1PCREPATTERN(3) Library Functions Manual PCREPATTERN(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
13 semantics as closely as it can. PCRE also supports some alternative
14 regular expression syntax (which does not conflict with the Perl syn‐
15 tax) in order to provide some compatibility with regular expressions in
16 Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular
21 Expressions", published by O'Reilly, covers regular expressions in
22 great detail. This description of PCRE's regular expressions is
23 intended as reference material.
24
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, PCRE must be built to include UTF-8 support, and you must call
28 pcre_compile() or pcre_compile2() with the PCRE_UTF8 option. There is
29 also a special sequence that can be given at the start of a pattern:
30
31 (*UTF8)
32
33 Starting a pattern with this sequence is equivalent to setting the
34 PCRE_UTF8 option. This feature is not Perl-compatible. How setting
35 UTF-8 mode affects pattern matching is mentioned in several places
36 below. There is also a summary of UTF-8 features in the section on
37 UTF-8 support in the main pcre page.
38
39 Another special sequence that may appear at the start of a pattern or
40 in combination with (*UTF8) is:
41
42 (*UCP)
43
44 This has the same effect as setting the PCRE_UCP option: it causes
45 sequences such as \d and \w to use Unicode properties to determine
46 character types, instead of recognizing only characters with codes less
47 than 128 via a lookup table.
48
49 The remainder of this document discusses the patterns that are sup‐
50 ported by PCRE when its main matching function, pcre_exec(), is used.
51 From release 6.0, PCRE offers a second matching function,
52 pcre_dfa_exec(), which matches using a different algorithm that is not
53 Perl-compatible. Some of the features discussed below are not available
54 when pcre_dfa_exec() is used. The advantages and disadvantages of the
55 alternative function, and how it differs from the normal function, are
56 discussed in the pcrematching page.
57
59
60 PCRE supports five different conventions for indicating line breaks in
61 strings: a single CR (carriage return) character, a single LF (line‐
62 feed) character, the two-character sequence CRLF, any of the three pre‐
63 ceding, or any Unicode newline sequence. The pcreapi page has further
64 discussion about newlines, and shows how to set the newline convention
65 in the options arguments for the compiling and matching functions.
66
67 It is also possible to specify a newline convention by starting a pat‐
68 tern string with one of the following five sequences:
69
70 (*CR) carriage return
71 (*LF) linefeed
72 (*CRLF) carriage return, followed by linefeed
73 (*ANYCRLF) any of the three above
74 (*ANY) all Unicode newline sequences
75
76 These override the default and the options given to pcre_compile() or
77 pcre_compile2(). For example, on a Unix system where LF is the default
78 newline sequence, the pattern
79
80 (*CR)a.b
81
82 changes the convention to CR. That pattern matches "a\nb" because LF is
83 no longer a newline. Note that these special settings, which are not
84 Perl-compatible, are recognized only at the very start of a pattern,
85 and that they must be in upper case. If more than one of them is
86 present, the last one is used.
87
88 The newline convention affects the interpretation of the dot metachar‐
89 acter when PCRE_DOTALL is not set, and also the behaviour of \N. How‐
90 ever, it does not affect what the \R escape sequence matches. By
91 default, this is any Unicode newline sequence, for Perl compatibility.
92 However, this can be changed; see the description of \R in the section
93 entitled "Newline sequences" below. A change of \R setting can be com‐
94 bined with a change of newline convention.
95
97
98 A regular expression is a pattern that is matched against a subject
99 string from left to right. Most characters stand for themselves in a
100 pattern, and match the corresponding characters in the subject. As a
101 trivial example, the pattern
102
103 The quick brown fox
104
105 matches a portion of a subject string that is identical to itself. When
106 caseless matching is specified (the PCRE_CASELESS option), letters are
107 matched independently of case. In UTF-8 mode, PCRE always understands
108 the concept of case for characters whose values are less than 128, so
109 caseless matching is always possible. For characters with higher val‐
110 ues, the concept of case is supported if PCRE is compiled with Unicode
111 property support, but not otherwise. If you want to use caseless
112 matching for characters 128 and above, you must ensure that PCRE is
113 compiled with Unicode property support as well as with UTF-8 support.
114
115 The power of regular expressions comes from the ability to include
116 alternatives and repetitions in the pattern. These are encoded in the
117 pattern by the use of metacharacters, which do not stand for themselves
118 but instead are interpreted in some special way.
119
120 There are two different sets of metacharacters: those that are recog‐
121 nized anywhere in the pattern except within square brackets, and those
122 that are recognized within square brackets. Outside square brackets,
123 the metacharacters are as follows:
124
125 \ general escape character with several uses
126 ^ assert start of string (or line, in multiline mode)
127 $ assert end of string (or line, in multiline mode)
128 . match any character except newline (by default)
129 [ start character class definition
130 | start of alternative branch
131 ( start subpattern
132 ) end subpattern
133 ? extends the meaning of (
134 also 0 or 1 quantifier
135 also quantifier minimizer
136 * 0 or more quantifier
137 + 1 or more quantifier
138 also "possessive quantifier"
139 { start min/max quantifier
140
141 Part of a pattern that is in square brackets is called a "character
142 class". In a character class the only metacharacters are:
143
144 \ general escape character
145 ^ negate the class, but only if the first character
146 - indicates character range
147 [ POSIX character class (only if followed by POSIX
148 syntax)
149 ] terminates the character class
150
151 The following sections describe the use of each of the metacharacters.
152
154
155 The backslash character has several uses. Firstly, if it is followed by
156 a non-alphanumeric character, it takes away any special meaning that
157 character may have. This use of backslash as an escape character
158 applies both inside and outside character classes.
159
160 For example, if you want to match a * character, you write \* in the
161 pattern. This escaping action applies whether or not the following
162 character would otherwise be interpreted as a metacharacter, so it is
163 always safe to precede a non-alphanumeric with backslash to specify
164 that it stands for itself. In particular, if you want to match a back‐
165 slash, you write \\.
166
167 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
168 the pattern (other than in a character class) and characters between a
169 # outside a character class and the next newline are ignored. An escap‐
170 ing backslash can be used to include a whitespace or # character as
171 part of the pattern.
172
173 If you want to remove the special meaning from a sequence of charac‐
174 ters, you can do so by putting them between \Q and \E. This is differ‐
175 ent from Perl in that $ and @ are handled as literals in \Q...\E
176 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola‐
177 tion. Note the following examples:
178
179 Pattern PCRE matches Perl matches
180
181 \Qabc$xyz\E abc$xyz abc followed by the
182 contents of $xyz
183 \Qabc\$xyz\E abc\$xyz abc\$xyz
184 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
185
186 The \Q...\E sequence is recognized both inside and outside character
187 classes.
188
189 Non-printing characters
190
191 A second use of backslash provides a way of encoding non-printing char‐
192 acters in patterns in a visible manner. There is no restriction on the
193 appearance of non-printing characters, apart from the binary zero that
194 terminates a pattern, but when a pattern is being prepared by text
195 editing, it is often easier to use one of the following escape
196 sequences than the binary character it represents:
197
198 \a alarm, that is, the BEL character (hex 07)
199 \cx "control-x", where x is any character
200 \e escape (hex 1B)
201 \f formfeed (hex 0C)
202 \n linefeed (hex 0A)
203 \r carriage return (hex 0D)
204 \t tab (hex 09)
205 \ddd character with octal code ddd, or back reference
206 \xhh character with hex code hh
207 \x{hhh..} character with hex code hhh..
208
209 The precise effect of \cx is as follows: if x is a lower case letter,
210 it is converted to upper case. Then bit 6 of the character (hex 40) is
211 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
212 becomes hex 7B.
213
214 After \x, from zero to two hexadecimal digits are read (letters can be
215 in upper or lower case). Any number of hexadecimal digits may appear
216 between \x{ and }, but the value of the character code must be less
217 than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
218 the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
219 than the largest Unicode code point, which is 10FFFF.
220
221 If characters other than hexadecimal digits appear between \x{ and },
222 or if there is no terminating }, this form of escape is not recognized.
223 Instead, the initial \x will be interpreted as a basic hexadecimal
224 escape, with no following digits, giving a character whose value is
225 zero.
226
227 Characters whose value is less than 256 can be defined by either of the
228 two syntaxes for \x. There is no difference in the way they are han‐
229 dled. For example, \xdc is exactly the same as \x{dc}.
230
231 After \0 up to two further octal digits are read. If there are fewer
232 than two digits, just those that are present are used. Thus the
233 sequence \0\x\07 specifies two binary zeros followed by a BEL character
234 (code value 7). Make sure you supply two digits after the initial zero
235 if the pattern character that follows is itself an octal digit.
236
237 The handling of a backslash followed by a digit other than 0 is compli‐
238 cated. Outside a character class, PCRE reads it and any following dig‐
239 its as a decimal number. If the number is less than 10, or if there
240 have been at least that many previous capturing left parentheses in the
241 expression, the entire sequence is taken as a back reference. A
242 description of how this works is given later, following the discussion
243 of parenthesized subpatterns.
244
245 Inside a character class, or if the decimal number is greater than 9
246 and there have not been that many capturing subpatterns, PCRE re-reads
247 up to three octal digits following the backslash, and uses them to gen‐
248 erate a data character. Any subsequent digits stand for themselves. In
249 non-UTF-8 mode, the value of a character specified in octal must be
250 less than \400. In UTF-8 mode, values up to \777 are permitted. For
251 example:
252
253 \040 is another way of writing a space
254 \40 is the same, provided there are fewer than 40
255 previous capturing subpatterns
256 \7 is always a back reference
257 \11 might be a back reference, or another way of
258 writing a tab
259 \011 is always a tab
260 \0113 is a tab followed by the character "3"
261 \113 might be a back reference, otherwise the
262 character with octal code 113
263 \377 might be a back reference, otherwise
264 the byte consisting entirely of 1 bits
265 \81 is either a back reference, or a binary zero
266 followed by the two characters "8" and "1"
267
268 Note that octal values of 100 or greater must not be introduced by a
269 leading zero, because no more than three octal digits are ever read.
270
271 All the sequences that define a single character value can be used both
272 inside and outside character classes. In addition, inside a character
273 class, the sequence \b is interpreted as the backspace character (hex
274 08). The sequences \B, \N, \R, and \X are not special inside a charac‐
275 ter class. Like any other unrecognized escape sequences, they are
276 treated as the literal characters "B", "N", "R", and "X" by default,
277 but cause an error if the PCRE_EXTRA option is set. Outside a character
278 class, these sequences have different meanings.
279
280 Absolute and relative back references
281
282 The sequence \g followed by an unsigned or a negative number, option‐
283 ally enclosed in braces, is an absolute or relative back reference. A
284 named back reference can be coded as \g{name}. Back references are dis‐
285 cussed later, following the discussion of parenthesized subpatterns.
286
287 Absolute and relative subroutine calls
288
289 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
290 name or a number enclosed either in angle brackets or single quotes, is
291 an alternative syntax for referencing a subpattern as a "subroutine".
292 Details are discussed later. Note that \g{...} (Perl syntax) and
293 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
294 reference; the latter is a subroutine call.
295
296 Generic character types
297
298 Another use of backslash is for specifying generic character types:
299
300 \d any decimal digit
301 \D any character that is not a decimal digit
302 \h any horizontal whitespace character
303 \H any character that is not a horizontal whitespace character
304 \s any whitespace character
305 \S any character that is not a whitespace character
306 \v any vertical whitespace character
307 \V any character that is not a vertical whitespace character
308 \w any "word" character
309 \W any "non-word" character
310
311 There is also the single sequence \N, which matches a non-newline char‐
312 acter. This is the same as the "." metacharacter when PCRE_DOTALL is
313 not set.
314
315 Each pair of lower and upper case escape sequences partitions the com‐
316 plete set of characters into two disjoint sets. Any given character
317 matches one, and only one, of each pair. The sequences can appear both
318 inside and outside character classes. They each match one character of
319 the appropriate type. If the current matching point is at the end of
320 the subject string, all of them fail, because there is no character to
321 match.
322
323 For compatibility with Perl, \s does not match the VT character (code
324 11). This makes it different from the the POSIX "space" class. The \s
325 characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
326 "use locale;" is included in a Perl script, \s may match the VT charac‐
327 ter. In PCRE, it never does.
328
329 A "word" character is an underscore or any character that is a letter
330 or digit. By default, the definition of letters and digits is con‐
331 trolled by PCRE's low-valued character tables, and may vary if locale-
332 specific matching is taking place (see "Locale support" in the pcreapi
333 page). For example, in a French locale such as "fr_FR" in Unix-like
334 systems, or "french" in Windows, some character codes greater than 128
335 are used for accented letters, and these are then matched by \w. The
336 use of locales with Unicode is discouraged.
337
338 By default, in UTF-8 mode, characters with values greater than 128
339 never match \d, \s, or \w, and always match \D, \S, and \W. These
340 sequences retain their original meanings from before UTF-8 support was
341 available, mainly for efficiency reasons. However, if PCRE is compiled
342 with Unicode property support, and the PCRE_UCP option is set, the be‐
343 haviour is changed so that Unicode properties are used to determine
344 character types, as follows:
345
346 \d any character that \p{Nd} matches (decimal digit)
347 \s any character that \p{Z} matches, plus HT, LF, FF, CR
348 \w any character that \p{L} or \p{N} matches, plus underscore
349
350 The upper case escapes match the inverse sets of characters. Note that
351 \d matches only decimal digits, whereas \w matches any Unicode digit,
352 as well as any Unicode letter, and underscore. Note also that PCRE_UCP
353 affects \b, and \B because they are defined in terms of \w and \W.
354 Matching these sequences is noticeably slower when PCRE_UCP is set.
355
356 The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
357 the other sequences, which match only ASCII characters by default,
358 these always match certain high-valued codepoints in UTF-8 mode,
359 whether or not PCRE_UCP is set. The horizontal space characters are:
360
361 U+0009 Horizontal tab
362 U+0020 Space
363 U+00A0 Non-break space
364 U+1680 Ogham space mark
365 U+180E Mongolian vowel separator
366 U+2000 En quad
367 U+2001 Em quad
368 U+2002 En space
369 U+2003 Em space
370 U+2004 Three-per-em space
371 U+2005 Four-per-em space
372 U+2006 Six-per-em space
373 U+2007 Figure space
374 U+2008 Punctuation space
375 U+2009 Thin space
376 U+200A Hair space
377 U+202F Narrow no-break space
378 U+205F Medium mathematical space
379 U+3000 Ideographic space
380
381 The vertical space characters are:
382
383 U+000A Linefeed
384 U+000B Vertical tab
385 U+000C Formfeed
386 U+000D Carriage return
387 U+0085 Next line
388 U+2028 Line separator
389 U+2029 Paragraph separator
390
391 Newline sequences
392
393 Outside a character class, by default, the escape sequence \R matches
394 any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
395 mode \R is equivalent to the following:
396
397 (?>\r\n|\n|\x0b|\f|\r|\x85)
398
399 This is an example of an "atomic group", details of which are given
400 below. This particular group matches either the two-character sequence
401 CR followed by LF, or one of the single characters LF (linefeed,
402 U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
403 return, U+000D), or NEL (next line, U+0085). The two-character sequence
404 is treated as a single unit that cannot be split.
405
406 In UTF-8 mode, two additional characters whose codepoints are greater
407 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
408 rator, U+2029). Unicode character property support is not needed for
409 these characters to be recognized.
410
411 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
412 the complete set of Unicode line endings) by setting the option
413 PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
414 (BSR is an abbrevation for "backslash R".) This can be made the default
415 when PCRE is built; if this is the case, the other behaviour can be
416 requested via the PCRE_BSR_UNICODE option. It is also possible to
417 specify these settings by starting a pattern string with one of the
418 following sequences:
419
420 (*BSR_ANYCRLF) CR, LF, or CRLF only
421 (*BSR_UNICODE) any Unicode newline sequence
422
423 These override the default and the options given to pcre_compile() or
424 pcre_compile2(), but they can be overridden by options given to
425 pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
426 are not Perl-compatible, are recognized only at the very start of a
427 pattern, and that they must be in upper case. If more than one of them
428 is present, the last one is used. They can be combined with a change of
429 newline convention; for example, a pattern can start with:
430
431 (*ANY)(*BSR_ANYCRLF)
432
433 They can also be combined with the (*UTF8) or (*UCP) special sequences.
434 Inside a character class, \R is treated as an unrecognized escape
435 sequence, and so matches the letter "R" by default, but causes an error
436 if PCRE_EXTRA is set.
437
438 Unicode character properties
439
440 When PCRE is built with Unicode character property support, three addi‐
441 tional escape sequences that match characters with specific properties
442 are available. When not in UTF-8 mode, these sequences are of course
443 limited to testing characters whose codepoints are less than 256, but
444 they do work in this mode. The extra escape sequences are:
445
446 \p{xx} a character with the xx property
447 \P{xx} a character without the xx property
448 \X an extended Unicode sequence
449
450 The property names represented by xx above are limited to the Unicode
451 script names, the general category properties, "Any", which matches any
452 character (including newline), and some special PCRE properties
453 (described in the next section). Other Perl properties such as "InMu‐
454 sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
455 does not match any characters, so always causes a match failure.
456
457 Sets of Unicode characters are defined as belonging to certain scripts.
458 A character from one of these sets can be matched using a script name.
459 For example:
460
461 \p{Greek}
462 \P{Han}
463
464 Those that are not part of an identified script are lumped together as
465 "Common". The current list of scripts is:
466
467 Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
468 Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
469 Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp‐
470 tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
471 Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe‐
472 rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
473 Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
474 Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
475 Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
476 Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
477 Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
478 Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
479 Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
480 Ugaritic, Vai, Yi.
481
482 Each character has exactly one Unicode general category property, spec‐
483 ified by a two-letter abbreviation. For compatibility with Perl, nega‐
484 tion can be specified by including a circumflex between the opening
485 brace and the property name. For example, \p{^Lu} is the same as
486 \P{Lu}.
487
488 If only one letter is specified with \p or \P, it includes all the gen‐
489 eral category properties that start with that letter. In this case, in
490 the absence of negation, the curly brackets in the escape sequence are
491 optional; these two examples have the same effect:
492
493 \p{L}
494 \pL
495
496 The following general category property codes are supported:
497
498 C Other
499 Cc Control
500 Cf Format
501 Cn Unassigned
502 Co Private use
503 Cs Surrogate
504
505 L Letter
506 Ll Lower case letter
507 Lm Modifier letter
508 Lo Other letter
509 Lt Title case letter
510 Lu Upper case letter
511
512 M Mark
513 Mc Spacing mark
514 Me Enclosing mark
515 Mn Non-spacing mark
516
517 N Number
518 Nd Decimal number
519 Nl Letter number
520 No Other number
521
522 P Punctuation
523 Pc Connector punctuation
524 Pd Dash punctuation
525 Pe Close punctuation
526 Pf Final punctuation
527 Pi Initial punctuation
528 Po Other punctuation
529 Ps Open punctuation
530
531 S Symbol
532 Sc Currency symbol
533 Sk Modifier symbol
534 Sm Mathematical symbol
535 So Other symbol
536
537 Z Separator
538 Zl Line separator
539 Zp Paragraph separator
540 Zs Space separator
541
542 The special property L& is also supported: it matches a character that
543 has the Lu, Ll, or Lt property, in other words, a letter that is not
544 classified as a modifier or "other".
545
546 The Cs (Surrogate) property applies only to characters in the range
547 U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
548 RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check‐
549 ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
550 the pcreapi page). Perl does not support the Cs property.
551
552 The long synonyms for property names that Perl supports (such as
553 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
554 any of these properties with "Is".
555
556 No character that is in the Unicode table has the Cn (unassigned) prop‐
557 erty. Instead, this property is assumed for any code point that is not
558 in the Unicode table.
559
560 Specifying caseless matching does not affect these escape sequences.
561 For example, \p{Lu} always matches only upper case letters.
562
563 The \X escape matches any number of Unicode characters that form an
564 extended Unicode sequence. \X is equivalent to
565
566 (?>\PM\pM*)
567
568 That is, it matches a character without the "mark" property, followed
569 by zero or more characters with the "mark" property, and treats the
570 sequence as an atomic group (see below). Characters with the "mark"
571 property are typically accents that affect the preceding character.
572 None of them have codepoints less than 256, so in non-UTF-8 mode \X
573 matches any one character.
574
575 Matching characters by Unicode property is not fast, because PCRE has
576 to search a structure that contains data for over fifteen thousand
577 characters. That is why the traditional escape sequences such as \d and
578 \w do not use Unicode properties in PCRE by default, though you can
579 make them do so by setting the PCRE_UCP option for pcre_compile() or by
580 starting the pattern with (*UCP).
581
582 PCRE's additional properties
583
584 As well as the standard Unicode properties described in the previous
585 section, PCRE supports four more that make it possible to convert tra‐
586 ditional escape sequences such as \w and \s and POSIX character classes
587 to use Unicode properties. PCRE uses these non-standard, non-Perl prop‐
588 erties internally when PCRE_UCP is set. They are:
589
590 Xan Any alphanumeric character
591 Xps Any POSIX space character
592 Xsp Any Perl space character
593 Xwd Any Perl "word" character
594
595 Xan matches characters that have either the L (letter) or the N (num‐
596 ber) property. Xps matches the characters tab, linefeed, vertical tab,
597 formfeed, or carriage return, and any other character that has the Z
598 (separator) property. Xsp is the same as Xps, except that vertical tab
599 is excluded. Xwd matches the same characters as Xan, plus underscore.
600
601 Resetting the match start
602
603 The escape sequence \K, which is a Perl 5.10 feature, causes any previ‐
604 ously matched characters not to be included in the final matched
605 sequence. For example, the pattern:
606
607 foo\Kbar
608
609 matches "foobar", but reports that it has matched "bar". This feature
610 is similar to a lookbehind assertion (described below). However, in
611 this case, the part of the subject before the real match does not have
612 to be of fixed length, as lookbehind assertions do. The use of \K does
613 not interfere with the setting of captured substrings. For example,
614 when the pattern
615
616 (foo)\Kbar
617
618 matches "foobar", the first substring is still set to "foo".
619
620 Perl documents that the use of \K within assertions is "not well
621 defined". In PCRE, \K is acted upon when it occurs inside positive
622 assertions, but is ignored in negative assertions.
623
624 Simple assertions
625
626 The final use of backslash is for certain simple assertions. An asser‐
627 tion specifies a condition that has to be met at a particular point in
628 a match, without consuming any characters from the subject string. The
629 use of subpatterns for more complicated assertions is described below.
630 The backslashed assertions are:
631
632 \b matches at a word boundary
633 \B matches when not at a word boundary
634 \A matches at the start of the subject
635 \Z matches at the end of the subject
636 also matches before a newline at the end of the subject
637 \z matches only at the end of the subject
638 \G matches at the first matching position in the subject
639
640 Inside a character class, \b has a different meaning; it matches the
641 backspace character. If any other of these assertions appears in a
642 character class, by default it matches the corresponding literal char‐
643 acter (for example, \B matches the letter B). However, if the
644 PCRE_EXTRA option is set, an "invalid escape sequence" error is gener‐
645 ated instead.
646
647 A word boundary is a position in the subject string where the current
648 character and the previous character do not both match \w or \W (i.e.
649 one matches \w and the other matches \W), or the start or end of the
650 string if the first or last character matches \w, respectively. In
651 UTF-8 mode, the meanings of \w and \W can be changed by setting the
652 PCRE_UCP option. When this is done, it also affects \b and \B. Neither
653 PCRE nor Perl has a separate "start of word" or "end of word" metase‐
654 quence. However, whatever follows \b normally determines which it is.
655 For example, the fragment \ba matches "a" at the start of a word.
656
657 The \A, \Z, and \z assertions differ from the traditional circumflex
658 and dollar (described in the next section) in that they only ever match
659 at the very start and end of the subject string, whatever options are
660 set. Thus, they are independent of multiline mode. These three asser‐
661 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
662 affect only the behaviour of the circumflex and dollar metacharacters.
663 However, if the startoffset argument of pcre_exec() is non-zero, indi‐
664 cating that matching is to start at a point other than the beginning of
665 the subject, \A can never match. The difference between \Z and \z is
666 that \Z matches before a newline at the end of the string as well as at
667 the very end, whereas \z matches only at the end.
668
669 The \G assertion is true only when the current matching position is at
670 the start point of the match, as specified by the startoffset argument
671 of pcre_exec(). It differs from \A when the value of startoffset is
672 non-zero. By calling pcre_exec() multiple times with appropriate argu‐
673 ments, you can mimic Perl's /g option, and it is in this kind of imple‐
674 mentation where \G can be useful.
675
676 Note, however, that PCRE's interpretation of \G, as the start of the
677 current match, is subtly different from Perl's, which defines it as the
678 end of the previous match. In Perl, these can be different when the
679 previously matched string was empty. Because PCRE does just one match
680 at a time, it cannot reproduce this behaviour.
681
682 If all the alternatives of a pattern begin with \G, the expression is
683 anchored to the starting match position, and the "anchored" flag is set
684 in the compiled regular expression.
685
687
688 Outside a character class, in the default matching mode, the circumflex
689 character is an assertion that is true only if the current matching
690 point is at the start of the subject string. If the startoffset argu‐
691 ment of pcre_exec() is non-zero, circumflex can never match if the
692 PCRE_MULTILINE option is unset. Inside a character class, circumflex
693 has an entirely different meaning (see below).
694
695 Circumflex need not be the first character of the pattern if a number
696 of alternatives are involved, but it should be the first thing in each
697 alternative in which it appears if the pattern is ever to match that
698 branch. If all possible alternatives start with a circumflex, that is,
699 if the pattern is constrained to match only at the start of the sub‐
700 ject, it is said to be an "anchored" pattern. (There are also other
701 constructs that can cause a pattern to be anchored.)
702
703 A dollar character is an assertion that is true only if the current
704 matching point is at the end of the subject string, or immediately
705 before a newline at the end of the string (by default). Dollar need not
706 be the last character of the pattern if a number of alternatives are
707 involved, but it should be the last item in any branch in which it
708 appears. Dollar has no special meaning in a character class.
709
710 The meaning of dollar can be changed so that it matches only at the
711 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
712 compile time. This does not affect the \Z assertion.
713
714 The meanings of the circumflex and dollar characters are changed if the
715 PCRE_MULTILINE option is set. When this is the case, a circumflex
716 matches immediately after internal newlines as well as at the start of
717 the subject string. It does not match after a newline that ends the
718 string. A dollar matches before any newlines in the string, as well as
719 at the very end, when PCRE_MULTILINE is set. When newline is specified
720 as the two-character sequence CRLF, isolated CR and LF characters do
721 not indicate newlines.
722
723 For example, the pattern /^abc$/ matches the subject string "def\nabc"
724 (where \n represents a newline) in multiline mode, but not otherwise.
725 Consequently, patterns that are anchored in single line mode because
726 all branches start with ^ are not anchored in multiline mode, and a
727 match for circumflex is possible when the startoffset argument of
728 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
729 PCRE_MULTILINE is set.
730
731 Note that the sequences \A, \Z, and \z can be used to match the start
732 and end of the subject in both modes, and if all branches of a pattern
733 start with \A it is always anchored, whether or not PCRE_MULTILINE is
734 set.
735
737
738 Outside a character class, a dot in the pattern matches any one charac‐
739 ter in the subject string except (by default) a character that signi‐
740 fies the end of a line. In UTF-8 mode, the matched character may be
741 more than one byte long.
742
743 When a line ending is defined as a single character, dot never matches
744 that character; when the two-character sequence CRLF is used, dot does
745 not match CR if it is immediately followed by LF, but otherwise it
746 matches all characters (including isolated CRs and LFs). When any Uni‐
747 code line endings are being recognized, dot does not match CR or LF or
748 any of the other line ending characters.
749
750 The behaviour of dot with regard to newlines can be changed. If the
751 PCRE_DOTALL option is set, a dot matches any one character, without
752 exception. If the two-character sequence CRLF is present in the subject
753 string, it takes two dots to match it.
754
755 The handling of dot is entirely independent of the handling of circum‐
756 flex and dollar, the only relationship being that they both involve
757 newlines. Dot has no special meaning in a character class.
758
759 The escape sequence \N always behaves as a dot does when PCRE_DOTALL is
760 not set. In other words, it matches any one character except one that
761 signifies the end of a line.
762
764
765 Outside a character class, the escape sequence \C matches any one byte,
766 both in and out of UTF-8 mode. Unlike a dot, it always matches any
767 line-ending characters. The feature is provided in Perl in order to
768 match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char‐
769 acters into individual bytes, what remains in the string may be a mal‐
770 formed UTF-8 string. For this reason, the \C escape sequence is best
771 avoided.
772
773 PCRE does not allow \C to appear in lookbehind assertions (described
774 below), because in UTF-8 mode this would make it impossible to calcu‐
775 late the length of the lookbehind.
776
778
779 An opening square bracket introduces a character class, terminated by a
780 closing square bracket. A closing square bracket on its own is not spe‐
781 cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
782 a lone closing square bracket causes a compile-time error. If a closing
783 square bracket is required as a member of the class, it should be the
784 first data character in the class (after an initial circumflex, if
785 present) or escaped with a backslash.
786
787 A character class matches a single character in the subject. In UTF-8
788 mode, the character may be more than one byte long. A matched character
789 must be in the set of characters defined by the class, unless the first
790 character in the class definition is a circumflex, in which case the
791 subject character must not be in the set defined by the class. If a
792 circumflex is actually required as a member of the class, ensure it is
793 not the first character, or escape it with a backslash.
794
795 For example, the character class [aeiou] matches any lower case vowel,
796 while [^aeiou] matches any character that is not a lower case vowel.
797 Note that a circumflex is just a convenient notation for specifying the
798 characters that are in the class by enumerating those that are not. A
799 class that starts with a circumflex is not an assertion; it still con‐
800 sumes a character from the subject string, and therefore it fails if
801 the current pointer is at the end of the string.
802
803 In UTF-8 mode, characters with values greater than 255 can be included
804 in a class as a literal string of bytes, or by using the \x{ escaping
805 mechanism.
806
807 When caseless matching is set, any letters in a class represent both
808 their upper case and lower case versions, so for example, a caseless
809 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
810 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
811 understands the concept of case for characters whose values are less
812 than 128, so caseless matching is always possible. For characters with
813 higher values, the concept of case is supported if PCRE is compiled
814 with Unicode property support, but not otherwise. If you want to use
815 caseless matching in UTF8-mode for characters 128 and above, you must
816 ensure that PCRE is compiled with Unicode property support as well as
817 with UTF-8 support.
818
819 Characters that might indicate line breaks are never treated in any
820 special way when matching character classes, whatever line-ending
821 sequence is in use, and whatever setting of the PCRE_DOTALL and
822 PCRE_MULTILINE options is used. A class such as [^a] always matches one
823 of these characters.
824
825 The minus (hyphen) character can be used to specify a range of charac‐
826 ters in a character class. For example, [d-m] matches any letter
827 between d and m, inclusive. If a minus character is required in a
828 class, it must be escaped with a backslash or appear in a position
829 where it cannot be interpreted as indicating a range, typically as the
830 first or last character in the class.
831
832 It is not possible to have the literal character "]" as the end charac‐
833 ter of a range. A pattern such as [W-]46] is interpreted as a class of
834 two characters ("W" and "-") followed by a literal string "46]", so it
835 would match "W46]" or "-46]". However, if the "]" is escaped with a
836 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
837 preted as a class containing a range followed by two other characters.
838 The octal or hexadecimal representation of "]" can also be used to end
839 a range.
840
841 Ranges operate in the collating sequence of character values. They can
842 also be used for characters specified numerically, for example
843 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
844 are greater than 255, for example [\x{100}-\x{2ff}].
845
846 If a range that includes letters is used when caseless matching is set,
847 it matches the letters in either case. For example, [W-c] is equivalent
848 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
849 character tables for a French locale are in use, [\xc8-\xcb] matches
850 accented E characters in both cases. In UTF-8 mode, PCRE supports the
851 concept of case for characters with values greater than 128 only when
852 it is compiled with Unicode property support.
853
854 The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and \W
855 may also appear in a character class, and add the characters that they
856 match to the class. For example, [\dABCDEF] matches any hexadecimal
857 digit. A circumflex can conveniently be used with the upper case char‐
858 acter types to specify a more restricted set of characters than the
859 matching lower case type. For example, the class [^\W_] matches any
860 letter or digit, but not underscore.
861
862 The only metacharacters that are recognized in character classes are
863 backslash, hyphen (only where it can be interpreted as specifying a
864 range), circumflex (only at the start), opening square bracket (only
865 when it can be interpreted as introducing a POSIX class name - see the
866 next section), and the terminating closing square bracket. However,
867 escaping other non-alphanumeric characters does no harm.
868
870
871 Perl supports the POSIX notation for character classes. This uses names
872 enclosed by [: and :] within the enclosing square brackets. PCRE also
873 supports this notation. For example,
874
875 [01[:alpha:]%]
876
877 matches "0", "1", any alphabetic character, or "%". The supported class
878 names are:
879
880 alnum letters and digits
881 alpha letters
882 ascii character codes 0 - 127
883 blank space or tab only
884 cntrl control characters
885 digit decimal digits (same as \d)
886 graph printing characters, excluding space
887 lower lower case letters
888 print printing characters, including space
889 punct printing characters, excluding letters and digits and space
890 space white space (not quite the same as \s)
891 upper upper case letters
892 word "word" characters (same as \w)
893 xdigit hexadecimal digits
894
895 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
896 and space (32). Notice that this list includes the VT character (code
897 11). This makes "space" different to \s, which does not include VT (for
898 Perl compatibility).
899
900 The name "word" is a Perl extension, and "blank" is a GNU extension
901 from Perl 5.8. Another Perl extension is negation, which is indicated
902 by a ^ character after the colon. For example,
903
904 [12[:^digit:]]
905
906 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
907 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
908 these are not supported, and an error is given if they are encountered.
909
910 By default, in UTF-8 mode, characters with values greater than 128 do
911 not match any of the POSIX character classes. However, if the PCRE_UCP
912 option is passed to pcre_compile(), some of the classes are changed so
913 that Unicode character properties are used. This is achieved by replac‐
914 ing the POSIX classes by other sequences, as follows:
915
916 [:alnum:] becomes \p{Xan}
917 [:alpha:] becomes \p{L}
918 [:blank:] becomes \h
919 [:digit:] becomes \p{Nd}
920 [:lower:] becomes \p{Ll}
921 [:space:] becomes \p{Xps}
922 [:upper:] becomes \p{Lu}
923 [:word:] becomes \p{Xwd}
924
925 Negated versions, such as [:^alpha:] use \P instead of \p. The other
926 POSIX classes are unchanged, and match only characters with code points
927 less than 128.
928
930
931 Vertical bar characters are used to separate alternative patterns. For
932 example, the pattern
933
934 gilbert|sullivan
935
936 matches either "gilbert" or "sullivan". Any number of alternatives may
937 appear, and an empty alternative is permitted (matching the empty
938 string). The matching process tries each alternative in turn, from left
939 to right, and the first one that succeeds is used. If the alternatives
940 are within a subpattern (defined below), "succeeds" means matching the
941 rest of the main pattern as well as the alternative in the subpattern.
942
944
945 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
946 PCRE_EXTENDED options (which are Perl-compatible) can be changed from
947 within the pattern by a sequence of Perl option letters enclosed
948 between "(?" and ")". The option letters are
949
950 i for PCRE_CASELESS
951 m for PCRE_MULTILINE
952 s for PCRE_DOTALL
953 x for PCRE_EXTENDED
954
955 For example, (?im) sets caseless, multiline matching. It is also possi‐
956 ble to unset these options by preceding the letter with a hyphen, and a
957 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE‐
958 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
959 is also permitted. If a letter appears both before and after the
960 hyphen, the option is unset.
961
962 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
963 can be changed in the same way as the Perl-compatible options by using
964 the characters J, U and X respectively.
965
966 When one of these option changes occurs at top level (that is, not
967 inside subpattern parentheses), the change applies to the remainder of
968 the pattern that follows. If the change is placed right at the start of
969 a pattern, PCRE extracts it into the global options (and it will there‐
970 fore show up in data extracted by the pcre_fullinfo() function).
971
972 An option change within a subpattern (see below for a description of
973 subpatterns) affects only that part of the current pattern that follows
974 it, so
975
976 (a(?i)b)c
977
978 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
979 used). By this means, options can be made to have different settings
980 in different parts of the pattern. Any changes made in one alternative
981 do carry on into subsequent branches within the same subpattern. For
982 example,
983
984 (a(?i)b|c)
985
986 matches "ab", "aB", "c", and "C", even though when matching "C" the
987 first branch is abandoned before the option setting. This is because
988 the effects of option settings happen at compile time. There would be
989 some very weird behaviour otherwise.
990
991 Note: There are other PCRE-specific options that can be set by the
992 application when the compile or match functions are called. In some
993 cases the pattern can contain special leading sequences such as (*CRLF)
994 to override what the application has set or what has been defaulted.
995 Details are given in the section entitled "Newline sequences" above.
996 There are also the (*UTF8) and (*UCP) leading sequences that can be
997 used to set UTF-8 and Unicode property modes; they are equivalent to
998 setting the PCRE_UTF8 and the PCRE_UCP options, respectively.
999
1001
1002 Subpatterns are delimited by parentheses (round brackets), which can be
1003 nested. Turning part of a pattern into a subpattern does two things:
1004
1005 1. It localizes a set of alternatives. For example, the pattern
1006
1007 cat(aract|erpillar|)
1008
1009 matches one of the words "cat", "cataract", or "caterpillar". Without
1010 the parentheses, it would match "cataract", "erpillar" or an empty
1011 string.
1012
1013 2. It sets up the subpattern as a capturing subpattern. This means
1014 that, when the whole pattern matches, that portion of the subject
1015 string that matched the subpattern is passed back to the caller via the
1016 ovector argument of pcre_exec(). Opening parentheses are counted from
1017 left to right (starting from 1) to obtain numbers for the capturing
1018 subpatterns.
1019
1020 For example, if the string "the red king" is matched against the pat‐
1021 tern
1022
1023 the ((red|white) (king|queen))
1024
1025 the captured substrings are "red king", "red", and "king", and are num‐
1026 bered 1, 2, and 3, respectively.
1027
1028 The fact that plain parentheses fulfil two functions is not always
1029 helpful. There are often times when a grouping subpattern is required
1030 without a capturing requirement. If an opening parenthesis is followed
1031 by a question mark and a colon, the subpattern does not do any captur‐
1032 ing, and is not counted when computing the number of any subsequent
1033 capturing subpatterns. For example, if the string "the white queen" is
1034 matched against the pattern
1035
1036 the ((?:red|white) (king|queen))
1037
1038 the captured substrings are "white queen" and "queen", and are numbered
1039 1 and 2. The maximum number of capturing subpatterns is 65535.
1040
1041 As a convenient shorthand, if any option settings are required at the
1042 start of a non-capturing subpattern, the option letters may appear
1043 between the "?" and the ":". Thus the two patterns
1044
1045 (?i:saturday|sunday)
1046 (?:(?i)saturday|sunday)
1047
1048 match exactly the same set of strings. Because alternative branches are
1049 tried from left to right, and options are not reset until the end of
1050 the subpattern is reached, an option setting in one branch does affect
1051 subsequent branches, so the above patterns match "SUNDAY" as well as
1052 "Saturday".
1053
1055
1056 Perl 5.10 introduced a feature whereby each alternative in a subpattern
1057 uses the same numbers for its capturing parentheses. Such a subpattern
1058 starts with (?| and is itself a non-capturing subpattern. For example,
1059 consider this pattern:
1060
1061 (?|(Sat)ur|(Sun))day
1062
1063 Because the two alternatives are inside a (?| group, both sets of cap‐
1064 turing parentheses are numbered one. Thus, when the pattern matches,
1065 you can look at captured substring number one, whichever alternative
1066 matched. This construct is useful when you want to capture part, but
1067 not all, of one of a number of alternatives. Inside a (?| group, paren‐
1068 theses are numbered as usual, but the number is reset at the start of
1069 each branch. The numbers of any capturing buffers that follow the sub‐
1070 pattern start after the highest number used in any branch. The follow‐
1071 ing example is taken from the Perl documentation. The numbers under‐
1072 neath show in which buffer the captured content will be stored.
1073
1074 # before ---------------branch-reset----------- after
1075 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1076 # 1 2 2 3 2 3 4
1077
1078 A back reference to a numbered subpattern uses the most recent value
1079 that is set for that number by any subpattern. The following pattern
1080 matches "abcabc" or "defdef":
1081
1082 /(?|(abc)|(def))\1/
1083
1084 In contrast, a recursive or "subroutine" call to a numbered subpattern
1085 always refers to the first one in the pattern with the given number.
1086 The following pattern matches "abcabc" or "defabc":
1087
1088 /(?|(abc)|(def))(?1)/
1089
1090 If a condition test for a subpattern's having matched refers to a non-
1091 unique number, the test is true if any of the subpatterns of that num‐
1092 ber have matched.
1093
1094 An alternative approach to using this "branch reset" feature is to use
1095 duplicate named subpatterns, as described in the next section.
1096
1098
1099 Identifying capturing parentheses by number is simple, but it can be
1100 very hard to keep track of the numbers in complicated regular expres‐
1101 sions. Furthermore, if an expression is modified, the numbers may
1102 change. To help with this difficulty, PCRE supports the naming of sub‐
1103 patterns. This feature was not added to Perl until release 5.10. Python
1104 had the feature earlier, and PCRE introduced it at release 4.0, using
1105 the Python syntax. PCRE now supports both the Perl and the Python syn‐
1106 tax. Perl allows identically numbered subpatterns to have different
1107 names, but PCRE does not.
1108
1109 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
1110 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
1111 to capturing parentheses from other parts of the pattern, such as back
1112 references, recursion, and conditions, can be made by name as well as
1113 by number.
1114
1115 Names consist of up to 32 alphanumeric characters and underscores.
1116 Named capturing parentheses are still allocated numbers as well as
1117 names, exactly as if the names were not present. The PCRE API provides
1118 function calls for extracting the name-to-number translation table from
1119 a compiled pattern. There is also a convenience function for extracting
1120 a captured substring by name.
1121
1122 By default, a name must be unique within a pattern, but it is possible
1123 to relax this constraint by setting the PCRE_DUPNAMES option at compile
1124 time. (Duplicate names are also always permitted for subpatterns with
1125 the same number, set up as described in the previous section.) Dupli‐
1126 cate names can be useful for patterns where only one instance of the
1127 named parentheses can match. Suppose you want to match the name of a
1128 weekday, either as a 3-letter abbreviation or as the full name, and in
1129 both cases you want to extract the abbreviation. This pattern (ignoring
1130 the line breaks) does the job:
1131
1132 (?<DN>Mon|Fri|Sun)(?:day)?|
1133 (?<DN>Tue)(?:sday)?|
1134 (?<DN>Wed)(?:nesday)?|
1135 (?<DN>Thu)(?:rsday)?|
1136 (?<DN>Sat)(?:urday)?
1137
1138 There are five capturing substrings, but only one is ever set after a
1139 match. (An alternative way of solving this problem is to use a "branch
1140 reset" subpattern, as described in the previous section.)
1141
1142 The convenience function for extracting the data by name returns the
1143 substring for the first (and in this example, the only) subpattern of
1144 that name that matched. This saves searching to find which numbered
1145 subpattern it was.
1146
1147 If you make a back reference to a non-unique named subpattern from
1148 elsewhere in the pattern, the one that corresponds to the first occur‐
1149 rence of the name is used. In the absence of duplicate numbers (see the
1150 previous section) this is the one with the lowest number. If you use a
1151 named reference in a condition test (see the section about conditions
1152 below), either to check whether a subpattern has matched, or to check
1153 for recursion, all subpatterns with the same name are tested. If the
1154 condition is true for any one of them, the overall condition is true.
1155 This is the same behaviour as testing by number. For further details of
1156 the interfaces for handling named subpatterns, see the pcreapi documen‐
1157 tation.
1158
1159 Warning: You cannot use different names to distinguish between two sub‐
1160 patterns with the same number because PCRE uses only the numbers when
1161 matching. For this reason, an error is given at compile time if differ‐
1162 ent names are given to subpatterns with the same number. However, you
1163 can give the same name to subpatterns with the same number, even when
1164 PCRE_DUPNAMES is not set.
1165
1167
1168 Repetition is specified by quantifiers, which can follow any of the
1169 following items:
1170
1171 a literal data character
1172 the dot metacharacter
1173 the \C escape sequence
1174 the \X escape sequence (in UTF-8 mode with Unicode properties)
1175 the \R escape sequence
1176 an escape such as \d that matches a single character
1177 a character class
1178 a back reference (see next section)
1179 a parenthesized subpattern (unless it is an assertion)
1180 a recursive or "subroutine" call to a subpattern
1181
1182 The general repetition quantifier specifies a minimum and maximum num‐
1183 ber of permitted matches, by giving the two numbers in curly brackets
1184 (braces), separated by a comma. The numbers must be less than 65536,
1185 and the first must be less than or equal to the second. For example:
1186
1187 z{2,4}
1188
1189 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1190 special character. If the second number is omitted, but the comma is
1191 present, there is no upper limit; if the second number and the comma
1192 are both omitted, the quantifier specifies an exact number of required
1193 matches. Thus
1194
1195 [aeiou]{3,}
1196
1197 matches at least 3 successive vowels, but may match many more, while
1198
1199 \d{8}
1200
1201 matches exactly 8 digits. An opening curly bracket that appears in a
1202 position where a quantifier is not allowed, or one that does not match
1203 the syntax of a quantifier, is taken as a literal character. For exam‐
1204 ple, {,6} is not a quantifier, but a literal string of four characters.
1205
1206 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
1207 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char‐
1208 acters, each of which is represented by a two-byte sequence. Similarly,
1209 when Unicode property support is available, \X{3} matches three Unicode
1210 extended sequences, each of which may be several bytes long (and they
1211 may be of different lengths).
1212
1213 The quantifier {0} is permitted, causing the expression to behave as if
1214 the previous item and the quantifier were not present. This may be use‐
1215 ful for subpatterns that are referenced as subroutines from elsewhere
1216 in the pattern. Items other than subpatterns that have a {0} quantifier
1217 are omitted from the compiled pattern.
1218
1219 For convenience, the three most common quantifiers have single-charac‐
1220 ter abbreviations:
1221
1222 * is equivalent to {0,}
1223 + is equivalent to {1,}
1224 ? is equivalent to {0,1}
1225
1226 It is possible to construct infinite loops by following a subpattern
1227 that can match no characters with a quantifier that has no upper limit,
1228 for example:
1229
1230 (a?)*
1231
1232 Earlier versions of Perl and PCRE used to give an error at compile time
1233 for such patterns. However, because there are cases where this can be
1234 useful, such patterns are now accepted, but if any repetition of the
1235 subpattern does in fact match no characters, the loop is forcibly bro‐
1236 ken.
1237
1238 By default, the quantifiers are "greedy", that is, they match as much
1239 as possible (up to the maximum number of permitted times), without
1240 causing the rest of the pattern to fail. The classic example of where
1241 this gives problems is in trying to match comments in C programs. These
1242 appear between /* and */ and within the comment, individual * and /
1243 characters may appear. An attempt to match C comments by applying the
1244 pattern
1245
1246 /\*.*\*/
1247
1248 to the string
1249
1250 /* first comment */ not comment /* second comment */
1251
1252 fails, because it matches the entire string owing to the greediness of
1253 the .* item.
1254
1255 However, if a quantifier is followed by a question mark, it ceases to
1256 be greedy, and instead matches the minimum number of times possible, so
1257 the pattern
1258
1259 /\*.*?\*/
1260
1261 does the right thing with the C comments. The meaning of the various
1262 quantifiers is not otherwise changed, just the preferred number of
1263 matches. Do not confuse this use of question mark with its use as a
1264 quantifier in its own right. Because it has two uses, it can sometimes
1265 appear doubled, as in
1266
1267 \d??\d
1268
1269 which matches one digit by preference, but can match two if that is the
1270 only way the rest of the pattern matches.
1271
1272 If the PCRE_UNGREEDY option is set (an option that is not available in
1273 Perl), the quantifiers are not greedy by default, but individual ones
1274 can be made greedy by following them with a question mark. In other
1275 words, it inverts the default behaviour.
1276
1277 When a parenthesized subpattern is quantified with a minimum repeat
1278 count that is greater than 1 or with a limited maximum, more memory is
1279 required for the compiled pattern, in proportion to the size of the
1280 minimum or maximum.
1281
1282 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv‐
1283 alent to Perl's /s) is set, thus allowing the dot to match newlines,
1284 the pattern is implicitly anchored, because whatever follows will be
1285 tried against every character position in the subject string, so there
1286 is no point in retrying the overall match at any position after the
1287 first. PCRE normally treats such a pattern as though it were preceded
1288 by \A.
1289
1290 In cases where it is known that the subject string contains no new‐
1291 lines, it is worth setting PCRE_DOTALL in order to obtain this opti‐
1292 mization, or alternatively using ^ to indicate anchoring explicitly.
1293
1294 However, there is one situation where the optimization cannot be used.
1295 When .* is inside capturing parentheses that are the subject of a back
1296 reference elsewhere in the pattern, a match at the start may fail where
1297 a later one succeeds. Consider, for example:
1298
1299 (.*)abc\1
1300
1301 If the subject is "xyz123abc123" the match point is the fourth charac‐
1302 ter. For this reason, such a pattern is not implicitly anchored.
1303
1304 When a capturing subpattern is repeated, the value captured is the sub‐
1305 string that matched the final iteration. For example, after
1306
1307 (tweedle[dume]{3}\s*)+
1308
1309 has matched "tweedledum tweedledee" the value of the captured substring
1310 is "tweedledee". However, if there are nested capturing subpatterns,
1311 the corresponding captured values may have been set in previous itera‐
1312 tions. For example, after
1313
1314 /(a|(b))+/
1315
1316 matches "aba" the value of the second captured substring is "b".
1317
1319
1320 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1321 repetition, failure of what follows normally causes the repeated item
1322 to be re-evaluated to see if a different number of repeats allows the
1323 rest of the pattern to match. Sometimes it is useful to prevent this,
1324 either to change the nature of the match, or to cause it fail earlier
1325 than it otherwise might, when the author of the pattern knows there is
1326 no point in carrying on.
1327
1328 Consider, for example, the pattern \d+foo when applied to the subject
1329 line
1330
1331 123456bar
1332
1333 After matching all 6 digits and then failing to match "foo", the normal
1334 action of the matcher is to try again with only 5 digits matching the
1335 \d+ item, and then with 4, and so on, before ultimately failing.
1336 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1337 the means for specifying that once a subpattern has matched, it is not
1338 to be re-evaluated in this way.
1339
1340 If we use atomic grouping for the previous example, the matcher gives
1341 up immediately on failing to match "foo" the first time. The notation
1342 is a kind of special parenthesis, starting with (?> as in this example:
1343
1344 (?>\d+)foo
1345
1346 This kind of parenthesis "locks up" the part of the pattern it con‐
1347 tains once it has matched, and a failure further into the pattern is
1348 prevented from backtracking into it. Backtracking past it to previous
1349 items, however, works as normal.
1350
1351 An alternative description is that a subpattern of this type matches
1352 the string of characters that an identical standalone pattern would
1353 match, if anchored at the current point in the subject string.
1354
1355 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1356 such as the above example can be thought of as a maximizing repeat that
1357 must swallow everything it can. So, while both \d+ and \d+? are pre‐
1358 pared to adjust the number of digits they match in order to make the
1359 rest of the pattern match, (?>\d+) can only match an entire sequence of
1360 digits.
1361
1362 Atomic groups in general can of course contain arbitrarily complicated
1363 subpatterns, and can be nested. However, when the subpattern for an
1364 atomic group is just a single repeated item, as in the example above, a
1365 simpler notation, called a "possessive quantifier" can be used. This
1366 consists of an additional + character following a quantifier. Using
1367 this notation, the previous example can be rewritten as
1368
1369 \d++foo
1370
1371 Note that a possessive quantifier can be used with an entire group, for
1372 example:
1373
1374 (abc|xyz){2,3}+
1375
1376 Possessive quantifiers are always greedy; the setting of the
1377 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1378 simpler forms of atomic group. However, there is no difference in the
1379 meaning of a possessive quantifier and the equivalent atomic group,
1380 though there may be a performance difference; possessive quantifiers
1381 should be slightly faster.
1382
1383 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1384 tax. Jeffrey Friedl originated the idea (and the name) in the first
1385 edition of his book. Mike McCloskey liked it, so implemented it when he
1386 built Sun's Java package, and PCRE copied it from there. It ultimately
1387 found its way into Perl at release 5.10.
1388
1389 PCRE has an optimization that automatically "possessifies" certain sim‐
1390 ple pattern constructs. For example, the sequence A+B is treated as
1391 A++B because there is no point in backtracking into a sequence of A's
1392 when B must follow.
1393
1394 When a pattern contains an unlimited repeat inside a subpattern that
1395 can itself be repeated an unlimited number of times, the use of an
1396 atomic group is the only way to avoid some failing matches taking a
1397 very long time indeed. The pattern
1398
1399 (\D+|<\d+>)*[!?]
1400
1401 matches an unlimited number of substrings that either consist of non-
1402 digits, or digits enclosed in <>, followed by either ! or ?. When it
1403 matches, it runs quickly. However, if it is applied to
1404
1405 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1406
1407 it takes a long time before reporting failure. This is because the
1408 string can be divided between the internal \D+ repeat and the external
1409 * repeat in a large number of ways, and all have to be tried. (The
1410 example uses [!?] rather than a single character at the end, because
1411 both PCRE and Perl have an optimization that allows for fast failure
1412 when a single character is used. They remember the last single charac‐
1413 ter that is required for a match, and fail early if it is not present
1414 in the string.) If the pattern is changed so that it uses an atomic
1415 group, like this:
1416
1417 ((?>\D+)|<\d+>)*[!?]
1418
1419 sequences of non-digits cannot be broken, and failure happens quickly.
1420
1422
1423 Outside a character class, a backslash followed by a digit greater than
1424 0 (and possibly further digits) is a back reference to a capturing sub‐
1425 pattern earlier (that is, to its left) in the pattern, provided there
1426 have been that many previous capturing left parentheses.
1427
1428 However, if the decimal number following the backslash is less than 10,
1429 it is always taken as a back reference, and causes an error only if
1430 there are not that many capturing left parentheses in the entire pat‐
1431 tern. In other words, the parentheses that are referenced need not be
1432 to the left of the reference for numbers less than 10. A "forward back
1433 reference" of this type can make sense when a repetition is involved
1434 and the subpattern to the right has participated in an earlier itera‐
1435 tion.
1436
1437 It is not possible to have a numerical "forward back reference" to a
1438 subpattern whose number is 10 or more using this syntax because a
1439 sequence such as \50 is interpreted as a character defined in octal.
1440 See the subsection entitled "Non-printing characters" above for further
1441 details of the handling of digits following a backslash. There is no
1442 such problem when named parentheses are used. A back reference to any
1443 subpattern is possible using named parentheses (see below).
1444
1445 Another way of avoiding the ambiguity inherent in the use of digits
1446 following a backslash is to use the \g escape sequence, which is a fea‐
1447 ture introduced in Perl 5.10. This escape must be followed by an
1448 unsigned number or a negative number, optionally enclosed in braces.
1449 These examples are all identical:
1450
1451 (ring), \1
1452 (ring), \g1
1453 (ring), \g{1}
1454
1455 An unsigned number specifies an absolute reference without the ambigu‐
1456 ity that is present in the older syntax. It is also useful when literal
1457 digits follow the reference. A negative number is a relative reference.
1458 Consider this example:
1459
1460 (abc(def)ghi)\g{-1}
1461
1462 The sequence \g{-1} is a reference to the most recently started captur‐
1463 ing subpattern before \g, that is, is it equivalent to \2. Similarly,
1464 \g{-2} would be equivalent to \1. The use of relative references can be
1465 helpful in long patterns, and also in patterns that are created by
1466 joining together fragments that contain references within themselves.
1467
1468 A back reference matches whatever actually matched the capturing sub‐
1469 pattern in the current subject string, rather than anything matching
1470 the subpattern itself (see "Subpatterns as subroutines" below for a way
1471 of doing that). So the pattern
1472
1473 (sens|respons)e and \1ibility
1474
1475 matches "sense and sensibility" and "response and responsibility", but
1476 not "sense and responsibility". If caseful matching is in force at the
1477 time of the back reference, the case of letters is relevant. For exam‐
1478 ple,
1479
1480 ((?i)rah)\s+\1
1481
1482 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1483 original capturing subpattern is matched caselessly.
1484
1485 There are several different ways of writing back references to named
1486 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
1487 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
1488 unified back reference syntax, in which \g can be used for both numeric
1489 and named references, is also supported. We could rewrite the above
1490 example in any of the following ways:
1491
1492 (?<p1>(?i)rah)\s+\k<p1>
1493 (?'p1'(?i)rah)\s+\k{p1}
1494 (?P<p1>(?i)rah)\s+(?P=p1)
1495 (?<p1>(?i)rah)\s+\g{p1}
1496
1497 A subpattern that is referenced by name may appear in the pattern
1498 before or after the reference.
1499
1500 There may be more than one back reference to the same subpattern. If a
1501 subpattern has not actually been used in a particular match, any back
1502 references to it always fail by default. For example, the pattern
1503
1504 (a|(bc))\2
1505
1506 always fails if it starts to match "a" rather than "bc". However, if
1507 the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer‐
1508 ence to an unset value matches an empty string.
1509
1510 Because there may be many capturing parentheses in a pattern, all dig‐
1511 its following a backslash are taken as part of a potential back refer‐
1512 ence number. If the pattern continues with a digit character, some
1513 delimiter must be used to terminate the back reference. If the
1514 PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
1515 syntax or an empty comment (see "Comments" below) can be used.
1516
1517 Recursive back references
1518
1519 A back reference that occurs inside the parentheses to which it refers
1520 fails when the subpattern is first used, so, for example, (a\1) never
1521 matches. However, such references can be useful inside repeated sub‐
1522 patterns. For example, the pattern
1523
1524 (a|b\1)+
1525
1526 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1527 ation of the subpattern, the back reference matches the character
1528 string corresponding to the previous iteration. In order for this to
1529 work, the pattern must be such that the first iteration does not need
1530 to match the back reference. This can be done using alternation, as in
1531 the example above, or by a quantifier with a minimum of zero.
1532
1533 Back references of this type cause the group that they reference to be
1534 treated as an atomic group. Once the whole group has been matched, a
1535 subsequent matching failure cannot cause backtracking into the middle
1536 of the group.
1537
1539
1540 An assertion is a test on the characters following or preceding the
1541 current matching point that does not actually consume any characters.
1542 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1543 described above.
1544
1545 More complicated assertions are coded as subpatterns. There are two
1546 kinds: those that look ahead of the current position in the subject
1547 string, and those that look behind it. An assertion subpattern is
1548 matched in the normal way, except that it does not cause the current
1549 matching position to be changed.
1550
1551 Assertion subpatterns are not capturing subpatterns, and may not be
1552 repeated, because it makes no sense to assert the same thing several
1553 times. If any kind of assertion contains capturing subpatterns within
1554 it, these are counted for the purposes of numbering the capturing sub‐
1555 patterns in the whole pattern. However, substring capturing is carried
1556 out only for positive assertions, because it does not make sense for
1557 negative assertions.
1558
1559 Lookahead assertions
1560
1561 Lookahead assertions start with (?= for positive assertions and (?! for
1562 negative assertions. For example,
1563
1564 \w+(?=;)
1565
1566 matches a word followed by a semicolon, but does not include the semi‐
1567 colon in the match, and
1568
1569 foo(?!bar)
1570
1571 matches any occurrence of "foo" that is not followed by "bar". Note
1572 that the apparently similar pattern
1573
1574 (?!foo)bar
1575
1576 does not find an occurrence of "bar" that is preceded by something
1577 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1578 the assertion (?!foo) is always true when the next three characters are
1579 "bar". A lookbehind assertion is needed to achieve the other effect.
1580
1581 If you want to force a matching failure at some point in a pattern, the
1582 most convenient way to do it is with (?!) because an empty string
1583 always matches, so an assertion that requires there not to be an empty
1584 string must always fail. The Perl 5.10 backtracking control verb
1585 (*FAIL) or (*F) is essentially a synonym for (?!).
1586
1587 Lookbehind assertions
1588
1589 Lookbehind assertions start with (?<= for positive assertions and (?<!
1590 for negative assertions. For example,
1591
1592 (?<!foo)bar
1593
1594 does find an occurrence of "bar" that is not preceded by "foo". The
1595 contents of a lookbehind assertion are restricted such that all the
1596 strings it matches must have a fixed length. However, if there are sev‐
1597 eral top-level alternatives, they do not all have to have the same
1598 fixed length. Thus
1599
1600 (?<=bullock|donkey)
1601
1602 is permitted, but
1603
1604 (?<!dogs?|cats?)
1605
1606 causes an error at compile time. Branches that match different length
1607 strings are permitted only at the top level of a lookbehind assertion.
1608 This is an extension compared with Perl (5.8 and 5.10), which requires
1609 all branches to match the same length of string. An assertion such as
1610
1611 (?<=ab(c|de))
1612
1613 is not permitted, because its single top-level branch can match two
1614 different lengths, but it is acceptable to PCRE if rewritten to use two
1615 top-level branches:
1616
1617 (?<=abc|abde)
1618
1619 In some cases, the Perl 5.10 escape sequence \K (see above) can be used
1620 instead of a lookbehind assertion to get round the fixed-length
1621 restriction.
1622
1623 The implementation of lookbehind assertions is, for each alternative,
1624 to temporarily move the current position back by the fixed length and
1625 then try to match. If there are insufficient characters before the cur‐
1626 rent position, the assertion fails.
1627
1628 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1629 mode) to appear in lookbehind assertions, because it makes it impossi‐
1630 ble to calculate the length of the lookbehind. The \X and \R escapes,
1631 which can match different numbers of bytes, are also not permitted.
1632
1633 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
1634 lookbehinds, as long as the subpattern matches a fixed-length string.
1635 Recursion, however, is not supported.
1636
1637 Possessive quantifiers can be used in conjunction with lookbehind
1638 assertions to specify efficient matching of fixed-length strings at the
1639 end of subject strings. Consider a simple pattern such as
1640
1641 abcd$
1642
1643 when applied to a long string that does not match. Because matching
1644 proceeds from left to right, PCRE will look for each "a" in the subject
1645 and then see if what follows matches the rest of the pattern. If the
1646 pattern is specified as
1647
1648 ^.*abcd$
1649
1650 the initial .* matches the entire string at first, but when this fails
1651 (because there is no following "a"), it backtracks to match all but the
1652 last character, then all but the last two characters, and so on. Once
1653 again the search for "a" covers the entire string, from right to left,
1654 so we are no better off. However, if the pattern is written as
1655
1656 ^.*+(?<=abcd)
1657
1658 there can be no backtracking for the .*+ item; it can match only the
1659 entire string. The subsequent lookbehind assertion does a single test
1660 on the last four characters. If it fails, the match fails immediately.
1661 For long strings, this approach makes a significant difference to the
1662 processing time.
1663
1664 Using multiple assertions
1665
1666 Several assertions (of any sort) may occur in succession. For example,
1667
1668 (?<=\d{3})(?<!999)foo
1669
1670 matches "foo" preceded by three digits that are not "999". Notice that
1671 each of the assertions is applied independently at the same point in
1672 the subject string. First there is a check that the previous three
1673 characters are all digits, and then there is a check that the same
1674 three characters are not "999". This pattern does not match "foo" pre‐
1675 ceded by six characters, the first of which are digits and the last
1676 three of which are not "999". For example, it doesn't match "123abc‐
1677 foo". A pattern to do that is
1678
1679 (?<=\d{3}...)(?<!999)foo
1680
1681 This time the first assertion looks at the preceding six characters,
1682 checking that the first three are digits, and then the second assertion
1683 checks that the preceding three characters are not "999".
1684
1685 Assertions can be nested in any combination. For example,
1686
1687 (?<=(?<!foo)bar)baz
1688
1689 matches an occurrence of "baz" that is preceded by "bar" which in turn
1690 is not preceded by "foo", while
1691
1692 (?<=\d{3}(?!999)...)foo
1693
1694 is another pattern that matches "foo" preceded by three digits and any
1695 three characters that are not "999".
1696
1698
1699 It is possible to cause the matching process to obey a subpattern con‐
1700 ditionally or to choose between two alternative subpatterns, depending
1701 on the result of an assertion, or whether a specific capturing subpat‐
1702 tern has already been matched. The two possible forms of conditional
1703 subpattern are:
1704
1705 (?(condition)yes-pattern)
1706 (?(condition)yes-pattern|no-pattern)
1707
1708 If the condition is satisfied, the yes-pattern is used; otherwise the
1709 no-pattern (if present) is used. If there are more than two alterna‐
1710 tives in the subpattern, a compile-time error occurs.
1711
1712 There are four kinds of condition: references to subpatterns, refer‐
1713 ences to recursion, a pseudo-condition called DEFINE, and assertions.
1714
1715 Checking for a used subpattern by number
1716
1717 If the text between the parentheses consists of a sequence of digits,
1718 the condition is true if a capturing subpattern of that number has pre‐
1719 viously matched. If there is more than one capturing subpattern with
1720 the same number (see the earlier section about duplicate subpattern
1721 numbers), the condition is true if any of them have been set. An alter‐
1722 native notation is to precede the digits with a plus or minus sign. In
1723 this case, the subpattern number is relative rather than absolute. The
1724 most recently opened parentheses can be referenced by (?(-1), the next
1725 most recent by (?(-2), and so on. In looping constructs it can also
1726 make sense to refer to subsequent groups with constructs such as
1727 (?(+2).
1728
1729 Consider the following pattern, which contains non-significant white
1730 space to make it more readable (assume the PCRE_EXTENDED option) and to
1731 divide it into three parts for ease of discussion:
1732
1733 ( \( )? [^()]+ (?(1) \) )
1734
1735 The first part matches an optional opening parenthesis, and if that
1736 character is present, sets it as the first captured substring. The sec‐
1737 ond part matches one or more characters that are not parentheses. The
1738 third part is a conditional subpattern that tests whether the first set
1739 of parentheses matched or not. If they did, that is, if subject started
1740 with an opening parenthesis, the condition is true, and so the yes-pat‐
1741 tern is executed and a closing parenthesis is required. Otherwise,
1742 since no-pattern is not present, the subpattern matches nothing. In
1743 other words, this pattern matches a sequence of non-parentheses,
1744 optionally enclosed in parentheses.
1745
1746 If you were embedding this pattern in a larger one, you could use a
1747 relative reference:
1748
1749 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
1750
1751 This makes the fragment independent of the parentheses in the larger
1752 pattern.
1753
1754 Checking for a used subpattern by name
1755
1756 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
1757 used subpattern by name. For compatibility with earlier versions of
1758 PCRE, which had this facility before Perl, the syntax (?(name)...) is
1759 also recognized. However, there is a possible ambiguity with this syn‐
1760 tax, because subpattern names may consist entirely of digits. PCRE
1761 looks first for a named subpattern; if it cannot find one and the name
1762 consists entirely of digits, PCRE looks for a subpattern of that num‐
1763 ber, which must be greater than zero. Using subpattern names that con‐
1764 sist entirely of digits is not recommended.
1765
1766 Rewriting the above example to use a named subpattern gives this:
1767
1768 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
1769
1770 If the name used in a condition of this kind is a duplicate, the test
1771 is applied to all subpatterns of the same name, and is true if any one
1772 of them has matched.
1773
1774 Checking for pattern recursion
1775
1776 If the condition is the string (R), and there is no subpattern with the
1777 name R, the condition is true if a recursive call to the whole pattern
1778 or any subpattern has been made. If digits or a name preceded by amper‐
1779 sand follow the letter R, for example:
1780
1781 (?(R3)...) or (?(R&name)...)
1782
1783 the condition is true if the most recent recursion is into a subpattern
1784 whose number or name is given. This condition does not check the entire
1785 recursion stack. If the name used in a condition of this kind is a
1786 duplicate, the test is applied to all subpatterns of the same name, and
1787 is true if any one of them is the most recent recursion.
1788
1789 At "top level", all these recursion test conditions are false. The
1790 syntax for recursive patterns is described below.
1791
1792 Defining subpatterns for use by reference only
1793
1794 If the condition is the string (DEFINE), and there is no subpattern
1795 with the name DEFINE, the condition is always false. In this case,
1796 there may be only one alternative in the subpattern. It is always
1797 skipped if control reaches this point in the pattern; the idea of
1798 DEFINE is that it can be used to define "subroutines" that can be ref‐
1799 erenced from elsewhere. (The use of "subroutines" is described below.)
1800 For example, a pattern to match an IPv4 address could be written like
1801 this (ignore whitespace and line breaks):
1802
1803 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
1804 \b (?&byte) (\.(?&byte)){3} \b
1805
1806 The first part of the pattern is a DEFINE group inside which a another
1807 group named "byte" is defined. This matches an individual component of
1808 an IPv4 address (a number less than 256). When matching takes place,
1809 this part of the pattern is skipped because DEFINE acts like a false
1810 condition. The rest of the pattern uses references to the named group
1811 to match the four dot-separated components of an IPv4 address, insist‐
1812 ing on a word boundary at each end.
1813
1814 Assertion conditions
1815
1816 If the condition is not in any of the above formats, it must be an
1817 assertion. This may be a positive or negative lookahead or lookbehind
1818 assertion. Consider this pattern, again containing non-significant
1819 white space, and with the two alternatives on the second line:
1820
1821 (?(?=[^a-z]*[a-z])
1822 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1823
1824 The condition is a positive lookahead assertion that matches an
1825 optional sequence of non-letters followed by a letter. In other words,
1826 it tests for the presence of at least one letter in the subject. If a
1827 letter is found, the subject is matched against the first alternative;
1828 otherwise it is matched against the second. This pattern matches
1829 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1830 letters and dd are digits.
1831
1833
1834 The sequence (?# marks the start of a comment that continues up to the
1835 next closing parenthesis. Nested parentheses are not permitted. The
1836 characters that make up a comment play no part in the pattern matching
1837 at all.
1838
1839 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1840 character class introduces a comment that continues to immediately
1841 after the next newline in the pattern.
1842
1844
1845 Consider the problem of matching a string in parentheses, allowing for
1846 unlimited nested parentheses. Without the use of recursion, the best
1847 that can be done is to use a pattern that matches up to some fixed
1848 depth of nesting. It is not possible to handle an arbitrary nesting
1849 depth.
1850
1851 For some time, Perl has provided a facility that allows regular expres‐
1852 sions to recurse (amongst other things). It does this by interpolating
1853 Perl code in the expression at run time, and the code can refer to the
1854 expression itself. A Perl pattern using code interpolation to solve the
1855 parentheses problem can be created like this:
1856
1857 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1858
1859 The (?p{...}) item interpolates Perl code at run time, and in this case
1860 refers recursively to the pattern in which it appears.
1861
1862 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1863 it supports special syntax for recursion of the entire pattern, and
1864 also for individual subpattern recursion. After its introduction in
1865 PCRE and Python, this kind of recursion was subsequently introduced
1866 into Perl at release 5.10.
1867
1868 A special item that consists of (? followed by a number greater than
1869 zero and a closing parenthesis is a recursive call of the subpattern of
1870 the given number, provided that it occurs inside that subpattern. (If
1871 not, it is a "subroutine" call, which is described in the next sec‐
1872 tion.) The special item (?R) or (?0) is a recursive call of the entire
1873 regular expression.
1874
1875 This PCRE pattern solves the nested parentheses problem (assume the
1876 PCRE_EXTENDED option is set so that white space is ignored):
1877
1878 \( ( [^()]++ | (?R) )* \)
1879
1880 First it matches an opening parenthesis. Then it matches any number of
1881 substrings which can either be a sequence of non-parentheses, or a
1882 recursive match of the pattern itself (that is, a correctly parenthe‐
1883 sized substring). Finally there is a closing parenthesis. Note the use
1884 of a possessive quantifier to avoid backtracking into sequences of non-
1885 parentheses.
1886
1887 If this were part of a larger pattern, you would not want to recurse
1888 the entire pattern, so instead you could use this:
1889
1890 ( \( ( [^()]++ | (?1) )* \) )
1891
1892 We have put the pattern into parentheses, and caused the recursion to
1893 refer to them instead of the whole pattern.
1894
1895 In a larger pattern, keeping track of parenthesis numbers can be
1896 tricky. This is made easier by the use of relative references (a Perl
1897 5.10 feature). Instead of (?1) in the pattern above you can write
1898 (?-2) to refer to the second most recently opened parentheses preceding
1899 the recursion. In other words, a negative number counts capturing
1900 parentheses leftwards from the point at which it is encountered.
1901
1902 It is also possible to refer to subsequently opened parentheses, by
1903 writing references such as (?+2). However, these cannot be recursive
1904 because the reference is not inside the parentheses that are refer‐
1905 enced. They are always "subroutine" calls, as described in the next
1906 section.
1907
1908 An alternative approach is to use named parentheses instead. The Perl
1909 syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
1910 supported. We could rewrite the above example as follows:
1911
1912 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
1913
1914 If there is more than one subpattern with the same name, the earliest
1915 one is used.
1916
1917 This particular example pattern that we have been looking at contains
1918 nested unlimited repeats, and so the use of a possessive quantifier for
1919 matching strings of non-parentheses is important when applying the pat‐
1920 tern to strings that do not match. For example, when this pattern is
1921 applied to
1922
1923 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1924
1925 it yields "no match" quickly. However, if a possessive quantifier is
1926 not used, the match runs for a very long time indeed because there are
1927 so many different ways the + and * repeats can carve up the subject,
1928 and all have to be tested before failure can be reported.
1929
1930 At the end of a match, the values of capturing parentheses are those
1931 from the outermost level. If you want to obtain intermediate values, a
1932 callout function can be used (see below and the pcrecallout documenta‐
1933 tion). If the pattern above is matched against
1934
1935 (ab(cd)ef)
1936
1937 the value for the inner capturing parentheses (numbered 2) is "ef",
1938 which is the last value taken on at the top level. If a capturing sub‐
1939 pattern is not matched at the top level, its final value is unset, even
1940 if it is (temporarily) set at a deeper level.
1941
1942 If there are more than 15 capturing parentheses in a pattern, PCRE has
1943 to obtain extra memory to store data during a recursion, which it does
1944 by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
1945 can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
1946
1947 Do not confuse the (?R) item with the condition (R), which tests for
1948 recursion. Consider this pattern, which matches text in angle brack‐
1949 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1950 brackets (that is, when recursing), whereas any characters are permit‐
1951 ted at the outer level.
1952
1953 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1954
1955 In this pattern, (?(R) is the start of a conditional subpattern, with
1956 two different alternatives for the recursive and non-recursive cases.
1957 The (?R) item is the actual recursive call.
1958
1959 Recursion difference from Perl
1960
1961 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
1962 always treated as an atomic group. That is, once it has matched some of
1963 the subject string, it is never re-entered, even if it contains untried
1964 alternatives and there is a subsequent matching failure. This can be
1965 illustrated by the following pattern, which purports to match a palin‐
1966 dromic string that contains an odd number of characters (for example,
1967 "a", "aba", "abcba", "abcdcba"):
1968
1969 ^(.|(.)(?1)\2)$
1970
1971 The idea is that it either matches a single character, or two identical
1972 characters surrounding a sub-palindrome. In Perl, this pattern works;
1973 in PCRE it does not if the pattern is longer than three characters.
1974 Consider the subject string "abcba":
1975
1976 At the top level, the first character is matched, but as it is not at
1977 the end of the string, the first alternative fails; the second alterna‐
1978 tive is taken and the recursion kicks in. The recursive call to subpat‐
1979 tern 1 successfully matches the next character ("b"). (Note that the
1980 beginning and end of line tests are not part of the recursion).
1981
1982 Back at the top level, the next character ("c") is compared with what
1983 subpattern 2 matched, which was "a". This fails. Because the recursion
1984 is treated as an atomic group, there are now no backtracking points,
1985 and so the entire match fails. (Perl is able, at this point, to re-
1986 enter the recursion and try the second alternative.) However, if the
1987 pattern is written with the alternatives in the other order, things are
1988 different:
1989
1990 ^((.)(?1)\2|.)$
1991
1992 This time, the recursing alternative is tried first, and continues to
1993 recurse until it runs out of characters, at which point the recursion
1994 fails. But this time we do have another alternative to try at the
1995 higher level. That is the big difference: in the previous case the
1996 remaining alternative is at a deeper recursion level, which PCRE cannot
1997 use.
1998
1999 To change the pattern so that matches all palindromic strings, not just
2000 those with an odd number of characters, it is tempting to change the
2001 pattern to this:
2002
2003 ^((.)(?1)\2|.?)$
2004
2005 Again, this works in Perl, but not in PCRE, and for the same reason.
2006 When a deeper recursion has matched a single character, it cannot be
2007 entered again in order to match an empty string. The solution is to
2008 separate the two cases, and write out the odd and even cases as alter‐
2009 natives at the higher level:
2010
2011 ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2012
2013 If you want to match typical palindromic phrases, the pattern has to
2014 ignore all non-word characters, which can be done like this:
2015
2016 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2017
2018 If run with the PCRE_CASELESS option, this pattern matches phrases such
2019 as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
2020 Perl. Note the use of the possessive quantifier *+ to avoid backtrack‐
2021 ing into sequences of non-word characters. Without this, PCRE takes a
2022 great deal longer (ten times or more) to match typical phrases, and
2023 Perl takes so long that you think it has gone into a loop.
2024
2025 WARNING: The palindrome-matching patterns above work only if the sub‐
2026 ject string does not start with a palindrome that is shorter than the
2027 entire string. For example, although "abcba" is correctly matched, if
2028 the subject is "ababa", PCRE finds the palindrome "aba" at the start,
2029 then fails at top level because the end of the string does not follow.
2030 Once again, it cannot jump back into the recursion to try other alter‐
2031 natives, so the entire match fails.
2032
2034
2035 If the syntax for a recursive subpattern reference (either by number or
2036 by name) is used outside the parentheses to which it refers, it oper‐
2037 ates like a subroutine in a programming language. The "called" subpat‐
2038 tern may be defined before or after the reference. A numbered reference
2039 can be absolute or relative, as in these examples:
2040
2041 (...(absolute)...)...(?2)...
2042 (...(relative)...)...(?-1)...
2043 (...(?+1)...(relative)...
2044
2045 An earlier example pointed out that the pattern
2046
2047 (sens|respons)e and \1ibility
2048
2049 matches "sense and sensibility" and "response and responsibility", but
2050 not "sense and responsibility". If instead the pattern
2051
2052 (sens|respons)e and (?1)ibility
2053
2054 is used, it does match "sense and responsibility" as well as the other
2055 two strings. Another example is given in the discussion of DEFINE
2056 above.
2057
2058 Like recursive subpatterns, a subroutine call is always treated as an
2059 atomic group. That is, once it has matched some of the subject string,
2060 it is never re-entered, even if it contains untried alternatives and
2061 there is a subsequent matching failure. Any capturing parentheses that
2062 are set during the subroutine call revert to their previous values
2063 afterwards.
2064
2065 When a subpattern is used as a subroutine, processing options such as
2066 case-independence are fixed when the subpattern is defined. They cannot
2067 be changed for different calls. For example, consider this pattern:
2068
2069 (abc)(?i:(?-1))
2070
2071 It matches "abcabc". It does not match "abcABC" because the change of
2072 processing option does not affect the called subpattern.
2073
2075
2076 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
2077 name or a number enclosed either in angle brackets or single quotes, is
2078 an alternative syntax for referencing a subpattern as a subroutine,
2079 possibly recursively. Here are two of the examples used above, rewrit‐
2080 ten using this syntax:
2081
2082 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2083 (sens|respons)e and \g'1'ibility
2084
2085 PCRE supports an extension to Oniguruma: if a number is preceded by a
2086 plus or a minus sign it is taken as a relative reference. For example:
2087
2088 (abc)(?i:\g<-1>)
2089
2090 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2091 synonymous. The former is a back reference; the latter is a subroutine
2092 call.
2093
2095
2096 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2097 Perl code to be obeyed in the middle of matching a regular expression.
2098 This makes it possible, amongst other things, to extract different sub‐
2099 strings that match the same pair of parentheses when there is a repeti‐
2100 tion.
2101
2102 PCRE provides a similar feature, but of course it cannot obey arbitrary
2103 Perl code. The feature is called "callout". The caller of PCRE provides
2104 an external function by putting its entry point in the global variable
2105 pcre_callout. By default, this variable contains NULL, which disables
2106 all calling out.
2107
2108 Within a regular expression, (?C) indicates the points at which the
2109 external function is to be called. If you want to identify different
2110 callout points, you can put a number less than 256 after the letter C.
2111 The default value is zero. For example, this pattern has two callout
2112 points:
2113
2114 (?C1)abc(?C2)def
2115
2116 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
2117 automatically installed before each item in the pattern. They are all
2118 numbered 255.
2119
2120 During matching, when PCRE reaches a callout point (and pcre_callout is
2121 set), the external function is called. It is provided with the number
2122 of the callout, the position in the pattern, and, optionally, one item
2123 of data originally supplied by the caller of pcre_exec(). The callout
2124 function may cause matching to proceed, to backtrack, or to fail alto‐
2125 gether. A complete description of the interface to the callout function
2126 is given in the pcrecallout documentation.
2127
2129
2130 Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
2131 which are described in the Perl documentation as "experimental and sub‐
2132 ject to change or removal in a future version of Perl". It goes on to
2133 say: "Their usage in production code should be noted to avoid problems
2134 during upgrades." The same remarks apply to the PCRE features described
2135 in this section.
2136
2137 Since these verbs are specifically related to backtracking, most of
2138 them can be used only when the pattern is to be matched using
2139 pcre_exec(), which uses a backtracking algorithm. With the exception of
2140 (*FAIL), which behaves like a failing negative assertion, they cause an
2141 error if encountered by pcre_dfa_exec().
2142
2143 If any of these verbs are used in an assertion or subroutine subpattern
2144 (including recursive subpatterns), their effect is confined to that
2145 subpattern; it does not extend to the surrounding pattern. Note that
2146 such subpatterns are processed as anchored at the point where they are
2147 tested.
2148
2149 The new verbs make use of what was previously invalid syntax: an open‐
2150 ing parenthesis followed by an asterisk. They are generally of the form
2151 (*VERB) or (*VERB:NAME). Some may take either form, with differing be‐
2152 haviour, depending on whether or not an argument is present. An name is
2153 a sequence of letters, digits, and underscores. If the name is empty,
2154 that is, if the closing parenthesis immediately follows the colon, the
2155 effect is as if the colon were not there. Any number of these verbs may
2156 occur in a pattern.
2157
2158 PCRE contains some optimizations that are used to speed up matching by
2159 running some checks at the start of each match attempt. For example, it
2160 may know the minimum length of matching subject, or that a particular
2161 character must be present. When one of these optimizations suppresses
2162 the running of a match, any included backtracking verbs will not, of
2163 course, be processed. You can suppress the start-of-match optimizations
2164 by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().
2165
2166 Verbs that act immediately
2167
2168 The following verbs act as soon as they are encountered. They may not
2169 be followed by a name.
2170
2171 (*ACCEPT)
2172
2173 This verb causes the match to end successfully, skipping the remainder
2174 of the pattern. When inside a recursion, only the innermost pattern is
2175 ended immediately. If (*ACCEPT) is inside capturing parentheses, the
2176 data so far is captured. (This feature was added to PCRE at release
2177 8.00.) For example:
2178
2179 A((?:A|B(*ACCEPT)|C)D)
2180
2181 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2182 tured by the outer parentheses.
2183
2184 (*FAIL) or (*F)
2185
2186 This verb causes the match to fail, forcing backtracking to occur. It
2187 is equivalent to (?!) but easier to read. The Perl documentation notes
2188 that it is probably useful only when combined with (?{}) or (??{}).
2189 Those are, of course, Perl features that are not present in PCRE. The
2190 nearest equivalent is the callout feature, as for example in this pat‐
2191 tern:
2192
2193 a+(?C)(*FAIL)
2194
2195 A match with the string "aaaa" always fails, but the callout is taken
2196 before each backtrack happens (in this example, 10 times).
2197
2198 Recording which path was taken
2199
2200 There is one verb whose main purpose is to track how a match was
2201 arrived at, though it also has a secondary use in conjunction with
2202 advancing the match starting point (see (*SKIP) below).
2203
2204 (*MARK:NAME) or (*:NAME)
2205
2206 A name is always required with this verb. There may be as many
2207 instances of (*MARK) as you like in a pattern, and their names do not
2208 have to be unique.
2209
2210 When a match succeeds, the name of the last-encountered (*MARK) is
2211 passed back to the caller via the pcre_extra data structure, as
2212 described in the section on pcre_extra in the pcreapi documentation. No
2213 data is returned for a partial match. Here is an example of pcretest
2214 output, where the /K modifier requests the retrieval and outputting of
2215 (*MARK) data:
2216
2217 /X(*MARK:A)Y|X(*MARK:B)Z/K
2218 XY
2219 0: XY
2220 MK: A
2221 XZ
2222 0: XZ
2223 MK: B
2224
2225 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2226 ple it indicates which of the two alternatives matched. This is a more
2227 efficient way of obtaining this information than putting each alterna‐
2228 tive in its own capturing parentheses.
2229
2230 A name may also be returned after a failed match if the final path
2231 through the pattern involves (*MARK). However, unless (*MARK) used in
2232 conjunction with (*COMMIT), this is unlikely to happen for an unan‐
2233 chored pattern because, as the starting point for matching is advanced,
2234 the final check is often with an empty string, causing a failure before
2235 (*MARK) is reached. For example:
2236
2237 /X(*MARK:A)Y|X(*MARK:B)Z/K
2238 XP
2239 No match
2240
2241 There are three potential starting points for this match (starting with
2242 X, starting with P, and with an empty string). If the pattern is
2243 anchored, the result is different:
2244
2245 /^X(*MARK:A)Y|^X(*MARK:B)Z/K
2246 XP
2247 No match, mark = B
2248
2249 PCRE's start-of-match optimizations can also interfere with this. For
2250 example, if, as a result of a call to pcre_study(), it knows the mini‐
2251 mum subject length for a match, a shorter subject will not be scanned
2252 at all.
2253
2254 Note that similar anomalies (though different in detail) exist in Perl,
2255 no doubt for the same reasons. The use of (*MARK) data after a failed
2256 match of an unanchored pattern is not recommended, unless (*COMMIT) is
2257 involved.
2258
2259 Verbs that act after backtracking
2260
2261 The following verbs do nothing when they are encountered. Matching con‐
2262 tinues with what follows, but if there is no subsequent match, causing
2263 a backtrack to the verb, a failure is forced. That is, backtracking
2264 cannot pass to the left of the verb. However, when one of these verbs
2265 appears inside an atomic group, its effect is confined to that group,
2266 because once the group has been matched, there is never any backtrack‐
2267 ing into it. In this situation, backtracking can "jump back" to the
2268 left of the entire atomic group. (Remember also, as stated above, that
2269 this localization also applies in subroutine calls and assertions.)
2270
2271 These verbs differ in exactly what kind of failure occurs when back‐
2272 tracking reaches them.
2273
2274 (*COMMIT)
2275
2276 This verb, which may not be followed by a name, causes the whole match
2277 to fail outright if the rest of the pattern does not match. Even if the
2278 pattern is unanchored, no further attempts to find a match by advancing
2279 the starting point take place. Once (*COMMIT) has been passed,
2280 pcre_exec() is committed to finding a match at the current starting
2281 point, or not at all. For example:
2282
2283 a+(*COMMIT)b
2284
2285 This matches "xxaab" but not "aacaab". It can be thought of as a kind
2286 of dynamic anchor, or "I've started, so I must finish." The name of the
2287 most recently passed (*MARK) in the path is passed back when (*COMMIT)
2288 forces a match failure.
2289
2290 Note that (*COMMIT) at the start of a pattern is not the same as an
2291 anchor, unless PCRE's start-of-match optimizations are turned off, as
2292 shown in this pcretest example:
2293
2294 /(*COMMIT)abc/
2295 xyzabc
2296 0: abc
2297 xyzabc\Y
2298 No match
2299
2300 PCRE knows that any match must start with "a", so the optimization
2301 skips along the subject to "a" before running the first match attempt,
2302 which succeeds. When the optimization is disabled by the \Y escape in
2303 the second subject, the match starts at "x" and so the (*COMMIT) causes
2304 it to fail without trying any other starting points.
2305
2306 (*PRUNE) or (*PRUNE:NAME)
2307
2308 This verb causes the match to fail at the current starting position in
2309 the subject if the rest of the pattern does not match. If the pattern
2310 is unanchored, the normal "bumpalong" advance to the next starting
2311 character then happens. Backtracking can occur as usual to the left of
2312 (*PRUNE), before it is reached, or when matching to the right of
2313 (*PRUNE), but if there is no match to the right, backtracking cannot
2314 cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter‐
2315 native to an atomic group or possessive quantifier, but there are some
2316 uses of (*PRUNE) that cannot be expressed in any other way. The behav‐
2317 iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
2318 match fails completely; the name is passed back if this is the final
2319 attempt. (*PRUNE:NAME) does not pass back a name if the match suc‐
2320 ceeds. In an anchored pattern (*PRUNE) has the same effect as (*COM‐
2321 MIT).
2322
2323 (*SKIP)
2324
2325 This verb, when given without a name, is like (*PRUNE), except that if
2326 the pattern is unanchored, the "bumpalong" advance is not to the next
2327 character, but to the position in the subject where (*SKIP) was encoun‐
2328 tered. (*SKIP) signifies that whatever text was matched leading up to
2329 it cannot be part of a successful match. Consider:
2330
2331 a+(*SKIP)b
2332
2333 If the subject is "aaaac...", after the first match attempt fails
2334 (starting at the first character in the string), the starting point
2335 skips on to start the next attempt at "c". Note that a possessive quan‐
2336 tifer does not have the same effect as this example; although it would
2337 suppress backtracking during the first match attempt, the second
2338 attempt would start at the second character instead of skipping on to
2339 "c".
2340
2341 (*SKIP:NAME)
2342
2343 When (*SKIP) has an associated name, its behaviour is modified. If the
2344 following pattern fails to match, the previous path through the pattern
2345 is searched for the most recent (*MARK) that has the same name. If one
2346 is found, the "bumpalong" advance is to the subject position that cor‐
2347 responds to that (*MARK) instead of to where (*SKIP) was encountered.
2348 If no (*MARK) with a matching name is found, normal "bumpalong" of one
2349 character happens (the (*SKIP) is ignored).
2350
2351 (*THEN) or (*THEN:NAME)
2352
2353 This verb causes a skip to the next alternation if the rest of the pat‐
2354 tern does not match. That is, it cancels pending backtracking, but only
2355 within the current alternation. Its name comes from the observation
2356 that it can be used for a pattern-based if-then-else block:
2357
2358 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2359
2360 If the COND1 pattern matches, FOO is tried (and possibly further items
2361 after the end of the group if FOO succeeds); on failure the matcher
2362 skips to the second alternative and tries COND2, without backtracking
2363 into COND1. The behaviour of (*THEN:NAME) is exactly the same as
2364 (*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not
2365 directly inside an alternation, it acts like (*PRUNE).
2366
2368
2369 pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3).
2370
2372
2373 Philip Hazel
2374 University Computing Service
2375 Cambridge CB2 3QH, England.
2376
2378
2379 Last updated: 18 May 2010
2380 Copyright (c) 1997-2010 University of Cambridge.
2381
2382
2383
2384 PCREPATTERN(3)