pcrepattern(3)

1PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PCRE REGULAR EXPRESSION DETAILS

9
10       The  syntax and semantics of the regular expressions that are supported
11       by PCRE are described in detail below. There is a quick-reference  syn‐
12       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
13       semantics as closely as it can. PCRE  also  supports  some  alternative
14       regular  expression  syntax (which does not conflict with the Perl syn‐
15       tax) in order to provide some compatibility with regular expressions in
16       Python, .NET, and Oniguruma.
17
18       Perl's  regular expressions are described in its own documentation, and
19       regular expressions in general are covered in a number of  books,  some
20       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
21       Expressions", published by  O'Reilly,  covers  regular  expressions  in
22       great  detail.  This  description  of  PCRE's  regular  expressions  is
23       intended as reference material.
24
25       The original operation of PCRE was on strings of  one-byte  characters.
26       However,  there is now also support for UTF-8 character strings. To use
27       this, PCRE must be built to include UTF-8 support, and  you  must  call
28       pcre_compile()  or  pcre_compile2() with the PCRE_UTF8 option. There is
29       also a special sequence that can be given at the start of a pattern:
30
31         (*UTF8)
32
33       Starting a pattern with this sequence  is  equivalent  to  setting  the
34       PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
35       UTF-8 mode affects pattern matching  is  mentioned  in  several  places
36       below.  There  is  also  a  summary of UTF-8 features in the section on
37       UTF-8 support in the main pcre page.
38
39       Another special sequence that may appear at the start of a  pattern  or
40       in combination with (*UTF8) is:
41
42         (*UCP)
43
44       This  has  the  same  effect  as setting the PCRE_UCP option: it causes
45       sequences such as \d and \w to  use  Unicode  properties  to  determine
46       character types, instead of recognizing only characters with codes less
47       than 128 via a lookup table.
48
49       The remainder of this document discusses the  patterns  that  are  sup‐
50       ported  by  PCRE when its main matching function, pcre_exec(), is used.
51       From  release  6.0,   PCRE   offers   a   second   matching   function,
52       pcre_dfa_exec(),  which matches using a different algorithm that is not
53       Perl-compatible. Some of the features discussed below are not available
54       when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
55       alternative function, and how it differs from the normal function,  are
56       discussed in the pcrematching page.
57

NEWLINE CONVENTIONS

59
60       PCRE  supports five different conventions for indicating line breaks in
61       strings: a single CR (carriage return) character, a  single  LF  (line‐
62       feed) character, the two-character sequence CRLF, any of the three pre‐
63       ceding, or any Unicode newline sequence. The pcreapi page  has  further
64       discussion  about newlines, and shows how to set the newline convention
65       in the options arguments for the compiling and matching functions.
66
67       It is also possible to specify a newline convention by starting a  pat‐
68       tern string with one of the following five sequences:
69
70         (*CR)        carriage return
71         (*LF)        linefeed
72         (*CRLF)      carriage return, followed by linefeed
73         (*ANYCRLF)   any of the three above
74         (*ANY)       all Unicode newline sequences
75
76       These  override  the default and the options given to pcre_compile() or
77       pcre_compile2(). For example, on a Unix system where LF is the  default
78       newline sequence, the pattern
79
80         (*CR)a.b
81
82       changes the convention to CR. That pattern matches "a\nb" because LF is
83       no longer a newline. Note that these special settings,  which  are  not
84       Perl-compatible,  are  recognized  only at the very start of a pattern,
85       and that they must be in upper case.  If  more  than  one  of  them  is
86       present, the last one is used.
87
88       The  newline convention affects the interpretation of the dot metachar‐
89       acter when PCRE_DOTALL is not set, and also the behaviour of  \N.  How‐
90       ever,  it  does  not  affect  what  the  \R escape sequence matches. By
91       default, this is any Unicode newline sequence, for Perl  compatibility.
92       However,  this can be changed; see the description of \R in the section
93       entitled "Newline sequences" below. A change of \R setting can be  com‐
94       bined with a change of newline convention.
95

CHARACTERS AND METACHARACTERS

97
98       A  regular  expression  is  a pattern that is matched against a subject
99       string from left to right. Most characters stand for  themselves  in  a
100       pattern,  and  match  the corresponding characters in the subject. As a
101       trivial example, the pattern
102
103         The quick brown fox
104
105       matches a portion of a subject string that is identical to itself. When
106       caseless  matching is specified (the PCRE_CASELESS option), letters are
107       matched independently of case. In UTF-8 mode, PCRE  always  understands
108       the  concept  of case for characters whose values are less than 128, so
109       caseless matching is always possible. For characters with  higher  val‐
110       ues,  the concept of case is supported if PCRE is compiled with Unicode
111       property support, but not otherwise.   If  you  want  to  use  caseless
112       matching  for  characters  128  and above, you must ensure that PCRE is
113       compiled with Unicode property support as well as with UTF-8 support.
114
115       The power of regular expressions comes  from  the  ability  to  include
116       alternatives  and  repetitions in the pattern. These are encoded in the
117       pattern by the use of metacharacters, which do not stand for themselves
118       but instead are interpreted in some special way.
119
120       There  are  two different sets of metacharacters: those that are recog‐
121       nized anywhere in the pattern except within square brackets, and  those
122       that  are  recognized  within square brackets. Outside square brackets,
123       the metacharacters are as follows:
124
125         \      general escape character with several uses
126         ^      assert start of string (or line, in multiline mode)
127         $      assert end of string (or line, in multiline mode)
128         .      match any character except newline (by default)
129         [      start character class definition
130         |      start of alternative branch
131         (      start subpattern
132         )      end subpattern
133         ?      extends the meaning of (
134                also 0 or 1 quantifier
135                also quantifier minimizer
136         *      0 or more quantifier
137         +      1 or more quantifier
138                also "possessive quantifier"
139         {      start min/max quantifier
140
141       Part of a pattern that is in square brackets  is  called  a  "character
142       class". In a character class the only metacharacters are:
143
144         \      general escape character
145         ^      negate the class, but only if the first character
146         -      indicates character range
147         [      POSIX character class (only if followed by POSIX
148                  syntax)
149         ]      terminates the character class
150
151       The following sections describe the use of each of the metacharacters.
152

BACKSLASH

154
155       The backslash character has several uses. Firstly, if it is followed by
156       a non-alphanumeric character, it takes away any  special  meaning  that
157       character  may  have.  This  use  of  backslash  as an escape character
158       applies both inside and outside character classes.
159
160       For example, if you want to match a * character, you write  \*  in  the
161       pattern.   This  escaping  action  applies whether or not the following
162       character would otherwise be interpreted as a metacharacter, so  it  is
163       always  safe  to  precede  a non-alphanumeric with backslash to specify
164       that it stands for itself. In particular, if you want to match a  back‐
165       slash, you write \\.
166
167       If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
168       the pattern (other than in a character class) and characters between  a
169       # outside a character class and the next newline are ignored. An escap‐
170       ing backslash can be used to include a whitespace  or  #  character  as
171       part of the pattern.
172
173       If  you  want  to remove the special meaning from a sequence of charac‐
174       ters, you can do so by putting them between \Q and \E. This is  differ‐
175       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
176       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola‐
177       tion. Note the following examples:
178
179         Pattern            PCRE matches   Perl matches
180
181         \Qabc$xyz\E        abc$xyz        abc followed by the
182                                             contents of $xyz
183         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
184         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
185
186       The  \Q...\E  sequence  is recognized both inside and outside character
187       classes.
188
189   Non-printing characters
190
191       A second use of backslash provides a way of encoding non-printing char‐
192       acters  in patterns in a visible manner. There is no restriction on the
193       appearance of non-printing characters, apart from the binary zero  that
194       terminates  a  pattern,  but  when  a pattern is being prepared by text
195       editing, it is  often  easier  to  use  one  of  the  following  escape
196       sequences than the binary character it represents:
197
198         \a        alarm, that is, the BEL character (hex 07)
199         \cx       "control-x", where x is any character
200         \e        escape (hex 1B)
201         \f        formfeed (hex 0C)
202         \n        linefeed (hex 0A)
203         \r        carriage return (hex 0D)
204         \t        tab (hex 09)
205         \ddd      character with octal code ddd, or back reference
206         \xhh      character with hex code hh
207         \x{hhh..} character with hex code hhh..
208
209       The  precise  effect of \cx is as follows: if x is a lower case letter,
210       it is converted to upper case. Then bit 6 of the character (hex 40)  is
211       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
212       becomes hex 7B.
213
214       After \x, from zero to two hexadecimal digits are read (letters can  be
215       in  upper  or  lower case). Any number of hexadecimal digits may appear
216       between \x{ and }, but the value of the character  code  must  be  less
217       than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
218       the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger
219       than the largest Unicode code point, which is 10FFFF.
220
221       If  characters  other than hexadecimal digits appear between \x{ and },
222       or if there is no terminating }, this form of escape is not recognized.
223       Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
224       escape, with no following digits, giving a  character  whose  value  is
225       zero.
226
227       Characters whose value is less than 256 can be defined by either of the
228       two syntaxes for \x. There is no difference in the way  they  are  han‐
229       dled. For example, \xdc is exactly the same as \x{dc}.
230
231       After  \0  up  to two further octal digits are read. If there are fewer
232       than two digits, just  those  that  are  present  are  used.  Thus  the
233       sequence \0\x\07 specifies two binary zeros followed by a BEL character
234       (code value 7). Make sure you supply two digits after the initial  zero
235       if the pattern character that follows is itself an octal digit.
236
237       The handling of a backslash followed by a digit other than 0 is compli‐
238       cated.  Outside a character class, PCRE reads it and any following dig‐
239       its  as  a  decimal  number. If the number is less than 10, or if there
240       have been at least that many previous capturing left parentheses in the
241       expression,  the  entire  sequence  is  taken  as  a  back reference. A
242       description of how this works is given later, following the  discussion
243       of parenthesized subpatterns.
244
245       Inside  a  character  class, or if the decimal number is greater than 9
246       and there have not been that many capturing subpatterns, PCRE  re-reads
247       up to three octal digits following the backslash, and uses them to gen‐
248       erate a data character. Any subsequent digits stand for themselves.  In
249       non-UTF-8  mode,  the  value  of a character specified in octal must be
250       less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
251       example:
252
253         \040   is another way of writing a space
254         \40    is the same, provided there are fewer than 40
255                   previous capturing subpatterns
256         \7     is always a back reference
257         \11    might be a back reference, or another way of
258                   writing a tab
259         \011   is always a tab
260         \0113  is a tab followed by the character "3"
261         \113   might be a back reference, otherwise the
262                   character with octal code 113
263         \377   might be a back reference, otherwise
264                   the byte consisting entirely of 1 bits
265         \81    is either a back reference, or a binary zero
266                   followed by the two characters "8" and "1"
267
268       Note  that  octal  values of 100 or greater must not be introduced by a
269       leading zero, because no more than three octal digits are ever read.
270
271       All the sequences that define a single character value can be used both
272       inside  and  outside character classes. In addition, inside a character
273       class, the sequence \b is interpreted as the backspace  character  (hex
274       08).  The sequences \B, \N, \R, and \X are not special inside a charac‐
275       ter class. Like any  other  unrecognized  escape  sequences,  they  are
276       treated  as  the  literal characters "B", "N", "R", and "X" by default,
277       but cause an error if the PCRE_EXTRA option is set. Outside a character
278       class, these sequences have different meanings.
279
280   Absolute and relative back references
281
282       The  sequence  \g followed by an unsigned or a negative number, option‐
283       ally enclosed in braces, is an absolute or relative back  reference.  A
284       named back reference can be coded as \g{name}. Back references are dis‐
285       cussed later, following the discussion of parenthesized subpatterns.
286
287   Absolute and relative subroutine calls
288
289       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
290       name or a number enclosed either in angle brackets or single quotes, is
291       an alternative syntax for referencing a subpattern as  a  "subroutine".
292       Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
293       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
294       reference; the latter is a subroutine call.
295
296   Generic character types
297
298       Another use of backslash is for specifying generic character types:
299
300         \d     any decimal digit
301         \D     any character that is not a decimal digit
302         \h     any horizontal whitespace character
303         \H     any character that is not a horizontal whitespace character
304         \s     any whitespace character
305         \S     any character that is not a whitespace character
306         \v     any vertical whitespace character
307         \V     any character that is not a vertical whitespace character
308         \w     any "word" character
309         \W     any "non-word" character
310
311       There is also the single sequence \N, which matches a non-newline char‐
312       acter.  This is the same as the "." metacharacter when  PCRE_DOTALL  is
313       not set.
314
315       Each  pair of lower and upper case escape sequences partitions the com‐
316       plete set of characters into two disjoint  sets.  Any  given  character
317       matches  one, and only one, of each pair. The sequences can appear both
318       inside and outside character classes. They each match one character  of
319       the  appropriate  type.  If the current matching point is at the end of
320       the subject string, all of them fail, because there is no character  to
321       match.
322
323       For  compatibility  with Perl, \s does not match the VT character (code
324       11).  This makes it different from the the POSIX "space" class. The  \s
325       characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
326       "use locale;" is included in a Perl script, \s may match the VT charac‐
327       ter. In PCRE, it never does.
328
329       A  "word"  character is an underscore or any character that is a letter
330       or digit.  By default, the definition of letters  and  digits  is  con‐
331       trolled  by PCRE's low-valued character tables, and may vary if locale-
332       specific matching is taking place (see "Locale support" in the  pcreapi
333       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
334       systems, or "french" in Windows, some character codes greater than  128
335       are  used  for  accented letters, and these are then matched by \w. The
336       use of locales with Unicode is discouraged.
337
338       By default, in UTF-8 mode, characters  with  values  greater  than  128
339       never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
340       sequences retain their original meanings from before UTF-8 support  was
341       available,  mainly for efficiency reasons. However, if PCRE is compiled
342       with Unicode property support, and the PCRE_UCP option is set, the  be‐
343       haviour  is  changed  so  that Unicode properties are used to determine
344       character types, as follows:
345
346         \d  any character that \p{Nd} matches (decimal digit)
347         \s  any character that \p{Z} matches, plus HT, LF, FF, CR
348         \w  any character that \p{L} or \p{N} matches, plus underscore
349
350       The upper case escapes match the inverse sets of characters. Note  that
351       \d  matches  only decimal digits, whereas \w matches any Unicode digit,
352       as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
353       affects  \b,  and  \B  because  they are defined in terms of \w and \W.
354       Matching these sequences is noticeably slower when PCRE_UCP is set.
355
356       The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
357       the  other  sequences,  which  match  only ASCII characters by default,
358       these always  match  certain  high-valued  codepoints  in  UTF-8  mode,
359       whether or not PCRE_UCP is set. The horizontal space characters are:
360
361         U+0009     Horizontal tab
362         U+0020     Space
363         U+00A0     Non-break space
364         U+1680     Ogham space mark
365         U+180E     Mongolian vowel separator
366         U+2000     En quad
367         U+2001     Em quad
368         U+2002     En space
369         U+2003     Em space
370         U+2004     Three-per-em space
371         U+2005     Four-per-em space
372         U+2006     Six-per-em space
373         U+2007     Figure space
374         U+2008     Punctuation space
375         U+2009     Thin space
376         U+200A     Hair space
377         U+202F     Narrow no-break space
378         U+205F     Medium mathematical space
379         U+3000     Ideographic space
380
381       The vertical space characters are:
382
383         U+000A     Linefeed
384         U+000B     Vertical tab
385         U+000C     Formfeed
386         U+000D     Carriage return
387         U+0085     Next line
388         U+2028     Line separator
389         U+2029     Paragraph separator
390
391   Newline sequences
392
393       Outside  a  character class, by default, the escape sequence \R matches
394       any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
395       mode \R is equivalent to the following:
396
397         (?>\r\n|\n|\x0b|\f|\r|\x85)
398
399       This  is  an  example  of an "atomic group", details of which are given
400       below.  This particular group matches either the two-character sequence
401       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
402       U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
403       return, U+000D), or NEL (next line, U+0085). The two-character sequence
404       is treated as a single unit that cannot be split.
405
406       In UTF-8 mode, two additional characters whose codepoints  are  greater
407       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
408       rator, U+2029).  Unicode character property support is not  needed  for
409       these characters to be recognized.
410
411       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
412       the complete set  of  Unicode  line  endings)  by  setting  the  option
413       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
414       (BSR is an abbrevation for "backslash R".) This can be made the default
415       when  PCRE  is  built;  if this is the case, the other behaviour can be
416       requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
417       specify  these  settings  by  starting a pattern string with one of the
418       following sequences:
419
420         (*BSR_ANYCRLF)   CR, LF, or CRLF only
421         (*BSR_UNICODE)   any Unicode newline sequence
422
423       These override the default and the options given to  pcre_compile()  or
424       pcre_compile2(),  but  they  can  be  overridden  by  options  given to
425       pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
426       are  not  Perl-compatible,  are  recognized only at the very start of a
427       pattern, and that they must be in upper case. If more than one of  them
428       is present, the last one is used. They can be combined with a change of
429       newline convention; for example, a pattern can start with:
430
431         (*ANY)(*BSR_ANYCRLF)
432
433       They can also be combined with the (*UTF8) or (*UCP) special sequences.
434       Inside  a  character  class,  \R  is  treated as an unrecognized escape
435       sequence, and so matches the letter "R" by default, but causes an error
436       if PCRE_EXTRA is set.
437
438   Unicode character properties
439
440       When PCRE is built with Unicode character property support, three addi‐
441       tional escape sequences that match characters with specific  properties
442       are  available.   When not in UTF-8 mode, these sequences are of course
443       limited to testing characters whose codepoints are less than  256,  but
444       they do work in this mode.  The extra escape sequences are:
445
446         \p{xx}   a character with the xx property
447         \P{xx}   a character without the xx property
448         \X       an extended Unicode sequence
449
450       The  property  names represented by xx above are limited to the Unicode
451       script names, the general category properties, "Any", which matches any
452       character   (including  newline),  and  some  special  PCRE  properties
453       (described in the next section).  Other Perl properties such as  "InMu‐
454       sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
455       does not match any characters, so always causes a match failure.
456
457       Sets of Unicode characters are defined as belonging to certain scripts.
458       A  character from one of these sets can be matched using a script name.
459       For example:
460
461         \p{Greek}
462         \P{Han}
463
464       Those that are not part of an identified script are lumped together  as
465       "Common". The current list of scripts is:
466
467       Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
468       Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,
469       Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp‐
470       tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,
471       Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe‐
472       rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
473       Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
474       Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
475       Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
476       Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,
477       Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,
478       Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,
479       Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,
480       Ugaritic, Vai, Yi.
481
482       Each character has exactly one Unicode general category property, spec‐
483       ified  by a two-letter abbreviation. For compatibility with Perl, nega‐
484       tion can be specified by including a  circumflex  between  the  opening
485       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
486       \P{Lu}.
487
488       If only one letter is specified with \p or \P, it includes all the gen‐
489       eral  category properties that start with that letter. In this case, in
490       the absence of negation, the curly brackets in the escape sequence  are
491       optional; these two examples have the same effect:
492
493         \p{L}
494         \pL
495
496       The following general category property codes are supported:
497
498         C     Other
499         Cc    Control
500         Cf    Format
501         Cn    Unassigned
502         Co    Private use
503         Cs    Surrogate
504
505         L     Letter
506         Ll    Lower case letter
507         Lm    Modifier letter
508         Lo    Other letter
509         Lt    Title case letter
510         Lu    Upper case letter
511
512         M     Mark
513         Mc    Spacing mark
514         Me    Enclosing mark
515         Mn    Non-spacing mark
516
517         N     Number
518         Nd    Decimal number
519         Nl    Letter number
520         No    Other number
521
522         P     Punctuation
523         Pc    Connector punctuation
524         Pd    Dash punctuation
525         Pe    Close punctuation
526         Pf    Final punctuation
527         Pi    Initial punctuation
528         Po    Other punctuation
529         Ps    Open punctuation
530
531         S     Symbol
532         Sc    Currency symbol
533         Sk    Modifier symbol
534         Sm    Mathematical symbol
535         So    Other symbol
536
537         Z     Separator
538         Zl    Line separator
539         Zp    Paragraph separator
540         Zs    Space separator
541
542       The  special property L& is also supported: it matches a character that
543       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
544       classified as a modifier or "other".
545
546       The  Cs  (Surrogate)  property  applies only to characters in the range
547       U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
548       RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check‐
549       ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
550       the pcreapi page). Perl does not support the Cs property.
551
552       The  long  synonyms  for  property  names  that  Perl supports (such as
553       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
554       any of these properties with "Is".
555
556       No character that is in the Unicode table has the Cn (unassigned) prop‐
557       erty.  Instead, this property is assumed for any code point that is not
558       in the Unicode table.
559
560       Specifying  caseless  matching  does not affect these escape sequences.
561       For example, \p{Lu} always matches only upper case letters.
562
563       The \X escape matches any number of Unicode  characters  that  form  an
564       extended Unicode sequence. \X is equivalent to
565
566         (?>\PM\pM*)
567
568       That  is,  it matches a character without the "mark" property, followed
569       by zero or more characters with the "mark"  property,  and  treats  the
570       sequence  as  an  atomic group (see below).  Characters with the "mark"
571       property are typically accents that  affect  the  preceding  character.
572       None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X
573       matches any one character.
574
575       Matching characters by Unicode property is not fast, because  PCRE  has
576       to  search  a  structure  that  contains data for over fifteen thousand
577       characters. That is why the traditional escape sequences such as \d and
578       \w  do  not  use  Unicode properties in PCRE by default, though you can
579       make them do so by setting the PCRE_UCP option for pcre_compile() or by
580       starting the pattern with (*UCP).
581
582   PCRE's additional properties
583
584       As  well  as  the standard Unicode properties described in the previous
585       section, PCRE supports four more that make it possible to convert  tra‐
586       ditional escape sequences such as \w and \s and POSIX character classes
587       to use Unicode properties. PCRE uses these non-standard, non-Perl prop‐
588       erties internally when PCRE_UCP is set. They are:
589
590         Xan   Any alphanumeric character
591         Xps   Any POSIX space character
592         Xsp   Any Perl space character
593         Xwd   Any Perl "word" character
594
595       Xan  matches  characters that have either the L (letter) or the N (num‐
596       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
597       formfeed,  or  carriage  return, and any other character that has the Z
598       (separator) property.  Xsp is the same as Xps, except that vertical tab
599       is excluded. Xwd matches the same characters as Xan, plus underscore.
600
601   Resetting the match start
602
603       The escape sequence \K, which is a Perl 5.10 feature, causes any previ‐
604       ously matched characters not  to  be  included  in  the  final  matched
605       sequence. For example, the pattern:
606
607         foo\Kbar
608
609       matches  "foobar",  but reports that it has matched "bar". This feature
610       is similar to a lookbehind assertion (described  below).   However,  in
611       this  case, the part of the subject before the real match does not have
612       to be of fixed length, as lookbehind assertions do. The use of \K  does
613       not  interfere  with  the setting of captured substrings.  For example,
614       when the pattern
615
616         (foo)\Kbar
617
618       matches "foobar", the first substring is still set to "foo".
619
620       Perl documents that the use  of  \K  within  assertions  is  "not  well
621       defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
622       assertions, but is ignored in negative assertions.
623
624   Simple assertions
625
626       The final use of backslash is for certain simple assertions. An  asser‐
627       tion  specifies a condition that has to be met at a particular point in
628       a match, without consuming any characters from the subject string.  The
629       use  of subpatterns for more complicated assertions is described below.
630       The backslashed assertions are:
631
632         \b     matches at a word boundary
633         \B     matches when not at a word boundary
634         \A     matches at the start of the subject
635         \Z     matches at the end of the subject
636                 also matches before a newline at the end of the subject
637         \z     matches only at the end of the subject
638         \G     matches at the first matching position in the subject
639
640       Inside a character class, \b has a different meaning;  it  matches  the
641       backspace  character.  If  any  other  of these assertions appears in a
642       character class, by default it matches the corresponding literal  char‐
643       acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
644       PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener‐
645       ated instead.
646
647       A  word  boundary is a position in the subject string where the current
648       character and the previous character do not both match \w or  \W  (i.e.
649       one  matches  \w  and the other matches \W), or the start or end of the
650       string if the first or last  character  matches  \w,  respectively.  In
651       UTF-8  mode,  the  meanings  of \w and \W can be changed by setting the
652       PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
653       PCRE  nor  Perl has a separate "start of word" or "end of word" metase‐
654       quence. However, whatever follows \b normally determines which  it  is.
655       For example, the fragment \ba matches "a" at the start of a word.
656
657       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
658       and dollar (described in the next section) in that they only ever match
659       at  the  very start and end of the subject string, whatever options are
660       set. Thus, they are independent of multiline mode. These  three  asser‐
661       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
662       affect only the behaviour of the circumflex and dollar  metacharacters.
663       However,  if the startoffset argument of pcre_exec() is non-zero, indi‐
664       cating that matching is to start at a point other than the beginning of
665       the  subject,  \A  can never match. The difference between \Z and \z is
666       that \Z matches before a newline at the end of the string as well as at
667       the very end, whereas \z matches only at the end.
668
669       The  \G assertion is true only when the current matching position is at
670       the start point of the match, as specified by the startoffset  argument
671       of  pcre_exec().  It  differs  from \A when the value of startoffset is
672       non-zero. By calling pcre_exec() multiple times with appropriate  argu‐
673       ments, you can mimic Perl's /g option, and it is in this kind of imple‐
674       mentation where \G can be useful.
675
676       Note, however, that PCRE's interpretation of \G, as the  start  of  the
677       current match, is subtly different from Perl's, which defines it as the
678       end of the previous match. In Perl, these can  be  different  when  the
679       previously  matched  string was empty. Because PCRE does just one match
680       at a time, it cannot reproduce this behaviour.
681
682       If all the alternatives of a pattern begin with \G, the  expression  is
683       anchored to the starting match position, and the "anchored" flag is set
684       in the compiled regular expression.
685

CIRCUMFLEX AND DOLLAR

687
688       Outside a character class, in the default matching mode, the circumflex
689       character  is  an  assertion  that is true only if the current matching
690       point is at the start of the subject string. If the  startoffset  argu‐
691       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
692       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
693       has an entirely different meaning (see below).
694
695       Circumflex  need  not be the first character of the pattern if a number
696       of alternatives are involved, but it should be the first thing in  each
697       alternative  in  which  it appears if the pattern is ever to match that
698       branch. If all possible alternatives start with a circumflex, that  is,
699       if  the  pattern  is constrained to match only at the start of the sub‐
700       ject, it is said to be an "anchored" pattern.  (There  are  also  other
701       constructs that can cause a pattern to be anchored.)
702
703       A  dollar  character  is  an assertion that is true only if the current
704       matching point is at the end of  the  subject  string,  or  immediately
705       before a newline at the end of the string (by default). Dollar need not
706       be the last character of the pattern if a number  of  alternatives  are
707       involved,  but  it  should  be  the last item in any branch in which it
708       appears. Dollar has no special meaning in a character class.
709
710       The meaning of dollar can be changed so that it  matches  only  at  the
711       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
712       compile time. This does not affect the \Z assertion.
713
714       The meanings of the circumflex and dollar characters are changed if the
715       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
716       matches immediately after internal newlines as well as at the start  of
717       the  subject  string.  It  does not match after a newline that ends the
718       string. A dollar matches before any newlines in the string, as well  as
719       at  the very end, when PCRE_MULTILINE is set. When newline is specified
720       as the two-character sequence CRLF, isolated CR and  LF  characters  do
721       not indicate newlines.
722
723       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
724       (where \n represents a newline) in multiline mode, but  not  otherwise.
725       Consequently,  patterns  that  are anchored in single line mode because
726       all branches start with ^ are not anchored in  multiline  mode,  and  a
727       match  for  circumflex  is  possible  when  the startoffset argument of
728       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
729       PCRE_MULTILINE is set.
730
731       Note  that  the sequences \A, \Z, and \z can be used to match the start
732       and end of the subject in both modes, and if all branches of a  pattern
733       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
734       set.
735

FULL STOP (PERIOD, DOT) AND \N

737
738       Outside a character class, a dot in the pattern matches any one charac‐
739       ter  in  the subject string except (by default) a character that signi‐
740       fies the end of a line. In UTF-8 mode, the  matched  character  may  be
741       more than one byte long.
742
743       When  a line ending is defined as a single character, dot never matches
744       that character; when the two-character sequence CRLF is used, dot  does
745       not  match  CR  if  it  is immediately followed by LF, but otherwise it
746       matches all characters (including isolated CRs and LFs). When any  Uni‐
747       code  line endings are being recognized, dot does not match CR or LF or
748       any of the other line ending characters.
749
750       The behaviour of dot with regard to newlines can  be  changed.  If  the
751       PCRE_DOTALL  option  is  set,  a dot matches any one character, without
752       exception. If the two-character sequence CRLF is present in the subject
753       string, it takes two dots to match it.
754
755       The  handling of dot is entirely independent of the handling of circum‐
756       flex and dollar, the only relationship being  that  they  both  involve
757       newlines. Dot has no special meaning in a character class.
758
759       The escape sequence \N always behaves as a dot does when PCRE_DOTALL is
760       not set. In other words, it matches any one character except  one  that
761       signifies the end of a line.
762

MATCHING A SINGLE BYTE

764
765       Outside a character class, the escape sequence \C matches any one byte,
766       both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any
767       line-ending  characters.  The  feature  is provided in Perl in order to
768       match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char‐
769       acters  into individual bytes, what remains in the string may be a mal‐
770       formed UTF-8 string. For this reason, the \C escape  sequence  is  best
771       avoided.
772
773       PCRE  does  not  allow \C to appear in lookbehind assertions (described
774       below), because in UTF-8 mode this would make it impossible  to  calcu‐
775       late the length of the lookbehind.
776

SQUARE BRACKETS AND CHARACTER CLASSES

778
779       An opening square bracket introduces a character class, terminated by a
780       closing square bracket. A closing square bracket on its own is not spe‐
781       cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
782       a lone closing square bracket causes a compile-time error. If a closing
783       square  bracket  is required as a member of the class, it should be the
784       first data character in the class  (after  an  initial  circumflex,  if
785       present) or escaped with a backslash.
786
787       A  character  class matches a single character in the subject. In UTF-8
788       mode, the character may be more than one byte long. A matched character
789       must be in the set of characters defined by the class, unless the first
790       character in the class definition is a circumflex, in  which  case  the
791       subject  character  must  not  be in the set defined by the class. If a
792       circumflex is actually required as a member of the class, ensure it  is
793       not the first character, or escape it with a backslash.
794
795       For  example, the character class [aeiou] matches any lower case vowel,
796       while [^aeiou] matches any character that is not a  lower  case  vowel.
797       Note that a circumflex is just a convenient notation for specifying the
798       characters that are in the class by enumerating those that are  not.  A
799       class  that starts with a circumflex is not an assertion; it still con‐
800       sumes a character from the subject string, and therefore  it  fails  if
801       the current pointer is at the end of the string.
802
803       In  UTF-8 mode, characters with values greater than 255 can be included
804       in a class as a literal string of bytes, or by using the  \x{  escaping
805       mechanism.
806
807       When  caseless  matching  is set, any letters in a class represent both
808       their upper case and lower case versions, so for  example,  a  caseless
809       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
810       match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
811       understands  the  concept  of case for characters whose values are less
812       than 128, so caseless matching is always possible. For characters  with
813       higher  values,  the  concept  of case is supported if PCRE is compiled
814       with Unicode property support, but not otherwise.  If you want  to  use
815       caseless  matching  in UTF8-mode for characters 128 and above, you must
816       ensure that PCRE is compiled with Unicode property support as  well  as
817       with UTF-8 support.
818
819       Characters  that  might  indicate  line breaks are never treated in any
820       special way  when  matching  character  classes,  whatever  line-ending
821       sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
822       PCRE_MULTILINE options is used. A class such as [^a] always matches one
823       of these characters.
824
825       The  minus (hyphen) character can be used to specify a range of charac‐
826       ters in a character  class.  For  example,  [d-m]  matches  any  letter
827       between  d  and  m,  inclusive.  If  a minus character is required in a
828       class, it must be escaped with a backslash  or  appear  in  a  position
829       where  it cannot be interpreted as indicating a range, typically as the
830       first or last character in the class.
831
832       It is not possible to have the literal character "]" as the end charac‐
833       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
834       two characters ("W" and "-") followed by a literal string "46]", so  it
835       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
836       backslash it is interpreted as the end of range, so [W-\]46] is  inter‐
837       preted  as a class containing a range followed by two other characters.
838       The octal or hexadecimal representation of "]" can also be used to  end
839       a range.
840
841       Ranges  operate in the collating sequence of character values. They can
842       also  be  used  for  characters  specified  numerically,  for   example
843       [\000-\037].  In UTF-8 mode, ranges can include characters whose values
844       are greater than 255, for example [\x{100}-\x{2ff}].
845
846       If a range that includes letters is used when caseless matching is set,
847       it matches the letters in either case. For example, [W-c] is equivalent
848       to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
849       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
850       accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
851       concept  of  case for characters with values greater than 128 only when
852       it is compiled with Unicode property support.
853
854       The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and  \W
855       may  also appear in a character class, and add the characters that they
856       match to the class. For example,  [\dABCDEF]  matches  any  hexadecimal
857       digit.  A circumflex can conveniently be used with the upper case char‐
858       acter types to specify a more restricted set  of  characters  than  the
859       matching  lower  case  type.  For example, the class [^\W_] matches any
860       letter or digit, but not underscore.
861
862       The only metacharacters that are recognized in  character  classes  are
863       backslash,  hyphen  (only  where  it can be interpreted as specifying a
864       range), circumflex (only at the start), opening  square  bracket  (only
865       when  it can be interpreted as introducing a POSIX class name - see the
866       next section), and the terminating  closing  square  bracket.  However,
867       escaping other non-alphanumeric characters does no harm.
868

POSIX CHARACTER CLASSES

870
871       Perl supports the POSIX notation for character classes. This uses names
872       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
873       supports this notation. For example,
874
875         [01[:alpha:]%]
876
877       matches "0", "1", any alphabetic character, or "%". The supported class
878       names are:
879
880         alnum    letters and digits
881         alpha    letters
882         ascii    character codes 0 - 127
883         blank    space or tab only
884         cntrl    control characters
885         digit    decimal digits (same as \d)
886         graph    printing characters, excluding space
887         lower    lower case letters
888         print    printing characters, including space
889         punct    printing characters, excluding letters and digits and space
890         space    white space (not quite the same as \s)
891         upper    upper case letters
892         word     "word" characters (same as \w)
893         xdigit   hexadecimal digits
894
895       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
896       and  space  (32). Notice that this list includes the VT character (code
897       11). This makes "space" different to \s, which does not include VT (for
898       Perl compatibility).
899
900       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
901       from Perl 5.8. Another Perl extension is negation, which  is  indicated
902       by a ^ character after the colon. For example,
903
904         [12[:^digit:]]
905
906       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
907       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
908       these are not supported, and an error is given if they are encountered.
909
910       By  default,  in UTF-8 mode, characters with values greater than 128 do
911       not match any of the POSIX character classes. However, if the  PCRE_UCP
912       option  is passed to pcre_compile(), some of the classes are changed so
913       that Unicode character properties are used. This is achieved by replac‐
914       ing the POSIX classes by other sequences, as follows:
915
916         [:alnum:]  becomes  \p{Xan}
917         [:alpha:]  becomes  \p{L}
918         [:blank:]  becomes  \h
919         [:digit:]  becomes  \p{Nd}
920         [:lower:]  becomes  \p{Ll}
921         [:space:]  becomes  \p{Xps}
922         [:upper:]  becomes  \p{Lu}
923         [:word:]   becomes  \p{Xwd}
924
925       Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
926       POSIX classes are unchanged, and match only characters with code points
927       less than 128.
928

VERTICAL BAR

930
931       Vertical  bar characters are used to separate alternative patterns. For
932       example, the pattern
933
934         gilbert|sullivan
935
936       matches either "gilbert" or "sullivan". Any number of alternatives  may
937       appear,  and  an  empty  alternative  is  permitted (matching the empty
938       string). The matching process tries each alternative in turn, from left
939       to  right, and the first one that succeeds is used. If the alternatives
940       are within a subpattern (defined below), "succeeds" means matching  the
941       rest of the main pattern as well as the alternative in the subpattern.
942

INTERNAL OPTION SETTING

944
945       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
946       PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
947       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
948       between "(?" and ")".  The option letters are
949
950         i  for PCRE_CASELESS
951         m  for PCRE_MULTILINE
952         s  for PCRE_DOTALL
953         x  for PCRE_EXTENDED
954
955       For example, (?im) sets caseless, multiline matching. It is also possi‐
956       ble to unset these options by preceding the letter with a hyphen, and a
957       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE‐
958       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
959       is also permitted. If a  letter  appears  both  before  and  after  the
960       hyphen, the option is unset.
961
962       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
963       can be changed in the same way as the Perl-compatible options by  using
964       the characters J, U and X respectively.
965
966       When  one  of  these  option  changes occurs at top level (that is, not
967       inside subpattern parentheses), the change applies to the remainder  of
968       the pattern that follows. If the change is placed right at the start of
969       a pattern, PCRE extracts it into the global options (and it will there‐
970       fore show up in data extracted by the pcre_fullinfo() function).
971
972       An  option  change  within a subpattern (see below for a description of
973       subpatterns) affects only that part of the current pattern that follows
974       it, so
975
976         (a(?i)b)c
977
978       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
979       used).  By this means, options can be made to have  different  settings
980       in  different parts of the pattern. Any changes made in one alternative
981       do carry on into subsequent branches within the  same  subpattern.  For
982       example,
983
984         (a(?i)b|c)
985
986       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
987       first branch is abandoned before the option setting.  This  is  because
988       the  effects  of option settings happen at compile time. There would be
989       some very weird behaviour otherwise.
990
991       Note: There are other PCRE-specific options that  can  be  set  by  the
992       application  when  the  compile  or match functions are called. In some
993       cases the pattern can contain special leading sequences such as (*CRLF)
994       to  override  what  the application has set or what has been defaulted.
995       Details are given in the section entitled  "Newline  sequences"  above.
996       There  are  also  the  (*UTF8) and (*UCP) leading sequences that can be
997       used to set UTF-8 and Unicode property modes; they  are  equivalent  to
998       setting the PCRE_UTF8 and the PCRE_UCP options, respectively.
999

SUBPATTERNS

1001
1002       Subpatterns are delimited by parentheses (round brackets), which can be
1003       nested.  Turning part of a pattern into a subpattern does two things:
1004
1005       1. It localizes a set of alternatives. For example, the pattern
1006
1007         cat(aract|erpillar|)
1008
1009       matches one of the words "cat", "cataract", or  "caterpillar".  Without
1010       the  parentheses,  it  would  match  "cataract", "erpillar" or an empty
1011       string.
1012
1013       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
1014       that,  when  the  whole  pattern  matches,  that portion of the subject
1015       string that matched the subpattern is passed back to the caller via the
1016       ovector  argument  of pcre_exec(). Opening parentheses are counted from
1017       left to right (starting from 1) to obtain  numbers  for  the  capturing
1018       subpatterns.
1019
1020       For  example,  if the string "the red king" is matched against the pat‐
1021       tern
1022
1023         the ((red|white) (king|queen))
1024
1025       the captured substrings are "red king", "red", and "king", and are num‐
1026       bered 1, 2, and 3, respectively.
1027
1028       The  fact  that  plain  parentheses  fulfil two functions is not always
1029       helpful.  There are often times when a grouping subpattern is  required
1030       without  a capturing requirement. If an opening parenthesis is followed
1031       by a question mark and a colon, the subpattern does not do any  captur‐
1032       ing,  and  is  not  counted when computing the number of any subsequent
1033       capturing subpatterns. For example, if the string "the white queen"  is
1034       matched against the pattern
1035
1036         the ((?:red|white) (king|queen))
1037
1038       the captured substrings are "white queen" and "queen", and are numbered
1039       1 and 2. The maximum number of capturing subpatterns is 65535.
1040
1041       As a convenient shorthand, if any option settings are required  at  the
1042       start  of  a  non-capturing  subpattern,  the option letters may appear
1043       between the "?" and the ":". Thus the two patterns
1044
1045         (?i:saturday|sunday)
1046         (?:(?i)saturday|sunday)
1047
1048       match exactly the same set of strings. Because alternative branches are
1049       tried  from  left  to right, and options are not reset until the end of
1050       the subpattern is reached, an option setting in one branch does  affect
1051       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
1052       "Saturday".
1053

DUPLICATE SUBPATTERN NUMBERS

1055
1056       Perl 5.10 introduced a feature whereby each alternative in a subpattern
1057       uses  the same numbers for its capturing parentheses. Such a subpattern
1058       starts with (?| and is itself a non-capturing subpattern. For  example,
1059       consider this pattern:
1060
1061         (?|(Sat)ur|(Sun))day
1062
1063       Because  the two alternatives are inside a (?| group, both sets of cap‐
1064       turing parentheses are numbered one. Thus, when  the  pattern  matches,
1065       you  can  look  at captured substring number one, whichever alternative
1066       matched. This construct is useful when you want to  capture  part,  but
1067       not all, of one of a number of alternatives. Inside a (?| group, paren‐
1068       theses are numbered as usual, but the number is reset at the  start  of
1069       each  branch. The numbers of any capturing buffers that follow the sub‐
1070       pattern start after the highest number used in any branch. The  follow‐
1071       ing  example  is taken from the Perl documentation.  The numbers under‐
1072       neath show in which buffer the captured content will be stored.
1073
1074         # before  ---------------branch-reset----------- after
1075         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1076         # 1            2         2  3        2     3     4
1077
1078       A back reference to a numbered subpattern uses the  most  recent  value
1079       that  is  set  for that number by any subpattern. The following pattern
1080       matches "abcabc" or "defdef":
1081
1082         /(?|(abc)|(def))\1/
1083
1084       In contrast, a recursive or "subroutine" call to a numbered  subpattern
1085       always  refers  to  the first one in the pattern with the given number.
1086       The following pattern matches "abcabc" or "defabc":
1087
1088         /(?|(abc)|(def))(?1)/
1089
1090       If a condition test for a subpattern's having matched refers to a  non-
1091       unique  number, the test is true if any of the subpatterns of that num‐
1092       ber have matched.
1093
1094       An alternative approach to using this "branch reset" feature is to  use
1095       duplicate named subpatterns, as described in the next section.
1096

NAMED SUBPATTERNS

1098
1099       Identifying  capturing  parentheses  by number is simple, but it can be
1100       very hard to keep track of the numbers in complicated  regular  expres‐
1101       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
1102       change. To help with this difficulty, PCRE supports the naming of  sub‐
1103       patterns. This feature was not added to Perl until release 5.10. Python
1104       had the feature earlier, and PCRE introduced it at release  4.0,  using
1105       the  Python syntax. PCRE now supports both the Perl and the Python syn‐
1106       tax. Perl allows identically numbered  subpatterns  to  have  different
1107       names, but PCRE does not.
1108
1109       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
1110       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
1111       to  capturing parentheses from other parts of the pattern, such as back
1112       references, recursion, and conditions, can be made by name as  well  as
1113       by number.
1114
1115       Names  consist  of  up  to  32 alphanumeric characters and underscores.
1116       Named capturing parentheses are still  allocated  numbers  as  well  as
1117       names,  exactly as if the names were not present. The PCRE API provides
1118       function calls for extracting the name-to-number translation table from
1119       a compiled pattern. There is also a convenience function for extracting
1120       a captured substring by name.
1121
1122       By default, a name must be unique within a pattern, but it is  possible
1123       to relax this constraint by setting the PCRE_DUPNAMES option at compile
1124       time. (Duplicate names are also always permitted for  subpatterns  with
1125       the  same  number, set up as described in the previous section.) Dupli‐
1126       cate names can be useful for patterns where only one  instance  of  the
1127       named  parentheses  can  match. Suppose you want to match the name of a
1128       weekday, either as a 3-letter abbreviation or as the full name, and  in
1129       both cases you want to extract the abbreviation. This pattern (ignoring
1130       the line breaks) does the job:
1131
1132         (?<DN>Mon|Fri|Sun)(?:day)?|
1133         (?<DN>Tue)(?:sday)?|
1134         (?<DN>Wed)(?:nesday)?|
1135         (?<DN>Thu)(?:rsday)?|
1136         (?<DN>Sat)(?:urday)?
1137
1138       There are five capturing substrings, but only one is ever set  after  a
1139       match.  (An alternative way of solving this problem is to use a "branch
1140       reset" subpattern, as described in the previous section.)
1141
1142       The convenience function for extracting the data by  name  returns  the
1143       substring  for  the first (and in this example, the only) subpattern of
1144       that name that matched. This saves searching  to  find  which  numbered
1145       subpattern it was.
1146
1147       If  you  make  a  back  reference to a non-unique named subpattern from
1148       elsewhere in the pattern, the one that corresponds to the first  occur‐
1149       rence of the name is used. In the absence of duplicate numbers (see the
1150       previous section) this is the one with the lowest number. If you use  a
1151       named  reference  in a condition test (see the section about conditions
1152       below), either to check whether a subpattern has matched, or  to  check
1153       for  recursion,  all  subpatterns with the same name are tested. If the
1154       condition is true for any one of them, the overall condition  is  true.
1155       This is the same behaviour as testing by number. For further details of
1156       the interfaces for handling named subpatterns, see the pcreapi documen‐
1157       tation.
1158
1159       Warning: You cannot use different names to distinguish between two sub‐
1160       patterns with the same number because PCRE uses only the  numbers  when
1161       matching. For this reason, an error is given at compile time if differ‐
1162       ent names are given to subpatterns with the same number.  However,  you
1163       can  give  the same name to subpatterns with the same number, even when
1164       PCRE_DUPNAMES is not set.
1165

REPETITION

1167
1168       Repetition is specified by quantifiers, which can  follow  any  of  the
1169       following items:
1170
1171         a literal data character
1172         the dot metacharacter
1173         the \C escape sequence
1174         the \X escape sequence (in UTF-8 mode with Unicode properties)
1175         the \R escape sequence
1176         an escape such as \d that matches a single character
1177         a character class
1178         a back reference (see next section)
1179         a parenthesized subpattern (unless it is an assertion)
1180         a recursive or "subroutine" call to a subpattern
1181
1182       The  general repetition quantifier specifies a minimum and maximum num‐
1183       ber of permitted matches, by giving the two numbers in  curly  brackets
1184       (braces),  separated  by  a comma. The numbers must be less than 65536,
1185       and the first must be less than or equal to the second. For example:
1186
1187         z{2,4}
1188
1189       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
1190       special  character.  If  the second number is omitted, but the comma is
1191       present, there is no upper limit; if the second number  and  the  comma
1192       are  both omitted, the quantifier specifies an exact number of required
1193       matches. Thus
1194
1195         [aeiou]{3,}
1196
1197       matches at least 3 successive vowels, but may match many more, while
1198
1199         \d{8}
1200
1201       matches exactly 8 digits. An opening curly bracket that  appears  in  a
1202       position  where a quantifier is not allowed, or one that does not match
1203       the syntax of a quantifier, is taken as a literal character. For  exam‐
1204       ple, {,6} is not a quantifier, but a literal string of four characters.
1205
1206       In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
1207       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char‐
1208       acters, each of which is represented by a two-byte sequence. Similarly,
1209       when Unicode property support is available, \X{3} matches three Unicode
1210       extended  sequences,  each of which may be several bytes long (and they
1211       may be of different lengths).
1212
1213       The quantifier {0} is permitted, causing the expression to behave as if
1214       the previous item and the quantifier were not present. This may be use‐
1215       ful for subpatterns that are referenced as subroutines  from  elsewhere
1216       in the pattern. Items other than subpatterns that have a {0} quantifier
1217       are omitted from the compiled pattern.
1218
1219       For convenience, the three most common quantifiers have  single-charac‐
1220       ter abbreviations:
1221
1222         *    is equivalent to {0,}
1223         +    is equivalent to {1,}
1224         ?    is equivalent to {0,1}
1225
1226       It  is  possible  to construct infinite loops by following a subpattern
1227       that can match no characters with a quantifier that has no upper limit,
1228       for example:
1229
1230         (a?)*
1231
1232       Earlier versions of Perl and PCRE used to give an error at compile time
1233       for such patterns. However, because there are cases where this  can  be
1234       useful,  such  patterns  are now accepted, but if any repetition of the
1235       subpattern does in fact match no characters, the loop is forcibly  bro‐
1236       ken.
1237
1238       By  default,  the quantifiers are "greedy", that is, they match as much
1239       as possible (up to the maximum  number  of  permitted  times),  without
1240       causing  the  rest of the pattern to fail. The classic example of where
1241       this gives problems is in trying to match comments in C programs. These
1242       appear  between  /*  and  */ and within the comment, individual * and /
1243       characters may appear. An attempt to match C comments by  applying  the
1244       pattern
1245
1246         /\*.*\*/
1247
1248       to the string
1249
1250         /* first comment */  not comment  /* second comment */
1251
1252       fails,  because it matches the entire string owing to the greediness of
1253       the .*  item.
1254
1255       However, if a quantifier is followed by a question mark, it  ceases  to
1256       be greedy, and instead matches the minimum number of times possible, so
1257       the pattern
1258
1259         /\*.*?\*/
1260
1261       does the right thing with the C comments. The meaning  of  the  various
1262       quantifiers  is  not  otherwise  changed,  just the preferred number of
1263       matches.  Do not confuse this use of question mark with its  use  as  a
1264       quantifier  in its own right. Because it has two uses, it can sometimes
1265       appear doubled, as in
1266
1267         \d??\d
1268
1269       which matches one digit by preference, but can match two if that is the
1270       only way the rest of the pattern matches.
1271
1272       If  the PCRE_UNGREEDY option is set (an option that is not available in
1273       Perl), the quantifiers are not greedy by default, but  individual  ones
1274       can  be  made  greedy  by following them with a question mark. In other
1275       words, it inverts the default behaviour.
1276
1277       When a parenthesized subpattern is quantified  with  a  minimum  repeat
1278       count  that is greater than 1 or with a limited maximum, more memory is
1279       required for the compiled pattern, in proportion to  the  size  of  the
1280       minimum or maximum.
1281
1282       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv‐
1283       alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
1284       the  pattern  is  implicitly anchored, because whatever follows will be
1285       tried against every character position in the subject string, so  there
1286       is  no  point  in  retrying the overall match at any position after the
1287       first. PCRE normally treats such a pattern as though it  were  preceded
1288       by \A.
1289
1290       In  cases  where  it  is known that the subject string contains no new‐
1291       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti‐
1292       mization, or alternatively using ^ to indicate anchoring explicitly.
1293
1294       However,  there is one situation where the optimization cannot be used.
1295       When .*  is inside capturing parentheses that are the subject of a back
1296       reference elsewhere in the pattern, a match at the start may fail where
1297       a later one succeeds. Consider, for example:
1298
1299         (.*)abc\1
1300
1301       If the subject is "xyz123abc123" the match point is the fourth  charac‐
1302       ter. For this reason, such a pattern is not implicitly anchored.
1303
1304       When a capturing subpattern is repeated, the value captured is the sub‐
1305       string that matched the final iteration. For example, after
1306
1307         (tweedle[dume]{3}\s*)+
1308
1309       has matched "tweedledum tweedledee" the value of the captured substring
1310       is  "tweedledee".  However,  if there are nested capturing subpatterns,
1311       the corresponding captured values may have been set in previous  itera‐
1312       tions. For example, after
1313
1314         /(a|(b))+/
1315
1316       matches "aba" the value of the second captured substring is "b".
1317

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

1319
1320       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1321       repetition, failure of what follows normally causes the  repeated  item
1322       to  be  re-evaluated to see if a different number of repeats allows the
1323       rest of the pattern to match. Sometimes it is useful to  prevent  this,
1324       either  to  change the nature of the match, or to cause it fail earlier
1325       than it otherwise might, when the author of the pattern knows there  is
1326       no point in carrying on.
1327
1328       Consider,  for  example, the pattern \d+foo when applied to the subject
1329       line
1330
1331         123456bar
1332
1333       After matching all 6 digits and then failing to match "foo", the normal
1334       action  of  the matcher is to try again with only 5 digits matching the
1335       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
1336       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
1337       the means for specifying that once a subpattern has matched, it is  not
1338       to be re-evaluated in this way.
1339
1340       If  we  use atomic grouping for the previous example, the matcher gives
1341       up immediately on failing to match "foo" the first time.  The  notation
1342       is a kind of special parenthesis, starting with (?> as in this example:
1343
1344         (?>\d+)foo
1345
1346       This  kind  of  parenthesis "locks up" the  part of the pattern it con‐
1347       tains once it has matched, and a failure further into  the  pattern  is
1348       prevented  from  backtracking into it. Backtracking past it to previous
1349       items, however, works as normal.
1350
1351       An alternative description is that a subpattern of  this  type  matches
1352       the  string  of  characters  that an identical standalone pattern would
1353       match, if anchored at the current point in the subject string.
1354
1355       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1356       such as the above example can be thought of as a maximizing repeat that
1357       must swallow everything it can. So, while both \d+ and  \d+?  are  pre‐
1358       pared  to  adjust  the number of digits they match in order to make the
1359       rest of the pattern match, (?>\d+) can only match an entire sequence of
1360       digits.
1361
1362       Atomic  groups in general can of course contain arbitrarily complicated
1363       subpatterns, and can be nested. However, when  the  subpattern  for  an
1364       atomic group is just a single repeated item, as in the example above, a
1365       simpler notation, called a "possessive quantifier" can  be  used.  This
1366       consists  of  an  additional  + character following a quantifier. Using
1367       this notation, the previous example can be rewritten as
1368
1369         \d++foo
1370
1371       Note that a possessive quantifier can be used with an entire group, for
1372       example:
1373
1374         (abc|xyz){2,3}+
1375
1376       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
1377       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1378       simpler  forms  of atomic group. However, there is no difference in the
1379       meaning of a possessive quantifier and  the  equivalent  atomic  group,
1380       though  there  may  be a performance difference; possessive quantifiers
1381       should be slightly faster.
1382
1383       The possessive quantifier syntax is an extension to the Perl  5.8  syn‐
1384       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
1385       edition of his book. Mike McCloskey liked it, so implemented it when he
1386       built  Sun's Java package, and PCRE copied it from there. It ultimately
1387       found its way into Perl at release 5.10.
1388
1389       PCRE has an optimization that automatically "possessifies" certain sim‐
1390       ple  pattern  constructs.  For  example, the sequence A+B is treated as
1391       A++B because there is no point in backtracking into a sequence  of  A's
1392       when B must follow.
1393
1394       When  a  pattern  contains an unlimited repeat inside a subpattern that
1395       can itself be repeated an unlimited number of  times,  the  use  of  an
1396       atomic  group  is  the  only way to avoid some failing matches taking a
1397       very long time indeed. The pattern
1398
1399         (\D+|<\d+>)*[!?]
1400
1401       matches an unlimited number of substrings that either consist  of  non-
1402       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
1403       matches, it runs quickly. However, if it is applied to
1404
1405         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1406
1407       it takes a long time before reporting  failure.  This  is  because  the
1408       string  can be divided between the internal \D+ repeat and the external
1409       * repeat in a large number of ways, and all  have  to  be  tried.  (The
1410       example  uses  [!?]  rather than a single character at the end, because
1411       both PCRE and Perl have an optimization that allows  for  fast  failure
1412       when  a single character is used. They remember the last single charac‐
1413       ter that is required for a match, and fail early if it is  not  present
1414       in  the  string.)  If  the pattern is changed so that it uses an atomic
1415       group, like this:
1416
1417         ((?>\D+)|<\d+>)*[!?]
1418
1419       sequences of non-digits cannot be broken, and failure happens quickly.
1420

BACK REFERENCES

1422
1423       Outside a character class, a backslash followed by a digit greater than
1424       0 (and possibly further digits) is a back reference to a capturing sub‐
1425       pattern earlier (that is, to its left) in the pattern,  provided  there
1426       have been that many previous capturing left parentheses.
1427
1428       However, if the decimal number following the backslash is less than 10,
1429       it is always taken as a back reference, and causes  an  error  only  if
1430       there  are  not that many capturing left parentheses in the entire pat‐
1431       tern. In other words, the parentheses that are referenced need  not  be
1432       to  the left of the reference for numbers less than 10. A "forward back
1433       reference" of this type can make sense when a  repetition  is  involved
1434       and  the  subpattern to the right has participated in an earlier itera‐
1435       tion.
1436
1437       It is not possible to have a numerical "forward back  reference"  to  a
1438       subpattern  whose  number  is  10  or  more using this syntax because a
1439       sequence such as \50 is interpreted as a character  defined  in  octal.
1440       See the subsection entitled "Non-printing characters" above for further
1441       details of the handling of digits following a backslash.  There  is  no
1442       such  problem  when named parentheses are used. A back reference to any
1443       subpattern is possible using named parentheses (see below).
1444
1445       Another way of avoiding the ambiguity inherent in  the  use  of  digits
1446       following a backslash is to use the \g escape sequence, which is a fea‐
1447       ture introduced in Perl 5.10.  This  escape  must  be  followed  by  an
1448       unsigned  number  or  a negative number, optionally enclosed in braces.
1449       These examples are all identical:
1450
1451         (ring), \1
1452         (ring), \g1
1453         (ring), \g{1}
1454
1455       An unsigned number specifies an absolute reference without the  ambigu‐
1456       ity that is present in the older syntax. It is also useful when literal
1457       digits follow the reference. A negative number is a relative reference.
1458       Consider this example:
1459
1460         (abc(def)ghi)\g{-1}
1461
1462       The sequence \g{-1} is a reference to the most recently started captur‐
1463       ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,
1464       \g{-2} would be equivalent to \1. The use of relative references can be
1465       helpful in long patterns, and also in  patterns  that  are  created  by
1466       joining together fragments that contain references within themselves.
1467
1468       A  back  reference matches whatever actually matched the capturing sub‐
1469       pattern in the current subject string, rather  than  anything  matching
1470       the subpattern itself (see "Subpatterns as subroutines" below for a way
1471       of doing that). So the pattern
1472
1473         (sens|respons)e and \1ibility
1474
1475       matches "sense and sensibility" and "response and responsibility",  but
1476       not  "sense and responsibility". If caseful matching is in force at the
1477       time of the back reference, the case of letters is relevant. For  exam‐
1478       ple,
1479
1480         ((?i)rah)\s+\1
1481
1482       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
1483       original capturing subpattern is matched caselessly.
1484
1485       There are several different ways of writing back  references  to  named
1486       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
1487       \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
1488       unified back reference syntax, in which \g can be used for both numeric
1489       and named references, is also supported. We  could  rewrite  the  above
1490       example in any of the following ways:
1491
1492         (?<p1>(?i)rah)\s+\k<p1>
1493         (?'p1'(?i)rah)\s+\k{p1}
1494         (?P<p1>(?i)rah)\s+(?P=p1)
1495         (?<p1>(?i)rah)\s+\g{p1}
1496
1497       A  subpattern  that  is  referenced  by  name may appear in the pattern
1498       before or after the reference.
1499
1500       There may be more than one back reference to the same subpattern. If  a
1501       subpattern  has  not actually been used in a particular match, any back
1502       references to it always fail by default. For example, the pattern
1503
1504         (a|(bc))\2
1505
1506       always fails if it starts to match "a" rather than  "bc".  However,  if
1507       the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer‐
1508       ence to an unset value matches an empty string.
1509
1510       Because there may be many capturing parentheses in a pattern, all  dig‐
1511       its  following a backslash are taken as part of a potential back refer‐
1512       ence number.  If the pattern continues with  a  digit  character,  some
1513       delimiter  must  be  used  to  terminate  the  back  reference.  If the
1514       PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
1515       syntax or an empty comment (see "Comments" below) can be used.
1516
1517   Recursive back references
1518
1519       A  back reference that occurs inside the parentheses to which it refers
1520       fails when the subpattern is first used, so, for example,  (a\1)  never
1521       matches.   However,  such references can be useful inside repeated sub‐
1522       patterns. For example, the pattern
1523
1524         (a|b\1)+
1525
1526       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1527       ation  of  the  subpattern,  the  back  reference matches the character
1528       string corresponding to the previous iteration. In order  for  this  to
1529       work,  the  pattern must be such that the first iteration does not need
1530       to match the back reference. This can be done using alternation, as  in
1531       the example above, or by a quantifier with a minimum of zero.
1532
1533       Back  references of this type cause the group that they reference to be
1534       treated as an atomic group.  Once the whole group has been  matched,  a
1535       subsequent  matching  failure cannot cause backtracking into the middle
1536       of the group.
1537

ASSERTIONS

1539
1540       An assertion is a test on the characters  following  or  preceding  the
1541       current  matching  point that does not actually consume any characters.
1542       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
1543       described above.
1544
1545       More  complicated  assertions  are  coded as subpatterns. There are two
1546       kinds: those that look ahead of the current  position  in  the  subject
1547       string,  and  those  that  look  behind  it. An assertion subpattern is
1548       matched in the normal way, except that it does not  cause  the  current
1549       matching position to be changed.
1550
1551       Assertion  subpatterns  are  not  capturing subpatterns, and may not be
1552       repeated, because it makes no sense to assert the  same  thing  several
1553       times.  If  any kind of assertion contains capturing subpatterns within
1554       it, these are counted for the purposes of numbering the capturing  sub‐
1555       patterns in the whole pattern.  However, substring capturing is carried
1556       out only for positive assertions, because it does not  make  sense  for
1557       negative assertions.
1558
1559   Lookahead assertions
1560
1561       Lookahead assertions start with (?= for positive assertions and (?! for
1562       negative assertions. For example,
1563
1564         \w+(?=;)
1565
1566       matches a word followed by a semicolon, but does not include the  semi‐
1567       colon in the match, and
1568
1569         foo(?!bar)
1570
1571       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
1572       that the apparently similar pattern
1573
1574         (?!foo)bar
1575
1576       does not find an occurrence of "bar"  that  is  preceded  by  something
1577       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
1578       the assertion (?!foo) is always true when the next three characters are
1579       "bar". A lookbehind assertion is needed to achieve the other effect.
1580
1581       If you want to force a matching failure at some point in a pattern, the
1582       most convenient way to do it is  with  (?!)  because  an  empty  string
1583       always  matches, so an assertion that requires there not to be an empty
1584       string must always fail.   The  Perl  5.10  backtracking  control  verb
1585       (*FAIL) or (*F) is essentially a synonym for (?!).
1586
1587   Lookbehind assertions
1588
1589       Lookbehind  assertions start with (?<= for positive assertions and (?<!
1590       for negative assertions. For example,
1591
1592         (?<!foo)bar
1593
1594       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
1595       contents  of  a  lookbehind  assertion are restricted such that all the
1596       strings it matches must have a fixed length. However, if there are sev‐
1597       eral  top-level  alternatives,  they  do  not all have to have the same
1598       fixed length. Thus
1599
1600         (?<=bullock|donkey)
1601
1602       is permitted, but
1603
1604         (?<!dogs?|cats?)
1605
1606       causes an error at compile time. Branches that match  different  length
1607       strings  are permitted only at the top level of a lookbehind assertion.
1608       This is an extension compared with Perl (5.8 and 5.10), which  requires
1609       all branches to match the same length of string. An assertion such as
1610
1611         (?<=ab(c|de))
1612
1613       is  not  permitted,  because  its single top-level branch can match two
1614       different lengths, but it is acceptable to PCRE if rewritten to use two
1615       top-level branches:
1616
1617         (?<=abc|abde)
1618
1619       In some cases, the Perl 5.10 escape sequence \K (see above) can be used
1620       instead of  a  lookbehind  assertion  to  get  round  the  fixed-length
1621       restriction.
1622
1623       The  implementation  of lookbehind assertions is, for each alternative,
1624       to temporarily move the current position back by the fixed  length  and
1625       then try to match. If there are insufficient characters before the cur‐
1626       rent position, the assertion fails.
1627
1628       PCRE does not allow the \C escape (which matches a single byte in UTF-8
1629       mode)  to appear in lookbehind assertions, because it makes it impossi‐
1630       ble to calculate the length of the lookbehind. The \X and  \R  escapes,
1631       which can match different numbers of bytes, are also not permitted.
1632
1633       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
1634       lookbehinds, as long as the subpattern matches a  fixed-length  string.
1635       Recursion, however, is not supported.
1636
1637       Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
1638       assertions to specify efficient matching of fixed-length strings at the
1639       end of subject strings. Consider a simple pattern such as
1640
1641         abcd$
1642
1643       when  applied  to  a  long string that does not match. Because matching
1644       proceeds from left to right, PCRE will look for each "a" in the subject
1645       and  then  see  if what follows matches the rest of the pattern. If the
1646       pattern is specified as
1647
1648         ^.*abcd$
1649
1650       the initial .* matches the entire string at first, but when this  fails
1651       (because there is no following "a"), it backtracks to match all but the
1652       last character, then all but the last two characters, and so  on.  Once
1653       again  the search for "a" covers the entire string, from right to left,
1654       so we are no better off. However, if the pattern is written as
1655
1656         ^.*+(?<=abcd)
1657
1658       there can be no backtracking for the .*+ item; it can  match  only  the
1659       entire  string.  The subsequent lookbehind assertion does a single test
1660       on the last four characters. If it fails, the match fails  immediately.
1661       For  long  strings, this approach makes a significant difference to the
1662       processing time.
1663
1664   Using multiple assertions
1665
1666       Several assertions (of any sort) may occur in succession. For example,
1667
1668         (?<=\d{3})(?<!999)foo
1669
1670       matches "foo" preceded by three digits that are not "999". Notice  that
1671       each  of  the  assertions is applied independently at the same point in
1672       the subject string. First there is a  check  that  the  previous  three
1673       characters  are  all  digits,  and  then there is a check that the same
1674       three characters are not "999".  This pattern does not match "foo" pre‐
1675       ceded  by  six  characters,  the first of which are digits and the last
1676       three of which are not "999". For example, it  doesn't  match  "123abc‐
1677       foo". A pattern to do that is
1678
1679         (?<=\d{3}...)(?<!999)foo
1680
1681       This  time  the  first assertion looks at the preceding six characters,
1682       checking that the first three are digits, and then the second assertion
1683       checks that the preceding three characters are not "999".
1684
1685       Assertions can be nested in any combination. For example,
1686
1687         (?<=(?<!foo)bar)baz
1688
1689       matches  an occurrence of "baz" that is preceded by "bar" which in turn
1690       is not preceded by "foo", while
1691
1692         (?<=\d{3}(?!999)...)foo
1693
1694       is another pattern that matches "foo" preceded by three digits and  any
1695       three characters that are not "999".
1696

CONDITIONAL SUBPATTERNS

1698
1699       It  is possible to cause the matching process to obey a subpattern con‐
1700       ditionally or to choose between two alternative subpatterns,  depending
1701       on  the result of an assertion, or whether a specific capturing subpat‐
1702       tern has already been matched. The two possible  forms  of  conditional
1703       subpattern are:
1704
1705         (?(condition)yes-pattern)
1706         (?(condition)yes-pattern|no-pattern)
1707
1708       If  the  condition is satisfied, the yes-pattern is used; otherwise the
1709       no-pattern (if present) is used. If there are more  than  two  alterna‐
1710       tives in the subpattern, a compile-time error occurs.
1711
1712       There  are  four  kinds of condition: references to subpatterns, refer‐
1713       ences to recursion, a pseudo-condition called DEFINE, and assertions.
1714
1715   Checking for a used subpattern by number
1716
1717       If the text between the parentheses consists of a sequence  of  digits,
1718       the condition is true if a capturing subpattern of that number has pre‐
1719       viously matched. If there is more than one  capturing  subpattern  with
1720       the  same  number  (see  the earlier section about duplicate subpattern
1721       numbers), the condition is true if any of them have been set. An alter‐
1722       native  notation is to precede the digits with a plus or minus sign. In
1723       this case, the subpattern number is relative rather than absolute.  The
1724       most  recently opened parentheses can be referenced by (?(-1), the next
1725       most recent by (?(-2), and so on. In looping  constructs  it  can  also
1726       make  sense  to  refer  to  subsequent  groups  with constructs such as
1727       (?(+2).
1728
1729       Consider the following pattern, which  contains  non-significant  white
1730       space to make it more readable (assume the PCRE_EXTENDED option) and to
1731       divide it into three parts for ease of discussion:
1732
1733         ( \( )?    [^()]+    (?(1) \) )
1734
1735       The first part matches an optional opening  parenthesis,  and  if  that
1736       character is present, sets it as the first captured substring. The sec‐
1737       ond part matches one or more characters that are not  parentheses.  The
1738       third part is a conditional subpattern that tests whether the first set
1739       of parentheses matched or not. If they did, that is, if subject started
1740       with an opening parenthesis, the condition is true, and so the yes-pat‐
1741       tern is executed and a  closing  parenthesis  is  required.  Otherwise,
1742       since  no-pattern  is  not  present, the subpattern matches nothing. In
1743       other words,  this  pattern  matches  a  sequence  of  non-parentheses,
1744       optionally enclosed in parentheses.
1745
1746       If  you  were  embedding  this pattern in a larger one, you could use a
1747       relative reference:
1748
1749         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
1750
1751       This makes the fragment independent of the parentheses  in  the  larger
1752       pattern.
1753
1754   Checking for a used subpattern by name
1755
1756       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
1757       used subpattern by name. For compatibility  with  earlier  versions  of
1758       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
1759       also recognized. However, there is a possible ambiguity with this  syn‐
1760       tax,  because  subpattern  names  may  consist entirely of digits. PCRE
1761       looks first for a named subpattern; if it cannot find one and the  name
1762       consists  entirely  of digits, PCRE looks for a subpattern of that num‐
1763       ber, which must be greater than zero. Using subpattern names that  con‐
1764       sist entirely of digits is not recommended.
1765
1766       Rewriting the above example to use a named subpattern gives this:
1767
1768         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
1769
1770       If  the  name used in a condition of this kind is a duplicate, the test
1771       is applied to all subpatterns of the same name, and is true if any  one
1772       of them has matched.
1773
1774   Checking for pattern recursion
1775
1776       If the condition is the string (R), and there is no subpattern with the
1777       name R, the condition is true if a recursive call to the whole  pattern
1778       or any subpattern has been made. If digits or a name preceded by amper‐
1779       sand follow the letter R, for example:
1780
1781         (?(R3)...) or (?(R&name)...)
1782
1783       the condition is true if the most recent recursion is into a subpattern
1784       whose number or name is given. This condition does not check the entire
1785       recursion stack. If the name used in a condition  of  this  kind  is  a
1786       duplicate, the test is applied to all subpatterns of the same name, and
1787       is true if any one of them is the most recent recursion.
1788
1789       At "top level", all these recursion test  conditions  are  false.   The
1790       syntax for recursive patterns is described below.
1791
1792   Defining subpatterns for use by reference only
1793
1794       If  the  condition  is  the string (DEFINE), and there is no subpattern
1795       with the name DEFINE, the condition is  always  false.  In  this  case,
1796       there  may  be  only  one  alternative  in the subpattern. It is always
1797       skipped if control reaches this point  in  the  pattern;  the  idea  of
1798       DEFINE  is that it can be used to define "subroutines" that can be ref‐
1799       erenced from elsewhere. (The use of "subroutines" is described  below.)
1800       For  example,  a pattern to match an IPv4 address could be written like
1801       this (ignore whitespace and line breaks):
1802
1803         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
1804         \b (?&byte) (\.(?&byte)){3} \b
1805
1806       The first part of the pattern is a DEFINE group inside which a  another
1807       group  named "byte" is defined. This matches an individual component of
1808       an IPv4 address (a number less than 256). When  matching  takes  place,
1809       this  part  of  the pattern is skipped because DEFINE acts like a false
1810       condition. The rest of the pattern uses references to the  named  group
1811       to  match the four dot-separated components of an IPv4 address, insist‐
1812       ing on a word boundary at each end.
1813
1814   Assertion conditions
1815
1816       If the condition is not in any of the above  formats,  it  must  be  an
1817       assertion.   This may be a positive or negative lookahead or lookbehind
1818       assertion. Consider  this  pattern,  again  containing  non-significant
1819       white space, and with the two alternatives on the second line:
1820
1821         (?(?=[^a-z]*[a-z])
1822         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
1823
1824       The  condition  is  a  positive  lookahead  assertion  that  matches an
1825       optional sequence of non-letters followed by a letter. In other  words,
1826       it  tests  for the presence of at least one letter in the subject. If a
1827       letter is found, the subject is matched against the first  alternative;
1828       otherwise  it  is  matched  against  the  second.  This pattern matches
1829       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
1830       letters and dd are digits.
1831

COMMENTS

1833
1834       The  sequence (?# marks the start of a comment that continues up to the
1835       next closing parenthesis. Nested parentheses  are  not  permitted.  The
1836       characters  that make up a comment play no part in the pattern matching
1837       at all.
1838
1839       If the PCRE_EXTENDED option is set, an unescaped # character outside  a
1840       character  class  introduces  a  comment  that continues to immediately
1841       after the next newline in the pattern.
1842

RECURSIVE PATTERNS

1844
1845       Consider the problem of matching a string in parentheses, allowing  for
1846       unlimited  nested  parentheses.  Without the use of recursion, the best
1847       that can be done is to use a pattern that  matches  up  to  some  fixed
1848       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
1849       depth.
1850
1851       For some time, Perl has provided a facility that allows regular expres‐
1852       sions  to recurse (amongst other things). It does this by interpolating
1853       Perl code in the expression at run time, and the code can refer to  the
1854       expression itself. A Perl pattern using code interpolation to solve the
1855       parentheses problem can be created like this:
1856
1857         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1858
1859       The (?p{...}) item interpolates Perl code at run time, and in this case
1860       refers recursively to the pattern in which it appears.
1861
1862       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1863       it supports special syntax for recursion of  the  entire  pattern,  and
1864       also  for  individual  subpattern  recursion. After its introduction in
1865       PCRE and Python, this kind of  recursion  was  subsequently  introduced
1866       into Perl at release 5.10.
1867
1868       A  special  item  that consists of (? followed by a number greater than
1869       zero and a closing parenthesis is a recursive call of the subpattern of
1870       the  given  number, provided that it occurs inside that subpattern. (If
1871       not, it is a "subroutine" call, which is described  in  the  next  sec‐
1872       tion.)  The special item (?R) or (?0) is a recursive call of the entire
1873       regular expression.
1874
1875       This PCRE pattern solves the nested  parentheses  problem  (assume  the
1876       PCRE_EXTENDED option is set so that white space is ignored):
1877
1878         \( ( [^()]++ | (?R) )* \)
1879
1880       First  it matches an opening parenthesis. Then it matches any number of
1881       substrings which can either be a  sequence  of  non-parentheses,  or  a
1882       recursive  match  of the pattern itself (that is, a correctly parenthe‐
1883       sized substring).  Finally there is a closing parenthesis. Note the use
1884       of a possessive quantifier to avoid backtracking into sequences of non-
1885       parentheses.
1886
1887       If this were part of a larger pattern, you would not  want  to  recurse
1888       the entire pattern, so instead you could use this:
1889
1890         ( \( ( [^()]++ | (?1) )* \) )
1891
1892       We  have  put the pattern into parentheses, and caused the recursion to
1893       refer to them instead of the whole pattern.
1894
1895       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
1896       tricky.  This  is made easier by the use of relative references (a Perl
1897       5.10 feature).  Instead of (?1) in the  pattern  above  you  can  write
1898       (?-2) to refer to the second most recently opened parentheses preceding
1899       the recursion. In other  words,  a  negative  number  counts  capturing
1900       parentheses leftwards from the point at which it is encountered.
1901
1902       It  is  also  possible  to refer to subsequently opened parentheses, by
1903       writing references such as (?+2). However, these  cannot  be  recursive
1904       because  the  reference  is  not inside the parentheses that are refer‐
1905       enced. They are always "subroutine" calls, as  described  in  the  next
1906       section.
1907
1908       An  alternative  approach is to use named parentheses instead. The Perl
1909       syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
1910       supported. We could rewrite the above example as follows:
1911
1912         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
1913
1914       If  there  is more than one subpattern with the same name, the earliest
1915       one is used.
1916
1917       This particular example pattern that we have been looking  at  contains
1918       nested unlimited repeats, and so the use of a possessive quantifier for
1919       matching strings of non-parentheses is important when applying the pat‐
1920       tern  to  strings  that do not match. For example, when this pattern is
1921       applied to
1922
1923         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1924
1925       it yields "no match" quickly. However, if a  possessive  quantifier  is
1926       not  used, the match runs for a very long time indeed because there are
1927       so many different ways the + and * repeats can carve  up  the  subject,
1928       and all have to be tested before failure can be reported.
1929
1930       At  the  end  of a match, the values of capturing parentheses are those
1931       from the outermost level. If you want to obtain intermediate values,  a
1932       callout  function can be used (see below and the pcrecallout documenta‐
1933       tion). If the pattern above is matched against
1934
1935         (ab(cd)ef)
1936
1937       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
1938       which  is the last value taken on at the top level. If a capturing sub‐
1939       pattern is not matched at the top level, its final value is unset, even
1940       if it is (temporarily) set at a deeper level.
1941
1942       If  there are more than 15 capturing parentheses in a pattern, PCRE has
1943       to obtain extra memory to store data during a recursion, which it  does
1944       by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
1945       can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
1946
1947       Do not confuse the (?R) item with the condition (R),  which  tests  for
1948       recursion.   Consider  this pattern, which matches text in angle brack‐
1949       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
1950       brackets  (that is, when recursing), whereas any characters are permit‐
1951       ted at the outer level.
1952
1953         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
1954
1955       In this pattern, (?(R) is the start of a conditional  subpattern,  with
1956       two  different  alternatives for the recursive and non-recursive cases.
1957       The (?R) item is the actual recursive call.
1958
1959   Recursion difference from Perl
1960
1961       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
1962       always treated as an atomic group. That is, once it has matched some of
1963       the subject string, it is never re-entered, even if it contains untried
1964       alternatives  and  there  is a subsequent matching failure. This can be
1965       illustrated by the following pattern, which purports to match a  palin‐
1966       dromic  string  that contains an odd number of characters (for example,
1967       "a", "aba", "abcba", "abcdcba"):
1968
1969         ^(.|(.)(?1)\2)$
1970
1971       The idea is that it either matches a single character, or two identical
1972       characters  surrounding  a sub-palindrome. In Perl, this pattern works;
1973       in PCRE it does not if the pattern is  longer  than  three  characters.
1974       Consider the subject string "abcba":
1975
1976       At  the  top level, the first character is matched, but as it is not at
1977       the end of the string, the first alternative fails; the second alterna‐
1978       tive is taken and the recursion kicks in. The recursive call to subpat‐
1979       tern 1 successfully matches the next character ("b").  (Note  that  the
1980       beginning and end of line tests are not part of the recursion).
1981
1982       Back  at  the top level, the next character ("c") is compared with what
1983       subpattern 2 matched, which was "a". This fails. Because the  recursion
1984       is  treated  as  an atomic group, there are now no backtracking points,
1985       and so the entire match fails. (Perl is able, at  this  point,  to  re-
1986       enter  the  recursion  and try the second alternative.) However, if the
1987       pattern is written with the alternatives in the other order, things are
1988       different:
1989
1990         ^((.)(?1)\2|.)$
1991
1992       This  time,  the recursing alternative is tried first, and continues to
1993       recurse until it runs out of characters, at which point  the  recursion
1994       fails.  But  this  time  we  do  have another alternative to try at the
1995       higher level. That is the big difference:  in  the  previous  case  the
1996       remaining alternative is at a deeper recursion level, which PCRE cannot
1997       use.
1998
1999       To change the pattern so that matches all palindromic strings, not just
2000       those  with  an  odd number of characters, it is tempting to change the
2001       pattern to this:
2002
2003         ^((.)(?1)\2|.?)$
2004
2005       Again, this works in Perl, but not in PCRE, and for  the  same  reason.
2006       When  a  deeper  recursion has matched a single character, it cannot be
2007       entered again in order to match an empty string.  The  solution  is  to
2008       separate  the two cases, and write out the odd and even cases as alter‐
2009       natives at the higher level:
2010
2011         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2012
2013       If you want to match typical palindromic phrases, the  pattern  has  to
2014       ignore all non-word characters, which can be done like this:
2015
2016         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2017
2018       If run with the PCRE_CASELESS option, this pattern matches phrases such
2019       as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
2020       Perl.  Note the use of the possessive quantifier *+ to avoid backtrack‐
2021       ing into sequences of non-word characters. Without this, PCRE  takes  a
2022       great  deal  longer  (ten  times or more) to match typical phrases, and
2023       Perl takes so long that you think it has gone into a loop.
2024
2025       WARNING: The palindrome-matching patterns above work only if  the  sub‐
2026       ject  string  does not start with a palindrome that is shorter than the
2027       entire string.  For example, although "abcba" is correctly matched,  if
2028       the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,
2029       then fails at top level because the end of the string does not  follow.
2030       Once  again, it cannot jump back into the recursion to try other alter‐
2031       natives, so the entire match fails.
2032

SUBPATTERNS AS SUBROUTINES

2034
2035       If the syntax for a recursive subpattern reference (either by number or
2036       by  name)  is used outside the parentheses to which it refers, it oper‐
2037       ates like a subroutine in a programming language. The "called"  subpat‐
2038       tern may be defined before or after the reference. A numbered reference
2039       can be absolute or relative, as in these examples:
2040
2041         (...(absolute)...)...(?2)...
2042         (...(relative)...)...(?-1)...
2043         (...(?+1)...(relative)...
2044
2045       An earlier example pointed out that the pattern
2046
2047         (sens|respons)e and \1ibility
2048
2049       matches "sense and sensibility" and "response and responsibility",  but
2050       not "sense and responsibility". If instead the pattern
2051
2052         (sens|respons)e and (?1)ibility
2053
2054       is  used, it does match "sense and responsibility" as well as the other
2055       two strings. Another example is  given  in  the  discussion  of  DEFINE
2056       above.
2057
2058       Like  recursive  subpatterns, a subroutine call is always treated as an
2059       atomic group. That is, once it has matched some of the subject  string,
2060       it  is  never  re-entered, even if it contains untried alternatives and
2061       there is a subsequent matching failure. Any capturing parentheses  that
2062       are  set  during  the  subroutine  call revert to their previous values
2063       afterwards.
2064
2065       When a subpattern is used as a subroutine, processing options  such  as
2066       case-independence are fixed when the subpattern is defined. They cannot
2067       be changed for different calls. For example, consider this pattern:
2068
2069         (abc)(?i:(?-1))
2070
2071       It matches "abcabc". It does not match "abcABC" because the  change  of
2072       processing option does not affect the called subpattern.
2073

ONIGURUMA SUBROUTINE SYNTAX

2075
2076       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
2077       name or a number enclosed either in angle brackets or single quotes, is
2078       an  alternative  syntax  for  referencing a subpattern as a subroutine,
2079       possibly recursively. Here are two of the examples used above,  rewrit‐
2080       ten using this syntax:
2081
2082         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2083         (sens|respons)e and \g'1'ibility
2084
2085       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
2086       plus or a minus sign it is taken as a relative reference. For example:
2087
2088         (abc)(?i:\g<-1>)
2089
2090       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
2091       synonymous.  The former is a back reference; the latter is a subroutine
2092       call.
2093

CALLOUTS

2095
2096       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2097       Perl  code to be obeyed in the middle of matching a regular expression.
2098       This makes it possible, amongst other things, to extract different sub‐
2099       strings that match the same pair of parentheses when there is a repeti‐
2100       tion.
2101
2102       PCRE provides a similar feature, but of course it cannot obey arbitrary
2103       Perl code. The feature is called "callout". The caller of PCRE provides
2104       an external function by putting its entry point in the global  variable
2105       pcre_callout.   By default, this variable contains NULL, which disables
2106       all calling out.
2107
2108       Within a regular expression, (?C) indicates the  points  at  which  the
2109       external  function  is  to be called. If you want to identify different
2110       callout points, you can put a number less than 256 after the letter  C.
2111       The  default  value is zero.  For example, this pattern has two callout
2112       points:
2113
2114         (?C1)abc(?C2)def
2115
2116       If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
2117       automatically  installed  before each item in the pattern. They are all
2118       numbered 255.
2119
2120       During matching, when PCRE reaches a callout point (and pcre_callout is
2121       set),  the  external function is called. It is provided with the number
2122       of the callout, the position in the pattern, and, optionally, one  item
2123       of  data  originally supplied by the caller of pcre_exec(). The callout
2124       function may cause matching to proceed, to backtrack, or to fail  alto‐
2125       gether. A complete description of the interface to the callout function
2126       is given in the pcrecallout documentation.
2127

BACKTRACKING CONTROL

2129
2130       Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
2131       which are described in the Perl documentation as "experimental and sub‐
2132       ject to change or removal in a future version of Perl". It goes  on  to
2133       say:  "Their usage in production code should be noted to avoid problems
2134       during upgrades." The same remarks apply to the PCRE features described
2135       in this section.
2136
2137       Since  these  verbs  are  specifically related to backtracking, most of
2138       them can be  used  only  when  the  pattern  is  to  be  matched  using
2139       pcre_exec(), which uses a backtracking algorithm. With the exception of
2140       (*FAIL), which behaves like a failing negative assertion, they cause an
2141       error if encountered by pcre_dfa_exec().
2142
2143       If any of these verbs are used in an assertion or subroutine subpattern
2144       (including recursive subpatterns), their effect  is  confined  to  that
2145       subpattern;  it  does  not extend to the surrounding pattern. Note that
2146       such subpatterns are processed as anchored at the point where they  are
2147       tested.
2148
2149       The  new verbs make use of what was previously invalid syntax: an open‐
2150       ing parenthesis followed by an asterisk. They are generally of the form
2151       (*VERB)  or (*VERB:NAME). Some may take either form, with differing be‐
2152       haviour, depending on whether or not an argument is present. An name is
2153       a  sequence  of letters, digits, and underscores. If the name is empty,
2154       that is, if the closing parenthesis immediately follows the colon,  the
2155       effect is as if the colon were not there. Any number of these verbs may
2156       occur in a pattern.
2157
2158       PCRE contains some optimizations that are used to speed up matching  by
2159       running some checks at the start of each match attempt. For example, it
2160       may know the minimum length of matching subject, or that  a  particular
2161       character  must  be present. When one of these optimizations suppresses
2162       the running of a match, any included backtracking verbs  will  not,  of
2163       course, be processed. You can suppress the start-of-match optimizations
2164       by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().
2165
2166   Verbs that act immediately
2167
2168       The following verbs act as soon as they are encountered. They  may  not
2169       be followed by a name.
2170
2171          (*ACCEPT)
2172
2173       This  verb causes the match to end successfully, skipping the remainder
2174       of the pattern. When inside a recursion, only the innermost pattern  is
2175       ended  immediately.  If  (*ACCEPT) is inside capturing parentheses, the
2176       data so far is captured. (This feature was added  to  PCRE  at  release
2177       8.00.) For example:
2178
2179         A((?:A|B(*ACCEPT)|C)D)
2180
2181       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2182       tured by the outer parentheses.
2183
2184         (*FAIL) or (*F)
2185
2186       This verb causes the match to fail, forcing backtracking to  occur.  It
2187       is  equivalent to (?!) but easier to read. The Perl documentation notes
2188       that it is probably useful only when combined  with  (?{})  or  (??{}).
2189       Those  are,  of course, Perl features that are not present in PCRE. The
2190       nearest equivalent is the callout feature, as for example in this  pat‐
2191       tern:
2192
2193         a+(?C)(*FAIL)
2194
2195       A  match  with the string "aaaa" always fails, but the callout is taken
2196       before each backtrack happens (in this example, 10 times).
2197
2198   Recording which path was taken
2199
2200       There is one verb whose main purpose  is  to  track  how  a  match  was
2201       arrived  at,  though  it  also  has a secondary use in conjunction with
2202       advancing the match starting point (see (*SKIP) below).
2203
2204         (*MARK:NAME) or (*:NAME)
2205
2206       A name is always  required  with  this  verb.  There  may  be  as  many
2207       instances  of  (*MARK) as you like in a pattern, and their names do not
2208       have to be unique.
2209
2210       When a match succeeds, the name  of  the  last-encountered  (*MARK)  is
2211       passed  back  to  the  caller  via  the  pcre_extra  data structure, as
2212       described in the section on pcre_extra in the pcreapi documentation. No
2213       data  is  returned  for a partial match. Here is an example of pcretest
2214       output, where the /K modifier requests the retrieval and outputting  of
2215       (*MARK) data:
2216
2217         /X(*MARK:A)Y|X(*MARK:B)Z/K
2218         XY
2219          0: XY
2220         MK: A
2221         XZ
2222          0: XZ
2223         MK: B
2224
2225       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2226       ple it indicates which of the two alternatives matched. This is a  more
2227       efficient  way of obtaining this information than putting each alterna‐
2228       tive in its own capturing parentheses.
2229
2230       A name may also be returned after a failed  match  if  the  final  path
2231       through  the  pattern involves (*MARK). However, unless (*MARK) used in
2232       conjunction with (*COMMIT), this is unlikely to  happen  for  an  unan‐
2233       chored pattern because, as the starting point for matching is advanced,
2234       the final check is often with an empty string, causing a failure before
2235       (*MARK) is reached. For example:
2236
2237         /X(*MARK:A)Y|X(*MARK:B)Z/K
2238         XP
2239         No match
2240
2241       There are three potential starting points for this match (starting with
2242       X, starting with P, and with  an  empty  string).  If  the  pattern  is
2243       anchored, the result is different:
2244
2245         /^X(*MARK:A)Y|^X(*MARK:B)Z/K
2246         XP
2247         No match, mark = B
2248
2249       PCRE's  start-of-match  optimizations can also interfere with this. For
2250       example, if, as a result of a call to pcre_study(), it knows the  mini‐
2251       mum  subject  length for a match, a shorter subject will not be scanned
2252       at all.
2253
2254       Note that similar anomalies (though different in detail) exist in Perl,
2255       no  doubt  for the same reasons. The use of (*MARK) data after a failed
2256       match of an unanchored pattern is not recommended, unless (*COMMIT)  is
2257       involved.
2258
2259   Verbs that act after backtracking
2260
2261       The following verbs do nothing when they are encountered. Matching con‐
2262       tinues with what follows, but if there is no subsequent match,  causing
2263       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
2264       cannot pass to the left of the verb. However, when one of  these  verbs
2265       appears  inside  an atomic group, its effect is confined to that group,
2266       because once the group has been matched, there is never any  backtrack‐
2267       ing  into  it.  In  this situation, backtracking can "jump back" to the
2268       left of the entire atomic group. (Remember also, as stated above,  that
2269       this localization also applies in subroutine calls and assertions.)
2270
2271       These  verbs  differ  in exactly what kind of failure occurs when back‐
2272       tracking reaches them.
2273
2274         (*COMMIT)
2275
2276       This verb, which may not be followed by a name, causes the whole  match
2277       to fail outright if the rest of the pattern does not match. Even if the
2278       pattern is unanchored, no further attempts to find a match by advancing
2279       the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
2280       pcre_exec() is committed to finding a match  at  the  current  starting
2281       point, or not at all. For example:
2282
2283         a+(*COMMIT)b
2284
2285       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
2286       of dynamic anchor, or "I've started, so I must finish." The name of the
2287       most  recently passed (*MARK) in the path is passed back when (*COMMIT)
2288       forces a match failure.
2289
2290       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
2291       anchor,  unless  PCRE's start-of-match optimizations are turned off, as
2292       shown in this pcretest example:
2293
2294         /(*COMMIT)abc/
2295         xyzabc
2296          0: abc
2297         xyzabc\Y
2298         No match
2299
2300       PCRE knows that any match must start  with  "a",  so  the  optimization
2301       skips  along the subject to "a" before running the first match attempt,
2302       which succeeds. When the optimization is disabled by the \Y  escape  in
2303       the second subject, the match starts at "x" and so the (*COMMIT) causes
2304       it to fail without trying any other starting points.
2305
2306         (*PRUNE) or (*PRUNE:NAME)
2307
2308       This verb causes the match to fail at the current starting position  in
2309       the  subject  if the rest of the pattern does not match. If the pattern
2310       is unanchored, the normal "bumpalong"  advance  to  the  next  starting
2311       character  then happens. Backtracking can occur as usual to the left of
2312       (*PRUNE), before it is reached,  or  when  matching  to  the  right  of
2313       (*PRUNE),  but  if  there is no match to the right, backtracking cannot
2314       cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter‐
2315       native  to an atomic group or possessive quantifier, but there are some
2316       uses of (*PRUNE) that cannot be expressed in any other way.  The behav‐
2317       iour  of  (*PRUNE:NAME)  is  the  same as (*MARK:NAME)(*PRUNE) when the
2318       match fails completely; the name is passed back if this  is  the  final
2319       attempt.   (*PRUNE:NAME)  does  not  pass back a name if the match suc‐
2320       ceeds. In an anchored pattern (*PRUNE) has the same  effect  as  (*COM‐
2321       MIT).
2322
2323         (*SKIP)
2324
2325       This  verb, when given without a name, is like (*PRUNE), except that if
2326       the pattern is unanchored, the "bumpalong" advance is not to  the  next
2327       character, but to the position in the subject where (*SKIP) was encoun‐
2328       tered. (*SKIP) signifies that whatever text was matched leading  up  to
2329       it cannot be part of a successful match. Consider:
2330
2331         a+(*SKIP)b
2332
2333       If  the  subject  is  "aaaac...",  after  the first match attempt fails
2334       (starting at the first character in the  string),  the  starting  point
2335       skips on to start the next attempt at "c". Note that a possessive quan‐
2336       tifer does not have the same effect as this example; although it  would
2337       suppress  backtracking  during  the  first  match  attempt,  the second
2338       attempt would start at the second character instead of skipping  on  to
2339       "c".
2340
2341         (*SKIP:NAME)
2342
2343       When  (*SKIP) has an associated name, its behaviour is modified. If the
2344       following pattern fails to match, the previous path through the pattern
2345       is  searched for the most recent (*MARK) that has the same name. If one
2346       is found, the "bumpalong" advance is to the subject position that  cor‐
2347       responds  to  that (*MARK) instead of to where (*SKIP) was encountered.
2348       If no (*MARK) with a matching name is found, normal "bumpalong" of  one
2349       character happens (the (*SKIP) is ignored).
2350
2351         (*THEN) or (*THEN:NAME)
2352
2353       This verb causes a skip to the next alternation if the rest of the pat‐
2354       tern does not match. That is, it cancels pending backtracking, but only
2355       within  the  current  alternation.  Its name comes from the observation
2356       that it can be used for a pattern-based if-then-else block:
2357
2358         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2359
2360       If the COND1 pattern matches, FOO is tried (and possibly further  items
2361       after  the  end  of  the group if FOO succeeds); on failure the matcher
2362       skips to the second alternative and tries COND2,  without  backtracking
2363       into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as
2364       (*MARK:NAME)(*THEN) if the overall  match  fails.  If  (*THEN)  is  not
2365       directly inside an alternation, it acts like (*PRUNE).
2366

AUTHOR

2372
2373       Philip Hazel
2374       University Computing Service
2375       Cambridge CB2 3QH, England.
2376

REVISION

2378
2379       Last updated: 18 May 2010
2380       Copyright (c) 1997-2010 University of Cambridge.
2381
2382
2383
2384                                                                PCREPATTERN(3)