1PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PCRE REGULAR EXPRESSION DETAILS

9
10       The  syntax and semantics of the regular expressions that are supported
11       by PCRE are described in detail below. There is a quick-reference  syn‐
12       tax  summary  in  the  pcresyntax  page. Perl's regular expressions are
13       described in its own documentation, and regular expressions in  general
14       are  covered in a number of books, some of which have copious examples.
15       Jeffrey  Friedl's  "Mastering  Regular   Expressions",   published   by
16       O'Reilly,  covers regular expressions in great detail. This description
17       of PCRE's regular expressions is intended as reference material.
18
19       The original operation of PCRE was on strings of  one-byte  characters.
20       However,  there is now also support for UTF-8 character strings. To use
21       this, you must build PCRE to  include  UTF-8  support,  and  then  call
22       pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
23       matching is mentioned in several places below. There is also a  summary
24       of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
25       page.
26
27       The remainder of this document discusses the  patterns  that  are  sup‐
28       ported  by  PCRE when its main matching function, pcre_exec(), is used.
29       From  release  6.0,   PCRE   offers   a   second   matching   function,
30       pcre_dfa_exec(),  which matches using a different algorithm that is not
31       Perl-compatible. Some of the features discussed below are not available
32       when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
33       alternative function, and how it differs from the normal function,  are
34       discussed in the pcrematching page.
35

NEWLINE CONVENTIONS

37
38       PCRE  supports five different conventions for indicating line breaks in
39       strings: a single CR (carriage return) character, a  single  LF  (line‐
40       feed) character, the two-character sequence CRLF, any of the three pre‐
41       ceding, or any Unicode newline sequence. The pcreapi page  has  further
42       discussion  about newlines, and shows how to set the newline convention
43       in the options arguments for the compiling and matching functions.
44
45       It is also possible to specify a newline convention by starting a  pat‐
46       tern string with one of the following five sequences:
47
48         (*CR)        carriage return
49         (*LF)        linefeed
50         (*CRLF)      carriage return, followed by linefeed
51         (*ANYCRLF)   any of the three above
52         (*ANY)       all Unicode newline sequences
53
54       These override the default and the options given to pcre_compile(). For
55       example, on a Unix system where LF is the default newline sequence, the
56       pattern
57
58         (*CR)a.b
59
60       changes the convention to CR. That pattern matches "a\nb" because LF is
61       no longer a newline. Note that these special settings,  which  are  not
62       Perl-compatible,  are  recognized  only at the very start of a pattern,
63       and that they must be in upper case.
64

CHARACTERS AND METACHARACTERS

66
67       A regular expression is a pattern that is  matched  against  a  subject
68       string  from  left  to right. Most characters stand for themselves in a
69       pattern, and match the corresponding characters in the  subject.  As  a
70       trivial example, the pattern
71
72         The quick brown fox
73
74       matches a portion of a subject string that is identical to itself. When
75       caseless matching is specified (the PCRE_CASELESS option), letters  are
76       matched  independently  of case. In UTF-8 mode, PCRE always understands
77       the concept of case for characters whose values are less than  128,  so
78       caseless  matching  is always possible. For characters with higher val‐
79       ues, the concept of case is supported if PCRE is compiled with  Unicode
80       property  support,  but  not  otherwise.   If  you want to use caseless
81       matching for characters 128 and above, you must  ensure  that  PCRE  is
82       compiled with Unicode property support as well as with UTF-8 support.
83
84       The  power  of  regular  expressions  comes from the ability to include
85       alternatives and repetitions in the pattern. These are encoded  in  the
86       pattern by the use of metacharacters, which do not stand for themselves
87       but instead are interpreted in some special way.
88
89       There are two different sets of metacharacters: those that  are  recog‐
90       nized  anywhere in the pattern except within square brackets, and those
91       that are recognized within square brackets.  Outside  square  brackets,
92       the metacharacters are as follows:
93
94         \      general escape character with several uses
95         ^      assert start of string (or line, in multiline mode)
96         $      assert end of string (or line, in multiline mode)
97         .      match any character except newline (by default)
98         [      start character class definition
99         |      start of alternative branch
100         (      start subpattern
101         )      end subpattern
102         ?      extends the meaning of (
103                also 0 or 1 quantifier
104                also quantifier minimizer
105         *      0 or more quantifier
106         +      1 or more quantifier
107                also "possessive quantifier"
108         {      start min/max quantifier
109
110       Part  of  a  pattern  that is in square brackets is called a "character
111       class". In a character class the only metacharacters are:
112
113         \      general escape character
114         ^      negate the class, but only if the first character
115         -      indicates character range
116         [      POSIX character class (only if followed by POSIX
117                  syntax)
118         ]      terminates the character class
119
120       The following sections describe the use of each of the metacharacters.
121

BACKSLASH

123
124       The backslash character has several uses. Firstly, if it is followed by
125       a  non-alphanumeric  character,  it takes away any special meaning that
126       character may have. This  use  of  backslash  as  an  escape  character
127       applies both inside and outside character classes.
128
129       For  example,  if  you want to match a * character, you write \* in the
130       pattern.  This escaping action applies whether  or  not  the  following
131       character  would  otherwise be interpreted as a metacharacter, so it is
132       always safe to precede a non-alphanumeric  with  backslash  to  specify
133       that  it stands for itself. In particular, if you want to match a back‐
134       slash, you write \\.
135
136       If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
137       the  pattern (other than in a character class) and characters between a
138       # outside a character class and the next newline are ignored. An escap‐
139       ing  backslash  can  be  used to include a whitespace or # character as
140       part of the pattern.
141
142       If you want to remove the special meaning from a  sequence  of  charac‐
143       ters,  you can do so by putting them between \Q and \E. This is differ‐
144       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
145       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola‐
146       tion. Note the following examples:
147
148         Pattern            PCRE matches   Perl matches
149
150         \Qabc$xyz\E        abc$xyz        abc followed by the
151                                             contents of $xyz
152         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
153         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
154
155       The \Q...\E sequence is recognized both inside  and  outside  character
156       classes.
157
158   Non-printing characters
159
160       A second use of backslash provides a way of encoding non-printing char‐
161       acters in patterns in a visible manner. There is no restriction on  the
162       appearance  of non-printing characters, apart from the binary zero that
163       terminates a pattern, but when a pattern  is  being  prepared  by  text
164       editing,  it  is  usually  easier  to  use  one of the following escape
165       sequences than the binary character it represents:
166
167         \a        alarm, that is, the BEL character (hex 07)
168         \cx       "control-x", where x is any character
169         \e        escape (hex 1B)
170         \f        formfeed (hex 0C)
171         \n        linefeed (hex 0A)
172         \r        carriage return (hex 0D)
173         \t        tab (hex 09)
174         \ddd      character with octal code ddd, or backreference
175         \xhh      character with hex code hh
176         \x{hhh..} character with hex code hhh..
177
178       The precise effect of \cx is as follows: if x is a lower  case  letter,
179       it  is converted to upper case. Then bit 6 of the character (hex 40) is
180       inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
181       becomes hex 7B.
182
183       After  \x, from zero to two hexadecimal digits are read (letters can be
184       in upper or lower case). Any number of hexadecimal  digits  may  appear
185       between  \x{  and  },  but the value of the character code must be less
186       than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
187       the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
188       than the largest Unicode code point, which is 10FFFF.
189
190       If characters other than hexadecimal digits appear between \x{  and  },
191       or if there is no terminating }, this form of escape is not recognized.
192       Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal
193       escape,  with  no  following  digits, giving a character whose value is
194       zero.
195
196       Characters whose value is less than 256 can be defined by either of the
197       two  syntaxes  for  \x. There is no difference in the way they are han‐
198       dled. For example, \xdc is exactly the same as \x{dc}.
199
200       After \0 up to two further octal digits are read. If  there  are  fewer
201       than  two  digits,  just  those  that  are  present  are used. Thus the
202       sequence \0\x\07 specifies two binary zeros followed by a BEL character
203       (code  value 7). Make sure you supply two digits after the initial zero
204       if the pattern character that follows is itself an octal digit.
205
206       The handling of a backslash followed by a digit other than 0 is compli‐
207       cated.  Outside a character class, PCRE reads it and any following dig‐
208       its as a decimal number. If the number is less than  10,  or  if  there
209       have been at least that many previous capturing left parentheses in the
210       expression, the entire  sequence  is  taken  as  a  back  reference.  A
211       description  of how this works is given later, following the discussion
212       of parenthesized subpatterns.
213
214       Inside a character class, or if the decimal number is  greater  than  9
215       and  there have not been that many capturing subpatterns, PCRE re-reads
216       up to three octal digits following the backslash, and uses them to gen‐
217       erate  a data character. Any subsequent digits stand for themselves. In
218       non-UTF-8 mode, the value of a character specified  in  octal  must  be
219       less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
220       example:
221
222         \040   is another way of writing a space
223         \40    is the same, provided there are fewer than 40
224                   previous capturing subpatterns
225         \7     is always a back reference
226         \11    might be a back reference, or another way of
227                   writing a tab
228         \011   is always a tab
229         \0113  is a tab followed by the character "3"
230         \113   might be a back reference, otherwise the
231                   character with octal code 113
232         \377   might be a back reference, otherwise
233                   the byte consisting entirely of 1 bits
234         \81    is either a back reference, or a binary zero
235                   followed by the two characters "8" and "1"
236
237       Note that octal values of 100 or greater must not be  introduced  by  a
238       leading zero, because no more than three octal digits are ever read.
239
240       All the sequences that define a single character value can be used both
241       inside and outside character classes. In addition, inside  a  character
242       class,  the  sequence \b is interpreted as the backspace character (hex
243       08), and the sequences \R and \X are interpreted as the characters  "R"
244       and  "X", respectively. Outside a character class, these sequences have
245       different meanings (see below).
246
247   Absolute and relative back references
248
249       The sequence \g followed by an unsigned or a negative  number,  option‐
250       ally  enclosed  in braces, is an absolute or relative back reference. A
251       named back reference can be coded as \g{name}. Back references are dis‐
252       cussed later, following the discussion of parenthesized subpatterns.
253
254   Generic character types
255
256       Another use of backslash is for specifying generic character types. The
257       following are always recognized:
258
259         \d     any decimal digit
260         \D     any character that is not a decimal digit
261         \h     any horizontal whitespace character
262         \H     any character that is not a horizontal whitespace character
263         \s     any whitespace character
264         \S     any character that is not a whitespace character
265         \v     any vertical whitespace character
266         \V     any character that is not a vertical whitespace character
267         \w     any "word" character
268         \W     any "non-word" character
269
270       Each pair of escape sequences partitions the complete set of characters
271       into  two disjoint sets. Any given character matches one, and only one,
272       of each pair.
273
274       These character type sequences can appear both inside and outside char‐
275       acter  classes.  They each match one character of the appropriate type.
276       If the current matching point is at the end of the subject string,  all
277       of them fail, since there is no character to match.
278
279       For  compatibility  with Perl, \s does not match the VT character (code
280       11).  This makes it different from the the POSIX "space" class. The  \s
281       characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
282       "use locale;" is included in a Perl script, \s may match the VT charac‐
283       ter. In PCRE, it never does.
284
285       In  UTF-8 mode, characters with values greater than 128 never match \d,
286       \s, or \w, and always match \D, \S, and \W. This is true even when Uni‐
287       code  character  property  support is available. These sequences retain
288       their original meanings from before UTF-8 support was available, mainly
289       for efficiency reasons.
290
291       The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
292       the other sequences, these do match certain high-valued  codepoints  in
293       UTF-8 mode.  The horizontal space characters are:
294
295         U+0009     Horizontal tab
296         U+0020     Space
297         U+00A0     Non-break space
298         U+1680     Ogham space mark
299         U+180E     Mongolian vowel separator
300         U+2000     En quad
301         U+2001     Em quad
302         U+2002     En space
303         U+2003     Em space
304         U+2004     Three-per-em space
305         U+2005     Four-per-em space
306         U+2006     Six-per-em space
307         U+2007     Figure space
308         U+2008     Punctuation space
309         U+2009     Thin space
310         U+200A     Hair space
311         U+202F     Narrow no-break space
312         U+205F     Medium mathematical space
313         U+3000     Ideographic space
314
315       The vertical space characters are:
316
317         U+000A     Linefeed
318         U+000B     Vertical tab
319         U+000C     Formfeed
320         U+000D     Carriage return
321         U+0085     Next line
322         U+2028     Line separator
323         U+2029     Paragraph separator
324
325       A "word" character is an underscore or any character less than 256 that
326       is a letter or digit. The definition of  letters  and  digits  is  con‐
327       trolled  by PCRE's low-valued character tables, and may vary if locale-
328       specific matching is taking place (see "Locale support" in the  pcreapi
329       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
330       systems, or "french" in Windows, some character codes greater than  128
331       are  used for accented letters, and these are matched by \w. The use of
332       locales with Unicode is discouraged.
333
334   Newline sequences
335
336       Outside a character class, the escape sequence \R matches  any  Unicode
337       newline  sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is
338       equivalent to the following:
339
340         (?>\r\n|\n|\x0b|\f|\r|\x85)
341
342       This is an example of an "atomic group", details  of  which  are  given
343       below.  This particular group matches either the two-character sequence
344       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
345       U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
346       return, U+000D), or NEL (next line, U+0085). The two-character sequence
347       is treated as a single unit that cannot be split.
348
349       In  UTF-8  mode, two additional characters whose codepoints are greater
350       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
351       rator,  U+2029).   Unicode character property support is not needed for
352       these characters to be recognized.
353
354       Inside a character class, \R matches the letter "R".
355
356   Unicode character properties
357
358       When PCRE is built with Unicode character property support, three addi‐
359       tional  escape sequences that match characters with specific properties
360       are available.  When not in UTF-8 mode, these sequences are  of  course
361       limited  to  testing characters whose codepoints are less than 256, but
362       they do work in this mode.  The extra escape sequences are:
363
364         \p{xx}   a character with the xx property
365         \P{xx}   a character without the xx property
366         \X       an extended Unicode sequence
367
368       The property names represented by xx above are limited to  the  Unicode
369       script names, the general category properties, and "Any", which matches
370       any character (including newline). Other properties such as "InMusical‐
371       Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
372       not match any characters, so always causes a match failure.
373
374       Sets of Unicode characters are defined as belonging to certain scripts.
375       A  character from one of these sets can be matched using a script name.
376       For example:
377
378         \p{Greek}
379         \P{Han}
380
381       Those that are not part of an identified script are lumped together  as
382       "Common". The current list of scripts is:
383
384       Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
385       Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,
386       Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
387       Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira‐
388       gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,
389       Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
390       Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
391       Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
392       Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
393
394       Each  character has exactly one general category property, specified by
395       a two-letter abbreviation. For compatibility with Perl, negation can be
396       specified  by  including a circumflex between the opening brace and the
397       property name. For example, \p{^Lu} is the same as \P{Lu}.
398
399       If only one letter is specified with \p or \P, it includes all the gen‐
400       eral  category properties that start with that letter. In this case, in
401       the absence of negation, the curly brackets in the escape sequence  are
402       optional; these two examples have the same effect:
403
404         \p{L}
405         \pL
406
407       The following general category property codes are supported:
408
409         C     Other
410         Cc    Control
411         Cf    Format
412         Cn    Unassigned
413         Co    Private use
414         Cs    Surrogate
415
416         L     Letter
417         Ll    Lower case letter
418         Lm    Modifier letter
419         Lo    Other letter
420         Lt    Title case letter
421         Lu    Upper case letter
422
423         M     Mark
424         Mc    Spacing mark
425         Me    Enclosing mark
426         Mn    Non-spacing mark
427
428         N     Number
429         Nd    Decimal number
430         Nl    Letter number
431         No    Other number
432
433         P     Punctuation
434         Pc    Connector punctuation
435         Pd    Dash punctuation
436         Pe    Close punctuation
437         Pf    Final punctuation
438         Pi    Initial punctuation
439         Po    Other punctuation
440         Ps    Open punctuation
441
442         S     Symbol
443         Sc    Currency symbol
444         Sk    Modifier symbol
445         Sm    Mathematical symbol
446         So    Other symbol
447
448         Z     Separator
449         Zl    Line separator
450         Zp    Paragraph separator
451         Zs    Space separator
452
453       The  special property L& is also supported: it matches a character that
454       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
455       classified as a modifier or "other".
456
457       The  Cs  (Surrogate)  property  applies only to characters in the range
458       U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
459       RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check‐
460       ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
461       the pcreapi page).
462
463       The  long  synonyms  for  these  properties that Perl supports (such as
464       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
465       any of these properties with "Is".
466
467       No character that is in the Unicode table has the Cn (unassigned) prop‐
468       erty.  Instead, this property is assumed for any code point that is not
469       in the Unicode table.
470
471       Specifying  caseless  matching  does not affect these escape sequences.
472       For example, \p{Lu} always matches only upper case letters.
473
474       The \X escape matches any number of Unicode  characters  that  form  an
475       extended Unicode sequence. \X is equivalent to
476
477         (?>\PM\pM*)
478
479       That  is,  it matches a character without the "mark" property, followed
480       by zero or more characters with the "mark"  property,  and  treats  the
481       sequence  as  an  atomic group (see below).  Characters with the "mark"
482       property are typically accents that  affect  the  preceding  character.
483       None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X
484       matches any one character.
485
486       Matching characters by Unicode property is not fast, because  PCRE  has
487       to  search  a  structure  that  contains data for over fifteen thousand
488       characters. That is why the traditional escape sequences such as \d and
489       \w do not use Unicode properties in PCRE.
490
491   Resetting the match start
492
493       The escape sequence \K, which is a Perl 5.10 feature, causes any previ‐
494       ously matched characters not  to  be  included  in  the  final  matched
495       sequence. For example, the pattern:
496
497         foo\Kbar
498
499       matches  "foobar",  but reports that it has matched "bar". This feature
500       is similar to a lookbehind assertion (described  below).   However,  in
501       this  case, the part of the subject before the real match does not have
502       to be of fixed length, as lookbehind assertions do. The use of \K  does
503       not  interfere  with  the setting of captured substrings.  For example,
504       when the pattern
505
506         (foo)\Kbar
507
508       matches "foobar", the first substring is still set to "foo".
509
510   Simple assertions
511
512       The final use of backslash is for certain simple assertions. An  asser‐
513       tion  specifies a condition that has to be met at a particular point in
514       a match, without consuming any characters from the subject string.  The
515       use  of subpatterns for more complicated assertions is described below.
516       The backslashed assertions are:
517
518         \b     matches at a word boundary
519         \B     matches when not at a word boundary
520         \A     matches at the start of the subject
521         \Z     matches at the end of the subject
522                 also matches before a newline at the end of the subject
523         \z     matches only at the end of the subject
524         \G     matches at the first matching position in the subject
525
526       These assertions may not appear in character classes (but note that  \b
527       has a different meaning, namely the backspace character, inside a char‐
528       acter class).
529
530       A word boundary is a position in the subject string where  the  current
531       character  and  the previous character do not both match \w or \W (i.e.
532       one matches \w and the other matches \W), or the start or  end  of  the
533       string if the first or last character matches \w, respectively.
534
535       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
536       and dollar (described in the next section) in that they only ever match
537       at  the  very start and end of the subject string, whatever options are
538       set. Thus, they are independent of multiline mode. These  three  asser‐
539       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
540       affect only the behaviour of the circumflex and dollar  metacharacters.
541       However,  if the startoffset argument of pcre_exec() is non-zero, indi‐
542       cating that matching is to start at a point other than the beginning of
543       the  subject,  \A  can never match. The difference between \Z and \z is
544       that \Z matches before a newline at the end of the string as well as at
545       the very end, whereas \z matches only at the end.
546
547       The  \G assertion is true only when the current matching position is at
548       the start point of the match, as specified by the startoffset  argument
549       of  pcre_exec().  It  differs  from \A when the value of startoffset is
550       non-zero. By calling pcre_exec() multiple times with appropriate  argu‐
551       ments, you can mimic Perl's /g option, and it is in this kind of imple‐
552       mentation where \G can be useful.
553
554       Note, however, that PCRE's interpretation of \G, as the  start  of  the
555       current match, is subtly different from Perl's, which defines it as the
556       end of the previous match. In Perl, these can  be  different  when  the
557       previously  matched  string was empty. Because PCRE does just one match
558       at a time, it cannot reproduce this behaviour.
559
560       If all the alternatives of a pattern begin with \G, the  expression  is
561       anchored to the starting match position, and the "anchored" flag is set
562       in the compiled regular expression.
563

CIRCUMFLEX AND DOLLAR

565
566       Outside a character class, in the default matching mode, the circumflex
567       character  is  an  assertion  that is true only if the current matching
568       point is at the start of the subject string. If the  startoffset  argu‐
569       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
570       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
571       has an entirely different meaning (see below).
572
573       Circumflex  need  not be the first character of the pattern if a number
574       of alternatives are involved, but it should be the first thing in  each
575       alternative  in  which  it appears if the pattern is ever to match that
576       branch. If all possible alternatives start with a circumflex, that  is,
577       if  the  pattern  is constrained to match only at the start of the sub‐
578       ject, it is said to be an "anchored" pattern.  (There  are  also  other
579       constructs that can cause a pattern to be anchored.)
580
581       A  dollar  character  is  an assertion that is true only if the current
582       matching point is at the end of  the  subject  string,  or  immediately
583       before a newline at the end of the string (by default). Dollar need not
584       be the last character of the pattern if a number  of  alternatives  are
585       involved,  but  it  should  be  the last item in any branch in which it
586       appears. Dollar has no special meaning in a character class.
587
588       The meaning of dollar can be changed so that it  matches  only  at  the
589       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
590       compile time. This does not affect the \Z assertion.
591
592       The meanings of the circumflex and dollar characters are changed if the
593       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
594       matches immediately after internal newlines as well as at the start  of
595       the  subject  string.  It  does not match after a newline that ends the
596       string. A dollar matches before any newlines in the string, as well  as
597       at  the very end, when PCRE_MULTILINE is set. When newline is specified
598       as the two-character sequence CRLF, isolated CR and  LF  characters  do
599       not indicate newlines.
600
601       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
602       (where \n represents a newline) in multiline mode, but  not  otherwise.
603       Consequently,  patterns  that  are anchored in single line mode because
604       all branches start with ^ are not anchored in  multiline  mode,  and  a
605       match  for  circumflex  is  possible  when  the startoffset argument of
606       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
607       PCRE_MULTILINE is set.
608
609       Note  that  the sequences \A, \Z, and \z can be used to match the start
610       and end of the subject in both modes, and if all branches of a  pattern
611       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
612       set.
613

FULL STOP (PERIOD, DOT)

615
616       Outside a character class, a dot in the pattern matches any one charac‐
617       ter  in  the subject string except (by default) a character that signi‐
618       fies the end of a line. In UTF-8 mode, the  matched  character  may  be
619       more than one byte long.
620
621       When  a line ending is defined as a single character, dot never matches
622       that character; when the two-character sequence CRLF is used, dot  does
623       not  match  CR  if  it  is immediately followed by LF, but otherwise it
624       matches all characters (including isolated CRs and LFs). When any  Uni‐
625       code  line endings are being recognized, dot does not match CR or LF or
626       any of the other line ending characters.
627
628       The behaviour of dot with regard to newlines can  be  changed.  If  the
629       PCRE_DOTALL  option  is  set,  a dot matches any one character, without
630       exception. If the two-character sequence CRLF is present in the subject
631       string, it takes two dots to match it.
632
633       The  handling of dot is entirely independent of the handling of circum‐
634       flex and dollar, the only relationship being  that  they  both  involve
635       newlines. Dot has no special meaning in a character class.
636

MATCHING A SINGLE BYTE

638
639       Outside a character class, the escape sequence \C matches any one byte,
640       both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any
641       line-ending  characters.  The  feature  is provided in Perl in order to
642       match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char‐
643       acters  into individual bytes, what remains in the string may be a mal‐
644       formed UTF-8 string. For this reason, the \C escape  sequence  is  best
645       avoided.
646
647       PCRE  does  not  allow \C to appear in lookbehind assertions (described
648       below), because in UTF-8 mode this would make it impossible  to  calcu‐
649       late the length of the lookbehind.
650

SQUARE BRACKETS AND CHARACTER CLASSES

652
653       An opening square bracket introduces a character class, terminated by a
654       closing square bracket. A closing square bracket on its own is not spe‐
655       cial. If a closing square bracket is required as a member of the class,
656       it should be the first data character in the class  (after  an  initial
657       circumflex, if present) or escaped with a backslash.
658
659       A  character  class matches a single character in the subject. In UTF-8
660       mode, the character may occupy more than one byte. A matched  character
661       must be in the set of characters defined by the class, unless the first
662       character in the class definition is a circumflex, in  which  case  the
663       subject  character  must  not  be in the set defined by the class. If a
664       circumflex is actually required as a member of the class, ensure it  is
665       not the first character, or escape it with a backslash.
666
667       For  example, the character class [aeiou] matches any lower case vowel,
668       while [^aeiou] matches any character that is not a  lower  case  vowel.
669       Note that a circumflex is just a convenient notation for specifying the
670       characters that are in the class by enumerating those that are  not.  A
671       class  that starts with a circumflex is not an assertion: it still con‐
672       sumes a character from the subject string, and therefore  it  fails  if
673       the current pointer is at the end of the string.
674
675       In  UTF-8 mode, characters with values greater than 255 can be included
676       in a class as a literal string of bytes, or by using the  \x{  escaping
677       mechanism.
678
679       When  caseless  matching  is set, any letters in a class represent both
680       their upper case and lower case versions, so for  example,  a  caseless
681       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
682       match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
683       understands  the  concept  of case for characters whose values are less
684       than 128, so caseless matching is always possible. For characters  with
685       higher  values,  the  concept  of case is supported if PCRE is compiled
686       with Unicode property support, but not otherwise.  If you want  to  use
687       caseless  matching  for  characters 128 and above, you must ensure that
688       PCRE is compiled with Unicode property support as well  as  with  UTF-8
689       support.
690
691       Characters  that  might  indicate  line breaks are never treated in any
692       special way  when  matching  character  classes,  whatever  line-ending
693       sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
694       PCRE_MULTILINE options is used. A class such as [^a] always matches one
695       of these characters.
696
697       The  minus (hyphen) character can be used to specify a range of charac‐
698       ters in a character  class.  For  example,  [d-m]  matches  any  letter
699       between  d  and  m,  inclusive.  If  a minus character is required in a
700       class, it must be escaped with a backslash  or  appear  in  a  position
701       where  it cannot be interpreted as indicating a range, typically as the
702       first or last character in the class.
703
704       It is not possible to have the literal character "]" as the end charac‐
705       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
706       two characters ("W" and "-") followed by a literal string "46]", so  it
707       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
708       backslash it is interpreted as the end of range, so [W-\]46] is  inter‐
709       preted  as a class containing a range followed by two other characters.
710       The octal or hexadecimal representation of "]" can also be used to  end
711       a range.
712
713       Ranges  operate in the collating sequence of character values. They can
714       also  be  used  for  characters  specified  numerically,  for   example
715       [\000-\037].  In UTF-8 mode, ranges can include characters whose values
716       are greater than 255, for example [\x{100}-\x{2ff}].
717
718       If a range that includes letters is used when caseless matching is set,
719       it matches the letters in either case. For example, [W-c] is equivalent
720       to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
721       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
722       accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
723       concept  of  case for characters with values greater than 128 only when
724       it is compiled with Unicode property support.
725
726       The character types \d, \D, \p, \P, \s, \S, \w, and \W may also  appear
727       in  a  character  class,  and add the characters that they match to the
728       class. For example, [\dABCDEF] matches any hexadecimal digit. A circum‐
729       flex  can  conveniently  be used with the upper case character types to
730       specify a more restricted set of characters  than  the  matching  lower
731       case  type.  For example, the class [^\W_] matches any letter or digit,
732       but not underscore.
733
734       The only metacharacters that are recognized in  character  classes  are
735       backslash,  hyphen  (only  where  it can be interpreted as specifying a
736       range), circumflex (only at the start), opening  square  bracket  (only
737       when  it can be interpreted as introducing a POSIX class name - see the
738       next section), and the terminating  closing  square  bracket.  However,
739       escaping other non-alphanumeric characters does no harm.
740

POSIX CHARACTER CLASSES

742
743       Perl supports the POSIX notation for character classes. This uses names
744       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
745       supports this notation. For example,
746
747         [01[:alpha:]%]
748
749       matches "0", "1", any alphabetic character, or "%". The supported class
750       names are
751
752         alnum    letters and digits
753         alpha    letters
754         ascii    character codes 0 - 127
755         blank    space or tab only
756         cntrl    control characters
757         digit    decimal digits (same as \d)
758         graph    printing characters, excluding space
759         lower    lower case letters
760         print    printing characters, including space
761         punct    printing characters, excluding letters and digits
762         space    white space (not quite the same as \s)
763         upper    upper case letters
764         word     "word" characters (same as \w)
765         xdigit   hexadecimal digits
766
767       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
768       and  space  (32). Notice that this list includes the VT character (code
769       11). This makes "space" different to \s, which does not include VT (for
770       Perl compatibility).
771
772       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
773       from Perl 5.8. Another Perl extension is negation, which  is  indicated
774       by a ^ character after the colon. For example,
775
776         [12[:^digit:]]
777
778       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
779       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
780       these are not supported, and an error is given if they are encountered.
781
782       In UTF-8 mode, characters with values greater than 128 do not match any
783       of the POSIX character classes.
784

VERTICAL BAR

786
787       Vertical bar characters are used to separate alternative patterns.  For
788       example, the pattern
789
790         gilbert|sullivan
791
792       matches  either "gilbert" or "sullivan". Any number of alternatives may
793       appear, and an empty  alternative  is  permitted  (matching  the  empty
794       string). The matching process tries each alternative in turn, from left
795       to right, and the first one that succeeds is used. If the  alternatives
796       are  within a subpattern (defined below), "succeeds" means matching the
797       rest of the main pattern as well as the alternative in the subpattern.
798

INTERNAL OPTION SETTING

800
801       The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
802       PCRE_EXTENDED  options  can  be  changed  from  within the pattern by a
803       sequence of Perl option letters enclosed  between  "(?"  and  ")".  The
804       option letters are
805
806         i  for PCRE_CASELESS
807         m  for PCRE_MULTILINE
808         s  for PCRE_DOTALL
809         x  for PCRE_EXTENDED
810
811       For example, (?im) sets caseless, multiline matching. It is also possi‐
812       ble to unset these options by preceding the letter with a hyphen, and a
813       combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE‐
814       LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
815       is  also  permitted.  If  a  letter  appears  both before and after the
816       hyphen, the option is unset.
817
818       When an option change occurs at top level (that is, not inside  subpat‐
819       tern  parentheses),  the change applies to the remainder of the pattern
820       that follows.  If the change is placed right at the start of a pattern,
821       PCRE extracts it into the global options (and it will therefore show up
822       in data extracted by the pcre_fullinfo() function).
823
824       An option change within a subpattern (see below for  a  description  of
825       subpatterns) affects only that part of the current pattern that follows
826       it, so
827
828         (a(?i)b)c
829
830       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
831       used).   By  this means, options can be made to have different settings
832       in different parts of the pattern. Any changes made in one  alternative
833       do  carry  on  into subsequent branches within the same subpattern. For
834       example,
835
836         (a(?i)b|c)
837
838       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
839       first  branch  is  abandoned before the option setting. This is because
840       the effects of option settings happen at compile time. There  would  be
841       some very weird behaviour otherwise.
842
843       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
844       can be changed in the same way as the Perl-compatible options by  using
845       the characters J, U and X respectively.
846

SUBPATTERNS

848
849       Subpatterns are delimited by parentheses (round brackets), which can be
850       nested.  Turning part of a pattern into a subpattern does two things:
851
852       1. It localizes a set of alternatives. For example, the pattern
853
854         cat(aract|erpillar|)
855
856       matches one of the words "cat", "cataract", or  "caterpillar".  Without
857       the  parentheses,  it  would  match  "cataract", "erpillar" or an empty
858       string.
859
860       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
861       that,  when  the  whole  pattern  matches,  that portion of the subject
862       string that matched the subpattern is passed back to the caller via the
863       ovector  argument  of pcre_exec(). Opening parentheses are counted from
864       left to right (starting from 1) to obtain  numbers  for  the  capturing
865       subpatterns.
866
867       For  example,  if the string "the red king" is matched against the pat‐
868       tern
869
870         the ((red|white) (king|queen))
871
872       the captured substrings are "red king", "red", and "king", and are num‐
873       bered 1, 2, and 3, respectively.
874
875       The  fact  that  plain  parentheses  fulfil two functions is not always
876       helpful.  There are often times when a grouping subpattern is  required
877       without  a capturing requirement. If an opening parenthesis is followed
878       by a question mark and a colon, the subpattern does not do any  captur‐
879       ing,  and  is  not  counted when computing the number of any subsequent
880       capturing subpatterns. For example, if the string "the white queen"  is
881       matched against the pattern
882
883         the ((?:red|white) (king|queen))
884
885       the captured substrings are "white queen" and "queen", and are numbered
886       1 and 2. The maximum number of capturing subpatterns is 65535.
887
888       As a convenient shorthand, if any option settings are required  at  the
889       start  of  a  non-capturing  subpattern,  the option letters may appear
890       between the "?" and the ":". Thus the two patterns
891
892         (?i:saturday|sunday)
893         (?:(?i)saturday|sunday)
894
895       match exactly the same set of strings. Because alternative branches are
896       tried  from  left  to right, and options are not reset until the end of
897       the subpattern is reached, an option setting in one branch does  affect
898       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
899       "Saturday".
900

DUPLICATE SUBPATTERN NUMBERS

902
903       Perl 5.10 introduced a feature whereby each alternative in a subpattern
904       uses  the same numbers for its capturing parentheses. Such a subpattern
905       starts with (?| and is itself a non-capturing subpattern. For  example,
906       consider this pattern:
907
908         (?|(Sat)ur|(Sun))day
909
910       Because  the two alternatives are inside a (?| group, both sets of cap‐
911       turing parentheses are numbered one. Thus, when  the  pattern  matches,
912       you  can  look  at captured substring number one, whichever alternative
913       matched. This construct is useful when you want to  capture  part,  but
914       not all, of one of a number of alternatives. Inside a (?| group, paren‐
915       theses are numbered as usual, but the number is reset at the  start  of
916       each  branch. The numbers of any capturing buffers that follow the sub‐
917       pattern start after the highest number used in any branch. The  follow‐
918       ing  example  is taken from the Perl documentation.  The numbers under‐
919       neath show in which buffer the captured content will be stored.
920
921         # before  ---------------branch-reset----------- after
922         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
923         # 1            2         2  3        2     3     4
924
925       A backreference or a recursive call to  a  numbered  subpattern  always
926       refers to the first one in the pattern with the given number.
927
928       An  alternative approach to using this "branch reset" feature is to use
929       duplicate named subpatterns, as described in the next section.
930

NAMED SUBPATTERNS

932
933       Identifying capturing parentheses by number is simple, but  it  can  be
934       very  hard  to keep track of the numbers in complicated regular expres‐
935       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
936       change.  To help with this difficulty, PCRE supports the naming of sub‐
937       patterns. This feature was not added to Perl until release 5.10. Python
938       had  the  feature earlier, and PCRE introduced it at release 4.0, using
939       the Python syntax. PCRE now supports both the Perl and the Python  syn‐
940       tax.
941
942       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
943       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
944       to capturing parentheses from other parts of the pattern, such as back‐
945       references, recursion, and conditions, can be made by name as  well  as
946       by number.
947
948       Names  consist  of  up  to  32 alphanumeric characters and underscores.
949       Named capturing parentheses are still  allocated  numbers  as  well  as
950       names,  exactly as if the names were not present. The PCRE API provides
951       function calls for extracting the name-to-number translation table from
952       a compiled pattern. There is also a convenience function for extracting
953       a captured substring by name.
954
955       By default, a name must be unique within a pattern, but it is  possible
956       to relax this constraint by setting the PCRE_DUPNAMES option at compile
957       time. This can be useful for patterns where only one  instance  of  the
958       named  parentheses  can  match. Suppose you want to match the name of a
959       weekday, either as a 3-letter abbreviation or as the full name, and  in
960       both cases you want to extract the abbreviation. This pattern (ignoring
961       the line breaks) does the job:
962
963         (?<DN>Mon|Fri|Sun)(?:day)?|
964         (?<DN>Tue)(?:sday)?|
965         (?<DN>Wed)(?:nesday)?|
966         (?<DN>Thu)(?:rsday)?|
967         (?<DN>Sat)(?:urday)?
968
969       There are five capturing substrings, but only one is ever set  after  a
970       match.  (An alternative way of solving this problem is to use a "branch
971       reset" subpattern, as described in the previous section.)
972
973       The convenience function for extracting the data by  name  returns  the
974       substring  for  the first (and in this example, the only) subpattern of
975       that name that matched. This saves searching  to  find  which  numbered
976       subpattern  it  was. If you make a reference to a non-unique named sub‐
977       pattern from elsewhere in the pattern, the one that corresponds to  the
978       lowest  number  is used. For further details of the interfaces for han‐
979       dling named subpatterns, see the pcreapi documentation.
980

REPETITION

982
983       Repetition is specified by quantifiers, which can  follow  any  of  the
984       following items:
985
986         a literal data character
987         the dot metacharacter
988         the \C escape sequence
989         the \X escape sequence (in UTF-8 mode with Unicode properties)
990         the \R escape sequence
991         an escape such as \d that matches a single character
992         a character class
993         a back reference (see next section)
994         a parenthesized subpattern (unless it is an assertion)
995
996       The  general repetition quantifier specifies a minimum and maximum num‐
997       ber of permitted matches, by giving the two numbers in  curly  brackets
998       (braces),  separated  by  a comma. The numbers must be less than 65536,
999       and the first must be less than or equal to the second. For example:
1000
1001         z{2,4}
1002
1003       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
1004       special  character.  If  the second number is omitted, but the comma is
1005       present, there is no upper limit; if the second number  and  the  comma
1006       are  both omitted, the quantifier specifies an exact number of required
1007       matches. Thus
1008
1009         [aeiou]{3,}
1010
1011       matches at least 3 successive vowels, but may match many more, while
1012
1013         \d{8}
1014
1015       matches exactly 8 digits. An opening curly bracket that  appears  in  a
1016       position  where a quantifier is not allowed, or one that does not match
1017       the syntax of a quantifier, is taken as a literal character. For  exam‐
1018       ple, {,6} is not a quantifier, but a literal string of four characters.
1019
1020       In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
1021       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char‐
1022       acters, each of which is represented by a two-byte sequence. Similarly,
1023       when Unicode property support is available, \X{3} matches three Unicode
1024       extended  sequences,  each of which may be several bytes long (and they
1025       may be of different lengths).
1026
1027       The quantifier {0} is permitted, causing the expression to behave as if
1028       the previous item and the quantifier were not present.
1029
1030       For  convenience, the three most common quantifiers have single-charac‐
1031       ter abbreviations:
1032
1033         *    is equivalent to {0,}
1034         +    is equivalent to {1,}
1035         ?    is equivalent to {0,1}
1036
1037       It is possible to construct infinite loops by  following  a  subpattern
1038       that can match no characters with a quantifier that has no upper limit,
1039       for example:
1040
1041         (a?)*
1042
1043       Earlier versions of Perl and PCRE used to give an error at compile time
1044       for  such  patterns. However, because there are cases where this can be
1045       useful, such patterns are now accepted, but if any  repetition  of  the
1046       subpattern  does in fact match no characters, the loop is forcibly bro‐
1047       ken.
1048
1049       By default, the quantifiers are "greedy", that is, they match  as  much
1050       as  possible  (up  to  the  maximum number of permitted times), without
1051       causing the rest of the pattern to fail. The classic example  of  where
1052       this gives problems is in trying to match comments in C programs. These
1053       appear between /* and */ and within the comment,  individual  *  and  /
1054       characters  may  appear. An attempt to match C comments by applying the
1055       pattern
1056
1057         /\*.*\*/
1058
1059       to the string
1060
1061         /* first comment */  not comment  /* second comment */
1062
1063       fails, because it matches the entire string owing to the greediness  of
1064       the .*  item.
1065
1066       However,  if  a quantifier is followed by a question mark, it ceases to
1067       be greedy, and instead matches the minimum number of times possible, so
1068       the pattern
1069
1070         /\*.*?\*/
1071
1072       does  the  right  thing with the C comments. The meaning of the various
1073       quantifiers is not otherwise changed,  just  the  preferred  number  of
1074       matches.   Do  not  confuse this use of question mark with its use as a
1075       quantifier in its own right. Because it has two uses, it can  sometimes
1076       appear doubled, as in
1077
1078         \d??\d
1079
1080       which matches one digit by preference, but can match two if that is the
1081       only way the rest of the pattern matches.
1082
1083       If the PCRE_UNGREEDY option is set (an option that is not available  in
1084       Perl),  the  quantifiers are not greedy by default, but individual ones
1085       can be made greedy by following them with a  question  mark.  In  other
1086       words, it inverts the default behaviour.
1087
1088       When  a  parenthesized  subpattern  is quantified with a minimum repeat
1089       count that is greater than 1 or with a limited maximum, more memory  is
1090       required  for  the  compiled  pattern, in proportion to the size of the
1091       minimum or maximum.
1092
1093       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv‐
1094       alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
1095       the pattern is implicitly anchored, because whatever  follows  will  be
1096       tried  against every character position in the subject string, so there
1097       is no point in retrying the overall match at  any  position  after  the
1098       first.  PCRE  normally treats such a pattern as though it were preceded
1099       by \A.
1100
1101       In cases where it is known that the subject  string  contains  no  new‐
1102       lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti‐
1103       mization, or alternatively using ^ to indicate anchoring explicitly.
1104
1105       However, there is one situation where the optimization cannot be  used.
1106       When  .*   is  inside  capturing  parentheses that are the subject of a
1107       backreference elsewhere in the pattern, a match at the start  may  fail
1108       where a later one succeeds. Consider, for example:
1109
1110         (.*)abc\1
1111
1112       If  the subject is "xyz123abc123" the match point is the fourth charac‐
1113       ter. For this reason, such a pattern is not implicitly anchored.
1114
1115       When a capturing subpattern is repeated, the value captured is the sub‐
1116       string that matched the final iteration. For example, after
1117
1118         (tweedle[dume]{3}\s*)+
1119
1120       has matched "tweedledum tweedledee" the value of the captured substring
1121       is "tweedledee". However, if there are  nested  capturing  subpatterns,
1122       the  corresponding captured values may have been set in previous itera‐
1123       tions. For example, after
1124
1125         /(a|(b))+/
1126
1127       matches "aba" the value of the second captured substring is "b".
1128

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

1130
1131       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
1132       repetition,  failure  of what follows normally causes the repeated item
1133       to be re-evaluated to see if a different number of repeats  allows  the
1134       rest  of  the pattern to match. Sometimes it is useful to prevent this,
1135       either to change the nature of the match, or to cause it  fail  earlier
1136       than  it otherwise might, when the author of the pattern knows there is
1137       no point in carrying on.
1138
1139       Consider, for example, the pattern \d+foo when applied to  the  subject
1140       line
1141
1142         123456bar
1143
1144       After matching all 6 digits and then failing to match "foo", the normal
1145       action of the matcher is to try again with only 5 digits  matching  the
1146       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
1147       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
1148       the  means for specifying that once a subpattern has matched, it is not
1149       to be re-evaluated in this way.
1150
1151       If we use atomic grouping for the previous example, the  matcher  gives
1152       up  immediately  on failing to match "foo" the first time. The notation
1153       is a kind of special parenthesis, starting with (?> as in this example:
1154
1155         (?>\d+)foo
1156
1157       This kind of parenthesis "locks up" the  part of the  pattern  it  con‐
1158       tains  once  it  has matched, and a failure further into the pattern is
1159       prevented from backtracking into it. Backtracking past it  to  previous
1160       items, however, works as normal.
1161
1162       An  alternative  description  is that a subpattern of this type matches
1163       the string of characters that an  identical  standalone  pattern  would
1164       match, if anchored at the current point in the subject string.
1165
1166       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1167       such as the above example can be thought of as a maximizing repeat that
1168       must  swallow  everything  it can. So, while both \d+ and \d+? are pre‐
1169       pared to adjust the number of digits they match in order  to  make  the
1170       rest of the pattern match, (?>\d+) can only match an entire sequence of
1171       digits.
1172
1173       Atomic groups in general can of course contain arbitrarily  complicated
1174       subpatterns,  and  can  be  nested. However, when the subpattern for an
1175       atomic group is just a single repeated item, as in the example above, a
1176       simpler  notation,  called  a "possessive quantifier" can be used. This
1177       consists of an additional + character  following  a  quantifier.  Using
1178       this notation, the previous example can be rewritten as
1179
1180         \d++foo
1181
1182       Note that a possessive quantifier can be used with an entire group, for
1183       example:
1184
1185         (abc|xyz){2,3}+
1186
1187       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
1188       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1189       simpler forms of atomic group. However, there is no difference  in  the
1190       meaning  of  a  possessive  quantifier and the equivalent atomic group,
1191       though there may be a performance  difference;  possessive  quantifiers
1192       should be slightly faster.
1193
1194       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn‐
1195       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
1196       edition of his book. Mike McCloskey liked it, so implemented it when he
1197       built Sun's Java package, and PCRE copied it from there. It  ultimately
1198       found its way into Perl at release 5.10.
1199
1200       PCRE has an optimization that automatically "possessifies" certain sim‐
1201       ple pattern constructs. For example, the sequence  A+B  is  treated  as
1202       A++B  because  there is no point in backtracking into a sequence of A's
1203       when B must follow.
1204
1205       When a pattern contains an unlimited repeat inside  a  subpattern  that
1206       can  itself  be  repeated  an  unlimited number of times, the use of an
1207       atomic group is the only way to avoid some  failing  matches  taking  a
1208       very long time indeed. The pattern
1209
1210         (\D+|<\d+>)*[!?]
1211
1212       matches  an  unlimited number of substrings that either consist of non-
1213       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1214       matches, it runs quickly. However, if it is applied to
1215
1216         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1217
1218       it  takes  a  long  time  before reporting failure. This is because the
1219       string can be divided between the internal \D+ repeat and the  external
1220       *  repeat  in  a  large  number of ways, and all have to be tried. (The
1221       example uses [!?] rather than a single character at  the  end,  because
1222       both  PCRE  and  Perl have an optimization that allows for fast failure
1223       when a single character is used. They remember the last single  charac‐
1224       ter  that  is required for a match, and fail early if it is not present
1225       in the string.) If the pattern is changed so that  it  uses  an  atomic
1226       group, like this:
1227
1228         ((?>\D+)|<\d+>)*[!?]
1229
1230       sequences of non-digits cannot be broken, and failure happens quickly.
1231

BACK REFERENCES

1233
1234       Outside a character class, a backslash followed by a digit greater than
1235       0 (and possibly further digits) is a back reference to a capturing sub‐
1236       pattern  earlier  (that is, to its left) in the pattern, provided there
1237       have been that many previous capturing left parentheses.
1238
1239       However, if the decimal number following the backslash is less than 10,
1240       it  is  always  taken  as a back reference, and causes an error only if
1241       there are not that many capturing left parentheses in the  entire  pat‐
1242       tern.  In  other words, the parentheses that are referenced need not be
1243       to the left of the reference for numbers less than 10. A "forward  back
1244       reference"  of  this  type can make sense when a repetition is involved
1245       and the subpattern to the right has participated in an  earlier  itera‐
1246       tion.
1247
1248       It  is  not  possible to have a numerical "forward back reference" to a
1249       subpattern whose number is 10 or  more  using  this  syntax  because  a
1250       sequence  such  as  \50 is interpreted as a character defined in octal.
1251       See the subsection entitled "Non-printing characters" above for further
1252       details  of  the  handling of digits following a backslash. There is no
1253       such problem when named parentheses are used. A back reference  to  any
1254       subpattern is possible using named parentheses (see below).
1255
1256       Another  way  of  avoiding  the ambiguity inherent in the use of digits
1257       following a backslash is to use the \g escape sequence, which is a fea‐
1258       ture  introduced  in  Perl  5.10.  This  escape  must be followed by an
1259       unsigned number or a negative number, optionally  enclosed  in  braces.
1260       These examples are all identical:
1261
1262         (ring), \1
1263         (ring), \g1
1264         (ring), \g{1}
1265
1266       An  unsigned number specifies an absolute reference without the ambigu‐
1267       ity that is present in the older syntax. It is also useful when literal
1268       digits follow the reference. A negative number is a relative reference.
1269       Consider this example:
1270
1271         (abc(def)ghi)\g{-1}
1272
1273       The sequence \g{-1} is a reference to the most recently started captur‐
1274       ing  subpattern  before \g, that is, is it equivalent to \2. Similarly,
1275       \g{-2} would be equivalent to \1. The use of relative references can be
1276       helpful  in  long  patterns,  and  also in patterns that are created by
1277       joining together fragments that contain references within themselves.
1278
1279       A back reference matches whatever actually matched the  capturing  sub‐
1280       pattern  in  the  current subject string, rather than anything matching
1281       the subpattern itself (see "Subpatterns as subroutines" below for a way
1282       of doing that). So the pattern
1283
1284         (sens|respons)e and \1ibility
1285
1286       matches  "sense and sensibility" and "response and responsibility", but
1287       not "sense and responsibility". If caseful matching is in force at  the
1288       time  of the back reference, the case of letters is relevant. For exam‐
1289       ple,
1290
1291         ((?i)rah)\s+\1
1292
1293       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1294       original capturing subpattern is matched caselessly.
1295
1296       There  are  several  different ways of writing back references to named
1297       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
1298       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
1299       unified back reference syntax, in which \g can be used for both numeric
1300       and  named  references,  is  also supported. We could rewrite the above
1301       example in any of the following ways:
1302
1303         (?<p1>(?i)rah)\s+\k<p1>
1304         (?'p1'(?i)rah)\s+\k{p1}
1305         (?P<p1>(?i)rah)\s+(?P=p1)
1306         (?<p1>(?i)rah)\s+\g{p1}
1307
1308       A subpattern that is referenced by  name  may  appear  in  the  pattern
1309       before or after the reference.
1310
1311       There  may be more than one back reference to the same subpattern. If a
1312       subpattern has not actually been used in a particular match,  any  back
1313       references to it always fail. For example, the pattern
1314
1315         (a|(bc))\2
1316
1317       always  fails if it starts to match "a" rather than "bc". Because there
1318       may be many capturing parentheses in a pattern,  all  digits  following
1319       the  backslash  are taken as part of a potential back reference number.
1320       If the pattern continues with a digit character, some delimiter must be
1321       used  to  terminate  the back reference. If the PCRE_EXTENDED option is
1322       set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com‐
1323       ments" below) can be used.
1324
1325       A  back reference that occurs inside the parentheses to which it refers
1326       fails when the subpattern is first used, so, for example,  (a\1)  never
1327       matches.   However,  such references can be useful inside repeated sub‐
1328       patterns. For example, the pattern
1329
1330         (a|b\1)+
1331
1332       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1333       ation  of  the  subpattern,  the  back  reference matches the character
1334       string corresponding to the previous iteration. In order  for  this  to
1335       work,  the  pattern must be such that the first iteration does not need
1336       to match the back reference. This can be done using alternation, as  in
1337       the example above, or by a quantifier with a minimum of zero.
1338

ASSERTIONS

1340
1341       An  assertion  is  a  test on the characters following or preceding the
1342       current matching point that does not actually consume  any  characters.
1343       The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
1344       described above.
1345
1346       More complicated assertions are coded as  subpatterns.  There  are  two
1347       kinds:  those  that  look  ahead of the current position in the subject
1348       string, and those that look  behind  it.  An  assertion  subpattern  is
1349       matched  in  the  normal way, except that it does not cause the current
1350       matching position to be changed.
1351
1352       Assertion subpatterns are not capturing subpatterns,  and  may  not  be
1353       repeated,  because  it  makes no sense to assert the same thing several
1354       times. If any kind of assertion contains capturing  subpatterns  within
1355       it,  these are counted for the purposes of numbering the capturing sub‐
1356       patterns in the whole pattern.  However, substring capturing is carried
1357       out  only  for  positive assertions, because it does not make sense for
1358       negative assertions.
1359
1360   Lookahead assertions
1361
1362       Lookahead assertions start with (?= for positive assertions and (?! for
1363       negative assertions. For example,
1364
1365         \w+(?=;)
1366
1367       matches  a word followed by a semicolon, but does not include the semi‐
1368       colon in the match, and
1369
1370         foo(?!bar)
1371
1372       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
1373       that the apparently similar pattern
1374
1375         (?!foo)bar
1376
1377       does  not  find  an  occurrence  of "bar" that is preceded by something
1378       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
1379       the assertion (?!foo) is always true when the next three characters are
1380       "bar". A lookbehind assertion is needed to achieve the other effect.
1381
1382       If you want to force a matching failure at some point in a pattern, the
1383       most  convenient  way  to  do  it  is with (?!) because an empty string
1384       always matches, so an assertion that requires there not to be an  empty
1385       string must always fail.
1386
1387   Lookbehind assertions
1388
1389       Lookbehind  assertions start with (?<= for positive assertions and (?<!
1390       for negative assertions. For example,
1391
1392         (?<!foo)bar
1393
1394       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
1395       contents  of  a  lookbehind  assertion are restricted such that all the
1396       strings it matches must have a fixed length. However, if there are sev‐
1397       eral  top-level  alternatives,  they  do  not all have to have the same
1398       fixed length. Thus
1399
1400         (?<=bullock|donkey)
1401
1402       is permitted, but
1403
1404         (?<!dogs?|cats?)
1405
1406       causes an error at compile time. Branches that match  different  length
1407       strings  are permitted only at the top level of a lookbehind assertion.
1408       This is an extension compared with  Perl  (at  least  for  5.8),  which
1409       requires  all branches to match the same length of string. An assertion
1410       such as
1411
1412         (?<=ab(c|de))
1413
1414       is not permitted, because its single top-level  branch  can  match  two
1415       different  lengths,  but  it is acceptable if rewritten to use two top-
1416       level branches:
1417
1418         (?<=abc|abde)
1419
1420       In some cases, the Perl 5.10 escape sequence \K (see above) can be used
1421       instead  of  a lookbehind assertion; this is not restricted to a fixed-
1422       length.
1423
1424       The implementation of lookbehind assertions is, for  each  alternative,
1425       to  temporarily  move the current position back by the fixed length and
1426       then try to match. If there are insufficient characters before the cur‐
1427       rent position, the assertion fails.
1428
1429       PCRE does not allow the \C escape (which matches a single byte in UTF-8
1430       mode) to appear in lookbehind assertions, because it makes it  impossi‐
1431       ble  to  calculate the length of the lookbehind. The \X and \R escapes,
1432       which can match different numbers of bytes, are also not permitted.
1433
1434       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
1435       assertions  to  specify  efficient  matching  at the end of the subject
1436       string. Consider a simple pattern such as
1437
1438         abcd$
1439
1440       when applied to a long string that does  not  match.  Because  matching
1441       proceeds from left to right, PCRE will look for each "a" in the subject
1442       and then see if what follows matches the rest of the  pattern.  If  the
1443       pattern is specified as
1444
1445         ^.*abcd$
1446
1447       the  initial .* matches the entire string at first, but when this fails
1448       (because there is no following "a"), it backtracks to match all but the
1449       last  character,  then all but the last two characters, and so on. Once
1450       again the search for "a" covers the entire string, from right to  left,
1451       so we are no better off. However, if the pattern is written as
1452
1453         ^.*+(?<=abcd)
1454
1455       there  can  be  no backtracking for the .*+ item; it can match only the
1456       entire string. The subsequent lookbehind assertion does a  single  test
1457       on  the last four characters. If it fails, the match fails immediately.
1458       For long strings, this approach makes a significant difference  to  the
1459       processing time.
1460
1461   Using multiple assertions
1462
1463       Several assertions (of any sort) may occur in succession. For example,
1464
1465         (?<=\d{3})(?<!999)foo
1466
1467       matches  "foo" preceded by three digits that are not "999". Notice that
1468       each of the assertions is applied independently at the  same  point  in
1469       the  subject  string.  First  there  is a check that the previous three
1470       characters are all digits, and then there is  a  check  that  the  same
1471       three characters are not "999".  This pattern does not match "foo" pre‐
1472       ceded by six characters, the first of which are  digits  and  the  last
1473       three  of  which  are not "999". For example, it doesn't match "123abc‐
1474       foo". A pattern to do that is
1475
1476         (?<=\d{3}...)(?<!999)foo
1477
1478       This time the first assertion looks at the  preceding  six  characters,
1479       checking that the first three are digits, and then the second assertion
1480       checks that the preceding three characters are not "999".
1481
1482       Assertions can be nested in any combination. For example,
1483
1484         (?<=(?<!foo)bar)baz
1485
1486       matches an occurrence of "baz" that is preceded by "bar" which in  turn
1487       is not preceded by "foo", while
1488
1489         (?<=\d{3}(?!999)...)foo
1490
1491       is  another pattern that matches "foo" preceded by three digits and any
1492       three characters that are not "999".
1493

CONDITIONAL SUBPATTERNS

1495
1496       It is possible to cause the matching process to obey a subpattern  con‐
1497       ditionally  or to choose between two alternative subpatterns, depending
1498       on the result of an assertion, or whether a previous capturing  subpat‐
1499       tern  matched  or not. The two possible forms of conditional subpattern
1500       are
1501
1502         (?(condition)yes-pattern)
1503         (?(condition)yes-pattern|no-pattern)
1504
1505       If the condition is satisfied, the yes-pattern is used;  otherwise  the
1506       no-pattern  (if  present)  is used. If there are more than two alterna‐
1507       tives in the subpattern, a compile-time error occurs.
1508
1509       There are four kinds of condition: references  to  subpatterns,  refer‐
1510       ences to recursion, a pseudo-condition called DEFINE, and assertions.
1511
1512   Checking for a used subpattern by number
1513
1514       If  the  text between the parentheses consists of a sequence of digits,
1515       the condition is true if the capturing subpattern of  that  number  has
1516       previously  matched.  An  alternative notation is to precede the digits
1517       with a plus or minus sign. In this case, the subpattern number is rela‐
1518       tive rather than absolute.  The most recently opened parentheses can be
1519       referenced by (?(-1), the next most recent by (?(-2),  and  so  on.  In
1520       looping constructs it can also make sense to refer to subsequent groups
1521       with constructs such as (?(+2).
1522
1523       Consider the following pattern, which  contains  non-significant  white
1524       space to make it more readable (assume the PCRE_EXTENDED option) and to
1525       divide it into three parts for ease of discussion:
1526
1527         ( \( )?    [^()]+    (?(1) \) )
1528
1529       The first part matches an optional opening  parenthesis,  and  if  that
1530       character is present, sets it as the first captured substring. The sec‐
1531       ond part matches one or more characters that are not  parentheses.  The
1532       third part is a conditional subpattern that tests whether the first set
1533       of parentheses matched or not. If they did, that is, if subject started
1534       with an opening parenthesis, the condition is true, and so the yes-pat‐
1535       tern is executed and a  closing  parenthesis  is  required.  Otherwise,
1536       since  no-pattern  is  not  present, the subpattern matches nothing. In
1537       other words,  this  pattern  matches  a  sequence  of  non-parentheses,
1538       optionally enclosed in parentheses.
1539
1540       If  you  were  embedding  this pattern in a larger one, you could use a
1541       relative reference:
1542
1543         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
1544
1545       This makes the fragment independent of the parentheses  in  the  larger
1546       pattern.
1547
1548   Checking for a used subpattern by name
1549
1550       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
1551       used subpattern by name. For compatibility  with  earlier  versions  of
1552       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
1553       also recognized. However, there is a possible ambiguity with this  syn‐
1554       tax,  because  subpattern  names  may  consist entirely of digits. PCRE
1555       looks first for a named subpattern; if it cannot find one and the  name
1556       consists  entirely  of digits, PCRE looks for a subpattern of that num‐
1557       ber, which must be greater than zero. Using subpattern names that  con‐
1558       sist entirely of digits is not recommended.
1559
1560       Rewriting the above example to use a named subpattern gives this:
1561
1562         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
1563
1564
1565   Checking for pattern recursion
1566
1567       If the condition is the string (R), and there is no subpattern with the
1568       name R, the condition is true if a recursive call to the whole  pattern
1569       or any subpattern has been made. If digits or a name preceded by amper‐
1570       sand follow the letter R, for example:
1571
1572         (?(R3)...) or (?(R&name)...)
1573
1574       the condition is true if the most recent recursion is into the  subpat‐
1575       tern  whose  number or name is given. This condition does not check the
1576       entire recursion stack.
1577
1578       At "top level", all these recursion test conditions are  false.  Recur‐
1579       sive patterns are described below.
1580
1581   Defining subpatterns for use by reference only
1582
1583       If  the  condition  is  the string (DEFINE), and there is no subpattern
1584       with the name DEFINE, the condition is  always  false.  In  this  case,
1585       there  may  be  only  one  alternative  in the subpattern. It is always
1586       skipped if control reaches this point  in  the  pattern;  the  idea  of
1587       DEFINE  is that it can be used to define "subroutines" that can be ref‐
1588       erenced from elsewhere. (The use of "subroutines" is described  below.)
1589       For  example,  a pattern to match an IPv4 address could be written like
1590       this (ignore whitespace and line breaks):
1591
1592         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
1593         \b (?&byte) (\.(?&byte)){3} \b
1594
1595       The first part of the pattern is a DEFINE group inside which a  another
1596       group  named "byte" is defined. This matches an individual component of
1597       an IPv4 address (a number less than 256). When  matching  takes  place,
1598       this  part  of  the pattern is skipped because DEFINE acts like a false
1599       condition.
1600
1601       The rest of the pattern uses references to the named group to match the
1602       four  dot-separated  components of an IPv4 address, insisting on a word
1603       boundary at each end.
1604
1605   Assertion conditions
1606
1607       If the condition is not in any of the above  formats,  it  must  be  an
1608       assertion.   This may be a positive or negative lookahead or lookbehind
1609       assertion. Consider  this  pattern,  again  containing  non-significant
1610       white space, and with the two alternatives on the second line:
1611
1612         (?(?=[^a-z]*[a-z])
1613         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
1614
1615       The  condition  is  a  positive  lookahead  assertion  that  matches an
1616       optional sequence of non-letters followed by a letter. In other  words,
1617       it  tests  for the presence of at least one letter in the subject. If a
1618       letter is found, the subject is matched against the first  alternative;
1619       otherwise  it  is  matched  against  the  second.  This pattern matches
1620       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
1621       letters and dd are digits.
1622

COMMENTS

1624
1625       The  sequence (?# marks the start of a comment that continues up to the
1626       next closing parenthesis. Nested parentheses  are  not  permitted.  The
1627       characters  that make up a comment play no part in the pattern matching
1628       at all.
1629
1630       If the PCRE_EXTENDED option is set, an unescaped # character outside  a
1631       character  class  introduces  a  comment  that continues to immediately
1632       after the next newline in the pattern.
1633

RECURSIVE PATTERNS

1635
1636       Consider the problem of matching a string in parentheses, allowing  for
1637       unlimited  nested  parentheses.  Without the use of recursion, the best
1638       that can be done is to use a pattern that  matches  up  to  some  fixed
1639       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
1640       depth.
1641
1642       For some time, Perl has provided a facility that allows regular expres‐
1643       sions  to recurse (amongst other things). It does this by interpolating
1644       Perl code in the expression at run time, and the code can refer to  the
1645       expression itself. A Perl pattern using code interpolation to solve the
1646       parentheses problem can be created like this:
1647
1648         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1649
1650       The (?p{...}) item interpolates Perl code at run time, and in this case
1651       refers recursively to the pattern in which it appears.
1652
1653       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1654       it supports special syntax for recursion of  the  entire  pattern,  and
1655       also  for  individual  subpattern  recursion. After its introduction in
1656       PCRE and Python, this kind of recursion was  introduced  into  Perl  at
1657       release 5.10.
1658
1659       A  special  item  that consists of (? followed by a number greater than
1660       zero and a closing parenthesis is a recursive call of the subpattern of
1661       the  given  number, provided that it occurs inside that subpattern. (If
1662       not, it is a "subroutine" call, which is described  in  the  next  sec‐
1663       tion.)  The special item (?R) or (?0) is a recursive call of the entire
1664       regular expression.
1665
1666       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
1667       always treated as an atomic group. That is, once it has matched some of
1668       the subject string, it is never re-entered, even if it contains untried
1669       alternatives and there is a subsequent matching failure.
1670
1671       This  PCRE  pattern  solves  the nested parentheses problem (assume the
1672       PCRE_EXTENDED option is set so that white space is ignored):
1673
1674         \( ( (?>[^()]+) | (?R) )* \)
1675
1676       First it matches an opening parenthesis. Then it matches any number  of
1677       substrings  which  can  either  be  a sequence of non-parentheses, or a
1678       recursive match of the pattern itself (that is, a  correctly  parenthe‐
1679       sized substring).  Finally there is a closing parenthesis.
1680
1681       If  this  were  part of a larger pattern, you would not want to recurse
1682       the entire pattern, so instead you could use this:
1683
1684         ( \( ( (?>[^()]+) | (?1) )* \) )
1685
1686       We have put the pattern into parentheses, and caused the  recursion  to
1687       refer to them instead of the whole pattern.
1688
1689       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
1690       tricky. This is made easier by the use of relative references. (A  Perl
1691       5.10  feature.)   Instead  of  (?1)  in the pattern above you can write
1692       (?-2) to refer to the second most recently opened parentheses preceding
1693       the  recursion.  In  other  words,  a  negative number counts capturing
1694       parentheses leftwards from the point at which it is encountered.
1695
1696       It is also possible to refer to  subsequently  opened  parentheses,  by
1697       writing  references  such  as (?+2). However, these cannot be recursive
1698       because the reference is not inside the  parentheses  that  are  refer‐
1699       enced.  They  are  always  "subroutine" calls, as described in the next
1700       section.
1701
1702       An alternative approach is to use named parentheses instead.  The  Perl
1703       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
1704       supported. We could rewrite the above example as follows:
1705
1706         (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
1707
1708       If there is more than one subpattern with the same name,  the  earliest
1709       one is used.
1710
1711       This  particular  example pattern that we have been looking at contains
1712       nested unlimited repeats, and so the use of atomic grouping for  match‐
1713       ing  strings  of non-parentheses is important when applying the pattern
1714       to strings that do not match. For example, when this pattern is applied
1715       to
1716
1717         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1718
1719       it  yields "no match" quickly. However, if atomic grouping is not used,
1720       the match runs for a very long time indeed because there  are  so  many
1721       different  ways  the  + and * repeats can carve up the subject, and all
1722       have to be tested before failure can be reported.
1723
1724       At the end of a match, the values set for any capturing subpatterns are
1725       those from the outermost level of the recursion at which the subpattern
1726       value is set.  If you want to obtain  intermediate  values,  a  callout
1727       function  can be used (see below and the pcrecallout documentation). If
1728       the pattern above is matched against
1729
1730         (ab(cd)ef)
1731
1732       the value for the capturing parentheses is  "ef",  which  is  the  last
1733       value  taken  on at the top level. If additional parentheses are added,
1734       giving
1735
1736         \( ( ( (?>[^()]+) | (?R) )* ) \)
1737            ^                        ^
1738            ^                        ^
1739
1740       the string they capture is "ab(cd)ef", the contents of  the  top  level
1741       parentheses.  If there are more than 15 capturing parentheses in a pat‐
1742       tern, PCRE has to obtain extra memory to store data during a recursion,
1743       which  it  does  by  using pcre_malloc, freeing it via pcre_free after‐
1744       wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
1745       PCRE_ERROR_NOMEMORY error.
1746
1747       Do  not  confuse  the (?R) item with the condition (R), which tests for
1748       recursion.  Consider this pattern, which matches text in  angle  brack‐
1749       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
1750       brackets (that is, when recursing), whereas any characters are  permit‐
1751       ted at the outer level.
1752
1753         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
1754
1755       In  this  pattern, (?(R) is the start of a conditional subpattern, with
1756       two different alternatives for the recursive and  non-recursive  cases.
1757       The (?R) item is the actual recursive call.
1758

SUBPATTERNS AS SUBROUTINES

1760
1761       If the syntax for a recursive subpattern reference (either by number or
1762       by name) is used outside the parentheses to which it refers,  it  oper‐
1763       ates  like a subroutine in a programming language. The "called" subpat‐
1764       tern may be defined before or after the reference. A numbered reference
1765       can be absolute or relative, as in these examples:
1766
1767         (...(absolute)...)...(?2)...
1768         (...(relative)...)...(?-1)...
1769         (...(?+1)...(relative)...
1770
1771       An earlier example pointed out that the pattern
1772
1773         (sens|respons)e and \1ibility
1774
1775       matches  "sense and sensibility" and "response and responsibility", but
1776       not "sense and responsibility". If instead the pattern
1777
1778         (sens|respons)e and (?1)ibility
1779
1780       is used, it does match "sense and responsibility" as well as the  other
1781       two  strings.  Another  example  is  given  in the discussion of DEFINE
1782       above.
1783
1784       Like recursive subpatterns, a "subroutine" call is always treated as an
1785       atomic  group. That is, once it has matched some of the subject string,
1786       it is never re-entered, even if it contains  untried  alternatives  and
1787       there is a subsequent matching failure.
1788
1789       When  a  subpattern is used as a subroutine, processing options such as
1790       case-independence are fixed when the subpattern is defined. They cannot
1791       be changed for different calls. For example, consider this pattern:
1792
1793         (abc)(?i:(?-1))
1794
1795       It  matches  "abcabc". It does not match "abcABC" because the change of
1796       processing option does not affect the called subpattern.
1797

CALLOUTS

1799
1800       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1801       Perl  code to be obeyed in the middle of matching a regular expression.
1802       This makes it possible, amongst other things, to extract different sub‐
1803       strings that match the same pair of parentheses when there is a repeti‐
1804       tion.
1805
1806       PCRE provides a similar feature, but of course it cannot obey arbitrary
1807       Perl code. The feature is called "callout". The caller of PCRE provides
1808       an external function by putting its entry point in the global  variable
1809       pcre_callout.   By default, this variable contains NULL, which disables
1810       all calling out.
1811
1812       Within a regular expression, (?C) indicates the  points  at  which  the
1813       external  function  is  to be called. If you want to identify different
1814       callout points, you can put a number less than 256 after the letter  C.
1815       The  default  value is zero.  For example, this pattern has two callout
1816       points:
1817
1818         (?C1)abc(?C2)def
1819
1820       If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1821       automatically  installed  before each item in the pattern. They are all
1822       numbered 255.
1823
1824       During matching, when PCRE reaches a callout point (and pcre_callout is
1825       set),  the  external function is called. It is provided with the number
1826       of the callout, the position in the pattern, and, optionally, one  item
1827       of  data  originally supplied by the caller of pcre_exec(). The callout
1828       function may cause matching to proceed, to backtrack, or to fail  alto‐
1829       gether. A complete description of the interface to the callout function
1830       is given in the pcrecallout documentation.
1831

BACTRACKING CONTROL

1833
1834       Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
1835       which are described in the Perl documentation as "experimental and sub‐
1836       ject to change or removal in a future version of Perl". It goes  on  to
1837       say:  "Their usage in production code should be noted to avoid problems
1838       during upgrades." The same remarks apply to the PCRE features described
1839       in this section.
1840
1841       Since these verbs are specifically related to backtracking, they can be
1842       used only when the pattern is to be matched  using  pcre_exec(),  which
1843       uses  a  backtracking  algorithm. They cause an error if encountered by
1844       pcre_dfa_exec().
1845
1846       The new verbs make use of what was previously invalid syntax: an  open‐
1847       ing parenthesis followed by an asterisk. In Perl, they are generally of
1848       the form (*VERB:ARG) but PCRE does not support the use of arguments, so
1849       its  general  form is just (*VERB). Any number of these verbs may occur
1850       in a pattern. There are two kinds:
1851
1852   Verbs that act immediately
1853
1854       The following verbs act as soon as they are encountered:
1855
1856          (*ACCEPT)
1857
1858       This verb causes the match to end successfully, skipping the  remainder
1859       of  the pattern. When inside a recursion, only the innermost pattern is
1860       ended immediately. PCRE differs  from  Perl  in  what  happens  if  the
1861       (*ACCEPT)  is inside capturing parentheses. In Perl, the data so far is
1862       captured: in PCRE no data is captured. For example:
1863
1864         A(A|B(*ACCEPT)|C)D
1865
1866       This matches "AB", "AAD", or "ACD", but when it matches "AB",  no  data
1867       is captured.
1868
1869         (*FAIL) or (*F)
1870
1871       This  verb  causes the match to fail, forcing backtracking to occur. It
1872       is equivalent to (?!) but easier to read. The Perl documentation  notes
1873       that  it  is  probably  useful only when combined with (?{}) or (??{}).
1874       Those are, of course, Perl features that are not present in  PCRE.  The
1875       nearest  equivalent is the callout feature, as for example in this pat‐
1876       tern:
1877
1878         a+(?C)(*FAIL)
1879
1880       A match with the string "aaaa" always fails, but the callout  is  taken
1881       before each backtrack happens (in this example, 10 times).
1882
1883   Verbs that act after backtracking
1884
1885       The following verbs do nothing when they are encountered. Matching con‐
1886       tinues with what follows, but if there is no subsequent match, a  fail‐
1887       ure  is  forced.   The  verbs  differ  in  exactly what kind of failure
1888       occurs.
1889
1890         (*COMMIT)
1891
1892       This verb causes the whole match to fail outright if the  rest  of  the
1893       pattern  does  not match. Even if the pattern is unanchored, no further
1894       attempts to find a match by advancing the start point take place.  Once
1895       (*COMMIT)  has been passed, pcre_exec() is committed to finding a match
1896       at the current starting point, or not at all. For example:
1897
1898         a+(*COMMIT)b
1899
1900       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
1901       of dynamic anchor, or "I've started, so I must finish."
1902
1903         (*PRUNE)
1904
1905       This  verb causes the match to fail at the current position if the rest
1906       of the pattern does not match. If the pattern is unanchored, the normal
1907       "bumpalong"  advance to the next starting character then happens. Back‐
1908       tracking can occur as usual to the left of (*PRUNE), or  when  matching
1909       to  the right of (*PRUNE), but if there is no match to the right, back‐
1910       tracking cannot cross (*PRUNE).  In simple cases, the use  of  (*PRUNE)
1911       is just an alternative to an atomic group or possessive quantifier, but
1912       there are some uses of (*PRUNE) that cannot be expressed in  any  other
1913       way.
1914
1915         (*SKIP)
1916
1917       This  verb  is like (*PRUNE), except that if the pattern is unanchored,
1918       the "bumpalong" advance is not to the next character, but to the  posi‐
1919       tion  in  the  subject where (*SKIP) was encountered. (*SKIP) signifies
1920       that whatever text was matched leading up to it cannot  be  part  of  a
1921       successful match. Consider:
1922
1923         a+(*SKIP)b
1924
1925       If  the  subject  is  "aaaac...",  after  the first match attempt fails
1926       (starting at the first character in the  string),  the  starting  point
1927       skips on to start the next attempt at "c". Note that a possessive quan‐
1928       tifer does not have the same effect in this example; although it  would
1929       suppress  backtracking  during  the  first  match  attempt,  the second
1930       attempt would start at the second character instead of skipping  on  to
1931       "c".
1932
1933         (*THEN)
1934
1935       This verb causes a skip to the next alternation if the rest of the pat‐
1936       tern does not match. That is, it cancels pending backtracking, but only
1937       within  the  current  alternation.  Its name comes from the observation
1938       that it can be used for a pattern-based if-then-else block:
1939
1940         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
1941
1942       If the COND1 pattern matches, FOO is tried (and possibly further  items
1943       after  the  end  of  the group if FOO succeeds); on failure the matcher
1944       skips to the second alternative and tries COND2,  without  backtracking
1945       into  COND1.  If  (*THEN)  is  used outside of any alternation, it acts
1946       exactly like (*PRUNE).
1947

SEE ALSO

1949
1950       pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
1951

AUTHOR

1953
1954       Philip Hazel
1955       University Computing Service
1956       Cambridge CB2 3QH, England.
1957

REVISION

1959
1960       Last updated: 21 August 2007
1961       Copyright (c) 1997-2007 University of Cambridge.
1962
1963
1964
1965                                                                PCREPATTERN(3)
Impressum