pcrepattern(3)

1PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PCRE REGULAR EXPRESSION DETAILS

9
10       The  syntax and semantics of the regular expressions that are supported
11       by PCRE are described in detail below. There is a quick-reference  syn‐
12       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
13       semantics as closely as it can. PCRE  also  supports  some  alternative
14       regular  expression  syntax (which does not conflict with the Perl syn‐
15       tax) in order to provide some compatibility with regular expressions in
16       Python, .NET, and Oniguruma.
17
18       Perl's  regular expressions are described in its own documentation, and
19       regular expressions in general are covered in a number of  books,  some
20       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
21       Expressions", published by  O'Reilly,  covers  regular  expressions  in
22       great  detail.  This  description  of  PCRE's  regular  expressions  is
23       intended as reference material.
24
25       The original operation of PCRE was on strings of  one-byte  characters.
26       However,  there is now also support for UTF-8 character strings. To use
27       this, you must build PCRE to  include  UTF-8  support,  and  then  call
28       pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
29       matching is mentioned in several places below. There is also a  summary
30       of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
31       page.
32
33       The remainder of this document discusses the  patterns  that  are  sup‐
34       ported  by  PCRE when its main matching function, pcre_exec(), is used.
35       From  release  6.0,   PCRE   offers   a   second   matching   function,
36       pcre_dfa_exec(),  which matches using a different algorithm that is not
37       Perl-compatible. Some of the features discussed below are not available
38       when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
39       alternative function, and how it differs from the normal function,  are
40       discussed in the pcrematching page.
41

NEWLINE CONVENTIONS

43
44       PCRE  supports five different conventions for indicating line breaks in
45       strings: a single CR (carriage return) character, a  single  LF  (line‐
46       feed) character, the two-character sequence CRLF, any of the three pre‐
47       ceding, or any Unicode newline sequence. The pcreapi page  has  further
48       discussion  about newlines, and shows how to set the newline convention
49       in the options arguments for the compiling and matching functions.
50
51       It is also possible to specify a newline convention by starting a  pat‐
52       tern string with one of the following five sequences:
53
54         (*CR)        carriage return
55         (*LF)        linefeed
56         (*CRLF)      carriage return, followed by linefeed
57         (*ANYCRLF)   any of the three above
58         (*ANY)       all Unicode newline sequences
59
60       These override the default and the options given to pcre_compile(). For
61       example, on a Unix system where LF is the default newline sequence, the
62       pattern
63
64         (*CR)a.b
65
66       changes the convention to CR. That pattern matches "a\nb" because LF is
67       no longer a newline. Note that these special settings,  which  are  not
68       Perl-compatible,  are  recognized  only at the very start of a pattern,
69       and that they must be in upper case.  If  more  than  one  of  them  is
70       present, the last one is used.
71
72       The  newline  convention  does  not  affect what the \R escape sequence
73       matches. By default, this is any Unicode  newline  sequence,  for  Perl
74       compatibility.  However, this can be changed; see the description of \R
75       in the section entitled "Newline sequences" below. A change of \R  set‐
76       ting can be combined with a change of newline convention.
77

CHARACTERS AND METACHARACTERS

79
80       A  regular  expression  is  a pattern that is matched against a subject
81       string from left to right. Most characters stand for  themselves  in  a
82       pattern,  and  match  the corresponding characters in the subject. As a
83       trivial example, the pattern
84
85         The quick brown fox
86
87       matches a portion of a subject string that is identical to itself. When
88       caseless  matching is specified (the PCRE_CASELESS option), letters are
89       matched independently of case. In UTF-8 mode, PCRE  always  understands
90       the  concept  of case for characters whose values are less than 128, so
91       caseless matching is always possible. For characters with  higher  val‐
92       ues,  the concept of case is supported if PCRE is compiled with Unicode
93       property support, but not otherwise.   If  you  want  to  use  caseless
94       matching  for  characters  128  and above, you must ensure that PCRE is
95       compiled with Unicode property support as well as with UTF-8 support.
96
97       The power of regular expressions comes  from  the  ability  to  include
98       alternatives  and  repetitions in the pattern. These are encoded in the
99       pattern by the use of metacharacters, which do not stand for themselves
100       but instead are interpreted in some special way.
101
102       There  are  two different sets of metacharacters: those that are recog‐
103       nized anywhere in the pattern except within square brackets, and  those
104       that  are  recognized  within square brackets. Outside square brackets,
105       the metacharacters are as follows:
106
107         \      general escape character with several uses
108         ^      assert start of string (or line, in multiline mode)
109         $      assert end of string (or line, in multiline mode)
110         .      match any character except newline (by default)
111         [      start character class definition
112         |      start of alternative branch
113         (      start subpattern
114         )      end subpattern
115         ?      extends the meaning of (
116                also 0 or 1 quantifier
117                also quantifier minimizer
118         *      0 or more quantifier
119         +      1 or more quantifier
120                also "possessive quantifier"
121         {      start min/max quantifier
122
123       Part of a pattern that is in square brackets  is  called  a  "character
124       class". In a character class the only metacharacters are:
125
126         \      general escape character
127         ^      negate the class, but only if the first character
128         -      indicates character range
129         [      POSIX character class (only if followed by POSIX
130                  syntax)
131         ]      terminates the character class
132
133       The following sections describe the use of each of the metacharacters.
134

BACKSLASH

136
137       The backslash character has several uses. Firstly, if it is followed by
138       a non-alphanumeric character, it takes away any  special  meaning  that
139       character  may  have.  This  use  of  backslash  as an escape character
140       applies both inside and outside character classes.
141
142       For example, if you want to match a * character, you write  \*  in  the
143       pattern.   This  escaping  action  applies whether or not the following
144       character would otherwise be interpreted as a metacharacter, so  it  is
145       always  safe  to  precede  a non-alphanumeric with backslash to specify
146       that it stands for itself. In particular, if you want to match a  back‐
147       slash, you write \\.
148
149       If  a pattern is compiled with the PCRE_EXTENDED option, white space in
150       the pattern (other than in a character class) and characters between  a
151       # outside a character class and the next newline are ignored. An escap‐
152       ing backslash can be used to include a white space or  #  character  as
153       part of the pattern.
154
155       If  you  want  to remove the special meaning from a sequence of charac‐
156       ters, you can do so by putting them between \Q and \E. This is  differ‐
157       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
158       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola‐
159       tion. Note the following examples:
160
161         Pattern            PCRE matches   Perl matches
162
163         \Qabc$xyz\E        abc$xyz        abc followed by the
164                                             contents of $xyz
165         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
166         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
167
168       The  \Q...\E  sequence  is recognized both inside and outside character
169       classes.
170
171   Non-printing characters
172
173       A second use of backslash provides a way of encoding non-printing char‐
174       acters  in patterns in a visible manner. There is no restriction on the
175       appearance of non-printing characters, apart from the binary zero  that
176       terminates  a  pattern,  but  when  a pattern is being prepared by text
177       editing, it is usually easier  to  use  one  of  the  following  escape
178       sequences than the binary character it represents:
179
180         \a        alarm, that is, the BEL character (hex 07)
181         \cx       "control-x", where x is any character
182         \e        escape (hex 1B)
183         \f        form feed (hex 0C)
184         \n        linefeed (hex 0A)
185         \r        carriage return (hex 0D)
186         \t        tab (hex 09)
187         \ddd      character with octal code ddd, or backreference
188         \xhh      character with hex code hh
189         \x{hhh..} character with hex code hhh..
190
191       The  precise  effect of \cx is as follows: if x is a lower case letter,
192       it is converted to upper case. Then bit 6 of the character (hex 40)  is
193       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
194       becomes hex 7B.
195
196       After \x, from zero to two hexadecimal digits are read (letters can  be
197       in  upper  or  lower case). Any number of hexadecimal digits may appear
198       between \x{ and }, but the value of the character  code  must  be  less
199       than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
200       the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger
201       than the largest Unicode code point, which is 10FFFF.
202
203       If  characters  other than hexadecimal digits appear between \x{ and },
204       or if there is no terminating }, this form of escape is not recognized.
205       Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
206       escape, with no following digits, giving a  character  whose  value  is
207       zero.
208
209       Characters whose value is less than 256 can be defined by either of the
210       two syntaxes for \x. There is no difference in the way  they  are  han‐
211       dled. For example, \xdc is exactly the same as \x{dc}.
212
213       After  \0  up  to two further octal digits are read. If there are fewer
214       than two digits, just  those  that  are  present  are  used.  Thus  the
215       sequence \0\x\07 specifies two binary zeros followed by a BEL character
216       (code value 7). Make sure you supply two digits after the initial  zero
217       if the pattern character that follows is itself an octal digit.
218
219       The handling of a backslash followed by a digit other than 0 is compli‐
220       cated.  Outside a character class, PCRE reads it and any following dig‐
221       its  as  a  decimal  number. If the number is less than 10, or if there
222       have been at least that many previous capturing left parentheses in the
223       expression,  the  entire  sequence  is  taken  as  a  back reference. A
224       description of how this works is given later, following the  discussion
225       of parenthesized subpatterns.
226
227       Inside  a  character  class, or if the decimal number is greater than 9
228       and there have not been that many capturing subpatterns, PCRE  re-reads
229       up to three octal digits following the backslash, and uses them to gen‐
230       erate a data character. Any subsequent digits stand for themselves.  In
231       non-UTF-8  mode,  the  value  of a character specified in octal must be
232       less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
233       example:
234
235         \040   is another way of writing a space
236         \40    is the same, provided there are fewer than 40
237                   previous capturing subpatterns
238         \7     is always a back reference
239         \11    might be a back reference, or another way of
240                   writing a tab
241         \011   is always a tab
242         \0113  is a tab followed by the character "3"
243         \113   might be a back reference, otherwise the
244                   character with octal code 113
245         \377   might be a back reference, otherwise
246                   the byte consisting entirely of 1 bits
247         \81    is either a back reference, or a binary zero
248                   followed by the two characters "8" and "1"
249
250       Note  that  octal  values of 100 or greater must not be introduced by a
251       leading zero, because no more than three octal digits are ever read.
252
253       All the sequences that define a single character value can be used both
254       inside  and  outside character classes. In addition, inside a character
255       class, the sequence \b is interpreted as the backspace  character  (hex
256       08),  and the sequences \R and \X are interpreted as the characters "R"
257       and "X", respectively. Outside a character class, these sequences  have
258       different meanings (see below).
259
260   Absolute and relative back references
261
262       The  sequence  \g followed by an unsigned or a negative number, option‐
263       ally enclosed in braces, is an absolute or relative back  reference.  A
264       named back reference can be coded as \g{name}. Back references are dis‐
265       cussed later, following the discussion of parenthesized subpatterns.
266
267   Absolute and relative subroutine calls
268
269       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
270       name or a number enclosed either in angle brackets or single quotes, is
271       an alternative syntax for referencing a subpattern as  a  "subroutine".
272       Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
273       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
274       reference; the latter is a subroutine call.
275
276   Generic character types
277
278       Another use of backslash is for specifying generic character types. The
279       following are always recognized:
280
281         \d     any decimal digit
282         \D     any character that is not a decimal digit
283         \h     any horizontal white space character
284         \H     any character that is not a horizontal white space character
285         \s     any white space character
286         \S     any character that is not a white space character
287         \v     any vertical white space character
288         \V     any character that is not a vertical white space character
289         \w     any "word" character
290         \W     any "non-word" character
291
292       Each pair of escape sequences partitions the complete set of characters
293       into  two disjoint sets. Any given character matches one, and only one,
294       of each pair.
295
296       These character type sequences can appear both inside and outside char‐
297       acter  classes.  They each match one character of the appropriate type.
298       If the current matching point is at the end of the subject string,  all
299       of them fail, since there is no character to match.
300
301       For  compatibility  with Perl, \s does not match the VT character (code
302       11).  This makes it different from the the POSIX "space" class. The  \s
303       characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
304       "use locale;" is included in a Perl script, \s may match the VT charac‐
305       ter. In PCRE, it never does.
306
307       In  UTF-8 mode, characters with values greater than 128 never match \d,
308       \s, or \w, and always match \D, \S, and \W. This is true even when Uni‐
309       code  character  property  support is available. These sequences retain
310       their original meanings from before UTF-8 support was available, mainly
311       for efficiency reasons.
312
313       The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
314       the other sequences, these do match certain high-valued  codepoints  in
315       UTF-8 mode.  The horizontal space characters are:
316
317         U+0009     Horizontal tab
318         U+0020     Space
319         U+00A0     Non-break space
320         U+1680     Ogham space mark
321         U+180E     Mongolian vowel separator
322         U+2000     En quad
323         U+2001     Em quad
324         U+2002     En space
325         U+2003     Em space
326         U+2004     Three-per-em space
327         U+2005     Four-per-em space
328         U+2006     Six-per-em space
329         U+2007     Figure space
330         U+2008     Punctuation space
331         U+2009     Thin space
332         U+200A     Hair space
333         U+202F     Narrow no-break space
334         U+205F     Medium mathematical space
335         U+3000     Ideographic space
336
337       The vertical space characters are:
338
339         U+000A     Linefeed
340         U+000B     Vertical tab
341         U+000C     Form feed
342         U+000D     Carriage return
343         U+0085     Next line
344         U+2028     Line separator
345         U+2029     Paragraph separator
346
347       A "word" character is an underscore or any character less than 256 that
348       is a letter or digit. The definition of  letters  and  digits  is  con‐
349       trolled  by PCRE's low-valued character tables, and may vary if locale-
350       specific matching is taking place (see "Locale support" in the  pcreapi
351       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
352       systems, or "french" in Windows, some character codes greater than  128
353       are  used for accented letters, and these are matched by \w. The use of
354       locales with Unicode is discouraged.
355
356   Newline sequences
357
358       Outside a character class, by default, the escape sequence  \R  matches
359       any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
360       mode \R is equivalent to the following:
361
362         (?>\r\n|\n|\x0b|\f|\r|\x85)
363
364       This is an example of an "atomic group", details  of  which  are  given
365       below.  This particular group matches either the two-character sequence
366       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
367       U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car‐
368       riage return, U+000D), or NEL (next line,  U+0085).  The  two-character
369       sequence is treated as a single unit that cannot be split.
370
371       In  UTF-8  mode, two additional characters whose codepoints are greater
372       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
373       rator,  U+2029).   Unicode character property support is not needed for
374       these characters to be recognized.
375
376       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
377       the  complete  set  of  Unicode  line  endings)  by  setting the option
378       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
379       (BSR is an abbrevation for "backslash R".) This can be made the default
380       when PCRE is built; if this is the case, the  other  behaviour  can  be
381       requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
382       specify these settings by starting a pattern string  with  one  of  the
383       following sequences:
384
385         (*BSR_ANYCRLF)   CR, LF, or CRLF only
386         (*BSR_UNICODE)   any Unicode newline sequence
387
388       These override the default and the options given to pcre_compile(), but
389       they can be overridden by options given to pcre_exec(). Note that these
390       special settings, which are not Perl-compatible, are recognized only at
391       the very start of a pattern, and that they must be in  upper  case.  If
392       more  than  one  of  them is present, the last one is used. They can be
393       combined with a change of newline convention, for  example,  a  pattern
394       can start with:
395
396         (*ANY)(*BSR_ANYCRLF)
397
398       Inside a character class, \R matches the letter "R".
399
400   Unicode character properties
401
402       When PCRE is built with Unicode character property support, three addi‐
403       tional escape sequences that match characters with specific  properties
404       are  available.   When not in UTF-8 mode, these sequences are of course
405       limited to testing characters whose codepoints are less than  256,  but
406       they do work in this mode.  The extra escape sequences are:
407
408         \p{xx}   a character with the xx property
409         \P{xx}   a character without the xx property
410         \X       an extended Unicode sequence
411
412       The  property  names represented by xx above are limited to the Unicode
413       script names, the general category properties, and "Any", which matches
414       any character (including newline). Other properties such as "InMusical‐
415       Symbols" are not currently supported by PCRE. Note  that  \P{Any}  does
416       not match any characters, so always causes a match failure.
417
418       Sets of Unicode characters are defined as belonging to certain scripts.
419       A character from one of these sets can be matched using a script  name.
420       For example:
421
422         \p{Greek}
423         \P{Han}
424
425       Those  that are not part of an identified script are lumped together as
426       "Common". The current list of scripts is:
427
428       Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
429       Buhid,   Canadian_Aboriginal,   Cherokee,  Common,  Coptic,  Cuneiform,
430       Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
431       Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira‐
432       gana, Inherited, Kannada,  Katakana,  Kharoshthi,  Khmer,  Lao,  Latin,
433       Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
434       Ogham, Old_Italic, Old_Persian, Oriya, Osmanya,  Phags_Pa,  Phoenician,
435       Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
436       Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
437
438       Each character has exactly one general category property, specified  by
439       a two-letter abbreviation. For compatibility with Perl, negation can be
440       specified by including a circumflex between the opening brace  and  the
441       property name. For example, \p{^Lu} is the same as \P{Lu}.
442
443       If only one letter is specified with \p or \P, it includes all the gen‐
444       eral category properties that start with that letter. In this case,  in
445       the  absence of negation, the curly brackets in the escape sequence are
446       optional; these two examples have the same effect:
447
448         \p{L}
449         \pL
450
451       The following general category property codes are supported:
452
453         C     Other
454         Cc    Control
455         Cf    Format
456         Cn    Unassigned
457         Co    Private use
458         Cs    Surrogate
459
460         L     Letter
461         Ll    Lower case letter
462         Lm    Modifier letter
463         Lo    Other letter
464         Lt    Title case letter
465         Lu    Upper case letter
466
467         M     Mark
468         Mc    Spacing mark
469         Me    Enclosing mark
470         Mn    Non-spacing mark
471
472         N     Number
473         Nd    Decimal number
474         Nl    Letter number
475         No    Other number
476
477         P     Punctuation
478         Pc    Connector punctuation
479         Pd    Dash punctuation
480         Pe    Close punctuation
481         Pf    Final punctuation
482         Pi    Initial punctuation
483         Po    Other punctuation
484         Ps    Open punctuation
485
486         S     Symbol
487         Sc    Currency symbol
488         Sk    Modifier symbol
489         Sm    Mathematical symbol
490         So    Other symbol
491
492         Z     Separator
493         Zl    Line separator
494         Zp    Paragraph separator
495         Zs    Space separator
496
497       The special property L& is also supported: it matches a character  that
498       has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
499       classified as a modifier or "other".
500
501       The Cs (Surrogate) property applies only to  characters  in  the  range
502       U+D800  to  U+DFFF. Such characters are not valid in UTF-8 strings (see
503       RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check‐
504       ing  has  been  turned off (see the discussion of PCRE_NO_UTF8_CHECK in
505       the pcreapi page).
506
507       The long synonyms for these properties  that  Perl  supports  (such  as
508       \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
509       any of these properties with "Is".
510
511       No character that is in the Unicode table has the Cn (unassigned) prop‐
512       erty.  Instead, this property is assumed for any code point that is not
513       in the Unicode table.
514
515       Specifying caseless matching does not affect  these  escape  sequences.
516       For example, \p{Lu} always matches only upper case letters.
517
518       The  \X  escape  matches  any number of Unicode characters that form an
519       extended Unicode sequence. \X is equivalent to
520
521         (?>\PM\pM*)
522
523       That is, it matches a character without the "mark"  property,  followed
524       by  zero  or  more  characters with the "mark" property, and treats the
525       sequence as an atomic group (see below).  Characters  with  the  "mark"
526       property  are  typically  accents  that affect the preceding character.
527       None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
528       matches any one character.
529
530       Matching  characters  by Unicode property is not fast, because PCRE has
531       to search a structure that contains  data  for  over  fifteen  thousand
532       characters. That is why the traditional escape sequences such as \d and
533       \w do not use Unicode properties in PCRE.
534
535   Resetting the match start
536
537       The escape sequence \K, which is a Perl 5.10 feature, causes any previ‐
538       ously  matched  characters  not  to  be  included  in the final matched
539       sequence. For example, the pattern:
540
541         foo\Kbar
542
543       matches "foobar", but reports that it has matched "bar".  This  feature
544       is  similar  to  a lookbehind assertion (described below).  However, in
545       this case, the part of the subject before the real match does not  have
546       to  be of fixed length, as lookbehind assertions do. The use of \K does
547       not interfere with the setting of captured  substrings.   For  example,
548       when the pattern
549
550         (foo)\Kbar
551
552       matches "foobar", the first substring is still set to "foo".
553
554   Simple assertions
555
556       The  final use of backslash is for certain simple assertions. An asser‐
557       tion specifies a condition that has to be met at a particular point  in
558       a  match, without consuming any characters from the subject string. The
559       use of subpatterns for more complicated assertions is described  below.
560       The backslashed assertions are:
561
562         \b     matches at a word boundary
563         \B     matches when not at a word boundary
564         \A     matches at the start of the subject
565         \Z     matches at the end of the subject
566                 also matches before a newline at the end of the subject
567         \z     matches only at the end of the subject
568         \G     matches at the first matching position in the subject
569
570       These  assertions may not appear in character classes (but note that \b
571       has a different meaning, namely the backspace character, inside a char‐
572       acter class).
573
574       A  word  boundary is a position in the subject string where the current
575       character and the previous character do not both match \w or  \W  (i.e.
576       one  matches  \w  and the other matches \W), or the start or end of the
577       string if the first or last character matches \w, respectively.
578
579       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
580       and dollar (described in the next section) in that they only ever match
581       at the very start and end of the subject string, whatever  options  are
582       set.  Thus,  they are independent of multiline mode. These three asser‐
583       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
584       affect  only the behaviour of the circumflex and dollar metacharacters.
585       However, if the startoffset argument of pcre_exec() is non-zero,  indi‐
586       cating that matching is to start at a point other than the beginning of
587       the subject, \A can never match. The difference between \Z  and  \z  is
588       that \Z matches before a newline at the end of the string as well as at
589       the very end, whereas \z matches only at the end.
590
591       The \G assertion is true only when the current matching position is  at
592       the  start point of the match, as specified by the startoffset argument
593       of pcre_exec(). It differs from \A when the  value  of  startoffset  is
594       non-zero.  By calling pcre_exec() multiple times with appropriate argu‐
595       ments, you can mimic Perl's /g option, and it is in this kind of imple‐
596       mentation where \G can be useful.
597
598       Note,  however,  that  PCRE's interpretation of \G, as the start of the
599       current match, is subtly different from Perl's, which defines it as the
600       end  of  the  previous  match. In Perl, these can be different when the
601       previously matched string was empty. Because PCRE does just  one  match
602       at a time, it cannot reproduce this behaviour.
603
604       If  all  the alternatives of a pattern begin with \G, the expression is
605       anchored to the starting match position, and the "anchored" flag is set
606       in the compiled regular expression.
607

CIRCUMFLEX AND DOLLAR

609
610       Outside a character class, in the default matching mode, the circumflex
611       character is an assertion that is true only  if  the  current  matching
612       point  is  at the start of the subject string. If the startoffset argu‐
613       ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
614       PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
615       has an entirely different meaning (see below).
616
617       Circumflex need not be the first character of the pattern if  a  number
618       of  alternatives are involved, but it should be the first thing in each
619       alternative in which it appears if the pattern is ever  to  match  that
620       branch.  If all possible alternatives start with a circumflex, that is,
621       if the pattern is constrained to match only at the start  of  the  sub‐
622       ject,  it  is  said  to be an "anchored" pattern. (There are also other
623       constructs that can cause a pattern to be anchored.)
624
625       A dollar character is an assertion that is true  only  if  the  current
626       matching  point  is  at  the  end of the subject string, or immediately
627       before a newline at the end of the string (by default). Dollar need not
628       be  the  last  character of the pattern if a number of alternatives are
629       involved, but it should be the last item in  any  branch  in  which  it
630       appears. Dollar has no special meaning in a character class.
631
632       The  meaning  of  dollar  can be changed so that it matches only at the
633       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
634       compile time. This does not affect the \Z assertion.
635
636       The meanings of the circumflex and dollar characters are changed if the
637       PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
638       matches  immediately after internal newlines as well as at the start of
639       the subject string. It does not match after a  newline  that  ends  the
640       string.  A dollar matches before any newlines in the string, as well as
641       at the very end, when PCRE_MULTILINE is set. When newline is  specified
642       as  the  two-character  sequence CRLF, isolated CR and LF characters do
643       not indicate newlines.
644
645       For example, the pattern /^abc$/ matches the subject string  "def\nabc"
646       (where  \n  represents a newline) in multiline mode, but not otherwise.
647       Consequently, patterns that are anchored in single  line  mode  because
648       all  branches  start  with  ^ are not anchored in multiline mode, and a
649       match for circumflex is  possible  when  the  startoffset  argument  of
650       pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
651       PCRE_MULTILINE is set.
652
653       Note that the sequences \A, \Z, and \z can be used to match  the  start
654       and  end of the subject in both modes, and if all branches of a pattern
655       start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
656       set.
657

FULL STOP (PERIOD, DOT)

659
660       Outside a character class, a dot in the pattern matches any one charac‐
661       ter in the subject string except (by default) a character  that  signi‐
662       fies  the  end  of  a line. In UTF-8 mode, the matched character may be
663       more than one byte long.
664
665       When a line ending is defined as a single character, dot never  matches
666       that  character; when the two-character sequence CRLF is used, dot does
667       not match CR if it is immediately followed  by  LF,  but  otherwise  it
668       matches  all characters (including isolated CRs and LFs). When any Uni‐
669       code line endings are being recognized, dot does not match CR or LF  or
670       any of the other line ending characters.
671
672       The  behaviour  of  dot  with regard to newlines can be changed. If the
673       PCRE_DOTALL option is set, a dot matches  any  one  character,  without
674       exception. If the two-character sequence CRLF is present in the subject
675       string, it takes two dots to match it.
676
677       The handling of dot is entirely independent of the handling of  circum‐
678       flex  and  dollar,  the  only relationship being that they both involve
679       newlines. Dot has no special meaning in a character class.
680

MATCHING A SINGLE BYTE

682
683       Outside a character class, the escape sequence \C matches any one byte,
684       both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any
685       line-ending characters. The feature is provided in  Perl  in  order  to
686       match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char‐
687       acters into individual bytes, what remains in the string may be a  mal‐
688       formed  UTF-8  string.  For this reason, the \C escape sequence is best
689       avoided.
690
691       PCRE does not allow \C to appear in  lookbehind  assertions  (described
692       below),  because  in UTF-8 mode this would make it impossible to calcu‐
693       late the length of the lookbehind.
694

SQUARE BRACKETS AND CHARACTER CLASSES

696
697       An opening square bracket introduces a character class, terminated by a
698       closing square bracket. A closing square bracket on its own is not spe‐
699       cial. If a closing square bracket is required as a member of the class,
700       it  should  be  the first data character in the class (after an initial
701       circumflex, if present) or escaped with a backslash.
702
703       A character class matches a single character in the subject.  In  UTF-8
704       mode,  the character may occupy more than one byte. A matched character
705       must be in the set of characters defined by the class, unless the first
706       character  in  the  class definition is a circumflex, in which case the
707       subject character must not be in the set defined by  the  class.  If  a
708       circumflex  is actually required as a member of the class, ensure it is
709       not the first character, or escape it with a backslash.
710
711       For example, the character class [aeiou] matches any lower case  vowel,
712       while  [^aeiou]  matches  any character that is not a lower case vowel.
713       Note that a circumflex is just a convenient notation for specifying the
714       characters  that  are in the class by enumerating those that are not. A
715       class that starts with a circumflex is not an assertion: it still  con‐
716       sumes  a  character  from the subject string, and therefore it fails if
717       the current pointer is at the end of the string.
718
719       In UTF-8 mode, characters with values greater than 255 can be  included
720       in  a  class as a literal string of bytes, or by using the \x{ escaping
721       mechanism.
722
723       When caseless matching is set, any letters in a  class  represent  both
724       their  upper  case  and lower case versions, so for example, a caseless
725       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
726       match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
727       understands the concept of case for characters whose  values  are  less
728       than  128, so caseless matching is always possible. For characters with
729       higher values, the concept of case is supported  if  PCRE  is  compiled
730       with  Unicode  property support, but not otherwise.  If you want to use
731       caseless matching for characters 128 and above, you  must  ensure  that
732       PCRE  is  compiled  with Unicode property support as well as with UTF-8
733       support.
734
735       Characters that might indicate line breaks are  never  treated  in  any
736       special  way  when  matching  character  classes,  whatever line-ending
737       sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
738       PCRE_MULTILINE options is used. A class such as [^a] always matches one
739       of these characters.
740
741       The minus (hyphen) character can be used to specify a range of  charac‐
742       ters  in  a  character  class.  For  example,  [d-m] matches any letter
743       between d and m, inclusive. If a  minus  character  is  required  in  a
744       class,  it  must  be  escaped  with a backslash or appear in a position
745       where it cannot be interpreted as indicating a range, typically as  the
746       first or last character in the class.
747
748       It is not possible to have the literal character "]" as the end charac‐
749       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
750       two  characters ("W" and "-") followed by a literal string "46]", so it
751       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
752       backslash  it is interpreted as the end of range, so [W-\]46] is inter‐
753       preted as a class containing a range followed by two other  characters.
754       The  octal or hexadecimal representation of "]" can also be used to end
755       a range.
756
757       Ranges operate in the collating sequence of character values. They  can
758       also   be  used  for  characters  specified  numerically,  for  example
759       [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
760       are greater than 255, for example [\x{100}-\x{2ff}].
761
762       If a range that includes letters is used when caseless matching is set,
763       it matches the letters in either case. For example, [W-c] is equivalent
764       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
765       character tables for a French locale are in  use,  [\xc8-\xcb]  matches
766       accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
767       concept of case for characters with values greater than 128  only  when
768       it is compiled with Unicode property support.
769
770       The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
771       in a character class, and add the characters that  they  match  to  the
772       class. For example, [\dABCDEF] matches any hexadecimal digit. A circum‐
773       flex can conveniently be used with the upper case  character  types  to
774       specify  a  more  restricted  set of characters than the matching lower
775       case type. For example, the class [^\W_] matches any letter  or  digit,
776       but not underscore.
777
778       The  only  metacharacters  that are recognized in character classes are
779       backslash, hyphen (only where it can be  interpreted  as  specifying  a
780       range),  circumflex  (only  at the start), opening square bracket (only
781       when it can be interpreted as introducing a POSIX class name - see  the
782       next  section),  and  the  terminating closing square bracket. However,
783       escaping other non-alphanumeric characters does no harm.
784

POSIX CHARACTER CLASSES

786
787       Perl supports the POSIX notation for character classes. This uses names
788       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
789       supports this notation. For example,
790
791         [01[:alpha:]%]
792
793       matches "0", "1", any alphabetic character, or "%". The supported class
794       names are
795
796         alnum    letters and digits
797         alpha    letters
798         ascii    character codes 0 - 127
799         blank    space or tab only
800         cntrl    control characters
801         digit    decimal digits (same as \d)
802         graph    printing characters, excluding space
803         lower    lower case letters
804         print    printing characters, including space
805         punct    printing characters, excluding letters and digits
806         space    white space (not quite the same as \s)
807         upper    upper case letters
808         word     "word" characters (same as \w)
809         xdigit   hexadecimal digits
810
811       The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
812       and space (32). Notice that this list includes the VT  character  (code
813       11). This makes "space" different to \s, which does not include VT (for
814       Perl compatibility).
815
816       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
817       from  Perl  5.8. Another Perl extension is negation, which is indicated
818       by a ^ character after the colon. For example,
819
820         [12[:^digit:]]
821
822       matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
823       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
824       these are not supported, and an error is given if they are encountered.
825
826       In UTF-8 mode, characters with values greater than 128 do not match any
827       of the POSIX character classes.
828

VERTICAL BAR

830
831       Vertical  bar characters are used to separate alternative patterns. For
832       example, the pattern
833
834         gilbert|sullivan
835
836       matches either "gilbert" or "sullivan". Any number of alternatives  may
837       appear,  and  an  empty  alternative  is  permitted (matching the empty
838       string). The matching process tries each alternative in turn, from left
839       to  right, and the first one that succeeds is used. If the alternatives
840       are within a subpattern (defined below), "succeeds" means matching  the
841       rest of the main pattern as well as the alternative in the subpattern.
842

INTERNAL OPTION SETTING

844
845       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
846       PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
847       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
848       between "(?" and ")".  The option letters are
849
850         i  for PCRE_CASELESS
851         m  for PCRE_MULTILINE
852         s  for PCRE_DOTALL
853         x  for PCRE_EXTENDED
854
855       For example, (?im) sets caseless, multiline matching. It is also possi‐
856       ble to unset these options by preceding the letter with a hyphen, and a
857       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE‐
858       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
859       is also permitted. If a  letter  appears  both  before  and  after  the
860       hyphen, the option is unset.
861
862       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
863       can be changed in the same way as the Perl-compatible options by  using
864       the characters J, U and X respectively.
865
866       When  an option change occurs at top level (that is, not inside subpat‐
867       tern parentheses), the change applies to the remainder of  the  pattern
868       that follows.  If the change is placed right at the start of a pattern,
869       PCRE extracts it into the global options (and it will therefore show up
870       in data extracted by the pcre_fullinfo() function).
871
872       An  option  change  within a subpattern (see below for a description of
873       subpatterns) affects only that part of the current pattern that follows
874       it, so
875
876         (a(?i)b)c
877
878       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
879       used).  By this means, options can be made to have  different  settings
880       in  different parts of the pattern. Any changes made in one alternative
881       do carry on into subsequent branches within the  same  subpattern.  For
882       example,
883
884         (a(?i)b|c)
885
886       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
887       first branch is abandoned before the option setting.  This  is  because
888       the  effects  of option settings happen at compile time. There would be
889       some very weird behaviour otherwise.
890
891       Note: There are other PCRE-specific options that  can  be  set  by  the
892       application  when  the  compile  or match functions are called. In some
893       cases the pattern can contain special  leading  sequences  to  override
894       what  the  application  has set or what has been defaulted. Details are
895       given in the section entitled "Newline sequences" above.
896

SUBPATTERNS

898
899       Subpatterns are delimited by parentheses (round brackets), which can be
900       nested.  Turning part of a pattern into a subpattern does two things:
901
902       1. It localizes a set of alternatives. For example, the pattern
903
904         cat(aract|erpillar|)
905
906       matches  one  of the words "cat", "cataract", or "caterpillar". Without
907       the parentheses, it would match  "cataract",  "erpillar"  or  an  empty
908       string.
909
910       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
911       that, when the whole pattern  matches,  that  portion  of  the  subject
912       string that matched the subpattern is passed back to the caller via the
913       ovector argument of pcre_exec(). Opening parentheses are  counted  from
914       left  to  right  (starting  from 1) to obtain numbers for the capturing
915       subpatterns.
916
917       For example, if the string "the red king" is matched against  the  pat‐
918       tern
919
920         the ((red|white) (king|queen))
921
922       the captured substrings are "red king", "red", and "king", and are num‐
923       bered 1, 2, and 3, respectively.
924
925       The fact that plain parentheses fulfil  two  functions  is  not  always
926       helpful.   There are often times when a grouping subpattern is required
927       without a capturing requirement. If an opening parenthesis is  followed
928       by  a question mark and a colon, the subpattern does not do any captur‐
929       ing, and is not counted when computing the  number  of  any  subsequent
930       capturing  subpatterns. For example, if the string "the white queen" is
931       matched against the pattern
932
933         the ((?:red|white) (king|queen))
934
935       the captured substrings are "white queen" and "queen", and are numbered
936       1 and 2. The maximum number of capturing subpatterns is 65535.
937
938       As  a  convenient shorthand, if any option settings are required at the
939       start of a non-capturing subpattern,  the  option  letters  may  appear
940       between the "?" and the ":". Thus the two patterns
941
942         (?i:saturday|sunday)
943         (?:(?i)saturday|sunday)
944
945       match exactly the same set of strings. Because alternative branches are
946       tried from left to right, and options are not reset until  the  end  of
947       the  subpattern is reached, an option setting in one branch does affect
948       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
949       "Saturday".
950

DUPLICATE SUBPATTERN NUMBERS

952
953       Perl 5.10 introduced a feature whereby each alternative in a subpattern
954       uses the same numbers for its capturing parentheses. Such a  subpattern
955       starts  with (?| and is itself a non-capturing subpattern. For example,
956       consider this pattern:
957
958         (?|(Sat)ur|(Sun))day
959
960       Because the two alternatives are inside a (?| group, both sets of  cap‐
961       turing  parentheses  are  numbered one. Thus, when the pattern matches,
962       you can look at captured substring number  one,  whichever  alternative
963       matched.  This  construct  is useful when you want to capture part, but
964       not all, of one of a number of alternatives. Inside a (?| group, paren‐
965       theses  are  numbered as usual, but the number is reset at the start of
966       each branch. The numbers of any capturing buffers that follow the  sub‐
967       pattern  start after the highest number used in any branch. The follow‐
968       ing example is taken from the Perl documentation.  The  numbers  under‐
969       neath show in which buffer the captured content will be stored.
970
971         # before  ---------------branch-reset----------- after
972         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
973         # 1            2         2  3        2     3     4
974
975       A  backreference  or  a  recursive call to a numbered subpattern always
976       refers to the first one in the pattern with the given number.
977
978       An alternative approach to using this "branch reset" feature is to  use
979       duplicate named subpatterns, as described in the next section.
980

NAMED SUBPATTERNS

982
983       Identifying  capturing  parentheses  by number is simple, but it can be
984       very hard to keep track of the numbers in complicated  regular  expres‐
985       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
986       change. To help with this difficulty, PCRE supports the naming of  sub‐
987       patterns. This feature was not added to Perl until release 5.10. Python
988       had the feature earlier, and PCRE introduced it at release  4.0,  using
989       the  Python syntax. PCRE now supports both the Perl and the Python syn‐
990       tax.
991
992       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
993       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
994       to capturing parentheses from other parts of the pattern, such as back‐
995       references,  recursion,  and conditions, can be made by name as well as
996       by number.
997
998       Names consist of up to  32  alphanumeric  characters  and  underscores.
999       Named  capturing  parentheses  are  still  allocated numbers as well as
1000       names, exactly as if the names were not present. The PCRE API  provides
1001       function calls for extracting the name-to-number translation table from
1002       a compiled pattern. There is also a convenience function for extracting
1003       a captured substring by name.
1004
1005       By  default, a name must be unique within a pattern, but it is possible
1006       to relax this constraint by setting the PCRE_DUPNAMES option at compile
1007       time.  This  can  be useful for patterns where only one instance of the
1008       named parentheses can match. Suppose you want to match the  name  of  a
1009       weekday,  either as a 3-letter abbreviation or as the full name, and in
1010       both cases you want to extract the abbreviation. This pattern (ignoring
1011       the line breaks) does the job:
1012
1013         (?<DN>Mon|Fri|Sun)(?:day)?|
1014         (?<DN>Tue)(?:sday)?|
1015         (?<DN>Wed)(?:nesday)?|
1016         (?<DN>Thu)(?:rsday)?|
1017         (?<DN>Sat)(?:urday)?
1018
1019       There  are  five capturing substrings, but only one is ever set after a
1020       match.  (An alternative way of solving this problem is to use a "branch
1021       reset" subpattern, as described in the previous section.)
1022
1023       The  convenience  function  for extracting the data by name returns the
1024       substring for the first (and in this example, the only)  subpattern  of
1025       that  name  that  matched.  This saves searching to find which numbered
1026       subpattern it was. If you make a reference to a non-unique  named  sub‐
1027       pattern  from elsewhere in the pattern, the one that corresponds to the
1028       lowest number is used. For further details of the interfaces  for  han‐
1029       dling named subpatterns, see the pcreapi documentation.
1030

REPETITION

1032
1033       Repetition  is  specified  by  quantifiers, which can follow any of the
1034       following items:
1035
1036         a literal data character
1037         the dot metacharacter
1038         the \C escape sequence
1039         the \X escape sequence (in UTF-8 mode with Unicode properties)
1040         the \R escape sequence
1041         an escape such as \d that matches a single character
1042         a character class
1043         a back reference (see next section)
1044         a parenthesized subpattern (unless it is an assertion)
1045
1046       The general repetition quantifier specifies a minimum and maximum  num‐
1047       ber  of  permitted matches, by giving the two numbers in curly brackets
1048       (braces), separated by a comma. The numbers must be  less  than  65536,
1049       and the first must be less than or equal to the second. For example:
1050
1051         z{2,4}
1052
1053       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
1054       special character. If the second number is omitted, but  the  comma  is
1055       present,  there  is  no upper limit; if the second number and the comma
1056       are both omitted, the quantifier specifies an exact number of  required
1057       matches. Thus
1058
1059         [aeiou]{3,}
1060
1061       matches at least 3 successive vowels, but may match many more, while
1062
1063         \d{8}
1064
1065       matches  exactly  8  digits. An opening curly bracket that appears in a
1066       position where a quantifier is not allowed, or one that does not  match
1067       the  syntax of a quantifier, is taken as a literal character. For exam‐
1068       ple, {,6} is not a quantifier, but a literal string of four characters.
1069
1070       In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
1071       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char‐
1072       acters, each of which is represented by a two-byte sequence. Similarly,
1073       when Unicode property support is available, \X{3} matches three Unicode
1074       extended sequences, each of which may be several bytes long  (and  they
1075       may be of different lengths).
1076
1077       The quantifier {0} is permitted, causing the expression to behave as if
1078       the previous item and the quantifier were not present. This may be use‐
1079       ful  for  subpatterns that are referenced as subroutines from elsewhere
1080       in the pattern. Items other than subpatterns that have a {0} quantifier
1081       are omitted from the compiled pattern.
1082
1083       For  convenience, the three most common quantifiers have single-charac‐
1084       ter abbreviations:
1085
1086         *    is equivalent to {0,}
1087         +    is equivalent to {1,}
1088         ?    is equivalent to {0,1}
1089
1090       It is possible to construct infinite loops by  following  a  subpattern
1091       that can match no characters with a quantifier that has no upper limit,
1092       for example:
1093
1094         (a?)*
1095
1096       Earlier versions of Perl and PCRE used to give an error at compile time
1097       for  such  patterns. However, because there are cases where this can be
1098       useful, such patterns are now accepted, but if any  repetition  of  the
1099       subpattern  does in fact match no characters, the loop is forcibly bro‐
1100       ken.
1101
1102       By default, the quantifiers are "greedy", that is, they match  as  much
1103       as  possible  (up  to  the  maximum number of permitted times), without
1104       causing the rest of the pattern to fail. The classic example  of  where
1105       this gives problems is in trying to match comments in C programs. These
1106       appear between /* and */ and within the comment,  individual  *  and  /
1107       characters  may  appear. An attempt to match C comments by applying the
1108       pattern
1109
1110         /\*.*\*/
1111
1112       to the string
1113
1114         /* first comment */  not comment  /* second comment */
1115
1116       fails, because it matches the entire string owing to the greediness  of
1117       the .*  item.
1118
1119       However,  if  a quantifier is followed by a question mark, it ceases to
1120       be greedy, and instead matches the minimum number of times possible, so
1121       the pattern
1122
1123         /\*.*?\*/
1124
1125       does  the  right  thing with the C comments. The meaning of the various
1126       quantifiers is not otherwise changed,  just  the  preferred  number  of
1127       matches.   Do  not  confuse this use of question mark with its use as a
1128       quantifier in its own right. Because it has two uses, it can  sometimes
1129       appear doubled, as in
1130
1131         \d??\d
1132
1133       which matches one digit by preference, but can match two if that is the
1134       only way the rest of the pattern matches.
1135
1136       If the PCRE_UNGREEDY option is set (an option that is not available  in
1137       Perl),  the  quantifiers are not greedy by default, but individual ones
1138       can be made greedy by following them with a  question  mark.  In  other
1139       words, it inverts the default behaviour.
1140
1141       When  a  parenthesized  subpattern  is quantified with a minimum repeat
1142       count that is greater than 1 or with a limited maximum, more memory  is
1143       required  for  the  compiled  pattern, in proportion to the size of the
1144       minimum or maximum.
1145
1146       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv‐
1147       alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
1148       the pattern is implicitly anchored, because whatever  follows  will  be
1149       tried  against every character position in the subject string, so there
1150       is no point in retrying the overall match at  any  position  after  the
1151       first.  PCRE  normally treats such a pattern as though it were preceded
1152       by \A.
1153
1154       In cases where it is known that the subject  string  contains  no  new‐
1155       lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti‐
1156       mization, or alternatively using ^ to indicate anchoring explicitly.
1157
1158       However, there is one situation where the optimization cannot be  used.
1159       When  .*   is  inside  capturing  parentheses that are the subject of a
1160       backreference elsewhere in the pattern, a match at the start  may  fail
1161       where a later one succeeds. Consider, for example:
1162
1163         (.*)abc\1
1164
1165       If  the subject is "xyz123abc123" the match point is the fourth charac‐
1166       ter. For this reason, such a pattern is not implicitly anchored.
1167
1168       When a capturing subpattern is repeated, the value captured is the sub‐
1169       string that matched the final iteration. For example, after
1170
1171         (tweedle[dume]{3}\s*)+
1172
1173       has matched "tweedledum tweedledee" the value of the captured substring
1174       is "tweedledee". However, if there are  nested  capturing  subpatterns,
1175       the  corresponding captured values may have been set in previous itera‐
1176       tions. For example, after
1177
1178         /(a|(b))+/
1179
1180       matches "aba" the value of the second captured substring is "b".
1181

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

1183
1184       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
1185       repetition,  failure  of what follows normally causes the repeated item
1186       to be re-evaluated to see if a different number of repeats  allows  the
1187       rest  of  the pattern to match. Sometimes it is useful to prevent this,
1188       either to change the nature of the match, or to cause it  fail  earlier
1189       than  it otherwise might, when the author of the pattern knows there is
1190       no point in carrying on.
1191
1192       Consider, for example, the pattern \d+foo when applied to  the  subject
1193       line
1194
1195         123456bar
1196
1197       After matching all 6 digits and then failing to match "foo", the normal
1198       action of the matcher is to try again with only 5 digits  matching  the
1199       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
1200       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
1201       the  means for specifying that once a subpattern has matched, it is not
1202       to be re-evaluated in this way.
1203
1204       If we use atomic grouping for the previous example, the  matcher  gives
1205       up  immediately  on failing to match "foo" the first time. The notation
1206       is a kind of special parenthesis, starting with (?> as in this example:
1207
1208         (?>\d+)foo
1209
1210       This kind of parenthesis "locks up" the  part of the  pattern  it  con‐
1211       tains  once  it  has matched, and a failure further into the pattern is
1212       prevented from backtracking into it. Backtracking past it  to  previous
1213       items, however, works as normal.
1214
1215       An  alternative  description  is that a subpattern of this type matches
1216       the string of characters that an  identical  standalone  pattern  would
1217       match, if anchored at the current point in the subject string.
1218
1219       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1220       such as the above example can be thought of as a maximizing repeat that
1221       must  swallow  everything  it can. So, while both \d+ and \d+? are pre‐
1222       pared to adjust the number of digits they match in order  to  make  the
1223       rest of the pattern match, (?>\d+) can only match an entire sequence of
1224       digits.
1225
1226       Atomic groups in general can of course contain arbitrarily  complicated
1227       subpatterns,  and  can  be  nested. However, when the subpattern for an
1228       atomic group is just a single repeated item, as in the example above, a
1229       simpler  notation,  called  a "possessive quantifier" can be used. This
1230       consists of an additional + character  following  a  quantifier.  Using
1231       this notation, the previous example can be rewritten as
1232
1233         \d++foo
1234
1235       Note that a possessive quantifier can be used with an entire group, for
1236       example:
1237
1238         (abc|xyz){2,3}+
1239
1240       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
1241       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1242       simpler forms of atomic group. However, there is no difference  in  the
1243       meaning  of  a  possessive  quantifier and the equivalent atomic group,
1244       though there may be a performance  difference;  possessive  quantifiers
1245       should be slightly faster.
1246
1247       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn‐
1248       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
1249       edition of his book. Mike McCloskey liked it, so implemented it when he
1250       built Sun's Java package, and PCRE copied it from there. It  ultimately
1251       found its way into Perl at release 5.10.
1252
1253       PCRE has an optimization that automatically "possessifies" certain sim‐
1254       ple pattern constructs. For example, the sequence  A+B  is  treated  as
1255       A++B  because  there is no point in backtracking into a sequence of A's
1256       when B must follow.
1257
1258       When a pattern contains an unlimited repeat inside  a  subpattern  that
1259       can  itself  be  repeated  an  unlimited number of times, the use of an
1260       atomic group is the only way to avoid some  failing  matches  taking  a
1261       very long time indeed. The pattern
1262
1263         (\D+|<\d+>)*[!?]
1264
1265       matches  an  unlimited number of substrings that either consist of non-
1266       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1267       matches, it runs quickly. However, if it is applied to
1268
1269         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1270
1271       it  takes  a  long  time  before reporting failure. This is because the
1272       string can be divided between the internal \D+ repeat and the  external
1273       *  repeat  in  a  large  number of ways, and all have to be tried. (The
1274       example uses [!?] rather than a single character at  the  end,  because
1275       both  PCRE  and  Perl have an optimization that allows for fast failure
1276       when a single character is used. They remember the last single  charac‐
1277       ter  that  is required for a match, and fail early if it is not present
1278       in the string.) If the pattern is changed so that  it  uses  an  atomic
1279       group, like this:
1280
1281         ((?>\D+)|<\d+>)*[!?]
1282
1283       sequences of non-digits cannot be broken, and failure happens quickly.
1284

BACK REFERENCES

1286
1287       Outside a character class, a backslash followed by a digit greater than
1288       0 (and possibly further digits) is a back reference to a capturing sub‐
1289       pattern  earlier  (that is, to its left) in the pattern, provided there
1290       have been that many previous capturing left parentheses.
1291
1292       However, if the decimal number following the backslash is less than 10,
1293       it  is  always  taken  as a back reference, and causes an error only if
1294       there are not that many capturing left parentheses in the  entire  pat‐
1295       tern.  In  other words, the parentheses that are referenced need not be
1296       to the left of the reference for numbers less than 10. A "forward  back
1297       reference"  of  this  type can make sense when a repetition is involved
1298       and the subpattern to the right has participated in an  earlier  itera‐
1299       tion.
1300
1301       It  is  not  possible to have a numerical "forward back reference" to a
1302       subpattern whose number is 10 or  more  using  this  syntax  because  a
1303       sequence  such  as  \50 is interpreted as a character defined in octal.
1304       See the subsection entitled "Non-printing characters" above for further
1305       details  of  the  handling of digits following a backslash. There is no
1306       such problem when named parentheses are used. A back reference  to  any
1307       subpattern is possible using named parentheses (see below).
1308
1309       Another  way  of  avoiding  the ambiguity inherent in the use of digits
1310       following a backslash is to use the \g escape sequence, which is a fea‐
1311       ture  introduced  in  Perl  5.10.  This  escape  must be followed by an
1312       unsigned number or a negative number, optionally  enclosed  in  braces.
1313       These examples are all identical:
1314
1315         (ring), \1
1316         (ring), \g1
1317         (ring), \g{1}
1318
1319       An  unsigned number specifies an absolute reference without the ambigu‐
1320       ity that is present in the older syntax. It is also useful when literal
1321       digits follow the reference. A negative number is a relative reference.
1322       Consider this example:
1323
1324         (abc(def)ghi)\g{-1}
1325
1326       The sequence \g{-1} is a reference to the most recently started captur‐
1327       ing  subpattern  before \g, that is, is it equivalent to \2. Similarly,
1328       \g{-2} would be equivalent to \1. The use of relative references can be
1329       helpful  in  long  patterns,  and  also in patterns that are created by
1330       joining together fragments that contain references within themselves.
1331
1332       A back reference matches whatever actually matched the  capturing  sub‐
1333       pattern  in  the  current subject string, rather than anything matching
1334       the subpattern itself (see "Subpatterns as subroutines" below for a way
1335       of doing that). So the pattern
1336
1337         (sens|respons)e and \1ibility
1338
1339       matches  "sense and sensibility" and "response and responsibility", but
1340       not "sense and responsibility". If caseful matching is in force at  the
1341       time  of the back reference, the case of letters is relevant. For exam‐
1342       ple,
1343
1344         ((?i)rah)\s+\1
1345
1346       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1347       original capturing subpattern is matched caselessly.
1348
1349       There  are  several  different ways of writing back references to named
1350       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
1351       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
1352       unified back reference syntax, in which \g can be used for both numeric
1353       and  named  references,  is  also supported. We could rewrite the above
1354       example in any of the following ways:
1355
1356         (?<p1>(?i)rah)\s+\k<p1>
1357         (?'p1'(?i)rah)\s+\k{p1}
1358         (?P<p1>(?i)rah)\s+(?P=p1)
1359         (?<p1>(?i)rah)\s+\g{p1}
1360
1361       A subpattern that is referenced by  name  may  appear  in  the  pattern
1362       before or after the reference.
1363
1364       There  may be more than one back reference to the same subpattern. If a
1365       subpattern has not actually been used in a particular match,  any  back
1366       references to it always fail. For example, the pattern
1367
1368         (a|(bc))\2
1369
1370       always  fails if it starts to match "a" rather than "bc". Because there
1371       may be many capturing parentheses in a pattern,  all  digits  following
1372       the  backslash  are taken as part of a potential back reference number.
1373       If the pattern continues with a digit character, some delimiter must be
1374       used  to  terminate  the back reference. If the PCRE_EXTENDED option is
1375       set, this can be white space.  Otherwise an empty  comment  (see  "Com‐
1376       ments" below) can be used.
1377
1378       A  back reference that occurs inside the parentheses to which it refers
1379       fails when the subpattern is first used, so, for example,  (a\1)  never
1380       matches.   However,  such references can be useful inside repeated sub‐
1381       patterns. For example, the pattern
1382
1383         (a|b\1)+
1384
1385       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1386       ation  of  the  subpattern,  the  back  reference matches the character
1387       string corresponding to the previous iteration. In order  for  this  to
1388       work,  the  pattern must be such that the first iteration does not need
1389       to match the back reference. This can be done using alternation, as  in
1390       the example above, or by a quantifier with a minimum of zero.
1391

ASSERTIONS

1393
1394       An  assertion  is  a  test on the characters following or preceding the
1395       current matching point that does not actually consume  any  characters.
1396       The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
1397       described above.
1398
1399       More complicated assertions are coded as  subpatterns.  There  are  two
1400       kinds:  those  that  look  ahead of the current position in the subject
1401       string, and those that look  behind  it.  An  assertion  subpattern  is
1402       matched  in  the  normal way, except that it does not cause the current
1403       matching position to be changed.
1404
1405       Assertion subpatterns are not capturing subpatterns,  and  may  not  be
1406       repeated,  because  it  makes no sense to assert the same thing several
1407       times. If any kind of assertion contains capturing  subpatterns  within
1408       it,  these are counted for the purposes of numbering the capturing sub‐
1409       patterns in the whole pattern.  However, substring capturing is carried
1410       out  only  for  positive assertions, because it does not make sense for
1411       negative assertions.
1412
1413   Lookahead assertions
1414
1415       Lookahead assertions start with (?= for positive assertions and (?! for
1416       negative assertions. For example,
1417
1418         \w+(?=;)
1419
1420       matches  a word followed by a semicolon, but does not include the semi‐
1421       colon in the match, and
1422
1423         foo(?!bar)
1424
1425       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
1426       that the apparently similar pattern
1427
1428         (?!foo)bar
1429
1430       does  not  find  an  occurrence  of "bar" that is preceded by something
1431       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
1432       the assertion (?!foo) is always true when the next three characters are
1433       "bar". A lookbehind assertion is needed to achieve the other effect.
1434
1435       If you want to force a matching failure at some point in a pattern, the
1436       most  convenient  way  to  do  it  is with (?!) because an empty string
1437       always matches, so an assertion that requires there not to be an  empty
1438       string must always fail.
1439
1440   Lookbehind assertions
1441
1442       Lookbehind  assertions start with (?<= for positive assertions and (?<!
1443       for negative assertions. For example,
1444
1445         (?<!foo)bar
1446
1447       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
1448       contents  of  a  lookbehind  assertion are restricted such that all the
1449       strings it matches must have a fixed length. However, if there are sev‐
1450       eral  top-level  alternatives,  they  do  not all have to have the same
1451       fixed length. Thus
1452
1453         (?<=bullock|donkey)
1454
1455       is permitted, but
1456
1457         (?<!dogs?|cats?)
1458
1459       causes an error at compile time. Branches that match  different  length
1460       strings  are permitted only at the top level of a lookbehind assertion.
1461       This is an extension compared with  Perl  (at  least  for  5.8),  which
1462       requires  all branches to match the same length of string. An assertion
1463       such as
1464
1465         (?<=ab(c|de))
1466
1467       is not permitted, because its single top-level  branch  can  match  two
1468       different  lengths,  but  it is acceptable if rewritten to use two top-
1469       level branches:
1470
1471         (?<=abc|abde)
1472
1473       In some cases, the Perl 5.10 escape sequence \K (see above) can be used
1474       instead  of  a lookbehind assertion; this is not restricted to a fixed-
1475       length.
1476
1477       The implementation of lookbehind assertions is, for  each  alternative,
1478       to  temporarily  move the current position back by the fixed length and
1479       then try to match. If there are insufficient characters before the cur‐
1480       rent position, the assertion fails.
1481
1482       PCRE does not allow the \C escape (which matches a single byte in UTF-8
1483       mode) to appear in lookbehind assertions, because it makes it  impossi‐
1484       ble  to  calculate the length of the lookbehind. The \X and \R escapes,
1485       which can match different numbers of bytes, are also not permitted.
1486
1487       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
1488       assertions  to  specify  efficient  matching  at the end of the subject
1489       string. Consider a simple pattern such as
1490
1491         abcd$
1492
1493       when applied to a long string that does  not  match.  Because  matching
1494       proceeds from left to right, PCRE will look for each "a" in the subject
1495       and then see if what follows matches the rest of the  pattern.  If  the
1496       pattern is specified as
1497
1498         ^.*abcd$
1499
1500       the  initial .* matches the entire string at first, but when this fails
1501       (because there is no following "a"), it backtracks to match all but the
1502       last  character,  then all but the last two characters, and so on. Once
1503       again the search for "a" covers the entire string, from right to  left,
1504       so we are no better off. However, if the pattern is written as
1505
1506         ^.*+(?<=abcd)
1507
1508       there  can  be  no backtracking for the .*+ item; it can match only the
1509       entire string. The subsequent lookbehind assertion does a  single  test
1510       on  the last four characters. If it fails, the match fails immediately.
1511       For long strings, this approach makes a significant difference  to  the
1512       processing time.
1513
1514   Using multiple assertions
1515
1516       Several assertions (of any sort) may occur in succession. For example,
1517
1518         (?<=\d{3})(?<!999)foo
1519
1520       matches  "foo" preceded by three digits that are not "999". Notice that
1521       each of the assertions is applied independently at the  same  point  in
1522       the  subject  string.  First  there  is a check that the previous three
1523       characters are all digits, and then there is  a  check  that  the  same
1524       three characters are not "999".  This pattern does not match "foo" pre‐
1525       ceded by six characters, the first of which are  digits  and  the  last
1526       three  of  which  are not "999". For example, it doesn't match "123abc‐
1527       foo". A pattern to do that is
1528
1529         (?<=\d{3}...)(?<!999)foo
1530
1531       This time the first assertion looks at the  preceding  six  characters,
1532       checking that the first three are digits, and then the second assertion
1533       checks that the preceding three characters are not "999".
1534
1535       Assertions can be nested in any combination. For example,
1536
1537         (?<=(?<!foo)bar)baz
1538
1539       matches an occurrence of "baz" that is preceded by "bar" which in  turn
1540       is not preceded by "foo", while
1541
1542         (?<=\d{3}(?!999)...)foo
1543
1544       is  another pattern that matches "foo" preceded by three digits and any
1545       three characters that are not "999".
1546

CONDITIONAL SUBPATTERNS

1548
1549       It is possible to cause the matching process to obey a subpattern  con‐
1550       ditionally  or to choose between two alternative subpatterns, depending
1551       on the result of an assertion, or whether a previous capturing  subpat‐
1552       tern  matched  or not. The two possible forms of conditional subpattern
1553       are
1554
1555         (?(condition)yes-pattern)
1556         (?(condition)yes-pattern|no-pattern)
1557
1558       If the condition is satisfied, the yes-pattern is used;  otherwise  the
1559       no-pattern  (if  present)  is used. If there are more than two alterna‐
1560       tives in the subpattern, a compile-time error occurs.
1561
1562       There are four kinds of condition: references  to  subpatterns,  refer‐
1563       ences to recursion, a pseudo-condition called DEFINE, and assertions.
1564
1565   Checking for a used subpattern by number
1566
1567       If  the  text between the parentheses consists of a sequence of digits,
1568       the condition is true if the capturing subpattern of  that  number  has
1569       previously  matched.  An  alternative notation is to precede the digits
1570       with a plus or minus sign. In this case, the subpattern number is rela‐
1571       tive rather than absolute.  The most recently opened parentheses can be
1572       referenced by (?(-1), the next most recent by (?(-2),  and  so  on.  In
1573       looping constructs it can also make sense to refer to subsequent groups
1574       with constructs such as (?(+2).
1575
1576       Consider the following pattern, which  contains  non-significant  white
1577       space to make it more readable (assume the PCRE_EXTENDED option) and to
1578       divide it into three parts for ease of discussion:
1579
1580         ( \( )?    [^()]+    (?(1) \) )
1581
1582       The first part matches an optional opening  parenthesis,  and  if  that
1583       character is present, sets it as the first captured substring. The sec‐
1584       ond part matches one or more characters that are not  parentheses.  The
1585       third part is a conditional subpattern that tests whether the first set
1586       of parentheses matched or not. If they did, that is, if subject started
1587       with an opening parenthesis, the condition is true, and so the yes-pat‐
1588       tern is executed and a  closing  parenthesis  is  required.  Otherwise,
1589       since  no-pattern  is  not  present, the subpattern matches nothing. In
1590       other words,  this  pattern  matches  a  sequence  of  non-parentheses,
1591       optionally enclosed in parentheses.
1592
1593       If  you  were  embedding  this pattern in a larger one, you could use a
1594       relative reference:
1595
1596         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
1597
1598       This makes the fragment independent of the parentheses  in  the  larger
1599       pattern.
1600
1601   Checking for a used subpattern by name
1602
1603       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
1604       used subpattern by name. For compatibility  with  earlier  versions  of
1605       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
1606       also recognized. However, there is a possible ambiguity with this  syn‐
1607       tax,  because  subpattern  names  may  consist entirely of digits. PCRE
1608       looks first for a named subpattern; if it cannot find one and the  name
1609       consists  entirely  of digits, PCRE looks for a subpattern of that num‐
1610       ber, which must be greater than zero. Using subpattern names that  con‐
1611       sist entirely of digits is not recommended.
1612
1613       Rewriting the above example to use a named subpattern gives this:
1614
1615         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
1616
1617
1618   Checking for pattern recursion
1619
1620       If the condition is the string (R), and there is no subpattern with the
1621       name R, the condition is true if a recursive call to the whole  pattern
1622       or any subpattern has been made. If digits or a name preceded by amper‐
1623       sand follow the letter R, for example:
1624
1625         (?(R3)...) or (?(R&name)...)
1626
1627       the condition is true if the most recent recursion is into the  subpat‐
1628       tern  whose  number or name is given. This condition does not check the
1629       entire recursion stack.
1630
1631       At "top level", all these recursion test conditions are  false.  Recur‐
1632       sive patterns are described below.
1633
1634   Defining subpatterns for use by reference only
1635
1636       If  the  condition  is  the string (DEFINE), and there is no subpattern
1637       with the name DEFINE, the condition is  always  false.  In  this  case,
1638       there  may  be  only  one  alternative  in the subpattern. It is always
1639       skipped if control reaches this point  in  the  pattern;  the  idea  of
1640       DEFINE  is that it can be used to define "subroutines" that can be ref‐
1641       erenced from elsewhere. (The use of "subroutines" is described  below.)
1642       For  example,  a pattern to match an IPv4 address could be written like
1643       this (ignore white space and line breaks):
1644
1645         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
1646         \b (?&byte) (\.(?&byte)){3} \b
1647
1648       The first part of the pattern is a DEFINE group inside which a  another
1649       group  named "byte" is defined. This matches an individual component of
1650       an IPv4 address (a number less than 256). When  matching  takes  place,
1651       this  part  of  the pattern is skipped because DEFINE acts like a false
1652       condition.
1653
1654       The rest of the pattern uses references to the named group to match the
1655       four  dot-separated  components of an IPv4 address, insisting on a word
1656       boundary at each end.
1657
1658   Assertion conditions
1659
1660       If the condition is not in any of the above  formats,  it  must  be  an
1661       assertion.   This may be a positive or negative lookahead or lookbehind
1662       assertion. Consider  this  pattern,  again  containing  non-significant
1663       white space, and with the two alternatives on the second line:
1664
1665         (?(?=[^a-z]*[a-z])
1666         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
1667
1668       The  condition  is  a  positive  lookahead  assertion  that  matches an
1669       optional sequence of non-letters followed by a letter. In other  words,
1670       it  tests  for the presence of at least one letter in the subject. If a
1671       letter is found, the subject is matched against the first  alternative;
1672       otherwise  it  is  matched  against  the  second.  This pattern matches
1673       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
1674       letters and dd are digits.
1675

COMMENTS

1677
1678       The  sequence (?# marks the start of a comment that continues up to the
1679       next closing parenthesis. Nested parentheses  are  not  permitted.  The
1680       characters  that make up a comment play no part in the pattern matching
1681       at all.
1682
1683       If the PCRE_EXTENDED option is set, an unescaped # character outside  a
1684       character  class  introduces  a  comment  that continues to immediately
1685       after the next newline in the pattern.
1686

RECURSIVE PATTERNS

1688
1689       Consider the problem of matching a string in parentheses, allowing  for
1690       unlimited  nested  parentheses.  Without the use of recursion, the best
1691       that can be done is to use a pattern that  matches  up  to  some  fixed
1692       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
1693       depth.
1694
1695       For some time, Perl has provided a facility that allows regular expres‐
1696       sions  to recurse (amongst other things). It does this by interpolating
1697       Perl code in the expression at run time, and the code can refer to  the
1698       expression itself. A Perl pattern using code interpolation to solve the
1699       parentheses problem can be created like this:
1700
1701         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1702
1703       The (?p{...}) item interpolates Perl code at run time, and in this case
1704       refers recursively to the pattern in which it appears.
1705
1706       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1707       it supports special syntax for recursion of  the  entire  pattern,  and
1708       also  for  individual  subpattern  recursion. After its introduction in
1709       PCRE and Python, this kind of recursion was  introduced  into  Perl  at
1710       release 5.10.
1711
1712       A  special  item  that consists of (? followed by a number greater than
1713       zero and a closing parenthesis is a recursive call of the subpattern of
1714       the  given  number, provided that it occurs inside that subpattern. (If
1715       not, it is a "subroutine" call, which is described  in  the  next  sec‐
1716       tion.)  The special item (?R) or (?0) is a recursive call of the entire
1717       regular expression.
1718
1719       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
1720       always treated as an atomic group. That is, once it has matched some of
1721       the subject string, it is never re-entered, even if it contains untried
1722       alternatives and there is a subsequent matching failure.
1723
1724       This  PCRE  pattern  solves  the nested parentheses problem (assume the
1725       PCRE_EXTENDED option is set so that white space is ignored):
1726
1727         \( ( (?>[^()]+) | (?R) )* \)
1728
1729       First it matches an opening parenthesis. Then it matches any number  of
1730       substrings  which  can  either  be  a sequence of non-parentheses, or a
1731       recursive match of the pattern itself (that is, a  correctly  parenthe‐
1732       sized substring).  Finally there is a closing parenthesis.
1733
1734       If  this  were  part of a larger pattern, you would not want to recurse
1735       the entire pattern, so instead you could use this:
1736
1737         ( \( ( (?>[^()]+) | (?1) )* \) )
1738
1739       We have put the pattern into parentheses, and caused the  recursion  to
1740       refer to them instead of the whole pattern.
1741
1742       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
1743       tricky. This is made easier by the use of relative references. (A  Perl
1744       5.10  feature.)   Instead  of  (?1)  in the pattern above you can write
1745       (?-2) to refer to the second most recently opened parentheses preceding
1746       the  recursion.  In  other  words,  a  negative number counts capturing
1747       parentheses leftwards from the point at which it is encountered.
1748
1749       It is also possible to refer to  subsequently  opened  parentheses,  by
1750       writing  references  such  as (?+2). However, these cannot be recursive
1751       because the reference is not inside the  parentheses  that  are  refer‐
1752       enced.  They  are  always  "subroutine" calls, as described in the next
1753       section.
1754
1755       An alternative approach is to use named parentheses instead.  The  Perl
1756       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
1757       supported. We could rewrite the above example as follows:
1758
1759         (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
1760
1761       If there is more than one subpattern with the same name,  the  earliest
1762       one is used.
1763
1764       This  particular  example pattern that we have been looking at contains
1765       nested unlimited repeats, and so the use of atomic grouping for  match‐
1766       ing  strings  of non-parentheses is important when applying the pattern
1767       to strings that do not match. For example, when this pattern is applied
1768       to
1769
1770         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1771
1772       it  yields "no match" quickly. However, if atomic grouping is not used,
1773       the match runs for a very long time indeed because there  are  so  many
1774       different  ways  the  + and * repeats can carve up the subject, and all
1775       have to be tested before failure can be reported.
1776
1777       At the end of a match, the values set for any capturing subpatterns are
1778       those from the outermost level of the recursion at which the subpattern
1779       value is set.  If you want to obtain  intermediate  values,  a  callout
1780       function  can be used (see below and the pcrecallout documentation). If
1781       the pattern above is matched against
1782
1783         (ab(cd)ef)
1784
1785       the value for the capturing parentheses is  "ef",  which  is  the  last
1786       value  taken  on at the top level. If additional parentheses are added,
1787       giving
1788
1789         \( ( ( (?>[^()]+) | (?R) )* ) \)
1790            ^                        ^
1791            ^                        ^
1792
1793       the string they capture is "ab(cd)ef", the contents of  the  top  level
1794       parentheses.  If there are more than 15 capturing parentheses in a pat‐
1795       tern, PCRE has to obtain extra memory to store data during a recursion,
1796       which  it  does  by  using pcre_malloc, freeing it via pcre_free after‐
1797       wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
1798       PCRE_ERROR_NOMEMORY error.
1799
1800       Do  not  confuse  the (?R) item with the condition (R), which tests for
1801       recursion.  Consider this pattern, which matches text in  angle  brack‐
1802       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
1803       brackets (that is, when recursing), whereas any characters are  permit‐
1804       ted at the outer level.
1805
1806         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
1807
1808       In  this  pattern, (?(R) is the start of a conditional subpattern, with
1809       two different alternatives for the recursive and  non-recursive  cases.
1810       The (?R) item is the actual recursive call.
1811

SUBPATTERNS AS SUBROUTINES

1813
1814       If the syntax for a recursive subpattern reference (either by number or
1815       by name) is used outside the parentheses to which it refers,  it  oper‐
1816       ates  like a subroutine in a programming language. The "called" subpat‐
1817       tern may be defined before or after the reference. A numbered reference
1818       can be absolute or relative, as in these examples:
1819
1820         (...(absolute)...)...(?2)...
1821         (...(relative)...)...(?-1)...
1822         (...(?+1)...(relative)...
1823
1824       An earlier example pointed out that the pattern
1825
1826         (sens|respons)e and \1ibility
1827
1828       matches  "sense and sensibility" and "response and responsibility", but
1829       not "sense and responsibility". If instead the pattern
1830
1831         (sens|respons)e and (?1)ibility
1832
1833       is used, it does match "sense and responsibility" as well as the  other
1834       two  strings.  Another  example  is  given  in the discussion of DEFINE
1835       above.
1836
1837       Like recursive subpatterns, a "subroutine" call is always treated as an
1838       atomic  group. That is, once it has matched some of the subject string,
1839       it is never re-entered, even if it contains  untried  alternatives  and
1840       there is a subsequent matching failure.
1841
1842       When  a  subpattern is used as a subroutine, processing options such as
1843       case-independence are fixed when the subpattern is defined. They cannot
1844       be changed for different calls. For example, consider this pattern:
1845
1846         (abc)(?i:(?-1))
1847
1848       It  matches  "abcabc". It does not match "abcABC" because the change of
1849       processing option does not affect the called subpattern.
1850

ONIGURUMA SUBROUTINE SYNTAX

1852
1853       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
1854       name or a number enclosed either in angle brackets or single quotes, is
1855       an alternative syntax for referencing a  subpattern  as  a  subroutine,
1856       possibly  recursively. Here are two of the examples used above, rewrit‐
1857       ten using this syntax:
1858
1859         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
1860         (sens|respons)e and \g'1'ibility
1861
1862       PCRE supports an extension to Oniguruma: if a number is preceded  by  a
1863       plus or a minus sign it is taken as a relative reference. For example:
1864
1865         (abc)(?i:\g<-1>)
1866
1867       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
1868       synonymous. The former is a back reference; the latter is a  subroutine
1869       call.
1870

CALLOUTS

1872
1873       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1874       Perl code to be obeyed in the middle of matching a regular  expression.
1875       This makes it possible, amongst other things, to extract different sub‐
1876       strings that match the same pair of parentheses when there is a repeti‐
1877       tion.
1878
1879       PCRE provides a similar feature, but of course it cannot obey arbitrary
1880       Perl code. The feature is called "callout". The caller of PCRE provides
1881       an  external function by putting its entry point in the global variable
1882       pcre_callout.  By default, this variable contains NULL, which  disables
1883       all calling out.
1884
1885       Within  a  regular  expression,  (?C) indicates the points at which the
1886       external function is to be called. If you want  to  identify  different
1887       callout  points, you can put a number less than 256 after the letter C.
1888       The default value is zero.  For example, this pattern has  two  callout
1889       points:
1890
1891         (?C1)abc(?C2)def
1892
1893       If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1894       automatically installed before each item in the pattern. They  are  all
1895       numbered 255.
1896
1897       During matching, when PCRE reaches a callout point (and pcre_callout is
1898       set), the external function is called. It is provided with  the  number
1899       of  the callout, the position in the pattern, and, optionally, one item
1900       of data originally supplied by the caller of pcre_exec().  The  callout
1901       function  may cause matching to proceed, to backtrack, or to fail alto‐
1902       gether. A complete description of the interface to the callout function
1903       is given in the pcrecallout documentation.
1904

BACKTRACKING CONTROL

1906
1907       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
1908       which are described in the Perl documentation as "experimental and sub‐
1909       ject  to  change or removal in a future version of Perl". It goes on to
1910       say: "Their usage in production code should be noted to avoid  problems
1911       during upgrades." The same remarks apply to the PCRE features described
1912       in this section.
1913
1914       Since these verbs are specifically related  to  backtracking,  most  of
1915       them  can  be  used  only  when  the  pattern  is  to  be matched using
1916       pcre_exec(), which uses a backtracking algorithm. With the exception of
1917       (*FAIL), which behaves like a failing negative assertion, they cause an
1918       error if encountered by pcre_dfa_exec().
1919
1920       The new verbs make use of what was previously invalid syntax: an  open‐
1921       ing parenthesis followed by an asterisk. In Perl, they are generally of
1922       the form (*VERB:ARG) but PCRE does not support the use of arguments, so
1923       its  general  form is just (*VERB). Any number of these verbs may occur
1924       in a pattern. There are two kinds:
1925
1926   Verbs that act immediately
1927
1928       The following verbs act as soon as they are encountered:
1929
1930          (*ACCEPT)
1931
1932       This verb causes the match to end successfully, skipping the  remainder
1933       of  the pattern. When inside a recursion, only the innermost pattern is
1934       ended immediately. PCRE differs  from  Perl  in  what  happens  if  the
1935       (*ACCEPT)  is inside capturing parentheses. In Perl, the data so far is
1936       captured: in PCRE no data is captured. For example:
1937
1938         A(A|B(*ACCEPT)|C)D
1939
1940       This matches "AB", "AAD", or "ACD", but when it matches "AB",  no  data
1941       is captured.
1942
1943         (*FAIL) or (*F)
1944
1945       This  verb  causes the match to fail, forcing backtracking to occur. It
1946       is equivalent to (?!) but easier to read. The Perl documentation  notes
1947       that  it  is  probably  useful only when combined with (?{}) or (??{}).
1948       Those are, of course, Perl features that are not present in  PCRE.  The
1949       nearest  equivalent is the callout feature, as for example in this pat‐
1950       tern:
1951
1952         a+(?C)(*FAIL)
1953
1954       A match with the string "aaaa" always fails, but the callout  is  taken
1955       before each backtrack happens (in this example, 10 times).
1956
1957   Verbs that act after backtracking
1958
1959       The following verbs do nothing when they are encountered. Matching con‐
1960       tinues with what follows, but if there is no subsequent match, a  fail‐
1961       ure  is  forced.   The  verbs  differ  in  exactly what kind of failure
1962       occurs.
1963
1964         (*COMMIT)
1965
1966       This verb causes the whole match to fail outright if the  rest  of  the
1967       pattern  does  not match. Even if the pattern is unanchored, no further
1968       attempts to find a match by advancing the start point take place.  Once
1969       (*COMMIT)  has been passed, pcre_exec() is committed to finding a match
1970       at the current starting point, or not at all. For example:
1971
1972         a+(*COMMIT)b
1973
1974       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
1975       of dynamic anchor, or "I've started, so I must finish."
1976
1977         (*PRUNE)
1978
1979       This  verb causes the match to fail at the current position if the rest
1980       of the pattern does not match. If the pattern is unanchored, the normal
1981       "bumpalong"  advance to the next starting character then happens. Back‐
1982       tracking can occur as usual to the left of (*PRUNE), or  when  matching
1983       to  the right of (*PRUNE), but if there is no match to the right, back‐
1984       tracking cannot cross (*PRUNE).  In simple cases, the use  of  (*PRUNE)
1985       is just an alternative to an atomic group or possessive quantifier, but
1986       there are some uses of (*PRUNE) that cannot be expressed in  any  other
1987       way.
1988
1989         (*SKIP)
1990
1991       This  verb  is like (*PRUNE), except that if the pattern is unanchored,
1992       the "bumpalong" advance is not to the next character, but to the  posi‐
1993       tion  in  the  subject where (*SKIP) was encountered. (*SKIP) signifies
1994       that whatever text was matched leading up to it cannot  be  part  of  a
1995       successful match. Consider:
1996
1997         a+(*SKIP)b
1998
1999       If  the  subject  is  "aaaac...",  after  the first match attempt fails
2000       (starting at the first character in the  string),  the  starting  point
2001       skips on to start the next attempt at "c". Note that a possessive quan‐
2002       tifer does not have the same effect in this example; although it  would
2003       suppress  backtracking  during  the  first  match  attempt,  the second
2004       attempt would start at the second character instead of skipping  on  to
2005       "c".
2006
2007         (*THEN)
2008
2009       This verb causes a skip to the next alternation if the rest of the pat‐
2010       tern does not match. That is, it cancels pending backtracking, but only
2011       within  the  current  alternation.  Its name comes from the observation
2012       that it can be used for a pattern-based if-then-else block:
2013
2014         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2015
2016       If the COND1 pattern matches, FOO is tried (and possibly further  items
2017       after  the  end  of  the group if FOO succeeds); on failure the matcher
2018       skips to the second alternative and tries COND2,  without  backtracking
2019       into  COND1.  If  (*THEN)  is  used outside of any alternation, it acts
2020       exactly like (*PRUNE).
2021

AUTHOR

2027
2028       Philip Hazel
2029       University Computing Service
2030       Cambridge CB2 3QH, England.
2031

REVISION

2033
2034       Last updated: 19 April 2008
2035       Copyright (c) 1997-2008 University of Cambridge.
2036
2037
2038
2039                                                                PCREPATTERN(3)