pcrepattern(3)

1PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PCRE REGULAR EXPRESSION DETAILS

9
10       The  syntax and semantics of the regular expressions that are supported
11       by PCRE are described in detail below. There is a quick-reference  syn‐
12       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
13       semantics as closely as it can. PCRE  also  supports  some  alternative
14       regular  expression  syntax (which does not conflict with the Perl syn‐
15       tax) in order to provide some compatibility with regular expressions in
16       Python, .NET, and Oniguruma.
17
18       Perl's  regular expressions are described in its own documentation, and
19       regular expressions in general are covered in a number of  books,  some
20       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
21       Expressions", published by  O'Reilly,  covers  regular  expressions  in
22       great  detail.  This  description  of  PCRE's  regular  expressions  is
23       intended as reference material.
24
25       The original operation of PCRE was on strings of  one-byte  characters.
26       However,  there  is  now also support for UTF-8 strings in the original
27       library, an extra library that supports  16-bit  and  UTF-16  character
28       strings,  and a third library that supports 32-bit and UTF-32 character
29       strings. To use these features, PCRE must be built to include appropri‐
30       ate  support. When using UTF strings you must either call the compiling
31       function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option,  or  the
32       pattern must start with one of these special sequences:
33
34         (*UTF8)
35         (*UTF16)
36         (*UTF32)
37         (*UTF)
38
39       (*UTF)  is  a  generic  sequence  that  can  be  used  with  any of the
40       libraries.  Starting a pattern with such a sequence  is  equivalent  to
41       setting  the  relevant option. This feature is not Perl-compatible. How
42       setting a UTF mode affects pattern matching  is  mentioned  in  several
43       places  below.  There  is also a summary of features in the pcreunicode
44       page.
45
46       Another special sequence that may appear at the start of a  pattern  or
47       in combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
48
49         (*UCP)
50
51       This  has  the  same  effect  as setting the PCRE_UCP option: it causes
52       sequences such as \d and \w to  use  Unicode  properties  to  determine
53       character types, instead of recognizing only characters with codes less
54       than 128 via a lookup table.
55
56       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
57       setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
58       time. There are also some more of these special sequences that are con‐
59       cerned with the handling of newlines; they are described below.
60
61       The  remainder  of  this  document discusses the patterns that are sup‐
62       ported by PCRE  when  one  its  main  matching  functions,  pcre_exec()
63       (8-bit)  or  pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has
64       alternative      matching      functions,      pcre_dfa_exec()      and
65       pcre[16|32_dfa_exec(),  which match using a different algorithm that is
66       not Perl-compatible. Some of  the  features  discussed  below  are  not
67       available  when  DFA matching is used. The advantages and disadvantages
68       of the alternative functions, and how they differ from the normal func‐
69       tions, are discussed in the pcrematching page.
70

EBCDIC CHARACTER CODES

72
73       PCRE  can  be compiled to run in an environment that uses EBCDIC as its
74       character code rather than ASCII or Unicode (typically a mainframe sys‐
75       tem).  In  the  sections below, character code values are ASCII or Uni‐
76       code; in an EBCDIC environment these characters may have different code
77       values, and there are no code points greater than 255.
78

NEWLINE CONVENTIONS

80
81       PCRE  supports five different conventions for indicating line breaks in
82       strings: a single CR (carriage return) character, a  single  LF  (line‐
83       feed) character, the two-character sequence CRLF, any of the three pre‐
84       ceding, or any Unicode newline sequence. The pcreapi page  has  further
85       discussion  about newlines, and shows how to set the newline convention
86       in the options arguments for the compiling and matching functions.
87
88       It is also possible to specify a newline convention by starting a  pat‐
89       tern string with one of the following five sequences:
90
91         (*CR)        carriage return
92         (*LF)        linefeed
93         (*CRLF)      carriage return, followed by linefeed
94         (*ANYCRLF)   any of the three above
95         (*ANY)       all Unicode newline sequences
96
97       These override the default and the options given to the compiling func‐
98       tion. For example, on a Unix system where LF  is  the  default  newline
99       sequence, the pattern
100
101         (*CR)a.b
102
103       changes the convention to CR. That pattern matches "a\nb" because LF is
104       no longer a newline. Note that these special settings,  which  are  not
105       Perl-compatible,  are  recognized  only at the very start of a pattern,
106       and that they must be in upper case.  If  more  than  one  of  them  is
107       present, the last one is used.
108
109       The  newline  convention affects where the circumflex and dollar asser‐
110       tions are true. It also affects the interpretation of the dot metachar‐
111       acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
112       does not affect what the \R escape sequence matches. By  default,  this
113       is  any Unicode newline sequence, for Perl compatibility. However, this
114       can be changed; see the description of \R in the section entitled "New‐
115       line  sequences"  below.  A change of \R setting can be combined with a
116       change of newline convention.
117

CHARACTERS AND METACHARACTERS

119
120       A regular expression is a pattern that is  matched  against  a  subject
121       string  from  left  to right. Most characters stand for themselves in a
122       pattern, and match the corresponding characters in the  subject.  As  a
123       trivial example, the pattern
124
125         The quick brown fox
126
127       matches a portion of a subject string that is identical to itself. When
128       caseless matching is specified (the PCRE_CASELESS option), letters  are
129       matched  independently  of case. In a UTF mode, PCRE always understands
130       the concept of case for characters whose values are less than  128,  so
131       caseless  matching  is always possible. For characters with higher val‐
132       ues, the concept of case is supported if PCRE is compiled with  Unicode
133       property  support,  but  not  otherwise.   If  you want to use caseless
134       matching for characters 128 and above, you must  ensure  that  PCRE  is
135       compiled with Unicode property support as well as with UTF support.
136
137       The  power  of  regular  expressions  comes from the ability to include
138       alternatives and repetitions in the pattern. These are encoded  in  the
139       pattern by the use of metacharacters, which do not stand for themselves
140       but instead are interpreted in some special way.
141
142       There are two different sets of metacharacters: those that  are  recog‐
143       nized  anywhere in the pattern except within square brackets, and those
144       that are recognized within square brackets.  Outside  square  brackets,
145       the metacharacters are as follows:
146
147         \      general escape character with several uses
148         ^      assert start of string (or line, in multiline mode)
149         $      assert end of string (or line, in multiline mode)
150         .      match any character except newline (by default)
151         [      start character class definition
152         |      start of alternative branch
153         (      start subpattern
154         )      end subpattern
155         ?      extends the meaning of (
156                also 0 or 1 quantifier
157                also quantifier minimizer
158         *      0 or more quantifier
159         +      1 or more quantifier
160                also "possessive quantifier"
161         {      start min/max quantifier
162
163       Part  of  a  pattern  that is in square brackets is called a "character
164       class". In a character class the only metacharacters are:
165
166         \      general escape character
167         ^      negate the class, but only if the first character
168         -      indicates character range
169         [      POSIX character class (only if followed by POSIX
170                  syntax)
171         ]      terminates the character class
172
173       The following sections describe the use of each of the metacharacters.
174

BACKSLASH

176
177       The backslash character has several uses. Firstly, if it is followed by
178       a character that is not a number or a letter, it takes away any special
179       meaning that character may have. This use of  backslash  as  an  escape
180       character applies both inside and outside character classes.
181
182       For  example,  if  you want to match a * character, you write \* in the
183       pattern.  This escaping action applies whether  or  not  the  following
184       character  would  otherwise be interpreted as a metacharacter, so it is
185       always safe to precede a non-alphanumeric  with  backslash  to  specify
186       that  it stands for itself. In particular, if you want to match a back‐
187       slash, you write \\.
188
189       In a UTF mode, only ASCII numbers and letters have any special  meaning
190       after  a  backslash.  All  other characters (in particular, those whose
191       codepoints are greater than 127) are treated as literals.
192
193       If a pattern is compiled with the PCRE_EXTENDED option, white space  in
194       the  pattern (other than in a character class) and characters between a
195       # outside a character class and the next newline are ignored. An escap‐
196       ing  backslash  can  be used to include a white space or # character as
197       part of the pattern.
198
199       If you want to remove the special meaning from a  sequence  of  charac‐
200       ters,  you can do so by putting them between \Q and \E. This is differ‐
201       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
202       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola‐
203       tion. Note the following examples:
204
205         Pattern            PCRE matches   Perl matches
206
207         \Qabc$xyz\E        abc$xyz        abc followed by the
208                                             contents of $xyz
209         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
210         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
211
212       The \Q...\E sequence is recognized both inside  and  outside  character
213       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
214       is not followed by \E later in the pattern, the literal  interpretation
215       continues  to  the  end  of  the pattern (that is, \E is assumed at the
216       end). If the isolated \Q is inside a character class,  this  causes  an
217       error, because the character class is not terminated.
218
219   Non-printing characters
220
221       A second use of backslash provides a way of encoding non-printing char‐
222       acters in patterns in a visible manner. There is no restriction on  the
223       appearance  of non-printing characters, apart from the binary zero that
224       terminates a pattern, but when a pattern  is  being  prepared  by  text
225       editing,  it  is  often  easier  to  use  one  of  the following escape
226       sequences than the binary character it represents:
227
228         \a        alarm, that is, the BEL character (hex 07)
229         \cx       "control-x", where x is any ASCII character
230         \e        escape (hex 1B)
231         \f        form feed (hex 0C)
232         \n        linefeed (hex 0A)
233         \r        carriage return (hex 0D)
234         \t        tab (hex 09)
235         \ddd      character with octal code ddd, or back reference
236         \xhh      character with hex code hh
237         \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
238         \uhhhh    character with hex code hhhh (JavaScript mode only)
239
240       The precise effect of \cx on ASCII characters is as follows: if x is  a
241       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
242       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
243       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
244       hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
245       has  a  value greater than 127, a compile-time error occurs. This locks
246       out non-ASCII characters in all modes.
247
248       The \c facility was designed for use with ASCII  characters,  but  with
249       the  extension  to  Unicode it is even less useful than it once was. It
250       is, however, recognized when PCRE is compiled  in  EBCDIC  mode,  where
251       data  items  are always bytes. In this mode, all values are valid after
252       \c. If the next character is a lower case letter, it  is  converted  to
253       upper  case.  Then  the  0xc0  bits  of the byte are inverted. Thus \cA
254       becomes hex 01, as in ASCII (A is C1), but because the  EBCDIC  letters
255       are  disjoint,  \cZ becomes hex 29 (Z is E9), and other characters also
256       generate different values.
257
258       By default, after \x, from zero to  two  hexadecimal  digits  are  read
259       (letters can be in upper or lower case). Any number of hexadecimal dig‐
260       its may appear between \x{ and }, but the character code is constrained
261       as follows:
262
263         8-bit non-UTF mode    less than 0x100
264         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
265         16-bit non-UTF mode   less than 0x10000
266         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
267         32-bit non-UTF mode   less than 0x80000000
268         32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
269
270       Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
271       called "surrogate" codepoints), and 0xffef.
272
273       If characters other than hexadecimal digits appear between \x{  and  },
274       or if there is no terminating }, this form of escape is not recognized.
275       Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal
276       escape,  with  no  following  digits, giving a character whose value is
277       zero.
278
279       If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation  of  \x
280       is  as  just described only when it is followed by two hexadecimal dig‐
281       its.  Otherwise, it matches a  literal  "x"  character.  In  JavaScript
282       mode, support for code points greater than 256 is provided by \u, which
283       must be followed by four hexadecimal digits;  otherwise  it  matches  a
284       literal  "u"  character.  Character codes specified by \u in JavaScript
285       mode are constrained in the same was as those specified by \x  in  non-
286       JavaScript mode.
287
288       Characters whose value is less than 256 can be defined by either of the
289       two syntaxes for \x (or by \u in JavaScript mode). There is no  differ‐
290       ence in the way they are handled. For example, \xdc is exactly the same
291       as \x{dc} (or \u00dc in JavaScript mode).
292
293       After \0 up to two further octal digits are read. If  there  are  fewer
294       than  two  digits,  just  those  that  are  present  are used. Thus the
295       sequence \0\x\07 specifies two binary zeros followed by a BEL character
296       (code  value 7). Make sure you supply two digits after the initial zero
297       if the pattern character that follows is itself an octal digit.
298
299       The handling of a backslash followed by a digit other than 0 is compli‐
300       cated.  Outside a character class, PCRE reads it and any following dig‐
301       its as a decimal number. If the number is less than  10,  or  if  there
302       have been at least that many previous capturing left parentheses in the
303       expression, the entire  sequence  is  taken  as  a  back  reference.  A
304       description  of how this works is given later, following the discussion
305       of parenthesized subpatterns.
306
307       Inside a character class, or if the decimal number is  greater  than  9
308       and  there have not been that many capturing subpatterns, PCRE re-reads
309       up to three octal digits following the backslash, and uses them to gen‐
310       erate a data character. Any subsequent digits stand for themselves. The
311       value of the character is constrained in the  same  way  as  characters
312       specified in hexadecimal.  For example:
313
314         \040   is another way of writing an ASCII space
315         \40    is the same, provided there are fewer than 40
316                   previous capturing subpatterns
317         \7     is always a back reference
318         \11    might be a back reference, or another way of
319                   writing a tab
320         \011   is always a tab
321         \0113  is a tab followed by the character "3"
322         \113   might be a back reference, otherwise the
323                   character with octal code 113
324         \377   might be a back reference, otherwise
325                   the value 255 (decimal)
326         \81    is either a back reference, or a binary zero
327                   followed by the two characters "8" and "1"
328
329       Note  that  octal  values of 100 or greater must not be introduced by a
330       leading zero, because no more than three octal digits are ever read.
331
332       All the sequences that define a single character value can be used both
333       inside  and  outside character classes. In addition, inside a character
334       class, \b is interpreted as the backspace character (hex 08).
335
336       \N is not allowed in a character class. \B, \R, and \X are not  special
337       inside  a  character  class.  Like other unrecognized escape sequences,
338       they are treated as  the  literal  characters  "B",  "R",  and  "X"  by
339       default,  but cause an error if the PCRE_EXTRA option is set. Outside a
340       character class, these sequences have different meanings.
341
342   Unsupported escape sequences
343
344       In Perl, the sequences \l, \L, \u, and \U are recognized by its  string
345       handler  and  used  to  modify  the  case  of  following characters. By
346       default, PCRE does not support these escape sequences. However, if  the
347       PCRE_JAVASCRIPT_COMPAT  option  is set, \U matches a "U" character, and
348       \u can be used to define a character by code point, as described in the
349       previous section.
350
351   Absolute and relative back references
352
353       The  sequence  \g followed by an unsigned or a negative number, option‐
354       ally enclosed in braces, is an absolute or relative back  reference.  A
355       named back reference can be coded as \g{name}. Back references are dis‐
356       cussed later, following the discussion of parenthesized subpatterns.
357
358   Absolute and relative subroutine calls
359
360       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
361       name or a number enclosed either in angle brackets or single quotes, is
362       an alternative syntax for referencing a subpattern as  a  "subroutine".
363       Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
364       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
365       reference; the latter is a subroutine call.
366
367   Generic character types
368
369       Another use of backslash is for specifying generic character types:
370
371         \d     any decimal digit
372         \D     any character that is not a decimal digit
373         \h     any horizontal white space character
374         \H     any character that is not a horizontal white space character
375         \s     any white space character
376         \S     any character that is not a white space character
377         \v     any vertical white space character
378         \V     any character that is not a vertical white space character
379         \w     any "word" character
380         \W     any "non-word" character
381
382       There is also the single sequence \N, which matches a non-newline char‐
383       acter.  This is the same as the "." metacharacter when  PCRE_DOTALL  is
384       not  set.  Perl also uses \N to match characters by name; PCRE does not
385       support this.
386
387       Each pair of lower and upper case escape sequences partitions the  com‐
388       plete  set  of  characters  into two disjoint sets. Any given character
389       matches one, and only one, of each pair. The sequences can appear  both
390       inside  and outside character classes. They each match one character of
391       the appropriate type. If the current matching point is at  the  end  of
392       the  subject string, all of them fail, because there is no character to
393       match.
394
395       For compatibility with Perl, \s does not match the VT  character  (code
396       11).   This makes it different from the the POSIX "space" class. The \s
397       characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
398       "use locale;" is included in a Perl script, \s may match the VT charac‐
399       ter. In PCRE, it never does.
400
401       A "word" character is an underscore or any character that is  a  letter
402       or  digit.   By  default,  the definition of letters and digits is con‐
403       trolled by PCRE's low-valued character tables, and may vary if  locale-
404       specific  matching is taking place (see "Locale support" in the pcreapi
405       page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
406       systems,  or "french" in Windows, some character codes greater than 128
407       are used for accented letters, and these are then matched  by  \w.  The
408       use of locales with Unicode is discouraged.
409
410       By  default,  in  a  UTF  mode, characters with values greater than 128
411       never match \d, \s, or \w, and always  match  \D,  \S,  and  \W.  These
412       sequences  retain  their  original meanings from before UTF support was
413       available, mainly for efficiency reasons. However, if PCRE is  compiled
414       with  Unicode property support, and the PCRE_UCP option is set, the be‐
415       haviour is changed so that Unicode properties  are  used  to  determine
416       character types, as follows:
417
418         \d  any character that \p{Nd} matches (decimal digit)
419         \s  any character that \p{Z} matches, plus HT, LF, FF, CR
420         \w  any character that \p{L} or \p{N} matches, plus underscore
421
422       The  upper case escapes match the inverse sets of characters. Note that
423       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
424       as  well as any Unicode letter, and underscore. Note also that PCRE_UCP
425       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
426       Matching these sequences is noticeably slower when PCRE_UCP is set.
427
428       The  sequences  \h, \H, \v, and \V are features that were added to Perl
429       at release 5.10. In contrast to the other sequences, which  match  only
430       ASCII  characters  by  default,  these always match certain high-valued
431       codepoints, whether or not PCRE_UCP is set. The horizontal space  char‐
432       acters are:
433
434         U+0009     Horizontal tab (HT)
435         U+0020     Space
436         U+00A0     Non-break space
437         U+1680     Ogham space mark
438         U+180E     Mongolian vowel separator
439         U+2000     En quad
440         U+2001     Em quad
441         U+2002     En space
442         U+2003     Em space
443         U+2004     Three-per-em space
444         U+2005     Four-per-em space
445         U+2006     Six-per-em space
446         U+2007     Figure space
447         U+2008     Punctuation space
448         U+2009     Thin space
449         U+200A     Hair space
450         U+202F     Narrow no-break space
451         U+205F     Medium mathematical space
452         U+3000     Ideographic space
453
454       The vertical space characters are:
455
456         U+000A     Linefeed (LF)
457         U+000B     Vertical tab (VT)
458         U+000C     Form feed (FF)
459         U+000D     Carriage return (CR)
460         U+0085     Next line (NEL)
461         U+2028     Line separator
462         U+2029     Paragraph separator
463
464       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
465       256 are relevant.
466
467   Newline sequences
468
469       Outside a character class, by default, the escape sequence  \R  matches
470       any  Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
471       to the following:
472
473         (?>\r\n|\n|\x0b|\f|\r|\x85)
474
475       This is an example of an "atomic group", details  of  which  are  given
476       below.  This particular group matches either the two-character sequence
477       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
478       U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car‐
479       riage return, U+000D), or NEL (next line,  U+0085).  The  two-character
480       sequence is treated as a single unit that cannot be split.
481
482       In  other modes, two additional characters whose codepoints are greater
483       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
484       rator,  U+2029).   Unicode character property support is not needed for
485       these characters to be recognized.
486
487       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
488       the  complete  set  of  Unicode  line  endings)  by  setting the option
489       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
490       (BSR is an abbrevation for "backslash R".) This can be made the default
491       when PCRE is built; if this is the case, the  other  behaviour  can  be
492       requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
493       specify these settings by starting a pattern string  with  one  of  the
494       following sequences:
495
496         (*BSR_ANYCRLF)   CR, LF, or CRLF only
497         (*BSR_UNICODE)   any Unicode newline sequence
498
499       These override the default and the options given to the compiling func‐
500       tion, but they can themselves be  overridden  by  options  given  to  a
501       matching  function.  Note  that  these  special settings, which are not
502       Perl-compatible, are recognized only at the very start  of  a  pattern,
503       and  that  they  must  be  in  upper  case. If more than one of them is
504       present, the last one is used. They can be combined with  a  change  of
505       newline convention; for example, a pattern can start with:
506
507         (*ANY)(*BSR_ANYCRLF)
508
509       They  can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
510       or (*UCP) special sequences. Inside a character class, \R is treated as
511       an  unrecognized  escape  sequence,  and  so  matches the letter "R" by
512       default, but causes an error if PCRE_EXTRA is set.
513
514   Unicode character properties
515
516       When PCRE is built with Unicode character property support, three addi‐
517       tional  escape sequences that match characters with specific properties
518       are available.  When in 8-bit non-UTF-8 mode, these  sequences  are  of
519       course  limited  to  testing  characters whose codepoints are less than
520       256, but they do work in this mode.  The extra escape sequences are:
521
522         \p{xx}   a character with the xx property
523         \P{xx}   a character without the xx property
524         \X       a Unicode extended grapheme cluster
525
526       The property names represented by xx above are limited to  the  Unicode
527       script names, the general category properties, "Any", which matches any
528       character  (including  newline),  and  some  special  PCRE   properties
529       (described  in the next section).  Other Perl properties such as "InMu‐
530       sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}
531       does not match any characters, so always causes a match failure.
532
533       Sets of Unicode characters are defined as belonging to certain scripts.
534       A character from one of these sets can be matched using a script  name.
535       For example:
536
537         \p{Greek}
538         \P{Han}
539
540       Those  that are not part of an identified script are lumped together as
541       "Common". The current list of scripts is:
542
543       Arabic, Armenian, Avestan, Balinese, Bamum, Batak,  Bengali,  Bopomofo,
544       Brahmi,  Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,
545       Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic,  Deseret,
546       Devanagari,   Egyptian_Hieroglyphs,   Ethiopic,  Georgian,  Glagolitic,
547       Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira‐
548       gana,   Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,  Inscrip‐
549       tional_Parthian,  Javanese,  Kaithi,   Kannada,   Katakana,   Kayah_Li,
550       Kharoshthi,  Khmer,  Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,
551       Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
552       Meroitic_Hieroglyphs,   Miao,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
553       Ogham,   Old_Italic,   Old_Persian,   Old_South_Arabian,    Old_Turkic,
554       Ol_Chiki,  Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari‐
555       tan, Saurashtra, Sharada, Shavian,  Sinhala,  Sora_Sompeng,  Sundanese,
556       Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,
557       Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,  Ugaritic,  Vai,
558       Yi.
559
560       Each character has exactly one Unicode general category property, spec‐
561       ified by a two-letter abbreviation. For compatibility with Perl,  nega‐
562       tion  can  be  specified  by including a circumflex between the opening
563       brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
564       \P{Lu}.
565
566       If only one letter is specified with \p or \P, it includes all the gen‐
567       eral category properties that start with that letter. In this case,  in
568       the  absence of negation, the curly brackets in the escape sequence are
569       optional; these two examples have the same effect:
570
571         \p{L}
572         \pL
573
574       The following general category property codes are supported:
575
576         C     Other
577         Cc    Control
578         Cf    Format
579         Cn    Unassigned
580         Co    Private use
581         Cs    Surrogate
582
583         L     Letter
584         Ll    Lower case letter
585         Lm    Modifier letter
586         Lo    Other letter
587         Lt    Title case letter
588         Lu    Upper case letter
589
590         M     Mark
591         Mc    Spacing mark
592         Me    Enclosing mark
593         Mn    Non-spacing mark
594
595         N     Number
596         Nd    Decimal number
597         Nl    Letter number
598         No    Other number
599
600         P     Punctuation
601         Pc    Connector punctuation
602         Pd    Dash punctuation
603         Pe    Close punctuation
604         Pf    Final punctuation
605         Pi    Initial punctuation
606         Po    Other punctuation
607         Ps    Open punctuation
608
609         S     Symbol
610         Sc    Currency symbol
611         Sk    Modifier symbol
612         Sm    Mathematical symbol
613         So    Other symbol
614
615         Z     Separator
616         Zl    Line separator
617         Zp    Paragraph separator
618         Zs    Space separator
619
620       The special property L& is also supported: it matches a character  that
621       has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
622       classified as a modifier or "other".
623
624       The Cs (Surrogate) property applies only to  characters  in  the  range
625       U+D800  to U+DFFF. Such characters are not valid in Unicode strings and
626       so cannot be tested by PCRE, unless  UTF  validity  checking  has  been
627       turned    off    (see    the    discussion    of    PCRE_NO_UTF8_CHECK,
628       PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page).  Perl
629       does not support the Cs property.
630
631       The  long  synonyms  for  property  names  that  Perl supports (such as
632       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
633       any of these properties with "Is".
634
635       No character that is in the Unicode table has the Cn (unassigned) prop‐
636       erty.  Instead, this property is assumed for any code point that is not
637       in the Unicode table.
638
639       Specifying  caseless  matching  does not affect these escape sequences.
640       For example, \p{Lu} always matches only upper case letters.
641
642       Matching characters by Unicode property is not fast, because  PCRE  has
643       to  do  a  multistage table lookup in order to find a character's prop‐
644       erty. That is why the traditional escape sequences such as \d and \w do
645       not use Unicode properties in PCRE by default, though you can make them
646       do so by setting the PCRE_UCP option or by starting  the  pattern  with
647       (*UCP).
648
649   Extended grapheme clusters
650
651       The  \X  escape  matches  any number of Unicode characters that form an
652       "extended grapheme cluster", and treats the sequence as an atomic group
653       (see  below).   Up  to and including release 8.31, PCRE matched an ear‐
654       lier, simpler definition that was equivalent to
655
656         (?>\PM\pM*)
657
658       That is, it matched a character without the "mark"  property,  followed
659       by  zero  or  more characters with the "mark" property. Characters with
660       the "mark" property are typically non-spacing accents that  affect  the
661       preceding character.
662
663       This  simple definition was extended in Unicode to include more compli‐
664       cated kinds of composite character by giving each character a  grapheme
665       breaking  property,  and  creating  rules  that use these properties to
666       define the boundaries of extended grapheme  clusters.  In  releases  of
667       PCRE later than 8.31, \X matches one of these clusters.
668
669       \X  always  matches  at least one character. Then it decides whether to
670       add additional characters according to the following rules for ending a
671       cluster:
672
673       1. End at the end of the subject string.
674
675       2.  Do not end between CR and LF; otherwise end after any control char‐
676       acter.
677
678       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
679       characters  are of five types: L, V, T, LV, and LVT. An L character may
680       be followed by an L, V, LV, or LVT character; an LV or V character  may
681       be followed by a V or T character; an LVT or T character may be follwed
682       only by a T character.
683
684       4. Do not end before extending characters or spacing marks.  Characters
685       with  the  "mark"  property  always have the "extend" grapheme breaking
686       property.
687
688       5. Do not end after prepend characters.
689
690       6. Otherwise, end the cluster.
691
692   PCRE's additional properties
693
694       As well as the standard Unicode properties described above,  PCRE  sup‐
695       ports  four  more  that  make it possible to convert traditional escape
696       sequences such as \w and \s to use Unicode properties. PCRE uses  these
697       non-standard, non-Perl properties internally when PCRE_UCP is set. How‐
698       ever, they may also be used explicitly. These properties are:
699
700         Xan   Any alphanumeric character
701         Xps   Any POSIX space character
702         Xsp   Any Perl space character
703         Xwd   Any Perl "word" character
704
705       Xan matches characters that have either the L (letter) or the  N  (num‐
706       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
707       form feed, or carriage return, and any other character that has  the  Z
708       (separator) property.  Xsp is the same as Xps, except that vertical tab
709       is excluded. Xwd matches the :qa same characters as  Xan,  plus  under‐
710       score.
711
712   Resetting the match start
713
714       The  escape sequence \K causes any previously matched characters not to
715       be included in the final matched sequence. For example, the pattern:
716
717         foo\Kbar
718
719       matches "foobar", but reports that it has matched "bar".  This  feature
720       is  similar  to  a lookbehind assertion (described below).  However, in
721       this case, the part of the subject before the real match does not  have
722       to  be of fixed length, as lookbehind assertions do. The use of \K does
723       not interfere with the setting of captured  substrings.   For  example,
724       when the pattern
725
726         (foo)\Kbar
727
728       matches "foobar", the first substring is still set to "foo".
729
730       Perl  documents  that  the  use  of  \K  within assertions is "not well
731       defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
732       assertions, but is ignored in negative assertions.
733
734   Simple assertions
735
736       The  final use of backslash is for certain simple assertions. An asser‐
737       tion specifies a condition that has to be met at a particular point  in
738       a  match, without consuming any characters from the subject string. The
739       use of subpatterns for more complicated assertions is described  below.
740       The backslashed assertions are:
741
742         \b     matches at a word boundary
743         \B     matches when not at a word boundary
744         \A     matches at the start of the subject
745         \Z     matches at the end of the subject
746                 also matches before a newline at the end of the subject
747         \z     matches only at the end of the subject
748         \G     matches at the first matching position in the subject
749
750       Inside  a  character  class, \b has a different meaning; it matches the
751       backspace character. If any other of  these  assertions  appears  in  a
752       character  class, by default it matches the corresponding literal char‐
753       acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
754       PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener‐
755       ated instead.
756
757       A word boundary is a position in the subject string where  the  current
758       character  and  the previous character do not both match \w or \W (i.e.
759       one matches \w and the other matches \W), or the start or  end  of  the
760       string  if  the  first or last character matches \w, respectively. In a
761       UTF mode, the meanings of \w and \W  can  be  changed  by  setting  the
762       PCRE_UCP  option. When this is done, it also affects \b and \B. Neither
763       PCRE nor Perl has a separate "start of word" or "end of  word"  metase‐
764       quence.  However,  whatever follows \b normally determines which it is.
765       For example, the fragment \ba matches "a" at the start of a word.
766
767       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
768       and dollar (described in the next section) in that they only ever match
769       at the very start and end of the subject string, whatever  options  are
770       set.  Thus,  they are independent of multiline mode. These three asser‐
771       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
772       affect  only the behaviour of the circumflex and dollar metacharacters.
773       However, if the startoffset argument of pcre_exec() is non-zero,  indi‐
774       cating that matching is to start at a point other than the beginning of
775       the subject, \A can never match. The difference between \Z  and  \z  is
776       that \Z matches before a newline at the end of the string as well as at
777       the very end, whereas \z matches only at the end.
778
779       The \G assertion is true only when the current matching position is  at
780       the  start point of the match, as specified by the startoffset argument
781       of pcre_exec(). It differs from \A when the  value  of  startoffset  is
782       non-zero.  By calling pcre_exec() multiple times with appropriate argu‐
783       ments, you can mimic Perl's /g option, and it is in this kind of imple‐
784       mentation where \G can be useful.
785
786       Note,  however,  that  PCRE's interpretation of \G, as the start of the
787       current match, is subtly different from Perl's, which defines it as the
788       end  of  the  previous  match. In Perl, these can be different when the
789       previously matched string was empty. Because PCRE does just  one  match
790       at a time, it cannot reproduce this behaviour.
791
792       If  all  the alternatives of a pattern begin with \G, the expression is
793       anchored to the starting match position, and the "anchored" flag is set
794       in the compiled regular expression.
795

CIRCUMFLEX AND DOLLAR

797
798       The  circumflex  and  dollar  metacharacters are zero-width assertions.
799       That is, they test for a particular condition being true  without  con‐
800       suming any characters from the subject string.
801
802       Outside a character class, in the default matching mode, the circumflex
803       character is an assertion that is true only  if  the  current  matching
804       point  is  at the start of the subject string. If the startoffset argu‐
805       ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
806       PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
807       has an entirely different meaning (see below).
808
809       Circumflex need not be the first character of the pattern if  a  number
810       of  alternatives are involved, but it should be the first thing in each
811       alternative in which it appears if the pattern is ever  to  match  that
812       branch.  If all possible alternatives start with a circumflex, that is,
813       if the pattern is constrained to match only at the start  of  the  sub‐
814       ject,  it  is  said  to be an "anchored" pattern. (There are also other
815       constructs that can cause a pattern to be anchored.)
816
817       The dollar character is an assertion that is true only if  the  current
818       matching  point  is  at  the  end of the subject string, or immediately
819       before a newline at the end of the string (by default). Note,  however,
820       that  it  does  not  actually match the newline. Dollar need not be the
821       last character of the pattern if a number of alternatives are involved,
822       but  it should be the last item in any branch in which it appears. Dol‐
823       lar has no special meaning in a character class.
824
825       The meaning of dollar can be changed so that it  matches  only  at  the
826       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
827       compile time. This does not affect the \Z assertion.
828
829       The meanings of the circumflex and dollar characters are changed if the
830       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
831       matches immediately after internal newlines as well as at the start  of
832       the  subject  string.  It  does not match after a newline that ends the
833       string. A dollar matches before any newlines in the string, as well  as
834       at  the very end, when PCRE_MULTILINE is set. When newline is specified
835       as the two-character sequence CRLF, isolated CR and  LF  characters  do
836       not indicate newlines.
837
838       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
839       (where \n represents a newline) in multiline mode, but  not  otherwise.
840       Consequently,  patterns  that  are anchored in single line mode because
841       all branches start with ^ are not anchored in  multiline  mode,  and  a
842       match  for  circumflex  is  possible  when  the startoffset argument of
843       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
844       PCRE_MULTILINE is set.
845
846       Note  that  the sequences \A, \Z, and \z can be used to match the start
847       and end of the subject in both modes, and if all branches of a  pattern
848       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
849       set.
850

FULL STOP (PERIOD, DOT) AND \N

852
853       Outside a character class, a dot in the pattern matches any one charac‐
854       ter  in  the subject string except (by default) a character that signi‐
855       fies the end of a line.
856
857       When a line ending is defined as a single character, dot never  matches
858       that  character; when the two-character sequence CRLF is used, dot does
859       not match CR if it is immediately followed  by  LF,  but  otherwise  it
860       matches  all characters (including isolated CRs and LFs). When any Uni‐
861       code line endings are being recognized, dot does not match CR or LF  or
862       any of the other line ending characters.
863
864       The  behaviour  of  dot  with regard to newlines can be changed. If the
865       PCRE_DOTALL option is set, a dot matches  any  one  character,  without
866       exception. If the two-character sequence CRLF is present in the subject
867       string, it takes two dots to match it.
868
869       The handling of dot is entirely independent of the handling of  circum‐
870       flex  and  dollar,  the  only relationship being that they both involve
871       newlines. Dot has no special meaning in a character class.
872
873       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
874       affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
875       character except one that signifies the end of a line. Perl  also  uses
876       \N to match characters by name; PCRE does not support this.
877

MATCHING A SINGLE DATA UNIT

879
880       Outside  a character class, the escape sequence \C matches any one data
881       unit, whether or not a UTF mode is set. In the 8-bit library, one  data
882       unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the
883       32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
884       line-ending  characters.  The  feature  is provided in Perl in order to
885       match individual bytes in UTF-8 mode, but it is unclear how it can use‐
886       fully  be  used.  Because  \C breaks up characters into individual data
887       units, matching one unit with \C in a UTF mode means that the  rest  of
888       the string may start with a malformed UTF character. This has undefined
889       results, because PCRE assumes that it is dealing with valid UTF strings
890       (and  by  default  it checks this at the start of processing unless the
891       PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or  PCRE_NO_UTF32_CHECK  option
892       is used).
893
894       PCRE  does  not  allow \C to appear in lookbehind assertions (described
895       below) in a UTF mode, because this would make it impossible  to  calcu‐
896       late the length of the lookbehind.
897
898       In general, the \C escape sequence is best avoided. However, one way of
899       using it that avoids the problem of malformed UTF characters is to  use
900       a  lookahead to check the length of the next character, as in this pat‐
901       tern, which could be used with a UTF-8 string (ignore white  space  and
902       line breaks):
903
904         (?| (?=[\x00-\x7f])(\C) |
905             (?=[\x80-\x{7ff}])(\C)(\C) |
906             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
907             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
908
909       A  group  that starts with (?| resets the capturing parentheses numbers
910       in each alternative (see "Duplicate  Subpattern  Numbers"  below).  The
911       assertions  at  the start of each branch check the next UTF-8 character
912       for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The
913       character's  individual bytes are then captured by the appropriate num‐
914       ber of groups.
915

SQUARE BRACKETS AND CHARACTER CLASSES

917
918       An opening square bracket introduces a character class, terminated by a
919       closing square bracket. A closing square bracket on its own is not spe‐
920       cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
921       a lone closing square bracket causes a compile-time error. If a closing
922       square bracket is required as a member of the class, it should  be  the
923       first  data  character  in  the  class (after an initial circumflex, if
924       present) or escaped with a backslash.
925
926       A character class matches a single character in the subject. In  a  UTF
927       mode,  the  character  may  be  more than one data unit long. A matched
928       character must be in the set of characters defined by the class, unless
929       the  first  character in the class definition is a circumflex, in which
930       case the subject character must not be in the set defined by the class.
931       If  a  circumflex is actually required as a member of the class, ensure
932       it is not the first character, or escape it with a backslash.
933
934       For example, the character class [aeiou] matches any lower case  vowel,
935       while  [^aeiou]  matches  any character that is not a lower case vowel.
936       Note that a circumflex is just a convenient notation for specifying the
937       characters  that  are in the class by enumerating those that are not. A
938       class that starts with a circumflex is not an assertion; it still  con‐
939       sumes  a  character  from the subject string, and therefore it fails if
940       the current pointer is at the end of the string.
941
942       In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
943       (0xffff)  can be included in a class as a literal string of data units,
944       or by using the \x{ escaping mechanism.
945
946       When caseless matching is set, any letters in a  class  represent  both
947       their  upper  case  and lower case versions, so for example, a caseless
948       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
949       match  "A", whereas a caseful version would. In a UTF mode, PCRE always
950       understands the concept of case for characters whose  values  are  less
951       than  128, so caseless matching is always possible. For characters with
952       higher values, the concept of case is supported  if  PCRE  is  compiled
953       with  Unicode  property support, but not otherwise.  If you want to use
954       caseless matching in a UTF mode for characters 128 and above, you  must
955       ensure  that  PCRE is compiled with Unicode property support as well as
956       with UTF support.
957
958       Characters that might indicate line breaks are  never  treated  in  any
959       special  way  when  matching  character  classes,  whatever line-ending
960       sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
961       PCRE_MULTILINE options is used. A class such as [^a] always matches one
962       of these characters.
963
964       The minus (hyphen) character can be used to specify a range of  charac‐
965       ters  in  a  character  class.  For  example,  [d-m] matches any letter
966       between d and m, inclusive. If a  minus  character  is  required  in  a
967       class,  it  must  be  escaped  with a backslash or appear in a position
968       where it cannot be interpreted as indicating a range, typically as  the
969       first or last character in the class.
970
971       It is not possible to have the literal character "]" as the end charac‐
972       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
973       two  characters ("W" and "-") followed by a literal string "46]", so it
974       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
975       backslash  it is interpreted as the end of range, so [W-\]46] is inter‐
976       preted as a class containing a range followed by two other  characters.
977       The  octal or hexadecimal representation of "]" can also be used to end
978       a range.
979
980       Ranges operate in the collating sequence of character values. They  can
981       also   be  used  for  characters  specified  numerically,  for  example
982       [\000-\037]. Ranges can include any characters that are valid  for  the
983       current mode.
984
985       If a range that includes letters is used when caseless matching is set,
986       it matches the letters in either case. For example, [W-c] is equivalent
987       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
988       character tables for a French locale are in  use,  [\xc8-\xcb]  matches
989       accented  E  characters  in both cases. In UTF modes, PCRE supports the
990       concept of case for characters with values greater than 128  only  when
991       it is compiled with Unicode property support.
992
993       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
994       \w, and \W may appear in a character class, and add the characters that
995       they  match to the class. For example, [\dABCDEF] matches any hexadeci‐
996       mal digit. In UTF modes, the PCRE_UCP option affects  the  meanings  of
997       \d,  \s,  \w  and  their upper case partners, just as it does when they
998       appear outside a character class, as described in the section  entitled
999       "Generic character types" above. The escape sequence \b has a different
1000       meaning inside a character class; it matches the  backspace  character.
1001       The  sequences  \B,  \N,  \R, and \X are not special inside a character
1002       class. Like any other unrecognized escape sequences, they  are  treated
1003       as  the literal characters "B", "N", "R", and "X" by default, but cause
1004       an error if the PCRE_EXTRA option is set.
1005
1006       A circumflex can conveniently be used with  the  upper  case  character
1007       types  to specify a more restricted set of characters than the matching
1008       lower case type.  For example, the class [^\W_] matches any  letter  or
1009       digit, but not underscore, whereas [\w] includes underscore. A positive
1010       character class should be read as "something OR something OR ..." and a
1011       negative class as "NOT something AND NOT something AND NOT ...".
1012
1013       The  only  metacharacters  that are recognized in character classes are
1014       backslash, hyphen (only where it can be  interpreted  as  specifying  a
1015       range),  circumflex  (only  at the start), opening square bracket (only
1016       when it can be interpreted as introducing a POSIX class name - see  the
1017       next  section),  and  the  terminating closing square bracket. However,
1018       escaping other non-alphanumeric characters does no harm.
1019

POSIX CHARACTER CLASSES

1021
1022       Perl supports the POSIX notation for character classes. This uses names
1023       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
1024       supports this notation. For example,
1025
1026         [01[:alpha:]%]
1027
1028       matches "0", "1", any alphabetic character, or "%". The supported class
1029       names are:
1030
1031         alnum    letters and digits
1032         alpha    letters
1033         ascii    character codes 0 - 127
1034         blank    space or tab only
1035         cntrl    control characters
1036         digit    decimal digits (same as \d)
1037         graph    printing characters, excluding space
1038         lower    lower case letters
1039         print    printing characters, including space
1040         punct    printing characters, excluding letters and digits and space
1041         space    white space (not quite the same as \s)
1042         upper    upper case letters
1043         word     "word" characters (same as \w)
1044         xdigit   hexadecimal digits
1045
1046       The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1047       and space (32). Notice that this list includes the VT  character  (code
1048       11). This makes "space" different to \s, which does not include VT (for
1049       Perl compatibility).
1050
1051       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
1052       from  Perl  5.8. Another Perl extension is negation, which is indicated
1053       by a ^ character after the colon. For example,
1054
1055         [12[:^digit:]]
1056
1057       matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
1058       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1059       these are not supported, and an error is given if they are encountered.
1060
1061       By default, in UTF modes, characters with values greater  than  128  do
1062       not  match any of the POSIX character classes. However, if the PCRE_UCP
1063       option is passed to pcre_compile(), some of the classes are changed  so
1064       that Unicode character properties are used. This is achieved by replac‐
1065       ing certain POSIX classes by other sequences, as follows:
1066
1067         [:alnum:]  becomes  \p{Xan}
1068         [:alpha:]  becomes  \p{L}
1069         [:blank:]  becomes  \h
1070         [:digit:]  becomes  \p{Nd}
1071         [:lower:]  becomes  \p{Ll}
1072         [:space:]  becomes  \p{Xps}
1073         [:upper:]  becomes  \p{Lu}
1074         [:word:]   becomes  \p{Xwd}
1075
1076       Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
1077       POSIX classes are handled specially in UCP mode:
1078
1079       [:graph:] This  matches  characters that have glyphs that mark the page
1080                 when printed. In Unicode property terms, it matches all char‐
1081                 acters with the L, M, N, P, S, or Cf properties, except for:
1082
1083                   U+061C           Arabic Letter Mark
1084                   U+180E           Mongolian Vowel Separator
1085                   U+2066 - U+2069  Various "isolate"s
1086
1087
1088       [:print:] This  matches  the  same  characters  as [:graph:] plus space
1089                 characters that are not controls, that  is,  characters  with
1090                 the Zs property.
1091
1092       [:punct:] This matches all characters that have the Unicode P (punctua‐
1093                 tion) property, plus those characters whose code  points  are
1094                 less than 128 that have the S (Symbol) property.
1095
1096       The  other  POSIX classes are unchanged, and match only characters with
1097       code points less than 128.
1098

VERTICAL BAR

1100
1101       Vertical bar characters are used to separate alternative patterns.  For
1102       example, the pattern
1103
1104         gilbert|sullivan
1105
1106       matches  either "gilbert" or "sullivan". Any number of alternatives may
1107       appear, and an empty  alternative  is  permitted  (matching  the  empty
1108       string). The matching process tries each alternative in turn, from left
1109       to right, and the first one that succeeds is used. If the  alternatives
1110       are  within a subpattern (defined below), "succeeds" means matching the
1111       rest of the main pattern as well as the alternative in the subpattern.
1112

INTERNAL OPTION SETTING

1114
1115       The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
1116       PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
1117       within the pattern by  a  sequence  of  Perl  option  letters  enclosed
1118       between "(?" and ")".  The option letters are
1119
1120         i  for PCRE_CASELESS
1121         m  for PCRE_MULTILINE
1122         s  for PCRE_DOTALL
1123         x  for PCRE_EXTENDED
1124
1125       For example, (?im) sets caseless, multiline matching. It is also possi‐
1126       ble to unset these options by preceding the letter with a hyphen, and a
1127       combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE‐
1128       LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
1129       is  also  permitted.  If  a  letter  appears  both before and after the
1130       hyphen, the option is unset.
1131
1132       The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
1133       can  be changed in the same way as the Perl-compatible options by using
1134       the characters J, U and X respectively.
1135
1136       When one of these option changes occurs at  top  level  (that  is,  not
1137       inside  subpattern parentheses), the change applies to the remainder of
1138       the pattern that follows. If the change is placed right at the start of
1139       a pattern, PCRE extracts it into the global options (and it will there‐
1140       fore show up in data extracted by the pcre_fullinfo() function).
1141
1142       An option change within a subpattern (see below for  a  description  of
1143       subpatterns)  affects only that part of the subpattern that follows it,
1144       so
1145
1146         (a(?i)b)c
1147
1148       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
1149       used).   By  this means, options can be made to have different settings
1150       in different parts of the pattern. Any changes made in one  alternative
1151       do  carry  on  into subsequent branches within the same subpattern. For
1152       example,
1153
1154         (a(?i)b|c)
1155
1156       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
1157       first  branch  is  abandoned before the option setting. This is because
1158       the effects of option settings happen at compile time. There  would  be
1159       some very weird behaviour otherwise.
1160
1161       Note:  There  are  other  PCRE-specific  options that can be set by the
1162       application when the compiling or matching  functions  are  called.  In
1163       some  cases  the  pattern can contain special leading sequences such as
1164       (*CRLF) to override what the application  has  set  or  what  has  been
1165       defaulted.   Details   are  given  in  the  section  entitled  "Newline
1166       sequences" above. There are also the  (*UTF8),  (*UTF16),(*UTF32),  and
1167       (*UCP)  leading sequences that can be used to set UTF and Unicode prop‐
1168       erty modes; they are equivalent to setting the  PCRE_UTF8,  PCRE_UTF16,
1169       PCRE_UTF32  and the PCRE_UCP options, respectively. The (*UTF) sequence
1170       is a generic version that can be used with any of the libraries.
1171

SUBPATTERNS

1173
1174       Subpatterns are delimited by parentheses (round brackets), which can be
1175       nested.  Turning part of a pattern into a subpattern does two things:
1176
1177       1. It localizes a set of alternatives. For example, the pattern
1178
1179         cat(aract|erpillar|)
1180
1181       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
1182       it would match "cataract", "erpillar" or an empty string.
1183
1184       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
1185       that,  when  the  whole  pattern  matches,  that portion of the subject
1186       string that matched the subpattern is passed back to the caller via the
1187       ovector  argument  of  the matching function. (This applies only to the
1188       traditional matching functions; the DFA matching functions do not  sup‐
1189       port capturing.)
1190
1191       Opening parentheses are counted from left to right (starting from 1) to
1192       obtain numbers for the  capturing  subpatterns.  For  example,  if  the
1193       string "the red king" is matched against the pattern
1194
1195         the ((red|white) (king|queen))
1196
1197       the captured substrings are "red king", "red", and "king", and are num‐
1198       bered 1, 2, and 3, respectively.
1199
1200       The fact that plain parentheses fulfil  two  functions  is  not  always
1201       helpful.   There are often times when a grouping subpattern is required
1202       without a capturing requirement. If an opening parenthesis is  followed
1203       by  a question mark and a colon, the subpattern does not do any captur‐
1204       ing, and is not counted when computing the  number  of  any  subsequent
1205       capturing  subpatterns. For example, if the string "the white queen" is
1206       matched against the pattern
1207
1208         the ((?:red|white) (king|queen))
1209
1210       the captured substrings are "white queen" and "queen", and are numbered
1211       1 and 2. The maximum number of capturing subpatterns is 65535.
1212
1213       As  a  convenient shorthand, if any option settings are required at the
1214       start of a non-capturing subpattern,  the  option  letters  may  appear
1215       between the "?" and the ":". Thus the two patterns
1216
1217         (?i:saturday|sunday)
1218         (?:(?i)saturday|sunday)
1219
1220       match exactly the same set of strings. Because alternative branches are
1221       tried from left to right, and options are not reset until  the  end  of
1222       the  subpattern is reached, an option setting in one branch does affect
1223       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
1224       "Saturday".
1225

DUPLICATE SUBPATTERN NUMBERS

1227
1228       Perl 5.10 introduced a feature whereby each alternative in a subpattern
1229       uses the same numbers for its capturing parentheses. Such a  subpattern
1230       starts  with (?| and is itself a non-capturing subpattern. For example,
1231       consider this pattern:
1232
1233         (?|(Sat)ur|(Sun))day
1234
1235       Because the two alternatives are inside a (?| group, both sets of  cap‐
1236       turing  parentheses  are  numbered one. Thus, when the pattern matches,
1237       you can look at captured substring number  one,  whichever  alternative
1238       matched.  This  construct  is useful when you want to capture part, but
1239       not all, of one of a number of alternatives. Inside a (?| group, paren‐
1240       theses  are  numbered as usual, but the number is reset at the start of
1241       each branch. The numbers of any capturing parentheses that  follow  the
1242       subpattern  start after the highest number used in any branch. The fol‐
1243       lowing example is taken from the Perl documentation. The numbers under‐
1244       neath show in which buffer the captured content will be stored.
1245
1246         # before  ---------------branch-reset----------- after
1247         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1248         # 1            2         2  3        2     3     4
1249
1250       A  back  reference  to a numbered subpattern uses the most recent value
1251       that is set for that number by any subpattern.  The  following  pattern
1252       matches "abcabc" or "defdef":
1253
1254         /(?|(abc)|(def))\1/
1255
1256       In  contrast,  a subroutine call to a numbered subpattern always refers
1257       to the first one in the pattern with the given  number.  The  following
1258       pattern matches "abcabc" or "defabc":
1259
1260         /(?|(abc)|(def))(?1)/
1261
1262       If  a condition test for a subpattern's having matched refers to a non-
1263       unique number, the test is true if any of the subpatterns of that  num‐
1264       ber have matched.
1265
1266       An  alternative approach to using this "branch reset" feature is to use
1267       duplicate named subpatterns, as described in the next section.
1268

NAMED SUBPATTERNS

1270
1271       Identifying capturing parentheses by number is simple, but  it  can  be
1272       very  hard  to keep track of the numbers in complicated regular expres‐
1273       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
1274       change.  To help with this difficulty, PCRE supports the naming of sub‐
1275       patterns. This feature was not added to Perl until release 5.10. Python
1276       had  the  feature earlier, and PCRE introduced it at release 4.0, using
1277       the Python syntax. PCRE now supports both the Perl and the Python  syn‐
1278       tax.  Perl  allows  identically  numbered subpatterns to have different
1279       names, but PCRE does not.
1280
1281       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
1282       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
1283       to capturing parentheses from other parts of the pattern, such as  back
1284       references,  recursion,  and conditions, can be made by name as well as
1285       by number.
1286
1287       Names consist of up to  32  alphanumeric  characters  and  underscores.
1288       Named  capturing  parentheses  are  still  allocated numbers as well as
1289       names, exactly as if the names were not present. The PCRE API  provides
1290       function calls for extracting the name-to-number translation table from
1291       a compiled pattern. There is also a convenience function for extracting
1292       a captured substring by name.
1293
1294       By  default, a name must be unique within a pattern, but it is possible
1295       to relax this constraint by setting the PCRE_DUPNAMES option at compile
1296       time.  (Duplicate  names are also always permitted for subpatterns with
1297       the same number, set up as described in the previous  section.)  Dupli‐
1298       cate  names  can  be useful for patterns where only one instance of the
1299       named parentheses can match. Suppose you want to match the  name  of  a
1300       weekday,  either as a 3-letter abbreviation or as the full name, and in
1301       both cases you want to extract the abbreviation. This pattern (ignoring
1302       the line breaks) does the job:
1303
1304         (?<DN>Mon|Fri|Sun)(?:day)?|
1305         (?<DN>Tue)(?:sday)?|
1306         (?<DN>Wed)(?:nesday)?|
1307         (?<DN>Thu)(?:rsday)?|
1308         (?<DN>Sat)(?:urday)?
1309
1310       There  are  five capturing substrings, but only one is ever set after a
1311       match.  (An alternative way of solving this problem is to use a "branch
1312       reset" subpattern, as described in the previous section.)
1313
1314       The  convenience  function  for extracting the data by name returns the
1315       substring for the first (and in this example, the only)  subpattern  of
1316       that  name  that  matched.  This saves searching to find which numbered
1317       subpattern it was.
1318
1319       If you make a back reference to  a  non-unique  named  subpattern  from
1320       elsewhere  in the pattern, the one that corresponds to the first occur‐
1321       rence of the name is used. In the absence of duplicate numbers (see the
1322       previous  section) this is the one with the lowest number. If you use a
1323       named reference in a condition test (see the section  about  conditions
1324       below),  either  to check whether a subpattern has matched, or to check
1325       for recursion, all subpatterns with the same name are  tested.  If  the
1326       condition  is  true for any one of them, the overall condition is true.
1327       This is the same behaviour as testing by number. For further details of
1328       the interfaces for handling named subpatterns, see the pcreapi documen‐
1329       tation.
1330
1331       Warning: You cannot use different names to distinguish between two sub‐
1332       patterns  with  the same number because PCRE uses only the numbers when
1333       matching. For this reason, an error is given at compile time if differ‐
1334       ent  names  are given to subpatterns with the same number. However, you
1335       can give the same name to subpatterns with the same number,  even  when
1336       PCRE_DUPNAMES is not set.
1337

REPETITION

1339
1340       Repetition  is  specified  by  quantifiers, which can follow any of the
1341       following items:
1342
1343         a literal data character
1344         the dot metacharacter
1345         the \C escape sequence
1346         the \X escape sequence
1347         the \R escape sequence
1348         an escape such as \d or \pL that matches a single character
1349         a character class
1350         a back reference (see next section)
1351         a parenthesized subpattern (including assertions)
1352         a subroutine call to a subpattern (recursive or otherwise)
1353
1354       The general repetition quantifier specifies a minimum and maximum  num‐
1355       ber  of  permitted matches, by giving the two numbers in curly brackets
1356       (braces), separated by a comma. The numbers must be  less  than  65536,
1357       and the first must be less than or equal to the second. For example:
1358
1359         z{2,4}
1360
1361       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
1362       special character. If the second number is omitted, but  the  comma  is
1363       present,  there  is  no upper limit; if the second number and the comma
1364       are both omitted, the quantifier specifies an exact number of  required
1365       matches. Thus
1366
1367         [aeiou]{3,}
1368
1369       matches at least 3 successive vowels, but may match many more, while
1370
1371         \d{8}
1372
1373       matches  exactly  8  digits. An opening curly bracket that appears in a
1374       position where a quantifier is not allowed, or one that does not  match
1375       the  syntax of a quantifier, is taken as a literal character. For exam‐
1376       ple, {,6} is not a quantifier, but a literal string of four characters.
1377
1378       In UTF modes, quantifiers apply to characters rather than to individual
1379       data  units. Thus, for example, \x{100}{2} matches two characters, each
1380       of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1381       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
1382       which may be several data units long (and  they  may  be  of  different
1383       lengths).
1384
1385       The quantifier {0} is permitted, causing the expression to behave as if
1386       the previous item and the quantifier were not present. This may be use‐
1387       ful  for  subpatterns that are referenced as subroutines from elsewhere
1388       in the pattern (but see also the section entitled "Defining subpatterns
1389       for  use  by  reference only" below). Items other than subpatterns that
1390       have a {0} quantifier are omitted from the compiled pattern.
1391
1392       For convenience, the three most common quantifiers have  single-charac‐
1393       ter abbreviations:
1394
1395         *    is equivalent to {0,}
1396         +    is equivalent to {1,}
1397         ?    is equivalent to {0,1}
1398
1399       It  is  possible  to construct infinite loops by following a subpattern
1400       that can match no characters with a quantifier that has no upper limit,
1401       for example:
1402
1403         (a?)*
1404
1405       Earlier versions of Perl and PCRE used to give an error at compile time
1406       for such patterns. However, because there are cases where this  can  be
1407       useful,  such  patterns  are now accepted, but if any repetition of the
1408       subpattern does in fact match no characters, the loop is forcibly  bro‐
1409       ken.
1410
1411       By  default,  the quantifiers are "greedy", that is, they match as much
1412       as possible (up to the maximum  number  of  permitted  times),  without
1413       causing  the  rest of the pattern to fail. The classic example of where
1414       this gives problems is in trying to match comments in C programs. These
1415       appear  between  /*  and  */ and within the comment, individual * and /
1416       characters may appear. An attempt to match C comments by  applying  the
1417       pattern
1418
1419         /\*.*\*/
1420
1421       to the string
1422
1423         /* first comment */  not comment  /* second comment */
1424
1425       fails,  because it matches the entire string owing to the greediness of
1426       the .*  item.
1427
1428       However, if a quantifier is followed by a question mark, it  ceases  to
1429       be greedy, and instead matches the minimum number of times possible, so
1430       the pattern
1431
1432         /\*.*?\*/
1433
1434       does the right thing with the C comments. The meaning  of  the  various
1435       quantifiers  is  not  otherwise  changed,  just the preferred number of
1436       matches.  Do not confuse this use of question mark with its  use  as  a
1437       quantifier  in its own right. Because it has two uses, it can sometimes
1438       appear doubled, as in
1439
1440         \d??\d
1441
1442       which matches one digit by preference, but can match two if that is the
1443       only way the rest of the pattern matches.
1444
1445       If  the PCRE_UNGREEDY option is set (an option that is not available in
1446       Perl), the quantifiers are not greedy by default, but  individual  ones
1447       can  be  made  greedy  by following them with a question mark. In other
1448       words, it inverts the default behaviour.
1449
1450       When a parenthesized subpattern is quantified  with  a  minimum  repeat
1451       count  that is greater than 1 or with a limited maximum, more memory is
1452       required for the compiled pattern, in proportion to  the  size  of  the
1453       minimum or maximum.
1454
1455       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv‐
1456       alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
1457       the  pattern  is  implicitly anchored, because whatever follows will be
1458       tried against every character position in the subject string, so  there
1459       is  no  point  in  retrying the overall match at any position after the
1460       first. PCRE normally treats such a pattern as though it  were  preceded
1461       by \A.
1462
1463       In  cases  where  it  is known that the subject string contains no new‐
1464       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti‐
1465       mization, or alternatively using ^ to indicate anchoring explicitly.
1466
1467       However,  there  are  some cases where the optimization cannot be used.
1468       When .*  is inside capturing parentheses that are the subject of a back
1469       reference elsewhere in the pattern, a match at the start may fail where
1470       a later one succeeds. Consider, for example:
1471
1472         (.*)abc\1
1473
1474       If the subject is "xyz123abc123" the match point is the fourth  charac‐
1475       ter. For this reason, such a pattern is not implicitly anchored.
1476
1477       Another  case where implicit anchoring is not applied is when the lead‐
1478       ing .* is inside an atomic group. Once again, a match at the start  may
1479       fail where a later one succeeds. Consider this pattern:
1480
1481         (?>.*?a)b
1482
1483       It  matches "ab" in the subject "aab". The use of the backtracking con‐
1484       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
1485
1486       When a capturing subpattern is repeated, the value captured is the sub‐
1487       string that matched the final iteration. For example, after
1488
1489         (tweedle[dume]{3}\s*)+
1490
1491       has matched "tweedledum tweedledee" the value of the captured substring
1492       is "tweedledee". However, if there are  nested  capturing  subpatterns,
1493       the  corresponding captured values may have been set in previous itera‐
1494       tions. For example, after
1495
1496         /(a|(b))+/
1497
1498       matches "aba" the value of the second captured substring is "b".
1499

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

1501
1502       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
1503       repetition,  failure  of what follows normally causes the repeated item
1504       to be re-evaluated to see if a different number of repeats  allows  the
1505       rest  of  the pattern to match. Sometimes it is useful to prevent this,
1506       either to change the nature of the match, or to cause it  fail  earlier
1507       than  it otherwise might, when the author of the pattern knows there is
1508       no point in carrying on.
1509
1510       Consider, for example, the pattern \d+foo when applied to  the  subject
1511       line
1512
1513         123456bar
1514
1515       After matching all 6 digits and then failing to match "foo", the normal
1516       action of the matcher is to try again with only 5 digits  matching  the
1517       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
1518       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
1519       the  means for specifying that once a subpattern has matched, it is not
1520       to be re-evaluated in this way.
1521
1522       If we use atomic grouping for the previous example, the  matcher  gives
1523       up  immediately  on failing to match "foo" the first time. The notation
1524       is a kind of special parenthesis, starting with (?> as in this example:
1525
1526         (?>\d+)foo
1527
1528       This kind of parenthesis "locks up" the  part of the  pattern  it  con‐
1529       tains  once  it  has matched, and a failure further into the pattern is
1530       prevented from backtracking into it. Backtracking past it  to  previous
1531       items, however, works as normal.
1532
1533       An  alternative  description  is that a subpattern of this type matches
1534       the string of characters that an  identical  standalone  pattern  would
1535       match, if anchored at the current point in the subject string.
1536
1537       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1538       such as the above example can be thought of as a maximizing repeat that
1539       must  swallow  everything  it can. So, while both \d+ and \d+? are pre‐
1540       pared to adjust the number of digits they match in order  to  make  the
1541       rest of the pattern match, (?>\d+) can only match an entire sequence of
1542       digits.
1543
1544       Atomic groups in general can of course contain arbitrarily  complicated
1545       subpatterns,  and  can  be  nested. However, when the subpattern for an
1546       atomic group is just a single repeated item, as in the example above, a
1547       simpler  notation,  called  a "possessive quantifier" can be used. This
1548       consists of an additional + character  following  a  quantifier.  Using
1549       this notation, the previous example can be rewritten as
1550
1551         \d++foo
1552
1553       Note that a possessive quantifier can be used with an entire group, for
1554       example:
1555
1556         (abc|xyz){2,3}+
1557
1558       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
1559       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1560       simpler forms of atomic group. However, there is no difference  in  the
1561       meaning  of  a  possessive  quantifier and the equivalent atomic group,
1562       though there may be a performance  difference;  possessive  quantifiers
1563       should be slightly faster.
1564
1565       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn‐
1566       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
1567       edition of his book. Mike McCloskey liked it, so implemented it when he
1568       built Sun's Java package, and PCRE copied it from there. It  ultimately
1569       found its way into Perl at release 5.10.
1570
1571       PCRE has an optimization that automatically "possessifies" certain sim‐
1572       ple pattern constructs. For example, the sequence  A+B  is  treated  as
1573       A++B  because  there is no point in backtracking into a sequence of A's
1574       when B must follow.
1575
1576       When a pattern contains an unlimited repeat inside  a  subpattern  that
1577       can  itself  be  repeated  an  unlimited number of times, the use of an
1578       atomic group is the only way to avoid some  failing  matches  taking  a
1579       very long time indeed. The pattern
1580
1581         (\D+|<\d+>)*[!?]
1582
1583       matches  an  unlimited number of substrings that either consist of non-
1584       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1585       matches, it runs quickly. However, if it is applied to
1586
1587         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1588
1589       it  takes  a  long  time  before reporting failure. This is because the
1590       string can be divided between the internal \D+ repeat and the  external
1591       *  repeat  in  a  large  number of ways, and all have to be tried. (The
1592       example uses [!?] rather than a single character at  the  end,  because
1593       both  PCRE  and  Perl have an optimization that allows for fast failure
1594       when a single character is used. They remember the last single  charac‐
1595       ter  that  is required for a match, and fail early if it is not present
1596       in the string.) If the pattern is changed so that  it  uses  an  atomic
1597       group, like this:
1598
1599         ((?>\D+)|<\d+>)*[!?]
1600
1601       sequences of non-digits cannot be broken, and failure happens quickly.
1602

BACK REFERENCES

1604
1605       Outside a character class, a backslash followed by a digit greater than
1606       0 (and possibly further digits) is a back reference to a capturing sub‐
1607       pattern  earlier  (that is, to its left) in the pattern, provided there
1608       have been that many previous capturing left parentheses.
1609
1610       However, if the decimal number following the backslash is less than 10,
1611       it  is  always  taken  as a back reference, and causes an error only if
1612       there are not that many capturing left parentheses in the  entire  pat‐
1613       tern.  In  other words, the parentheses that are referenced need not be
1614       to the left of the reference for numbers less than 10. A "forward  back
1615       reference"  of  this  type can make sense when a repetition is involved
1616       and the subpattern to the right has participated in an  earlier  itera‐
1617       tion.
1618
1619       It  is  not  possible to have a numerical "forward back reference" to a
1620       subpattern whose number is 10 or  more  using  this  syntax  because  a
1621       sequence  such  as  \50 is interpreted as a character defined in octal.
1622       See the subsection entitled "Non-printing characters" above for further
1623       details  of  the  handling of digits following a backslash. There is no
1624       such problem when named parentheses are used. A back reference  to  any
1625       subpattern is possible using named parentheses (see below).
1626
1627       Another  way  of  avoiding  the ambiguity inherent in the use of digits
1628       following a backslash is to use the \g  escape  sequence.  This  escape
1629       must be followed by an unsigned number or a negative number, optionally
1630       enclosed in braces. These examples are all identical:
1631
1632         (ring), \1
1633         (ring), \g1
1634         (ring), \g{1}
1635
1636       An unsigned number specifies an absolute reference without the  ambigu‐
1637       ity that is present in the older syntax. It is also useful when literal
1638       digits follow the reference. A negative number is a relative reference.
1639       Consider this example:
1640
1641         (abc(def)ghi)\g{-1}
1642
1643       The sequence \g{-1} is a reference to the most recently started captur‐
1644       ing subpattern before \g, that is, is it equivalent to \2 in this exam‐
1645       ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
1646       references can be helpful in long patterns, and also in  patterns  that
1647       are  created  by  joining  together  fragments  that contain references
1648       within themselves.
1649
1650       A back reference matches whatever actually matched the  capturing  sub‐
1651       pattern  in  the  current subject string, rather than anything matching
1652       the subpattern itself (see "Subpatterns as subroutines" below for a way
1653       of doing that). So the pattern
1654
1655         (sens|respons)e and \1ibility
1656
1657       matches  "sense and sensibility" and "response and responsibility", but
1658       not "sense and responsibility". If caseful matching is in force at  the
1659       time  of the back reference, the case of letters is relevant. For exam‐
1660       ple,
1661
1662         ((?i)rah)\s+\1
1663
1664       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1665       original capturing subpattern is matched caselessly.
1666
1667       There  are  several  different ways of writing back references to named
1668       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
1669       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
1670       unified back reference syntax, in which \g can be used for both numeric
1671       and  named  references,  is  also supported. We could rewrite the above
1672       example in any of the following ways:
1673
1674         (?<p1>(?i)rah)\s+\k<p1>
1675         (?'p1'(?i)rah)\s+\k{p1}
1676         (?P<p1>(?i)rah)\s+(?P=p1)
1677         (?<p1>(?i)rah)\s+\g{p1}
1678
1679       A subpattern that is referenced by  name  may  appear  in  the  pattern
1680       before or after the reference.
1681
1682       There  may be more than one back reference to the same subpattern. If a
1683       subpattern has not actually been used in a particular match,  any  back
1684       references to it always fail by default. For example, the pattern
1685
1686         (a|(bc))\2
1687
1688       always  fails  if  it starts to match "a" rather than "bc". However, if
1689       the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer‐
1690       ence to an unset value matches an empty string.
1691
1692       Because  there may be many capturing parentheses in a pattern, all dig‐
1693       its following a backslash are taken as part of a potential back  refer‐
1694       ence  number.   If  the  pattern continues with a digit character, some
1695       delimiter must  be  used  to  terminate  the  back  reference.  If  the
1696       PCRE_EXTENDED  option  is  set, this can be white space. Otherwise, the
1697       \g{ syntax or an empty comment (see "Comments" below) can be used.
1698
1699   Recursive back references
1700
1701       A back reference that occurs inside the parentheses to which it  refers
1702       fails  when  the subpattern is first used, so, for example, (a\1) never
1703       matches.  However, such references can be useful inside  repeated  sub‐
1704       patterns. For example, the pattern
1705
1706         (a|b\1)+
1707
1708       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1709       ation of the subpattern,  the  back  reference  matches  the  character
1710       string  corresponding  to  the previous iteration. In order for this to
1711       work, the pattern must be such that the first iteration does  not  need
1712       to  match the back reference. This can be done using alternation, as in
1713       the example above, or by a quantifier with a minimum of zero.
1714
1715       Back references of this type cause the group that they reference to  be
1716       treated  as  an atomic group.  Once the whole group has been matched, a
1717       subsequent matching failure cannot cause backtracking into  the  middle
1718       of the group.
1719

ASSERTIONS

1721
1722       An  assertion  is  a  test on the characters following or preceding the
1723       current matching point that does not actually consume  any  characters.
1724       The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
1725       described above.
1726
1727       More complicated assertions are coded as  subpatterns.  There  are  two
1728       kinds:  those  that  look  ahead of the current position in the subject
1729       string, and those that look  behind  it.  An  assertion  subpattern  is
1730       matched  in  the  normal way, except that it does not cause the current
1731       matching position to be changed.
1732
1733       Assertion subpatterns are not capturing subpatterns. If such an  asser‐
1734       tion  contains  capturing  subpatterns within it, these are counted for
1735       the purposes of numbering the capturing subpatterns in the  whole  pat‐
1736       tern.  However,  substring  capturing  is carried out only for positive
1737       assertions, because it does not make sense for negative assertions.
1738
1739       For compatibility with Perl, assertion  subpatterns  may  be  repeated;
1740       though  it  makes  no sense to assert the same thing several times, the
1741       side effect of capturing parentheses may  occasionally  be  useful.  In
1742       practice, there only three cases:
1743
1744       (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
1745       matching.  However, it may  contain  internal  capturing  parenthesized
1746       groups that are called from elsewhere via the subroutine mechanism.
1747
1748       (2)  If quantifier is {0,n} where n is greater than zero, it is treated
1749       as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
1750       tried with and without the assertion, the order depending on the greed‐
1751       iness of the quantifier.
1752
1753       (3) If the minimum repetition is greater than zero, the  quantifier  is
1754       ignored.   The  assertion  is  obeyed just once when encountered during
1755       matching.
1756
1757   Lookahead assertions
1758
1759       Lookahead assertions start with (?= for positive assertions and (?! for
1760       negative assertions. For example,
1761
1762         \w+(?=;)
1763
1764       matches  a word followed by a semicolon, but does not include the semi‐
1765       colon in the match, and
1766
1767         foo(?!bar)
1768
1769       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
1770       that the apparently similar pattern
1771
1772         (?!foo)bar
1773
1774       does  not  find  an  occurrence  of "bar" that is preceded by something
1775       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
1776       the assertion (?!foo) is always true when the next three characters are
1777       "bar". A lookbehind assertion is needed to achieve the other effect.
1778
1779       If you want to force a matching failure at some point in a pattern, the
1780       most  convenient  way  to  do  it  is with (?!) because an empty string
1781       always matches, so an assertion that requires there not to be an  empty
1782       string must always fail.  The backtracking control verb (*FAIL) or (*F)
1783       is a synonym for (?!).
1784
1785   Lookbehind assertions
1786
1787       Lookbehind assertions start with (?<= for positive assertions and  (?<!
1788       for negative assertions. For example,
1789
1790         (?<!foo)bar
1791
1792       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
1793       contents of a lookbehind assertion are restricted  such  that  all  the
1794       strings it matches must have a fixed length. However, if there are sev‐
1795       eral top-level alternatives, they do not all  have  to  have  the  same
1796       fixed length. Thus
1797
1798         (?<=bullock|donkey)
1799
1800       is permitted, but
1801
1802         (?<!dogs?|cats?)
1803
1804       causes  an  error at compile time. Branches that match different length
1805       strings are permitted only at the top level of a lookbehind  assertion.
1806       This is an extension compared with Perl, which requires all branches to
1807       match the same length of string. An assertion such as
1808
1809         (?<=ab(c|de))
1810
1811       is not permitted, because its single top-level  branch  can  match  two
1812       different lengths, but it is acceptable to PCRE if rewritten to use two
1813       top-level branches:
1814
1815         (?<=abc|abde)
1816
1817       In some cases, the escape sequence \K (see above) can be  used  instead
1818       of a lookbehind assertion to get round the fixed-length restriction.
1819
1820       The  implementation  of lookbehind assertions is, for each alternative,
1821       to temporarily move the current position back by the fixed  length  and
1822       then try to match. If there are insufficient characters before the cur‐
1823       rent position, the assertion fails.
1824
1825       In a UTF mode, PCRE does not allow the \C escape (which matches a  sin‐
1826       gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
1827       because it makes it impossible to calculate the length of  the  lookbe‐
1828       hind.  The \X and \R escapes, which can match different numbers of data
1829       units, are also not permitted.
1830
1831       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
1832       lookbehinds,  as  long as the subpattern matches a fixed-length string.
1833       Recursion, however, is not supported.
1834
1835       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
1836       assertions to specify efficient matching of fixed-length strings at the
1837       end of subject strings. Consider a simple pattern such as
1838
1839         abcd$
1840
1841       when applied to a long string that does  not  match.  Because  matching
1842       proceeds from left to right, PCRE will look for each "a" in the subject
1843       and then see if what follows matches the rest of the  pattern.  If  the
1844       pattern is specified as
1845
1846         ^.*abcd$
1847
1848       the  initial .* matches the entire string at first, but when this fails
1849       (because there is no following "a"), it backtracks to match all but the
1850       last  character,  then all but the last two characters, and so on. Once
1851       again the search for "a" covers the entire string, from right to  left,
1852       so we are no better off. However, if the pattern is written as
1853
1854         ^.*+(?<=abcd)
1855
1856       there  can  be  no backtracking for the .*+ item; it can match only the
1857       entire string. The subsequent lookbehind assertion does a  single  test
1858       on  the last four characters. If it fails, the match fails immediately.
1859       For long strings, this approach makes a significant difference  to  the
1860       processing time.
1861
1862   Using multiple assertions
1863
1864       Several assertions (of any sort) may occur in succession. For example,
1865
1866         (?<=\d{3})(?<!999)foo
1867
1868       matches  "foo" preceded by three digits that are not "999". Notice that
1869       each of the assertions is applied independently at the  same  point  in
1870       the  subject  string.  First  there  is a check that the previous three
1871       characters are all digits, and then there is  a  check  that  the  same
1872       three characters are not "999".  This pattern does not match "foo" pre‐
1873       ceded by six characters, the first of which are  digits  and  the  last
1874       three  of  which  are not "999". For example, it doesn't match "123abc‐
1875       foo". A pattern to do that is
1876
1877         (?<=\d{3}...)(?<!999)foo
1878
1879       This time the first assertion looks at the  preceding  six  characters,
1880       checking that the first three are digits, and then the second assertion
1881       checks that the preceding three characters are not "999".
1882
1883       Assertions can be nested in any combination. For example,
1884
1885         (?<=(?<!foo)bar)baz
1886
1887       matches an occurrence of "baz" that is preceded by "bar" which in  turn
1888       is not preceded by "foo", while
1889
1890         (?<=\d{3}(?!999)...)foo
1891
1892       is  another pattern that matches "foo" preceded by three digits and any
1893       three characters that are not "999".
1894

CONDITIONAL SUBPATTERNS

1896
1897       It is possible to cause the matching process to obey a subpattern  con‐
1898       ditionally  or to choose between two alternative subpatterns, depending
1899       on the result of an assertion, or whether a specific capturing  subpat‐
1900       tern  has  already  been matched. The two possible forms of conditional
1901       subpattern are:
1902
1903         (?(condition)yes-pattern)
1904         (?(condition)yes-pattern|no-pattern)
1905
1906       If the condition is satisfied, the yes-pattern is used;  otherwise  the
1907       no-pattern  (if  present)  is used. If there are more than two alterna‐
1908       tives in the subpattern, a compile-time error occurs. Each of  the  two
1909       alternatives may itself contain nested subpatterns of any form, includ‐
1910       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
1911       applies only at the level of the condition. This pattern fragment is an
1912       example where the alternatives are complex:
1913
1914         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
1915
1916
1917       There are four kinds of condition: references  to  subpatterns,  refer‐
1918       ences to recursion, a pseudo-condition called DEFINE, and assertions.
1919
1920   Checking for a used subpattern by number
1921
1922       If  the  text between the parentheses consists of a sequence of digits,
1923       the condition is true if a capturing subpattern of that number has pre‐
1924       viously  matched.  If  there is more than one capturing subpattern with
1925       the same number (see the earlier  section  about  duplicate  subpattern
1926       numbers),  the condition is true if any of them have matched. An alter‐
1927       native notation is to precede the digits with a plus or minus sign.  In
1928       this  case, the subpattern number is relative rather than absolute. The
1929       most recently opened parentheses can be referenced by (?(-1), the  next
1930       most  recent  by (?(-2), and so on. Inside loops it can also make sense
1931       to refer to subsequent groups. The next parentheses to be opened can be
1932       referenced  as (?(+1), and so on. (The value zero in any of these forms
1933       is not used; it provokes a compile-time error.)
1934
1935       Consider the following pattern, which  contains  non-significant  white
1936       space to make it more readable (assume the PCRE_EXTENDED option) and to
1937       divide it into three parts for ease of discussion:
1938
1939         ( \( )?    [^()]+    (?(1) \) )
1940
1941       The first part matches an optional opening  parenthesis,  and  if  that
1942       character is present, sets it as the first captured substring. The sec‐
1943       ond part matches one or more characters that are not  parentheses.  The
1944       third  part  is  a conditional subpattern that tests whether or not the
1945       first set of parentheses matched. If they  did,  that  is,  if  subject
1946       started  with an opening parenthesis, the condition is true, and so the
1947       yes-pattern is executed and a closing parenthesis is  required.  Other‐
1948       wise,  since no-pattern is not present, the subpattern matches nothing.
1949       In other words, this pattern matches  a  sequence  of  non-parentheses,
1950       optionally enclosed in parentheses.
1951
1952       If  you  were  embedding  this pattern in a larger one, you could use a
1953       relative reference:
1954
1955         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
1956
1957       This makes the fragment independent of the parentheses  in  the  larger
1958       pattern.
1959
1960   Checking for a used subpattern by name
1961
1962       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
1963       used subpattern by name. For compatibility  with  earlier  versions  of
1964       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
1965       also recognized. However, there is a possible ambiguity with this  syn‐
1966       tax,  because  subpattern  names  may  consist entirely of digits. PCRE
1967       looks first for a named subpattern; if it cannot find one and the  name
1968       consists  entirely  of digits, PCRE looks for a subpattern of that num‐
1969       ber, which must be greater than zero. Using subpattern names that  con‐
1970       sist entirely of digits is not recommended.
1971
1972       Rewriting the above example to use a named subpattern gives this:
1973
1974         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
1975
1976       If  the  name used in a condition of this kind is a duplicate, the test
1977       is applied to all subpatterns of the same name, and is true if any  one
1978       of them has matched.
1979
1980   Checking for pattern recursion
1981
1982       If the condition is the string (R), and there is no subpattern with the
1983       name R, the condition is true if a recursive call to the whole  pattern
1984       or any subpattern has been made. If digits or a name preceded by amper‐
1985       sand follow the letter R, for example:
1986
1987         (?(R3)...) or (?(R&name)...)
1988
1989       the condition is true if the most recent recursion is into a subpattern
1990       whose number or name is given. This condition does not check the entire
1991       recursion stack. If the name used in a condition  of  this  kind  is  a
1992       duplicate, the test is applied to all subpatterns of the same name, and
1993       is true if any one of them is the most recent recursion.
1994
1995       At "top level", all these recursion test  conditions  are  false.   The
1996       syntax for recursive patterns is described below.
1997
1998   Defining subpatterns for use by reference only
1999
2000       If  the  condition  is  the string (DEFINE), and there is no subpattern
2001       with the name DEFINE, the condition is  always  false.  In  this  case,
2002       there  may  be  only  one  alternative  in the subpattern. It is always
2003       skipped if control reaches this point  in  the  pattern;  the  idea  of
2004       DEFINE  is that it can be used to define subroutines that can be refer‐
2005       enced from elsewhere. (The use of subroutines is described below.)  For
2006       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
2007       could be written like this (ignore white space and line breaks):
2008
2009         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2010         \b (?&byte) (\.(?&byte)){3} \b
2011
2012       The first part of the pattern is a DEFINE group inside which a  another
2013       group  named "byte" is defined. This matches an individual component of
2014       an IPv4 address (a number less than 256). When  matching  takes  place,
2015       this  part  of  the pattern is skipped because DEFINE acts like a false
2016       condition. The rest of the pattern uses references to the  named  group
2017       to  match the four dot-separated components of an IPv4 address, insist‐
2018       ing on a word boundary at each end.
2019
2020   Assertion conditions
2021
2022       If the condition is not in any of the above  formats,  it  must  be  an
2023       assertion.   This may be a positive or negative lookahead or lookbehind
2024       assertion. Consider  this  pattern,  again  containing  non-significant
2025       white space, and with the two alternatives on the second line:
2026
2027         (?(?=[^a-z]*[a-z])
2028         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
2029
2030       The  condition  is  a  positive  lookahead  assertion  that  matches an
2031       optional sequence of non-letters followed by a letter. In other  words,
2032       it  tests  for the presence of at least one letter in the subject. If a
2033       letter is found, the subject is matched against the first  alternative;
2034       otherwise  it  is  matched  against  the  second.  This pattern matches
2035       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
2036       letters and dd are digits.
2037

COMMENTS

2039
2040       There are two ways of including comments in patterns that are processed
2041       by PCRE. In both cases, the start of the comment must not be in a char‐
2042       acter class, nor in the middle of any other sequence of related charac‐
2043       ters such as (?: or a subpattern name or number.  The  characters  that
2044       make up a comment play no part in the pattern matching.
2045
2046       The  sequence (?# marks the start of a comment that continues up to the
2047       next closing parenthesis. Nested parentheses are not permitted. If  the
2048       PCRE_EXTENDED option is set, an unescaped # character also introduces a
2049       comment, which in this case continues to  immediately  after  the  next
2050       newline  character  or character sequence in the pattern. Which charac‐
2051       ters are interpreted as newlines is controlled by the options passed to
2052       a  compiling function or by a special sequence at the start of the pat‐
2053       tern, as described in the section entitled "Newline conventions" above.
2054       Note that the end of this type of comment is a literal newline sequence
2055       in the pattern; escape sequences that happen to represent a newline  do
2056       not  count.  For  example,  consider this pattern when PCRE_EXTENDED is
2057       set, and the default newline convention is in force:
2058
2059         abc #comment \n still comment
2060
2061       On encountering the # character, pcre_compile()  skips  along,  looking
2062       for  a newline in the pattern. The sequence \n is still literal at this
2063       stage, so it does not terminate the comment. Only an  actual  character
2064       with the code value 0x0a (the default newline) does so.
2065

RECURSIVE PATTERNS

2067
2068       Consider  the problem of matching a string in parentheses, allowing for
2069       unlimited nested parentheses. Without the use of  recursion,  the  best
2070       that  can  be  done  is  to use a pattern that matches up to some fixed
2071       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
2072       depth.
2073
2074       For some time, Perl has provided a facility that allows regular expres‐
2075       sions to recurse (amongst other things). It does this by  interpolating
2076       Perl  code in the expression at run time, and the code can refer to the
2077       expression itself. A Perl pattern using code interpolation to solve the
2078       parentheses problem can be created like this:
2079
2080         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2081
2082       The (?p{...}) item interpolates Perl code at run time, and in this case
2083       refers recursively to the pattern in which it appears.
2084
2085       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
2086       it  supports  special  syntax  for recursion of the entire pattern, and
2087       also for individual subpattern recursion.  After  its  introduction  in
2088       PCRE  and  Python,  this  kind of recursion was subsequently introduced
2089       into Perl at release 5.10.
2090
2091       A special item that consists of (? followed by a  number  greater  than
2092       zero  and  a  closing parenthesis is a recursive subroutine call of the
2093       subpattern of the given number, provided that  it  occurs  inside  that
2094       subpattern.  (If  not,  it is a non-recursive subroutine call, which is
2095       described in the next section.) The special item  (?R)  or  (?0)  is  a
2096       recursive call of the entire regular expression.
2097
2098       This  PCRE  pattern  solves  the nested parentheses problem (assume the
2099       PCRE_EXTENDED option is set so that white space is ignored):
2100
2101         \( ( [^()]++ | (?R) )* \)
2102
2103       First it matches an opening parenthesis. Then it matches any number  of
2104       substrings  which  can  either  be  a sequence of non-parentheses, or a
2105       recursive match of the pattern itself (that is, a  correctly  parenthe‐
2106       sized substring).  Finally there is a closing parenthesis. Note the use
2107       of a possessive quantifier to avoid backtracking into sequences of non-
2108       parentheses.
2109
2110       If  this  were  part of a larger pattern, you would not want to recurse
2111       the entire pattern, so instead you could use this:
2112
2113         ( \( ( [^()]++ | (?1) )* \) )
2114
2115       We have put the pattern into parentheses, and caused the  recursion  to
2116       refer to them instead of the whole pattern.
2117
2118       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
2119       tricky. This is made easier by the use of relative references.  Instead
2120       of (?1) in the pattern above you can write (?-2) to refer to the second
2121       most recently opened parentheses  preceding  the  recursion.  In  other
2122       words,  a  negative  number counts capturing parentheses leftwards from
2123       the point at which it is encountered.
2124
2125       It is also possible to refer to  subsequently  opened  parentheses,  by
2126       writing  references  such  as (?+2). However, these cannot be recursive
2127       because the reference is not inside the  parentheses  that  are  refer‐
2128       enced.  They are always non-recursive subroutine calls, as described in
2129       the next section.
2130
2131       An alternative approach is to use named parentheses instead.  The  Perl
2132       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
2133       supported. We could rewrite the above example as follows:
2134
2135         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2136
2137       If there is more than one subpattern with the same name,  the  earliest
2138       one is used.
2139
2140       This  particular  example pattern that we have been looking at contains
2141       nested unlimited repeats, and so the use of a possessive quantifier for
2142       matching strings of non-parentheses is important when applying the pat‐
2143       tern to strings that do not match. For example, when  this  pattern  is
2144       applied to
2145
2146         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2147
2148       it  yields  "no  match" quickly. However, if a possessive quantifier is
2149       not used, the match runs for a very long time indeed because there  are
2150       so  many  different  ways the + and * repeats can carve up the subject,
2151       and all have to be tested before failure can be reported.
2152
2153       At the end of a match, the values of capturing  parentheses  are  those
2154       from  the outermost level. If you want to obtain intermediate values, a
2155       callout function can be used (see below and the pcrecallout  documenta‐
2156       tion). If the pattern above is matched against
2157
2158         (ab(cd)ef)
2159
2160       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
2161       which is the last value taken on at the top level. If a capturing  sub‐
2162       pattern  is  not  matched at the top level, its final captured value is
2163       unset, even if it was (temporarily) set at a deeper  level  during  the
2164       matching process.
2165
2166       If  there are more than 15 capturing parentheses in a pattern, PCRE has
2167       to obtain extra memory to store data during a recursion, which it  does
2168       by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
2169       can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
2170
2171       Do not confuse the (?R) item with the condition (R),  which  tests  for
2172       recursion.   Consider  this pattern, which matches text in angle brack‐
2173       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
2174       brackets  (that is, when recursing), whereas any characters are permit‐
2175       ted at the outer level.
2176
2177         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
2178
2179       In this pattern, (?(R) is the start of a conditional  subpattern,  with
2180       two  different  alternatives for the recursive and non-recursive cases.
2181       The (?R) item is the actual recursive call.
2182
2183   Differences in recursion processing between PCRE and Perl
2184
2185       Recursion processing in PCRE differs from Perl in two  important  ways.
2186       In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
2187       always treated as an atomic group. That is, once it has matched some of
2188       the subject string, it is never re-entered, even if it contains untried
2189       alternatives and there is a subsequent matching failure.  This  can  be
2190       illustrated  by the following pattern, which purports to match a palin‐
2191       dromic string that contains an odd number of characters  (for  example,
2192       "a", "aba", "abcba", "abcdcba"):
2193
2194         ^(.|(.)(?1)\2)$
2195
2196       The idea is that it either matches a single character, or two identical
2197       characters surrounding a sub-palindrome. In Perl, this  pattern  works;
2198       in  PCRE  it  does  not if the pattern is longer than three characters.
2199       Consider the subject string "abcba":
2200
2201       At the top level, the first character is matched, but as it is  not  at
2202       the end of the string, the first alternative fails; the second alterna‐
2203       tive is taken and the recursion kicks in. The recursive call to subpat‐
2204       tern  1  successfully  matches the next character ("b"). (Note that the
2205       beginning and end of line tests are not part of the recursion).
2206
2207       Back at the top level, the next character ("c") is compared  with  what
2208       subpattern  2 matched, which was "a". This fails. Because the recursion
2209       is treated as an atomic group, there are now  no  backtracking  points,
2210       and  so  the  entire  match fails. (Perl is able, at this point, to re-
2211       enter the recursion and try the second alternative.)  However,  if  the
2212       pattern is written with the alternatives in the other order, things are
2213       different:
2214
2215         ^((.)(?1)\2|.)$
2216
2217       This time, the recursing alternative is tried first, and  continues  to
2218       recurse  until  it runs out of characters, at which point the recursion
2219       fails. But this time we do have  another  alternative  to  try  at  the
2220       higher  level.  That  is  the  big difference: in the previous case the
2221       remaining alternative is at a deeper recursion level, which PCRE cannot
2222       use.
2223
2224       To  change  the pattern so that it matches all palindromic strings, not
2225       just those with an odd number of characters, it is tempting  to  change
2226       the pattern to this:
2227
2228         ^((.)(?1)\2|.?)$
2229
2230       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
2231       When a deeper recursion has matched a single character,  it  cannot  be
2232       entered  again  in  order  to match an empty string. The solution is to
2233       separate the two cases, and write out the odd and even cases as  alter‐
2234       natives at the higher level:
2235
2236         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2237
2238       If  you  want  to match typical palindromic phrases, the pattern has to
2239       ignore all non-word characters, which can be done like this:
2240
2241         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2242
2243       If run with the PCRE_CASELESS option, this pattern matches phrases such
2244       as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
2245       Perl. Note the use of the possessive quantifier *+ to avoid  backtrack‐
2246       ing  into  sequences of non-word characters. Without this, PCRE takes a
2247       great deal longer (ten times or more) to  match  typical  phrases,  and
2248       Perl takes so long that you think it has gone into a loop.
2249
2250       WARNING:  The  palindrome-matching patterns above work only if the sub‐
2251       ject string does not start with a palindrome that is shorter  than  the
2252       entire  string.  For example, although "abcba" is correctly matched, if
2253       the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
2254       then  fails at top level because the end of the string does not follow.
2255       Once again, it cannot jump back into the recursion to try other  alter‐
2256       natives, so the entire match fails.
2257
2258       The  second  way  in which PCRE and Perl differ in their recursion pro‐
2259       cessing is in the handling of captured values. In Perl, when a  subpat‐
2260       tern  is  called recursively or as a subpattern (see the next section),
2261       it has no access to any values that were captured  outside  the  recur‐
2262       sion,  whereas  in  PCRE  these values can be referenced. Consider this
2263       pattern:
2264
2265         ^(.)(\1|a(?2))
2266
2267       In PCRE, this pattern matches "bab". The  first  capturing  parentheses
2268       match  "b",  then in the second group, when the back reference \1 fails
2269       to match "b", the second alternative matches "a" and then recurses.  In
2270       the  recursion,  \1 does now match "b" and so the whole match succeeds.
2271       In Perl, the pattern fails to match because inside the  recursive  call
2272       \1 cannot access the externally set value.
2273

SUBPATTERNS AS SUBROUTINES

2275
2276       If  the  syntax for a recursive subpattern call (either by number or by
2277       name) is used outside the parentheses to which it refers,  it  operates
2278       like  a subroutine in a programming language. The called subpattern may
2279       be defined before or after the reference. A numbered reference  can  be
2280       absolute or relative, as in these examples:
2281
2282         (...(absolute)...)...(?2)...
2283         (...(relative)...)...(?-1)...
2284         (...(?+1)...(relative)...
2285
2286       An earlier example pointed out that the pattern
2287
2288         (sens|respons)e and \1ibility
2289
2290       matches  "sense and sensibility" and "response and responsibility", but
2291       not "sense and responsibility". If instead the pattern
2292
2293         (sens|respons)e and (?1)ibility
2294
2295       is used, it does match "sense and responsibility" as well as the  other
2296       two  strings.  Another  example  is  given  in the discussion of DEFINE
2297       above.
2298
2299       All subroutine calls, whether recursive or not, are always  treated  as
2300       atomic  groups. That is, once a subroutine has matched some of the sub‐
2301       ject string, it is never re-entered, even if it contains untried alter‐
2302       natives  and  there  is  a  subsequent  matching failure. Any capturing
2303       parentheses that are set during the subroutine  call  revert  to  their
2304       previous values afterwards.
2305
2306       Processing  options  such as case-independence are fixed when a subpat‐
2307       tern is defined, so if it is used as a subroutine, such options  cannot
2308       be changed for different calls. For example, consider this pattern:
2309
2310         (abc)(?i:(?-1))
2311
2312       It  matches  "abcabc". It does not match "abcABC" because the change of
2313       processing option does not affect the called subpattern.
2314

ONIGURUMA SUBROUTINE SYNTAX

2316
2317       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
2318       name or a number enclosed either in angle brackets or single quotes, is
2319       an alternative syntax for referencing a  subpattern  as  a  subroutine,
2320       possibly  recursively. Here are two of the examples used above, rewrit‐
2321       ten using this syntax:
2322
2323         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2324         (sens|respons)e and \g'1'ibility
2325
2326       PCRE supports an extension to Oniguruma: if a number is preceded  by  a
2327       plus or a minus sign it is taken as a relative reference. For example:
2328
2329         (abc)(?i:\g<-1>)
2330
2331       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2332       synonymous. The former is a back reference; the latter is a  subroutine
2333       call.
2334

CALLOUTS

2336
2337       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2338       Perl code to be obeyed in the middle of matching a regular  expression.
2339       This makes it possible, amongst other things, to extract different sub‐
2340       strings that match the same pair of parentheses when there is a repeti‐
2341       tion.
2342
2343       PCRE provides a similar feature, but of course it cannot obey arbitrary
2344       Perl code. The feature is called "callout". The caller of PCRE provides
2345       an  external function by putting its entry point in the global variable
2346       pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit  or  32-bit
2347       library).   By default, this variable contains NULL, which disables all
2348       calling out.
2349
2350       Within a regular expression, (?C) indicates the  points  at  which  the
2351       external  function  is  to be called. If you want to identify different
2352       callout points, you can put a number less than 256 after the letter  C.
2353       The  default  value is zero.  For example, this pattern has two callout
2354       points:
2355
2356         (?C1)abc(?C2)def
2357
2358       If the PCRE_AUTO_CALLOUT flag is passed to a compiling function,  call‐
2359       outs  are automatically installed before each item in the pattern. They
2360       are all numbered 255.
2361
2362       During matching, when PCRE reaches a callout point, the external  func‐
2363       tion  is  called.  It  is  provided with the number of the callout, the
2364       position in the pattern, and, optionally, one item of  data  originally
2365       supplied  by  the caller of the matching function. The callout function
2366       may cause matching to proceed, to backtrack, or to fail  altogether.  A
2367       complete  description of the interface to the callout function is given
2368       in the pcrecallout documentation.
2369

BACKTRACKING CONTROL

2371
2372       Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
2373       which are described in the Perl documentation as "experimental and sub‐
2374       ject to change or removal in a future version of Perl". It goes  on  to
2375       say:  "Their usage in production code should be noted to avoid problems
2376       during upgrades." The same remarks apply to the PCRE features described
2377       in this section.
2378
2379       Since  these  verbs  are  specifically related to backtracking, most of
2380       them can be used only when the pattern is to be matched  using  one  of
2381       the traditional matching functions, which use a backtracking algorithm.
2382       With the exception of (*FAIL), which behaves like  a  failing  negative
2383       assertion,  they  cause an error if encountered by a DFA matching func‐
2384       tion.
2385
2386       If any of these verbs are used in an assertion or in a subpattern  that
2387       is called as a subroutine (whether or not recursively), their effect is
2388       confined to that subpattern; it does not extend to the surrounding pat‐
2389       tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
2390       that is encountered in a successful positive assertion is  passed  back
2391       when  a  match  succeeds (compare capturing parentheses in assertions).
2392       Note that such subpatterns are processed as anchored at the point where
2393       they  are  tested.  Note  also that Perl's treatment of subroutines and
2394       assertions is different in some cases.
2395
2396       The new verbs make use of what was previously invalid syntax: an  open‐
2397       ing parenthesis followed by an asterisk. They are generally of the form
2398       (*VERB) or (*VERB:NAME). Some may take either form, with differing  be‐
2399       haviour,  depending on whether or not an argument is present. A name is
2400       any sequence of characters that does not include a closing parenthesis.
2401       The maximum length of name is 255 in the 8-bit library and 65535 in the
2402       16-bit and 32-bit library.  If the name is empty, that is, if the clos‐
2403       ing  parenthesis immediately follows the colon, the effect is as if the
2404       colon were not there. Any number of these verbs may occur in a pattern.
2405
2406   Optimizations that affect backtracking verbs
2407
2408       PCRE contains some optimizations that are used to speed up matching  by
2409       running some checks at the start of each match attempt. For example, it
2410       may know the minimum length of matching subject, or that  a  particular
2411       character  must  be present. When one of these optimizations suppresses
2412       the running of a match, any included backtracking verbs  will  not,  of
2413       course, be processed. You can suppress the start-of-match optimizations
2414       by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com‐
2415       pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
2416       There is more discussion of this option in the section entitled "Option
2417       bits for pcre_exec()" in the pcreapi documentation.
2418
2419       Experiments  with  Perl  suggest that it too has similar optimizations,
2420       sometimes leading to anomalous results.
2421
2422   Verbs that act immediately
2423
2424       The following verbs act as soon as they are encountered. They  may  not
2425       be followed by a name.
2426
2427          (*ACCEPT)
2428
2429       This  verb causes the match to end successfully, skipping the remainder
2430       of the pattern. However, when it is inside a subpattern that is  called
2431       as  a  subroutine, only that subpattern is ended successfully. Matching
2432       then continues at the outer level. If  (*ACCEPT)  is  inside  capturing
2433       parentheses, the data so far is captured. For example:
2434
2435         A((?:A|B(*ACCEPT)|C)D)
2436
2437       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2438       tured by the outer parentheses.
2439
2440         (*FAIL) or (*F)
2441
2442       This verb causes a matching failure, forcing backtracking to occur.  It
2443       is  equivalent to (?!) but easier to read. The Perl documentation notes
2444       that it is probably useful only when combined  with  (?{})  or  (??{}).
2445       Those  are,  of course, Perl features that are not present in PCRE. The
2446       nearest equivalent is the callout feature, as for example in this  pat‐
2447       tern:
2448
2449         a+(?C)(*FAIL)
2450
2451       A  match  with the string "aaaa" always fails, but the callout is taken
2452       before each backtrack happens (in this example, 10 times).
2453
2454   Recording which path was taken
2455
2456       There is one verb whose main purpose  is  to  track  how  a  match  was
2457       arrived  at,  though  it  also  has a secondary use in conjunction with
2458       advancing the match starting point (see (*SKIP) below).
2459
2460         (*MARK:NAME) or (*:NAME)
2461
2462       A name is always  required  with  this  verb.  There  may  be  as  many
2463       instances  of  (*MARK) as you like in a pattern, and their names do not
2464       have to be unique.
2465
2466       When a match succeeds, the name of the last-encountered (*MARK) on  the
2467       matching  path is passed back to the caller as described in the section
2468       entitled "Extra data for pcre_exec()"  in  the  pcreapi  documentation.
2469       Here  is  an example of pcretest output, where the /K modifier requests
2470       the retrieval and outputting of (*MARK) data:
2471
2472           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2473         data> XY
2474          0: XY
2475         MK: A
2476         XZ
2477          0: XZ
2478         MK: B
2479
2480       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2481       ple  it indicates which of the two alternatives matched. This is a more
2482       efficient way of obtaining this information than putting each  alterna‐
2483       tive in its own capturing parentheses.
2484
2485       If (*MARK) is encountered in a positive assertion, its name is recorded
2486       and passed back if it is the last-encountered. This does not happen for
2487       negative assertions.
2488
2489       After  a  partial match or a failed match, the name of the last encoun‐
2490       tered (*MARK) in the entire match process is returned. For example:
2491
2492           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2493         data> XP
2494         No match, mark = B
2495
2496       Note that in this unanchored example the  mark  is  retained  from  the
2497       match attempt that started at the letter "X" in the subject. Subsequent
2498       match attempts starting at "P" and then with an empty string do not get
2499       as far as the (*MARK) item, but nevertheless do not reset it.
2500
2501       If  you  are  interested  in  (*MARK)  values after failed matches, you
2502       should probably set the PCRE_NO_START_OPTIMIZE option  (see  above)  to
2503       ensure that the match is always attempted.
2504
2505   Verbs that act after backtracking
2506
2507       The following verbs do nothing when they are encountered. Matching con‐
2508       tinues with what follows, but if there is no subsequent match,  causing
2509       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
2510       cannot pass to the left of the verb. However, when one of  these  verbs
2511       appears  inside  an atomic group, its effect is confined to that group,
2512       because once the group has been matched, there is never any  backtrack‐
2513       ing  into  it.  In  this situation, backtracking can "jump back" to the
2514       left of the entire atomic group. (Remember also, as stated above,  that
2515       this localization also applies in subroutine calls and assertions.)
2516
2517       These  verbs  differ  in exactly what kind of failure occurs when back‐
2518       tracking reaches them.
2519
2520         (*COMMIT)
2521
2522       This verb, which may not be followed by a name, causes the whole  match
2523       to fail outright if the rest of the pattern does not match. Even if the
2524       pattern is unanchored, no further attempts to find a match by advancing
2525       the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
2526       pcre_exec() is committed to finding a match  at  the  current  starting
2527       point, or not at all. For example:
2528
2529         a+(*COMMIT)b
2530
2531       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
2532       of dynamic anchor, or "I've started, so I must finish." The name of the
2533       most  recently passed (*MARK) in the path is passed back when (*COMMIT)
2534       forces a match failure.
2535
2536       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
2537       anchor,  unless  PCRE's start-of-match optimizations are turned off, as
2538       shown in this pcretest example:
2539
2540           re> /(*COMMIT)abc/
2541         data> xyzabc
2542          0: abc
2543         xyzabc\Y
2544         No match
2545
2546       PCRE knows that any match must start  with  "a",  so  the  optimization
2547       skips  along the subject to "a" before running the first match attempt,
2548       which succeeds. When the optimization is disabled by the \Y  escape  in
2549       the second subject, the match starts at "x" and so the (*COMMIT) causes
2550       it to fail without trying any other starting points.
2551
2552         (*PRUNE) or (*PRUNE:NAME)
2553
2554       This verb causes the match to fail at the current starting position  in
2555       the  subject  if the rest of the pattern does not match. If the pattern
2556       is unanchored, the normal "bumpalong"  advance  to  the  next  starting
2557       character  then happens. Backtracking can occur as usual to the left of
2558       (*PRUNE), before it is reached,  or  when  matching  to  the  right  of
2559       (*PRUNE),  but  if  there is no match to the right, backtracking cannot
2560       cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter‐
2561       native  to an atomic group or possessive quantifier, but there are some
2562       uses of (*PRUNE) that cannot be expressed in any other way.  The behav‐
2563       iour  of  (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE). In an
2564       anchored pattern (*PRUNE) has the same effect as (*COMMIT).
2565
2566         (*SKIP)
2567
2568       This verb, when given without a name, is like (*PRUNE), except that  if
2569       the  pattern  is unanchored, the "bumpalong" advance is not to the next
2570       character, but to the position in the subject where (*SKIP) was encoun‐
2571       tered.  (*SKIP)  signifies that whatever text was matched leading up to
2572       it cannot be part of a successful match. Consider:
2573
2574         a+(*SKIP)b
2575
2576       If the subject is "aaaac...",  after  the  first  match  attempt  fails
2577       (starting  at  the  first  character in the string), the starting point
2578       skips on to start the next attempt at "c". Note that a possessive quan‐
2579       tifer  does not have the same effect as this example; although it would
2580       suppress backtracking  during  the  first  match  attempt,  the  second
2581       attempt  would  start at the second character instead of skipping on to
2582       "c".
2583
2584         (*SKIP:NAME)
2585
2586       When (*SKIP) has an associated name, its behaviour is modified. If  the
2587       following pattern fails to match, the previous path through the pattern
2588       is searched for the most recent (*MARK) that has the same name. If  one
2589       is  found, the "bumpalong" advance is to the subject position that cor‐
2590       responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
2591       If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
2592
2593         (*THEN) or (*THEN:NAME)
2594
2595       This  verb  causes a skip to the next innermost alternative if the rest
2596       of the pattern does not match. That is, it cancels  pending  backtrack‐
2597       ing,  but  only within the current alternative. Its name comes from the
2598       observation that it can be used for a pattern-based if-then-else block:
2599
2600         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2601
2602       If the COND1 pattern matches, FOO is tried (and possibly further  items
2603       after  the  end  of the group if FOO succeeds); on failure, the matcher
2604       skips to the second alternative and tries COND2,  without  backtracking
2605       into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as
2606       (*MARK:NAME)(*THEN).  If (*THEN) is not inside an alternation, it  acts
2607       like (*PRUNE).
2608
2609       Note  that  a  subpattern that does not contain a | character is just a
2610       part of the enclosing alternative; it is not a nested alternation  with
2611       only  one alternative. The effect of (*THEN) extends beyond such a sub‐
2612       pattern to the enclosing alternative. Consider this pattern,  where  A,
2613       B, etc. are complex pattern fragments that do not contain any | charac‐
2614       ters at this level:
2615
2616         A (B(*THEN)C) | D
2617
2618       If A and B are matched, but there is a failure in C, matching does  not
2619       backtrack into A; instead it moves to the next alternative, that is, D.
2620       However, if the subpattern containing (*THEN) is given an  alternative,
2621       it behaves differently:
2622
2623         A (B(*THEN)C | (*FAIL)) | D
2624
2625       The  effect of (*THEN) is now confined to the inner subpattern. After a
2626       failure in C, matching moves to (*FAIL), which causes the whole subpat‐
2627       tern  to  fail  because  there are no more alternatives to try. In this
2628       case, matching does now backtrack into A.
2629
2630       Note also that a conditional subpattern is not considered as having two
2631       alternatives,  because  only  one  is  ever used. In other words, the |
2632       character in a conditional subpattern has a different meaning. Ignoring
2633       white space, consider:
2634
2635         ^.*? (?(?=a) a | b(*THEN)c )
2636
2637       If  the  subject  is  "ba", this pattern does not match. Because .*? is
2638       ungreedy, it initially matches zero  characters.  The  condition  (?=a)
2639       then  fails,  the  character  "b"  is  matched, but "c" is not. At this
2640       point, matching does not backtrack to .*? as might perhaps be  expected
2641       from  the  presence  of  the | character. The conditional subpattern is
2642       part of the single alternative that comprises the whole pattern, and so
2643       the  match  fails.  (If  there was a backtrack into .*?, allowing it to
2644       match "b", the match would succeed.)
2645
2646       The verbs just described provide four different "strengths" of  control
2647       when subsequent matching fails. (*THEN) is the weakest, carrying on the
2648       match at the next alternative. (*PRUNE) comes next, failing  the  match
2649       at  the  current starting position, but allowing an advance to the next
2650       character (for an unanchored pattern). (*SKIP) is similar, except  that
2651       the advance may be more than one character. (*COMMIT) is the strongest,
2652       causing the entire match to fail.
2653
2654       If more than one such verb is present in a pattern, the "strongest" one
2655       wins.  For example, consider this pattern, where A, B, etc. are complex
2656       pattern fragments:
2657
2658         (A(*COMMIT)B(*THEN)C|D)
2659
2660       Once A has matched, PCRE is committed to this  match,  at  the  current
2661       starting  position. If subsequently B matches, but C does not, the nor‐
2662       mal (*THEN) action of trying the next alternative (that is, D) does not
2663       happen because (*COMMIT) overrides.
2664

AUTHOR

2671
2672       Philip Hazel
2673       University Computing Service
2674       Cambridge CB2 3QH, England.
2675

REVISION

2677
2678       Last updated: 11 November 2012
2679       Copyright (c) 1997-2012 University of Cambridge.
2680
2681
2682
2683PCRE 8.32                      11 November 2012                 PCREPATTERN(3)