pcre2pattern(3)

1PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 REGULAR EXPRESSION DETAILS

9
10       The  syntax and semantics of the regular expressions that are supported
11       by PCRE2 are described in detail below. There is a quick-reference syn‐
12       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
13       and semantics as closely as it can.  PCRE2 also supports some  alterna‐
14       tive  regular  expression syntax (which does not conflict with the Perl
15       syntax) in order to provide some compatibility with regular expressions
16       in Python, .NET, and Oniguruma.
17
18       Perl's  regular expressions are described in its own documentation, and
19       regular expressions in general are covered in a number of  books,  some
20       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
21       Expressions", published by  O'Reilly,  covers  regular  expressions  in
22       great  detail.  This  description  of  PCRE2's  regular  expressions is
23       intended as reference material.
24
25       This document discusses the regular expression patterns that  are  sup‐
26       ported  by  PCRE2  when  its  main matching function, pcre2_match(), is
27       used.   PCRE2   also   has   an    alternative    matching    function,
28       pcre2_dfa_match(),  which  matches  using a different algorithm that is
29       not Perl-compatible. Some of  the  features  discussed  below  are  not
30       available  when  DFA matching is used. The advantages and disadvantages
31       of the alternative function, and how it differs from the  normal  func‐
32       tion, are discussed in the pcre2matching page.
33

SPECIAL START-OF-PATTERN ITEMS

35
36       A  number  of options that can be passed to pcre2_compile() can also be
37       set by special items at the start of a pattern. These are not Perl-com‐
38       patible,  but  are provided to make these options accessible to pattern
39       writers who are not able to change the program that processes the  pat‐
40       tern.  Any  number  of  these  items  may  appear, but they must all be
41       together right at the start of the pattern string, and the letters must
42       be in upper case.
43
44   UTF support
45
46       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
47       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
48       can  be  specified  for the 32-bit library, in which case it constrains
49       the character values to valid  Unicode  code  points.  To  process  UTF
50       strings,  PCRE2  must be built to include Unicode support (which is the
51       default). When using UTF strings you must  either  call  the  compiling
52       function  with  one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
53       options, or the pattern must start with the  special  sequence  (*UTF),
54       which  is  equivalent  to setting the relevant PCRE2_UTF. How setting a
55       UTF mode affects pattern matching is mentioned in several places below.
56       There is also a summary of features in the pcre2unicode page.
57
58       Some applications that allow their users to supply patterns may wish to
59       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
60       PCRE2_NEVER_UTF  option  is  passed  to  pcre2_compile(), (*UTF) is not
61       allowed, and its appearance in a pattern causes an error.
62
63   Unicode property support
64
65       Another special sequence that may appear at the start of a  pattern  is
66       (*UCP).   This  has the same effect as setting the PCRE2_UCP option: it
67       causes sequences such as \d and \w to use Unicode properties to  deter‐
68       mine character types, instead of recognizing only characters with codes
69       less than 256 via a lookup table. If  also  causes  upper/lower  casing
70       operations  to  use  Unicode properties for characters with code points
71       greater than 127, even when UTF is not set.
72
73       Some applications that allow their users to supply patterns may wish to
74       restrict  them  for  security reasons. If the PCRE2_NEVER_UCP option is
75       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
76       a pattern causes an error.
77
78   Locking out empty string matching
79
80       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
81       effect as passing the PCRE2_NOTEMPTY or  PCRE2_NOTEMPTY_ATSTART  option
82       to whichever matching function is subsequently called to match the pat‐
83       tern. These options lock out the  matching  of  empty  strings,  either
84       entirely, or only at the start of the subject.
85
86   Disabling auto-possessification
87
88       If  a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
89       setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from  making
90       quantifiers  possessive  when  what  follows  cannot match the repeated
91       item. For example, by default a+b is treated as a++b. For more details,
92       see the pcre2api documentation.
93
94   Disabling start-up optimizations
95
96       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
97       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
98       mizations  for  quickly  reaching "no match" results. For more details,
99       see the pcre2api documentation.
100
101   Disabling automatic anchoring
102
103       If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect
104       as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza‐
105       tions that apply to patterns whose top-level branches all start with .*
106       (match  any  number of arbitrary characters). For more details, see the
107       pcre2api documentation.
108
109   Disabling JIT compilation
110
111       If a pattern that starts with (*NO_JIT) is  successfully  compiled,  an
112       attempt  by  the  application  to apply the JIT optimization by calling
113       pcre2_jit_compile() is ignored.
114
115   Setting match resource limits
116
117       The pcre2_match() function contains a counter that is incremented every
118       time it goes round its main loop. The caller of pcre2_match() can set a
119       limit on this counter, which therefore limits the amount  of  computing
120       resource used for a match. The maximum depth of nested backtracking can
121       also be limited; this indirectly restricts the amount  of  heap  memory
122       that  is  used,  but there is also an explicit memory limit that can be
123       set.
124
125       These facilities are provided to catch runaway matches  that  are  pro‐
126       voked  by patterns with huge matching trees. A common example is a pat‐
127       tern with nested unlimited repeats applied to a long string  that  does
128       not  match. When one of these limits is reached, pcre2_match() gives an
129       error return. The limits can also be set by items at the start  of  the
130       pattern of the form
131
132         (*LIMIT_HEAP=d)
133         (*LIMIT_MATCH=d)
134         (*LIMIT_DEPTH=d)
135
136       where d is any number of decimal digits. However, the value of the set‐
137       ting must be less than the value set (or defaulted) by  the  caller  of
138       pcre2_match()  for  it  to have any effect. In other words, the pattern
139       writer can lower the limits set by the programmer, but not raise  them.
140       If  there  is  more  than one setting of one of these limits, the lower
141       value is used. The heap limit is specified in kibibytes (units of  1024
142       bytes).
143
144       Prior  to  release  10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
145       name is still recognized for backwards compatibility.
146
147       The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
148       interpreters are used for matching. It does not apply to JIT. The match
149       limit is used (but in a different way) when JIT is being used, or  when
150       pcre2_dfa_match() is called, to limit computing resource usage by those
151       matching functions. The depth limit is ignored by JIT but  is  relevant
152       for  DFA  matching, which uses function recursion for recursions within
153       the pattern and for lookaround assertions and atomic  groups.  In  this
154       case, the depth limit controls the depth of such recursion.
155
156   Newline conventions
157
158       PCRE2  supports six different conventions for indicating line breaks in
159       strings: a single CR (carriage return) character, a  single  LF  (line‐
160       feed) character, the two-character sequence CRLF, any of the three pre‐
161       ceding, any Unicode newline sequence,  or  the  NUL  character  (binary
162       zero).  The  pcre2api  page  has further discussion about newlines, and
163       shows how to set the newline convention when calling pcre2_compile().
164
165       It is also possible to specify a newline convention by starting a  pat‐
166       tern string with one of the following sequences:
167
168         (*CR)        carriage return
169         (*LF)        linefeed
170         (*CRLF)      carriage return, followed by linefeed
171         (*ANYCRLF)   any of the three above
172         (*ANY)       all Unicode newline sequences
173         (*NUL)       the NUL character (binary zero)
174
175       These override the default and the options given to the compiling func‐
176       tion. For example, on a Unix system where LF  is  the  default  newline
177       sequence, the pattern
178
179         (*CR)a.b
180
181       changes the convention to CR. That pattern matches "a\nb" because LF is
182       no longer a newline. If more than one of these settings is present, the
183       last one is used.
184
185       The  newline  convention affects where the circumflex and dollar asser‐
186       tions are true. It also affects the interpretation of the dot metachar‐
187       acter  when  PCRE2_DOTALL  is not set, and the behaviour of \N when not
188       followed by an opening brace. However, it does not affect what  the  \R
189       escape  sequence  matches.  By  default,  this  is  any Unicode newline
190       sequence, for Perl compatibility. However, this can be changed; see the
191       next section and the description of \R in the section entitled "Newline
192       sequences" below. A change of \R setting can be combined with a  change
193       of newline convention.
194
195   Specifying what \R matches
196
197       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
198       the complete set  of  Unicode  line  endings)  by  setting  the  option
199       PCRE2_BSR_ANYCRLF  at compile time. This effect can also be achieved by
200       starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI‐
201       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
202

EBCDIC CHARACTER CODES

204
205       PCRE2  can be compiled to run in an environment that uses EBCDIC as its
206       character code instead of ASCII or Unicode (typically a mainframe  sys‐
207       tem).  In  the  sections below, character code values are ASCII or Uni‐
208       code; in an EBCDIC environment these characters may have different code
209       values, and there are no code points greater than 255.
210

CHARACTERS AND METACHARACTERS

212
213       A  regular  expression  is  a pattern that is matched against a subject
214       string from left to right. Most characters stand for  themselves  in  a
215       pattern,  and  match  the corresponding characters in the subject. As a
216       trivial example, the pattern
217
218         The quick brown fox
219
220       matches a portion of a subject string that is identical to itself. When
221       caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i)
222       within the pattern), letters are matched independently  of  case.  Note
223       that  there  are  two  ASCII  characters, K and S, that, in addition to
224       their lower case ASCII equivalents, are  case-equivalent  with  Unicode
225       U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either
226       PCRE2_UTF or PCRE2_UCP is set.
227
228       The power of regular expressions comes from the ability to include wild
229       cards, character classes, alternatives, and repetitions in the pattern.
230       These are encoded in the pattern by the use of metacharacters, which do
231       not  stand  for  themselves but instead are interpreted in some special
232       way.
233
234       There are two different sets of metacharacters: those that  are  recog‐
235       nized  anywhere in the pattern except within square brackets, and those
236       that are recognized within square brackets.  Outside  square  brackets,
237       the metacharacters are as follows:
238
239         \      general escape character with several uses
240         ^      assert start of string (or line, in multiline mode)
241         $      assert end of string (or line, in multiline mode)
242         .      match any character except newline (by default)
243         [      start character class definition
244         |      start of alternative branch
245         (      start group or control verb
246         )      end group or control verb
247         *      0 or more quantifier
248         +      1 or more quantifier; also "possessive quantifier"
249         ?      0 or 1 quantifier; also quantifier minimizer
250         {      start min/max quantifier
251
252       Part  of  a  pattern  that is in square brackets is called a "character
253       class". In a character class the only metacharacters are:
254
255         \      general escape character
256         ^      negate the class, but only if the first character
257         -      indicates character range
258         [      POSIX character class (if followed by POSIX syntax)
259         ]      terminates the character class
260
261       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
262       space  in  the pattern, other than in a character class, and characters
263       between a # outside a character class and the next newline,  inclusive,
264       are ignored. An escaping backslash can be used to include a white space
265       or a # character as part of the  pattern.  If  the  PCRE2_EXTENDED_MORE
266       option  is  set,  the same applies, but in addition unescaped space and
267       horizontal tab characters are ignored inside a character  class.  Note:
268       only  these  two  characters  are  ignored, not the full set of pattern
269       white space characters that are  ignored  outside  a  character  class.
270       Option  settings can be changed within a pattern; see the section enti‐
271       tled "Internal Option Setting" below.
272
273       The following sections describe the use of each of the metacharacters.
274

BACKSLASH

276
277       The backslash character has several uses. Firstly, if it is followed by
278       a  character that is not a digit or a letter, it takes away any special
279       meaning that character may have. This use of  backslash  as  an  escape
280       character applies both inside and outside character classes.
281
282       For  example,  if you want to match a * character, you must write \* in
283       the pattern. This escaping action applies whether or not the  following
284       character  would  otherwise be interpreted as a metacharacter, so it is
285       always safe to precede a non-alphanumeric  with  backslash  to  specify
286       that it stands for itself.  In particular, if you want to match a back‐
287       slash, you write \\.
288
289       Only ASCII digits and letters have any special meaning  after  a  back‐
290       slash. All other characters (in particular, those whose code points are
291       greater than 127) are treated as literals.
292
293       If you want to treat all characters in a sequence as literals, you  can
294       do so by putting them between \Q and \E. This is different from Perl in
295       that $ and @ are handled as literals in  \Q...\E  sequences  in  PCRE2,
296       whereas  in Perl, $ and @ cause variable interpolation. Also, Perl does
297       "double-quotish backslash interpolation" on any backslashes between  \Q
298       and  \E which, its documentation says, "may lead to confusing results".
299       PCRE2 treats a backslash between \Q and \E just like any other  charac‐
300       ter. Note the following examples:
301
302         Pattern            PCRE2 matches   Perl matches
303
304         \Qabc$xyz\E        abc$xyz        abc followed by the
305                                             contents of $xyz
306         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
307         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
308         \QA\B\E            A\B            A\B
309         \Q\\E              \              \\E
310
311       The  \Q...\E  sequence  is recognized both inside and outside character
312       classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q
313       is  not followed by \E later in the pattern, the literal interpretation
314       continues to the end of the pattern (that is,  \E  is  assumed  at  the
315       end).  If  the  isolated \Q is inside a character class, this causes an
316       error, because the character class  is  not  terminated  by  a  closing
317       square bracket.
318
319   Non-printing characters
320
321       A second use of backslash provides a way of encoding non-printing char‐
322       acters in patterns in a visible manner. There is no restriction on  the
323       appearance  of non-printing characters in a pattern, but when a pattern
324       is being prepared by text editing, it is often easier to use one of the
325       following  escape  sequences  instead of the binary character it repre‐
326       sents. In an ASCII or Unicode environment, these escapes  are  as  fol‐
327       lows:
328
329         \a          alarm, that is, the BEL character (hex 07)
330         \cx         "control-x", where x is any printable ASCII character
331         \e          escape (hex 1B)
332         \f          form feed (hex 0C)
333         \n          linefeed (hex 0A)
334         \r          carriage return (hex 0D) (but see below)
335         \t          tab (hex 09)
336         \0dd        character with octal code 0dd
337         \ddd        character with octal code ddd, or backreference
338         \o{ddd..}   character with octal code ddd..
339         \xhh        character with hex code hh
340         \x{hhh..}   character with hex code hhh..
341         \N{U+hhh..} character with Unicode hex code point hhh..
342
343       By  default, after \x that is not followed by {, from zero to two hexa‐
344       decimal digits are read (letters can be in upper or  lower  case).  Any
345       number of hexadecimal digits may appear between \x{ and }. If a charac‐
346       ter other than a hexadecimal digit appears between \x{  and  },  or  if
347       there is no terminating }, an error occurs.
348
349       Characters whose code points are less than 256 can be defined by either
350       of the two syntaxes for \x or by an octal sequence. There is no differ‐
351       ence in the way they are handled. For example, \xdc is exactly the same
352       as \x{dc} or \334.  However, using the braced versions does  make  such
353       sequences easier to read.
354
355       Support  is  available  for  some  ECMAScript  (aka  JavaScript) escape
356       sequences via two compile-time options. If PCRE2_ALT_BSUX is  set,  the
357       sequence  \x followed by { is not recognized. Only if \x is followed by
358       two hexadecimal digits is it recognized as a character  escape.  Other‐
359       wise  it  is interpreted as a literal "x" character. In this mode, sup‐
360       port for code points greater than 256 is provided by \u, which must  be
361       followed  by  four hexadecimal digits; otherwise it is interpreted as a
362       literal "u" character.
363
364       PCRE2_EXTRA_ALT_BSUX has the same  effect  as  PCRE2_ALT_BSUX  and,  in
365       addition,  \u{hhh..}  is recognized as the character specified by hexa‐
366       decimal code point.  There may be any  number  of  hexadecimal  digits.
367       This syntax is from ECMAScript 6.
368
369       The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper‐
370       ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
371       Unicode  name;  PCRE2  does  not support this. Note that when \N is not
372       followed by an opening brace (curly bracket) it has an entirely differ‐
373       ent meaning, matching any character that is not a newline.
374
375       There  are  some  legacy  applications  where the escape sequence \r is
376       expected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
377       is  set,  \r  in  a  pattern is converted to \n so that it matches a LF
378       (linefeed) instead of a CR (carriage return) character.
379
380       The precise effect of \cx on ASCII characters is as follows: if x is  a
381       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
382       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
383       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
384       hex 7B (; is 3B). If the code unit following \c has a value  less  than
385       32 or greater than 126, a compile-time error occurs.
386
387       When  PCRE2  is  compiled in EBCDIC mode, \N{U+hhh..} is not supported.
388       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
389       The \c escape is processed as specified for Perl in the perlebcdic doc‐
390       ument. The only characters that are allowed after \c are A-Z,  a-z,  or
391       one  of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
392       time error. The sequence \c@ encodes character code  0;  after  \c  the
393       letters  (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
394       \, ], ^, and _ encode characters 27-31 (hex 1B  to  hex  1F),  and  \c?
395       becomes either 255 (hex FF) or 95 (hex 5F).
396
397       Thus,  apart  from  \c?, these escapes generate the same character code
398       values as they do in an ASCII environment, though the meanings  of  the
399       values  mostly  differ. For example, \cG always generates code value 7,
400       which is BEL in ASCII but DEL in EBCDIC.
401
402       The sequence \c? generates DEL (127, hex 7F) in an  ASCII  environment,
403       but  because  127  is  not a control character in EBCDIC, Perl makes it
404       generate the APC character. Unfortunately, there are  several  variants
405       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
406       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
407       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
408       95; otherwise it generates 255.
409
410       After \0 up to two further octal digits are read. If  there  are  fewer
411       than  two  digits,  just  those  that  are  present  are used. Thus the
412       sequence \0\x\015 specifies two binary zeros followed by a CR character
413       (code value 13). Make sure you supply two digits after the initial zero
414       if the pattern character that follows is itself an octal digit.
415
416       The escape \o must be followed by a sequence of octal digits,  enclosed
417       in  braces.  An  error occurs if this is not the case. This escape is a
418       recent addition to Perl; it provides way of specifying  character  code
419       points  as  octal  numbers  greater than 0777, and it also allows octal
420       numbers and backreferences to be unambiguously specified.
421
422       For greater clarity and unambiguity, it is best to avoid following \ by
423       a digit greater than zero. Instead, use \o{} or \x{} to specify numeri‐
424       cal character code points, and \g{} to specify backreferences. The fol‐
425       lowing paragraphs describe the old, ambiguous syntax.
426
427       The handling of a backslash followed by a digit other than 0 is compli‐
428       cated, and Perl has changed over time, causing PCRE2 also to change.
429
430       Outside a character class, PCRE2 reads the digit and any following dig‐
431       its as a decimal number. If the number is less than 10, begins with the
432       digit 8 or 9, or if there are  at  least  that  many  previous  capture
433       groups  in the expression, the entire sequence is taken as a backrefer‐
434       ence. A description of how this works is  given  later,  following  the
435       discussion  of parenthesized groups.  Otherwise, up to three octal dig‐
436       its are read to form a character code.
437
438       Inside a character class, PCRE2 handles \8 and \9 as the literal  char‐
439       acters  "8"  and "9", and otherwise reads up to three octal digits fol‐
440       lowing the backslash, using them to generate a data character. Any sub‐
441       sequent  digits  stand for themselves. For example, outside a character
442       class:
443
444         \040   is another way of writing an ASCII space
445         \40    is the same, provided there are fewer than 40
446                   previous capture groups
447         \7     is always a backreference
448         \11    might be a backreference, or another way of
449                   writing a tab
450         \011   is always a tab
451         \0113  is a tab followed by the character "3"
452         \113   might be a backreference, otherwise the
453                   character with octal code 113
454         \377   might be a backreference, otherwise
455                   the value 255 (decimal)
456         \81    is always a backreference
457
458       Note that octal values of 100 or greater that are specified using  this
459       syntax  must  not be introduced by a leading zero, because no more than
460       three octal digits are ever read.
461
462   Constraints on character values
463
464       Characters that are specified using octal or  hexadecimal  numbers  are
465       limited to certain values, as follows:
466
467         8-bit non-UTF mode    no greater than 0xff
468         16-bit non-UTF mode   no greater than 0xffff
469         32-bit non-UTF mode   no greater than 0xffffffff
470         All UTF modes         no greater than 0x10ffff and a valid code point
471
472       Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
473       (the so-called "surrogate" code points). The check  for  these  can  be
474       disabled  by  the  caller  of  pcre2_compile()  by  setting  the option
475       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
476       UTF-8  and  UTF-32 modes, because these values are not representable in
477       UTF-16.
478
479   Escape sequences in character classes
480
481       All the sequences that define a single character value can be used both
482       inside  and  outside character classes. In addition, inside a character
483       class, \b is interpreted as the backspace character (hex 08).
484
485       When not followed by an opening brace, \N is not allowed in a character
486       class.   \B,  \R, and \X are not special inside a character class. Like
487       other unrecognized alphabetic escape sequences, they  cause  an  error.
488       Outside a character class, these sequences have different meanings.
489
490   Unsupported escape sequences
491
492       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
493       string handler and used to modify the case of following characters.  By
494       default,  PCRE2  does  not  support these escape sequences in patterns.
495       However,  if  either  of  the  PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX
496       options  is  set,  \U  matches  a  "U" character, and \u can be used to
497       define a character by code point, as described above.
498
499   Absolute and relative backreferences
500
501       The sequence \g followed by a signed  or  unsigned  number,  optionally
502       enclosed  in  braces, is an absolute or relative backreference. A named
503       backreference can be coded as \g{name}.  Backreferences  are  discussed
504       later, following the discussion of parenthesized groups.
505
506   Absolute and relative subroutine calls
507
508       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
509       name or a number enclosed either in angle brackets or single quotes, is
510       an  alternative syntax for referencing a capture group as a subroutine.
511       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
512       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref‐
513       erence; the latter is a subroutine call.
514
515   Generic character types
516
517       Another use of backslash is for specifying generic character types:
518
519         \d     any decimal digit
520         \D     any character that is not a decimal digit
521         \h     any horizontal white space character
522         \H     any character that is not a horizontal white space character
523         \N     any character that is not a newline
524         \s     any white space character
525         \S     any character that is not a white space character
526         \v     any vertical white space character
527         \V     any character that is not a vertical white space character
528         \w     any "word" character
529         \W     any "non-word" character
530
531       The \N escape sequence has the same meaning as  the  "."  metacharacter
532       when  PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
533       the meaning of \N. Note that when \N is followed by an opening brace it
534       has a different meaning. See the section entitled "Non-printing charac‐
535       ters" above for details. Perl also uses \N{name} to specify  characters
536       by Unicode name; PCRE2 does not support this.
537
538       Each  pair of lower and upper case escape sequences partitions the com‐
539       plete set of characters into two disjoint  sets.  Any  given  character
540       matches  one, and only one, of each pair. The sequences can appear both
541       inside and outside character classes. They each match one character  of
542       the  appropriate  type.  If the current matching point is at the end of
543       the subject string, all of them fail, because there is no character  to
544       match.
545
546       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
547       (13), and space (32), which are defined  as  white  space  in  the  "C"
548       locale. This list may vary if locale-specific matching is taking place.
549       For example, in some locales the "non-breaking space" character  (\xA0)
550       is recognized as white space, and in others the VT character is not.
551
552       A  "word"  character is an underscore or any character that is a letter
553       or digit.  By default, the definition of letters  and  digits  is  con‐
554       trolled by PCRE2's low-valued character tables, and may vary if locale-
555       specific matching is taking place (see "Locale support" in the pcre2api
556       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
557       systems, or "french" in Windows, some character codes greater than  127
558       are  used  for  accented letters, and these are then matched by \w. The
559       use of locales with Unicode is discouraged.
560
561       By default, characters whose code points are  greater  than  127  never
562       match \d, \s, or \w, and always match \D, \S, and \W, although this may
563       be different for characters in the range 128-255  when  locale-specific
564       matching  is  happening.   These escape sequences retain their original
565       meanings from before Unicode support was available,  mainly  for  effi‐
566       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
567       changed so that Unicode properties  are  used  to  determine  character
568       types, as follows:
569
570         \d  any character that matches \p{Nd} (decimal digit)
571         \s  any character that matches \p{Z} or \h or \v
572         \w  any character that matches \p{L} or \p{N}, plus underscore
573
574       The  upper case escapes match the inverse sets of characters. Note that
575       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
576       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
577       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
578       Matching these sequences is noticeably slower when PCRE2_UCP is set.
579
580       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
581       which match only ASCII characters by default, always match  a  specific
582       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
583       space characters are:
584
585         U+0009     Horizontal tab (HT)
586         U+0020     Space
587         U+00A0     Non-break space
588         U+1680     Ogham space mark
589         U+180E     Mongolian vowel separator
590         U+2000     En quad
591         U+2001     Em quad
592         U+2002     En space
593         U+2003     Em space
594         U+2004     Three-per-em space
595         U+2005     Four-per-em space
596         U+2006     Six-per-em space
597         U+2007     Figure space
598         U+2008     Punctuation space
599         U+2009     Thin space
600         U+200A     Hair space
601         U+202F     Narrow no-break space
602         U+205F     Medium mathematical space
603         U+3000     Ideographic space
604
605       The vertical space characters are:
606
607         U+000A     Linefeed (LF)
608         U+000B     Vertical tab (VT)
609         U+000C     Form feed (FF)
610         U+000D     Carriage return (CR)
611         U+0085     Next line (NEL)
612         U+2028     Line separator
613         U+2029     Paragraph separator
614
615       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
616       than 256 are relevant.
617
618   Newline sequences
619
620       Outside  a  character class, by default, the escape sequence \R matches
621       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
622       to the following:
623
624         (?>\r\n|\n|\x0b|\f|\r|\x85)
625
626       This  is  an  example  of an "atomic group", details of which are given
627       below.  This particular group matches either the two-character sequence
628       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
629       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car‐
630       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
631       atomic group, the two-character sequence is treated as  a  single  unit
632       that cannot be split.
633
634       In other modes, two additional characters whose code points are greater
635       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
636       rator,  U+2029).  Unicode support is not needed for these characters to
637       be recognized.
638
639       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
640       the  complete  set  of  Unicode  line  endings)  by  setting the option
641       PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back‐
642       slash R".) This can be made the default when PCRE2 is built; if this is
643       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI‐
644       CODE  option. It is also possible to specify these settings by starting
645       a pattern string with one of the following sequences:
646
647         (*BSR_ANYCRLF)   CR, LF, or CRLF only
648         (*BSR_UNICODE)   any Unicode newline sequence
649
650       These override the default and the options given to the compiling func‐
651       tion.  Note that these special settings, which are not Perl-compatible,
652       are recognized only at the very start of a pattern, and that they  must
653       be  in upper case. If more than one of them is present, the last one is
654       used. They can be combined with a change  of  newline  convention;  for
655       example, a pattern can start with:
656
657         (*ANY)(*BSR_ANYCRLF)
658
659       They  can also be combined with the (*UTF) or (*UCP) special sequences.
660       Inside a character class, \R  is  treated  as  an  unrecognized  escape
661       sequence, and causes an error.
662
663   Unicode character properties
664
665       When  PCRE2  is  built  with Unicode support (the default), three addi‐
666       tional escape sequences that match characters with specific  properties
667       are available. They can be used in any mode, though in 8-bit and 16-bit
668       non-UTF modes these sequences are of course limited to testing  charac‐
669       ters  whose code points are less than U+0100 and U+10000, respectively.
670       In 32-bit non-UTF mode, code points greater than 0x10ffff (the  Unicode
671       limit)  may  be  encountered.  These  are  all  treated as being in the
672       Unknown script and with an unassigned type. The extra escape  sequences
673       are:
674
675         \p{xx}   a character with the xx property
676         \P{xx}   a character without the xx property
677         \X       a Unicode extended grapheme cluster
678
679       The property names represented by xx above are case-sensitive. There is
680       support for Unicode script names, Unicode general category  properties,
681       "Any",  which  matches any character (including newline), and some spe‐
682       cial PCRE2 properties (described in  the  next  section).   Other  Perl
683       properties such as "InMusicalSymbols" are not supported by PCRE2.  Note
684       that \P{Any} does not match any characters, so always  causes  a  match
685       failure.
686
687       Sets of Unicode characters are defined as belonging to certain scripts.
688       A character from one of these sets can be matched using a script  name.
689       For example:
690
691         \p{Greek}
692         \P{Han}
693
694       Unassigned characters (and in non-UTF 32-bit mode, characters with code
695       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
696       that  are not part of an identified script are lumped together as "Com‐
697       mon". The current list of scripts is:
698
699       Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali‐
700       nese,  Bamum,  Bassa_Vah,  Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
701       Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba‐
702       nian,  Chakma,  Cham,  Cherokee, Chorasmian, Common, Coptic, Cuneiform,
703       Cypriot, Cyrillic, Deseret, Devanagari, Dives_Akuru,  Dogra,  Duployan,
704       Egyptian_Hieroglyphs, Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic,
705       Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul,
706       Hanifi_Rohingya,  Hanunoo,  Hatran, Hebrew, Hiragana, Imperial_Aramaic,
707       Inherited,  Inscriptional_Pahlavi,  Inscriptional_Parthian,   Javanese,
708       Kaithi,  Kannada,  Katakana, Kayah_Li, Kharoshthi, Khitan_Small_Script,
709       Khmer, Khojki, Khudawadi, Lao, Latin,  Lepcha,  Limbu,  Linear_A,  Lin‐
710       ear_B,  Lisu,  Lycian,  Lydian,  Mahajani, Makasar, Malayalam, Mandaic,
711       Manichaean,   Marchen,   Masaram_Gondi,   Medefaidrin,    Meetei_Mayek,
712       Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mon‐
713       golian, Mro, Multani,  Myanmar,  Nabataean,  Nandinagari,  New_Tai_Lue,
714       Newa,  Nko,  Nushu, Nyakeng_Puachue_Hmong, Ogham, Ol_Chiki, Old_Hungar‐
715       ian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,  Old_Sog‐
716       dian,    Old_South_Arabian,    Old_Turkic,   Oriya,   Osage,   Osmanya,
717       Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
718       Psalter_Pahlavi,  Rejang,  Runic,  Samaritan, Saurashtra, Sharada, Sha‐
719       vian, Siddham, SignWriting, Sinhala,  Sogdian,  Sora_Sompeng,  Soyombo,
720       Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
721       Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana,  Thai,  Tibetan,  Tifi‐
722       nagh, Tirhuta, Ugaritic, Unknown, Vai, Wancho, Warang_Citi, Yezidi, Yi,
723       Zanabazar_Square.
724
725       Each character has exactly one Unicode general category property, spec‐
726       ified  by a two-letter abbreviation. For compatibility with Perl, nega‐
727       tion can be specified by including a  circumflex  between  the  opening
728       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
729       \P{Lu}.
730
731       If only one letter is specified with \p or \P, it includes all the gen‐
732       eral  category properties that start with that letter. In this case, in
733       the absence of negation, the curly brackets in the escape sequence  are
734       optional; these two examples have the same effect:
735
736         \p{L}
737         \pL
738
739       The following general category property codes are supported:
740
741         C     Other
742         Cc    Control
743         Cf    Format
744         Cn    Unassigned
745         Co    Private use
746         Cs    Surrogate
747
748         L     Letter
749         Ll    Lower case letter
750         Lm    Modifier letter
751         Lo    Other letter
752         Lt    Title case letter
753         Lu    Upper case letter
754
755         M     Mark
756         Mc    Spacing mark
757         Me    Enclosing mark
758         Mn    Non-spacing mark
759
760         N     Number
761         Nd    Decimal number
762         Nl    Letter number
763         No    Other number
764
765         P     Punctuation
766         Pc    Connector punctuation
767         Pd    Dash punctuation
768         Pe    Close punctuation
769         Pf    Final punctuation
770         Pi    Initial punctuation
771         Po    Other punctuation
772         Ps    Open punctuation
773
774         S     Symbol
775         Sc    Currency symbol
776         Sk    Modifier symbol
777         Sm    Mathematical symbol
778         So    Other symbol
779
780         Z     Separator
781         Zl    Line separator
782         Zp    Paragraph separator
783         Zs    Space separator
784
785       The  special property L& is also supported: it matches a character that
786       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
787       classified as a modifier or "other".
788
789       The  Cs  (Surrogate)  property  applies  only  to characters whose code
790       points are in the range U+D800 to U+DFFF. These characters are no  dif‐
791       ferent  to any other character when PCRE2 is not in UTF mode (using the
792       16-bit or 32-bit library).  However, they  are  not  valid  in  Unicode
793       strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid‐
794       ity  checking  has   been   turned   off   (see   the   discussion   of
795       PCRE2_NO_UTF_CHECK in the pcre2api page).
796
797       The  long  synonyms  for  property  names  that  Perl supports (such as
798       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
799       any of these properties with "Is".
800
801       No character that is in the Unicode table has the Cn (unassigned) prop‐
802       erty.  Instead, this property is assumed for any code point that is not
803       in the Unicode table.
804
805       Specifying  caseless  matching  does not affect these escape sequences.
806       For example, \p{Lu} always matches only upper  case  letters.  This  is
807       different from the behaviour of current versions of Perl.
808
809       Matching  characters by Unicode property is not fast, because PCRE2 has
810       to do a multistage table lookup in order to find  a  character's  prop‐
811       erty. That is why the traditional escape sequences such as \d and \w do
812       not use Unicode properties in PCRE2 by default,  though  you  can  make
813       them  do  so by setting the PCRE2_UCP option or by starting the pattern
814       with (*UCP).
815
816   Extended grapheme clusters
817
818       The \X escape matches any number of Unicode  characters  that  form  an
819       "extended grapheme cluster", and treats the sequence as an atomic group
820       (see below).  Unicode supports various kinds of composite character  by
821       giving  each  character  a grapheme breaking property, and having rules
822       that use these properties to define the boundaries of extended grapheme
823       clusters.  The rules are defined in Unicode Standard Annex 29, "Unicode
824       Text Segmentation". Unicode 11.0.0 abandoned the use of  some  previous
825       properties  that had been used for emojis.  Instead it introduced vari‐
826       ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto‐
827       graphic property.
828
829       \X  always  matches  at least one character. Then it decides whether to
830       add additional characters according to the following rules for ending a
831       cluster:
832
833       1. End at the end of the subject string.
834
835       2.  Do not end between CR and LF; otherwise end after any control char‐
836       acter.
837
838       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
839       characters  are of five types: L, V, T, LV, and LVT. An L character may
840       be followed by an L, V, LV, or LVT character; an LV or V character  may
841       be followed by a V or T character; an LVT or T character may be follwed
842       only by a T character.
843
844       4. Do not end before extending  characters  or  spacing  marks  or  the
845       "zero-width  joiner"  character.  Characters  with  the "mark" property
846       always have the "extend" grapheme breaking property.
847
848       5. Do not end after prepend characters.
849
850       6. Do not break within emoji modifier sequences or emoji zwj sequences.
851       That is, do not break between characters with the Extended_Pictographic
852       property.  Extend and ZWJ characters are allowed  between  the  charac‐
853       ters.
854
855       7.  Do  not  break  within  emoji flag sequences. That is, do not break
856       between regional indicator (RI) characters if there are an  odd  number
857       of RI characters before the break point.
858
859       8. Otherwise, end the cluster.
860
861   PCRE2's additional properties
862
863       As  well as the standard Unicode properties described above, PCRE2 sup‐
864       ports four more that make it possible  to  convert  traditional  escape
865       sequences such as \w and \s to use Unicode properties. PCRE2 uses these
866       non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
867       However, they may also be used explicitly. These properties are:
868
869         Xan   Any alphanumeric character
870         Xps   Any POSIX space character
871         Xsp   Any Perl space character
872         Xwd   Any Perl "word" character
873
874       Xan  matches  characters that have either the L (letter) or the N (num‐
875       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
876       form  feed,  or carriage return, and any other character that has the Z
877       (separator) property.  Xsp is the same as Xps;  in  PCRE1  it  used  to
878       exclude  vertical  tab,  for  Perl compatibility, but Perl changed. Xwd
879       matches the same characters as Xan, plus underscore.
880
881       There is another non-standard property, Xuc, which matches any  charac‐
882       ter  that  can  be represented by a Universal Character Name in C++ and
883       other programming languages. These are the characters $,  @,  `  (grave
884       accent),  and  all  characters with Unicode code points greater than or
885       equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note  that
886       most  base  (ASCII) characters are excluded. (Universal Character Names
887       are of the form \uHHHH or \UHHHHHHHH where H is  a  hexadecimal  digit.
888       Note that the Xuc property does not match these sequences but the char‐
889       acters that they represent.)
890
891   Resetting the match start
892
893       In normal use, the escape sequence \K  causes  any  previously  matched
894       characters  not  to  be  included in the final matched sequence that is
895       returned. For example, the pattern:
896
897         foo\Kbar
898
899       matches "foobar", but reports that it has matched "bar".  \K  does  not
900       interact with anchoring in any way. The pattern:
901
902         ^foo\Kbar
903
904       matches  only  when  the  subject  begins with "foobar" (in single line
905       mode), though it again reports the matched string as "bar".  This  fea‐
906       ture  is similar to a lookbehind assertion (described below).  However,
907       in this case, the part of the subject before the real  match  does  not
908       have  to be of fixed length, as lookbehind assertions do. The use of \K
909       does not interfere with the setting of captured substrings.  For  exam‐
910       ple, when the pattern
911
912         (foo)\Kbar
913
914       matches "foobar", the first substring is still set to "foo".
915
916       Perl  used  to document that the use of \K within lookaround assertions
917       is "not well defined", but from version 5.32.0 Perl  does  not  support
918       this  usage  at  all.  In PCRE2, \K is acted upon when it occurs inside
919       positive assertions, but is ignored in negative assertions.  Note  that
920       when  a  pattern  such  as  (?=ab\K) matches, the reported start of the
921       match can be greater than the end of the match. Using \K in  a  lookbe‐
922       hind  assertion at the start of a pattern can also lead to odd effects.
923       For example, consider this pattern:
924
925         (?<=\Kfoo)bar
926
927       If the subject is "foobar", a call to  pcre2_match()  with  a  starting
928       offset  of 3 succeeds and reports the matching string as "foobar", that
929       is, the start of the reported match is earlier  than  where  the  match
930       started.
931
932   Simple assertions
933
934       The  final use of backslash is for certain simple assertions. An asser‐
935       tion specifies a condition that has to be met at a particular point  in
936       a  match, without consuming any characters from the subject string. The
937       use of groups for more complicated assertions is described below.   The
938       backslashed assertions are:
939
940         \b     matches at a word boundary
941         \B     matches when not at a word boundary
942         \A     matches at the start of the subject
943         \Z     matches at the end of the subject
944                 also matches before a newline at the end of the subject
945         \z     matches only at the end of the subject
946         \G     matches at the first matching position in the subject
947
948       Inside  a  character  class, \b has a different meaning; it matches the
949       backspace character. If any other of  these  assertions  appears  in  a
950       character class, an "invalid escape sequence" error is generated.
951
952       A  word  boundary is a position in the subject string where the current
953       character and the previous character do not both match \w or  \W  (i.e.
954       one  matches  \w  and the other matches \W), or the start or end of the
955       string if the first or last character matches  \w,  respectively.  When
956       PCRE2  is  built with Unicode support, the meanings of \w and \W can be
957       changed by setting the PCRE2_UCP option. When this  is  done,  it  also
958       affects  \b  and  \B.  Neither  PCRE2 nor Perl has a separate "start of
959       word" or "end of word" metasequence. However, whatever follows \b  nor‐
960       mally determines which it is. For example, the fragment \ba matches "a"
961       at the start of a word.
962
963       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
964       and dollar (described in the next section) in that they only ever match
965       at the very start and end of the subject string, whatever  options  are
966       set.  Thus,  they are independent of multiline mode. These three asser‐
967       tions are not affected by the  PCRE2_NOTBOL  or  PCRE2_NOTEOL  options,
968       which  affect only the behaviour of the circumflex and dollar metachar‐
969       acters. However, if the startoffset argument of pcre2_match()  is  non-
970       zero,  indicating  that  matching is to start at a point other than the
971       beginning of the subject, \A can never match.  The  difference  between
972       \Z  and \z is that \Z matches before a newline at the end of the string
973       as well as at the very end, whereas \z matches only at the end.
974
975       The \G assertion is true only when the current matching position is  at
976       the  start point of the matching process, as specified by the startoff‐
977       set argument of pcre2_match(). It differs from \A  when  the  value  of
978       startoffset  is  non-zero. By calling pcre2_match() multiple times with
979       appropriate arguments, you can mimic Perl's /g option,  and  it  is  in
980       this kind of implementation where \G can be useful.
981
982       Note,  however,  that  PCRE2's  implementation of \G, being true at the
983       starting character of the matching process, is  subtly  different  from
984       Perl's,  which  defines it as true at the end of the previous match. In
985       Perl, these can be different when the  previously  matched  string  was
986       empty. Because PCRE2 does just one match at a time, it cannot reproduce
987       this behaviour.
988
989       If all the alternatives of a pattern begin with \G, the  expression  is
990       anchored to the starting match position, and the "anchored" flag is set
991       in the compiled regular expression.
992

CIRCUMFLEX AND DOLLAR

994
995       The circumflex and dollar  metacharacters  are  zero-width  assertions.
996       That  is,  they test for a particular condition being true without con‐
997       suming any characters from the subject string. These two metacharacters
998       are  concerned  with matching the starts and ends of lines. If the new‐
999       line convention is set so that only the two-character sequence CRLF  is
1000       recognized  as  a newline, isolated CR and LF characters are treated as
1001       ordinary data characters, and are not recognized as newlines.
1002
1003       Outside a character class, in the default matching mode, the circumflex
1004       character  is  an  assertion  that is true only if the current matching
1005       point is at the start of the subject string. If the  startoffset  argu‐
1006       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
1007       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
1008       character  class,  circumflex  has  an  entirely different meaning (see
1009       below).
1010
1011       Circumflex need not be the first character of the pattern if  a  number
1012       of  alternatives are involved, but it should be the first thing in each
1013       alternative in which it appears if the pattern is ever  to  match  that
1014       branch.  If all possible alternatives start with a circumflex, that is,
1015       if the pattern is constrained to match only at the start  of  the  sub‐
1016       ject,  it  is  said  to be an "anchored" pattern. (There are also other
1017       constructs that can cause a pattern to be anchored.)
1018
1019       The dollar character is an assertion that is true only if  the  current
1020       matching  point  is  at  the  end of the subject string, or immediately
1021       before a newline  at  the  end  of  the  string  (by  default),  unless
1022       PCRE2_NOTEOL is set. Note, however, that it does not actually match the
1023       newline. Dollar need not be the last character of the pattern if a num‐
1024       ber of alternatives are involved, but it should be the last item in any
1025       branch in which it appears. Dollar has no special meaning in a  charac‐
1026       ter class.
1027
1028       The  meaning  of  dollar  can be changed so that it matches only at the
1029       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
1030       compile time. This does not affect the \Z assertion.
1031
1032       The meanings of the circumflex and dollar metacharacters are changed if
1033       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
1034       character  matches before any newlines in the string, as well as at the
1035       very end, and a circumflex matches immediately after internal  newlines
1036       as  well as at the start of the subject string. It does not match after
1037       a newline that ends the string, for compatibility with  Perl.  However,
1038       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
1039
1040       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
1041       (where \n represents a newline) in multiline mode, but  not  otherwise.
1042       Consequently,  patterns  that  are anchored in single line mode because
1043       all branches start with ^ are not anchored in  multiline  mode,  and  a
1044       match  for  circumflex  is  possible  when  the startoffset argument of
1045       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
1046       if PCRE2_MULTILINE is set.
1047
1048       When  the  newline  convention (see "Newline conventions" below) recog‐
1049       nizes the two-character sequence CRLF as a newline, this is  preferred,
1050       even  if  the  single  characters CR and LF are also recognized as new‐
1051       lines. For example, if the newline convention  is  "any",  a  multiline
1052       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
1053       than after CR, even though CR on its own is a valid newline.  (It  also
1054       matches at the very start of the string, of course.)
1055
1056       Note  that  the sequences \A, \Z, and \z can be used to match the start
1057       and end of the subject in both modes, and if all branches of a  pattern
1058       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
1059       set.
1060

FULL STOP (PERIOD, DOT) AND \N

1062
1063       Outside a character class, a dot in the pattern matches any one charac‐
1064       ter  in  the subject string except (by default) a character that signi‐
1065       fies the end of a line.
1066
1067       When a line ending is defined as a single character, dot never  matches
1068       that  character; when the two-character sequence CRLF is used, dot does
1069       not match CR if it is immediately followed  by  LF,  but  otherwise  it
1070       matches  all characters (including isolated CRs and LFs). When any Uni‐
1071       code line endings are being recognized, dot does not match CR or LF  or
1072       any of the other line ending characters.
1073
1074       The  behaviour  of  dot  with regard to newlines can be changed. If the
1075       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
1076       exception.   If  the two-character sequence CRLF is present in the sub‐
1077       ject string, it takes two dots to match it.
1078
1079       The handling of dot is entirely independent of the handling of  circum‐
1080       flex  and  dollar,  the  only relationship being that they both involve
1081       newlines. Dot has no special meaning in a character class.
1082
1083       The escape sequence \N when not followed by an  opening  brace  behaves
1084       like  a dot, except that it is not affected by the PCRE2_DOTALL option.
1085       In other words, it matches any character except one that signifies  the
1086       end of a line.
1087
1088       When \N is followed by an opening brace it has a different meaning. See
1089       the section entitled "Non-printing characters" above for details.  Perl
1090       also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does
1091       not support this.
1092

MATCHING A SINGLE CODE UNIT

1094
1095       Outside a character class, the escape sequence \C matches any one  code
1096       unit,  whether or not a UTF mode is set. In the 8-bit library, one code
1097       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
1098       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
1099       line-ending characters. The feature is provided in  Perl  in  order  to
1100       match individual bytes in UTF-8 mode, but it is unclear how it can use‐
1101       fully be used.
1102
1103       Because \C breaks up characters into individual  code  units,  matching
1104       one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
1105       string may start with a malformed UTF  character.  This  has  undefined
1106       results, because PCRE2 assumes that it is matching character by charac‐
1107       ter in a valid UTF string (by default it checks  the  subject  string's
1108       validity  at  the  start of processing unless the PCRE2_NO_UTF_CHECK or
1109       PCRE2_MATCH_INVALID_UTF option is used).
1110
1111       An  application  can  lock  out  the  use  of   \C   by   setting   the
1112       PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also
1113       possible to build PCRE2 with the use of \C permanently disabled.
1114
1115       PCRE2 does not allow \C to appear in lookbehind  assertions  (described
1116       below)  in UTF-8 or UTF-16 modes, because this would make it impossible
1117       to calculate the length of  the  lookbehind.  Neither  the  alternative
1118       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1119       these UTF modes.  The former gives a match-time error; the latter fails
1120       to optimize and so the match is always run using the interpreter.
1121
1122       In  the  32-bit  library,  however,  \C  is  always supported (when not
1123       explicitly locked out) because it always matches a  single  code  unit,
1124       whether or not UTF-32 is specified.
1125
1126       In general, the \C escape sequence is best avoided. However, one way of
1127       using it that avoids the problem of malformed UTF-8 or  UTF-16  charac‐
1128       ters  is  to use a lookahead to check the length of the next character,
1129       as in this pattern, which could be used with  a  UTF-8  string  (ignore
1130       white space and line breaks):
1131
1132         (?| (?=[\x00-\x7f])(\C) |
1133             (?=[\x80-\x{7ff}])(\C)(\C) |
1134             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1135             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1136
1137       In  this  example,  a  group  that starts with (?| resets the capturing
1138       parentheses numbers in each alternative (see "Duplicate Group  Numbers"
1139       below). The assertions at the start of each branch check the next UTF-8
1140       character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec‐
1141       tively.  The  character's  individual  bytes  are  then captured by the
1142       appropriate number of \C groups.
1143

SQUARE BRACKETS AND CHARACTER CLASSES

1145
1146       An opening square bracket introduces a character class, terminated by a
1147       closing square bracket. A closing square bracket on its own is not spe‐
1148       cial by default.  If a closing square bracket is required as  a  member
1149       of the class, it should be the first data character in the class (after
1150       an initial circumflex, if present) or escaped with  a  backslash.  This
1151       means  that,  by default, an empty class cannot be defined. However, if
1152       the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket  at
1153       the start does end the (empty) class.
1154
1155       A  character class matches a single character in the subject. A matched
1156       character must be in the set of characters defined by the class, unless
1157       the  first  character in the class definition is a circumflex, in which
1158       case the subject character must not be in the set defined by the class.
1159       If  a  circumflex is actually required as a member of the class, ensure
1160       it is not the first character, or escape it with a backslash.
1161
1162       For example, the character class [aeiou] matches any lower case  vowel,
1163       while  [^aeiou]  matches  any character that is not a lower case vowel.
1164       Note that a circumflex is just a convenient notation for specifying the
1165       characters  that  are in the class by enumerating those that are not. A
1166       class that starts with a circumflex is not an assertion; it still  con‐
1167       sumes  a  character  from the subject string, and therefore it fails if
1168       the current pointer is at the end of the string.
1169
1170       Characters in a class may be specified by their code points  using  \o,
1171       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
1172       letters in a class represent both their upper case and lower case  ver‐
1173       sions,  so  for example, a caseless [aeiou] matches "A" as well as "a",
1174       and a caseless [^aeiou] does not match "A", whereas a  caseful  version
1175       would.  Note  that  there  are  two ASCII characters, K and S, that, in
1176       addition to their lower case  ASCII  equivalents,  are  case-equivalent
1177       with Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when
1178       either PCRE2_UTF or PCRE2_UCP is set.
1179
1180       Characters that might indicate line breaks are  never  treated  in  any
1181       special  way  when  matching  character  classes,  whatever line-ending
1182       sequence is in use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
1183       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
1184       one of these characters.
1185
1186       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1187       \S,  \v,  \V,  \w,  and \W may appear in a character class, and add the
1188       characters that they  match  to  the  class.  For  example,  [\dABCDEF]
1189       matches  any  hexadecimal  digit.  In  UTF  modes, the PCRE2_UCP option
1190       affects the meanings of \d, \s, \w and their upper case partners,  just
1191       as  it does when they appear outside a character class, as described in
1192       the section  entitled  "Generic  character  types"  above.  The  escape
1193       sequence  \b  has  a  different  meaning  inside  a character class; it
1194       matches the backspace character. The sequences \B, \R, and \X  are  not
1195       special  inside  a  character class. Like any other unrecognized escape
1196       sequences, they cause an error. The same is true for \N when  not  fol‐
1197       lowed by an opening brace.
1198
1199       The  minus (hyphen) character can be used to specify a range of charac‐
1200       ters in a character  class.  For  example,  [d-m]  matches  any  letter
1201       between  d  and  m,  inclusive.  If  a minus character is required in a
1202       class, it must be escaped with a backslash  or  appear  in  a  position
1203       where  it cannot be interpreted as indicating a range, typically as the
1204       first or last character in the class, or immediately after a range. For
1205       example,  [b-d-z] matches letters in the range b to d, a hyphen charac‐
1206       ter, or z.
1207
1208       Perl treats a hyphen as a literal if it appears before or after a POSIX
1209       class (see below) or before or after a character type escape such as as
1210       \d or \H.  However, unless the hyphen is  the  last  character  in  the
1211       class,  Perl  outputs  a  warning  in its warning mode, as this is most
1212       likely a user error. As PCRE2 has no facility for warning, an error  is
1213       given in these cases.
1214
1215       It is not possible to have the literal character "]" as the end charac‐
1216       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
1217       two  characters ("W" and "-") followed by a literal string "46]", so it
1218       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
1219       backslash  it is interpreted as the end of range, so [W-\]46] is inter‐
1220       preted as a class containing a range followed by two other  characters.
1221       The  octal or hexadecimal representation of "]" can also be used to end
1222       a range.
1223
1224       Ranges normally include all code points between the start and end char‐
1225       acters,  inclusive.  They  can  also  be used for code points specified
1226       numerically, for example [\000-\037]. Ranges can include any characters
1227       that  are  valid  for  the current mode. In any UTF mode, the so-called
1228       "surrogate" characters (those whose code points lie between 0xd800  and
1229       0xdfff  inclusive)  may  not  be  specified  explicitly by default (the
1230       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this  check).  How‐
1231       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
1232       are always permitted.
1233
1234       There is a special case in EBCDIC environments  for  ranges  whose  end
1235       points are both specified as literal letters in the same case. For com‐
1236       patibility with Perl, EBCDIC code points within the range that are  not
1237       letters  are  omitted. For example, [h-k] matches only four characters,
1238       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1239       points.  However,  if  the range is specified numerically, for example,
1240       [\x88-\x92] or [h-\x92], all code points are included.
1241
1242       If a range that includes letters is used when caseless matching is set,
1243       it matches the letters in either case. For example, [W-c] is equivalent
1244       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
1245       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
1246       accented E characters in both cases.
1247
1248       A circumflex can conveniently be used with  the  upper  case  character
1249       types  to specify a more restricted set of characters than the matching
1250       lower case type.  For example, the class [^\W_] matches any  letter  or
1251       digit, but not underscore, whereas [\w] includes underscore. A positive
1252       character class should be read as "something OR something OR ..." and a
1253       negative class as "NOT something AND NOT something AND NOT ...".
1254
1255       The  only  metacharacters  that are recognized in character classes are
1256       backslash, hyphen (only where it can be  interpreted  as  specifying  a
1257       range),  circumflex  (only  at the start), opening square bracket (only
1258       when it can be interpreted as introducing a POSIX class name, or for  a
1259       special  compatibility  feature  -  see the next two sections), and the
1260       terminating  closing  square  bracket.  However,  escaping  other  non-
1261       alphanumeric characters does no harm.
1262

POSIX CHARACTER CLASSES

1264
1265       Perl supports the POSIX notation for character classes. This uses names
1266       enclosed by [: and :] within the enclosing square brackets. PCRE2  also
1267       supports this notation. For example,
1268
1269         [01[:alpha:]%]
1270
1271       matches "0", "1", any alphabetic character, or "%". The supported class
1272       names are:
1273
1274         alnum    letters and digits
1275         alpha    letters
1276         ascii    character codes 0 - 127
1277         blank    space or tab only
1278         cntrl    control characters
1279         digit    decimal digits (same as \d)
1280         graph    printing characters, excluding space
1281         lower    lower case letters
1282         print    printing characters, including space
1283         punct    printing characters, excluding letters and digits and space
1284         space    white space (the same as \s from PCRE2 8.34)
1285         upper    upper case letters
1286         word     "word" characters (same as \w)
1287         xdigit   hexadecimal digits
1288
1289       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
1290       CR  (13),  and space (32). If locale-specific matching is taking place,
1291       the list of space characters may be different; there may  be  fewer  or
1292       more of them. "Space" and \s match the same set of characters.
1293
1294       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
1295       from Perl 5.8. Another Perl extension is negation, which  is  indicated
1296       by a ^ character after the colon. For example,
1297
1298         [12[:^digit:]]
1299
1300       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1301       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1302       these are not supported, and an error is given if they are encountered.
1303
1304       By default, characters with values greater than 127 do not match any of
1305       the POSIX character classes, although this may be different for charac‐
1306       ters  in  the range 128-255 when locale-specific matching is happening.
1307       However, if the PCRE2_UCP option is passed to pcre2_compile(), some  of
1308       the  classes are changed so that Unicode character properties are used.
1309       This  is  achieved  by  replacing  certain  POSIX  classes  with  other
1310       sequences, as follows:
1311
1312         [:alnum:]  becomes  \p{Xan}
1313         [:alpha:]  becomes  \p{L}
1314         [:blank:]  becomes  \h
1315         [:cntrl:]  becomes  \p{Cc}
1316         [:digit:]  becomes  \p{Nd}
1317         [:lower:]  becomes  \p{Ll}
1318         [:space:]  becomes  \p{Xps}
1319         [:upper:]  becomes  \p{Lu}
1320         [:word:]   becomes  \p{Xwd}
1321
1322       Negated  versions, such as [:^alpha:] use \P instead of \p. Three other
1323       POSIX classes are handled specially in UCP mode:
1324
1325       [:graph:] This matches characters that have glyphs that mark  the  page
1326                 when printed. In Unicode property terms, it matches all char‐
1327                 acters with the L, M, N, P, S, or Cf properties, except for:
1328
1329                   U+061C           Arabic Letter Mark
1330                   U+180E           Mongolian Vowel Separator
1331                   U+2066 - U+2069  Various "isolate"s
1332
1333
1334       [:print:] This matches the same  characters  as  [:graph:]  plus  space
1335                 characters  that  are  not controls, that is, characters with
1336                 the Zs property.
1337
1338       [:punct:] This matches all characters that have the Unicode P (punctua‐
1339                 tion)  property,  plus those characters with code points less
1340                 than 256 that have the S (Symbol) property.
1341
1342       The other POSIX classes are unchanged, and match only  characters  with
1343       code points less than 256.
1344

COMPATIBILITY FEATURE FOR WORD BOUNDARIES

1346
1347       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1348       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
1349       and "end of word". PCRE2 treats these items as follows:
1350
1351         [[:<:]]  is converted to  \b(?=\w)
1352         [[:>:]]  is converted to  \b(?<=\w)
1353
1354       Only these exact character sequences are recognized. A sequence such as
1355       [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This
1356       support  is not compatible with Perl. It is provided to help migrations
1357       from other environments, and is best not used in any new patterns. Note
1358       that  \b matches at the start and the end of a word (see "Simple asser‐
1359       tions" above), and in a Perl-style pattern the preceding  or  following
1360       character  normally  shows  which  is  wanted, without the need for the
1361       assertions that are used above in order to give exactly the  POSIX  be‐
1362       haviour.
1363

VERTICAL BAR

1365
1366       Vertical  bar characters are used to separate alternative patterns. For
1367       example, the pattern
1368
1369         gilbert|sullivan
1370
1371       matches either "gilbert" or "sullivan". Any number of alternatives  may
1372       appear,  and  an  empty  alternative  is  permitted (matching the empty
1373       string). The matching process tries each alternative in turn, from left
1374       to  right, and the first one that succeeds is used. If the alternatives
1375       are within a group (defined below), "succeeds" means matching the  rest
1376       of the main pattern as well as the alternative in the group.
1377

INTERNAL OPTION SETTING

1379
1380       The  settings  of  the  PCRE2_CASELESS,  PCRE2_MULTILINE, PCRE2_DOTALL,
1381       PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE  options
1382       can  be  changed  from  within  the  pattern  by  a sequence of letters
1383       enclosed between "(?"  and ")". These options are Perl-compatible,  and
1384       are  described in detail in the pcre2api documentation. The option let‐
1385       ters are:
1386
1387         i  for PCRE2_CASELESS
1388         m  for PCRE2_MULTILINE
1389         n  for PCRE2_NO_AUTO_CAPTURE
1390         s  for PCRE2_DOTALL
1391         x  for PCRE2_EXTENDED
1392         xx for PCRE2_EXTENDED_MORE
1393
1394       For example, (?im) sets caseless, multiline matching. It is also possi‐
1395       ble  to  unset  these  options by preceding the relevant letters with a
1396       hyphen, for example (?-im). The two "extended" options are not indepen‐
1397       dent; unsetting either one cancels the effects of both of them.
1398
1399       A   combined  setting  and  unsetting  such  as  (?im-sx),  which  sets
1400       PCRE2_CASELESS and PCRE2_MULTILINE  while  unsetting  PCRE2_DOTALL  and
1401       PCRE2_EXTENDED,  is  also  permitted. Only one hyphen may appear in the
1402       options string. If a letter appears both before and after  the  hyphen,
1403       the  option  is unset. An empty options setting "(?)" is allowed. Need‐
1404       less to say, it has no effect.
1405
1406       If the first character following (? is a circumflex, it causes  all  of
1407       the  above  options to be unset. Thus, (?^) is equivalent to (?-imnsx).
1408       Letters may follow the circumflex to  cause  some  options  to  be  re-
1409       instated, but a hyphen may not appear.
1410
1411       The  PCRE2-specific  options  PCRE2_DUPNAMES  and PCRE2_UNGREEDY can be
1412       changed in the same way as the Perl-compatible  options  by  using  the
1413       characters J and U respectively. However, these are not unset by (?^).
1414
1415       When  one  of  these  option  changes occurs at top level (that is, not
1416       inside group parentheses), the change applies to the remainder  of  the
1417       pattern  that follows. An option change within a group (see below for a
1418       description of groups) affects only that part of the group that follows
1419       it, so
1420
1421         (a(?i)b)c
1422
1423       matches  abc  and  aBc and no other strings (assuming PCRE2_CASELESS is
1424       not used).  By this means, options can be made to have  different  set‐
1425       tings in different parts of the pattern. Any changes made in one alter‐
1426       native do carry on into subsequent branches within the same group.  For
1427       example,
1428
1429         (a(?i)b|c)
1430
1431       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
1432       first branch is abandoned before the option setting.  This  is  because
1433       the  effects  of option settings happen at compile time. There would be
1434       some very weird behaviour otherwise.
1435
1436       As a convenient shorthand, if any option settings are required  at  the
1437       start  of a non-capturing group (see the next section), the option let‐
1438       ters may appear between the "?" and the ":". Thus the two patterns
1439
1440         (?i:saturday|sunday)
1441         (?:(?i)saturday|sunday)
1442
1443       match exactly the same set of strings.
1444
1445       Note: There are other PCRE2-specific options,  applying  to  the  whole
1446       pattern,  which  can be set by the application when the compiling func‐
1447       tion is called. In addition, the pattern can  contain  special  leading
1448       sequences  such  as (*CRLF) to override what the application has set or
1449       what has been defaulted.  Details are given  in  the  section  entitled
1450       "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
1451       sequences that can be used to set UTF and Unicode property modes;  they
1452       are  equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec‐
1453       tively. However,  the  application  can  set  the  PCRE2_NEVER_UTF  and
1454       PCRE2_NEVER_UCP  options,  which  lock  out  the  use of the (*UTF) and
1455       (*UCP) sequences.
1456

GROUPS

1458
1459       Groups are delimited by parentheses  (round  brackets),  which  can  be
1460       nested.  Turning part of a pattern into a group does two things:
1461
1462       1. It localizes a set of alternatives. For example, the pattern
1463
1464         cat(aract|erpillar|)
1465
1466       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
1467       it would match "cataract", "erpillar" or an empty string.
1468
1469       2. It creates a "capture group". This means that, when the  whole  pat‐
1470       tern  matches, the portion of the subject string that matched the group
1471       is passed back to the caller, separately from the portion that  matched
1472       the  whole  pattern.   (This  applies  only to the traditional matching
1473       function; the DFA matching function does not support capturing.)
1474
1475       Opening parentheses are counted from left to right (starting from 1) to
1476       obtain  numbers for capture groups. For example, if the string "the red
1477       king" is matched against the pattern
1478
1479         the ((red|white) (king|queen))
1480
1481       the captured substrings are "red king", "red", and "king", and are num‐
1482       bered 1, 2, and 3, respectively.
1483
1484       The  fact  that  plain  parentheses  fulfil two functions is not always
1485       helpful.  There are often times when grouping is required without  cap‐
1486       turing.  If an opening parenthesis is followed by a question mark and a
1487       colon, the group does not do any capturing, and  is  not  counted  when
1488       computing  the number of any subsequent capture groups. For example, if
1489       the string "the white queen" is matched against the pattern
1490
1491         the ((?:red|white) (king|queen))
1492
1493       the captured substrings are "white queen" and "queen", and are numbered
1494       1 and 2. The maximum number of capture groups is 65535.
1495
1496       As  a  convenient shorthand, if any option settings are required at the
1497       start of a non-capturing group, the option letters may  appear  between
1498       the "?" and the ":". Thus the two patterns
1499
1500         (?i:saturday|sunday)
1501         (?:(?i)saturday|sunday)
1502
1503       match exactly the same set of strings. Because alternative branches are
1504       tried from left to right, and options are not reset until  the  end  of
1505       the  group is reached, an option setting in one branch does affect sub‐
1506       sequent branches, so the above patterns match "SUNDAY" as well as "Sat‐
1507       urday".
1508

DUPLICATE GROUP NUMBERS

1510
1511       Perl 5.10 introduced a feature whereby each alternative in a group uses
1512       the same numbers for its capturing parentheses.  Such  a  group  starts
1513       with  (?|  and  is  itself a non-capturing group. For example, consider
1514       this pattern:
1515
1516         (?|(Sat)ur|(Sun))day
1517
1518       Because the two alternatives are inside a (?| group, both sets of  cap‐
1519       turing  parentheses  are  numbered one. Thus, when the pattern matches,
1520       you can look at captured substring number  one,  whichever  alternative
1521       matched.  This  construct  is useful when you want to capture part, but
1522       not all, of one of a number of alternatives. Inside a (?| group, paren‐
1523       theses  are  numbered as usual, but the number is reset at the start of
1524       each branch. The numbers of any capturing parentheses that  follow  the
1525       whole group start after the highest number used in any branch. The fol‐
1526       lowing example is taken from the Perl documentation. The numbers under‐
1527       neath show in which buffer the captured content will be stored.
1528
1529         # before  ---------------branch-reset----------- after
1530         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1531         # 1            2         2  3        2     3     4
1532
1533       A  backreference  to a capture group uses the most recent value that is
1534       set for the group. The following pattern matches "abcabc" or "defdef":
1535
1536         /(?|(abc)|(def))\1/
1537
1538       In contrast, a subroutine call to a capture group always refers to  the
1539       first  one  in the pattern with the given number. The following pattern
1540       matches "abcabc" or "defabc":
1541
1542         /(?|(abc)|(def))(?1)/
1543
1544       A relative reference such as (?-1) is no different: it is just a conve‐
1545       nient way of computing an absolute group number.
1546
1547       If a condition test for a group's having matched refers to a non-unique
1548       number, the test is true if any group with that number has matched.
1549
1550       An alternative approach to using this "branch reset" feature is to  use
1551       duplicate named groups, as described in the next section.
1552

NAMED CAPTURE GROUPS

1554
1555       Identifying capture groups by number is simple, but it can be very hard
1556       to keep track of the numbers in complicated patterns.  Furthermore,  if
1557       an  expression  is  modified, the numbers may change. To help with this
1558       difficulty, PCRE2 supports the naming of capture groups.  This  feature
1559       was  not  added to Perl until release 5.10. Python had the feature ear‐
1560       lier, and PCRE1 introduced it at release 4.0, using the Python  syntax.
1561       PCRE2 supports both the Perl and the Python syntax.
1562
1563       In  PCRE2,  a  capture  group  can  be  named  in  one  of  three ways:
1564       (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
1565       Names  may be up to 32 code units long. When PCRE2_UTF is not set, they
1566       may contain only ASCII alphanumeric  characters  and  underscores,  but
1567       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
1568       names is extended to allow any Unicode letter or Unicode decimal digit.
1569       In other words, group names must match one of these patterns:
1570
1571         ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
1572         ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
1573
1574       References  to  capture groups from other parts of the pattern, such as
1575       backreferences, recursion, and conditions, can all be made by  name  as
1576       well as by number.
1577
1578       Named capture groups are allocated numbers as well as names, exactly as
1579       if the names were not present. In both PCRE2 and Perl,  capture  groups
1580       are  primarily  identified  by  numbers; any names are just aliases for
1581       these numbers. The PCRE2 API provides function calls for extracting the
1582       complete  name-to-number  translation table from a compiled pattern, as
1583       well as convenience functions for  extracting  captured  substrings  by
1584       name.
1585
1586       Warning:  When  more  than  one  capture  group has the same number, as
1587       described in the previous section, a name given to one of them  applies
1588       to all of them. Perl allows identically numbered groups to have differ‐
1589       ent names.  Consider this pattern, where there are two capture  groups,
1590       both numbered 1:
1591
1592         (?|(?<AA>aa)|(?<BB>bb))
1593
1594       Perl  allows  this,  with  both  names AA and BB as aliases of group 1.
1595       Thus, after a successful match, both names yield the same value (either
1596       "aa" or "bb").
1597
1598       In  an attempt to reduce confusion, PCRE2 does not allow the same group
1599       number to be associated with more than one name. The example above pro‐
1600       vokes  a  compile-time  error. However, there is still scope for confu‐
1601       sion. Consider this pattern:
1602
1603         (?|(?<AA>aa)|(bb))
1604
1605       Although the second group number 1 is not explicitly named, the name AA
1606       is  still an alias for any group 1. Whether the pattern matches "aa" or
1607       "bb", a reference by name to group AA yields the matched string.
1608
1609       By default, a name must be unique within a pattern, except that  dupli‐
1610       cate names are permitted for groups with the same number, for example:
1611
1612         (?|(?<AA>aa)|(?<AA>bb))
1613
1614       The duplicate name constraint can be disabled by setting the PCRE2_DUP‐
1615       NAMES option at compile time, or by the use of (?J) within the pattern,
1616       as described in the section entitled "Internal Option Setting" above.
1617
1618       Duplicate  names  can be useful for patterns where only one instance of
1619       the named capture group can match. Suppose you want to match  the  name
1620       of  a  weekday,  either as a 3-letter abbreviation or as the full name,
1621       and in both cases you want to extract the  abbreviation.  This  pattern
1622       (ignoring the line breaks) does the job:
1623
1624         (?J)
1625         (?<DN>Mon|Fri|Sun)(?:day)?|
1626         (?<DN>Tue)(?:sday)?|
1627         (?<DN>Wed)(?:nesday)?|
1628         (?<DN>Thu)(?:rsday)?|
1629         (?<DN>Sat)(?:urday)?
1630
1631       There  are five capture groups, but only one is ever set after a match.
1632       The convenience functions for extracting the data by name  returns  the
1633       substring  for  the first (and in this example, the only) group of that
1634       name that matched. This saves searching to find which numbered group it
1635       was.  (An  alternative  way of solving this problem is to use a "branch
1636       reset" group, as described in the previous section.)
1637
1638       If you make a backreference to a non-unique named group from  elsewhere
1639       in  the pattern, the groups to which the name refers are checked in the
1640       order in which they appear in the overall pattern. The first  one  that
1641       is  set  is  used  for the reference. For example, this pattern matches
1642       both "foofoo" and "barbar" but not "foobar" or "barfoo":
1643
1644         (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
1645
1646
1647       If you make a subroutine call to a non-unique named group, the one that
1648       corresponds to the first occurrence of the name is used. In the absence
1649       of duplicate numbers this is the one with the lowest number.
1650
1651       If you use a named reference in a condition test (see the section about
1652       conditions below), either to check whether a capture group has matched,
1653       or to check for recursion, all groups with the same name are tested. If
1654       the  condition  is  true  for any one of them, the overall condition is
1655       true. This is the same behaviour as  testing  by  number.  For  further
1656       details  of  the  interfaces for handling named capture groups, see the
1657       pcre2api documentation.
1658

REPETITION

1660
1661       Repetition is specified by quantifiers, which can  follow  any  of  the
1662       following items:
1663
1664         a literal data character
1665         the dot metacharacter
1666         the \C escape sequence
1667         the \R escape sequence
1668         the \X escape sequence
1669         an escape such as \d or \pL that matches a single character
1670         a character class
1671         a backreference
1672         a parenthesized group (including lookaround assertions)
1673         a subroutine call (recursive or otherwise)
1674
1675       The  general repetition quantifier specifies a minimum and maximum num‐
1676       ber of permitted matches, by giving the two numbers in  curly  brackets
1677       (braces),  separated  by  a comma. The numbers must be less than 65536,
1678       and the first must be less than or equal to the second. For example,
1679
1680         z{2,4}
1681
1682       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
1683       special  character.  If  the second number is omitted, but the comma is
1684       present, there is no upper limit; if the second number  and  the  comma
1685       are  both omitted, the quantifier specifies an exact number of required
1686       matches. Thus
1687
1688         [aeiou]{3,}
1689
1690       matches at least 3 successive vowels, but may match many more, whereas
1691
1692         \d{8}
1693
1694       matches exactly 8 digits. An opening curly bracket that  appears  in  a
1695       position  where a quantifier is not allowed, or one that does not match
1696       the syntax of a quantifier, is taken as a literal character. For  exam‐
1697       ple, {,6} is not a quantifier, but a literal string of four characters.
1698
1699       In UTF modes, quantifiers apply to characters rather than to individual
1700       code units. Thus, for example, \x{100}{2} matches two characters,  each
1701       of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1702       larly, \X{3} matches three Unicode extended grapheme clusters, each  of
1703       which  may  be  several  code  units long (and they may be of different
1704       lengths).
1705
1706       The quantifier {0} is permitted, causing the expression to behave as if
1707       the previous item and the quantifier were not present. This may be use‐
1708       ful for capture groups that are referenced as  subroutines  from  else‐
1709       where  in the pattern (but see also the section entitled "Defining cap‐
1710       ture groups for use by reference only" below). Except for parenthesized
1711       groups,  items that have a {0} quantifier are omitted from the compiled
1712       pattern.
1713
1714       For convenience, the three most common quantifiers have  single-charac‐
1715       ter abbreviations:
1716
1717         *    is equivalent to {0,}
1718         +    is equivalent to {1,}
1719         ?    is equivalent to {0,1}
1720
1721       It  is  possible  to construct infinite loops by following a group that
1722       can match no characters with a quantifier that has no upper limit,  for
1723       example:
1724
1725         (a?)*
1726
1727       Earlier  versions  of  Perl  and PCRE1 used to give an error at compile
1728       time for such patterns. However, because there are cases where this can
1729       be useful, such patterns are now accepted, but whenever an iteration of
1730       such a group matches no characters, matching moves on to the next  item
1731       in  the  pattern  instead  of repeatedly matching an empty string. This
1732       does not prevent backtracking into any of the iterations  if  a  subse‐
1733       quent item fails to match.
1734
1735       By  default,  quantifiers  are "greedy", that is, they match as much as
1736       possible (up to the maximum number of permitted times), without causing
1737       the  rest  of  the  pattern  to fail. The classic example of where this
1738       gives problems is in trying to match  comments  in  C  programs.  These
1739       appear  between  /*  and  */ and within the comment, individual * and /
1740       characters may appear. An attempt to match C comments by  applying  the
1741       pattern
1742
1743         /\*.*\*/
1744
1745       to the string
1746
1747         /* first comment */  not comment  /* second comment */
1748
1749       fails,  because it matches the entire string owing to the greediness of
1750       the .*  item. However, if a quantifier is followed by a question  mark,
1751       it ceases to be greedy, and instead matches the minimum number of times
1752       possible, so the pattern
1753
1754         /\*.*?\*/
1755
1756       does the right thing with the C comments. The meaning  of  the  various
1757       quantifiers  is  not  otherwise  changed,  just the preferred number of
1758       matches.  Do not confuse this use of question mark with its  use  as  a
1759       quantifier  in its own right. Because it has two uses, it can sometimes
1760       appear doubled, as in
1761
1762         \d??\d
1763
1764       which matches one digit by preference, but can match two if that is the
1765       only way the rest of the pattern matches.
1766
1767       If the PCRE2_UNGREEDY option is set (an option that is not available in
1768       Perl), the quantifiers are not greedy by default, but  individual  ones
1769       can  be  made  greedy  by following them with a question mark. In other
1770       words, it inverts the default behaviour.
1771
1772       When a parenthesized group is quantified with a  minimum  repeat  count
1773       that  is  greater  than  1  or  with  a limited maximum, more memory is
1774       required for the compiled pattern, in proportion to  the  size  of  the
1775       minimum or maximum.
1776
1777       If  a  pattern  starts  with  .*  or  .{0,} and the PCRE2_DOTALL option
1778       (equivalent to Perl's /s) is set, thus allowing the dot to  match  new‐
1779       lines,  the  pattern  is  implicitly anchored, because whatever follows
1780       will be tried against every character position in the  subject  string,
1781       so  there  is  no  point  in retrying the overall match at any position
1782       after the first. PCRE2 normally treats such a pattern as though it were
1783       preceded by \A.
1784
1785       In  cases  where  it  is known that the subject string contains no new‐
1786       lines, it is worth setting PCRE2_DOTALL in order to obtain  this  opti‐
1787       mization, or alternatively, using ^ to indicate anchoring explicitly.
1788
1789       However,  there  are  some cases where the optimization cannot be used.
1790       When .*  is inside capturing parentheses that  are  the  subject  of  a
1791       backreference  elsewhere  in the pattern, a match at the start may fail
1792       where a later one succeeds. Consider, for example:
1793
1794         (.*)abc\1
1795
1796       If the subject is "xyz123abc123" the match point is the fourth  charac‐
1797       ter. For this reason, such a pattern is not implicitly anchored.
1798
1799       Another  case where implicit anchoring is not applied is when the lead‐
1800       ing .* is inside an atomic group. Once again, a match at the start  may
1801       fail where a later one succeeds. Consider this pattern:
1802
1803         (?>.*?a)b
1804
1805       It  matches "ab" in the subject "aab". The use of the backtracking con‐
1806       trol verbs (*PRUNE) and (*SKIP) also  disable  this  optimization,  and
1807       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1808
1809       When  a  capture group is repeated, the value captured is the substring
1810       that matched the final iteration. For example, after
1811
1812         (tweedle[dume]{3}\s*)+
1813
1814       has matched "tweedledum tweedledee" the value of the captured substring
1815       is  "tweedledee". However, if there are nested capture groups, the cor‐
1816       responding captured values may have been set  in  previous  iterations.
1817       For example, after
1818
1819         (a|(b))+
1820
1821       matches "aba" the value of the second captured substring is "b".
1822

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

1824
1825       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1826       repetition, failure of what follows normally causes the  repeated  item
1827       to  be  re-evaluated to see if a different number of repeats allows the
1828       rest of the pattern to match. Sometimes it is useful to  prevent  this,
1829       either  to  change the nature of the match, or to cause it fail earlier
1830       than it otherwise might, when the author of the pattern knows there  is
1831       no point in carrying on.
1832
1833       Consider,  for  example, the pattern \d+foo when applied to the subject
1834       line
1835
1836         123456bar
1837
1838       After matching all 6 digits and then failing to match "foo", the normal
1839       action  of  the matcher is to try again with only 5 digits matching the
1840       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
1841       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
1842       the means for specifying that once a group has matched, it is not to be
1843       re-evaluated in this way.
1844
1845       If  we  use atomic grouping for the previous example, the matcher gives
1846       up immediately on failing to match "foo" the first time.  The  notation
1847       is a kind of special parenthesis, starting with (?> as in this example:
1848
1849         (?>\d+)foo
1850
1851       Perl  5.28  introduced an experimental alphabetic form starting with (*
1852       which may be easier to remember:
1853
1854         (*atomic:\d+)foo
1855
1856       This kind of parenthesized group "locks up" the  part of the pattern it
1857       contains once it has matched, and a failure further into the pattern is
1858       prevented from backtracking into it. Backtracking past it  to  previous
1859       items, however, works as normal.
1860
1861       An alternative description is that a group of this type matches exactly
1862       the string of characters that an  identical  standalone  pattern  would
1863       match, if anchored at the current point in the subject string.
1864
1865       Atomic  groups  are  not capture groups. Simple cases such as the above
1866       example can be thought of as a  maximizing  repeat  that  must  swallow
1867       everything  it can.  So, while both \d+ and \d+? are prepared to adjust
1868       the number of digits they match in order to make the rest of  the  pat‐
1869       tern match, (?>\d+) can only match an entire sequence of digits.
1870
1871       Atomic  groups in general can of course contain arbitrarily complicated
1872       expressions, and can be nested. However, when the contents of an atomic
1873       group  is  just a single repeated item, as in the example above, a sim‐
1874       pler notation, called a "possessive quantifier" can be used. This  con‐
1875       sists  of  an additional + character following a quantifier. Using this
1876       notation, the previous example can be rewritten as
1877
1878         \d++foo
1879
1880       Note that a possessive quantifier can be used with an entire group, for
1881       example:
1882
1883         (abc|xyz){2,3}+
1884
1885       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
1886       PCRE2_UNGREEDY option is ignored. They are a  convenient  notation  for
1887       the  simpler  forms of atomic group. However, there is no difference in
1888       the meaning of a possessive quantifier and the equivalent atomic group,
1889       though  there  may  be a performance difference; possessive quantifiers
1890       should be slightly faster.
1891
1892       The possessive quantifier syntax is an extension to the Perl  5.8  syn‐
1893       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
1894       edition of his book. Mike McCloskey liked it, so implemented it when he
1895       built  Sun's Java package, and PCRE1 copied it from there. It found its
1896       way into Perl at release 5.10.
1897
1898       PCRE2 has an optimization  that  automatically  "possessifies"  certain
1899       simple  pattern constructs. For example, the sequence A+B is treated as
1900       A++B because there is no point in backtracking into a sequence  of  A's
1901       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO‐
1902       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1903
1904       When a pattern contains an unlimited repeat inside  a  group  that  can
1905       itself  be  repeated an unlimited number of times, the use of an atomic
1906       group is the only way to avoid some failing matches taking a very  long
1907       time indeed. The pattern
1908
1909         (\D+|<\d+>)*[!?]
1910
1911       matches  an  unlimited number of substrings that either consist of non-
1912       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1913       matches, it runs quickly. However, if it is applied to
1914
1915         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1916
1917       it  takes  a  long  time  before reporting failure. This is because the
1918       string can be divided between the internal \D+ repeat and the  external
1919       *  repeat  in  a  large  number of ways, and all have to be tried. (The
1920       example uses [!?] rather than a single character at  the  end,  because
1921       both  PCRE2  and Perl have an optimization that allows for fast failure
1922       when a single character is used. They remember the last single  charac‐
1923       ter  that  is required for a match, and fail early if it is not present
1924       in the string.) If the pattern is changed so that  it  uses  an  atomic
1925       group, like this:
1926
1927         ((?>\D+)|<\d+>)*[!?]
1928
1929       sequences of non-digits cannot be broken, and failure happens quickly.
1930

BACKREFERENCES

1932
1933       Outside a character class, a backslash followed by a digit greater than
1934       0 (and possibly further digits) is a backreference to a  capture  group
1935       earlier (that is, to its left) in the pattern, provided there have been
1936       that many previous capture groups.
1937
1938       However, if the decimal number following the backslash is less than  8,
1939       it  is  always  taken  as  a backreference, and causes an error only if
1940       there are not that many capture groups in the entire pattern. In  other
1941       words, the group that is referenced need not be to the left of the ref‐
1942       erence for numbers less than 8. A "forward backreference" of this  type
1943       can make sense when a repetition is involved and the group to the right
1944       has participated in an earlier iteration.
1945
1946       It is not possible to have a numerical  "forward  backreference"  to  a
1947       group  whose  number  is 8 or more using this syntax because a sequence
1948       such as \50 is interpreted as a character defined  in  octal.  See  the
1949       subsection entitled "Non-printing characters" above for further details
1950       of the handling of digits following a backslash. Other forms  of  back‐
1951       referencing  do  not suffer from this restriction. In particular, there
1952       is no problem when named capture groups are used (see below).
1953
1954       Another way of avoiding the ambiguity inherent in  the  use  of  digits
1955       following  a  backslash  is  to use the \g escape sequence. This escape
1956       must be followed by a signed or unsigned number, optionally enclosed in
1957       braces. These examples are all identical:
1958
1959         (ring), \1
1960         (ring), \g1
1961         (ring), \g{1}
1962
1963       An  unsigned number specifies an absolute reference without the ambigu‐
1964       ity that is present in the older syntax. It is also useful when literal
1965       digits  follow  the reference. A signed number is a relative reference.
1966       Consider this example:
1967
1968         (abc(def)ghi)\g{-1}
1969
1970       The sequence \g{-1} is a reference to the most recently started capture
1971       group before \g, that is, is it equivalent to \2 in this example. Simi‐
1972       larly, \g{-2} would be equivalent to \1. The use of relative references
1973       can  be helpful in long patterns, and also in patterns that are created
1974       by joining together fragments  that  contain  references  within  them‐
1975       selves.
1976
1977       The sequence \g{+1} is a reference to the next capture group. This kind
1978       of forward reference can be useful in patterns that repeat.  Perl  does
1979       not support the use of + in this way.
1980
1981       A  backreference  matches  whatever  actually most recently matched the
1982       capture group in the current subject string, rather  than  anything  at
1983       all that matches the group (see "Groups as subroutines" below for a way
1984       of doing that). So the pattern
1985
1986         (sens|respons)e and \1ibility
1987
1988       matches "sense and sensibility" and "response and responsibility",  but
1989       not  "sense and responsibility". If caseful matching is in force at the
1990       time of the backreference, the case of letters is relevant.  For  exam‐
1991       ple,
1992
1993         ((?i)rah)\s+\1
1994
1995       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
1996       original capture group is matched caselessly.
1997
1998       There are several different ways of  writing  backreferences  to  named
1999       capture  groups.  The .NET syntax \k{name} and the Perl syntax \k<name>
2000       or \k'name' are supported, as is  the  Python  syntax  (?P=name).  Perl
2001       5.10's  unified  backreference syntax, in which \g can be used for both
2002       numeric and named references, is also supported. We could  rewrite  the
2003       above example in any of the following ways:
2004
2005         (?<p1>(?i)rah)\s+\k<p1>
2006         (?'p1'(?i)rah)\s+\k{p1}
2007         (?P<p1>(?i)rah)\s+(?P=p1)
2008         (?<p1>(?i)rah)\s+\g{p1}
2009
2010       A  capture  group  that is referenced by name may appear in the pattern
2011       before or after the reference.
2012
2013       There may be more than one backreference to the same group. If a  group
2014       has  not actually been used in a particular match, backreferences to it
2015       always fail by default. For example, the pattern
2016
2017         (a|(bc))\2
2018
2019       always fails if it starts to match "a" rather than  "bc".  However,  if
2020       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref‐
2021       erence to an unset value matches an empty string.
2022
2023       Because there may be many capture groups in a pattern, all digits  fol‐
2024       lowing  a backslash are taken as part of a potential backreference num‐
2025       ber. If the pattern continues with a digit  character,  some  delimiter
2026       must  be  used to terminate the backreference. If the PCRE2_EXTENDED or
2027       PCRE2_EXTENDED_MORE option is set, this can be white space.  Otherwise,
2028       the \g{} syntax or an empty comment (see "Comments" below) can be used.
2029
2030   Recursive backreferences
2031
2032       A  backreference  that occurs inside the group to which it refers fails
2033       when the group is first used, so, for  example,  (a\1)  never  matches.
2034       However,  such  references  can  be  useful inside repeated groups. For
2035       example, the pattern
2036
2037         (a|b\1)+
2038
2039       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
2040       ation of the group, the backreference matches the character string cor‐
2041       responding to the previous iteration. In order for this  to  work,  the
2042       pattern  must  be  such that the first iteration does not need to match
2043       the backreference. This can be done using alternation, as in the  exam‐
2044       ple above, or by a quantifier with a minimum of zero.
2045
2046       For versions of PCRE2 less than 10.25, backreferences of this type used
2047       to cause the group that they reference  to  be  treated  as  an  atomic
2048       group.   This restriction no longer applies, and backtracking into such
2049       groups can occur as normal.
2050

ASSERTIONS

2052
2053       An assertion is a test on the characters  following  or  preceding  the
2054       current matching point that does not consume any characters. The simple
2055       assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are  described
2056       above.
2057
2058       More  complicated  assertions  are coded as parenthesized groups. There
2059       are two kinds: those that look ahead of the  current  position  in  the
2060       subject  string,  and  those  that  look behind it, and in each case an
2061       assertion may be positive (must match for the assertion to be true)  or
2062       negative  (must  not  match for the assertion to be true). An assertion
2063       group is matched in the normal way, and if it is true, matching contin‐
2064       ues  after  it,  but  with  the matching position in the subject string
2065       reset to what it was before the assertion was processed.
2066
2067       The Perl-compatible lookaround assertions are atomic. If  an  assertion
2068       is  true, but there is a subsequent matching failure, there is no back‐
2069       tracking into the assertion. However, there are some cases  where  non-
2070       atomic  assertions  can  be  useful.  PCRE2 has some support for these,
2071       described in the section entitled "Non-atomic  assertions"  below,  but
2072       they are not Perl-compatible.
2073
2074       A  lookaround  assertion  may  appear as the condition in a conditional
2075       group (see below). In this case, the result of matching  the  assertion
2076       determines which branch of the condition is followed.
2077
2078       Assertion  groups are not capture groups. If an assertion contains cap‐
2079       ture groups within it, these are counted for the purposes of  numbering
2080       the  capture  groups  in  the  whole  pattern. Within each branch of an
2081       assertion, locally captured substrings may be referenced in  the  usual
2082       way.  For  example,  a  sequence such as (.)\g{-1} can be used to check
2083       that two adjacent characters are the same.
2084
2085       When a branch within an assertion fails to match, any  substrings  that
2086       were  captured  are  discarded (as happens with any pattern branch that
2087       fails to match). A  negative  assertion  is  true  only  when  all  its
2088       branches fail to match; this means that no captured substrings are ever
2089       retained after a successful negative assertion. When an assertion  con‐
2090       tains a matching branch, what happens depends on the type of assertion.
2091
2092       For  a  positive  assertion, internally captured substrings in the suc‐
2093       cessful branch are retained, and matching continues with the next  pat‐
2094       tern  item  after  the  assertion. For a negative assertion, a matching
2095       branch means that the assertion is not true. If such  an  assertion  is
2096       being  used as a condition in a conditional group (see below), captured
2097       substrings are retained,  because  matching  continues  with  the  "no"
2098       branch of the condition. For other failing negative assertions, control
2099       passes to the previous backtracking point, thus discarding any captured
2100       strings within the assertion.
2101
2102       Most  assertion  groups  may  be  repeated; though it makes no sense to
2103       assert the same thing several times, the side effect  of  capturing  in
2104       positive  assertions  may occasionally be useful. However, an assertion
2105       that forms the condition for a conditional group may not be quantified.
2106       PCRE2  used  to restrict the repetition of assertions, but from release
2107       10.35 the only restriction is that an unlimited maximum  repetition  is
2108       changed  to  be one more than the minimum. For example, {3,} is treated
2109       as {3,4}.
2110
2111   Alphabetic assertion names
2112
2113       Traditionally, symbolic sequences such as (?= and (?<= have  been  used
2114       to  specify lookaround assertions. Perl 5.28 introduced some experimen‐
2115       tal alphabetic alternatives which might be easier to remember. They all
2116       start  with  (* instead of (? and must be written using lower case let‐
2117       ters. PCRE2 supports the following synonyms:
2118
2119         (*positive_lookahead:  or (*pla: is the same as (?=
2120         (*negative_lookahead:  or (*nla: is the same as (?!
2121         (*positive_lookbehind: or (*plb: is the same as (?<=
2122         (*negative_lookbehind: or (*nlb: is the same as (?<!
2123
2124       For example, (*pla:foo) is the same assertion as (?=foo). In  the  fol‐
2125       lowing  sections, the various assertions are described using the origi‐
2126       nal symbolic forms.
2127
2128   Lookahead assertions
2129
2130       Lookahead assertions start with (?= for positive assertions and (?! for
2131       negative assertions. For example,
2132
2133         \w+(?=;)
2134
2135       matches  a word followed by a semicolon, but does not include the semi‐
2136       colon in the match, and
2137
2138         foo(?!bar)
2139
2140       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
2141       that the apparently similar pattern
2142
2143         (?!foo)bar
2144
2145       does  not  find  an  occurrence  of "bar" that is preceded by something
2146       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
2147       the assertion (?!foo) is always true when the next three characters are
2148       "bar". A lookbehind assertion is needed to achieve the other effect.
2149
2150       If you want to force a matching failure at some point in a pattern, the
2151       most  convenient  way  to  do  it  is with (?!) because an empty string
2152       always matches, so an assertion that requires there not to be an  empty
2153       string must always fail.  The backtracking control verb (*FAIL) or (*F)
2154       is a synonym for (?!).
2155
2156   Lookbehind assertions
2157
2158       Lookbehind assertions start with (?<= for positive assertions and  (?<!
2159       for negative assertions. For example,
2160
2161         (?<!foo)bar
2162
2163       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
2164       contents of a lookbehind assertion are restricted  such  that  all  the
2165       strings it matches must have a fixed length. However, if there are sev‐
2166       eral top-level alternatives, they do not all  have  to  have  the  same
2167       fixed length. Thus
2168
2169         (?<=bullock|donkey)
2170
2171       is permitted, but
2172
2173         (?<!dogs?|cats?)
2174
2175       causes  an  error at compile time. Branches that match different length
2176       strings are permitted only at the top level of a lookbehind  assertion.
2177       This is an extension compared with Perl, which requires all branches to
2178       match the same length of string. An assertion such as
2179
2180         (?<=ab(c|de))
2181
2182       is not permitted, because its single top-level  branch  can  match  two
2183       different  lengths,  but  it is acceptable to PCRE2 if rewritten to use
2184       two top-level branches:
2185
2186         (?<=abc|abde)
2187
2188       In some cases, the escape sequence \K (see above) can be  used  instead
2189       of a lookbehind assertion to get round the fixed-length restriction.
2190
2191       The  implementation  of lookbehind assertions is, for each alternative,
2192       to temporarily move the current position back by the fixed  length  and
2193       then try to match. If there are insufficient characters before the cur‐
2194       rent position, the assertion fails.
2195
2196       In UTF-8 and UTF-16 modes, PCRE2 does not allow the  \C  escape  (which
2197       matches  a single code unit even in a UTF mode) to appear in lookbehind
2198       assertions, because it makes it impossible to calculate the  length  of
2199       the  lookbehind.  The \X and \R escapes, which can match different num‐
2200       bers of code units, are never permitted in lookbehinds.
2201
2202       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
2203       lookbehinds, as long as the called capture group matches a fixed-length
2204       string. However, recursion, that is, a "subroutine" call into  a  group
2205       that is already active, is not supported.
2206
2207       Perl does not support backreferences in lookbehinds. PCRE2 does support
2208       them,   but   only    if    certain    conditions    are    met.    The
2209       PCRE2_MATCH_UNSET_BACKREF  option must not be set, there must be no use
2210       of (?| in the pattern (it creates duplicate group numbers), and if  the
2211       backreference  is by name, the name must be unique. Of course, the ref‐
2212       erenced group must itself match a fixed length substring. The following
2213       pattern matches words containing at least two characters that begin and
2214       end with the same character:
2215
2216          \b(\w)\w++(?<=\1)
2217
2218       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
2219       assertions to specify efficient matching of fixed-length strings at the
2220       end of subject strings. Consider a simple pattern such as
2221
2222         abcd$
2223
2224       when applied to a long string that does  not  match.  Because  matching
2225       proceeds  from  left to right, PCRE2 will look for each "a" in the sub‐
2226       ject and then see if what follows matches the rest of the  pattern.  If
2227       the pattern is specified as
2228
2229         ^.*abcd$
2230
2231       the  initial .* matches the entire string at first, but when this fails
2232       (because there is no following "a"), it backtracks to match all but the
2233       last  character,  then all but the last two characters, and so on. Once
2234       again the search for "a" covers the entire string, from right to  left,
2235       so we are no better off. However, if the pattern is written as
2236
2237         ^.*+(?<=abcd)
2238
2239       there can be no backtracking for the .*+ item because of the possessive
2240       quantifier; it can match only the entire string. The subsequent lookbe‐
2241       hind  assertion  does  a single test on the last four characters. If it
2242       fails, the match fails immediately. For  long  strings,  this  approach
2243       makes a significant difference to the processing time.
2244
2245   Using multiple assertions
2246
2247       Several assertions (of any sort) may occur in succession. For example,
2248
2249         (?<=\d{3})(?<!999)foo
2250
2251       matches  "foo" preceded by three digits that are not "999". Notice that
2252       each of the assertions is applied independently at the  same  point  in
2253       the  subject  string.  First  there  is a check that the previous three
2254       characters are all digits, and then there is  a  check  that  the  same
2255       three characters are not "999".  This pattern does not match "foo" pre‐
2256       ceded by six characters, the first of which are  digits  and  the  last
2257       three  of  which  are not "999". For example, it doesn't match "123abc‐
2258       foo". A pattern to do that is
2259
2260         (?<=\d{3}...)(?<!999)foo
2261
2262       This time the first assertion looks at the  preceding  six  characters,
2263       checking that the first three are digits, and then the second assertion
2264       checks that the preceding three characters are not "999".
2265
2266       Assertions can be nested in any combination. For example,
2267
2268         (?<=(?<!foo)bar)baz
2269
2270       matches an occurrence of "baz" that is preceded by "bar" which in  turn
2271       is not preceded by "foo", while
2272
2273         (?<=\d{3}(?!999)...)foo
2274
2275       is  another pattern that matches "foo" preceded by three digits and any
2276       three characters that are not "999".
2277

NON-ATOMIC ASSERTIONS

2279
2280       The traditional Perl-compatible lookaround assertions are atomic.  That
2281       is,  if  an assertion is true, but there is a subsequent matching fail‐
2282       ure, there is no backtracking into the assertion.  However,  there  are
2283       some  cases  where  non-atomic positive assertions can be useful. PCRE2
2284       provides these using the following syntax:
2285
2286         (*non_atomic_positive_lookahead:  or (*napla: or (?*
2287         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
2288
2289       Consider the problem of finding the right-most word in  a  string  that
2290       also  appears  earlier  in the string, that is, it must appear at least
2291       twice in total.  This pattern returns the required result  as  captured
2292       substring 1:
2293
2294         ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
2295
2296       For  a subject such as "word1 word2 word3 word2 word3 word4" the result
2297       is "word3". How does it work? At the start, ^(?x) anchors  the  pattern
2298       and sets the "x" option, which causes white space (introduced for read‐
2299       ability) to be ignored. Inside the assertion, the greedy  .*  at  first
2300       consumes the entire string, but then has to backtrack until the rest of
2301       the assertion can match a word, which is captured by group 1. In  other
2302       words,  when  the  assertion first succeeds, it captures the right-most
2303       word in the string.
2304
2305       The current matching point is then reset to the start of  the  subject,
2306       and  the  rest  of  the pattern match checks for two occurrences of the
2307       captured word, using an ungreedy .*? to scan from  the  left.  If  this
2308       succeeds,  we  are  done,  but  if the last word in the string does not
2309       occur twice, this part of the pattern fails. If  a  traditional  atomic
2310       lookhead  (?=  or  (*pla: had been used, the assertion could not be re-
2311       entered, and the whole match would fail. The pattern would succeed only
2312       if the very last word in the subject was found twice.
2313
2314       Using  a  non-atomic  lookahead, however, means that when the last word
2315       does not occur twice in the string, the  lookahead  can  backtrack  and
2316       find  the second-last word, and so on, until either the match succeeds,
2317       or all words have been tested.
2318
2319       Two conditions must be met for a non-atomic assertion to be useful: the
2320       contents  of one or more capturing groups must change after a backtrack
2321       into the assertion, and there must be  a  backreference  to  a  changed
2322       group  later  in  the pattern. If this is not the case, the rest of the
2323       pattern match fails exactly as before because nothing has  changed,  so
2324       using a non-atomic assertion just wastes resources.
2325
2326       There  is one exception to backtracking into a non-atomic assertion. If
2327       an (*ACCEPT) control verb is triggered, the assertion  succeeds  atomi‐
2328       cally.  That  is,  a subsequent match failure cannot backtrack into the
2329       assertion.
2330
2331       Non-atomic assertions are not supported  by  the  alternative  matching
2332       function pcre2_dfa_match(). They are supported by JIT, but only if they
2333       do not contain any control verbs such as (*ACCEPT). (This may change in
2334       future). Note that assertions that appear as conditions for conditional
2335       groups (see below) must be atomic.
2336

SCRIPT RUNS

2338
2339       In concept, a script run is a sequence of characters that are all  from
2340       the  same  Unicode script such as Latin or Greek. However, because some
2341       scripts are commonly used together, and because  some  diacritical  and
2342       other  marks  are  used  with  multiple scripts, it is not that simple.
2343       There is a full description of the rules that PCRE2 uses in the section
2344       entitled "Script Runs" in the pcre2unicode documentation.
2345
2346       If  part  of a pattern is enclosed between (*script_run: or (*sr: and a
2347       closing parenthesis, it fails if the sequence  of  characters  that  it
2348       matches  are  not  a  script  run. After a failure, normal backtracking
2349       occurs. Script runs can be used to detect spoofing attacks using  char‐
2350       acters  that  look the same, but are from different scripts. The string
2351       "paypal.com" is an infamous example, where the letters could be a  mix‐
2352       ture of Latin and Cyrillic. This pattern ensures that the matched char‐
2353       acters in a sequence of non-spaces that follow white space are a script
2354       run:
2355
2356         \s+(*sr:\S+)
2357
2358       To  be  sure  that  they are all from the Latin script (for example), a
2359       lookahead can be used:
2360
2361         \s+(?=\p{Latin})(*sr:\S+)
2362
2363       This works as long as the first character is expected to be a character
2364       in  that  script,  and  not (for example) punctuation, which is allowed
2365       with any script. If this is not the case, a more creative lookahead  is
2366       needed.  For  example, if digits, underscore, and dots are permitted at
2367       the start:
2368
2369         \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2370
2371
2372       In many cases, backtracking into a script run pattern fragment  is  not
2373       desirable.  The  script run can employ an atomic group to prevent this.
2374       Because this is a common requirement, a shorthand notation is  provided
2375       by (*atomic_script_run: or (*asr:
2376
2377         (*asr:...) is the same as (*sr:(?>...))
2378
2379       Note that the atomic group is inside the script run. Putting it outside
2380       would not prevent backtracking into the script run pattern.
2381
2382       Support for script runs is not available if PCRE2 is  compiled  without
2383       Unicode support. A compile-time error is given if any of the above con‐
2384       structs is encountered. Script runs are not supported by the  alternate
2385       matching  function,  pcre2_dfa_match() because they use the same mecha‐
2386       nism as capturing parentheses.
2387
2388       Warning: The (*ACCEPT) control verb (see  below)  should  not  be  used
2389       within a script run group, because it causes an immediate exit from the
2390       group, bypassing the script run checking.
2391

CONDITIONAL GROUPS

2393
2394       It is possible to cause the matching process to obey a pattern fragment
2395       conditionally or to choose between two alternative fragments, depending
2396       on the result of an assertion, or whether a specific capture group  has
2397       already been matched. The two possible forms of conditional group are:
2398
2399         (?(condition)yes-pattern)
2400         (?(condition)yes-pattern|no-pattern)
2401
2402       If  the  condition is satisfied, the yes-pattern is used; otherwise the
2403       no-pattern (if present) is used. An absent no-pattern is equivalent  to
2404       an  empty string (it always matches). If there are more than two alter‐
2405       natives in the group, a compile-time error  occurs.  Each  of  the  two
2406       alternatives  may  itself  contain nested groups of any form, including
2407       conditional groups; the restriction to two alternatives applies only at
2408       the  level of the condition itself. This pattern fragment is an example
2409       where the alternatives are complex:
2410
2411         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2412
2413
2414       There are five kinds of condition: references to capture groups, refer‐
2415       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
2416       and assertions.
2417
2418   Checking for a used capture group by number
2419
2420       If the text between the parentheses consists of a sequence  of  digits,
2421       the  condition is true if a capture group of that number has previously
2422       matched. If there is more than one capture group with the  same  number
2423       (see  the earlier section about duplicate group numbers), the condition
2424       is true if any of them have matched. An alternative notation is to pre‐
2425       cede the digits with a plus or minus sign. In this case, the group num‐
2426       ber is relative rather than absolute. The most recently opened  capture
2427       group  can be referenced by (?(-1), the next most recent by (?(-2), and
2428       so on. Inside loops it can also  make  sense  to  refer  to  subsequent
2429       groups.  The next capture group can be referenced as (?(+1), and so on.
2430       (The value zero in any of these forms is not used; it provokes  a  com‐
2431       pile-time error.)
2432
2433       Consider  the  following  pattern, which contains non-significant white
2434       space to make it more readable (assume the PCRE2_EXTENDED  option)  and
2435       to divide it into three parts for ease of discussion:
2436
2437         ( \( )?    [^()]+    (?(1) \) )
2438
2439       The  first  part  matches  an optional opening parenthesis, and if that
2440       character is present, sets it as the first captured substring. The sec‐
2441       ond  part  matches one or more characters that are not parentheses. The
2442       third part is a conditional group that tests whether or not  the  first
2443       capture  group  matched. If it did, that is, if subject started with an
2444       opening parenthesis, the condition is true, and so the  yes-pattern  is
2445       executed  and  a  closing parenthesis is required. Otherwise, since no-
2446       pattern is not present, the conditional group matches nothing. In other
2447       words,  this  pattern matches a sequence of non-parentheses, optionally
2448       enclosed in parentheses.
2449
2450       If you were embedding this pattern in a larger one,  you  could  use  a
2451       relative reference:
2452
2453         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
2454
2455       This  makes  the  fragment independent of the parentheses in the larger
2456       pattern.
2457
2458   Checking for a used capture group by name
2459
2460       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
2461       used  capture group by name. For compatibility with earlier versions of
2462       PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
2463       also  recognized.   Note, however, that undelimited names consisting of
2464       the letter R followed by digits are ambiguous (see the  following  sec‐
2465       tion). Rewriting the above example to use a named group gives this:
2466
2467         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
2468
2469       If  the  name used in a condition of this kind is a duplicate, the test
2470       is applied to all groups of the same name, and is true if  any  one  of
2471       them has matched.
2472
2473   Checking for pattern recursion
2474
2475       "Recursion"  in  this sense refers to any subroutine-like call from one
2476       part of the pattern to another, whether or not it  is  actually  recur‐
2477       sive.  See  the  sections  entitled "Recursive patterns" and "Groups as
2478       subroutines" below for details of recursion and subroutine calls.
2479
2480       If a condition is the string (R), and there is no  capture  group  with
2481       the  name R, the condition is true if matching is currently in a recur‐
2482       sion or subroutine call to the whole pattern or any capture  group.  If
2483       digits  follow  the letter R, and there is no group with that name, the
2484       condition is true if the most recent call is  into  a  group  with  the
2485       given  number,  which must exist somewhere in the overall pattern. This
2486       is a contrived example that is equivalent to a+b:
2487
2488         ((?(R1)a+|(?1)b))
2489
2490       However, in both cases, if there is a capture  group  with  a  matching
2491       name,  the  condition tests for its being set, as described in the sec‐
2492       tion above, instead of testing for recursion. For example,  creating  a
2493       group  with  the  name  R1  by adding (?<R1>) to the above pattern com‐
2494       pletely changes its meaning.
2495
2496       If a name preceded by ampersand follows the letter R, for example:
2497
2498         (?(R&name)...)
2499
2500       the condition is true if the most recent recursion is into a  group  of
2501       that name (which must exist within the pattern).
2502
2503       This condition does not check the entire recursion stack. It tests only
2504       the current level. If the name used in a condition of this  kind  is  a
2505       duplicate,  the  test is applied to all groups of the same name, and is
2506       true if any one of them is the most recent recursion.
2507
2508       At "top level", all these recursion test conditions are false.
2509
2510   Defining capture groups for use by reference only
2511
2512       If the condition is the string (DEFINE), the condition is always false,
2513       even  if there is a group with the name DEFINE. In this case, there may
2514       be only one alternative in the rest of the  conditional  group.  It  is
2515       always  skipped  if control reaches this point in the pattern; the idea
2516       of DEFINE is that it can be used to define subroutines that can be ref‐
2517       erenced  from  elsewhere.  (The use of subroutines is described below.)
2518       For  example,  a  pattern  to   match   an   IPv4   address   such   as
2519       "192.168.23.245"  could  be  written  like this (ignore white space and
2520       line breaks):
2521
2522         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2523         \b (?&byte) (\.(?&byte)){3} \b
2524
2525       The first part of the pattern is a DEFINE group inside which a  another
2526       group  named "byte" is defined. This matches an individual component of
2527       an IPv4 address (a number less than 256). When  matching  takes  place,
2528       this  part  of  the pattern is skipped because DEFINE acts like a false
2529       condition. The rest of the pattern uses references to the  named  group
2530       to  match the four dot-separated components of an IPv4 address, insist‐
2531       ing on a word boundary at each end.
2532
2533   Checking the PCRE2 version
2534
2535       Programs that link with a PCRE2 library can check the version by  call‐
2536       ing  pcre2_config()  with  appropriate arguments. Users of applications
2537       that do not have access to the underlying code cannot do this.  A  spe‐
2538       cial  "condition" called VERSION exists to allow such users to discover
2539       which version of PCRE2 they are dealing with by using this condition to
2540       match  a string such as "yesno". VERSION must be followed either by "="
2541       or ">=" and a version number.  For example:
2542
2543         (?(VERSION>=10.4)yes|no)
2544
2545       This pattern matches "yes" if the PCRE2 version is greater or equal  to
2546       10.4,  or "no" otherwise. The fractional part of the version number may
2547       not contain more than two digits.
2548
2549   Assertion conditions
2550
2551       If the condition is not in any of the  above  formats,  it  must  be  a
2552       parenthesized  assertion.  This may be a positive or negative lookahead
2553       or lookbehind assertion. However,  it  must  be  a  traditional  atomic
2554       assertion, not one of the PCRE2-specific non-atomic assertions.
2555
2556       Consider  this  pattern,  again containing non-significant white space,
2557       and with the two alternatives on the second line:
2558
2559         (?(?=[^a-z]*[a-z])
2560         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
2561
2562       The condition  is  a  positive  lookahead  assertion  that  matches  an
2563       optional  sequence of non-letters followed by a letter. In other words,
2564       it tests for the presence of at least one letter in the subject.  If  a
2565       letter  is found, the subject is matched against the first alternative;
2566       otherwise it is  matched  against  the  second.  This  pattern  matches
2567       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2568       letters and dd are digits.
2569
2570       When an assertion that is a condition contains capture groups, any cap‐
2571       turing  that  occurs  in  a matching branch is retained afterwards, for
2572       both positive and negative assertions, because matching always  contin‐
2573       ues  after  the  assertion, whether it succeeds or fails. (Compare non-
2574       conditional assertions, for which captures are retained only for  posi‐
2575       tive assertions that succeed.)
2576

COMMENTS

2578
2579       There are two ways of including comments in patterns that are processed
2580       by PCRE2. In both cases, the start of the comment  must  not  be  in  a
2581       character  class,  nor  in  the middle of any other sequence of related
2582       characters such as (?: or a group name or number. The  characters  that
2583       make up a comment play no part in the pattern matching.
2584
2585       The  sequence (?# marks the start of a comment that continues up to the
2586       next closing parenthesis. Nested parentheses are not permitted. If  the
2587       PCRE2_EXTENDED  or  PCRE2_EXTENDED_MORE  option  is set, an unescaped #
2588       character also introduces a comment, which in this  case  continues  to
2589       immediately  after  the next newline character or character sequence in
2590       the pattern. Which characters are interpreted as newlines is controlled
2591       by  an option passed to the compiling function or by a special sequence
2592       at the start of the pattern, as described in the section entitled "New‐
2593       line conventions" above. Note that the end of this type of comment is a
2594       literal newline sequence in the pattern; escape sequences  that  happen
2595       to represent a newline do not count. For example, consider this pattern
2596       when PCRE2_EXTENDED is set, and the default newline convention (a  sin‐
2597       gle linefeed character) is in force:
2598
2599         abc #comment \n still comment
2600
2601       On  encountering  the # character, pcre2_compile() skips along, looking
2602       for a newline in the pattern. The sequence \n is still literal at  this
2603       stage,  so  it does not terminate the comment. Only an actual character
2604       with the code value 0x0a (the default newline) does so.
2605

RECURSIVE PATTERNS

2607
2608       Consider the problem of matching a string in parentheses, allowing  for
2609       unlimited  nested  parentheses.  Without the use of recursion, the best
2610       that can be done is to use a pattern that  matches  up  to  some  fixed
2611       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
2612       depth.
2613
2614       For some time, Perl has provided a facility that allows regular expres‐
2615       sions  to recurse (amongst other things). It does this by interpolating
2616       Perl code in the expression at run time, and the code can refer to  the
2617       expression itself. A Perl pattern using code interpolation to solve the
2618       parentheses problem can be created like this:
2619
2620         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2621
2622       The (?p{...}) item interpolates Perl code at run time, and in this case
2623       refers recursively to the pattern in which it appears.
2624
2625       Obviously,  PCRE2  cannot  support  the  interpolation  of  Perl  code.
2626       Instead, it supports special syntax for recursion of  the  entire  pat‐
2627       tern, and also for individual capture group recursion. After its intro‐
2628       duction in PCRE1 and Python, this kind of  recursion  was  subsequently
2629       introduced into Perl at release 5.10.
2630
2631       A  special  item  that consists of (? followed by a number greater than
2632       zero and a closing parenthesis is a recursive subroutine  call  of  the
2633       capture  group of the given number, provided that it occurs inside that
2634       group. (If not,  it  is  a  non-recursive  subroutine  call,  which  is
2635       described  in  the  next  section.)  The special item (?R) or (?0) is a
2636       recursive call of the entire regular expression.
2637
2638       This PCRE2 pattern solves the nested parentheses  problem  (assume  the
2639       PCRE2_EXTENDED option is set so that white space is ignored):
2640
2641         \( ( [^()]++ | (?R) )* \)
2642
2643       First  it matches an opening parenthesis. Then it matches any number of
2644       substrings which can either be a  sequence  of  non-parentheses,  or  a
2645       recursive  match  of the pattern itself (that is, a correctly parenthe‐
2646       sized substring).  Finally there is a closing parenthesis. Note the use
2647       of a possessive quantifier to avoid backtracking into sequences of non-
2648       parentheses.
2649
2650       If this were part of a larger pattern, you would not  want  to  recurse
2651       the entire pattern, so instead you could use this:
2652
2653         ( \( ( [^()]++ | (?1) )* \) )
2654
2655       We  have  put the pattern into parentheses, and caused the recursion to
2656       refer to them instead of the whole pattern.
2657
2658       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
2659       tricky.  This is made easier by the use of relative references. Instead
2660       of (?1) in the pattern above you can write (?-2) to refer to the second
2661       most  recently  opened  parentheses  preceding  the recursion. In other
2662       words, a negative number counts capturing  parentheses  leftwards  from
2663       the point at which it is encountered.
2664
2665       Be  aware  however, that if duplicate capture group numbers are in use,
2666       relative references refer to the earliest group  with  the  appropriate
2667       number. Consider, for example:
2668
2669         (?|(a)|(b)) (c) (?-2)
2670
2671       The first two capture groups (a) and (b) are both numbered 1, and group
2672       (c) is number 2. When the reference (?-2) is  encountered,  the  second
2673       most  recently opened parentheses has the number 1, but it is the first
2674       such group (the (a) group) to which the recursion refers. This would be
2675       the  same if an absolute reference (?1) was used. In other words, rela‐
2676       tive references are just a shorthand for computing a group number.
2677
2678       It is also possible to refer to subsequent capture groups,  by  writing
2679       references  such  as  (?+2). However, these cannot be recursive because
2680       the reference is not inside the parentheses that are  referenced.  They
2681       are  always  non-recursive  subroutine  calls, as described in the next
2682       section.
2683
2684       An alternative approach is to use named parentheses.  The  Perl  syntax
2685       for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup‐
2686       ported. We could rewrite the above example as follows:
2687
2688         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2689
2690       If there is more than one group with the same name, the earliest one is
2691       used.
2692
2693       The example pattern that we have been looking at contains nested unlim‐
2694       ited repeats, and so the use of a possessive  quantifier  for  matching
2695       strings  of  non-parentheses  is important when applying the pattern to
2696       strings that do not match. For example, when this pattern is applied to
2697
2698         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2699
2700       it yields "no match" quickly. However, if a  possessive  quantifier  is
2701       not  used, the match runs for a very long time indeed because there are
2702       so many different ways the + and * repeats can carve  up  the  subject,
2703       and all have to be tested before failure can be reported.
2704
2705       At  the  end  of a match, the values of capturing parentheses are those
2706       from the outermost level. If you want to obtain intermediate values,  a
2707       callout function can be used (see below and the pcre2callout documenta‐
2708       tion). If the pattern above is matched against
2709
2710         (ab(cd)ef)
2711
2712       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
2713       which  is  the last value taken on at the top level. If a capture group
2714       is not matched at the top level, its final  captured  value  is  unset,
2715       even  if it was (temporarily) set at a deeper level during the matching
2716       process.
2717
2718       Do not confuse the (?R) item with the condition (R),  which  tests  for
2719       recursion.   Consider  this pattern, which matches text in angle brack‐
2720       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
2721       brackets  (that is, when recursing), whereas any characters are permit‐
2722       ted at the outer level.
2723
2724         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
2725
2726       In this pattern, (?(R) is the start of a conditional  group,  with  two
2727       different  alternatives  for the recursive and non-recursive cases. The
2728       (?R) item is the actual recursive call.
2729
2730   Differences in recursion processing between PCRE2 and Perl
2731
2732       Some former differences between PCRE2 and Perl no longer exist.
2733
2734       Before release 10.30, recursion processing in PCRE2 differed from  Perl
2735       in  that  a  recursive  subroutine call was always treated as an atomic
2736       group. That is, once it had matched some of the subject string, it  was
2737       never  re-entered,  even if it contained untried alternatives and there
2738       was a subsequent matching failure. (Historical note:  PCRE  implemented
2739       recursion before Perl did.)
2740
2741       Starting  with  release 10.30, recursive subroutine calls are no longer
2742       treated as atomic. That is, they can be re-entered to try unused alter‐
2743       natives  if  there  is a matching failure later in the pattern. This is
2744       now compatible with the way Perl works. If you want a  subroutine  call
2745       to be atomic, you must explicitly enclose it in an atomic group.
2746
2747       Supporting  backtracking  into  recursions  simplifies certain types of
2748       recursive  pattern.  For  example,  this  pattern  matches  palindromic
2749       strings:
2750
2751         ^((.)(?1)\2|.?)$
2752
2753       The  second  branch  in the group matches a single central character in
2754       the palindrome when there are an odd number of characters,  or  nothing
2755       when  there  are  an even number of characters, but in order to work it
2756       has to be able to try the second case when  the  rest  of  the  pattern
2757       match fails. If you want to match typical palindromic phrases, the pat‐
2758       tern has to ignore all non-word characters,  which  can  be  done  like
2759       this:
2760
2761         ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
2762
2763       If  run  with  the  PCRE2_CASELESS option, this pattern matches phrases
2764       such as "A man, a plan, a canal: Panama!". Note the use of the  posses‐
2765       sive  quantifier  *+  to  avoid backtracking into sequences of non-word
2766       characters. Without this, PCRE2 takes a great deal longer (ten times or
2767       more)  to  match typical phrases, and Perl takes so long that you think
2768       it has gone into a loop.
2769
2770       Another way in which PCRE2 and Perl used to differ in  their  recursion
2771       processing  is  in  the  handling of captured values. Formerly in Perl,
2772       when a group was called recursively or as a subroutine  (see  the  next
2773       section), it had no access to any values that were captured outside the
2774       recursion, whereas in PCRE2 these values can  be  referenced.  Consider
2775       this pattern:
2776
2777         ^(.)(\1|a(?2))
2778
2779       This  pattern matches "bab". The first capturing parentheses match "b",
2780       then in the second group, when the backreference \1 fails to match "b",
2781       the second alternative matches "a" and then recurses. In the recursion,
2782       \1 does now match "b" and so the whole match succeeds. This match  used
2783       to fail in Perl, but in later versions (I tried 5.024) it now works.
2784

GROUPS AS SUBROUTINES

2786
2787       If  the syntax for a recursive group call (either by number or by name)
2788       is used outside the parentheses to which it refers, it operates  a  bit
2789       like  a  subroutine  in  a programming language. More accurately, PCRE2
2790       treats the referenced group as an independent subpattern which it tries
2791       to  match  at  the  current  matching position. The called group may be
2792       defined before or after the reference.  A  numbered  reference  can  be
2793       absolute or relative, as in these examples:
2794
2795         (...(absolute)...)...(?2)...
2796         (...(relative)...)...(?-1)...
2797         (...(?+1)...(relative)...
2798
2799       An earlier example pointed out that the pattern
2800
2801         (sens|respons)e and \1ibility
2802
2803       matches  "sense and sensibility" and "response and responsibility", but
2804       not "sense and responsibility". If instead the pattern
2805
2806         (sens|respons)e and (?1)ibility
2807
2808       is used, it does match "sense and responsibility" as well as the  other
2809       two  strings.  Another  example  is  given  in the discussion of DEFINE
2810       above.
2811
2812       Like recursions, subroutine calls used to be  treated  as  atomic,  but
2813       this  changed  at  PCRE2 release 10.30, so backtracking into subroutine
2814       calls can now occur. However, any capturing parentheses  that  are  set
2815       during the subroutine call revert to their previous values afterwards.
2816
2817       Processing  options such as case-independence are fixed when a group is
2818       defined, so if it is used as  a  subroutine,  such  options  cannot  be
2819       changed for different calls. For example, consider this pattern:
2820
2821         (abc)(?i:(?-1))
2822
2823       It  matches  "abcabc". It does not match "abcABC" because the change of
2824       processing option does not affect the called group.
2825
2826       The behaviour of backtracking control verbs in groups  when  called  as
2827       subroutines is described in the section entitled "Backtracking verbs in
2828       subroutines" below.
2829

ONIGURUMA SUBROUTINE SYNTAX

2831
2832       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
2833       name or a number enclosed either in angle brackets or single quotes, is
2834       an alternative syntax for calling a group  as  a  subroutine,  possibly
2835       recursively.  Here  are two of the examples used above, rewritten using
2836       this syntax:
2837
2838         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2839         (sens|respons)e and \g'1'ibility
2840
2841       PCRE2 supports an extension to Oniguruma: if a number is preceded by  a
2842       plus or a minus sign it is taken as a relative reference. For example:
2843
2844         (abc)(?i:\g<-1>)
2845
2846       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2847       synonymous. The former is a backreference; the latter is  a  subroutine
2848       call.
2849

CALLOUTS

2851
2852       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2853       Perl code to be obeyed in the middle of matching a regular  expression.
2854       This makes it possible, amongst other things, to extract different sub‐
2855       strings that match the same pair of parentheses when there is a repeti‐
2856       tion.
2857
2858       PCRE2  provides  a  similar feature, but of course it cannot obey arbi‐
2859       trary Perl code. The feature is called "callout". The caller  of  PCRE2
2860       provides  an  external  function  by putting its entry point in a match
2861       context using the function pcre2_set_callout(), and then  passing  that
2862       context  to  pcre2_match() or pcre2_dfa_match(). If no match context is
2863       passed, or if the callout entry point is set to NULL, callouts are dis‐
2864       abled.
2865
2866       Within  a  regular expression, (?C<arg>) indicates a point at which the
2867       external function is to be called. There  are  two  kinds  of  callout:
2868       those  with a numerical argument and those with a string argument. (?C)
2869       on its own with no argument is treated as (?C0). A  numerical  argument
2870       allows  the  application  to  distinguish  between  different callouts.
2871       String arguments were added for release 10.20 to make it  possible  for
2872       script  languages that use PCRE2 to embed short scripts within patterns
2873       in a similar way to Perl.
2874
2875       During matching, when PCRE2 reaches a callout point, the external func‐
2876       tion  is  called.  It is provided with the number or string argument of
2877       the callout, the position in the pattern, and one item of data that  is
2878       also set in the match block. The callout function may cause matching to
2879       proceed, to backtrack, or to fail.
2880
2881       By default, PCRE2 implements a  number  of  optimizations  at  matching
2882       time,  and  one  side-effect is that sometimes callouts are skipped. If
2883       you need all possible callouts to happen, you need to set options  that
2884       disable  the relevant optimizations. More details, including a complete
2885       description of the programming interface to the callout  function,  are
2886       given in the pcre2callout documentation.
2887
2888   Callouts with numerical arguments
2889
2890       If  you  just  want  to  have  a means of identifying different callout
2891       points, put a number less than 256 after the  letter  C.  For  example,
2892       this pattern has two callout points:
2893
2894         (?C1)abc(?C2)def
2895
2896       If  the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
2897       callouts are automatically installed before each item in  the  pattern.
2898       They  are all numbered 255. If there is a conditional group in the pat‐
2899       tern whose condition is an assertion, an additional callout is inserted
2900       just  before the condition. An explicit callout may also be set at this
2901       position, as in this example:
2902
2903         (?(?C9)(?=a)abc|def)
2904
2905       Note that this applies only to assertion conditions, not to other types
2906       of condition.
2907
2908   Callouts with string arguments
2909
2910       A  delimited  string may be used instead of a number as a callout argu‐
2911       ment. The starting delimiter must be one of ` ' " ^ % #  $  {  and  the
2912       ending delimiter is the same as the start, except for {, where the end‐
2913       ing delimiter is }. If  the  ending  delimiter  is  needed  within  the
2914       string, it must be doubled. For example:
2915
2916         (?C'ab ''c'' d')xyz(?C{any text})pqr
2917
2918       The  doubling  is  removed  before  the string is passed to the callout
2919       function.
2920

BACKTRACKING CONTROL

2922
2923       There are a number of special  "Backtracking  Control  Verbs"  (to  use
2924       Perl's  terminology)  that  modify the behaviour of backtracking during
2925       matching. They are generally of the form (*VERB) or (*VERB:NAME).  Some
2926       verbs take either form, and may behave differently depending on whether
2927       or not a name argument is present. The names are  not  required  to  be
2928       unique within the pattern.
2929
2930       By  default,  for  compatibility  with  Perl, a name is any sequence of
2931       characters that does not include a closing parenthesis. The name is not
2932       processed  in  any  way,  and  it  is not possible to include a closing
2933       parenthesis  in  the  name.   This  can  be  changed  by  setting   the
2934       PCRE2_ALT_VERBNAMES  option,  but the result is no longer Perl-compati‐
2935       ble.
2936
2937       When PCRE2_ALT_VERBNAMES is set, backslash  processing  is  applied  to
2938       verb  names  and  only  an unescaped closing parenthesis terminates the
2939       name. However, the only backslash items that are permitted are \Q,  \E,
2940       and  sequences such as \x{100} that define character code points. Char‐
2941       acter type escapes such as \d are faulted.
2942
2943       A closing parenthesis can be included in a name either as \) or between
2944       \Q  and  \E. In addition to backslash processing, if the PCRE2_EXTENDED
2945       or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
2946       names is skipped, and #-comments are recognized, exactly as in the rest
2947       of the pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do  not  affect
2948       verb names unless PCRE2_ALT_VERBNAMES is also set.
2949
2950       The  maximum  length of a name is 255 in the 8-bit library and 65535 in
2951       the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
2952       closing  parenthesis immediately follows the colon, the effect is as if
2953       the colon were not there. Any number of these verbs may occur in a pat‐
2954       tern. Except for (*ACCEPT), they may not be quantified.
2955
2956       Since  these  verbs  are  specifically related to backtracking, most of
2957       them can be used only when the pattern is to be matched using the  tra‐
2958       ditional matching function, because that uses a backtracking algorithm.
2959       With the exception of (*FAIL), which behaves like  a  failing  negative
2960       assertion, the backtracking control verbs cause an error if encountered
2961       by the DFA matching function.
2962
2963       The behaviour of these verbs in repeated  groups,  assertions,  and  in
2964       capture  groups  called  as subroutines (whether or not recursively) is
2965       documented below.
2966
2967   Optimizations that affect backtracking verbs
2968
2969       PCRE2 contains some optimizations that are used to speed up matching by
2970       running some checks at the start of each match attempt. For example, it
2971       may know the minimum length of matching subject, or that  a  particular
2972       character must be present. When one of these optimizations bypasses the
2973       running of a match,  any  included  backtracking  verbs  will  not,  of
2974       course, be processed. You can suppress the start-of-match optimizations
2975       by setting the PCRE2_NO_START_OPTIMIZE option when  calling  pcre2_com‐
2976       pile(),  or by starting the pattern with (*NO_START_OPT). There is more
2977       discussion of this option in the section entitled "Compiling a pattern"
2978       in the pcre2api documentation.
2979
2980       Experiments  with  Perl  suggest that it too has similar optimizations,
2981       and like PCRE2, turning them off can change the result of a match.
2982
2983   Verbs that act immediately
2984
2985       The following verbs act as soon as they are encountered.
2986
2987          (*ACCEPT) or (*ACCEPT:NAME)
2988
2989       This verb causes the match to end successfully, skipping the  remainder
2990       of  the  pattern.  However,  when  it is inside a capture group that is
2991       called as a subroutine, only that group is ended successfully. Matching
2992       then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
2993       tive assertion, the assertion succeeds; in a  negative  assertion,  the
2994       assertion fails.
2995
2996       If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap‐
2997       tured. For example:
2998
2999         A((?:A|B(*ACCEPT)|C)D)
3000
3001       This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap‐
3002       tured by the outer parentheses.
3003
3004       (*ACCEPT)  is  the only backtracking verb that is allowed to be quanti‐
3005       fied because an ungreedy quantification with a  minimum  of  zero  acts
3006       only when a backtrack happens. Consider, for example,
3007
3008         (A(*ACCEPT)??B)C
3009
3010       where  A,  B, and C may be complex expressions. After matching "A", the
3011       matcher processes "BC"; if that fails, causing a  backtrack,  (*ACCEPT)
3012       is  triggered  and the match succeeds. In both cases, all but C is cap‐
3013       tured. Whereas (*COMMIT) (see  below)  means  "fail  on  backtrack",  a
3014       repeated (*ACCEPT) of this type means "succeed on backtrack".
3015
3016       Warning:  (*ACCEPT)  should  not  be  used  within  a script run group,
3017       because it causes an immediate  exit  from  the  group,  bypassing  the
3018       script run checking.
3019
3020         (*FAIL) or (*FAIL:NAME)
3021
3022       This  verb causes a matching failure, forcing backtracking to occur. It
3023       may be abbreviated to (*F). It is equivalent  to  (?!)  but  easier  to
3024       read. The Perl documentation notes that it is probably useful only when
3025       combined with (?{}) or (??{}). Those are, of course, Perl features that
3026       are  not  present  in PCRE2. The nearest equivalent is the callout fea‐
3027       ture, as for example in this pattern:
3028
3029         a+(?C)(*FAIL)
3030
3031       A match with the string "aaaa" always fails, but the callout  is  taken
3032       before each backtrack happens (in this example, 10 times).
3033
3034       (*ACCEPT:NAME)     and     (*FAIL:NAME)     behave    the    same    as
3035       (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a
3036       (*MARK) is recorded just before the verb acts.
3037
3038   Recording which path was taken
3039
3040       There  is  one  verb  whose  main  purpose  is to track how a match was
3041       arrived at, though it also has a  secondary  use  in  conjunction  with
3042       advancing the match starting point (see (*SKIP) below).
3043
3044         (*MARK:NAME) or (*:NAME)
3045
3046       A  name is always required with this verb. For all the other backtrack‐
3047       ing control verbs, a NAME argument is optional.
3048
3049       When a match succeeds, the name of the last-encountered  mark  name  on
3050       the matching path is passed back to the caller as described in the sec‐
3051       tion entitled "Other information about the match" in the pcre2api docu‐
3052       mentation.  This  applies  to all instances of (*MARK) and other verbs,
3053       including those inside assertions and atomic groups. However, there are
3054       differences  in  those  cases  when (*MARK) is used in conjunction with
3055       (*SKIP) as described below.
3056
3057       The mark name that was last encountered on the matching path is  passed
3058       back.  A verb without a NAME argument is ignored for this purpose. Here
3059       is an example of pcre2test output, where the "mark"  modifier  requests
3060       the retrieval and outputting of (*MARK) data:
3061
3062           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3063         data> XY
3064          0: XY
3065         MK: A
3066         XZ
3067          0: XZ
3068         MK: B
3069
3070       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
3071       ple it indicates which of the two alternatives matched. This is a  more
3072       efficient  way of obtaining this information than putting each alterna‐
3073       tive in its own capturing parentheses.
3074
3075       If a verb with a name is encountered in a positive  assertion  that  is
3076       true,  the  name  is recorded and passed back if it is the last-encoun‐
3077       tered. This does not happen for negative assertions or failing positive
3078       assertions.
3079
3080       After  a  partial match or a failed match, the last encountered name in
3081       the entire match process is returned. For example:
3082
3083           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3084         data> XP
3085         No match, mark = B
3086
3087       Note that in this unanchored example the  mark  is  retained  from  the
3088       match attempt that started at the letter "X" in the subject. Subsequent
3089       match attempts starting at "P" and then with an empty string do not get
3090       as far as the (*MARK) item, but nevertheless do not reset it.
3091
3092       If  you  are  interested  in  (*MARK)  values after failed matches, you
3093       should probably set the PCRE2_NO_START_OPTIMIZE option (see  above)  to
3094       ensure that the match is always attempted.
3095
3096   Verbs that act after backtracking
3097
3098       The following verbs do nothing when they are encountered. Matching con‐
3099       tinues with what follows, but if there is a subsequent  match  failure,
3100       causing  a  backtrack  to the verb, a failure is forced. That is, back‐
3101       tracking cannot pass to the left of the  verb.  However,  when  one  of
3102       these verbs appears inside an atomic group or in a lookaround assertion
3103       that is true, its effect is confined to that group,  because  once  the
3104       group  has been matched, there is never any backtracking into it. Back‐
3105       tracking from beyond an assertion or an atomic group ignores the entire
3106       group, and seeks a preceding backtracking point.
3107
3108       These  verbs  differ  in exactly what kind of failure occurs when back‐
3109       tracking reaches them. The behaviour described below  is  what  happens
3110       when  the  verb is not in a subroutine or an assertion. Subsequent sec‐
3111       tions cover these special cases.
3112
3113         (*COMMIT) or (*COMMIT:NAME)
3114
3115       This verb causes the whole match to fail outright if there is  a  later
3116       matching failure that causes backtracking to reach it. Even if the pat‐
3117       tern is unanchored, no further attempts to find a  match  by  advancing
3118       the  starting  point  take place. If (*COMMIT) is the only backtracking
3119       verb that is encountered, once it has been passed pcre2_match() is com‐
3120       mitted to finding a match at the current starting point, or not at all.
3121       For example:
3122
3123         a+(*COMMIT)b
3124
3125       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
3126       of dynamic anchor, or "I've started, so I must finish."
3127
3128       The  behaviour  of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM‐
3129       MIT). It is like (*MARK:NAME) in that the name is remembered for  pass‐
3130       ing  back  to the caller. However, (*SKIP:NAME) searches only for names
3131       that are set with (*MARK), ignoring those set by any of the other back‐
3132       tracking verbs.
3133
3134       If  there  is more than one backtracking verb in a pattern, a different
3135       one that follows (*COMMIT) may be triggered first,  so  merely  passing
3136       (*COMMIT) during a match does not always guarantee that a match must be
3137       at this starting point.
3138
3139       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
3140       anchor,  unless PCRE2's start-of-match optimizations are turned off, as
3141       shown in this output from pcre2test:
3142
3143           re> /(*COMMIT)abc/
3144         data> xyzabc
3145          0: abc
3146         data>
3147         re> /(*COMMIT)abc/no_start_optimize
3148         data> xyzabc
3149         No match
3150
3151       For the first pattern, PCRE2 knows that any match must start with  "a",
3152       so  the optimization skips along the subject to "a" before applying the
3153       pattern to the first set of data. The match attempt then succeeds.  The
3154       second  pattern disables the optimization that skips along to the first
3155       character. The pattern is now applied  starting  at  "x",  and  so  the
3156       (*COMMIT)  causes  the  match to fail without trying any other starting
3157       points.
3158
3159         (*PRUNE) or (*PRUNE:NAME)
3160
3161       This verb causes the match to fail at the current starting position  in
3162       the subject if there is a later matching failure that causes backtrack‐
3163       ing to reach it. If the pattern is unanchored, the  normal  "bumpalong"
3164       advance  to  the next starting character then happens. Backtracking can
3165       occur as usual to the left of (*PRUNE), before it is reached,  or  when
3166       matching  to  the  right  of  (*PRUNE), but if there is no match to the
3167       right, backtracking cannot cross (*PRUNE). In simple cases, the use  of
3168       (*PRUNE)  is just an alternative to an atomic group or possessive quan‐
3169       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
3170       any  other  way. In an anchored pattern (*PRUNE) has the same effect as
3171       (*COMMIT).
3172
3173       The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
3174       It is like (*MARK:NAME) in that the name is remembered for passing back
3175       to the caller. However, (*SKIP:NAME) searches only for names  set  with
3176       (*MARK), ignoring those set by other backtracking verbs.
3177
3178         (*SKIP)
3179
3180       This  verb, when given without a name, is like (*PRUNE), except that if
3181       the pattern is unanchored, the "bumpalong" advance is not to  the  next
3182       character, but to the position in the subject where (*SKIP) was encoun‐
3183       tered. (*SKIP) signifies that whatever text was matched leading  up  to
3184       it  cannot  be part of a successful match if there is a later mismatch.
3185       Consider:
3186
3187         a+(*SKIP)b
3188
3189       If the subject is "aaaac...",  after  the  first  match  attempt  fails
3190       (starting  at  the  first  character in the string), the starting point
3191       skips on to start the next attempt at "c". Note that a possessive quan‐
3192       tifer  does not have the same effect as this example; although it would
3193       suppress backtracking  during  the  first  match  attempt,  the  second
3194       attempt  would  start at the second character instead of skipping on to
3195       "c".
3196
3197       If (*SKIP) is used to specify a new starting position that is the  same
3198       as  the  starting  position of the current match, or (by being inside a
3199       lookbehind) earlier, the position specified by (*SKIP) is ignored,  and
3200       instead the normal "bumpalong" occurs.
3201
3202         (*SKIP:NAME)
3203
3204       When  (*SKIP)  has  an associated name, its behaviour is modified. When
3205       such a (*SKIP) is triggered, the previous path through the  pattern  is
3206       searched  for the most recent (*MARK) that has the same name. If one is
3207       found, the "bumpalong" advance is to the subject position  that  corre‐
3208       sponds  to that (*MARK) instead of to where (*SKIP) was encountered. If
3209       no (*MARK) with a matching name is found, the (*SKIP) is ignored.
3210
3211       The search for a (*MARK) name uses the normal  backtracking  mechanism,
3212       which  means  that  it  does  not  see (*MARK) settings that are inside
3213       atomic groups or assertions, because they are never re-entered by back‐
3214       tracking. Compare the following pcre2test examples:
3215
3216           re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3217         data: abc
3218          0: a
3219          1: a
3220         data:
3221           re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3222         data: abc
3223          0: b
3224          1: b
3225
3226       In  the first example, the (*MARK) setting is in an atomic group, so it
3227       is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
3228       This  allows  the second branch of the pattern to be tried at the first
3229       character position.  In the second example, the (*MARK) setting is  not
3230       in  an  atomic group. This allows (*SKIP:X) to find the (*MARK) when it
3231       backtracks, and this causes a new matching attempt to start at the sec‐
3232       ond  character.  This  time, the (*MARK) is never seen because "a" does
3233       not match "b", so the matcher immediately jumps to the second branch of
3234       the pattern.
3235
3236       Note  that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
3237       ignores names that are set by other backtracking verbs.
3238
3239         (*THEN) or (*THEN:NAME)
3240
3241       This verb causes a skip to the next innermost  alternative  when  back‐
3242       tracking  reaches  it.  That  is,  it  cancels any further backtracking
3243       within the current alternative. Its name  comes  from  the  observation
3244       that it can be used for a pattern-based if-then-else block:
3245
3246         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3247
3248       If  the COND1 pattern matches, FOO is tried (and possibly further items
3249       after the end of the group if FOO succeeds); on  failure,  the  matcher
3250       skips  to  the second alternative and tries COND2, without backtracking
3251       into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse‐
3252       quently  BAZ fails, there are no more alternatives, so there is a back‐
3253       track to whatever came before the  entire  group.  If  (*THEN)  is  not
3254       inside an alternation, it acts like (*PRUNE).
3255
3256       The  behaviour  of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
3257       It is like (*MARK:NAME) in that the name is remembered for passing back
3258       to  the  caller. However, (*SKIP:NAME) searches only for names set with
3259       (*MARK), ignoring those set by other backtracking verbs.
3260
3261       A group that does not contain a | character  is  just  a  part  of  the
3262       enclosing  alternative;  it  is  not a nested alternation with only one
3263       alternative. The effect of (*THEN) extends beyond such a group  to  the
3264       enclosing  alternative.   Consider  this  pattern, where A, B, etc. are
3265       complex pattern fragments that do not contain any | characters at  this
3266       level:
3267
3268         A (B(*THEN)C) | D
3269
3270       If  A and B are matched, but there is a failure in C, matching does not
3271       backtrack into A; instead it moves to the next alternative, that is, D.
3272       However,  if  the  group containing (*THEN) is given an alternative, it
3273       behaves differently:
3274
3275         A (B(*THEN)C | (*FAIL)) | D
3276
3277       The effect of (*THEN) is now confined to the inner group. After a fail‐
3278       ure  in  C,  matching moves to (*FAIL), which causes the whole group to
3279       fail because there are no more  alternatives  to  try.  In  this  case,
3280       matching does backtrack into A.
3281
3282       Note  that a conditional group is not considered as having two alterna‐
3283       tives, because only one is ever used. In other words, the  |  character
3284       in  a  conditional group has a different meaning. Ignoring white space,
3285       consider:
3286
3287         ^.*? (?(?=a) a | b(*THEN)c )
3288
3289       If the subject is "ba", this pattern does not  match.  Because  .*?  is
3290       ungreedy,  it  initially  matches  zero characters. The condition (?=a)
3291       then fails, the character "b" is matched,  but  "c"  is  not.  At  this
3292       point,  matching does not backtrack to .*? as might perhaps be expected
3293       from the presence of the | character. The conditional group is part  of
3294       the  single  alternative  that  comprises the whole pattern, and so the
3295       match fails. (If there was a backtrack into .*?, allowing it  to  match
3296       "b", the match would succeed.)
3297
3298       The  verbs just described provide four different "strengths" of control
3299       when subsequent matching fails. (*THEN) is the weakest, carrying on the
3300       match  at  the next alternative. (*PRUNE) comes next, failing the match
3301       at the current starting position, but allowing an advance to  the  next
3302       character  (for an unanchored pattern). (*SKIP) is similar, except that
3303       the advance may be more than one character. (*COMMIT) is the strongest,
3304       causing the entire match to fail.
3305
3306   More than one backtracking verb
3307
3308       If  more  than  one  backtracking verb is present in a pattern, the one
3309       that is backtracked onto first acts. For example,  consider  this  pat‐
3310       tern, where A, B, etc. are complex pattern fragments:
3311
3312         (A(*COMMIT)B(*THEN)C|ABD)
3313
3314       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
3315       match to fail. However, if A and B match, but C fails, the backtrack to
3316       (*THEN)  causes  the next alternative (ABD) to be tried. This behaviour
3317       is consistent, but is not always the same as Perl's. It means  that  if
3318       two  or  more backtracking verbs appear in succession, all the the last
3319       of them has no effect. Consider this example:
3320
3321         ...(*COMMIT)(*PRUNE)...
3322
3323       If there is a matching failure to the right, backtracking onto (*PRUNE)
3324       causes  it to be triggered, and its action is taken. There can never be
3325       a backtrack onto (*COMMIT).
3326
3327   Backtracking verbs in repeated groups
3328
3329       PCRE2 sometimes differs from Perl in its handling of backtracking verbs
3330       in repeated groups. For example, consider:
3331
3332         /(a(*COMMIT)b)+ac/
3333
3334       If  the  subject  is  "abac", Perl matches unless its optimizations are
3335       disabled, but PCRE2 always fails because the (*COMMIT)  in  the  second
3336       repeat of the group acts.
3337
3338   Backtracking verbs in assertions
3339
3340       (*FAIL)  in any assertion has its normal effect: it forces an immediate
3341       backtrack. The behaviour of the other  backtracking  verbs  depends  on
3342       whether  or  not the assertion is standalone or acting as the condition
3343       in a conditional group.
3344
3345       (*ACCEPT) in a standalone positive assertion causes  the  assertion  to
3346       succeed  without  any  further  processing; captured strings and a mark
3347       name (if  set)  are  retained.  In  a  standalone  negative  assertion,
3348       (*ACCEPT)  causes the assertion to fail without any further processing;
3349       captured substrings and any mark name are discarded.
3350
3351       If the assertion is a condition, (*ACCEPT) causes the condition  to  be
3352       true  for  a  positive assertion and false for a negative one; captured
3353       substrings are retained in both cases.
3354
3355       The remaining verbs act only when a later failure causes a backtrack to
3356       reach  them. This means that, for the Perl-compatible assertions, their
3357       effect is confined to the assertion, because Perl lookaround assertions
3358       are atomic. A backtrack that occurs after such an assertion is complete
3359       does not jump back into  the  assertion.  Note  in  particular  that  a
3360       (*MARK)  name  that is set in an assertion is not "seen" by an instance
3361       of (*SKIP:NAME) later in the pattern.
3362
3363       PCRE2 now supports non-atomic positive assertions, as described in  the
3364       section  entitled  "Non-atomic assertions" above. These assertions must
3365       be standalone (not used as conditions). They are  not  Perl-compatible.
3366       For  these assertions, a later backtrack does jump back into the asser‐
3367       tion, and therefore verbs such as (*COMMIT) can be triggered  by  back‐
3368       tracks from later in the pattern.
3369
3370       The  effect of (*THEN) is not allowed to escape beyond an assertion. If
3371       there are no more branches to try, (*THEN) causes a positive  assertion
3372       to be false, and a negative assertion to be true.
3373
3374       The  other  backtracking verbs are not treated specially if they appear
3375       in a standalone positive assertion. In a  conditional  positive  asser‐
3376       tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
3377       or (*PRUNE) causes the condition to be false. However, for both  stand‐
3378       alone and conditional negative assertions, backtracking into (*COMMIT),
3379       (*SKIP), or (*PRUNE) causes the assertion to be true, without consider‐
3380       ing any further alternative branches.
3381
3382   Backtracking verbs in subroutines
3383
3384       These behaviours occur whether or not the group is called recursively.
3385
3386       (*ACCEPT) in a group called as a subroutine causes the subroutine match
3387       to succeed without any  further  processing.  Matching  then  continues
3388       after the subroutine call. Perl documents this behaviour. Perl's treat‐
3389       ment of the other verbs in subroutines is different in some cases.
3390
3391       (*FAIL) in a group called as a subroutine has  its  normal  effect:  it
3392       forces an immediate backtrack.
3393
3394       (*COMMIT),  (*SKIP),  and  (*PRUNE)  cause the subroutine match to fail
3395       when triggered by being backtracked to in a group called as  a  subrou‐
3396       tine. There is then a backtrack at the outer level.
3397
3398       (*THEN), when triggered, skips to the next alternative in the innermost
3399       enclosing group that has alternatives (its normal behaviour).  However,
3400       if there is no such group within the subroutine's group, the subroutine
3401       match fails and there is a backtrack at the outer level.
3402

AUTHOR

3409
3410       Philip Hazel
3411       University Computing Service
3412       Cambridge, England.
3413

REVISION

3415
3416       Last updated: 06 October 2020
3417       Copyright (c) 1997-2020 University of Cambridge.
3418
3419
3420
3421PCRE2 10.35                     06 October 2020                PCRE2PATTERN(3)