1PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 REGULAR EXPRESSION DETAILS

9
10       The  syntax and semantics of the regular expressions that are supported
11       by PCRE2 are described in detail below. There is a quick-reference syn‐
12       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
13       and semantics as closely as it can.  PCRE2 also supports some  alterna‐
14       tive  regular  expression syntax (which does not conflict with the Perl
15       syntax) in order to provide some compatibility with regular expressions
16       in Python, .NET, and Oniguruma.
17
18       Perl's  regular expressions are described in its own documentation, and
19       regular expressions in general are covered in a number of  books,  some
20       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
21       Expressions", published by  O'Reilly,  covers  regular  expressions  in
22       great  detail.  This  description  of  PCRE2's  regular  expressions is
23       intended as reference material.
24
25       This document discusses the patterns that are supported by  PCRE2  when
26       its  main  matching function, pcre2_match(), is used. PCRE2 also has an
27       alternative matching function, pcre2_dfa_match(), which matches using a
28       different  algorithm  that is not Perl-compatible. Some of the features
29       discussed below are not available when DFA matching is used. The advan‐
30       tages and disadvantages of the alternative function, and how it differs
31       from the normal function, are discussed in the pcre2matching page.
32

SPECIAL START-OF-PATTERN ITEMS

34
35       A number of options that can be passed to pcre2_compile() can  also  be
36       set by special items at the start of a pattern. These are not Perl-com‐
37       patible, but are provided to make these options accessible  to  pattern
38       writers  who are not able to change the program that processes the pat‐
39       tern. Any number of these items  may  appear,  but  they  must  all  be
40       together right at the start of the pattern string, and the letters must
41       be in upper case.
42
43   UTF support
44
45       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
46       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
47       can be specified for the 32-bit library, in which  case  it  constrains
48       the  character  values  to  valid  Unicode  code points. To process UTF
49       strings, PCRE2 must be built to include Unicode support (which  is  the
50       default).  When  using  UTF  strings you must either call the compiling
51       function with the PCRE2_UTF option, or the pattern must start with  the
52       special  sequence  (*UTF),  which is equivalent to setting the relevant
53       option. How setting a UTF mode affects pattern matching is mentioned in
54       several  places  below.  There  is  also  a  summary of features in the
55       pcre2unicode page.
56
57       Some applications that allow their users to supply patterns may wish to
58       restrict   them   to   non-UTF   data  for  security  reasons.  If  the
59       PCRE2_NEVER_UTF option is passed  to  pcre2_compile(),  (*UTF)  is  not
60       allowed, and its appearance in a pattern causes an error.
61
62   Unicode property support
63
64       Another  special  sequence that may appear at the start of a pattern is
65       (*UCP).  This has the same effect as setting the PCRE2_UCP  option:  it
66       causes  sequences such as \d and \w to use Unicode properties to deter‐
67       mine character types, instead of recognizing only characters with codes
68       less than 256 via a lookup table.
69
70       Some applications that allow their users to supply patterns may wish to
71       restrict them for security reasons. If the  PCRE2_NEVER_UCP  option  is
72       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
73       a pattern causes an error.
74
75   Locking out empty string matching
76
77       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
78       effect  as  passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
79       to whichever matching function is subsequently called to match the pat‐
80       tern.  These  options  lock  out  the matching of empty strings, either
81       entirely, or only at the start of the subject.
82
83   Disabling auto-possessification
84
85       If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as
86       setting  the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
87       quantifiers possessive when what  follows  cannot  match  the  repeated
88       item. For example, by default a+b is treated as a++b. For more details,
89       see the pcre2api documentation.
90
91   Disabling start-up optimizations
92
93       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
94       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
95       mizations for quickly reaching "no match" results.  For  more  details,
96       see the pcre2api documentation.
97
98   Disabling automatic anchoring
99
100       If  a  pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
101       as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables  optimiza‐
102       tions that apply to patterns whose top-level branches all start with .*
103       (match any number of arbitrary characters). For more details,  see  the
104       pcre2api documentation.
105
106   Disabling JIT compilation
107
108       If  a  pattern  that starts with (*NO_JIT) is successfully compiled, an
109       attempt by the application to apply the  JIT  optimization  by  calling
110       pcre2_jit_compile() is ignored.
111
112   Setting match and recursion limits
113
114       The  caller of pcre2_match() can set a limit on the number of times the
115       internal match() function is called and on the maximum depth of  recur‐
116       sive calls. These facilities are provided to catch runaway matches that
117       are provoked by patterns with huge matching trees (a typical example is
118       a  pattern  with  nested unlimited repeats) and to avoid running out of
119       system stack by too  much  recursion.  When  one  of  these  limits  is
120       reached,  pcre2_match()  gives  an error return. The limits can also be
121       set by items at the start of the pattern of the form
122
123         (*LIMIT_MATCH=d)
124         (*LIMIT_RECURSION=d)
125
126       where d is any number of decimal digits. However, the value of the set‐
127       ting  must  be  less than the value set (or defaulted) by the caller of
128       pcre2_match() for it to have any effect. In other  words,  the  pattern
129       writer  can lower the limits set by the programmer, but not raise them.
130       If there is more than one setting of one of  these  limits,  the  lower
131       value is used.
132
133       The  match  limit  is  used  (but in a different way) when JIT is being
134       used, but it is not  relevant,  and  is  ignored,  when  matching  with
135       pcre2_dfa_match().   However,  the  recursion limit is relevant for DFA
136       matching, which does use some function recursion,  in  particular,  for
137       recursions within the pattern.
138
139   Newline conventions
140
141       PCRE2 supports five different conventions for indicating line breaks in
142       strings: a single CR (carriage return) character, a  single  LF  (line‐
143       feed) character, the two-character sequence CRLF, any of the three pre‐
144       ceding, or any Unicode newline sequence. The pcre2api page has  further
145       discussion  about newlines, and shows how to set the newline convention
146       when calling pcre2_compile().
147
148       It is also possible to specify a newline convention by starting a  pat‐
149       tern string with one of the following five sequences:
150
151         (*CR)        carriage return
152         (*LF)        linefeed
153         (*CRLF)      carriage return, followed by linefeed
154         (*ANYCRLF)   any of the three above
155         (*ANY)       all Unicode newline sequences
156
157       These override the default and the options given to the compiling func‐
158       tion. For example, on a Unix system where LF  is  the  default  newline
159       sequence, the pattern
160
161         (*CR)a.b
162
163       changes the convention to CR. That pattern matches "a\nb" because LF is
164       no longer a newline. If more than one of these settings is present, the
165       last one is used.
166
167       The  newline  convention affects where the circumflex and dollar asser‐
168       tions are true. It also affects the interpretation of the dot metachar‐
169       acter  when  PCRE2_DOTALL is not set, and the behaviour of \N. However,
170       it does not affect what the \R escape  sequence  matches.  By  default,
171       this  is any Unicode newline sequence, for Perl compatibility. However,
172       this can be changed; see the description of \R in the section  entitled
173       "Newline  sequences" below. A change of \R setting can be combined with
174       a change of newline convention.
175
176   Specifying what \R matches
177
178       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
179       the  complete  set  of  Unicode  line  endings)  by  setting the option
180       PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved  by
181       starting  a  pattern  with (*BSR_ANYCRLF). For completeness, (*BSR_UNI‐
182       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
183

EBCDIC CHARACTER CODES

185
186       PCRE2 can be compiled to run in an environment that uses EBCDIC as  its
187       character code rather than ASCII or Unicode (typically a mainframe sys‐
188       tem). In the sections below, character code values are  ASCII  or  Uni‐
189       code; in an EBCDIC environment these characters may have different code
190       values, and there are no code points greater than 255.
191

CHARACTERS AND METACHARACTERS

193
194       A regular expression is a pattern that is  matched  against  a  subject
195       string  from  left  to right. Most characters stand for themselves in a
196       pattern, and match the corresponding characters in the  subject.  As  a
197       trivial example, the pattern
198
199         The quick brown fox
200
201       matches a portion of a subject string that is identical to itself. When
202       caseless matching is specified (the PCRE2_CASELESS option), letters are
203       matched independently of case.
204
205       The  power  of  regular  expressions  comes from the ability to include
206       alternatives and repetitions in the pattern. These are encoded  in  the
207       pattern by the use of metacharacters, which do not stand for themselves
208       but instead are interpreted in some special way.
209
210       There are two different sets of metacharacters: those that  are  recog‐
211       nized  anywhere in the pattern except within square brackets, and those
212       that are recognized within square brackets.  Outside  square  brackets,
213       the metacharacters are as follows:
214
215         \      general escape character with several uses
216         ^      assert start of string (or line, in multiline mode)
217         $      assert end of string (or line, in multiline mode)
218         .      match any character except newline (by default)
219         [      start character class definition
220         |      start of alternative branch
221         (      start subpattern
222         )      end subpattern
223         ?      extends the meaning of (
224                also 0 or 1 quantifier
225                also quantifier minimizer
226         *      0 or more quantifier
227         +      1 or more quantifier
228                also "possessive quantifier"
229         {      start min/max quantifier
230
231       Part  of  a  pattern  that is in square brackets is called a "character
232       class". In a character class the only metacharacters are:
233
234         \      general escape character
235         ^      negate the class, but only if the first character
236         -      indicates character range
237         [      POSIX character class (only if followed by POSIX
238                  syntax)
239         ]      terminates the character class
240
241       The following sections describe the use of each of the metacharacters.
242

BACKSLASH

244
245       The backslash character has several uses. Firstly, if it is followed by
246       a character that is not a number or a letter, it takes away any special
247       meaning that character may have. This use of  backslash  as  an  escape
248       character applies both inside and outside character classes.
249
250       For  example,  if  you want to match a * character, you write \* in the
251       pattern.  This escaping action applies whether  or  not  the  following
252       character  would  otherwise be interpreted as a metacharacter, so it is
253       always safe to precede a non-alphanumeric  with  backslash  to  specify
254       that  it stands for itself. In particular, if you want to match a back‐
255       slash, you write \\.
256
257       In a UTF mode, only ASCII numbers and letters have any special  meaning
258       after  a  backslash.  All  other characters (in particular, those whose
259       codepoints are greater than 127) are treated as literals.
260
261       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
262       space  in the pattern (other than in a character class), and characters
263       between a # outside a character class and the next newline,  inclusive,
264       are ignored. An escaping backslash can be used to include a white space
265       or # character as part of the pattern.
266
267       If you want to remove the special meaning from a  sequence  of  charac‐
268       ters,  you can do so by putting them between \Q and \E. This is differ‐
269       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
270       sequences  in PCRE2, whereas in Perl, $ and @ cause variable interpola‐
271       tion. Note the following examples:
272
273         Pattern            PCRE2 matches   Perl matches
274
275         \Qabc$xyz\E        abc$xyz        abc followed by the
276                                             contents of $xyz
277         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
278         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
279
280       The \Q...\E sequence is recognized both inside  and  outside  character
281       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
282       is not followed by \E later in the pattern, the literal  interpretation
283       continues  to  the  end  of  the pattern (that is, \E is assumed at the
284       end). If the isolated \Q is inside a character class,  this  causes  an
285       error, because the character class is not terminated.
286
287   Non-printing characters
288
289       A second use of backslash provides a way of encoding non-printing char‐
290       acters in patterns in a visible manner. There is no restriction on  the
291       appearance  of non-printing characters in a pattern, but when a pattern
292       is being prepared by text editing, it is often easier to use one of the
293       following  escape sequences than the binary character it represents. In
294       an ASCII or Unicode environment, these escapes are as follows:
295
296         \a        alarm, that is, the BEL character (hex 07)
297         \cx       "control-x", where x is any printable ASCII character
298         \e        escape (hex 1B)
299         \f        form feed (hex 0C)
300         \n        linefeed (hex 0A)
301         \r        carriage return (hex 0D)
302         \t        tab (hex 09)
303         \0dd      character with octal code 0dd
304         \ddd      character with octal code ddd, or back reference
305         \o{ddd..} character with octal code ddd..
306         \xhh      character with hex code hh
307         \x{hhh..} character with hex code hhh.. (default mode)
308         \uhhhh    character with hex code hhhh (when PCRE2_ALT_BSUX is set)
309
310       The precise effect of \cx on ASCII characters is as follows: if x is  a
311       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
312       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
313       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
314       hex 7B (; is 3B). If the code unit following \c has a value  less  than
315       32 or greater than 126, a compile-time error occurs.
316
317       When  PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen‐
318       erate the appropriate EBCDIC code values. The \c escape is processed as
319       specified for Perl in the perlebcdic document. The only characters that
320       are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^,  _,  or  ?.
321       Any  other  character  provokes  a compile-time error. The sequence \c@
322       encodes character code 0; after \c the letters (in either case)  encode
323       characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
324       27-31 (hex 1B to hex 1F), and \c? becomes either 255  (hex  FF)  or  95
325       (hex 5F).
326
327       Thus,  apart  from  \c?, these escapes generate the same character code
328       values as they do in an ASCII environment, though the meanings  of  the
329       values  mostly  differ. For example, \cG always generates code value 7,
330       which is BEL in ASCII but DEL in EBCDIC.
331
332       The sequence \c? generates DEL (127, hex 7F) in an  ASCII  environment,
333       but  because  127  is  not a control character in EBCDIC, Perl makes it
334       generate the APC character. Unfortunately, there are  several  variants
335       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
336       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
337       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
338       95; otherwise it generates 255.
339
340       After \0 up to two further octal digits are read. If  there  are  fewer
341       than  two  digits,  just  those  that  are  present  are used. Thus the
342       sequence \0\x\015 specifies two binary zeros followed by a CR character
343       (code value 13). Make sure you supply two digits after the initial zero
344       if the pattern character that follows is itself an octal digit.
345
346       The escape \o must be followed by a sequence of octal digits,  enclosed
347       in  braces.  An  error occurs if this is not the case. This escape is a
348       recent addition to Perl; it provides way of specifying  character  code
349       points  as  octal  numbers  greater than 0777, and it also allows octal
350       numbers and back references to be unambiguously specified.
351
352       For greater clarity and unambiguity, it is best to avoid following \ by
353       a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
354       ter numbers, and \g{} to specify back references. The  following  para‐
355       graphs describe the old, ambiguous syntax.
356
357       The handling of a backslash followed by a digit other than 0 is compli‐
358       cated, and Perl has changed over time, causing PCRE2 also to change.
359
360       Outside a character class, PCRE2 reads the digit and any following dig‐
361       its as a decimal number. If the number is less than 10, begins with the
362       digit 8 or 9, or if there are at least  that  many  previous  capturing
363       left  parentheses  in the expression, the entire sequence is taken as a
364       back reference. A description of how this works is given later, follow‐
365       ing  the  discussion  of  parenthesized  subpatterns.  Otherwise, up to
366       three octal digits are read to form a character code.
367
368       Inside a character class, PCRE2 handles \8 and \9 as the literal  char‐
369       acters  "8"  and "9", and otherwise reads up to three octal digits fol‐
370       lowing the backslash, using them to generate a data character. Any sub‐
371       sequent  digits  stand for themselves. For example, outside a character
372       class:
373
374         \040   is another way of writing an ASCII space
375         \40    is the same, provided there are fewer than 40
376                   previous capturing subpatterns
377         \7     is always a back reference
378         \11    might be a back reference, or another way of
379                   writing a tab
380         \011   is always a tab
381         \0113  is a tab followed by the character "3"
382         \113   might be a back reference, otherwise the
383                   character with octal code 113
384         \377   might be a back reference, otherwise
385                   the value 255 (decimal)
386         \81    is always a back reference
387
388       Note that octal values of 100 or greater that are specified using  this
389       syntax  must  not be introduced by a leading zero, because no more than
390       three octal digits are ever read.
391
392       By default, after \x that is not followed by {, from zero to two  hexa‐
393       decimal  digits  are  read (letters can be in upper or lower case). Any
394       number of hexadecimal digits may appear between \x{ and }. If a charac‐
395       ter  other  than  a  hexadecimal digit appears between \x{ and }, or if
396       there is no terminating }, an error occurs.
397
398       If the PCRE2_ALT_BSUX option is set, the interpretation  of  \x  is  as
399       just described only when it is followed by two hexadecimal digits. Oth‐
400       erwise, it matches a literal "x" character. In this mode mode,  support
401       for  code points greater than 256 is provided by \u, which must be fol‐
402       lowed by four hexadecimal digits; otherwise it matches  a  literal  "u"
403       character.
404
405       Characters whose value is less than 256 can be defined by either of the
406       two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif‐
407       ference  in  the way they are handled. For example, \xdc is exactly the
408       same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
409
410   Constraints on character values
411
412       Characters that are specified using octal or  hexadecimal  numbers  are
413       limited to certain values, as follows:
414
415         8-bit non-UTF mode    less than 0x100
416         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
417         16-bit non-UTF mode   less than 0x10000
418         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
419         32-bit non-UTF mode   less than 0x100000000
420         32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
421
422       Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
423       called "surrogate" codepoints), and 0xffef.
424
425   Escape sequences in character classes
426
427       All the sequences that define a single character value can be used both
428       inside  and  outside character classes. In addition, inside a character
429       class, \b is interpreted as the backspace character (hex 08).
430
431       \N is not allowed in a character class. \B, \R, and \X are not  special
432       inside  a  character  class.  Like other unrecognized alphabetic escape
433       sequences, they cause  an  error.  Outside  a  character  class,  these
434       sequences have different meanings.
435
436   Unsupported escape sequences
437
438       In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
439       handler and used  to  modify  the  case  of  following  characters.  By
440       default, PCRE2 does not support these escape sequences. However, if the
441       PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
442       used  to define a character by code point, as described in the previous
443       section.
444
445   Absolute and relative back references
446
447       The sequence \g followed by a signed  or  unsigned  number,  optionally
448       enclosed  in braces, is an absolute or relative back reference. A named
449       back reference can be coded as \g{name}. Back references are  discussed
450       later, following the discussion of parenthesized subpatterns.
451
452   Absolute and relative subroutine calls
453
454       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
455       name or a number enclosed either in angle brackets or single quotes, is
456       an  alternative  syntax for referencing a subpattern as a "subroutine".
457       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
458       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
459       reference; the latter is a subroutine call.
460
461   Generic character types
462
463       Another use of backslash is for specifying generic character types:
464
465         \d     any decimal digit
466         \D     any character that is not a decimal digit
467         \h     any horizontal white space character
468         \H     any character that is not a horizontal white space character
469         \s     any white space character
470         \S     any character that is not a white space character
471         \v     any vertical white space character
472         \V     any character that is not a vertical white space character
473         \w     any "word" character
474         \W     any "non-word" character
475
476       There is also the single sequence \N, which matches a non-newline char‐
477       acter.   This is the same as the "." metacharacter when PCRE2_DOTALL is
478       not set. Perl also uses \N to match characters by name; PCRE2 does  not
479       support this.
480
481       Each  pair of lower and upper case escape sequences partitions the com‐
482       plete set of characters into two disjoint  sets.  Any  given  character
483       matches  one, and only one, of each pair. The sequences can appear both
484       inside and outside character classes. They each match one character  of
485       the  appropriate  type.  If the current matching point is at the end of
486       the subject string, all of them fail, because there is no character  to
487       match.
488
489       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
490       (13), and space (32), which are defined  as  white  space  in  the  "C"
491       locale. This list may vary if locale-specific matching is taking place.
492       For example, in some locales the "non-breaking space" character  (\xA0)
493       is recognized as white space, and in others the VT character is not.
494
495       A  "word"  character is an underscore or any character that is a letter
496       or digit.  By default, the definition of letters  and  digits  is  con‐
497       trolled by PCRE2's low-valued character tables, and may vary if locale-
498       specific matching is taking place (see "Locale support" in the pcre2api
499       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
500       systems, or "french" in Windows, some character codes greater than  127
501       are  used  for  accented letters, and these are then matched by \w. The
502       use of locales with Unicode is discouraged.
503
504       By default, characters whose code points are  greater  than  127  never
505       match \d, \s, or \w, and always match \D, \S, and \W, although this may
506       be different for characters in the range 128-255  when  locale-specific
507       matching  is  happening.   These escape sequences retain their original
508       meanings from before Unicode support was available,  mainly  for  effi‐
509       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
510       changed so that Unicode properties  are  used  to  determine  character
511       types, as follows:
512
513         \d  any character that matches \p{Nd} (decimal digit)
514         \s  any character that matches \p{Z} or \h or \v
515         \w  any character that matches \p{L} or \p{N}, plus underscore
516
517       The  upper case escapes match the inverse sets of characters. Note that
518       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
519       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
520       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
521       Matching these sequences is noticeably slower when PCRE2_UCP is set.
522
523       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
524       which match only ASCII characters by default, always match  a  specific
525       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
526       space characters are:
527
528         U+0009     Horizontal tab (HT)
529         U+0020     Space
530         U+00A0     Non-break space
531         U+1680     Ogham space mark
532         U+180E     Mongolian vowel separator
533         U+2000     En quad
534         U+2001     Em quad
535         U+2002     En space
536         U+2003     Em space
537         U+2004     Three-per-em space
538         U+2005     Four-per-em space
539         U+2006     Six-per-em space
540         U+2007     Figure space
541         U+2008     Punctuation space
542         U+2009     Thin space
543         U+200A     Hair space
544         U+202F     Narrow no-break space
545         U+205F     Medium mathematical space
546         U+3000     Ideographic space
547
548       The vertical space characters are:
549
550         U+000A     Linefeed (LF)
551         U+000B     Vertical tab (VT)
552         U+000C     Form feed (FF)
553         U+000D     Carriage return (CR)
554         U+0085     Next line (NEL)
555         U+2028     Line separator
556         U+2029     Paragraph separator
557
558       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
559       than 256 are relevant.
560
561   Newline sequences
562
563       Outside  a  character class, by default, the escape sequence \R matches
564       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
565       to the following:
566
567         (?>\r\n|\n|\x0b|\f|\r|\x85)
568
569       This  is  an  example  of an "atomic group", details of which are given
570       below.  This particular group matches either the two-character sequence
571       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
572       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car‐
573       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
574       atomic group, the two-character sequence is treated as  a  single  unit
575       that cannot be split.
576
577       In  other modes, two additional characters whose codepoints are greater
578       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
579       rator,  U+2029).  Unicode support is not needed for these characters to
580       be recognized.
581
582       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
583       the  complete  set  of  Unicode  line  endings)  by  setting the option
584       PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back‐
585       slash R".) This can be made the default when PCRE2 is built; if this is
586       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI‐
587       CODE  option. It is also possible to specify these settings by starting
588       a pattern string with one of the following sequences:
589
590         (*BSR_ANYCRLF)   CR, LF, or CRLF only
591         (*BSR_UNICODE)   any Unicode newline sequence
592
593       These override the default and the options given to the compiling func‐
594       tion.  Note that these special settings, which are not Perl-compatible,
595       are recognized only at the very start of a pattern, and that they  must
596       be  in upper case. If more than one of them is present, the last one is
597       used. They can be combined with a change  of  newline  convention;  for
598       example, a pattern can start with:
599
600         (*ANY)(*BSR_ANYCRLF)
601
602       They  can also be combined with the (*UTF) or (*UCP) special sequences.
603       Inside a character class, \R  is  treated  as  an  unrecognized  escape
604       sequence, and causes an error.
605
606   Unicode character properties
607
608       When  PCRE2  is  built  with Unicode support (the default), three addi‐
609       tional escape sequences that match characters with specific  properties
610       are  available.  In 8-bit non-UTF-8 mode, these sequences are of course
611       limited to testing characters whose codepoints are less than  256,  but
612       they do work in this mode.  The extra escape sequences are:
613
614         \p{xx}   a character with the xx property
615         \P{xx}   a character without the xx property
616         \X       a Unicode extended grapheme cluster
617
618       The  property  names represented by xx above are limited to the Unicode
619       script names, the general category properties, "Any", which matches any
620       character  (including  newline),  and  some  special  PCRE2  properties
621       (described in the next section).  Other Perl properties such as  "InMu‐
622       sicalSymbols"  are  not supported by PCRE2.  Note that \P{Any} does not
623       match any characters, so always causes a match failure.
624
625       Sets of Unicode characters are defined as belonging to certain scripts.
626       A  character from one of these sets can be matched using a script name.
627       For example:
628
629         \p{Greek}
630         \P{Han}
631
632       Those that are not part of an identified script are lumped together  as
633       "Common". The current list of scripts is:
634
635       Ahom,   Anatolian_Hieroglyphs,  Arabic,  Armenian,  Avestan,  Balinese,
636       Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille,  Buginese,
637       Buhid,  Canadian_Aboriginal,  Carian, Caucasian_Albanian, Chakma, Cham,
638       Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,  Cyrillic,   Deseret,
639       Devanagari,  Duployan,  Egyptian_Hieroglyphs,  Elbasan, Ethiopic, Geor‐
640       gian, Glagolitic, Gothic,  Grantha,  Greek,  Gujarati,  Gurmukhi,  Han,
641       Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
642       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan‐
643       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
644       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha‐
645       jani,  Malayalam,  Mandaic,  Manichaean,  Meetei_Mayek,  Mende_Kikakui,
646       Meroitic_Cursive, Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro,
647       Multani,   Myanmar,   Nabataean,  New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,
648       Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,
649       Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
650       Pau_Cin_Hau,  Phags_Pa,  Phoenician,  Psalter_Pahlavi,  Rejang,  Runic,
651       Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
652       Sora_Sompeng,  Sundanese,  Syloti_Nagri,  Syriac,  Tagalog,   Tagbanwa,
653       Tai_Le,   Tai_Tham,  Tai_Viet,  Takri,  Tamil,  Telugu,  Thaana,  Thai,
654       Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
655
656       Each character has exactly one Unicode general category property, spec‐
657       ified  by a two-letter abbreviation. For compatibility with Perl, nega‐
658       tion can be specified by including a  circumflex  between  the  opening
659       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
660       \P{Lu}.
661
662       If only one letter is specified with \p or \P, it includes all the gen‐
663       eral  category properties that start with that letter. In this case, in
664       the absence of negation, the curly brackets in the escape sequence  are
665       optional; these two examples have the same effect:
666
667         \p{L}
668         \pL
669
670       The following general category property codes are supported:
671
672         C     Other
673         Cc    Control
674         Cf    Format
675         Cn    Unassigned
676         Co    Private use
677         Cs    Surrogate
678
679         L     Letter
680         Ll    Lower case letter
681         Lm    Modifier letter
682         Lo    Other letter
683         Lt    Title case letter
684         Lu    Upper case letter
685
686         M     Mark
687         Mc    Spacing mark
688         Me    Enclosing mark
689         Mn    Non-spacing mark
690
691         N     Number
692         Nd    Decimal number
693         Nl    Letter number
694         No    Other number
695
696         P     Punctuation
697         Pc    Connector punctuation
698         Pd    Dash punctuation
699         Pe    Close punctuation
700         Pf    Final punctuation
701         Pi    Initial punctuation
702         Po    Other punctuation
703         Ps    Open punctuation
704
705         S     Symbol
706         Sc    Currency symbol
707         Sk    Modifier symbol
708         Sm    Mathematical symbol
709         So    Other symbol
710
711         Z     Separator
712         Zl    Line separator
713         Zp    Paragraph separator
714         Zs    Space separator
715
716       The  special property L& is also supported: it matches a character that
717       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
718       classified as a modifier or "other".
719
720       The  Cs  (Surrogate)  property  applies only to characters in the range
721       U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
722       so  cannot  be  tested  by PCRE2, unless UTF validity checking has been
723       turned off (see the discussion of PCRE2_NO_UTF_CHECK  in  the  pcre2api
724       page). Perl does not support the Cs property.
725
726       The  long  synonyms  for  property  names  that  Perl supports (such as
727       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
728       any of these properties with "Is".
729
730       No character that is in the Unicode table has the Cn (unassigned) prop‐
731       erty.  Instead, this property is assumed for any code point that is not
732       in the Unicode table.
733
734       Specifying  caseless  matching  does not affect these escape sequences.
735       For example, \p{Lu} always matches only upper  case  letters.  This  is
736       different from the behaviour of current versions of Perl.
737
738       Matching  characters by Unicode property is not fast, because PCRE2 has
739       to do a multistage table lookup in order to find  a  character's  prop‐
740       erty. That is why the traditional escape sequences such as \d and \w do
741       not use Unicode properties in PCRE2 by default,  though  you  can  make
742       them  do  so by setting the PCRE2_UCP option or by starting the pattern
743       with (*UCP).
744
745   Extended grapheme clusters
746
747       The \X escape matches any number of Unicode  characters  that  form  an
748       "extended grapheme cluster", and treats the sequence as an atomic group
749       (see below).  Unicode supports various kinds of composite character  by
750       giving  each  character  a grapheme breaking property, and having rules
751       that use these properties to define the boundaries of extended grapheme
752       clusters.  \X  always  matches  at least one character. Then it decides
753       whether to add additional characters according to the  following  rules
754       for ending a cluster:
755
756       1. End at the end of the subject string.
757
758       2.  Do not end between CR and LF; otherwise end after any control char‐
759       acter.
760
761       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
762       characters  are of five types: L, V, T, LV, and LVT. An L character may
763       be followed by an L, V, LV, or LVT character; an LV or V character  may
764       be followed by a V or T character; an LVT or T character may be follwed
765       only by a T character.
766
767       4. Do not end before extending characters or spacing marks.  Characters
768       with  the  "mark"  property  always have the "extend" grapheme breaking
769       property.
770
771       5. Do not end after prepend characters.
772
773       6. Otherwise, end the cluster.
774
775   PCRE2's additional properties
776
777       As well as the standard Unicode properties described above, PCRE2  sup‐
778       ports  four  more  that  make it possible to convert traditional escape
779       sequences such as \w and \s to use Unicode properties. PCRE2 uses these
780       non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set.
781       However, they may also be used explicitly. These properties are:
782
783         Xan   Any alphanumeric character
784         Xps   Any POSIX space character
785         Xsp   Any Perl space character
786         Xwd   Any Perl "word" character
787
788       Xan matches characters that have either the L (letter) or the  N  (num‐
789       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
790       form feed, or carriage return, and any other character that has  the  Z
791       (separator)  property.   Xsp  is  the  same as Xps; in PCRE1 it used to
792       exclude vertical tab, for Perl compatibility,  but  Perl  changed.  Xwd
793       matches the same characters as Xan, plus underscore.
794
795       There  is another non-standard property, Xuc, which matches any charac‐
796       ter that can be represented by a Universal Character Name  in  C++  and
797       other  programming  languages.  These are the characters $, @, ` (grave
798       accent), and all characters with Unicode code points  greater  than  or
799       equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
800       most base (ASCII) characters are excluded. (Universal  Character  Names
801       are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
802       Note that the Xuc property does not match these sequences but the char‐
803       acters that they represent.)
804
805   Resetting the match start
806
807       The  escape sequence \K causes any previously matched characters not to
808       be included in the final matched sequence. For example, the pattern:
809
810         foo\Kbar
811
812       matches "foobar", but reports that it has matched "bar".  This  feature
813       is  similar  to  a lookbehind assertion (described below).  However, in
814       this case, the part of the subject before the real match does not  have
815       to  be of fixed length, as lookbehind assertions do. The use of \K does
816       not interfere with the setting of captured  substrings.   For  example,
817       when the pattern
818
819         (foo)\Kbar
820
821       matches "foobar", the first substring is still set to "foo".
822
823       Perl  documents  that  the  use  of  \K  within assertions is "not well
824       defined". In PCRE2, \K is acted upon when  it  occurs  inside  positive
825       assertions,  but  is  ignored  in negative assertions. Note that when a
826       pattern such as (?=ab\K) matches, the reported start of the  match  can
827       be greater than the end of the match.
828
829   Simple assertions
830
831       The  final use of backslash is for certain simple assertions. An asser‐
832       tion specifies a condition that has to be met at a particular point  in
833       a  match, without consuming any characters from the subject string. The
834       use of subpatterns for more complicated assertions is described  below.
835       The backslashed assertions are:
836
837         \b     matches at a word boundary
838         \B     matches when not at a word boundary
839         \A     matches at the start of the subject
840         \Z     matches at the end of the subject
841                 also matches before a newline at the end of the subject
842         \z     matches only at the end of the subject
843         \G     matches at the first matching position in the subject
844
845       Inside  a  character  class, \b has a different meaning; it matches the
846       backspace character. If any other of  these  assertions  appears  in  a
847       character class, an "invalid escape sequence" error is generated.
848
849       A  word  boundary is a position in the subject string where the current
850       character and the previous character do not both match \w or  \W  (i.e.
851       one  matches  \w  and the other matches \W), or the start or end of the
852       string if the first or last character matches \w,  respectively.  In  a
853       UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
854       PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
855       PCRE2  nor Perl has a separate "start of word" or "end of word" metase‐
856       quence. However, whatever follows \b normally determines which  it  is.
857       For example, the fragment \ba matches "a" at the start of a word.
858
859       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
860       and dollar (described in the next section) in that they only ever match
861       at  the  very start and end of the subject string, whatever options are
862       set. Thus, they are independent of multiline mode. These  three  asser‐
863       tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
864       which affect only the behaviour of the circumflex and dollar  metachar‐
865       acters.  However,  if the startoffset argument of pcre2_match() is non-
866       zero, indicating that matching is to start at a point  other  than  the
867       beginning  of  the subject, \A can never match.  The difference between
868       \Z and \z is that \Z matches before a newline at the end of the  string
869       as well as at the very end, whereas \z matches only at the end.
870
871       The  \G assertion is true only when the current matching position is at
872       the start point of the match, as specified by the startoffset  argument
873       of  pcre2_match().  It differs from \A when the value of startoffset is
874       non-zero. By calling  pcre2_match()  multiple  times  with  appropriate
875       arguments,  you  can  mimic Perl's /g option, and it is in this kind of
876       implementation where \G can be useful.
877
878       Note, however, that PCRE2's interpretation of \G, as the start  of  the
879       current match, is subtly different from Perl's, which defines it as the
880       end of the previous match. In Perl, these can  be  different  when  the
881       previously  matched string was empty. Because PCRE2 does just one match
882       at a time, it cannot reproduce this behaviour.
883
884       If all the alternatives of a pattern begin with \G, the  expression  is
885       anchored to the starting match position, and the "anchored" flag is set
886       in the compiled regular expression.
887

CIRCUMFLEX AND DOLLAR

889
890       The circumflex and dollar  metacharacters  are  zero-width  assertions.
891       That  is,  they test for a particular condition being true without con‐
892       suming any characters from the subject string. These two metacharacters
893       are  concerned  with matching the starts and ends of lines. If the new‐
894       line convention is set so that only the two-character sequence CRLF  is
895       recognized  as  a newline, isolated CR and LF characters are treated as
896       ordinary data characters, and are not recognized as newlines.
897
898       Outside a character class, in the default matching mode, the circumflex
899       character  is  an  assertion  that is true only if the current matching
900       point is at the start of the subject string. If the  startoffset  argu‐
901       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
902       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
903       character  class,  circumflex  has  an  entirely different meaning (see
904       below).
905
906       Circumflex need not be the first character of the pattern if  a  number
907       of  alternatives are involved, but it should be the first thing in each
908       alternative in which it appears if the pattern is ever  to  match  that
909       branch.  If all possible alternatives start with a circumflex, that is,
910       if the pattern is constrained to match only at the start  of  the  sub‐
911       ject,  it  is  said  to be an "anchored" pattern. (There are also other
912       constructs that can cause a pattern to be anchored.)
913
914       The dollar character is an assertion that is true only if  the  current
915       matching  point  is  at  the  end of the subject string, or immediately
916       before a newline  at  the  end  of  the  string  (by  default),  unless
917       PCRE2_NOTEOL is set. Note, however, that it does not actually match the
918       newline. Dollar need not be the last character of the pattern if a num‐
919       ber of alternatives are involved, but it should be the last item in any
920       branch in which it appears. Dollar has no special meaning in a  charac‐
921       ter class.
922
923       The  meaning  of  dollar  can be changed so that it matches only at the
924       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
925       compile time. This does not affect the \Z assertion.
926
927       The meanings of the circumflex and dollar metacharacters are changed if
928       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
929       character  matches before any newlines in the string, as well as at the
930       very end, and a circumflex matches immediately after internal  newlines
931       as  well as at the start of the subject string. It does not match after
932       a newline that ends the string, for compatibility with  Perl.  However,
933       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
934
935       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
936       (where \n represents a newline) in multiline mode, but  not  otherwise.
937       Consequently,  patterns  that  are anchored in single line mode because
938       all branches start with ^ are not anchored in  multiline  mode,  and  a
939       match  for  circumflex  is  possible  when  the startoffset argument of
940       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
941       if PCRE2_MULTILINE is set.
942
943       When  the  newline  convention (see "Newline conventions" below) recog‐
944       nizes the two-character sequence CRLF as a newline, this is  preferred,
945       even  if  the  single  characters CR and LF are also recognized as new‐
946       lines. For example, if the newline convention  is  "any",  a  multiline
947       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
948       than after CR, even though CR on its own is a valid newline.  (It  also
949       matches at the very start of the string, of course.)
950
951       Note  that  the sequences \A, \Z, and \z can be used to match the start
952       and end of the subject in both modes, and if all branches of a  pattern
953       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
954       set.
955

FULL STOP (PERIOD, DOT) AND \N

957
958       Outside a character class, a dot in the pattern matches any one charac‐
959       ter  in  the subject string except (by default) a character that signi‐
960       fies the end of a line.
961
962       When a line ending is defined as a single character, dot never  matches
963       that  character; when the two-character sequence CRLF is used, dot does
964       not match CR if it is immediately followed  by  LF,  but  otherwise  it
965       matches  all characters (including isolated CRs and LFs). When any Uni‐
966       code line endings are being recognized, dot does not match CR or LF  or
967       any of the other line ending characters.
968
969       The  behaviour  of  dot  with regard to newlines can be changed. If the
970       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
971       exception.   If  the two-character sequence CRLF is present in the sub‐
972       ject string, it takes two dots to match it.
973
974       The handling of dot is entirely independent of the handling of  circum‐
975       flex  and  dollar,  the  only relationship being that they both involve
976       newlines. Dot has no special meaning in a character class.
977
978       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
979       affected  by  the  PCRE2_DOTALL  option. In other words, it matches any
980       character except one that signifies the end of a line. Perl  also  uses
981       \N to match characters by name; PCRE2 does not support this.
982

MATCHING A SINGLE CODE UNIT

984
985       Outside  a character class, the escape sequence \C matches any one code
986       unit, whether or not a UTF mode is set. In the 8-bit library, one  code
987       unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the
988       32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
989       line-ending  characters.  The  feature  is provided in Perl in order to
990       match individual bytes in UTF-8 mode, but it is unclear how it can use‐
991       fully be used.
992
993       Because  \C  breaks  up characters into individual code units, matching
994       one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the
995       string  may  start  with  a malformed UTF character. This has undefined
996       results, because PCRE2 assumes that it is matching character by charac‐
997       ter  in  a  valid UTF string (by default it checks the subject string's
998       validity at the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK
999       option is used).
1000
1001       An   application   can   lock   out  the  use  of  \C  by  setting  the
1002       PCRE2_NEVER_BACKSLASH_C option when compiling a  pattern.  It  is  also
1003       possible to build PCRE2 with the use of \C permanently disabled.
1004
1005       PCRE2  does  not allow \C to appear in lookbehind assertions (described
1006       below) in UTF-8 or UTF-16 modes, because this would make it  impossible
1007       to  calculate  the  length  of  the lookbehind. Neither the alternative
1008       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1009       these UTF modes.  The former gives a match-time error; the latter fails
1010       to optimize and so the match is always run using the interpreter.
1011
1012       In the 32-bit library,  however,  \C  is  always  supported  (when  not
1013       explicitly  locked  out)  because it always matches a single code unit,
1014       whether or not UTF-32 is specified.
1015
1016       In general, the \C escape sequence is best avoided. However, one way of
1017       using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac‐
1018       ters is to use a lookahead to check the length of the  next  character,
1019       as  in  this  pattern,  which could be used with a UTF-8 string (ignore
1020       white space and line breaks):
1021
1022         (?| (?=[\x00-\x7f])(\C) |
1023             (?=[\x80-\x{7ff}])(\C)(\C) |
1024             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1025             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1026
1027       In this example, a group that starts  with  (?|  resets  the  capturing
1028       parentheses numbers in each alternative (see "Duplicate Subpattern Num‐
1029       bers" below). The assertions at the start of each branch check the next
1030       UTF-8  character  for  values  whose encoding uses 1, 2, 3, or 4 bytes,
1031       respectively. The character's individual bytes are then captured by the
1032       appropriate number of \C groups.
1033

SQUARE BRACKETS AND CHARACTER CLASSES

1035
1036       An opening square bracket introduces a character class, terminated by a
1037       closing square bracket. A closing square bracket on its own is not spe‐
1038       cial  by  default.  If a closing square bracket is required as a member
1039       of the class, it should be the first data character in the class (after
1040       an  initial  circumflex,  if present) or escaped with a backslash. This
1041       means that, by default, an empty class cannot be defined.  However,  if
1042       the  PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
1043       the start does end the (empty) class.
1044
1045       A character class matches a single character in the subject. A  matched
1046       character must be in the set of characters defined by the class, unless
1047       the first character in the class definition is a circumflex,  in  which
1048       case the subject character must not be in the set defined by the class.
1049       If a circumflex is actually required as a member of the  class,  ensure
1050       it is not the first character, or escape it with a backslash.
1051
1052       For  example, the character class [aeiou] matches any lower case vowel,
1053       while [^aeiou] matches any character that is not a  lower  case  vowel.
1054       Note that a circumflex is just a convenient notation for specifying the
1055       characters that are in the class by enumerating those that are  not.  A
1056       class  that starts with a circumflex is not an assertion; it still con‐
1057       sumes a character from the subject string, and therefore  it  fails  if
1058       the current pointer is at the end of the string.
1059
1060       When  caseless  matching  is set, any letters in a class represent both
1061       their upper case and lower case versions, so for  example,  a  caseless
1062       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
1063       match "A", whereas a caseful version would.
1064
1065       Characters that might indicate line breaks are  never  treated  in  any
1066       special  way  when  matching  character  classes,  whatever line-ending
1067       sequence is in use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
1068       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
1069       one of these characters.
1070
1071       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
1072       \w, and \W may appear in a character class, and add the characters that
1073       they match to the class. For example, [\dABCDEF] matches any  hexadeci‐
1074       mal  digit.  In UTF modes, the PCRE2_UCP option affects the meanings of
1075       \d, \s, \w and their upper case partners, just as  it  does  when  they
1076       appear  outside a character class, as described in the section entitled
1077       "Generic character types" above. The escape sequence \b has a different
1078       meaning  inside  a character class; it matches the backspace character.
1079       The sequences \B, \N, \R, and \X are not  special  inside  a  character
1080       class.  Like  any  other  unrecognized  escape sequences, they cause an
1081       error.
1082
1083       The minus (hyphen) character can be used to specify a range of  charac‐
1084       ters  in  a  character  class.  For  example,  [d-m] matches any letter
1085       between d and m, inclusive. If a  minus  character  is  required  in  a
1086       class,  it  must  be  escaped  with a backslash or appear in a position
1087       where it cannot be interpreted as indicating a range, typically as  the
1088       first or last character in the class, or immediately after a range. For
1089       example, [b-d-z] matches letters in the range b to d, a hyphen  charac‐
1090       ter, or z.
1091
1092       Perl treats a hyphen as a literal if it appears before or after a POSIX
1093       class (see below) or a character type escape such as as \d, but gives a
1094       warning  in  its  warning mode, as this is most likely a user error. As
1095       PCRE2 has no facility for warning, an error is given in these cases.
1096
1097       It is not possible to have the literal character "]" as the end charac‐
1098       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
1099       two characters ("W" and "-") followed by a literal string "46]", so  it
1100       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
1101       backslash it is interpreted as the end of range, so [W-\]46] is  inter‐
1102       preted  as a class containing a range followed by two other characters.
1103       The octal or hexadecimal representation of "]" can also be used to  end
1104       a range.
1105
1106       Ranges normally include all code points between the start and end char‐
1107       acters, inclusive. They can also be  used  for  code  points  specified
1108       numerically, for example [\000-\037]. Ranges can include any characters
1109       that are valid for the current mode.
1110
1111       There is a special case in EBCDIC environments  for  ranges  whose  end
1112       points are both specified as literal letters in the same case. For com‐
1113       patibility with Perl, EBCDIC code points within the range that are  not
1114       letters  are  omitted. For example, [h-k] matches only four characters,
1115       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1116       points.  However,  if  the range is specified numerically, for example,
1117       [\x88-\x92] or [h-\x92], all code points are included.
1118
1119       If a range that includes letters is used when caseless matching is set,
1120       it matches the letters in either case. For example, [W-c] is equivalent
1121       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
1122       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
1123       accented E characters in both cases.
1124
1125       A circumflex can conveniently be used with  the  upper  case  character
1126       types  to specify a more restricted set of characters than the matching
1127       lower case type.  For example, the class [^\W_] matches any  letter  or
1128       digit, but not underscore, whereas [\w] includes underscore. A positive
1129       character class should be read as "something OR something OR ..." and a
1130       negative class as "NOT something AND NOT something AND NOT ...".
1131
1132       The  only  metacharacters  that are recognized in character classes are
1133       backslash, hyphen (only where it can be  interpreted  as  specifying  a
1134       range),  circumflex  (only  at the start), opening square bracket (only
1135       when it can be interpreted as introducing a POSIX class name, or for  a
1136       special  compatibility  feature  -  see the next two sections), and the
1137       terminating  closing  square  bracket.  However,  escaping  other  non-
1138       alphanumeric characters does no harm.
1139

POSIX CHARACTER CLASSES

1141
1142       Perl supports the POSIX notation for character classes. This uses names
1143       enclosed by [: and :] within the enclosing square brackets. PCRE2  also
1144       supports this notation. For example,
1145
1146         [01[:alpha:]%]
1147
1148       matches "0", "1", any alphabetic character, or "%". The supported class
1149       names are:
1150
1151         alnum    letters and digits
1152         alpha    letters
1153         ascii    character codes 0 - 127
1154         blank    space or tab only
1155         cntrl    control characters
1156         digit    decimal digits (same as \d)
1157         graph    printing characters, excluding space
1158         lower    lower case letters
1159         print    printing characters, including space
1160         punct    printing characters, excluding letters and digits and space
1161         space    white space (the same as \s from PCRE2 8.34)
1162         upper    upper case letters
1163         word     "word" characters (same as \w)
1164         xdigit   hexadecimal digits
1165
1166       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
1167       CR  (13),  and space (32). If locale-specific matching is taking place,
1168       the list of space characters may be different; there may  be  fewer  or
1169       more of them. "Space" and \s match the same set of characters.
1170
1171       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
1172       from Perl 5.8. Another Perl extension is negation, which  is  indicated
1173       by a ^ character after the colon. For example,
1174
1175         [12[:^digit:]]
1176
1177       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1178       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1179       these are not supported, and an error is given if they are encountered.
1180
1181       By default, characters with values greater than 127 do not match any of
1182       the POSIX character classes, although this may be different for charac‐
1183       ters  in  the range 128-255 when locale-specific matching is happening.
1184       However, if the PCRE2_UCP option is passed to pcre2_compile(), some  of
1185       the  classes are changed so that Unicode character properties are used.
1186       This  is  achieved  by  replacing  certain  POSIX  classes  with  other
1187       sequences, as follows:
1188
1189         [:alnum:]  becomes  \p{Xan}
1190         [:alpha:]  becomes  \p{L}
1191         [:blank:]  becomes  \h
1192         [:cntrl:]  becomes  \p{Cc}
1193         [:digit:]  becomes  \p{Nd}
1194         [:lower:]  becomes  \p{Ll}
1195         [:space:]  becomes  \p{Xps}
1196         [:upper:]  becomes  \p{Lu}
1197         [:word:]   becomes  \p{Xwd}
1198
1199       Negated  versions, such as [:^alpha:] use \P instead of \p. Three other
1200       POSIX classes are handled specially in UCP mode:
1201
1202       [:graph:] This matches characters that have glyphs that mark  the  page
1203                 when printed. In Unicode property terms, it matches all char‐
1204                 acters with the L, M, N, P, S, or Cf properties, except for:
1205
1206                   U+061C           Arabic Letter Mark
1207                   U+180E           Mongolian Vowel Separator
1208                   U+2066 - U+2069  Various "isolate"s
1209
1210
1211       [:print:] This matches the same  characters  as  [:graph:]  plus  space
1212                 characters  that  are  not controls, that is, characters with
1213                 the Zs property.
1214
1215       [:punct:] This matches all characters that have the Unicode P (punctua‐
1216                 tion)  property,  plus those characters with code points less
1217                 than 256 that have the S (Symbol) property.
1218
1219       The other POSIX classes are unchanged, and match only  characters  with
1220       code points less than 256.
1221

COMPATIBILITY FEATURE FOR WORD BOUNDARIES

1223
1224       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1225       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
1226       and "end of word". PCRE2 treats these items as follows:
1227
1228         [[:<:]]  is converted to  \b(?=\w)
1229         [[:>:]]  is converted to  \b(?<=\w)
1230
1231       Only these exact character sequences are recognized. A sequence such as
1232       [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This
1233       support  is not compatible with Perl. It is provided to help migrations
1234       from other environments, and is best not used in any new patterns. Note
1235       that  \b matches at the start and the end of a word (see "Simple asser‐
1236       tions" above), and in a Perl-style pattern the preceding  or  following
1237       character  normally  shows  which  is  wanted, without the need for the
1238       assertions that are used above in order to give exactly the  POSIX  be‐
1239       haviour.
1240

VERTICAL BAR

1242
1243       Vertical  bar characters are used to separate alternative patterns. For
1244       example, the pattern
1245
1246         gilbert|sullivan
1247
1248       matches either "gilbert" or "sullivan". Any number of alternatives  may
1249       appear,  and  an  empty  alternative  is  permitted (matching the empty
1250       string). The matching process tries each alternative in turn, from left
1251       to  right, and the first one that succeeds is used. If the alternatives
1252       are within a subpattern (defined below), "succeeds" means matching  the
1253       rest of the main pattern as well as the alternative in the subpattern.
1254

INTERNAL OPTION SETTING

1256
1257       The  settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
1258       PCRE2_EXTENDED options (which are Perl-compatible) can be changed  from
1259       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
1260       between "(?" and ")".  The option letters are
1261
1262         i  for PCRE2_CASELESS
1263         m  for PCRE2_MULTILINE
1264         s  for PCRE2_DOTALL
1265         x  for PCRE2_EXTENDED
1266
1267       For example, (?im) sets caseless, multiline matching. It is also possi‐
1268       ble to unset these options by preceding the letter with a hyphen, and a
1269       combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE‐
1270       LESS    and    PCRE2_MULTILINE   while   unsetting   PCRE2_DOTALL   and
1271       PCRE2_EXTENDED, is also permitted. If a letter appears both before  and
1272       after  the  hyphen, the option is unset. An empty options setting "(?)"
1273       is allowed. Needless to say, it has no effect.
1274
1275       The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
1276       changed  in  the  same  way as the Perl-compatible options by using the
1277       characters J and U respectively.
1278
1279       When one of these option changes occurs at  top  level  (that  is,  not
1280       inside  subpattern parentheses), the change applies to the remainder of
1281       the pattern that follows. An option change  within  a  subpattern  (see
1282       below  for  a description of subpatterns) affects only that part of the
1283       subpattern that follows it, so
1284
1285         (a(?i)b)c
1286
1287       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
1288       not  used).   By this means, options can be made to have different set‐
1289       tings in different parts of the pattern. Any changes made in one alter‐
1290       native do carry on into subsequent branches within the same subpattern.
1291       For example,
1292
1293         (a(?i)b|c)
1294
1295       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
1296       first  branch  is  abandoned before the option setting. This is because
1297       the effects of option settings happen at compile time. There  would  be
1298       some very weird behaviour otherwise.
1299
1300       As  a  convenient shorthand, if any option settings are required at the
1301       start of a non-capturing subpattern (see the next section), the  option
1302       letters may appear between the "?" and the ":". Thus the two patterns
1303
1304         (?i:saturday|sunday)
1305         (?:(?i)saturday|sunday)
1306
1307       match exactly the same set of strings.
1308
1309       Note:  There  are  other  PCRE2-specific options that can be set by the
1310       application when the compiling function is called. The pattern can con‐
1311       tain  special  leading  sequences  such as (*CRLF) to override what the
1312       application has set or what has been defaulted. Details  are  given  in
1313       the  section  entitled  "Newline  sequences"  above. There are also the
1314       (*UTF) and (*UCP) leading sequences that can be used  to  set  UTF  and
1315       Unicode  property  modes;  they are equivalent to setting the PCRE2_UTF
1316       and PCRE2_UCP options, respectively. However, the application  can  set
1317       the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
1318       of the (*UTF) and (*UCP) sequences.
1319

SUBPATTERNS

1321
1322       Subpatterns are delimited by parentheses (round brackets), which can be
1323       nested.  Turning part of a pattern into a subpattern does two things:
1324
1325       1. It localizes a set of alternatives. For example, the pattern
1326
1327         cat(aract|erpillar|)
1328
1329       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
1330       it would match "cataract", "erpillar" or an empty string.
1331
1332       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
1333       that, when the whole pattern matches, the portion of the subject string
1334       that matched the subpattern is passed back to  the  caller,  separately
1335       from  the portion that matched the whole pattern. (This applies only to
1336       the traditional matching function; the DFA matching function  does  not
1337       support capturing.)
1338
1339       Opening parentheses are counted from left to right (starting from 1) to
1340       obtain numbers for the  capturing  subpatterns.  For  example,  if  the
1341       string "the red king" is matched against the pattern
1342
1343         the ((red|white) (king|queen))
1344
1345       the captured substrings are "red king", "red", and "king", and are num‐
1346       bered 1, 2, and 3, respectively.
1347
1348       The fact that plain parentheses fulfil  two  functions  is  not  always
1349       helpful.   There are often times when a grouping subpattern is required
1350       without a capturing requirement. If an opening parenthesis is  followed
1351       by  a question mark and a colon, the subpattern does not do any captur‐
1352       ing, and is not counted when computing the  number  of  any  subsequent
1353       capturing  subpatterns. For example, if the string "the white queen" is
1354       matched against the pattern
1355
1356         the ((?:red|white) (king|queen))
1357
1358       the captured substrings are "white queen" and "queen", and are numbered
1359       1 and 2. The maximum number of capturing subpatterns is 65535.
1360
1361       As  a  convenient shorthand, if any option settings are required at the
1362       start of a non-capturing subpattern,  the  option  letters  may  appear
1363       between the "?" and the ":". Thus the two patterns
1364
1365         (?i:saturday|sunday)
1366         (?:(?i)saturday|sunday)
1367
1368       match exactly the same set of strings. Because alternative branches are
1369       tried from left to right, and options are not reset until  the  end  of
1370       the  subpattern is reached, an option setting in one branch does affect
1371       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
1372       "Saturday".
1373

DUPLICATE SUBPATTERN NUMBERS

1375
1376       Perl 5.10 introduced a feature whereby each alternative in a subpattern
1377       uses the same numbers for its capturing parentheses. Such a  subpattern
1378       starts  with (?| and is itself a non-capturing subpattern. For example,
1379       consider this pattern:
1380
1381         (?|(Sat)ur|(Sun))day
1382
1383       Because the two alternatives are inside a (?| group, both sets of  cap‐
1384       turing  parentheses  are  numbered one. Thus, when the pattern matches,
1385       you can look at captured substring number  one,  whichever  alternative
1386       matched.  This  construct  is useful when you want to capture part, but
1387       not all, of one of a number of alternatives. Inside a (?| group, paren‐
1388       theses  are  numbered as usual, but the number is reset at the start of
1389       each branch. The numbers of any capturing parentheses that  follow  the
1390       subpattern  start after the highest number used in any branch. The fol‐
1391       lowing example is taken from the Perl documentation. The numbers under‐
1392       neath show in which buffer the captured content will be stored.
1393
1394         # before  ---------------branch-reset----------- after
1395         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1396         # 1            2         2  3        2     3     4
1397
1398       A  back  reference  to a numbered subpattern uses the most recent value
1399       that is set for that number by any subpattern.  The  following  pattern
1400       matches "abcabc" or "defdef":
1401
1402         /(?|(abc)|(def))\1/
1403
1404       In  contrast,  a subroutine call to a numbered subpattern always refers
1405       to the first one in the pattern with the given  number.  The  following
1406       pattern matches "abcabc" or "defabc":
1407
1408         /(?|(abc)|(def))(?1)/
1409
1410       A relative reference such as (?-1) is no different: it is just a conve‐
1411       nient way of computing an absolute group number.
1412
1413       If a condition test for a subpattern's having matched refers to a  non-
1414       unique  number, the test is true if any of the subpatterns of that num‐
1415       ber have matched.
1416
1417       An alternative approach to using this "branch reset" feature is to  use
1418       duplicate named subpatterns, as described in the next section.
1419

NAMED SUBPATTERNS

1421
1422       Identifying  capturing  parentheses  by number is simple, but it can be
1423       very hard to keep track of the numbers in complicated  regular  expres‐
1424       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
1425       change. To help with this difficulty, PCRE2 supports the naming of sub‐
1426       patterns. This feature was not added to Perl until release 5.10. Python
1427       had the feature earlier, and PCRE1 introduced it at release 4.0,  using
1428       the  Python syntax. PCRE2 supports both the Perl and the Python syntax.
1429       Perl allows identically numbered subpatterns to have  different  names,
1430       but PCRE2 does not.
1431
1432       In  PCRE2, a subpattern can be named in one of three ways: (?<name>...)
1433       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
1434       to  capturing parentheses from other parts of the pattern, such as back
1435       references, recursion, and conditions, can be made by name as  well  as
1436       by number.
1437
1438       Names  consist of up to 32 alphanumeric characters and underscores, but
1439       must start with a non-digit.  Named  capturing  parentheses  are  still
1440       allocated  numbers  as  well as names, exactly as if the names were not
1441       present. The PCRE2 API provides function calls for extracting the name-
1442       to-number  translation  table  from  a compiled pattern. There are also
1443       convenience functions for extracting a captured substring by name.
1444
1445       By default, a name must be unique within a pattern, but it is  possible
1446       to  relax  this constraint by setting the PCRE2_DUPNAMES option at com‐
1447       pile time.  (Duplicate names are also always permitted for  subpatterns
1448       with  the  same  number,  set up as described in the previous section.)
1449       Duplicate names can be useful for patterns where only one  instance  of
1450       the named parentheses can match.  Suppose you want to match the name of
1451       a weekday, either as a 3-letter abbreviation or as the full  name,  and
1452       in  both  cases  you  want  to  extract  the abbreviation. This pattern
1453       (ignoring the line breaks) does the job:
1454
1455         (?<DN>Mon|Fri|Sun)(?:day)?|
1456         (?<DN>Tue)(?:sday)?|
1457         (?<DN>Wed)(?:nesday)?|
1458         (?<DN>Thu)(?:rsday)?|
1459         (?<DN>Sat)(?:urday)?
1460
1461       There are five capturing substrings, but only one is ever set  after  a
1462       match.  (An alternative way of solving this problem is to use a "branch
1463       reset" subpattern, as described in the previous section.)
1464
1465       The convenience functions for extracting the data by name  returns  the
1466       substring  for  the first (and in this example, the only) subpattern of
1467       that name that matched. This saves searching  to  find  which  numbered
1468       subpattern it was.
1469
1470       If  you  make  a  back  reference to a non-unique named subpattern from
1471       elsewhere in the pattern, the subpatterns to which the name refers  are
1472       checked  in  the order in which they appear in the overall pattern. The
1473       first one that is set is used for the reference. For example, this pat‐
1474       tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
1475
1476         (?:(?<n>foo)|(?<n>bar))\k<n>
1477
1478
1479       If you make a subroutine call to a non-unique named subpattern, the one
1480       that corresponds to the first occurrence of the name is  used.  In  the
1481       absence of duplicate numbers (see the previous section) this is the one
1482       with the lowest number.
1483
1484       If you use a named reference in a condition test (see the section about
1485       conditions below), either to check whether a subpattern has matched, or
1486       to check for recursion, all subpatterns with the same name are  tested.
1487       If  the condition is true for any one of them, the overall condition is
1488       true. This is the same behaviour as  testing  by  number.  For  further
1489       details  of  the  interfaces  for  handling  named subpatterns, see the
1490       pcre2api documentation.
1491
1492       Warning: You cannot use different names to distinguish between two sub‐
1493       patterns  with the same number because PCRE2 uses only the numbers when
1494       matching. For this reason, an error is given at compile time if differ‐
1495       ent  names  are given to subpatterns with the same number. However, you
1496       can always give the same name to subpatterns with the same number, even
1497       when PCRE2_DUPNAMES is not set.
1498

REPETITION

1500
1501       Repetition  is  specified  by  quantifiers, which can follow any of the
1502       following items:
1503
1504         a literal data character
1505         the dot metacharacter
1506         the \C escape sequence
1507         the \X escape sequence
1508         the \R escape sequence
1509         an escape such as \d or \pL that matches a single character
1510         a character class
1511         a back reference
1512         a parenthesized subpattern (including most assertions)
1513         a subroutine call to a subpattern (recursive or otherwise)
1514
1515       The general repetition quantifier specifies a minimum and maximum  num‐
1516       ber  of  permitted matches, by giving the two numbers in curly brackets
1517       (braces), separated by a comma. The numbers must be  less  than  65536,
1518       and the first must be less than or equal to the second. For example:
1519
1520         z{2,4}
1521
1522       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
1523       special character. If the second number is omitted, but  the  comma  is
1524       present,  there  is  no upper limit; if the second number and the comma
1525       are both omitted, the quantifier specifies an exact number of  required
1526       matches. Thus
1527
1528         [aeiou]{3,}
1529
1530       matches at least 3 successive vowels, but may match many more, whereas
1531
1532         \d{8}
1533
1534       matches  exactly  8  digits. An opening curly bracket that appears in a
1535       position where a quantifier is not allowed, or one that does not  match
1536       the  syntax of a quantifier, is taken as a literal character. For exam‐
1537       ple, {,6} is not a quantifier, but a literal string of four characters.
1538
1539       In UTF modes, quantifiers apply to characters rather than to individual
1540       code  units. Thus, for example, \x{100}{2} matches two characters, each
1541       of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1542       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
1543       which may be several code units long (and  they  may  be  of  different
1544       lengths).
1545
1546       The quantifier {0} is permitted, causing the expression to behave as if
1547       the previous item and the quantifier were not present. This may be use‐
1548       ful  for  subpatterns that are referenced as subroutines from elsewhere
1549       in the pattern (but see also the section entitled "Defining subpatterns
1550       for  use  by  reference only" below). Items other than subpatterns that
1551       have a {0} quantifier are omitted from the compiled pattern.
1552
1553       For convenience, the three most common quantifiers have  single-charac‐
1554       ter abbreviations:
1555
1556         *    is equivalent to {0,}
1557         +    is equivalent to {1,}
1558         ?    is equivalent to {0,1}
1559
1560       It  is  possible  to construct infinite loops by following a subpattern
1561       that can match no characters with a quantifier that has no upper limit,
1562       for example:
1563
1564         (a?)*
1565
1566       Earlier  versions  of  Perl  and PCRE1 used to give an error at compile
1567       time for such patterns. However, because there are cases where this can
1568       be useful, such patterns are now accepted, but if any repetition of the
1569       subpattern does in fact match no characters, the loop is forcibly  bro‐
1570       ken.
1571
1572       By  default,  the quantifiers are "greedy", that is, they match as much
1573       as possible (up to the maximum  number  of  permitted  times),  without
1574       causing  the  rest of the pattern to fail. The classic example of where
1575       this gives problems is in trying to match comments in C programs. These
1576       appear  between  /*  and  */ and within the comment, individual * and /
1577       characters may appear. An attempt to match C comments by  applying  the
1578       pattern
1579
1580         /\*.*\*/
1581
1582       to the string
1583
1584         /* first comment */  not comment  /* second comment */
1585
1586       fails,  because it matches the entire string owing to the greediness of
1587       the .*  item.
1588
1589       If a quantifier is followed by a question mark, it ceases to be greedy,
1590       and  instead  matches the minimum number of times possible, so the pat‐
1591       tern
1592
1593         /\*.*?\*/
1594
1595       does the right thing with the C comments. The meaning  of  the  various
1596       quantifiers  is  not  otherwise  changed,  just the preferred number of
1597       matches.  Do not confuse this use of question mark with its  use  as  a
1598       quantifier  in its own right. Because it has two uses, it can sometimes
1599       appear doubled, as in
1600
1601         \d??\d
1602
1603       which matches one digit by preference, but can match two if that is the
1604       only way the rest of the pattern matches.
1605
1606       If the PCRE2_UNGREEDY option is set (an option that is not available in
1607       Perl), the quantifiers are not greedy by default, but  individual  ones
1608       can  be  made  greedy  by following them with a question mark. In other
1609       words, it inverts the default behaviour.
1610
1611       When a parenthesized subpattern is quantified  with  a  minimum  repeat
1612       count  that is greater than 1 or with a limited maximum, more memory is
1613       required for the compiled pattern, in proportion to  the  size  of  the
1614       minimum or maximum.
1615
1616       If  a  pattern  starts  with  .*  or  .{0,} and the PCRE2_DOTALL option
1617       (equivalent to Perl's /s) is set, thus allowing the dot to  match  new‐
1618       lines,  the  pattern  is  implicitly anchored, because whatever follows
1619       will be tried against every character position in the  subject  string,
1620       so  there  is  no  point  in retrying the overall match at any position
1621       after the first. PCRE2 normally treats such a pattern as though it were
1622       preceded by \A.
1623
1624       In  cases  where  it  is known that the subject string contains no new‐
1625       lines, it is worth setting PCRE2_DOTALL in order to obtain  this  opti‐
1626       mization, or alternatively, using ^ to indicate anchoring explicitly.
1627
1628       However,  there  are  some cases where the optimization cannot be used.
1629       When .*  is inside capturing parentheses that are the subject of a back
1630       reference elsewhere in the pattern, a match at the start may fail where
1631       a later one succeeds. Consider, for example:
1632
1633         (.*)abc\1
1634
1635       If the subject is "xyz123abc123" the match point is the fourth  charac‐
1636       ter. For this reason, such a pattern is not implicitly anchored.
1637
1638       Another  case where implicit anchoring is not applied is when the lead‐
1639       ing .* is inside an atomic group. Once again, a match at the start  may
1640       fail where a later one succeeds. Consider this pattern:
1641
1642         (?>.*?a)b
1643
1644       It  matches "ab" in the subject "aab". The use of the backtracking con‐
1645       trol verbs (*PRUNE) and (*SKIP) also  disable  this  optimization,  and
1646       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1647
1648       When a capturing subpattern is repeated, the value captured is the sub‐
1649       string that matched the final iteration. For example, after
1650
1651         (tweedle[dume]{3}\s*)+
1652
1653       has matched "tweedledum tweedledee" the value of the captured substring
1654       is  "tweedledee".  However,  if there are nested capturing subpatterns,
1655       the corresponding captured values may have been set in previous  itera‐
1656       tions. For example, after
1657
1658         (a|(b))+
1659
1660       matches "aba" the value of the second captured substring is "b".
1661

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

1663
1664       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1665       repetition, failure of what follows normally causes the  repeated  item
1666       to  be  re-evaluated to see if a different number of repeats allows the
1667       rest of the pattern to match. Sometimes it is useful to  prevent  this,
1668       either  to  change the nature of the match, or to cause it fail earlier
1669       than it otherwise might, when the author of the pattern knows there  is
1670       no point in carrying on.
1671
1672       Consider,  for  example, the pattern \d+foo when applied to the subject
1673       line
1674
1675         123456bar
1676
1677       After matching all 6 digits and then failing to match "foo", the normal
1678       action  of  the matcher is to try again with only 5 digits matching the
1679       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
1680       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
1681       the means for specifying that once a subpattern has matched, it is  not
1682       to be re-evaluated in this way.
1683
1684       If  we  use atomic grouping for the previous example, the matcher gives
1685       up immediately on failing to match "foo" the first time.  The  notation
1686       is a kind of special parenthesis, starting with (?> as in this example:
1687
1688         (?>\d+)foo
1689
1690       This  kind  of  parenthesis "locks up" the  part of the pattern it con‐
1691       tains once it has matched, and a failure further into  the  pattern  is
1692       prevented  from  backtracking into it. Backtracking past it to previous
1693       items, however, works as normal.
1694
1695       An alternative description is that a subpattern of  this  type  matches
1696       exactly  the  string of characters that an identical standalone pattern
1697       would match, if anchored at the current point in the subject string.
1698
1699       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1700       such as the above example can be thought of as a maximizing repeat that
1701       must swallow everything it can. So, while both \d+ and  \d+?  are  pre‐
1702       pared  to  adjust  the number of digits they match in order to make the
1703       rest of the pattern match, (?>\d+) can only match an entire sequence of
1704       digits.
1705
1706       Atomic  groups in general can of course contain arbitrarily complicated
1707       subpatterns, and can be nested. However, when  the  subpattern  for  an
1708       atomic group is just a single repeated item, as in the example above, a
1709       simpler notation, called a "possessive quantifier" can  be  used.  This
1710       consists  of  an  additional  + character following a quantifier. Using
1711       this notation, the previous example can be rewritten as
1712
1713         \d++foo
1714
1715       Note that a possessive quantifier can be used with an entire group, for
1716       example:
1717
1718         (abc|xyz){2,3}+
1719
1720       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
1721       PCRE2_UNGREEDY option is ignored. They are a  convenient  notation  for
1722       the  simpler  forms of atomic group. However, there is no difference in
1723       the meaning of a possessive quantifier and the equivalent atomic group,
1724       though  there  may  be a performance difference; possessive quantifiers
1725       should be slightly faster.
1726
1727       The possessive quantifier syntax is an extension to the Perl  5.8  syn‐
1728       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
1729       edition of his book. Mike McCloskey liked it, so implemented it when he
1730       built Sun's Java package, and PCRE1 copied it from there. It ultimately
1731       found its way into Perl at release 5.10.
1732
1733       PCRE2 has an optimization  that  automatically  "possessifies"  certain
1734       simple  pattern constructs. For example, the sequence A+B is treated as
1735       A++B because there is no point in backtracking into a sequence  of  A's
1736       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO‐
1737       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1738
1739       When a pattern contains an unlimited repeat inside  a  subpattern  that
1740       can  itself  be  repeated  an  unlimited number of times, the use of an
1741       atomic group is the only way to avoid some  failing  matches  taking  a
1742       very long time indeed. The pattern
1743
1744         (\D+|<\d+>)*[!?]
1745
1746       matches  an  unlimited number of substrings that either consist of non-
1747       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1748       matches, it runs quickly. However, if it is applied to
1749
1750         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1751
1752       it  takes  a  long  time  before reporting failure. This is because the
1753       string can be divided between the internal \D+ repeat and the  external
1754       *  repeat  in  a  large  number of ways, and all have to be tried. (The
1755       example uses [!?] rather than a single character at  the  end,  because
1756       both  PCRE2  and Perl have an optimization that allows for fast failure
1757       when a single character is used. They remember the last single  charac‐
1758       ter  that  is required for a match, and fail early if it is not present
1759       in the string.) If the pattern is changed so that  it  uses  an  atomic
1760       group, like this:
1761
1762         ((?>\D+)|<\d+>)*[!?]
1763
1764       sequences of non-digits cannot be broken, and failure happens quickly.
1765

BACK REFERENCES

1767
1768       Outside a character class, a backslash followed by a digit greater than
1769       0 (and possibly further digits) is a back reference to a capturing sub‐
1770       pattern  earlier  (that is, to its left) in the pattern, provided there
1771       have been that many previous capturing left parentheses.
1772
1773       However, if the decimal number following the backslash is less than  8,
1774       it  is  always  taken  as a back reference, and causes an error only if
1775       there are not that many capturing left parentheses in the  entire  pat‐
1776       tern.  In  other words, the parentheses that are referenced need not be
1777       to the left of the reference for numbers less than 8. A  "forward  back
1778       reference"  of  this  type can make sense when a repetition is involved
1779       and the subpattern to the right has participated in an  earlier  itera‐
1780       tion.
1781
1782       It  is  not  possible to have a numerical "forward back reference" to a
1783       subpattern whose number is 8  or  more  using  this  syntax  because  a
1784       sequence  such  as  \50 is interpreted as a character defined in octal.
1785       See the subsection entitled "Non-printing characters" above for further
1786       details  of  the  handling of digits following a backslash. There is no
1787       such problem when named parentheses are used. A back reference  to  any
1788       subpattern is possible using named parentheses (see below).
1789
1790       Another  way  of  avoiding  the ambiguity inherent in the use of digits
1791       following a backslash is to use the \g  escape  sequence.  This  escape
1792       must be followed by a signed or unsigned number, optionally enclosed in
1793       braces. These examples are all identical:
1794
1795         (ring), \1
1796         (ring), \g1
1797         (ring), \g{1}
1798
1799       An unsigned number specifies an absolute reference without the  ambigu‐
1800       ity that is present in the older syntax. It is also useful when literal
1801       digits follow the reference. A signed number is a  relative  reference.
1802       Consider this example:
1803
1804         (abc(def)ghi)\g{-1}
1805
1806       The sequence \g{-1} is a reference to the most recently started captur‐
1807       ing subpattern before \g, that is, is it equivalent to \2 in this exam‐
1808       ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
1809       references can be helpful in long patterns, and also in  patterns  that
1810       are  created  by  joining  together  fragments  that contain references
1811       within themselves.
1812
1813       The sequence \g{+1} is a reference to the  next  capturing  subpattern.
1814       This  kind  of forward reference can be useful it patterns that repeat.
1815       Perl does not support the use of + in this way.
1816
1817       A back reference matches whatever actually matched the  capturing  sub‐
1818       pattern  in  the  current subject string, rather than anything matching
1819       the subpattern itself (see "Subpatterns as subroutines" below for a way
1820       of doing that). So the pattern
1821
1822         (sens|respons)e and \1ibility
1823
1824       matches  "sense and sensibility" and "response and responsibility", but
1825       not "sense and responsibility". If caseful matching is in force at  the
1826       time  of the back reference, the case of letters is relevant. For exam‐
1827       ple,
1828
1829         ((?i)rah)\s+\1
1830
1831       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1832       original capturing subpattern is matched caselessly.
1833
1834       There  are  several  different ways of writing back references to named
1835       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
1836       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
1837       unified back reference syntax, in which \g can be used for both numeric
1838       and  named  references,  is  also supported. We could rewrite the above
1839       example in any of the following ways:
1840
1841         (?<p1>(?i)rah)\s+\k<p1>
1842         (?'p1'(?i)rah)\s+\k{p1}
1843         (?P<p1>(?i)rah)\s+(?P=p1)
1844         (?<p1>(?i)rah)\s+\g{p1}
1845
1846       A subpattern that is referenced by  name  may  appear  in  the  pattern
1847       before or after the reference.
1848
1849       There  may be more than one back reference to the same subpattern. If a
1850       subpattern has not actually been used in a particular match,  any  back
1851       references to it always fail by default. For example, the pattern
1852
1853         (a|(bc))\2
1854
1855       always  fails  if  it starts to match "a" rather than "bc". However, if
1856       the PCRE2_MATCH_UNSET_BACKREF option is set at  compile  time,  a  back
1857       reference to an unset value matches an empty string.
1858
1859       Because  there may be many capturing parentheses in a pattern, all dig‐
1860       its following a backslash are taken as part of a potential back  refer‐
1861       ence  number.   If  the  pattern continues with a digit character, some
1862       delimiter must  be  used  to  terminate  the  back  reference.  If  the
1863       PCRE2_EXTENDED  option  is set, this can be white space. Otherwise, the
1864       \g{ syntax or an empty comment (see "Comments" below) can be used.
1865
1866   Recursive back references
1867
1868       A back reference that occurs inside the parentheses to which it  refers
1869       fails  when  the subpattern is first used, so, for example, (a\1) never
1870       matches.  However, such references can be useful inside  repeated  sub‐
1871       patterns. For example, the pattern
1872
1873         (a|b\1)+
1874
1875       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1876       ation of the subpattern,  the  back  reference  matches  the  character
1877       string  corresponding  to  the previous iteration. In order for this to
1878       work, the pattern must be such that the first iteration does  not  need
1879       to  match the back reference. This can be done using alternation, as in
1880       the example above, or by a quantifier with a minimum of zero.
1881
1882       Back references of this type cause the group that they reference to  be
1883       treated  as  an atomic group.  Once the whole group has been matched, a
1884       subsequent matching failure cannot cause backtracking into  the  middle
1885       of the group.
1886

ASSERTIONS

1888
1889       An  assertion  is  a  test on the characters following or preceding the
1890       current matching point that does not consume any characters. The simple
1891       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
1892       above.
1893
1894       More complicated assertions are coded as  subpatterns.  There  are  two
1895       kinds:  those  that  look  ahead of the current position in the subject
1896       string, and those that look  behind  it.  An  assertion  subpattern  is
1897       matched  in  the  normal way, except that it does not cause the current
1898       matching position to be changed.
1899
1900       Assertion subpatterns are not capturing subpatterns. If such an  asser‐
1901       tion  contains  capturing  subpatterns within it, these are counted for
1902       the purposes of numbering the capturing subpatterns in the  whole  pat‐
1903       tern.  However,  substring  capturing  is carried out only for positive
1904       assertions. (Perl sometimes, but not always, does do capturing in nega‐
1905       tive assertions.)
1906
1907       WARNING:  If a positive assertion containing one or more capturing sub‐
1908       patterns succeeds, but failure to match later  in  the  pattern  causes
1909       backtracking over this assertion, the captures within the assertion are
1910       reset only if no higher numbered captures are  already  set.  This  is,
1911       unfortunately,  a fundamental limitation of the current implementation;
1912       it may get removed in a future reworking.
1913
1914       For  compatibility  with  Perl,  most  assertion  subpatterns  may   be
1915       repeated;  though  it  makes  no sense to assert the same thing several
1916       times, the side effect of capturing  parentheses  may  occasionally  be
1917       useful.  However,  an  assertion  that forms the condition for a condi‐
1918       tional subpattern may not be quantified. In practice, for other  asser‐
1919       tions, there only three cases:
1920
1921       (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
1922       matching.  However, it may  contain  internal  capturing  parenthesized
1923       groups that are called from elsewhere via the subroutine mechanism.
1924
1925       (2)  If quantifier is {0,n} where n is greater than zero, it is treated
1926       as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
1927       tried with and without the assertion, the order depending on the greed‐
1928       iness of the quantifier.
1929
1930       (3) If the minimum repetition is greater than zero, the  quantifier  is
1931       ignored.   The  assertion  is  obeyed just once when encountered during
1932       matching.
1933
1934   Lookahead assertions
1935
1936       Lookahead assertions start with (?= for positive assertions and (?! for
1937       negative assertions. For example,
1938
1939         \w+(?=;)
1940
1941       matches  a word followed by a semicolon, but does not include the semi‐
1942       colon in the match, and
1943
1944         foo(?!bar)
1945
1946       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
1947       that the apparently similar pattern
1948
1949         (?!foo)bar
1950
1951       does  not  find  an  occurrence  of "bar" that is preceded by something
1952       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
1953       the assertion (?!foo) is always true when the next three characters are
1954       "bar". A lookbehind assertion is needed to achieve the other effect.
1955
1956       If you want to force a matching failure at some point in a pattern, the
1957       most  convenient  way  to  do  it  is with (?!) because an empty string
1958       always matches, so an assertion that requires there not to be an  empty
1959       string must always fail.  The backtracking control verb (*FAIL) or (*F)
1960       is a synonym for (?!).
1961
1962   Lookbehind assertions
1963
1964       Lookbehind assertions start with (?<= for positive assertions and  (?<!
1965       for negative assertions. For example,
1966
1967         (?<!foo)bar
1968
1969       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
1970       contents of a lookbehind assertion are restricted  such  that  all  the
1971       strings it matches must have a fixed length. However, if there are sev‐
1972       eral top-level alternatives, they do not all  have  to  have  the  same
1973       fixed length. Thus
1974
1975         (?<=bullock|donkey)
1976
1977       is permitted, but
1978
1979         (?<!dogs?|cats?)
1980
1981       causes  an  error at compile time. Branches that match different length
1982       strings are permitted only at the top level of a lookbehind  assertion.
1983       This is an extension compared with Perl, which requires all branches to
1984       match the same length of string. An assertion such as
1985
1986         (?<=ab(c|de))
1987
1988       is not permitted, because its single top-level  branch  can  match  two
1989       different  lengths,  but  it is acceptable to PCRE2 if rewritten to use
1990       two top-level branches:
1991
1992         (?<=abc|abde)
1993
1994       In some cases, the escape sequence \K (see above) can be  used  instead
1995       of a lookbehind assertion to get round the fixed-length restriction.
1996
1997       The  implementation  of lookbehind assertions is, for each alternative,
1998       to temporarily move the current position back by the fixed  length  and
1999       then try to match. If there are insufficient characters before the cur‐
2000       rent position, the assertion fails.
2001
2002       In UTF-8 and UTF-16 modes, PCRE2 does not allow the  \C  escape  (which
2003       matches  a single code unit even in a UTF mode) to appear in lookbehind
2004       assertions, because it makes it impossible to calculate the  length  of
2005       the  lookbehind.  The \X and \R escapes, which can match different num‐
2006       bers of code units, are never permitted in lookbehinds.
2007
2008       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
2009       lookbehinds,  as  long as the subpattern matches a fixed-length string.
2010       However, recursion, that is, a "subroutine" call into a group  that  is
2011       already active, is not supported.
2012
2013       Perl  does  not support back references in lookbehinds. PCRE2 does sup‐
2014       port  them,   but   only   if   certain   conditions   are   met.   The
2015       PCRE2_MATCH_UNSET_BACKREF  option must not be set, there must be no use
2016       of (?| in the pattern (it creates duplicate subpattern numbers), and if
2017       the  back reference is by name, the name must be unique. Of course, the
2018       referenced subpattern must itself be of  fixed  length.  The  following
2019       pattern matches words containing at least two characters that begin and
2020       end with the same character:
2021
2022          \b(\w)\w++(?<=\1)
2023
2024       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
2025       assertions to specify efficient matching of fixed-length strings at the
2026       end of subject strings. Consider a simple pattern such as
2027
2028         abcd$
2029
2030       when applied to a long string that does  not  match.  Because  matching
2031       proceeds  from  left to right, PCRE2 will look for each "a" in the sub‐
2032       ject and then see if what follows matches the rest of the  pattern.  If
2033       the pattern is specified as
2034
2035         ^.*abcd$
2036
2037       the  initial .* matches the entire string at first, but when this fails
2038       (because there is no following "a"), it backtracks to match all but the
2039       last  character,  then all but the last two characters, and so on. Once
2040       again the search for "a" covers the entire string, from right to  left,
2041       so we are no better off. However, if the pattern is written as
2042
2043         ^.*+(?<=abcd)
2044
2045       there can be no backtracking for the .*+ item because of the possessive
2046       quantifier; it can match only the entire string. The subsequent lookbe‐
2047       hind  assertion  does  a single test on the last four characters. If it
2048       fails, the match fails immediately. For  long  strings,  this  approach
2049       makes a significant difference to the processing time.
2050
2051   Using multiple assertions
2052
2053       Several assertions (of any sort) may occur in succession. For example,
2054
2055         (?<=\d{3})(?<!999)foo
2056
2057       matches  "foo" preceded by three digits that are not "999". Notice that
2058       each of the assertions is applied independently at the  same  point  in
2059       the  subject  string.  First  there  is a check that the previous three
2060       characters are all digits, and then there is  a  check  that  the  same
2061       three characters are not "999".  This pattern does not match "foo" pre‐
2062       ceded by six characters, the first of which are  digits  and  the  last
2063       three  of  which  are not "999". For example, it doesn't match "123abc‐
2064       foo". A pattern to do that is
2065
2066         (?<=\d{3}...)(?<!999)foo
2067
2068       This time the first assertion looks at the  preceding  six  characters,
2069       checking that the first three are digits, and then the second assertion
2070       checks that the preceding three characters are not "999".
2071
2072       Assertions can be nested in any combination. For example,
2073
2074         (?<=(?<!foo)bar)baz
2075
2076       matches an occurrence of "baz" that is preceded by "bar" which in  turn
2077       is not preceded by "foo", while
2078
2079         (?<=\d{3}(?!999)...)foo
2080
2081       is  another pattern that matches "foo" preceded by three digits and any
2082       three characters that are not "999".
2083

CONDITIONAL SUBPATTERNS

2085
2086       It is possible to cause the matching process to obey a subpattern  con‐
2087       ditionally  or to choose between two alternative subpatterns, depending
2088       on the result of an assertion, or whether a specific capturing  subpat‐
2089       tern  has  already  been matched. The two possible forms of conditional
2090       subpattern are:
2091
2092         (?(condition)yes-pattern)
2093         (?(condition)yes-pattern|no-pattern)
2094
2095       If the condition is satisfied, the yes-pattern is used;  otherwise  the
2096       no-pattern  (if  present)  is used. If there are more than two alterna‐
2097       tives in the subpattern, a compile-time error occurs. Each of  the  two
2098       alternatives may itself contain nested subpatterns of any form, includ‐
2099       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
2100       applies only at the level of the condition. This pattern fragment is an
2101       example where the alternatives are complex:
2102
2103         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2104
2105
2106       There are five kinds of condition: references  to  subpatterns,  refer‐
2107       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
2108       and assertions.
2109
2110   Checking for a used subpattern by number
2111
2112       If the text between the parentheses consists of a sequence  of  digits,
2113       the condition is true if a capturing subpattern of that number has pre‐
2114       viously matched. If there is more than one  capturing  subpattern  with
2115       the  same  number  (see  the earlier section about duplicate subpattern
2116       numbers), the condition is true if any of them have matched. An  alter‐
2117       native  notation is to precede the digits with a plus or minus sign. In
2118       this case, the subpattern number is relative rather than absolute.  The
2119       most  recently opened parentheses can be referenced by (?(-1), the next
2120       most recent by (?(-2), and so on. Inside loops it can also  make  sense
2121       to refer to subsequent groups. The next parentheses to be opened can be
2122       referenced as (?(+1), and so on. (The value zero in any of these  forms
2123       is not used; it provokes a compile-time error.)
2124
2125       Consider  the  following  pattern, which contains non-significant white
2126       space to make it more readable (assume the PCRE2_EXTENDED  option)  and
2127       to divide it into three parts for ease of discussion:
2128
2129         ( \( )?    [^()]+    (?(1) \) )
2130
2131       The  first  part  matches  an optional opening parenthesis, and if that
2132       character is present, sets it as the first captured substring. The sec‐
2133       ond  part  matches one or more characters that are not parentheses. The
2134       third part is a conditional subpattern that tests whether  or  not  the
2135       first  set  of  parentheses  matched.  If they did, that is, if subject
2136       started with an opening parenthesis, the condition is true, and so  the
2137       yes-pattern  is  executed and a closing parenthesis is required. Other‐
2138       wise, since no-pattern is not present, the subpattern matches  nothing.
2139       In  other  words,  this  pattern matches a sequence of non-parentheses,
2140       optionally enclosed in parentheses.
2141
2142       If you were embedding this pattern in a larger one,  you  could  use  a
2143       relative reference:
2144
2145         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
2146
2147       This  makes  the  fragment independent of the parentheses in the larger
2148       pattern.
2149
2150   Checking for a used subpattern by name
2151
2152       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
2153       used  subpattern  by  name.  For compatibility with earlier versions of
2154       PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
2155       also  recognized.  Note,  however, that undelimited names consisting of
2156       the letter R followed by digits are ambiguous (see the  following  sec‐
2157       tion).
2158
2159       Rewriting the above example to use a named subpattern gives this:
2160
2161         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
2162
2163       If  the  name used in a condition of this kind is a duplicate, the test
2164       is applied to all subpatterns of the same name, and is true if any  one
2165       of them has matched.
2166
2167   Checking for pattern recursion
2168
2169       "Recursion"  in  this sense refers to any subroutine-like call from one
2170       part of the pattern to another, whether or not it  is  actually  recur‐
2171       sive.  See  the sections entitled "Recursive patterns" and "Subpatterns
2172       as subroutines" below for details of recursion and subpattern calls.
2173
2174       If a condition is the string (R), and there is no subpattern  with  the
2175       name  R,  the condition is true if matching is currently in a recursion
2176       or subroutine call to the whole pattern or any  subpattern.  If  digits
2177       follow  the  letter  R,  and there is no subpattern with that name, the
2178       condition is true if the most recent call is into a subpattern with the
2179       given  number,  which must exist somewhere in the overall pattern. This
2180       is a contrived example that is equivalent to a+b:
2181
2182         ((?(R1)a+|(?1)b))
2183
2184       However, in both cases, if there is a subpattern with a matching  name,
2185       the  condition  tests  for  its  being set, as described in the section
2186       above, instead of testing for recursion. For example, creating a  group
2187       with  the  name  R1  by  adding (?<R1>) to the above pattern completely
2188       changes its meaning.
2189
2190       If a name preceded by ampersand follows the letter R, for example:
2191
2192         (?(R&name)...)
2193
2194       the condition is true if the most recent recursion is into a subpattern
2195       of that name (which must exist within the pattern).
2196
2197       This condition does not check the entire recursion stack. It tests only
2198       the current level. If the name used in a condition of this  kind  is  a
2199       duplicate, the test is applied to all subpatterns of the same name, and
2200       is true if any one of them is the most recent recursion.
2201
2202       At "top level", all these recursion test conditions are false.
2203
2204   Defining subpatterns for use by reference only
2205
2206       If the condition is the string (DEFINE), the condition is always false,
2207       even  if there is a group with the name DEFINE. In this case, there may
2208       be only one alternative in the subpattern. It is always skipped if con‐
2209       trol  reaches  this point in the pattern; the idea of DEFINE is that it
2210       can be used to define subroutines that can  be  referenced  from  else‐
2211       where. (The use of subroutines is described below.) For example, a pat‐
2212       tern to match an IPv4 address such as "192.168.23.245" could be written
2213       like this (ignore white space and line breaks):
2214
2215         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2216         \b (?&byte) (\.(?&byte)){3} \b
2217
2218       The  first part of the pattern is a DEFINE group inside which a another
2219       group named "byte" is defined. This matches an individual component  of
2220       an  IPv4  address  (a number less than 256). When matching takes place,
2221       this part of the pattern is skipped because DEFINE acts  like  a  false
2222       condition.  The  rest of the pattern uses references to the named group
2223       to match the four dot-separated components of an IPv4 address,  insist‐
2224       ing on a word boundary at each end.
2225
2226   Checking the PCRE2 version
2227
2228       Programs  that link with a PCRE2 library can check the version by call‐
2229       ing pcre2_config() with appropriate arguments.  Users  of  applications
2230       that  do  not have access to the underlying code cannot do this. A spe‐
2231       cial "condition" called VERSION exists to allow such users to  discover
2232       which version of PCRE2 they are dealing with by using this condition to
2233       match a string such as "yesno". VERSION must be followed either by  "="
2234       or ">=" and a version number.  For example:
2235
2236         (?(VERSION>=10.4)yes|no)
2237
2238       This  pattern matches "yes" if the PCRE2 version is greater or equal to
2239       10.4, or "no" otherwise. The fractional part of the version number  may
2240       not contain more than two digits.
2241
2242   Assertion conditions
2243
2244       If  the  condition  is  not  in any of the above formats, it must be an
2245       assertion.  This may be a positive or negative lookahead or  lookbehind
2246       assertion.  Consider  this  pattern,  again  containing non-significant
2247       white space, and with the two alternatives on the second line:
2248
2249         (?(?=[^a-z]*[a-z])
2250         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
2251
2252       The condition  is  a  positive  lookahead  assertion  that  matches  an
2253       optional  sequence of non-letters followed by a letter. In other words,
2254       it tests for the presence of at least one letter in the subject.  If  a
2255       letter  is found, the subject is matched against the first alternative;
2256       otherwise it is  matched  against  the  second.  This  pattern  matches
2257       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2258       letters and dd are digits.
2259

COMMENTS

2261
2262       There are two ways of including comments in patterns that are processed
2263       by  PCRE2.  In  both  cases,  the start of the comment must not be in a
2264       character class, nor in the middle of any  other  sequence  of  related
2265       characters  such  as (?: or a subpattern name or number. The characters
2266       that make up a comment play no part in the pattern matching.
2267
2268       The sequence (?# marks the start of a comment that continues up to  the
2269       next  closing parenthesis. Nested parentheses are not permitted. If the
2270       PCRE2_EXTENDED option is set, an unescaped # character also  introduces
2271       a  comment,  which in this case continues to immediately after the next
2272       newline character or character sequence in the pattern.  Which  charac‐
2273       ters  are  interpreted as newlines is controlled by an option passed to
2274       the compiling function or by a special sequence at  the  start  of  the
2275       pattern,  as  described  in  the section entitled "Newline conventions"
2276       above. Note that the end of this type of comment is a  literal  newline
2277       sequence  in  the  pattern; escape sequences that happen to represent a
2278       newline  do  not  count.  For  example,  consider  this  pattern   when
2279       PCRE2_EXTENDED  is  set,  and  the default newline convention (a single
2280       linefeed character) is in force:
2281
2282         abc #comment \n still comment
2283
2284       On encountering the # character, pcre2_compile() skips  along,  looking
2285       for  a newline in the pattern. The sequence \n is still literal at this
2286       stage, so it does not terminate the comment. Only an  actual  character
2287       with the code value 0x0a (the default newline) does so.
2288

RECURSIVE PATTERNS

2290
2291       Consider  the problem of matching a string in parentheses, allowing for
2292       unlimited nested parentheses. Without the use of  recursion,  the  best
2293       that  can  be  done  is  to use a pattern that matches up to some fixed
2294       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
2295       depth.
2296
2297       For some time, Perl has provided a facility that allows regular expres‐
2298       sions to recurse (amongst other things). It does this by  interpolating
2299       Perl  code in the expression at run time, and the code can refer to the
2300       expression itself. A Perl pattern using code interpolation to solve the
2301       parentheses problem can be created like this:
2302
2303         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2304
2305       The (?p{...}) item interpolates Perl code at run time, and in this case
2306       refers recursively to the pattern in which it appears.
2307
2308       Obviously,  PCRE2  cannot  support  the  interpolation  of  Perl  code.
2309       Instead,  it  supports  special syntax for recursion of the entire pat‐
2310       tern, and also for individual subpattern recursion. After its introduc‐
2311       tion  in  PCRE1  and  Python,  this  kind of recursion was subsequently
2312       introduced into Perl at release 5.10.
2313
2314       A special item that consists of (? followed by a  number  greater  than
2315       zero  and  a  closing parenthesis is a recursive subroutine call of the
2316       subpattern of the given number, provided that  it  occurs  inside  that
2317       subpattern.  (If  not,  it is a non-recursive subroutine call, which is
2318       described in the next section.) The special item  (?R)  or  (?0)  is  a
2319       recursive call of the entire regular expression.
2320
2321       This  PCRE2  pattern  solves the nested parentheses problem (assume the
2322       PCRE2_EXTENDED option is set so that white space is ignored):
2323
2324         \( ( [^()]++ | (?R) )* \)
2325
2326       First it matches an opening parenthesis. Then it matches any number  of
2327       substrings  which  can  either  be  a sequence of non-parentheses, or a
2328       recursive match of the pattern itself (that is, a  correctly  parenthe‐
2329       sized substring).  Finally there is a closing parenthesis. Note the use
2330       of a possessive quantifier to avoid backtracking into sequences of non-
2331       parentheses.
2332
2333       If  this  were  part of a larger pattern, you would not want to recurse
2334       the entire pattern, so instead you could use this:
2335
2336         ( \( ( [^()]++ | (?1) )* \) )
2337
2338       We have put the pattern into parentheses, and caused the  recursion  to
2339       refer to them instead of the whole pattern.
2340
2341       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
2342       tricky. This is made easier by the use of relative references.  Instead
2343       of (?1) in the pattern above you can write (?-2) to refer to the second
2344       most recently opened parentheses  preceding  the  recursion.  In  other
2345       words,  a  negative  number counts capturing parentheses leftwards from
2346       the point at which it is encountered.
2347
2348       Be aware however, that if duplicate subpattern numbers are in use, rel‐
2349       ative  references refer to the earliest subpattern with the appropriate
2350       number. Consider, for example:
2351
2352         (?|(a)|(b)) (c) (?-2)
2353
2354       The first two capturing groups (a) and (b) are  both  numbered  1,  and
2355       group  (c)  is  number  2. When the reference (?-2) is encountered, the
2356       second most recently opened parentheses has the number 1, but it is the
2357       first  such  group  (the (a) group) to which the recursion refers. This
2358       would be the same if an absolute reference  (?1)  was  used.  In  other
2359       words,  relative  references are just a shorthand for computing a group
2360       number.
2361
2362       It is also possible to refer to  subsequently  opened  parentheses,  by
2363       writing  references  such  as (?+2). However, these cannot be recursive
2364       because the reference is not inside the  parentheses  that  are  refer‐
2365       enced.  They are always non-recursive subroutine calls, as described in
2366       the next section.
2367
2368       An alternative approach is to use named parentheses.  The  Perl  syntax
2369       for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup‐
2370       ported. We could rewrite the above example as follows:
2371
2372         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2373
2374       If there is more than one subpattern with the same name,  the  earliest
2375       one is used.
2376
2377       The example pattern that we have been looking at contains nested unlim‐
2378       ited repeats, and so the use of a possessive  quantifier  for  matching
2379       strings  of  non-parentheses  is important when applying the pattern to
2380       strings that do not match. For example, when this pattern is applied to
2381
2382         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2383
2384       it yields "no match" quickly. However, if a  possessive  quantifier  is
2385       not  used, the match runs for a very long time indeed because there are
2386       so many different ways the + and * repeats can carve  up  the  subject,
2387       and all have to be tested before failure can be reported.
2388
2389       At  the  end  of a match, the values of capturing parentheses are those
2390       from the outermost level. If you want to obtain intermediate values,  a
2391       callout function can be used (see below and the pcre2callout documenta‐
2392       tion). If the pattern above is matched against
2393
2394         (ab(cd)ef)
2395
2396       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
2397       which  is the last value taken on at the top level. If a capturing sub‐
2398       pattern is not matched at the top level, its final  captured  value  is
2399       unset,  even  if  it was (temporarily) set at a deeper level during the
2400       matching process.
2401
2402       If there are more than 15 capturing parentheses in a pattern, PCRE2 has
2403       to  obtain extra memory from the heap to store data during a recursion.
2404       If  no  memory  can   be   obtained,   the   match   fails   with   the
2405       PCRE2_ERROR_NOMEMORY error.
2406
2407       Do  not  confuse  the (?R) item with the condition (R), which tests for
2408       recursion.  Consider this pattern, which matches text in  angle  brack‐
2409       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
2410       brackets (that is, when recursing), whereas any characters are  permit‐
2411       ted at the outer level.
2412
2413         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
2414
2415       In  this  pattern, (?(R) is the start of a conditional subpattern, with
2416       two different alternatives for the recursive and  non-recursive  cases.
2417       The (?R) item is the actual recursive call.
2418
2419   Differences in recursion processing between PCRE2 and Perl
2420
2421       Recursion  processing in PCRE2 differs from Perl in two important ways.
2422       In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
2423       always treated as an atomic group. That is, once it has matched some of
2424       the subject string, it is never re-entered, even if it contains untried
2425       alternatives  and  there  is a subsequent matching failure. This can be
2426       illustrated by the following pattern, which purports to match a  palin‐
2427       dromic  string  that contains an odd number of characters (for example,
2428       "a", "aba", "abcba", "abcdcba"):
2429
2430         ^(.|(.)(?1)\2)$
2431
2432       The idea is that it either matches a single character, or two identical
2433       characters  surrounding  a sub-palindrome. In Perl, this pattern works;
2434       in PCRE2 it does not if the pattern is longer  than  three  characters.
2435       Consider the subject string "abcba":
2436
2437       At  the  top level, the first character is matched, but as it is not at
2438       the end of the string, the first alternative fails; the second alterna‐
2439       tive is taken and the recursion kicks in. The recursive call to subpat‐
2440       tern 1 successfully matches the next character ("b").  (Note  that  the
2441       beginning and end of line tests are not part of the recursion).
2442
2443       Back  at  the top level, the next character ("c") is compared with what
2444       subpattern 2 matched, which was "a". This fails. Because the  recursion
2445       is  treated  as  an atomic group, there are now no backtracking points,
2446       and so the entire match fails. (Perl is able, at  this  point,  to  re-
2447       enter  the  recursion  and try the second alternative.) However, if the
2448       pattern is written with the alternatives in the other order, things are
2449       different:
2450
2451         ^((.)(?1)\2|.)$
2452
2453       This  time,  the recursing alternative is tried first, and continues to
2454       recurse until it runs out of characters, at which point  the  recursion
2455       fails.  But  this  time  we  do  have another alternative to try at the
2456       higher level. That is the big difference:  in  the  previous  case  the
2457       remaining  alternative is at a deeper recursion level, which PCRE2 can‐
2458       not use.
2459
2460       To change the pattern so that it matches all palindromic  strings,  not
2461       just  those  with an odd number of characters, it is tempting to change
2462       the pattern to this:
2463
2464         ^((.)(?1)\2|.?)$
2465
2466       Again, this works in Perl, but not in PCRE2, and for the  same  reason.
2467       When  a  deeper  recursion has matched a single character, it cannot be
2468       entered again in order to match an empty string.  The  solution  is  to
2469       separate  the two cases, and write out the odd and even cases as alter‐
2470       natives at the higher level:
2471
2472         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2473
2474       If you want to match typical palindromic phrases, the  pattern  has  to
2475       ignore all non-word characters, which can be done like this:
2476
2477         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2478
2479       If  run  with  the  PCRE2_CASELESS option, this pattern matches phrases
2480       such as "A man, a plan, a canal: Panama!" and it works  in  both  PCRE2
2481       and  Perl.  Note the use of the possessive quantifier *+ to avoid back‐
2482       tracking into sequences of non-word  characters.  Without  this,  PCRE2
2483       takes a great deal longer (ten times or more) to match typical phrases,
2484       and Perl takes so long that you think it has gone into a loop.
2485
2486       WARNING: The palindrome-matching patterns above work only if  the  sub‐
2487       ject  string  does not start with a palindrome that is shorter than the
2488       entire string.  For example, although "abcba" is correctly matched,  if
2489       the  subject is "ababa", PCRE2 finds the palindrome "aba" at the start,
2490       then fails at top level because the end of the string does not  follow.
2491       Once  again, it cannot jump back into the recursion to try other alter‐
2492       natives, so the entire match fails.
2493
2494       The second way in which PCRE2 and Perl differ in their  recursion  pro‐
2495       cessing  is in the handling of captured values. In Perl, when a subpat‐
2496       tern is called recursively or as a subpattern (see the  next  section),
2497       it  has  no  access to any values that were captured outside the recur‐
2498       sion, whereas in PCRE2 these values can be  referenced.  Consider  this
2499       pattern:
2500
2501         ^(.)(\1|a(?2))
2502
2503       In  PCRE2,  this pattern matches "bab". The first capturing parentheses
2504       match "b", then in the second group, when the back reference  \1  fails
2505       to  match "b", the second alternative matches "a" and then recurses. In
2506       the recursion, \1 does now match "b" and so the whole  match  succeeds.
2507       In  Perl,  the pattern fails to match because inside the recursive call
2508       \1 cannot access the externally set value.
2509

SUBPATTERNS AS SUBROUTINES

2511
2512       If the syntax for a recursive subpattern call (either by number  or  by
2513       name)  is  used outside the parentheses to which it refers, it operates
2514       like a subroutine in a programming language. The called subpattern  may
2515       be  defined  before or after the reference. A numbered reference can be
2516       absolute or relative, as in these examples:
2517
2518         (...(absolute)...)...(?2)...
2519         (...(relative)...)...(?-1)...
2520         (...(?+1)...(relative)...
2521
2522       An earlier example pointed out that the pattern
2523
2524         (sens|respons)e and \1ibility
2525
2526       matches "sense and sensibility" and "response and responsibility",  but
2527       not "sense and responsibility". If instead the pattern
2528
2529         (sens|respons)e and (?1)ibility
2530
2531       is  used, it does match "sense and responsibility" as well as the other
2532       two strings. Another example is  given  in  the  discussion  of  DEFINE
2533       above.
2534
2535       All  subroutine  calls, whether recursive or not, are always treated as
2536       atomic groups. That is, once a subroutine has matched some of the  sub‐
2537       ject string, it is never re-entered, even if it contains untried alter‐
2538       natives and there is  a  subsequent  matching  failure.  Any  capturing
2539       parentheses  that  are  set  during the subroutine call revert to their
2540       previous values afterwards.
2541
2542       Processing options such as case-independence are fixed when  a  subpat‐
2543       tern  is defined, so if it is used as a subroutine, such options cannot
2544       be changed for different calls. For example, consider this pattern:
2545
2546         (abc)(?i:(?-1))
2547
2548       It matches "abcabc". It does not match "abcABC" because the  change  of
2549       processing option does not affect the called subpattern.
2550

ONIGURUMA SUBROUTINE SYNTAX

2552
2553       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
2554       name or a number enclosed either in angle brackets or single quotes, is
2555       an  alternative  syntax  for  referencing a subpattern as a subroutine,
2556       possibly recursively. Here are two of the examples used above,  rewrit‐
2557       ten using this syntax:
2558
2559         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2560         (sens|respons)e and \g'1'ibility
2561
2562       PCRE2  supports an extension to Oniguruma: if a number is preceded by a
2563       plus or a minus sign it is taken as a relative reference. For example:
2564
2565         (abc)(?i:\g<-1>)
2566
2567       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
2568       synonymous.  The former is a back reference; the latter is a subroutine
2569       call.
2570

CALLOUTS

2572
2573       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2574       Perl  code to be obeyed in the middle of matching a regular expression.
2575       This makes it possible, amongst other things, to extract different sub‐
2576       strings that match the same pair of parentheses when there is a repeti‐
2577       tion.
2578
2579       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi‐
2580       trary  Perl  code. The feature is called "callout". The caller of PCRE2
2581       provides an external function by putting its entry  point  in  a  match
2582       context  using  the function pcre2_set_callout(), and then passing that
2583       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
2584       passed, or if the callout entry point is set to NULL, callouts are dis‐
2585       abled.
2586
2587       Within a regular expression, (?C<arg>) indicates a point at  which  the
2588       external  function  is  to  be  called. There are two kinds of callout:
2589       those with a numerical argument and those with a string argument.  (?C)
2590       on  its  own with no argument is treated as (?C0). A numerical argument
2591       allows the  application  to  distinguish  between  different  callouts.
2592       String  arguments  were added for release 10.20 to make it possible for
2593       script languages that use PCRE2 to embed short scripts within  patterns
2594       in a similar way to Perl.
2595
2596       During matching, when PCRE2 reaches a callout point, the external func‐
2597       tion is called. It is provided with the number or  string  argument  of
2598       the  callout, the position in the pattern, and one item of data that is
2599       also set in the match block. The callout function may cause matching to
2600       proceed, to backtrack, or to fail.
2601
2602       By  default,  PCRE2  implements  a  number of optimizations at matching
2603       time, and one side-effect is that sometimes callouts  are  skipped.  If
2604       you  need all possible callouts to happen, you need to set options that
2605       disable the relevant optimizations. More details, including a  complete
2606       description  of  the programming interface to the callout function, are
2607       given in the pcre2callout documentation.
2608
2609   Callouts with numerical arguments
2610
2611       If you just want to have  a  means  of  identifying  different  callout
2612       points,  put  a  number  less than 256 after the letter C. For example,
2613       this pattern has two callout points:
2614
2615         (?C1)abc(?C2)def
2616
2617       If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(),  numerical
2618       callouts  are  automatically installed before each item in the pattern.
2619       They are all numbered 255. If there is a conditional group in the  pat‐
2620       tern whose condition is an assertion, an additional callout is inserted
2621       just before the condition. An explicit callout may also be set at  this
2622       position, as in this example:
2623
2624         (?(?C9)(?=a)abc|def)
2625
2626       Note that this applies only to assertion conditions, not to other types
2627       of condition.
2628
2629   Callouts with string arguments
2630
2631       A delimited string may be used instead of a number as a  callout  argu‐
2632       ment.  The  starting  delimiter  must be one of ` ' " ^ % # $ { and the
2633       ending delimiter is the same as the start, except for {, where the end‐
2634       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
2635       string, it must be doubled. For example:
2636
2637         (?C'ab ''c'' d')xyz(?C{any text})pqr
2638
2639       The doubling is removed before the string  is  passed  to  the  callout
2640       function.
2641

BACKTRACKING CONTROL

2643
2644       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
2645       which are still described in the Perl  documentation  as  "experimental
2646       and  subject to change or removal in a future version of Perl". It goes
2647       on to say: "Their usage in production code should  be  noted  to  avoid
2648       problems during upgrades." The same remarks apply to the PCRE2 features
2649       described in this section.
2650
2651       The new verbs make use of what was previously invalid syntax: an  open‐
2652       ing parenthesis followed by an asterisk. They are generally of the form
2653       (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
2654       differently depending on whether or not a name is present.
2655
2656       By  default,  for  compatibility  with  Perl, a name is any sequence of
2657       characters that does not include a closing parenthesis. The name is not
2658       processed  in  any  way,  and  it  is not possible to include a closing
2659       parenthesis  in  the  name.   This  can  be  changed  by  setting   the
2660       PCRE2_ALT_VERBNAMES  option,  but the result is no longer Perl-compati‐
2661       ble.
2662
2663       When PCRE2_ALT_VERBNAMES is set, backslash  processing  is  applied  to
2664       verb  names  and  only  an unescaped closing parenthesis terminates the
2665       name. However, the only backslash items that are permitted are \Q,  \E,
2666       and  sequences such as \x{100} that define character code points. Char‐
2667       acter type escapes such as \d are faulted.
2668
2669       A closing parenthesis can be included in a name either as \) or between
2670       \Q  and  \E. In addition to backslash processing, if the PCRE2_EXTENDED
2671       option is also set, unescaped whitespace in verb names is skipped,  and
2672       #-comments  are  recognized,  exactly  as  in  the rest of the pattern.
2673       PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is
2674       also set.
2675
2676       The  maximum  length of a name is 255 in the 8-bit library and 65535 in
2677       the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
2678       closing  parenthesis immediately follows the colon, the effect is as if
2679       the colon were not there. Any number of these verbs may occur in a pat‐
2680       tern.
2681
2682       Since  these  verbs  are  specifically related to backtracking, most of
2683       them can be used only when the pattern is to be matched using the  tra‐
2684       ditional matching function, because these use a backtracking algorithm.
2685       With the exception of (*FAIL), which behaves like  a  failing  negative
2686       assertion, the backtracking control verbs cause an error if encountered
2687       by the DFA matching function.
2688
2689       The behaviour of these verbs in repeated  groups,  assertions,  and  in
2690       subpatterns called as subroutines (whether or not recursively) is docu‐
2691       mented below.
2692
2693   Optimizations that affect backtracking verbs
2694
2695       PCRE2 contains some optimizations that are used to speed up matching by
2696       running some checks at the start of each match attempt. For example, it
2697       may know the minimum length of matching subject, or that  a  particular
2698       character must be present. When one of these optimizations bypasses the
2699       running of a match,  any  included  backtracking  verbs  will  not,  of
2700       course, be processed. You can suppress the start-of-match optimizations
2701       by setting the PCRE2_NO_START_OPTIMIZE option when  calling  pcre2_com‐
2702       pile(),  or by starting the pattern with (*NO_START_OPT). There is more
2703       discussion of this option in the section entitled "Compiling a pattern"
2704       in the pcre2api documentation.
2705
2706       Experiments  with  Perl  suggest that it too has similar optimizations,
2707       sometimes leading to anomalous results.
2708
2709   Verbs that act immediately
2710
2711       The following verbs act as soon as they are encountered. They  may  not
2712       be followed by a name.
2713
2714          (*ACCEPT)
2715
2716       This  verb causes the match to end successfully, skipping the remainder
2717       of the pattern. However, when it is inside a subpattern that is  called
2718       as  a  subroutine, only that subpattern is ended successfully. Matching
2719       then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
2720       tive  assertion,  the  assertion succeeds; in a negative assertion, the
2721       assertion fails.
2722
2723       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap‐
2724       tured. For example:
2725
2726         A((?:A|B(*ACCEPT)|C)D)
2727
2728       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2729       tured by the outer parentheses.
2730
2731         (*FAIL) or (*F)
2732
2733       This verb causes a matching failure, forcing backtracking to occur.  It
2734       is  equivalent to (?!) but easier to read. The Perl documentation notes
2735       that it is probably useful only when combined  with  (?{})  or  (??{}).
2736       Those  are, of course, Perl features that are not present in PCRE2. The
2737       nearest equivalent is the callout feature, as for example in this  pat‐
2738       tern:
2739
2740         a+(?C)(*FAIL)
2741
2742       A  match  with the string "aaaa" always fails, but the callout is taken
2743       before each backtrack happens (in this example, 10 times).
2744
2745   Recording which path was taken
2746
2747       There is one verb whose main purpose  is  to  track  how  a  match  was
2748       arrived  at,  though  it  also  has a secondary use in conjunction with
2749       advancing the match starting point (see (*SKIP) below).
2750
2751         (*MARK:NAME) or (*:NAME)
2752
2753       A name is always  required  with  this  verb.  There  may  be  as  many
2754       instances  of  (*MARK) as you like in a pattern, and their names do not
2755       have to be unique.
2756
2757       When a match succeeds, the name of the  last-encountered  (*MARK:NAME),
2758       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
2759       the caller as described in  the  section  entitled  "Other  information
2760       about  the  match" in the pcre2api documentation. Here is an example of
2761       pcre2test output, where the "mark" modifier requests the retrieval  and
2762       outputting of (*MARK) data:
2763
2764           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2765         data> XY
2766          0: XY
2767         MK: A
2768         XZ
2769          0: XZ
2770         MK: B
2771
2772       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2773       ple it indicates which of the two alternatives matched. This is a  more
2774       efficient  way of obtaining this information than putting each alterna‐
2775       tive in its own capturing parentheses.
2776
2777       If a verb with a name is encountered in a positive  assertion  that  is
2778       true,  the  name  is recorded and passed back if it is the last-encoun‐
2779       tered. This does not happen for negative assertions or failing positive
2780       assertions.
2781
2782       After  a  partial match or a failed match, the last encountered name in
2783       the entire match process is returned. For example:
2784
2785           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2786         data> XP
2787         No match, mark = B
2788
2789       Note that in this unanchored example the  mark  is  retained  from  the
2790       match attempt that started at the letter "X" in the subject. Subsequent
2791       match attempts starting at "P" and then with an empty string do not get
2792       as far as the (*MARK) item, but nevertheless do not reset it.
2793
2794       If  you  are  interested  in  (*MARK)  values after failed matches, you
2795       should probably set the PCRE2_NO_START_OPTIMIZE option (see  above)  to
2796       ensure that the match is always attempted.
2797
2798   Verbs that act after backtracking
2799
2800       The following verbs do nothing when they are encountered. Matching con‐
2801       tinues with what follows, but if there is no subsequent match,  causing
2802       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
2803       cannot pass to the left of the verb. However, when one of  these  verbs
2804       appears inside an atomic group (which includes any group that is called
2805       as a subroutine) or in an assertion that is true, its  effect  is  con‐
2806       fined  to that group, because once the group has been matched, there is
2807       never any backtracking into it. In this situation, backtracking has  to
2808       jump to the left of the entire atomic group or assertion.
2809
2810       These  verbs  differ  in exactly what kind of failure occurs when back‐
2811       tracking reaches them. The behaviour described below  is  what  happens
2812       when  the  verb is not in a subroutine or an assertion. Subsequent sec‐
2813       tions cover these special cases.
2814
2815         (*COMMIT)
2816
2817       This verb, which may not be followed by a name, causes the whole  match
2818       to fail outright if there is a later matching failure that causes back‐
2819       tracking to reach it. Even if the pattern  is  unanchored,  no  further
2820       attempts to find a match by advancing the starting point take place. If
2821       (*COMMIT) is the only backtracking verb that is  encountered,  once  it
2822       has  been  passed  pcre2_match() is committed to finding a match at the
2823       current starting point, or not at all. For example:
2824
2825         a+(*COMMIT)b
2826
2827       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
2828       of dynamic anchor, or "I've started, so I must finish." The name of the
2829       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
2830       forces a match failure.
2831
2832       If  there  is more than one backtracking verb in a pattern, a different
2833       one that follows (*COMMIT) may be triggered first,  so  merely  passing
2834       (*COMMIT) during a match does not always guarantee that a match must be
2835       at this starting point.
2836
2837       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
2838       anchor,  unless PCRE2's start-of-match optimizations are turned off, as
2839       shown in this output from pcre2test:
2840
2841           re> /(*COMMIT)abc/
2842         data> xyzabc
2843          0: abc
2844         data>
2845         re> /(*COMMIT)abc/no_start_optimize
2846         data> xyzabc
2847         No match
2848
2849       For the first pattern, PCRE2 knows that any match must start with  "a",
2850       so  the optimization skips along the subject to "a" before applying the
2851       pattern to the first set of data. The match attempt then succeeds.  The
2852       second  pattern disables the optimization that skips along to the first
2853       character. The pattern is now applied  starting  at  "x",  and  so  the
2854       (*COMMIT)  causes  the  match to fail without trying any other starting
2855       points.
2856
2857         (*PRUNE) or (*PRUNE:NAME)
2858
2859       This verb causes the match to fail at the current starting position  in
2860       the subject if there is a later matching failure that causes backtrack‐
2861       ing to reach it. If the pattern is unanchored, the  normal  "bumpalong"
2862       advance  to  the next starting character then happens. Backtracking can
2863       occur as usual to the left of (*PRUNE), before it is reached,  or  when
2864       matching  to  the  right  of  (*PRUNE), but if there is no match to the
2865       right, backtracking cannot cross (*PRUNE). In simple cases, the use  of
2866       (*PRUNE)  is just an alternative to an atomic group or possessive quan‐
2867       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
2868       any  other  way. In an anchored pattern (*PRUNE) has the same effect as
2869       (*COMMIT).
2870
2871       The   behaviour   of   (*PRUNE:NAME)   is   the   not   the   same   as
2872       (*MARK:NAME)(*PRUNE).   It  is  like  (*MARK:NAME)  in that the name is
2873       remembered for  passing  back  to  the  caller.  However,  (*SKIP:NAME)
2874       searches  only  for  names  set  with  (*MARK),  ignoring  those set by
2875       (*PRUNE) or (*THEN).
2876
2877         (*SKIP)
2878
2879       This verb, when given without a name, is like (*PRUNE), except that  if
2880       the  pattern  is unanchored, the "bumpalong" advance is not to the next
2881       character, but to the position in the subject where (*SKIP) was encoun‐
2882       tered.  (*SKIP)  signifies that whatever text was matched leading up to
2883       it cannot be part of a successful match. Consider:
2884
2885         a+(*SKIP)b
2886
2887       If the subject is "aaaac...",  after  the  first  match  attempt  fails
2888       (starting  at  the  first  character in the string), the starting point
2889       skips on to start the next attempt at "c". Note that a possessive quan‐
2890       tifer  does not have the same effect as this example; although it would
2891       suppress backtracking  during  the  first  match  attempt,  the  second
2892       attempt  would  start at the second character instead of skipping on to
2893       "c".
2894
2895         (*SKIP:NAME)
2896
2897       When (*SKIP) has an associated name, its behaviour is modified. When it
2898       is triggered, the previous path through the pattern is searched for the
2899       most recent (*MARK) that has the  same  name.  If  one  is  found,  the
2900       "bumpalong" advance is to the subject position that corresponds to that
2901       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
2902       a matching name is found, the (*SKIP) is ignored.
2903
2904       Note  that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
2905       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
2906
2907         (*THEN) or (*THEN:NAME)
2908
2909       This verb causes a skip to the next innermost  alternative  when  back‐
2910       tracking  reaches  it.  That  is,  it  cancels any further backtracking
2911       within the current alternative. Its name  comes  from  the  observation
2912       that it can be used for a pattern-based if-then-else block:
2913
2914         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2915
2916       If  the COND1 pattern matches, FOO is tried (and possibly further items
2917       after the end of the group if FOO succeeds); on  failure,  the  matcher
2918       skips  to  the second alternative and tries COND2, without backtracking
2919       into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse‐
2920       quently  BAZ fails, there are no more alternatives, so there is a back‐
2921       track to whatever came before the  entire  group.  If  (*THEN)  is  not
2922       inside an alternation, it acts like (*PRUNE).
2923
2924       The    behaviour   of   (*THEN:NAME)   is   the   not   the   same   as
2925       (*MARK:NAME)(*THEN).  It is like  (*MARK:NAME)  in  that  the  name  is
2926       remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
2927       searches only for  names  set  with  (*MARK),  ignoring  those  set  by
2928       (*PRUNE) and (*THEN).
2929
2930       A  subpattern that does not contain a | character is just a part of the
2931       enclosing alternative; it is not a nested  alternation  with  only  one
2932       alternative.  The effect of (*THEN) extends beyond such a subpattern to
2933       the enclosing alternative. Consider this pattern, where A, B, etc.  are
2934       complex  pattern fragments that do not contain any | characters at this
2935       level:
2936
2937         A (B(*THEN)C) | D
2938
2939       If A and B are matched, but there is a failure in C, matching does  not
2940       backtrack into A; instead it moves to the next alternative, that is, D.
2941       However, if the subpattern containing (*THEN) is given an  alternative,
2942       it behaves differently:
2943
2944         A (B(*THEN)C | (*FAIL)) | D
2945
2946       The  effect of (*THEN) is now confined to the inner subpattern. After a
2947       failure in C, matching moves to (*FAIL), which causes the whole subpat‐
2948       tern  to  fail  because  there are no more alternatives to try. In this
2949       case, matching does now backtrack into A.
2950
2951       Note that a conditional subpattern is  not  considered  as  having  two
2952       alternatives,  because  only  one  is  ever used. In other words, the |
2953       character in a conditional subpattern has a different meaning. Ignoring
2954       white space, consider:
2955
2956         ^.*? (?(?=a) a | b(*THEN)c )
2957
2958       If  the  subject  is  "ba", this pattern does not match. Because .*? is
2959       ungreedy, it initially matches zero  characters.  The  condition  (?=a)
2960       then  fails,  the  character  "b"  is  matched, but "c" is not. At this
2961       point, matching does not backtrack to .*? as might perhaps be  expected
2962       from  the  presence  of  the | character. The conditional subpattern is
2963       part of the single alternative that comprises the whole pattern, and so
2964       the  match  fails.  (If  there was a backtrack into .*?, allowing it to
2965       match "b", the match would succeed.)
2966
2967       The verbs just described provide four different "strengths" of  control
2968       when subsequent matching fails. (*THEN) is the weakest, carrying on the
2969       match at the next alternative. (*PRUNE) comes next, failing  the  match
2970       at  the  current starting position, but allowing an advance to the next
2971       character (for an unanchored pattern). (*SKIP) is similar, except  that
2972       the advance may be more than one character. (*COMMIT) is the strongest,
2973       causing the entire match to fail.
2974
2975   More than one backtracking verb
2976
2977       If more than one backtracking verb is present in  a  pattern,  the  one
2978       that  is  backtracked  onto first acts. For example, consider this pat‐
2979       tern, where A, B, etc. are complex pattern fragments:
2980
2981         (A(*COMMIT)B(*THEN)C|ABD)
2982
2983       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
2984       match to fail. However, if A and B match, but C fails, the backtrack to
2985       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
2986       is  consistent,  but is not always the same as Perl's. It means that if
2987       two or more backtracking verbs appear in succession, all the  the  last
2988       of them has no effect. Consider this example:
2989
2990         ...(*COMMIT)(*PRUNE)...
2991
2992       If there is a matching failure to the right, backtracking onto (*PRUNE)
2993       causes it to be triggered, and its action is taken. There can never  be
2994       a backtrack onto (*COMMIT).
2995
2996   Backtracking verbs in repeated groups
2997
2998       PCRE2  differs  from  Perl  in  its  handling  of backtracking verbs in
2999       repeated groups. For example, consider:
3000
3001         /(a(*COMMIT)b)+ac/
3002
3003       If the subject is "abac", Perl matches, but  PCRE2  fails  because  the
3004       (*COMMIT) in the second repeat of the group acts.
3005
3006   Backtracking verbs in assertions
3007
3008       (*FAIL)  in  an assertion has its normal effect: it forces an immediate
3009       backtrack.
3010
3011       (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
3012       out  any  further processing. In a negative assertion, (*ACCEPT) causes
3013       the assertion to fail without any further processing.
3014
3015       The other backtracking verbs are not treated specially if  they  appear
3016       in  a  positive  assertion.  In  particular,  (*THEN) skips to the next
3017       alternative in the innermost enclosing  group  that  has  alternations,
3018       whether or not this is within the assertion.
3019
3020       Negative  assertions  are,  however, different, in order to ensure that
3021       changing a positive assertion into a  negative  assertion  changes  its
3022       result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg‐
3023       ative assertion to be true, without considering any further alternative
3024       branches in the assertion.  Backtracking into (*THEN) causes it to skip
3025       to the next enclosing alternative within the assertion (the normal  be‐
3026       haviour),  but  if  the  assertion  does  not have such an alternative,
3027       (*THEN) behaves like (*PRUNE).
3028
3029   Backtracking verbs in subroutines
3030
3031       These behaviours occur whether or not the subpattern is  called  recur‐
3032       sively.  Perl's treatment of subroutines is different in some cases.
3033
3034       (*FAIL)  in  a subpattern called as a subroutine has its normal effect:
3035       it forces an immediate backtrack.
3036
3037       (*ACCEPT) in a subpattern called as a subroutine causes the  subroutine
3038       match  to succeed without any further processing. Matching then contin‐
3039       ues after the subroutine call.
3040
3041       (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
3042       cause the subroutine match to fail.
3043
3044       (*THEN)  skips to the next alternative in the innermost enclosing group
3045       within the subpattern that has alternatives. If there is no such  group
3046       within the subpattern, (*THEN) causes the subroutine match to fail.
3047

SEE ALSO

3049
3050       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
3051       pcre2(3).
3052

AUTHOR

3054
3055       Philip Hazel
3056       University Computing Service
3057       Cambridge, England.
3058

REVISION

3060
3061       Last updated: 27 December 2016
3062       Copyright (c) 1997-2016 University of Cambridge.
3063
3064
3065
3066PCRE2 10.23                    27 December 2016                PCRE2PATTERN(3)
Impressum