1re(3)                      Erlang Module Definition                      re(3)
2
3
4

NAME

6       re - Perl-like regular expressions for Erlang.
7

DESCRIPTION

9       This  module contains regular expression matching functions for strings
10       and binaries.
11
12       The regular expression syntax and semantics resemble that of Perl.
13
14       The matching algorithms of the library are based on the  PCRE  library,
15       but  not  all  of  the PCRE library is interfaced and some parts of the
16       library go  beyond  what  PCRE  offers.  Currently  PCRE  version  8.40
17       (release  date 2017-01-11) is used. The sections of the PCRE documenta‐
18       tion that are relevant to this module are included here.
19
20   Note:
21       The Erlang literal syntax for strings uses the "\" (backslash)  charac‐
22       ter  as  an  escape  code.  You  need  to escape backslashes in literal
23       strings, both in your code and in the shell, with an  extra  backslash,
24       that is, "\\".
25
26

DATA TYPES

28       mp() = {re_pattern, term(), term(), term(), term()}
29
30              Opaque  data type containing a compiled regular expression. mp()
31              is guaranteed to be a tuple() having the atom re_pattern as  its
32              first element, to allow for matching in guards. The arity of the
33              tuple or the content of the other fields can  change  in  future
34              Erlang/OTP releases.
35
36       nl_spec() = cr | crlf | lf | anycrlf | any
37
38       compile_option() =
39           unicode |
40           anchored |
41           caseless |
42           dollar_endonly |
43           dotall |
44           extended |
45           firstline |
46           multiline |
47           no_auto_capture |
48           dupnames |
49           ungreedy |
50           {newline, nl_spec()} |
51           bsr_anycrlf |
52           bsr_unicode |
53           no_start_optimize |
54           ucp |
55           never_utf
56

EXPORTS

58       version() -> binary()
59
60              The return of this function is a string with the PCRE version of
61              the system that was used in the Erlang/OTP compilation.
62
63       compile(Regexp) -> {ok, MP} | {error, ErrSpec}
64
65              Types:
66
67                 Regexp = iodata()
68                 MP = mp()
69                 ErrSpec =
70                     {ErrString :: string(), Position :: integer() >= 0}
71
72              The same as compile(Regexp,[])
73
74       compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
75
76              Types:
77
78                 Regexp = iodata() | unicode:charlist()
79                 Options = [Option]
80                 Option = compile_option()
81                 MP = mp()
82                 ErrSpec =
83                     {ErrString :: string(), Position :: integer() >= 0}
84
85              Compiles a regular expression, with the syntax described  below,
86              into an internal format to be used later as a parameter to run/2
87              and run/3.
88
89              Compiling the regular expression before matching  is  useful  if
90              the  same  expression is to be used in matching against multiple
91              subjects during the lifetime of the program. Compiling once  and
92              executing  many  times is far more efficient than compiling each
93              time one wants to match.
94
95              When option unicode is specified, the regular expression  is  to
96              be  specified  as  a  valid Unicode charlist(), otherwise as any
97              valid iodata().
98
99              Options:
100
101                unicode:
102                  The regular expression is specified as a Unicode  charlist()
103                  and  the  resulting  regular  expression  code  is to be run
104                  against a valid Unicode charlist()  subject.  Also  consider
105                  option ucp when using Unicode characters.
106
107                anchored:
108                  The  pattern is forced to be "anchored", that is, it is con‐
109                  strained to match only at the first matching  point  in  the
110                  string  that is searched (the "subject string"). This effect
111                  can also be achieved by appropriate constructs in  the  pat‐
112                  tern itself.
113
114                caseless:
115                  Letters  in  the  pattern match both uppercase and lowercase
116                  letters. It is equivalent to  Perl  option  /i  and  can  be
117                  changed within a pattern by a (?i) option setting. Uppercase
118                  and lowercase letters are defined as in the ISO 8859-1 char‐
119                  acter set.
120
121                dollar_endonly:
122                  A  dollar  metacharacter  in the pattern matches only at the
123                  end of the subject string. Without  this  option,  a  dollar
124                  also  matches immediately before a newline at the end of the
125                  string (but not before any other newlines). This  option  is
126                  ignored if option multiline is specified. There is no equiv‐
127                  alent option in Perl, and it cannot be set within a pattern.
128
129                dotall:
130                  A dot in the pattern matches all characters, including those
131                  indicating  newline.  Without  it, a dot does not match when
132                  the current position is at a newline. This option is equiva‐
133                  lent  to  Perl option /s and it can be changed within a pat‐
134                  tern by a (?s) option setting. A  negative  class,  such  as
135                  [^a],  always matches newline characters, independent of the
136                  setting of this option.
137
138                extended:
139                  If this option is set, most white space  characters  in  the
140                  pattern  are totally ignored except when escaped or inside a
141                  character class. However, white space is not allowed  within
142                  sequences  such  as (?> that introduce various parenthesized
143                  subpatterns, nor  within  a  numerical  quantifier  such  as
144                  {1,3}.  However,  ignorable white space is permitted between
145                  an item and a following quantifier and between a  quantifier
146                  and a following + that indicates possessiveness.
147
148                  White  space  did not used to include the VT character (code
149                  11), because Perl did not  treat  this  character  as  white
150                  space.  However,  Perl changed at release 5.18, so PCRE fol‐
151                  lowed at release 8.34, and VT is now treated as white space.
152
153                  This also causes characters between an unescaped # outside a
154                  character  class  and  the  next  newline,  inclusive, to be
155                  ignored. This is equivalent to Perl's /x option, and it  can
156                  be changed within a pattern by a (?x) option setting.
157
158                  With  this  option, comments inside complicated patterns can
159                  be included. However, notice that this applies only to  data
160                  characters.  Whitespace  characters  can never appear within
161                  special character sequences in a pattern, for example within
162                  sequence (?( that introduces a conditional subpattern.
163
164                firstline:
165                  An  unanchored pattern is required to match before or at the
166                  first newline in the subject string,  although  the  matched
167                  text can continue over the newline.
168
169                multiline:
170                  By  default, PCRE treats the subject string as consisting of
171                  a single line of characters (even if it contains  newlines).
172                  The  "start  of  line" metacharacter (^) matches only at the
173                  start of the string, while the "end of  line"  metacharacter
174                  ($)  matches only at the end of the string, or before a ter‐
175                  minating newline (unless  option  dollar_endonly  is  speci‐
176                  fied). This is the same as in Perl.
177
178                  When  this option is specified, the "start of line" and "end
179                  of line" constructs match immediately following  or  immedi‐
180                  ately  before  internal  newlines  in  the  subject  string,
181                  respectively, as well as at the very start and end. This  is
182                  equivalent  to  Perl  option  /m and can be changed within a
183                  pattern by a (?m) option setting. If there are  no  newlines
184                  in  a  subject string, or no occurrences of ^ or $ in a pat‐
185                  tern, setting multiline has no effect.
186
187                no_auto_capture:
188                  Disables the use of numbered capturing  parentheses  in  the
189                  pattern.  Any  opening parenthesis that is not followed by ?
190                  behaves as if it is followed by ?:.  Named  parentheses  can
191                  still be used for capturing (and they acquire numbers in the
192                  usual way). There is no equivalent option in Perl.
193
194                dupnames:
195                  Names used to identify capturing  subpatterns  need  not  be
196                  unique.  This  can  be  helpful for certain types of pattern
197                  when it is known that only one instance of the named subpat‐
198                  tern  can ever be matched. More details of named subpatterns
199                  are provided below.
200
201                ungreedy:
202                  Inverts the "greediness" of the quantifiers so that they are
203                  not greedy by default, but become greedy if followed by "?".
204                  It is not compatible with Perl. It can also be set by a (?U)
205                  option setting within the pattern.
206
207                {newline, NLSpec}:
208                  Overrides the default definition of a newline in the subject
209                  string, which is LF (ASCII 10) in Erlang.
210
211                  cr:
212                    Newline is indicated by a single character cr (ASCII 13).
213
214                  lf:
215                    Newline is indicated by a single character LF (ASCII  10),
216                    the default.
217
218                  crlf:
219                    Newline  is  indicated by the two-character CRLF (ASCII 13
220                    followed by ASCII 10) sequence.
221
222                  anycrlf:
223                    Any of the three preceding sequences is to be recognized.
224
225                  any:
226                    Any of  the  newline  sequences  above,  and  the  Unicode
227                    sequences   VT   (vertical  tab,  U+000B),  FF  (formfeed,
228                    U+000C), NEL (next  line,  U+0085),  LS  (line  separator,
229                    U+2028), and PS (paragraph separator, U+2029).
230
231                bsr_anycrlf:
232                  Specifies  specifically that \R is to match only the CR, LF,
233                  or CRLF sequences, not the Unicode-specific newline  charac‐
234                  ters.
235
236                bsr_unicode:
237                  Specifies  specifically  that \R is to match all the Unicode
238                  newline characters (including CRLF, and so on, the default).
239
240                no_start_optimize:
241                  Disables  optimization  that  can  malfunction  if  "Special
242                  start-of-pattern  items"  are present in the regular expres‐
243                  sion. A typical example  would  be  when  matching  "DEFABC"
244                  against "(*COMMIT)ABC", where the start optimization of PCRE
245                  would skip the subject up to "A" and never realize that  the
246                  (*COMMIT)  instruction  is  to  have made the matching fail.
247                  This option is only relevant if  you  use  "start-of-pattern
248                  items",  as  discussed  in  section  PCRE Regular Expression
249                  Details.
250
251                ucp:
252                  Specifies that Unicode character properties are to  be  used
253                  when  resolving  \B,  \b, \D, \d, \S, \s, \W and \w. Without
254                  this flag, only ISO Latin-1 properties are used. Using  Uni‐
255                  code  properties hurts performance, but is semantically cor‐
256                  rect when working with Unicode  characters  beyond  the  ISO
257                  Latin-1 range.
258
259                never_utf:
260                  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
261                  items" are forbidden. This  flag  cannot  be  combined  with
262                  option  unicode.  Useful  if  ISO  Latin-1  patterns from an
263                  external source are to be compiled.
264
265       inspect(MP, Item) -> {namelist, [binary()]}
266
267              Types:
268
269                 MP = mp()
270                 Item = namelist
271
272              Takes a compiled regular expression and an item, and returns the
273              relevant  data  from  the regular expression. The only supported
274              item  is  namelist,   which   returns   the   tuple   {namelist,
275              [binary()]},  containing the names of all (unique) named subpat‐
276              terns in the regular expression. For example:
277
278              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
279              {ok,{re_pattern,3,0,0,
280                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
281                                255,255,...>>}}
282              2> re:inspect(MP,namelist).
283              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
284              3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
285              {ok,{re_pattern,3,0,0,
286                              <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
287                                255,255,...>>}}
288              4> re:inspect(MPD,namelist).
289              {namelist,[<<"B">>,<<"C">>]}
290
291              Notice in the second example that the duplicate name only occurs
292              once  in the returned list, and that the list is in alphabetical
293              order regardless of where the names are positioned in the  regu‐
294              lar  expression. The order of the names is the same as the order
295              of captured subexpressions if {capture, all_names} is  specified
296              as  an option to run/3. You can therefore create a name-to-value
297              mapping from the result of run/3 like this:
298
299              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
300              {ok,{re_pattern,3,0,0,
301                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
302                                255,255,...>>}}
303              2> {namelist, N} = re:inspect(MP,namelist).
304              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
305              3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
306              {match,[<<"A">>,<<>>,<<>>]}
307              4> NameMap = lists:zip(N,L).
308              [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
309
310       replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
311
312              Types:
313
314                 Subject = iodata() | unicode:charlist()
315                 RE = mp() | iodata()
316                 Replacement = iodata() | unicode:charlist()
317
318              Same as replace(Subject, RE, Replacement, []).
319
320       replace(Subject, RE, Replacement, Options) ->
321                  iodata() | unicode:charlist()
322
323              Types:
324
325                 Subject = iodata() | unicode:charlist()
326                 RE = mp() | iodata() | unicode:charlist()
327                 Replacement = iodata() | unicode:charlist()
328                 Options = [Option]
329                 Option =
330                     anchored |
331                     global |
332                     notbol |
333                     noteol |
334                     notempty |
335                     notempty_atstart |
336                     {offset, integer() >= 0} |
337                     {newline, NLSpec} |
338                     bsr_anycrlf |
339                     {match_limit, integer() >= 0} |
340                     {match_limit_recursion, integer() >= 0} |
341                     bsr_unicode |
342                     {return, ReturnType} |
343                     CompileOpt
344                 ReturnType = iodata | list | binary
345                 CompileOpt = compile_option()
346                 NLSpec = cr | crlf | lf | anycrlf | any
347
348              Replaces the matched part of the Subject string  with  the  con‐
349              tents of Replacement.
350
351              The  permissible  options are the same as for run/3, except that
352              option capture is not allowed. Instead a {return, ReturnType} is
353              present. The default return type is iodata, constructed in a way
354              to minimize copying. The iodata result can be used  directly  in
355              many  I/O  operations.  If  a  flat  list()  is desired, specify
356              {return,  list}.  If  a  binary  is  desired,  specify  {return,
357              binary}.
358
359              As  in  function  run/3,  an  mp()  compiled with option unicode
360              requires Subject to be a Unicode charlist(). If  compilation  is
361              done  implicitly and the unicode compilation option is specified
362              to this function, both the regular expression and Subject are to
363              specified as valid Unicode charlist()s.
364
365              The  replacement  string  can  contain  the special character &,
366              which inserts the whole matching expression in the  result,  and
367              the  special  sequence  \N  (where N is an integer > 0), \gN, or
368              \g{N}, resulting in the subexpression number N, is  inserted  in
369              the result. If no subexpression with that number is generated by
370              the regular expression, nothing is inserted.
371
372              To insert an & or a \ in the result, precede it with a \. Notice
373              that  Erlang  already  gives  a  special meaning to \ in literal
374              strings, so a single \ must be written as "\\" and  therefore  a
375              double \ as "\\\\".
376
377              Example:
378
379              re:replace("abcd","c","[&]",[{return,list}]).
380
381              gives
382
383              "ab[c]d"
384
385              while
386
387              re:replace("abcd","c","[\\&]",[{return,list}]).
388
389              gives
390
391              "ab[&]d"
392
393              As  with  run/3,  compilation errors raise the badarg exception.
394              compile/2 can be used to get more information about the error.
395
396       run(Subject, RE) -> {match, Captured} | nomatch
397
398              Types:
399
400                 Subject = iodata() | unicode:charlist()
401                 RE = mp() | iodata()
402                 Captured = [CaptureData]
403                 CaptureData = {integer(), integer()}
404
405              Same as run(Subject,RE,[]).
406
407       run(Subject, RE, Options) ->
408              {match, Captured} | match | nomatch | {error, ErrType}
409
410              Types:
411
412                 Subject = iodata() | unicode:charlist()
413                 RE = mp() | iodata() | unicode:charlist()
414                 Options = [Option]
415                 Option =
416                     anchored |
417                     global |
418                     notbol |
419                     noteol |
420                     notempty |
421                     notempty_atstart |
422                     report_errors |
423                     {offset, integer() >= 0} |
424                     {match_limit, integer() >= 0} |
425                     {match_limit_recursion, integer() >= 0} |
426                     {newline, NLSpec :: nl_spec()} |
427                     bsr_anycrlf |
428                     bsr_unicode |
429                     {capture, ValueSpec} |
430                     {capture, ValueSpec, Type} |
431                     CompileOpt
432                 Type = index | list | binary
433                 ValueSpec =
434                     all | all_but_first | all_names | first  |  none  |  Val‐
435                 ueList
436                 ValueList = [ValueID]
437                 ValueID = integer() | string() | atom()
438                 CompileOpt = compile_option()
439                   See compile/2.
440                 Captured = [CaptureData] | [[CaptureData]]
441                 CaptureData =
442                     {integer(), integer()} | ListConversionData | binary()
443                 ListConversionData =
444                     string() |
445                     {error, string(), binary()} |
446                     {incomplete, string(), binary()}
447                 ErrType =
448                     match_limit  |  match_limit_recursion  |  {compile,  Com‐
449                 pileErr}
450                 CompileErr =
451                     {ErrString :: string(), Position :: integer() >= 0}
452
453              Executes   a   regular   expression   matching,   and    returns
454              match/{match,  Captured}  or nomatch. The regular expression can
455              be specified either as iodata() in which case  it  is  automati‐
456              cally  compiled  (as by compile/2) and executed, or as a precom‐
457              piled mp() in which case it  is  executed  against  the  subject
458              directly.
459
460              When  compilation  is  involved, exception badarg is thrown if a
461              compilation error occurs.  Call  compile/2  to  get  information
462              about the location of the error in the regular expression.
463
464              If  the  regular  expression  is previously compiled, the option
465              list can only contain the following options:
466
467                * anchored
468
469                * {capture, ValueSpec}/{capture, ValueSpec, Type}
470
471                * global
472
473                * {match_limit, integer() >= 0}
474
475                * {match_limit_recursion, integer() >= 0}
476
477                * {newline, NLSpec}
478
479                * notbol
480
481                * notempty
482
483                * notempty_atstart
484
485                * noteol
486
487                * {offset, integer() >= 0}
488
489                * report_errors
490
491              Otherwise all options valid  for  function  compile/2  are  also
492              allowed. Options allowed both for compilation and execution of a
493              match, namely anchored and {newline, NLSpec},  affect  both  the
494              compilation and execution if present together with a non-precom‐
495              piled regular expression.
496
497              If the regular expression was previously  compiled  with  option
498              unicode,   Subject   is  to  be  provided  as  a  valid  Unicode
499              charlist(), otherwise any iodata() will do.  If  compilation  is
500              involved  and  option unicode is specified, both Subject and the
501              regular  expression  are  to  be  specified  as  valid   Unicode
502              charlists().
503
504              {capture,  ValueSpec}/{capture, ValueSpec, Type} defines what to
505              return from the function upon successful matching.  The  capture
506              tuple  can  contain both a value specification, telling which of
507              the captured substrings are to be returned, and a type  specifi‐
508              cation,  telling  how captured substrings are to be returned (as
509              index tuples, lists, or binaries). The options are described  in
510              detail below.
511
512              If  the  capture options describe that no substring capturing is
513              to be done ({capture, none}), the function  returns  the  single
514              atom match upon successful matching, otherwise the tuple {match,
515              ValueList}. Disabling capturing can be done either by specifying
516              none or an empty list as ValueSpec.
517
518              Option report_errors adds the possibility that an error tuple is
519              returned.  The  tuple  either   indicates   a   matching   error
520              (match_limit  or match_limit_recursion), or a compilation error,
521              where the error tuple has  the  format  {error,  {compile,  Com‐
522              pileErr}}. Notice that if option report_errors is not specified,
523              the function never returns error tuples, but reports compilation
524              errors  as  a  badarg  exception  and  failed matches because of
525              exceeded match limits simply as nomatch.
526
527              The following options are relevant for execution:
528
529                anchored:
530                  Limits run/3 to matching at the first matching position.  If
531                  a  pattern  was  compiled with anchored, or turned out to be
532                  anchored by virtue of its contents, it cannot be made  unan‐
533                  chored  at  matching  time,  hence  there  is  no unanchored
534                  option.
535
536                global:
537                  Implements global (repetitive) search (flag g in Perl). Each
538                  match  is  returned as a separate list() containing the spe‐
539                  cific match and any matching subexpressions (or as specified
540                  by  option capture. The Captured part of the return value is
541                  hence a list() of list()s when this option is specified.
542
543                  The interaction of option global with a  regular  expression
544                  that  matches  an  empty  string  surprises some users. When
545                  option global is specified, run/3 handles empty  matches  in
546                  the  same  way  as Perl: a zero-length match at any point is
547                  also retried with options [anchored,  notempty_atstart].  If
548                  that  search  gives  a  result  of length > 0, the result is
549                  included. Example:
550
551                re:run("cat","(|at)",[global]).
552
553                  The following matchings are performed:
554
555                  At offset 0:
556                    The regular expression (|at) first match  at  the  initial
557                    position   of   string   cat,   giving   the   result  set
558                    [{0,0},{0,0}] (the second {0,0} is because of  the  subex‐
559                    pression  marked by the parentheses). As the length of the
560                    match is 0, we do not advance to the next position yet.
561
562                  At offset 0 with [anchored, notempty_atstart]:
563                    The   search   is   retried   with   options    [anchored,
564                    notempty_atstart]  at  the  same  position, which does not
565                    give any interesting  result  of  longer  length,  so  the
566                    search position is advanced to the next character (a).
567
568                  At offset 1:
569                    The  search  results  in  [{1,0},{1,0}], so this search is
570                    also repeated with the extra options.
571
572                  At offset 1 with [anchored, notempty_atstart]:
573                    Alternative ab is found and the result  is  [{1,2},{1,2}].
574                    The  result  is added to the list of results and the posi‐
575                    tion in the search string is advanced two steps.
576
577                  At offset 3:
578                    The search once again matches  the  empty  string,  giving
579                    [{3,0},{3,0}].
580
581                  At offset 1 with [anchored, notempty_atstart]:
582                    This  gives no result of length > 0 and we are at the last
583                    position, so the global search is complete.
584
585                  The result of the call is:
586
587                {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
588
589                notempty:
590                  An empty string is not considered to be  a  valid  match  if
591                  this  option  is  specified.  If alternatives in the pattern
592                  exist, they are tried. If all  the  alternatives  match  the
593                  empty string, the entire match fails.
594
595                  Example:
596
597                  If  the  following pattern is applied to a string not begin‐
598                  ning with "a" or "b", it  would  normally  match  the  empty
599                  string at the start of the subject:
600
601                a?b?
602
603                  With  option  notempty,  this  match  is  invalid,  so run/3
604                  searches further into the string for occurrences of  "a"  or
605                  "b".
606
607                notempty_atstart:
608                  Like notempty, except that an empty string match that is not
609                  at the start of the subject is permitted. If the pattern  is
610                  anchored,  such  a  match can occur only if the pattern con‐
611                  tains \K.
612
613                  Perl   has   no   direct   equivalent   of    notempty    or
614                  notempty_atstart,  but it does make a special case of a pat‐
615                  tern match of the empty string within its split()  function,
616                  and  when  using  modifier /g. The Perl behavior can be emu‐
617                  lated after matching a null string by first trying the match
618                  again at the same offset with notempty_atstart and anchored,
619                  and then, if that fails, by advancing  the  starting  offset
620                  (see below) and trying an ordinary match again.
621
622                notbol:
623                  Specifies  that the first character of the subject string is
624                  not the beginning of a line, so the circumflex metacharacter
625                  is  not  to  match before it. Setting this without multiline
626                  (at compile time) causes circumflex  never  to  match.  This
627                  option only affects the behavior of the circumflex metachar‐
628                  acter. It does not affect \A.
629
630                noteol:
631                  Specifies that the end of the subject string is not the  end
632                  of  a  line,  so the dollar metacharacter is not to match it
633                  nor (except in multiline mode) a newline immediately  before
634                  it.  Setting this without multiline (at compile time) causes
635                  dollar never to match. This option affects only the behavior
636                  of the dollar metacharacter. It does not affect \Z or \z.
637
638                report_errors:
639                  Gives  better  control  of the error handling in run/3. When
640                  specified, compilation errors (if the regular expression  is
641                  not  already  compiled)  and  runtime  errors are explicitly
642                  returned as an error tuple.
643
644                  The following are the possible runtime errors:
645
646                  match_limit:
647                    The PCRE library sets a limit on how many times the inter‐
648                    nal  match  function can be called. Defaults to 10,000,000
649                    in  the  library   compiled   for   Erlang.   If   {error,
650                    match_limit}  is  returned,  the  execution of the regular
651                    expression has reached this limit. This is normally to  be
652                    regarded  as  a nomatch, which is the default return value
653                    when this occurs, but by specifying report_errors, you are
654                    informed when the match fails because of too many internal
655                    calls.
656
657                  match_limit_recursion:
658                    This error is very similar to match_limit, but occurs when
659                    the  internal  match  function  of  PCRE  is "recursively"
660                    called more times than  the  match_limit_recursion  limit,
661                    which  defaults to 10,000,000 as well. Notice that as long
662                    as the match_limit and match_limit_default values are kept
663                    at  the  default  values,  the match_limit_recursion error
664                    cannot occur, as the match_limit error occurs before  that
665                    (each  recursive call is also a call, but not conversely).
666                    Both limits can however be changed, either by setting lim‐
667                    its directly in the regular expression string (see section
668                    PCRE Regular Eexpression Details) or by specifying options
669                    to run/3.
670
671                  It  is  important  to understand that what is referred to as
672                  "recursion" when limiting matches is not recursion on the  C
673                  stack  of the Erlang machine or on the Erlang process stack.
674                  The PCRE version compiled into the Erlang  VM  uses  machine
675                  "heap"  memory to store values that must be kept over recur‐
676                  sion in regular expression matches.
677
678                {match_limit, integer() >= 0}:
679                  Limits the execution time of a match in  an  implementation-
680                  specific  way.  It is described as follows by the PCRE docu‐
681                  mentation:
682
683                The match_limit field provides a means of preventing PCRE from using
684                up a vast amount of resources when running patterns that are not going
685                to match, but which have a very large number of possibilities in their
686                search trees. The classic example is a pattern that uses nested
687                unlimited repeats.
688
689                Internally, pcre_exec() uses a function called match(), which it calls
690                repeatedly (sometimes recursively). The limit set by match_limit is
691                imposed on the number of times this function is called during a match,
692                which has the effect of limiting the amount of backtracking that can
693                take place. For patterns that are not anchored, the count restarts
694                from zero for each position in the subject string.
695
696                  This means that runaway regular expression matches can  fail
697                  faster  if  the  limit  is  lowered  using  this option. The
698                  default value 10,000,000 is compiled into the Erlang VM.
699
700            Note:
701                This option does in no way affect the execution of the  Erlang
702                VM in terms of "long running BIFs". run/3 always gives control
703                back to the scheduler of Erlang processes  at  intervals  that
704                ensures the real-time properties of the Erlang system.
705
706
707                {match_limit_recursion, integer() >= 0}:
708                  Limits  the execution time and memory consumption of a match
709                  in  an  implementation-specific   way,   very   similar   to
710                  match_limit. It is described as follows by the PCRE documen‐
711                  tation:
712
713                The match_limit_recursion field is similar to match_limit, but instead
714                of limiting the total number of times that match() is called, it
715                limits the depth of recursion. The recursion depth is a smaller number
716                than the total number of calls, because not all calls to match() are
717                recursive. This limit is of use only if it is set smaller than
718                match_limit.
719
720                Limiting the recursion depth limits the amount of machine stack that
721                can be used, or, when PCRE has been compiled to use memory on the heap
722                instead of the stack, the amount of heap memory that can be used.
723
724                  The Erlang VM uses a PCRE library where heap memory is  used
725                  when  regular expression match recursion occurs. This there‐
726                  fore limits the use of machine heap, not C stack.
727
728                  Specifying a lower value can result  in  matches  with  deep
729                  recursion failing, when they should have matched:
730
731                1> re:run("aaaaaaaaaaaaaz","(a+)*z").
732                {match,[{0,14},{0,13}]}
733                2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
734                nomatch
735                3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
736                {error,match_limit_recursion}
737
738                  This  option  and  option match_limit are only to be used in
739                  rare cases. Understanding of the PCRE library  internals  is
740                  recommended before tampering with these limits.
741
742                {offset, integer() >= 0}:
743                  Start  matching  at  the  offset (position) specified in the
744                  subject string.  The  offset  is  zero-based,  so  that  the
745                  default is {offset,0} (all of the subject string).
746
747                {newline, NLSpec}:
748                  Overrides the default definition of a newline in the subject
749                  string, which is LF (ASCII 10) in Erlang.
750
751                  cr:
752                    Newline is indicated by a single character CR (ASCII 13).
753
754                  lf:
755                    Newline is indicated by a single character LF (ASCII  10),
756                    the default.
757
758                  crlf:
759                    Newline  is  indicated by the two-character CRLF (ASCII 13
760                    followed by ASCII 10) sequence.
761
762                  anycrlf:
763                    Any of the three preceding sequences is be recognized.
764
765                  any:
766                    Any of  the  newline  sequences  above,  and  the  Unicode
767                    sequences   VT   (vertical  tab,  U+000B),  FF  (formfeed,
768                    U+000C), NEL (next  line,  U+0085),  LS  (line  separator,
769                    U+2028), and PS (paragraph separator, U+2029).
770
771                bsr_anycrlf:
772                  Specifies  specifically  that \R is to match only the CR LF,
773                  or CRLF sequences, not the Unicode-specific newline  charac‐
774                  ters. (Overrides the compilation option.)
775
776                bsr_unicode:
777                  Specifies  specifically  that \R is to match all the Unicode
778                  newline characters (including CRLF, and so on, the default).
779                  (Overrides the compilation option.)
780
781                {capture, ValueSpec}/{capture, ValueSpec, Type}:
782                  Specifies which captured substrings are returned and in what
783                  format. By default, run/3 captures all of the matching  part
784                  of  the  substring and all capturing subpatterns (all of the
785                  pattern is automatically captured). The default return  type
786                  is (zero-based) indexes of the captured parts of the string,
787                  specified as {Offset,Length} pairs (the index Type  of  cap‐
788                  turing).
789
790                  As  an  example  of the default behavior, the following call
791                  returns, as first and only  captured  string,  the  matching
792                  part  of the subject ("abcd" in the middle) as an index pair
793                  {3,4}, where character positions are zero-based, just as  in
794                  offsets:
795
796                re:run("ABCabcdABC","abcd",[]).
797
798                  The return value of this call is:
799
800                {match,[{3,4}]}
801
802                  Another (and quite common) case is where the regular expres‐
803                  sion matches all of the subject:
804
805                re:run("ABCabcdABC",".*abcd.*",[]).
806
807                  Here the return value correspondingly points out all of  the
808                  string, beginning at index 0, and it is 10 characters long:
809
810                {match,[{0,10}]}
811
812                  If  the  regular  expression contains capturing subpatterns,
813                  like in:
814
815                re:run("ABCabcdABC",".*(abcd).*",[]).
816
817                  all of the matched subject is captured, as well as the  cap‐
818                  tured substrings:
819
820                {match,[{0,10},{3,4}]}
821
822                  The  complete matching pattern always gives the first return
823                  value in the list and the remaining subpatterns are added in
824                  the order they occurred in the regular expression.
825
826                  The capture tuple is built up as follows:
827
828                  ValueSpec:
829                    Specifies which captured (sub)patterns are to be returned.
830                    ValueSpec can either be an atom  describing  a  predefined
831                    set  of return values, or a list containing the indexes or
832                    the names of specific subpatterns to return.
833
834                    The following are the predefined sets of subpatterns:
835
836                    all:
837                      All captured subpatterns including the complete matching
838                      string. This is the default.
839
840                    all_names:
841                      All named subpatterns in the regular expression, as if a
842                      list() of all the names in alphabetical order was speci‐
843                      fied.  The  list of all names can also be retrieved with
844                      inspect/2.
845
846                    first:
847                      Only the first captured subpattern, which is always  the
848                      complete  matching  part  of the subject. All explicitly
849                      captured subpatterns are discarded.
850
851                    all_but_first:
852                      All but the first  matching  subpattern,  that  is,  all
853                      explicitly  captured  subpatterns,  but not the complete
854                      matching part of the subject string. This is  useful  if
855                      the  regular  expression as a whole matches a large part
856                      of the subject, but the part you are interested in is in
857                      an explicitly captured subpattern. If the return type is
858                      list or binary, not returning subpatterns  you  are  not
859                      interested in is a good way to optimize.
860
861                    none:
862                      Returns  no  matching subpatterns, gives the single atom
863                      match as the return value of the function when  matching
864                      successfully  instead  of  the  {match,  list()} return.
865                      Specifying an empty list gives the same behavior.
866
867                    The value list is a list of indexes for the subpatterns to
868                    return,  where index 0 is for all of the pattern, and 1 is
869                    for the first explicit capturing subpattern in the regular
870                    expression,  and  so on. When using named captured subpat‐
871                    terns (see below) in the regular expression, one  can  use
872                    atom()s  or  string()s  to  specify  the subpatterns to be
873                    returned. For example, consider the regular expression:
874
875                  ".*(abcd).*"
876
877                    matched against string "ABCabcdABC",  capturing  only  the
878                    "abcd" part (the first explicit subpattern):
879
880                  re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
881
882                    The  call gives the following result, as the first explic‐
883                    itly captured subpattern is "(abcd)", matching  "abcd"  in
884                    the subject, at (zero-based) position 3, of length 4:
885
886                  {match,[{3,4}]}
887
888                    Consider the same regular expression, but with the subpat‐
889                    tern explicitly named 'FOO':
890
891                  ".*(?<FOO>abcd).*"
892
893                    With this expression, we could still give the index of the
894                    subpattern with the following call:
895
896                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
897
898                    giving  the  same result as before. But, as the subpattern
899                    is named, we can also specify its name in the value list:
900
901                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
902
903                    This would give the same result as the  earlier  examples,
904                    namely:
905
906                  {match,[{3,4}]}
907
908                    The  values  list can specify indexes or names not present
909                    in the regular expression, in which case the return values
910                    vary  depending  on  the  type.  If the type is index, the
911                    tuple {-1,0} is returned for values with no  corresponding
912                    subpattern  in  the  regular expression, but for the other
913                    types (binary and list), the values are the  empty  binary
914                    or list, respectively.
915
916                  Type:
917                    Optionally  specifies  how  captured  substrings are to be
918                    returned. If omitted, the default of index is used.
919
920                    Type can be one of the following:
921
922                    index:
923                      Returns captured substrings as  pairs  of  byte  indexes
924                      into  the  subject  string  and  length  of the matching
925                      string in the subject (as  if  the  subject  string  was
926                      flattened   with   erlang:iolist_to_binary/1   or   uni‐
927                      code:characters_to_binary/2  before  matching).   Notice
928                      that  option unicode results in byte-oriented indexes in
929                      a (possibly virtual) UTF-8 encoded binary. A byte  index
930                      tuple  {0,2}  can therefore represent one or two charac‐
931                      ters when unicode is in effect. This can  seem  counter-
932                      intuitive,  but  has  been deemed the most effective and
933                      useful way to do it. To return lists instead can  result
934                      in  simpler code if that is desired. This return type is
935                      the default.
936
937                    list:
938                      Returns  matching  substrings  as  lists  of  characters
939                      (Erlang  string()s). It option unicode is used in combi‐
940                      nation with the \C sequence in the regular expression, a
941                      captured subpattern can contain bytes that are not valid
942                      UTF-8 (\C matches bytes regardless of  character  encod‐
943                      ing).  In that case the list capturing can result in the
944                      same types of tuples  that  unicode:characters_to_list/2
945                      can  return,  namely three-tuples with tag incomplete or
946                      error, the successfully  converted  characters  and  the
947                      invalid  UTF-8  tail  of the conversion as a binary. The
948                      best strategy is to avoid using  the  \C  sequence  when
949                      capturing lists.
950
951                    binary:
952                      Returns  matching substrings as binaries. If option uni‐
953                      code is used, these binaries are in  UTF-8.  If  the  \C
954                      sequence is used together with unicode, the binaries can
955                      be invalid UTF-8.
956
957                  In general, subpatterns that were not assigned  a  value  in
958                  the  match  are  returned  as  the tuple {-1,0} when type is
959                  index. Unassigned subpatterns  are  returned  as  the  empty
960                  binary  or  list, respectively, for other return types. Con‐
961                  sider the following regular expression:
962
963                ".*((?<FOO>abdd)|a(..d)).*"
964
965                  There are three explicitly capturing subpatterns, where  the
966                  opening  parenthesis  position  determines  the order in the
967                  result, hence ((?<FOO>abdd)|a(..d)) is subpattern  index  1,
968                  (?<FOO>abdd)  is subpattern index 2, and (..d) is subpattern
969                  index 3. When matched against the following string:
970
971                "ABCabcdABC"
972
973                  the subpattern at index 2 does not match, as "abdd"  is  not
974                  present  in  the  string,  but  the complete pattern matches
975                  (because of the alternative a(..d)). The subpattern at index
976                  2 is therefore unassigned and the default return value is:
977
978                {match,[{0,10},{3,4},{-1,0},{4,3}]}
979
980                  Setting the capture Type to binary gives:
981
982                {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
983
984                  Here  the empty binary (<<>>) represents the unassigned sub‐
985                  pattern. In the binary  case,  some  information  about  the
986                  matching  is  therefore  lost,  as <<>> can also be an empty
987                  string captured.
988
989                  If differentiation between empty  matches  and  non-existing
990                  subpatterns is necessary, use the type index and do the con‐
991                  version to the final type in Erlang code.
992
993                  When option global is speciified, the capture  specification
994                  affects each match separately, so that:
995
996                re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
997
998                  gives
999
1000                {match,[["a"],["b"]]}
1001
1002              For  a  descriptions  of  options only affecting the compilation
1003              step, see compile/2.
1004
1005       split(Subject, RE) -> SplitList
1006
1007              Types:
1008
1009                 Subject = iodata() | unicode:charlist()
1010                 RE = mp() | iodata()
1011                 SplitList = [iodata() | unicode:charlist()]
1012
1013              Same as split(Subject, RE, []).
1014
1015       split(Subject, RE, Options) -> SplitList
1016
1017              Types:
1018
1019                 Subject = iodata() | unicode:charlist()
1020                 RE = mp() | iodata() | unicode:charlist()
1021                 Options = [Option]
1022                 Option =
1023                     anchored |
1024                     notbol |
1025                     noteol |
1026                     notempty |
1027                     notempty_atstart |
1028                     {offset, integer() >= 0} |
1029                     {newline, nl_spec()} |
1030                     {match_limit, integer() >= 0} |
1031                     {match_limit_recursion, integer() >= 0} |
1032                     bsr_anycrlf |
1033                     bsr_unicode |
1034                     {return, ReturnType} |
1035                     {parts, NumParts} |
1036                     group |
1037                     trim |
1038                     CompileOpt
1039                 NumParts = integer() >= 0 | infinity
1040                 ReturnType = iodata | list | binary
1041                 CompileOpt = compile_option()
1042                   See compile/2.
1043                 SplitList = [RetData] | [GroupedRetData]
1044                 GroupedRetData = [RetData]
1045                 RetData = iodata() | unicode:charlist() | binary() | list()
1046
1047              Splits the input into parts by finding tokens according  to  the
1048              regular  expression supplied. The splitting is basically done by
1049              running a global regular expression match and dividing the  ini‐
1050              tial  string  wherever  a match occurs. The matching part of the
1051              string is removed from the output.
1052
1053              As in run/3, an mp() compiled with option unicode requires  Sub‐
1054              ject  to be a Unicode charlist(). If compilation is done implic‐
1055              itly and the unicode compilation option  is  specified  to  this
1056              function,  both  the  regular  expression  and Subject are to be
1057              specified as valid Unicode charlist()s.
1058
1059              The result is given as a list of "strings", the  preferred  data
1060              type specified in option return (default iodata).
1061
1062              If  subexpressions  are specified in the regular expression, the
1063              matching subexpressions are returned in the  resulting  list  as
1064              well. For example:
1065
1066              re:split("Erlang","[ln]",[{return,list}]).
1067
1068              gives
1069
1070              ["Er","a","g"]
1071
1072              while
1073
1074              re:split("Erlang","([ln])",[{return,list}]).
1075
1076              gives
1077
1078              ["Er","l","a","n","g"]
1079
1080              The  text  matching the subexpression (marked by the parentheses
1081              in the regular expression) is inserted in the result list  where
1082              it  was  found.  This  means  that concatenating the result of a
1083              split where the whole regular expression is a single  subexpres‐
1084              sion  (as  in  the  last example) always results in the original
1085              string.
1086
1087              As there is no matching subexpression for the last part  in  the
1088              example  (the  "g"), nothing is inserted after that. To make the
1089              group of strings and the parts matching the subexpressions  more
1090              obvious,  one  can  use  option group, which groups together the
1091              part of the subject string with the parts  matching  the  subex‐
1092              pressions when the string was split:
1093
1094              re:split("Erlang","([ln])",[{return,list},group]).
1095
1096              gives
1097
1098              [["Er","l"],["a","n"],["g"]]
1099
1100              Here  the regular expression first matched the "l", causing "Er"
1101              to be the first part in the result. When the regular  expression
1102              matched,  the  (only) subexpression was bound to the "l", so the
1103              "l" is inserted in the group together with "Er". The next  match
1104              is  of  the "n", making "a" the next part to be returned. As the
1105              subexpression is bound to substring "n" in this case, the "n" is
1106              inserted into this group. The last group consists of the remain‐
1107              ing string, as no more matches are found.
1108
1109              By default,  all  parts  of  the  string,  including  the  empty
1110              strings, are returned from the function, for example:
1111
1112              re:split("Erlang","[lg]",[{return,list}]).
1113
1114              gives
1115
1116              ["Er","an",[]]
1117
1118              as  the  matching  of the "g" in the end of the string leaves an
1119              empty rest, which is also returned. This behavior  differs  from
1120              the  default behavior of the split function in Perl, where empty
1121              strings at the end are by default removed. To get the "trimming"
1122              default behavior of Perl, specify trim as an option:
1123
1124              re:split("Erlang","[lg]",[{return,list},trim]).
1125
1126              gives
1127
1128              ["Er","an"]
1129
1130              The  "trim"  option  says;  "give  me  as many parts as possible
1131              except the empty ones", which sometimes can be useful.  You  can
1132              also specify how many parts you want, by specifying {parts,N}:
1133
1134              re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1135
1136              gives
1137
1138              ["Er","ang"]
1139
1140              Notice  that  the last part is "ang", not "an", as splitting was
1141              specified into two parts, and the splitting  stops  when  enough
1142              parts  are  given,  which is why the result differs from that of
1143              trim.
1144
1145              More than three parts are not possible with this indata, so
1146
1147              re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1148
1149              gives the same result as the default, which is to be  viewed  as
1150              "an infinite number of parts".
1151
1152              Specifying  0  as  the  number of parts gives the same effect as
1153              option trim. If subexpressions are  captured,  empty  subexpres‐
1154              sions  matched  at  the end are also stripped from the result if
1155              trim or {parts,0} is specified.
1156
1157              The trim behavior  corresponds  exactly  to  the  Perl  default.
1158              {parts,N}, where N is a positive integer, corresponds exactly to
1159              the Perl behavior with a positive numerical third parameter. The
1160              default  behavior  of  split/3  corresponds to the Perl behavior
1161              when a negative integer is specified as the third parameter  for
1162              the Perl routine.
1163
1164              Summary of options not previously described for function run/3:
1165
1166                {return,ReturnType}:
1167                  Specifies how the parts of the original string are presented
1168                  in the result list. Valid types:
1169
1170                  iodata:
1171                    The variant of iodata() that gives the  least  copying  of
1172                    data  with the current implementation (often a binary, but
1173                    do not depend on it).
1174
1175                  binary:
1176                    All parts returned as binaries.
1177
1178                  list:
1179                    All parts returned as lists of characters ("strings").
1180
1181                group:
1182                  Groups together the part of the string with the parts of the
1183                  string  matching  the  subexpressions of the regular expres‐
1184                  sion.
1185
1186                  The return value from the function is in this case a  list()
1187                  of  list()s.  Each sublist begins with the string picked out
1188                  of the subject string, followed by the parts  matching  each
1189                  of  the subexpressions in order of occurrence in the regular
1190                  expression.
1191
1192                {parts,N}:
1193                  Specifies the number of parts the subject string  is  to  be
1194                  split into.
1195
1196                  The  number  of parts is to be a positive integer for a spe‐
1197                  cific maximum number of parts, and infinity for the  maximum
1198                  number of parts possible (the default). Specifying {parts,0}
1199                  gives as many parts as possible disregarding empty parts  at
1200                  the end, the same as specifying trim.
1201
1202                trim:
1203                  Specifies that empty parts at the end of the result list are
1204                  to be disregarded. The same as  specifying  {parts,0}.  This
1205                  corresponds  to  the  default behavior of the split built-in
1206                  function in Perl.
1207

PERL-LIKE REGULAR EXPRESSION SYNTAX

1209       The following sections  contain  reference  material  for  the  regular
1210       expressions  used  by this module. The information is based on the PCRE
1211       documentation, with changes where this module  behaves  differently  to
1212       the PCRE library.
1213

PCRE REGULAR EXPRESSION DETAILS

1215       The  syntax  and semantics of the regular expressions supported by PCRE
1216       are described in detail  in  the  following  sections.  Perl's  regular
1217       expressions are described in its own documentation, and regular expres‐
1218       sions in general are covered in many books, some with copious examples.
1219       Jeffrey   Friedl's   "Mastering   Regular  Expressions",  published  by
1220       O'Reilly, covers regular expressions in great detail. This  description
1221       of the PCRE regular expressions is intended as reference material.
1222
1223       The reference material is divided into the following sections:
1224
1225         * Special Start-of-Pattern Items
1226
1227         * Characters and Metacharacters
1228
1229         * Backslash
1230
1231         * Circumflex and Dollar
1232
1233         * Full Stop (Period, Dot) and \N
1234
1235         * Matching a Single Data Unit
1236
1237         * Square Brackets and Character Classes
1238
1239         * Posix Character Classes
1240
1241         * Vertical Bar
1242
1243         * Internal Option Setting
1244
1245         * Subpatterns
1246
1247         * Duplicate Subpattern Numbers
1248
1249         * Named Subpatterns
1250
1251         * Repetition
1252
1253         * Atomic Grouping and Possessive Quantifiers
1254
1255         * Back References
1256
1257         * Assertions
1258
1259         * Conditional Subpatterns
1260
1261         * Comments
1262
1263         * Recursive Patterns
1264
1265         * Subpatterns as Subroutines
1266
1267         * Oniguruma Subroutine Syntax
1268
1269         * Backtracking Control
1270

SPECIAL START-OF-PATTERN ITEMS

1272       Some options that can be passed to compile/2 can also be set by special
1273       items at the start of a pattern. These are not Perl-compatible, but are
1274       provided  to  make  these options accessible to pattern writers who are
1275       not able to change the program that processes the pattern.  Any  number
1276       of  these  items can appear, but they must all be together right at the
1277       start of the pattern string, and the letters must be in upper case.
1278
1279       UTF Support
1280
1281       Unicode support is basically UTF-8 based. To  use  Unicode  characters,
1282       you  either call compile/2 or run/3 with option unicode, or the pattern
1283       must start with one of these special sequences:
1284
1285       (*UTF8)
1286       (*UTF)
1287
1288       Both options give the same effect, the input string is  interpreted  as
1289       UTF-8. Notice that with these instructions, the automatic conversion of
1290       lists to UTF-8 is not performed by the re functions.  Therefore,  using
1291       these  sequences  is  not  recommended. Add option unicode when running
1292       compile/2 instead.
1293
1294       Some applications that allow their users to supply patterns can wish to
1295       restrict them to non-UTF data for security reasons. If option never_utf
1296       is set at compile time, (*UTF), and so on, are not allowed,  and  their
1297       appearance causes an error.
1298
1299       Unicode Property Support
1300
1301       The  following is another special sequence that can appear at the start
1302       of a pattern:
1303
1304       (*UCP)
1305
1306       This has the same effect as setting option  ucp:  it  causes  sequences
1307       such  as  \d  and  \w  to use Unicode properties to determine character
1308       types, instead of recognizing only characters with codes < 256  through
1309       a lookup table.
1310
1311       Disabling Startup Optimizations
1312
1313       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
1314       setting option no_start_optimize at compile time.
1315
1316       Newline Conventions
1317
1318       PCRE supports five conventions for indicating line breaks in strings: a
1319       single  CR (carriage return) character, a single LF (line feed) charac‐
1320       ter, the two-character sequence CRLF, any of the three  preceding,  and
1321       any Unicode newline sequence.
1322
1323       A newline convention can also be specified by starting a pattern string
1324       with one of the following five sequences:
1325
1326         (*CR):
1327           Carriage return
1328
1329         (*LF):
1330           Line feed
1331
1332         (*CRLF):
1333           >Carriage return followed by line feed
1334
1335         (*ANYCRLF):
1336           Any of the three above
1337
1338         (*ANY):
1339           All Unicode newline sequences
1340
1341       These override the default and the options specified to compile/2.  For
1342       example, the following pattern changes the convention to CR:
1343
1344       (*CR)a.b
1345
1346       This  pattern  matches a\nb, as LF is no longer a newline. If more than
1347       one of them is present, the last one is used.
1348
1349       The newline convention affects where the circumflex and  dollar  asser‐
1350       tions are true. It also affects the interpretation of the dot metachar‐
1351       acter when dotall is not set, and the behavior of \N. However, it  does
1352       not affect what the \R escape sequence matches. By default, this is any
1353       Unicode newline sequence, for Perl compatibility. However, this can  be
1354       changed;  see  the  description  of  \R in section Newline Sequences. A
1355       change of the \R setting can be combined with a change of  the  newline
1356       convention.
1357
1358       Setting Match and Recursion Limits
1359
1360       The caller of run/3 can set a limit on the number of times the internal
1361       match() function is called and on the maximum depth of recursive calls.
1362       These  facilities  are  provided to catch runaway matches that are pro‐
1363       voked by patterns with huge matching trees (a typical example is a pat‐
1364       tern  with nested unlimited repeats) and to avoid running out of system
1365       stack by too much recursion. When  one  of  these  limits  is  reached,
1366       pcre_exec()  gives an error return. The limits can also be set by items
1367       at the start of the pattern of the following forms:
1368
1369       (*LIMIT_MATCH=d)
1370       (*LIMIT_RECURSION=d)
1371
1372       Here d is any number of decimal digits. However, the value of the  set‐
1373       ting  must  be less than the value set by the caller of run/3 for it to
1374       have any effect. That is, the pattern writer can lower the limit set by
1375       the  programmer, but not raise it. If there is more than one setting of
1376       one of these limits, the lower value is used.
1377
1378       The default value for both the limits is 10,000,000 in the  Erlang  VM.
1379       Notice  that the recursion limit does not affect the stack depth of the
1380       VM, as PCRE for Erlang is compiled in such a way that the  match  func‐
1381       tion never does recursion on the C stack.
1382
1383       Note  that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1384       the limits set by the caller, not increase them.
1385

CHARACTERS AND METACHARACTERS

1387       A regular expression is a pattern that is  matched  against  a  subject
1388       string  from  left  to right. Most characters stand for themselves in a
1389       pattern and match the corresponding characters in  the  subject.  As  a
1390       trivial  example,  the following pattern matches a portion of a subject
1391       string that is identical to itself:
1392
1393       The quick brown fox
1394
1395       When caseless matching is  specified  (option  caseless),  letters  are
1396       matched independently of case.
1397
1398       The  power  of  regular  expressions  comes from the ability to include
1399       alternatives and repetitions in the pattern. These are encoded  in  the
1400       pattern by the use of metacharacters, which do not stand for themselves
1401       but instead are interpreted in some special way.
1402
1403       Two sets of metacharacters exist: those that are recognized anywhere in
1404       the  pattern  except  within square brackets, and those that are recog‐
1405       nized within square brackets. Outside square brackets, the  metacharac‐
1406       ters are as follows:
1407
1408         \:
1409           General escape character with many uses
1410
1411         ^:
1412           Assert start of string (or line, in multiline mode)
1413
1414         $:
1415           Assert end of string (or line, in multiline mode)
1416
1417         .:
1418           Match any character except newline (by default)
1419
1420         [:
1421           Start character class definition
1422
1423         |:
1424           Start of alternative branch
1425
1426         (:
1427           Start subpattern
1428
1429         ):
1430           End subpattern
1431
1432         ?:
1433           Extends  the  meaning of (, also 0 or 1 quantifier, also quantifier
1434           minimizer
1435
1436         *:
1437           0 or more quantifiers
1438
1439         +:
1440           1 or more quantifier, also "possessive quantifier"
1441
1442         {:
1443           Start min/max quantifier
1444
1445       Part of a pattern within square brackets is called a "character class".
1446       The following are the only metacharacters in a character class:
1447
1448         \:
1449           General escape character
1450
1451         ^:
1452           Negate the class, but only if the first character
1453
1454         -:
1455           Indicates character range
1456
1457         [:
1458           Posix character class (only if followed by Posix syntax)
1459
1460         ]:
1461           Terminates the character class
1462
1463       The following sections describe the use of each metacharacter.
1464

BACKSLASH

1466       The  backslash  character  has many uses. First, if it is followed by a
1467       character that is not a number or a letter, it takes away  any  special
1468       meaning  that  a character can have. This use of backslash as an escape
1469       character applies both inside and outside character classes.
1470
1471       For example, if you want to match a * character, you write  \*  in  the
1472       pattern.  This escaping action applies if the following character would
1473       otherwise be interpreted as a metacharacter, so it is  always  safe  to
1474       precede a non-alphanumeric with backslash to specify that it stands for
1475       itself. In particular, if you want to match a backslash, write \\.
1476
1477       In unicode mode, only ASCII numbers and letters have any special  mean‐
1478       ing after a backslash. All other characters (in particular, those whose
1479       code points are > 127) are treated as literals.
1480
1481       If a pattern is compiled with option extended, whitespace in  the  pat‐
1482       tern  (other than in a character class) and characters between a # out‐
1483       side a character class and the next newline are  ignored.  An  escaping
1484       backslash can be used to include a whitespace or # character as part of
1485       the pattern.
1486
1487       To remove the special meaning from a sequence of characters,  put  them
1488       between \Q and \E. This is different from Perl in that $ and @ are han‐
1489       dled as literals in \Q...\E sequences in PCRE,  while  $  and  @  cause
1490       variable interpolation in Perl. Notice the following examples:
1491
1492       Pattern            PCRE matches   Perl matches
1493
1494       \Qabc$xyz\E        abc$xyz        abc followed by the contents of $xyz
1495       \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1496       \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1497
1498       The  \Q...\E  sequence  is recognized both inside and outside character
1499       classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1500       not  followed  by  \E  later in the pattern, the literal interpretation
1501       continues to the end of the pattern (that is,  \E  is  assumed  at  the
1502       end).  If  the  isolated \Q is inside a character class, this causes an
1503       error, as the character class is not terminated.
1504
1505       Non-Printing Characters
1506
1507       A second use of backslash provides a way of encoding non-printing char‐
1508       acters  in patterns in a visible manner. There is no restriction on the
1509       appearance of non-printing characters, apart from the binary zero  that
1510       terminates a pattern. When a pattern is prepared by text editing, it is
1511       often easier to use one of the  following  escape  sequences  than  the
1512       binary character it represents:
1513
1514         \a:
1515           Alarm, that is, the BEL character (hex 07)
1516
1517         \cx:
1518           "Control-x", where x is any ASCII character
1519
1520         \e:
1521           Escape (hex 1B)
1522
1523         \f:
1524           Form feed (hex 0C)
1525
1526         \n:
1527           Line feed (hex 0A)
1528
1529         \r:
1530           Carriage return (hex 0D)
1531
1532         \t:
1533           Tab (hex 09)
1534
1535         \0dd:
1536           Character with octal code 0dd
1537
1538         \ddd:
1539           Character with octal code ddd, or back reference
1540
1541         \o{ddd..}:
1542           character with octal code ddd..
1543
1544         \xhh:
1545           Character with hex code hh
1546
1547         \x{hhh..}:
1548           Character with hex code hhh..
1549
1550   Note:
1551       Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1552       eral characters "8" and "9".
1553
1554
1555       The precise effect of \cx on ASCII characters is as follows: if x is  a
1556       lowercase  letter,  it  is  converted  to upper case. Then bit 6 of the
1557       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1558       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1559       hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
1560       has  a  value  >  127, a compile-time error occurs. This locks out non-
1561       ASCII characters in all modes.
1562
1563       The \c facility was designed for use with ASCII  characters,  but  with
1564       the extension to Unicode it is even less useful than it once was.
1565
1566       After  \0  up  to two further octal digits are read. If there are fewer
1567       than two digits, just  those  that  are  present  are  used.  Thus  the
1568       sequence \0\x\015 specifies two binary zeros followed by a CR character
1569       (code value 13). Make sure you supply two digits after the initial zero
1570       if the pattern character that follows is itself an octal digit.
1571
1572       The  escape \o must be followed by a sequence of octal digits, enclosed
1573       in braces. An error occurs if this is not the case. This  escape  is  a
1574       recent  addition  to Perl; it provides way of specifying character code
1575       points as octal numbers greater than 0777, and  it  also  allows  octal
1576       numbers and back references to be unambiguously specified.
1577
1578       For greater clarity and unambiguity, it is best to avoid following \ by
1579       a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1580       ter  numbers,  and \g{} to specify back references. The following para‐
1581       graphs describe the old, ambiguous syntax.
1582
1583       The handling of a backslash followed by a digit other than 0 is compli‐
1584       cated,  and  Perl  has changed in recent releases, causing PCRE also to
1585       change. Outside a character class, PCRE reads the digit and any follow‐
1586       ing  digits as a decimal number. If the number is < 8, or if there have
1587       been at least that many previous  capturing  left  parentheses  in  the
1588       expression,  the  entire  sequence  is  taken  as  a  back reference. A
1589       description of how this works is provided later, following the  discus‐
1590       sion of parenthesized subpatterns.
1591
1592       Inside  a  character class, or if the decimal number following \ is > 7
1593       and there have not been that many capturing subpatterns,  PCRE  handles
1594       \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1595       up to three octal digits following the backslash,  and  using  them  to
1596       generate  a data character. Any subsequent digits stand for themselves.
1597       For example:
1598
1599         \040:
1600           Another way of writing an ASCII space
1601
1602         \40:
1603           The same, provided there are < 40 previous capturing subpatterns
1604
1605         \7:
1606           Always a back reference
1607
1608         \11:
1609           Can be a back reference, or another way of writing a tab
1610
1611         \011:
1612           Always a tab
1613
1614         \0113:
1615           A tab followed by character "3"
1616
1617         \113:
1618           Can be a back reference, otherwise the character  with  octal  code
1619           113
1620
1621         \377:
1622           Can be a back reference, otherwise value 255 (decimal)
1623
1624         \81:
1625           Either a back reference, or the two characters "8" and "1"
1626
1627       Notice  that  octal  values >= 100 that are specified using this syntax
1628       must not be introduced by a leading zero, as no more than  three  octal
1629       digits are ever read.
1630
1631       By  default, after \x that is not followed by {, from zero to two hexa‐
1632       decimal digits are read (letters can be in upper or  lower  case).  Any
1633       number of hexadecimal digits may appear between \x{ and }. If a charac‐
1634       ter other than a hexadecimal digit appears between \x{  and  },  or  if
1635       there is no terminating }, an error occurs.
1636
1637       Characters whose value is less than 256 can be defined by either of the
1638       two syntaxes for \x. There is no difference in the way  they  are  han‐
1639       dled. For example, \xdc is exactly the same as \x{dc}.
1640
1641       Constraints on character values
1642
1643       Characters  that  are  specified using octal or hexadecimal numbers are
1644       limited to certain values, as follows:
1645
1646         8-bit non-UTF mode:
1647           < 0x100
1648
1649         8-bit UTF-8 mode:
1650           < 0x10ffff and a valid codepoint
1651
1652       Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
1653       called "surrogate" codepoints), and 0xffef.
1654
1655       Escape sequences in character classes
1656
1657       All the sequences that define a single character value can be used both
1658       inside and outside character classes. Also, inside a  character  class,
1659       \b is interpreted as the backspace character (hex 08).
1660
1661       \N  is not allowed in a character class. \B, \R, and \X are not special
1662       inside a character class. Like  other  unrecognized  escape  sequences,
1663       they are treated as the literal characters "B", "R", and "X". Outside a
1664       character class, these sequences have different meanings.
1665
1666       Unsupported Escape Sequences
1667
1668       In Perl, the sequences \l, \L, \u, and \U are recognized by its  string
1669       handler  and used to modify the case of following characters. PCRE does
1670       not support these escape sequences.
1671
1672       Absolute and Relative Back References
1673
1674       The sequence \g followed by an unsigned or a negative  number,  option‐
1675       ally  enclosed  in braces, is an absolute or relative back reference. A
1676       named back reference can be coded as \g{name}. Back references are dis‐
1677       cussed later, following the discussion of parenthesized subpatterns.
1678
1679       Absolute and Relative Subroutine Calls
1680
1681       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
1682       name or a number enclosed either in angle brackets or single quotes, is
1683       alternative  syntax  for  referencing  a  subpattern as a "subroutine".
1684       Details are discussed later. Notice  that  \g{...}  (Perl  syntax)  and
1685       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
1686       reference and the latter is a subroutine call.
1687
1688       Generic Character Types
1689
1690       Another use of backslash is for specifying generic character types:
1691
1692         \d:
1693           Any decimal digit
1694
1695         \D:
1696           Any character that is not a decimal digit
1697
1698         \h:
1699           Any horizontal whitespace character
1700
1701         \H:
1702           Any character that is not a horizontal whitespace character
1703
1704         \s:
1705           Any whitespace character
1706
1707         \S:
1708           Any character that is not a whitespace character
1709
1710         \v:
1711           Any vertical whitespace character
1712
1713         \V:
1714           Any character that is not a vertical whitespace character
1715
1716         \w:
1717           Any "word" character
1718
1719         \W:
1720           Any "non-word" character
1721
1722       There is also the single sequence \N, which matches a non-newline char‐
1723       acter.  This  is  the  same as the "." metacharacter when dotall is not
1724       set. Perl also uses \N to match characters by name, but PCRE  does  not
1725       support this.
1726
1727       Each  pair  of  lowercase and uppercase escape sequences partitions the
1728       complete set of characters into two disjoint sets. Any given  character
1729       matches  one, and only one, of each pair. The sequences can appear both
1730       inside and outside character classes. They each match one character  of
1731       the  appropriate  type.  If the current matching point is at the end of
1732       the subject string, all fail, as there is no character to match.
1733
1734       For compatibility with Perl, \s did not used to match the VT  character
1735       (code  11),  which  made it different from the the POSIX "space" class.
1736       However, Perl added VT at release  5.18,  and  PCRE  followed  suit  at
1737       release  8.34.  The  default  \s characters are now HT (9), LF (10), VT
1738       (11), FF (12), CR (13), and space (32),  which  are  defined  as  white
1739       space in the "C" locale. This list may vary if locale-specific matching
1740       is taking place. For example, in some locales the "non-breaking  space"
1741       character  (\xA0)  is  recognized  as white space, and in others the VT
1742       character is not.
1743
1744       A "word" character is an underscore or any character that is  a  letter
1745       or  a  digit.  By default, the definition of letters and digits is con‐
1746       trolled by the PCRE low-valued character tables, in Erlang's case  (and
1747       without option unicode), the ISO Latin-1 character set.
1748
1749       By default, in unicode mode, characters with values > 255, that is, all
1750       characters outside the ISO Latin-1 character set, never match  \d,  \s,
1751       or  \w,  and  always match \D, \S, and \W. These sequences retain their
1752       original meanings from before UTF support  was  available,  mainly  for
1753       efficiency  reasons.  However,  if  option  ucp is set, the behavior is
1754       changed so that Unicode properties  are  used  to  determine  character
1755       types, as follows:
1756
1757         \d:
1758           Any character that \p{Nd} matches (decimal digit)
1759
1760         \s:
1761           Any character that \p{Z} or \h or \v
1762
1763         \w:
1764           Any character that matches \p{L} or \p{N} matches, plus underscore
1765
1766       The uppercase escapes match the inverse sets of characters. Notice that
1767       \d matches only decimal digits, while \w matches any Unicode digit, any
1768       Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1769       as they are defined in terms of \w and \W. Matching these sequences  is
1770       noticeably slower when ucp is set.
1771
1772       The  sequences  \h, \H, \v, and \V are features that were added to Perl
1773       in release 5.10. In contrast to the other sequences, which  match  only
1774       ASCII  characters  by  default,  these always match certain high-valued
1775       code points, regardless if ucp is set.
1776
1777       The following are the horizontal space characters:
1778
1779         U+0009:
1780           Horizontal tab (HT)
1781
1782         U+0020:
1783           Space
1784
1785         U+00A0:
1786           Non-break space
1787
1788         U+1680:
1789           Ogham space mark
1790
1791         U+180E:
1792           Mongolian vowel separator
1793
1794         U+2000:
1795           En quad
1796
1797         U+2001:
1798           Em quad
1799
1800         U+2002:
1801           En space
1802
1803         U+2003:
1804           Em space
1805
1806         U+2004:
1807           Three-per-em space
1808
1809         U+2005:
1810           Four-per-em space
1811
1812         U+2006:
1813           Six-per-em space
1814
1815         U+2007:
1816           Figure space
1817
1818         U+2008:
1819           Punctuation space
1820
1821         U+2009:
1822           Thin space
1823
1824         U+200A:
1825           Hair space
1826
1827         U+202F:
1828           Narrow no-break space
1829
1830         U+205F:
1831           Medium mathematical space
1832
1833         U+3000:
1834           Ideographic space
1835
1836       The following are the vertical space characters:
1837
1838         U+000A:
1839           Line feed (LF)
1840
1841         U+000B:
1842           Vertical tab (VT)
1843
1844         U+000C:
1845           Form feed (FF)
1846
1847         U+000D:
1848           Carriage return (CR)
1849
1850         U+0085:
1851           Next line (NEL)
1852
1853         U+2028:
1854           Line separator
1855
1856         U+2029:
1857           Paragraph separator
1858
1859       In 8-bit, non-UTF-8 mode, only the characters with code  points  <  256
1860       are relevant.
1861
1862       Newline Sequences
1863
1864       Outside  a  character class, by default, the escape sequence \R matches
1865       any Unicode newline sequence. In non-UTF-8 mode, \R  is  equivalent  to
1866       the following:
1867
1868       (?>\r\n|\n|\x0b|\f|\r|\x85)
1869
1870       This is an example of an "atomic group", details are provided below.
1871
1872       This particular group matches either the two-character sequence CR fol‐
1873       lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1874       (vertical  tab,  U+000B),  FF (form feed, U+000C), CR (carriage return,
1875       U+000D), or NEL (next line,  U+0085).  The  two-character  sequence  is
1876       treated as a single unit that cannot be split.
1877
1878       In  Unicode  mode,  two more characters whose code points are > 255 are
1879       added:  LS  (line  separator,  U+2028)  and  PS  (paragraph  separator,
1880       U+2029).  Unicode  character  property  support is not needed for these
1881       characters to be recognized.
1882
1883       \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1884       plete set of Unicode line endings) by setting option bsr_anycrlf either
1885       at compile time or when the pattern is matched. (BSR is an acronym  for
1886       "backslash R".) This can be made the default when PCRE is built; if so,
1887       the other behavior can be requested through option  bsr_unicode.  These
1888       settings can also be specified by starting a pattern string with one of
1889       the following sequences:
1890
1891         (*BSR_ANYCRLF):
1892           CR, LF, or CRLF only
1893
1894         (*BSR_UNICODE):
1895           Any Unicode newline sequence
1896
1897       These override the default and the options specified to  the  compiling
1898       function, but they can themselves be overridden by options specified to
1899       a matching function. Notice that these special settings, which are  not
1900       Perl-compatible,  are  recognized  only at the very start of a pattern,
1901       and that they must be in upper case.  If  more  than  one  of  them  is
1902       present,  the  last  one is used. They can be combined with a change of
1903       newline convention; for example, a pattern can start with:
1904
1905       (*ANY)(*BSR_ANYCRLF)
1906
1907       They can also be combined with the (*UTF8), (*UTF), or  (*UCP)  special
1908       sequences.  Inside  a character class, \R is treated as an unrecognized
1909       escape sequence, and so matches the letter "R" by default.
1910
1911       Unicode Character Properties
1912
1913       Three more escape sequences that match characters with specific proper‐
1914       ties  are  available. When in 8-bit non-UTF-8 mode, these sequences are
1915       limited to testing characters whose code points are < 256, but they  do
1916       work in this mode. The following are the extra escape sequences:
1917
1918         \p{xx}:
1919           A character with property xx
1920
1921         \P{xx}:
1922           A character without property xx
1923
1924         \X:
1925           A Unicode extended grapheme cluster
1926
1927       The  property  names represented by xx above are limited to the Unicode
1928       script names, the general category properties, "Any", which matches any
1929       character   (including  newline),  and  some  special  PCRE  properties
1930       (described in the next section). Other Perl properties, such as  "InMu‐
1931       sicalSymbols", are currently not supported by PCRE. Notice that \P{Any}
1932       does not match any characters and always causes a match failure.
1933
1934       Sets of Unicode characters are defined as belonging to certain scripts.
1935       A  character from one of these sets can be matched using a script name,
1936       for example:
1937
1938       \p{Greek} \P{Han}
1939
1940       Those that are not part of an identified script are lumped together  as
1941       "Common". The following is the current list of scripts:
1942
1943         * Arabic
1944
1945         * Armenian
1946
1947         * Avestan
1948
1949         * Balinese
1950
1951         * Bamum
1952
1953         * Bassa_Vah
1954
1955         * Batak
1956
1957         * Bengali
1958
1959         * Bopomofo
1960
1961         * Braille
1962
1963         * Buginese
1964
1965         * Buhid
1966
1967         * Canadian_Aboriginal
1968
1969         * Carian
1970
1971         * Caucasian_Albanian
1972
1973         * Chakma
1974
1975         * Cham
1976
1977         * Cherokee
1978
1979         * Common
1980
1981         * Coptic
1982
1983         * Cuneiform
1984
1985         * Cypriot
1986
1987         * Cyrillic
1988
1989         * Deseret
1990
1991         * Devanagari
1992
1993         * Duployan
1994
1995         * Egyptian_Hieroglyphs
1996
1997         * Elbasan
1998
1999         * Ethiopic
2000
2001         * Georgian
2002
2003         * Glagolitic
2004
2005         * Gothic
2006
2007         * Grantha
2008
2009         * Greek
2010
2011         * Gujarati
2012
2013         * Gurmukhi
2014
2015         * Han
2016
2017         * Hangul
2018
2019         * Hanunoo
2020
2021         * Hebrew
2022
2023         * Hiragana
2024
2025         * Imperial_Aramaic
2026
2027         * Inherited
2028
2029         * Inscriptional_Pahlavi
2030
2031         * Inscriptional_Parthian
2032
2033         * Javanese
2034
2035         * Kaithi
2036
2037         * Kannada
2038
2039         * Katakana
2040
2041         * Kayah_Li
2042
2043         * Kharoshthi
2044
2045         * Khmer
2046
2047         * Khojki
2048
2049         * Khudawadi
2050
2051         * Lao
2052
2053         * Latin
2054
2055         * Lepcha
2056
2057         * Limbu
2058
2059         * Linear_A
2060
2061         * Linear_B
2062
2063         * Lisu
2064
2065         * Lycian
2066
2067         * Lydian
2068
2069         * Mahajani
2070
2071         * Malayalam
2072
2073         * Mandaic
2074
2075         * Manichaean
2076
2077         * Meetei_Mayek
2078
2079         * Mende_Kikakui
2080
2081         * Meroitic_Cursive
2082
2083         * Meroitic_Hieroglyphs
2084
2085         * Miao
2086
2087         * Modi
2088
2089         * Mongolian
2090
2091         * Mro
2092
2093         * Myanmar
2094
2095         * Nabataean
2096
2097         * New_Tai_Lue
2098
2099         * Nko
2100
2101         * Ogham
2102
2103         * Ol_Chiki
2104
2105         * Old_Italic
2106
2107         * Old_North_Arabian
2108
2109         * Old_Permic
2110
2111         * Old_Persian
2112
2113         * Oriya
2114
2115         * Old_South_Arabian
2116
2117         * Old_Turkic
2118
2119         * Osmanya
2120
2121         * Pahawh_Hmong
2122
2123         * Palmyrene
2124
2125         * Pau_Cin_Hau
2126
2127         * Phags_Pa
2128
2129         * Phoenician
2130
2131         * Psalter_Pahlavi
2132
2133         * Rejang
2134
2135         * Runic
2136
2137         * Samaritan
2138
2139         * Saurashtra
2140
2141         * Sharada
2142
2143         * Shavian
2144
2145         * Siddham
2146
2147         * Sinhala
2148
2149         * Sora_Sompeng
2150
2151         * Sundanese
2152
2153         * Syloti_Nagri
2154
2155         * Syriac
2156
2157         * Tagalog
2158
2159         * Tagbanwa
2160
2161         * Tai_Le
2162
2163         * Tai_Tham
2164
2165         * Tai_Viet
2166
2167         * Takri
2168
2169         * Tamil
2170
2171         * Telugu
2172
2173         * Thaana
2174
2175         * Thai
2176
2177         * Tibetan
2178
2179         * Tifinagh
2180
2181         * Tirhuta
2182
2183         * Ugaritic
2184
2185         * Vai
2186
2187         * Warang_Citi
2188
2189         * Yi
2190
2191       Each character has exactly one Unicode general category property, spec‐
2192       ified by a two-letter acronym. For compatibility  with  Perl,  negation
2193       can  be  specified  by including a circumflex between the opening brace
2194       and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2195
2196       If only one letter is specified with \p or \P, it includes all the gen‐
2197       eral  category properties that start with that letter. In this case, in
2198       the absence of negation, the curly brackets in the escape sequence  are
2199       optional. The following two examples have the same effect:
2200
2201       \p{L}
2202       \pL
2203
2204       The following general category property codes are supported:
2205
2206         C:
2207           Other
2208
2209         Cc:
2210           Control
2211
2212         Cf:
2213           Format
2214
2215         Cn:
2216           Unassigned
2217
2218         Co:
2219           Private use
2220
2221         Cs:
2222           Surrogate
2223
2224         L:
2225           Letter
2226
2227         Ll:
2228           Lowercase letter
2229
2230         Lm:
2231           Modifier letter
2232
2233         Lo:
2234           Other letter
2235
2236         Lt:
2237           Title case letter
2238
2239         Lu:
2240           Uppercase letter
2241
2242         M:
2243           Mark
2244
2245         Mc:
2246           Spacing mark
2247
2248         Me:
2249           Enclosing mark
2250
2251         Mn:
2252           Non-spacing mark
2253
2254         N:
2255           Number
2256
2257         Nd:
2258           Decimal number
2259
2260         Nl:
2261           Letter number
2262
2263         No:
2264           Other number
2265
2266         P:
2267           Punctuation
2268
2269         Pc:
2270           Connector punctuation
2271
2272         Pd:
2273           Dash punctuation
2274
2275         Pe:
2276           Close punctuation
2277
2278         Pf:
2279           Final punctuation
2280
2281         Pi:
2282           Initial punctuation
2283
2284         Po:
2285           Other punctuation
2286
2287         Ps:
2288           Open punctuation
2289
2290         S:
2291           Symbol
2292
2293         Sc:
2294           Currency symbol
2295
2296         Sk:
2297           Modifier symbol
2298
2299         Sm:
2300           Mathematical symbol
2301
2302         So:
2303           Other symbol
2304
2305         Z:
2306           Separator
2307
2308         Zl:
2309           Line separator
2310
2311         Zp:
2312           Paragraph separator
2313
2314         Zs:
2315           Space separator
2316
2317       The  special property L& is also supported. It matches a character that
2318       has the Lu, Ll, or Lt property, that is, a letter that is  not  classi‐
2319       fied as a modifier or "other".
2320
2321       The  Cs  (Surrogate)  property  applies only to characters in the range
2322       U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2323       cannot be tested by PCRE. Perl does not support the Cs property.
2324
2325       The long synonyms for property names supported by Perl (such as \p{Let‐
2326       ter}) are not supported by PCRE. It is not permitted to prefix  any  of
2327       these properties with "Is".
2328
2329       No  character  in  the  Unicode table has the Cn (unassigned) property.
2330       This property is instead assumed for any code point that is not in  the
2331       Unicode table.
2332
2333       Specifying  caseless  matching  does not affect these escape sequences.
2334       For example, \p{Lu} always matches only uppercase letters. This is dif‐
2335       ferent from the behavior of current versions of Perl.
2336
2337       Matching  characters by Unicode property is not fast, as PCRE must do a
2338       multistage table lookup to find a character property. That is  why  the
2339       traditional escape sequences such as \d and \w do not use Unicode prop‐
2340       erties in PCRE by default. However, you can make them do so by  setting
2341       option ucp or by starting the pattern with (*UCP).
2342
2343       Extended Grapheme Clusters
2344
2345       The  \X  escape  matches  any number of Unicode characters that form an
2346       "extended grapheme cluster", and treats the sequence as an atomic group
2347       (see below). Up to and including release 8.31, PCRE matched an earlier,
2348       simpler definition that was equivalent  to  (?>\PM\pM*).  That  is,  it
2349       matched  a  character  without the "mark" property, followed by zero or
2350       more characters with the "mark" property. Characters  with  the  "mark"
2351       property  are  typically  non-spacing accents that affect the preceding
2352       character.
2353
2354       This simple definition was extended in Unicode to include more  compli‐
2355       cated  kinds of composite character by giving each character a grapheme
2356       breaking property, and creating rules  that  use  these  properties  to
2357       define  the  boundaries of extended grapheme clusters. In PCRE releases
2358       later than 8.31, \X matches one of these clusters.
2359
2360       \X always matches at least one character. Then it  decides  whether  to
2361       add more characters according to the following rules for ending a clus‐
2362       ter:
2363
2364         * End at the end of the subject string.
2365
2366         * Do not end between CR and LF; otherwise end after any control char‐
2367           acter.
2368
2369         * Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
2370           characters are of five types: L, V, T, LV, and LVT. An L  character
2371           can  be followed by an L, V, LV, or LVT character. An LV or V char‐
2372           acter can be followed by a V or T character. An LVT or T  character
2373           can be followed only by a T character.
2374
2375         * Do not end before extending characters or spacing marks. Characters
2376           with the "mark" property always have the "extend" grapheme breaking
2377           property.
2378
2379         * Do not end after prepend characters.
2380
2381         * Otherwise, end the cluster.
2382
2383       PCRE Additional Properties
2384
2385       In  addition to the standard Unicode properties described earlier, PCRE
2386       supports four more that make it possible to convert traditional  escape
2387       sequences, such as \w and \s to use Unicode properties. PCRE uses these
2388       non-standard, non-Perl properties internally when  the  ucp  option  is
2389       passed.  However,  they can also be used explicitly. The properties are
2390       as follows:
2391
2392         Xan:
2393           Any alphanumeric character. Matches characters that have either the
2394           L (letter) or the N (number) property.
2395
2396         Xps:
2397           Any  Posix  space character. Matches the characters tab, line feed,
2398           vertical tab, form feed, carriage return, and any  other  character
2399           that has the Z (separator) property.
2400
2401         Xsp:
2402           Any Perl space character. Matches the same as Xps, except that ver‐
2403           tical tab is excluded.
2404
2405         Xwd:
2406           Any Perl "word" character. Matches the same characters as Xan, plus
2407           underscore.
2408
2409       Perl and POSIX space are now the same. Perl added VT to its space char‐
2410       acter set at release 5.18 and PCRE changed at release 8.34.
2411
2412       Xan matches characters that have either the L (letter) or the  N  (num‐
2413       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
2414       form feed, or carriage return, and any other character that has  the  Z
2415       (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2416       cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2417       at  release  8.34.  Xwd matches the same characters as Xan, plus under‐
2418       score.
2419
2420       There is another non-standard property, Xuc, which matches any  charac‐
2421       ter  that  can  be represented by a Universal Character Name in C++ and
2422       other programming languages. These are the characters $,  @,  `  (grave
2423       accent),  and all characters with Unicode code points >= U+00A0, except
2424       for the surrogates U+D800 to U+DFFF.  Notice  that  most  base  (ASCII)
2425       characters  are  excluded.  (Universal  Character Names are of the form
2426       \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice  that  the
2427       Xuc  property  does  not  match these sequences but the characters that
2428       they represent.)
2429
2430       Resetting the Match Start
2431
2432       The escape sequence \K causes any previously matched characters not  to
2433       be  included  in the final matched sequence. For example, the following
2434       pattern matches "foobar", but reports that it has matched "bar":
2435
2436       foo\Kbar
2437
2438       This feature is similar to a lookbehind  assertion  (described  below).
2439       However,  in  this  case, the part of the subject before the real match
2440       does not have to be of fixed length, as lookbehind assertions  do.  The
2441       use  of  \K does not interfere with the setting of captured substrings.
2442       For example, when the following pattern  matches  "foobar",  the  first
2443       substring is still set to "foo":
2444
2445       (foo)\Kbar
2446
2447       Perl  documents  that  the  use  of  \K  within assertions is "not well
2448       defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
2449       assertions,  but  is  ignored  in negative assertions. Note that when a
2450       pattern such as (?=ab\K) matches, the reported start of the  match  can
2451       be greater than the end of the match.
2452
2453       Simple Assertions
2454
2455       The  final use of backslash is for certain simple assertions. An asser‐
2456       tion specifies a condition that must be met at a particular point in  a
2457       match,  without  consuming  any characters from the subject string. The
2458       use of subpatterns for more complicated assertions is described  below.
2459       The following are the backslashed assertions:
2460
2461         \b:
2462           Matches at a word boundary.
2463
2464         \B:
2465           Matches when not at a word boundary.
2466
2467         \A:
2468           Matches at the start of the subject.
2469
2470         \Z:
2471           Matches  at the end of the subject, and before a newline at the end
2472           of the subject.
2473
2474         \z:
2475           Matches only at the end of the subject.
2476
2477         \G:
2478           Matches at the first matching position in the subject.
2479
2480       Inside a character class, \b has a different meaning;  it  matches  the
2481       backspace  character.  If  any  other  of these assertions appears in a
2482       character class, by default it matches the corresponding literal  char‐
2483       acter (for example, \B matches the letter B).
2484
2485       A  word  boundary is a position in the subject string where the current
2486       character and the previous character do not both match \w or  \W  (that
2487       is,  one  matches  \w and the other matches \W), or the start or end of
2488       the string if the first or last character matches \w, respectively.  In
2489       UTF  mode,  the  meanings of \w and \W can be changed by setting option
2490       ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2491       have a separate "start of word" or "end of word" metasequence. However,
2492       whatever follows \b normally determines which it is. For  example,  the
2493       fragment \ba matches "a" at the start of a word.
2494
2495       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
2496       and dollar (described in the next section) in that they only ever match
2497       at  the  very start and end of the subject string, whatever options are
2498       set. Thus, they are independent of multiline mode. These  three  asser‐
2499       tions  are  not affected by options notbol or noteol, which affect only
2500       the behavior of the circumflex and dollar metacharacters.  However,  if
2501       argument  startoffset of run/3 is non-zero, indicating that matching is
2502       to start at a point other than the beginning of  the  subject,  \A  can
2503       never match. The difference between \Z and \z is that \Z matches before
2504       a newline at the end of the string  and  at  the  very  end,  while  \z
2505       matches only at the end.
2506
2507       The  \G assertion is true only when the current matching position is at
2508       the start point of the match, as specified by argument  startoffset  of
2509       run/3. It differs from \A when the value of startoffset is non-zero. By
2510       calling run/3 multiple times with appropriate arguments, you can  mimic
2511       the  Perl  option /g, and it is in this kind of implementation where \G
2512       can be useful.
2513
2514       Notice, however, that the PCRE interpretation of \G, as  the  start  of
2515       the  current  match, is subtly different from Perl, which defines it as
2516       the end of the previous match. In Perl, these can be different when the
2517       previously  matched  string was empty. As PCRE does only one match at a
2518       time, it cannot reproduce this behavior.
2519
2520       If all the alternatives of a pattern begin with \G, the  expression  is
2521       anchored to the starting match position, and the "anchored" flag is set
2522       in the compiled regular expression.
2523

CIRCUMFLEX AND DOLLAR

2525       The circumflex and dollar  metacharacters  are  zero-width  assertions.
2526       That  is,  they test for a particular condition to be true without con‐
2527       suming any characters from the subject string.
2528
2529       Outside a character class, in the default matching mode, the circumflex
2530       character  is  an  assertion  that is true only if the current matching
2531       point is at the start of the subject string. If argument startoffset of
2532       run/3  is  non-zero,  circumflex can never match if option multiline is
2533       unset. Inside a character class, circumflex has an  entirely  different
2534       meaning (see below).
2535
2536       Circumflex  needs  not to be the first character of the pattern if some
2537       alternatives are involved, but it is to be  the  first  thing  in  each
2538       alternative  in  which  it appears if the pattern is ever to match that
2539       branch. If all possible alternatives start with a circumflex, that  is,
2540       if  the  pattern  is constrained to match only at the start of the sub‐
2541       ject, it is said to be an "anchored" pattern.  (There  are  also  other
2542       constructs that can cause a pattern to be anchored.)
2543
2544       The  dollar  character is an assertion that is true only if the current
2545       matching point is at the end of  the  subject  string,  or  immediately
2546       before  a newline at the end of the string (by default). Notice however
2547       that it does not match the newline. Dollar needs not  to  be  the  last
2548       character  of  the pattern if some alternatives are involved, but it is
2549       to be the last item in any branch in which it appears.  Dollar  has  no
2550       special meaning in a character class.
2551
2552       The  meaning  of  dollar  can be changed so that it matches only at the
2553       very end of the string, by setting  option  dollar_endonly  at  compile
2554       time. This does not affect the \Z assertion.
2555
2556       The  meanings  of  the  circumflex and dollar characters are changed if
2557       option multiline is set. When this is the case,  a  circumflex  matches
2558       immediately  after  internal  newlines  and at the start of the subject
2559       string. It does not match after a newline that ends the string. A  dol‐
2560       lar  matches  before  any  newlines in the string, and at the very end,
2561       when multiline is set. When newline is specified as  the  two-character
2562       sequence CRLF, isolated CR and LF characters do not indicate newlines.
2563
2564       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
2565       (where \n represents a newline) in multiline mode, but  not  otherwise.
2566       So, patterns that are anchored in single-line mode because all branches
2567       start with ^ are not anchored in multiline mode, and a match  for  cir‐
2568       cumflex  is  possible  when  argument startoffset of run/3 is non-zero.
2569       Option dollar_endonly is ignored if multiline is set.
2570
2571       Notice that the sequences \A, \Z, and \z can be used to match the start
2572       and  end  of  the  subject  in both modes. If all branches of a pattern
2573       start with \A, it is always anchored, regardless if multiline is set.
2574

FULL STOP (PERIOD, DOT) AND \N

2576       Outside a character class, a dot in the pattern matches  any  character
2577       in  the  subject  string except (by default) a character that signifies
2578       the end of a line.
2579
2580       When a line ending is defined as a single character, dot never  matches
2581       that  character. When the two-character sequence CRLF is used, dot does
2582       not match CR if it is immediately followed by LF, otherwise it  matches
2583       all  characters (including isolated CRs and LFs). When any Unicode line
2584       endings are recognized, dot does not match CR, LF, or any of the  other
2585       line-ending characters.
2586
2587       The behavior of dot regarding newlines can be changed. If option dotall
2588       is set, a dot matches any character, without  exception.  If  the  two-
2589       character  sequence CRLF is present in the subject string, it takes two
2590       dots to match it.
2591
2592       The handling of dot is entirely independent of the handling of  circum‐
2593       flex  and  dollar, the only relationship is that both involve newlines.
2594       Dot has no special meaning in a character class.
2595
2596       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
2597       affected  by  option  PCRE_DOTALL.  That  is,  it matches any character
2598       except one that signifies the end of a line. Perl also uses \N to match
2599       characters by name but PCRE does not support this.
2600

MATCHING A SINGLE DATA UNIT

2602       Outside  a  character  class,  the  escape sequence \C matches any data
2603       unit, regardless if a UTF mode is set.  One  data  unit  is  one  byte.
2604       Unlike  a dot, \C always matches line-ending characters. The feature is
2605       provided in Perl to match individual bytes in UTF-8  mode,  but  it  is
2606       unclear  how  it  can usefully be used. As \C breaks up characters into
2607       individual data units, matching one unit with \C in a  UTF  mode  means
2608       that  the  remaining  string  can start with a malformed UTF character.
2609       This has undefined results, as PCRE assumes that it  deals  with  valid
2610       UTF strings.
2611
2612       PCRE  does  not  allow \C to appear in lookbehind assertions (described
2613       below) in a UTF mode, as this would make it impossible to calculate the
2614       length of the lookbehind.
2615
2616       The  \C  escape  sequence is best avoided. However, one way of using it
2617       that avoids the problem of malformed UTF characters is to use a  looka‐
2618       head  to  check  the  length of the next character, as in the following
2619       pattern, which can be used with a UTF-8 string (ignore  whitespace  and
2620       line breaks):
2621
2622       (?| (?=[\x00-\x7f])(\C) |
2623           (?=[\x80-\x{7ff}])(\C)(\C) |
2624           (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2625           (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2626
2627       A  group  that starts with (?| resets the capturing parentheses numbers
2628       in each alternative (see section  Duplicate  Subpattern  Numbers).  The
2629       assertions  at  the start of each branch check the next UTF-8 character
2630       for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The
2631       individual  bytes of the character are then captured by the appropriate
2632       number of groups.
2633

SQUARE BRACKETS AND CHARACTER CLASSES

2635       An opening square bracket introduces a character class, terminated by a
2636       closing square bracket. A closing square bracket on its own is not spe‐
2637       cial by default. However, if option PCRE_JAVASCRIPT_COMPAT  is  set,  a
2638       lone  closing  square bracket causes a compile-time error. If a closing
2639       square bracket is required as a member of the class, it is  to  be  the
2640       first  data  character  in  the  class (after an initial circumflex, if
2641       present) or escaped with a backslash.
2642
2643       A character class matches a single character in the subject. In  a  UTF
2644       mode,  the  character  can  be  more than one data unit long. A matched
2645       character must be in the set of characters defined by the class, unless
2646       the  first  character in the class definition is a circumflex, in which
2647       case the subject character must not be in the set defined by the class.
2648       If a circumflex is required as a member of the class, ensure that it is
2649       not the first character, or escape it with a backslash.
2650
2651       For example, the character class [aeiou] matches any  lowercase  vowel,
2652       while  [^aeiou]  matches  any  character that is not a lowercase vowel.
2653       Notice that a circumflex is just a convenient notation  for  specifying
2654       the characters that are in the class by enumerating those that are not.
2655       A class that starts with a circumflex is not  an  assertion;  it  still
2656       consumes a character from the subject string, and therefore it fails if
2657       the current pointer is at the end of the string.
2658
2659       In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2660       a class as a literal string of data units, or by using the \x{ escaping
2661       mechanism.
2662
2663       When caseless matching is set, any letters in a  class  represent  both
2664       their uppercase and lowercase versions. For example, a caseless [aeiou]
2665       matches "A" and "a", and a caseless [^aeiou] does not match "A", but  a
2666       caseful  version would. In a UTF mode, PCRE always understands the con‐
2667       cept of case for characters whose values are < 256, so caseless  match‐
2668       ing  is always possible. For characters with higher values, the concept
2669       of case is supported only if PCRE is  compiled  with  Unicode  property
2670       support. If you want to use caseless matching in a UTF mode for charac‐
2671       ters >=, ensure that PCRE is compiled with Unicode property support and
2672       with UTF support.
2673
2674       Characters  that can indicate line breaks are never treated in any spe‐
2675       cial way when matching character classes, whatever line-ending sequence
2676       is  in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2677       LINE is used. A class such as [^a] always matches one of these  charac‐
2678       ters.
2679
2680       The  minus (hyphen) character can be used to specify a range of charac‐
2681       ters in a character  class.  For  example,  [d-m]  matches  any  letter
2682       between  d  and  m,  inclusive.  If  a minus character is required in a
2683       class, it must be escaped with a backslash  or  appear  in  a  position
2684       where  it cannot be interpreted as indicating a range, typically as the
2685       first or last character in the class, or immediately after a range. For
2686       example,  [b-d-z] matches letters in the range b to d, a hyphen charac‐
2687       ter, or z.
2688
2689       The literal character "]" cannot be the end character  of  a  range.  A
2690       pattern  such  as  [W-]46]  is interpreted as a class of two characters
2691       ("W" and "-") followed by a literal string "46]",  so  it  would  match
2692       "W46]"  or  "-46]".  However, if "]" is escaped with a backslash, it is
2693       interpreted as the end of range, so [W-\]46] is interpreted as a  class
2694       containing a range followed by two other characters. The octal or hexa‐
2695       decimal representation of "]" can also be used to end a range.
2696
2697       An error is generated if a POSIX character  class  (see  below)  or  an
2698       escape  sequence other than one that defines a single character appears
2699       at a point where a range ending character  is  expected.  For  example,
2700       [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2701
2702       Ranges  operate in the collating sequence of character values. They can
2703       also  be  used  for  characters  specified  numerically,  for  example,
2704       [\000-\037].  Ranges  can include any characters that are valid for the
2705       current mode.
2706
2707       If a range that includes letters is used when caseless matching is set,
2708       it matches the letters in either case. For example, [W-c] is equivalent
2709       to [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if  charac‐
2710       ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2711       E characters in both cases. In UTF modes, PCRE supports the concept  of
2712       case  for  characters  with  values > 255 only when it is compiled with
2713       Unicode property support.
2714
2715       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
2716       \w, and \W can appear in a character class, and add the characters that
2717       they match to the class. For example, [\dABCDEF] matches any  hexadeci‐
2718       mal  digit. In UTF modes, option ucp affects the meanings of \d, \s, \w
2719       and their uppercase partners, just as it does when they appear  outside
2720       a character class, as described in section Generic Character Types ear‐
2721       lier. The escape sequence \b has a different meaning inside a character
2722       class;  it  matches  the backspace character. The sequences \B, \N, \R,
2723       and \X are not special inside a character class. Like any other  unrec‐
2724       ognized  escape  sequences,  they are treated as the literal characters
2725       "B", "N", "R", and "X".
2726
2727       A circumflex can conveniently be  used  with  the  uppercase  character
2728       types  to specify a more restricted set of characters than the matching
2729       lowercase type. For example, class [^\W_] matches any letter or  digit,
2730       but  not underscore, while [\w] includes underscore. A positive charac‐
2731       ter class is to be read as "something OR something OR ..." and a  nega‐
2732       tive class as "NOT something AND NOT something AND NOT ...".
2733
2734       Only the following metacharacters are recognized in character classes:
2735
2736         * Backslash
2737
2738         * Hyphen (only where it can be interpreted as specifying a range)
2739
2740         * Circumflex (only at the start)
2741
2742         * Opening  square  bracket (only when it can be interpreted as intro‐
2743           ducing a Posix class name, or for a special compatibility  feature;
2744           see the next two sections)
2745
2746         * Terminating closing square bracket
2747
2748       However, escaping other non-alphanumeric characters does no harm.
2749

POSIX CHARACTER CLASSES

2751       Perl supports the Posix notation for character classes. This uses names
2752       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
2753       supports  this  notation.  For example, the following matches "0", "1",
2754       any alphabetic character, or "%":
2755
2756       [01[:alpha:]%]
2757
2758       The following are the supported class names:
2759
2760         alnum:
2761           Letters and digits
2762
2763         alpha:
2764           Letters
2765
2766         ascii:
2767           Character codes 0-127
2768
2769         blank:
2770           Space or tab only
2771
2772         cntrl:
2773           Control characters
2774
2775         digit:
2776           Decimal digits (same as \d)
2777
2778         graph:
2779           Printing characters, excluding space
2780
2781         lower:
2782           Lowercase letters
2783
2784         print:
2785           Printing characters, including space
2786
2787         punct:
2788           Printing characters, excluding letters, digits, and space
2789
2790         space:
2791           Whitespace (the same as \s from PCRE 8.34)
2792
2793         upper:
2794           Uppercase letters
2795
2796         word:
2797           "Word" characters (same as \w)
2798
2799         xdigit:
2800           Hexadecimal digits
2801
2802       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
2803       CR  (13),  and space (32). If locale-specific matching is taking place,
2804       the list of space characters may be different; there may  be  fewer  or
2805       more of them. "Space" used to be different to \s, which did not include
2806       VT, for Perl compatibility. However, Perl changed at release 5.18,  and
2807       PCRE followed at release 8.34. "Space" and \s now match the same set of
2808       characters.
2809
2810       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
2811       from  Perl  5.8. Another Perl extension is negation, which is indicated
2812       by a ^ character after the colon. For example,  the  following  matches
2813       "1", "2", or any non-digit:
2814
2815       [12[:^digit:]]
2816
2817       PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2818       "ch" is a "collating element", but these  are  not  supported,  and  an
2819       error is given if they are encountered.
2820
2821       By  default, characters with values > 255 do not match any of the Posix
2822       character classes. However, if option PCRE_UCP is passed  to  pcre_com‐
2823       pile(), some of the classes are changed so that Unicode character prop‐
2824       erties are used. This is achieved by replacing certain Posix classes by
2825       other sequences, as follows:
2826
2827         [:alnum:]:
2828           Becomes \p{Xan}
2829
2830         [:alpha:]:
2831           Becomes \p{L}
2832
2833         [:blank:]:
2834           Becomes \h
2835
2836         [:digit:]:
2837           Becomes \p{Nd}
2838
2839         [:lower:]:
2840           Becomes \p{Ll}
2841
2842         [:space:]:
2843           Becomes \p{Xps}
2844
2845         [:upper:]:
2846           Becomes \p{Lu}
2847
2848         [:word:]:
2849           Becomes \p{Xwd}
2850
2851       Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2852       POSIX classes are handled specially in UCP mode:
2853
2854         [:graph:]:
2855           This matches characters that have glyphs that mark  the  page  when
2856           printed.  In Unicode property terms, it matches all characters with
2857           the L, M, N, P, S, or Cf properties, except for:
2858
2859           U+061C:
2860             Arabic Letter Mark
2861
2862           U+180E:
2863             Mongolian Vowel Separator
2864
2865           U+2066 - U+2069:
2866             Various "isolate"s
2867
2868         [:print:]:
2869           This matches the same characters as [:graph:] plus space characters
2870           that are not controls, that is, characters with the Zs property.
2871
2872         [:punct:]:
2873           This  matches  all characters that have the Unicode P (punctuation)
2874           property, plus those characters whose code points are less than 128
2875           that have the S (Symbol) property.
2876
2877       The  other  POSIX classes are unchanged, and match only characters with
2878       code points less than 128.
2879
2880       Compatibility Feature for Word Boundaries
2881
2882       In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
2883       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
2884       and "end of word". PCRE treats these items as follows:
2885
2886         [[:<:]]:
2887           is converted to \b(?=\w)
2888
2889         [[:>:]]:
2890           is converted to \b(?<=\w)
2891
2892       Only these exact character sequences are recognized. A sequence such as
2893       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
2894       support is not compatible with Perl. It is provided to help  migrations
2895       from other environments, and is best not used in any new patterns. Note
2896       that \b matches at the start and the end of a word (see "Simple  asser‐
2897       tions"  above),  and in a Perl-style pattern the preceding or following
2898       character normally shows which is wanted,  without  the  need  for  the
2899       assertions  that  are used above in order to give exactly the POSIX be‐
2900       haviour.
2901

VERTICAL BAR

2903       Vertical bar characters are used to separate alternative patterns.  For
2904       example, the following pattern matches either "gilbert" or "sullivan":
2905
2906       gilbert|sullivan
2907
2908       Any number of alternatives can appear, and an empty alternative is per‐
2909       mitted (matching the empty string). The  matching  process  tries  each
2910       alternative in turn, from left to right, and the first that succeeds is
2911       used. If the alternatives are within a subpattern (defined  in  section
2912       Subpatterns),  "succeeds" means matching the remaining main pattern and
2913       the alternative in the subpattern.
2914

INTERNAL OPTION SETTING

2916       The  settings  of  the  Perl-compatible  options  caseless,  multiline,
2917       dotall,  and  extended  can  be  changed  from  within the pattern by a
2918       sequence of Perl option letters enclosed  between  "(?"  and  ")".  The
2919       option letters are as follows:
2920
2921         i:
2922           For caseless
2923
2924         m:
2925           For multiline
2926
2927         s:
2928           For dotall
2929
2930         x:
2931           For extended
2932
2933       For example, (?im) sets caseless, multiline matching. These options can
2934       also be unset by preceding the letter with a hyphen. A combined setting
2935       and  unsetting  such  as  (?im-sx),  which sets caseless and multiline,
2936       while unsetting dotall and extended, is also  permitted.  If  a  letter
2937       appears both before and after the hyphen, the option is unset.
2938
2939       The  PCRE-specific options dupnames, ungreedy, and extra can be changed
2940       in the same way as the Perl-compatible options by using the  characters
2941       J, U, and X respectively.
2942
2943       When  one  of  these  option  changes occurs at top-level (that is, not
2944       inside subpattern parentheses), the change applies to the remainder  of
2945       the pattern that follows.
2946
2947       An  option change within a subpattern (see section Subpatterns) affects
2948       only that part of the subpattern that follows  it.  So,  the  following
2949       matches  abc  and  aBc  and  no other strings (assuming caseless is not
2950       used):
2951
2952       (a(?i)b)c
2953
2954       By this means, options can be made to have different settings  in  dif‐
2955       ferent  parts  of  the  pattern. Any changes made in one alternative do
2956       carry on into subsequent branches within the same subpattern. For exam‐
2957       ple:
2958
2959       (a(?i)b|c)
2960
2961       matches  "ab", "aB", "c", and "C", although when matching "C" the first
2962       branch is abandoned before the option  setting.  This  is  because  the
2963       effects  of  option settings occur at compile time. There would be some
2964       weird behavior otherwise.
2965
2966   Note:
2967       Other PCRE-specific options can be set by the application when the com‐
2968       piling or matching functions are called. Sometimes the pattern can con‐
2969       tain special leading sequences, such as (*CRLF), to override  what  the
2970       application has set or what has been defaulted. Details are provided in
2971       section  Newline Sequences earlier.
2972
2973       The (*UTF8) and (*UCP) leading sequences can be used  to  set  UTF  and
2974       Unicode  property modes. They are equivalent to setting options unicode
2975       and ucp, respectively. The (*UTF) sequence is a  generic  version  that
2976       can be used with any of the libraries. However, the application can set
2977       option never_utf, which locks out the use of the (*UTF) sequences.
2978
2979

SUBPATTERNS

2981       Subpatterns are delimited by parentheses (round brackets), which can be
2982       nested. Turning part of a pattern into a subpattern does two things:
2983
2984         1.:
2985           It localizes a set of alternatives. For example, the following pat‐
2986           tern matches "cataract", "caterpillar", or "cat":
2987
2988         cat(aract|erpillar|)
2989
2990           Without the parentheses, it would match "cataract", "erpillar",  or
2991           an empty string.
2992
2993         2.:
2994           It  sets up the subpattern as a capturing subpattern. That is, when
2995           the complete pattern matches, that portion of  the  subject  string
2996           that  matched  the  subpattern is passed back to the caller through
2997           the return value of run/3.
2998
2999       Opening parentheses are counted from left to right (starting from 1) to
3000       obtain  numbers  for  the  capturing  subpatterns.  For example, if the
3001       string "the red king" is matched against  the  following  pattern,  the
3002       captured substrings are "red king", "red", and "king", and are numbered
3003       1, 2, and 3, respectively:
3004
3005       the ((red|white) (king|queen))
3006
3007       It is not always helpful that plain parentheses fulfill two  functions.
3008       Often  a  grouping  subpattern is required without a capturing require‐
3009       ment. If an opening parenthesis is followed by a question  mark  and  a
3010       colon,  the  subpattern  does  not do any capturing, and is not counted
3011       when computing the number of any subsequent capturing subpatterns.  For
3012       example, if the string "the white queen" is matched against the follow‐
3013       ing pattern, the captured substrings are "white queen" and "queen", and
3014       are numbered 1 and 2:
3015
3016       the ((?:red|white) (king|queen))
3017
3018       The maximum number of capturing subpatterns is 65535.
3019
3020       As  a  convenient shorthand, if any option settings are required at the
3021       start of a non-capturing subpattern,  the  option  letters  can  appear
3022       between  "?"  and  ":". Thus, the following two patterns match the same
3023       set of strings:
3024
3025       (?i:saturday|sunday)
3026       (?:(?i)saturday|sunday)
3027
3028       As alternative branches are tried from left to right, and  options  are
3029       not reset until the end of the subpattern is reached, an option setting
3030       in one branch does affect subsequent branches, so  the  above  patterns
3031       match both "SUNDAY" and "Saturday".
3032

DUPLICATE SUBPATTERN NUMBERS

3034       Perl  5.10  introduced a feature where each alternative in a subpattern
3035       uses the same numbers for its capturing parentheses. Such a  subpattern
3036       starts  with (?| and is itself a non-capturing subpattern. For example,
3037       consider the following pattern:
3038
3039       (?|(Sat)ur|(Sun))day
3040
3041       As the two alternatives are inside a (?| group, both sets of  capturing
3042       parentheses  are  numbered one. Thus, when the pattern matches, you can
3043       look at captured substring number one, whichever  alternative  matched.
3044       This  construct is useful when you want to capture a part, but not all,
3045       of one of many alternatives. Inside a (?| group, parentheses  are  num‐
3046       bered  as  usual,  but the number is reset at the start of each branch.
3047       The numbers of any capturing parentheses  that  follow  the  subpattern
3048       start  after the highest number used in any branch. The following exam‐
3049       ple is from the Perl documentation;  the  numbers  underneath  show  in
3050       which buffer the captured content is stored:
3051
3052       # before  ---------------branch-reset----------- after
3053       / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3054       # 1            2         2  3        2     3     4
3055
3056       A  back  reference  to a numbered subpattern uses the most recent value
3057       that is set for that number by any subpattern.  The  following  pattern
3058       matches "abcabc" or "defdef":
3059
3060       /(?|(abc)|(def))\1/
3061
3062       In  contrast,  a subroutine call to a numbered subpattern always refers
3063       to the first one in the pattern with the given  number.  The  following
3064       pattern matches "abcabc" or "defabc":
3065
3066       /(?|(abc)|(def))(?1)/
3067
3068       If  a  condition  test for a subpattern having matched refers to a non-
3069       unique number, the test is true if any of the subpatterns of that  num‐
3070       ber have matched.
3071
3072       An  alternative  approach  using  this "branch reset" feature is to use
3073       duplicate named subpatterns, as described in the next section.
3074

NAMED SUBPATTERNS

3076       Identifying capturing parentheses by number is simple, but  it  can  be
3077       hard  to  keep track of the numbers in complicated regular expressions.
3078       Also, if an expression is modified, the numbers  can  change.  To  help
3079       with  this  difficulty,  PCRE  supports the naming of subpatterns. This
3080       feature was not added to Perl until release 5.10. Python had  the  fea‐
3081       ture  earlier,  and PCRE introduced it at release 4.0, using the Python
3082       syntax. PCRE now supports both the Perl and  the  Python  syntax.  Perl
3083       allows  identically  numbered  subpatterns to have different names, but
3084       PCRE does not.
3085
3086       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
3087       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
3088       to capturing parentheses from other parts of the pattern, such as  back
3089       references,  recursion, and conditions, can be made by name and by num‐
3090       ber.
3091
3092       Names consist of up to 32 alphanumeric characters and underscores,  but
3093       must  start  with  a  non-digit.  Named capturing parentheses are still
3094       allocated numbers as well as names, exactly as if the  names  were  not
3095       present.  The  capture  specification  to run/3 can use named values if
3096       they are present in the regular expression.
3097
3098       By default, a name must be unique within a pattern, but this constraint
3099       can  be  relaxed by setting option dupnames at compile time. (Duplicate
3100       names are also always permitted for subpatterns with the  same  number,
3101       set  up  as  described in the previous section.) Duplicate names can be
3102       useful for patterns where only one instance of  the  named  parentheses
3103       can match. Suppose that you want to match the name of a weekday, either
3104       as a 3-letter abbreviation or as the full name, and in both  cases  you
3105       want  to  extract the abbreviation. The following pattern (ignoring the
3106       line breaks) does the job:
3107
3108       (?<DN>Mon|Fri|Sun)(?:day)?|
3109       (?<DN>Tue)(?:sday)?|
3110       (?<DN>Wed)(?:nesday)?|
3111       (?<DN>Thu)(?:rsday)?|
3112       (?<DN>Sat)(?:urday)?
3113
3114       There are five capturing substrings, but only one is ever set  after  a
3115       match.  (An alternative way of solving this problem is to use a "branch
3116       reset" subpattern, as described in the previous section.)
3117
3118       For capturing named subpatterns which names are not unique,  the  first
3119       matching  occurrence  (counted  from  left  to right in the subject) is
3120       returned from run/3, if the name is specified in the values part of the
3121       capture  statement. The all_names capturing value matches all the names
3122       in the same way.
3123
3124   Note:
3125       You cannot use different names to distinguish between  two  subpatterns
3126       with  the same number, as PCRE uses only the numbers when matching. For
3127       this reason, an error is given at compile time if different  names  are
3128       specified to subpatterns with the same number. However, you can specify
3129       the same name to subpatterns with the same number, even  when  dupnames
3130       is not set.
3131
3132

REPETITION

3134       Repetition  is  specified  by  quantifiers, which can follow any of the
3135       following items:
3136
3137         * A literal data character
3138
3139         * The dot metacharacter
3140
3141         * The \C escape sequence
3142
3143         * The \X escape sequence
3144
3145         * The \R escape sequence
3146
3147         * An escape such as \d or \pL that matches a single character
3148
3149         * A character class
3150
3151         * A back reference (see the next section)
3152
3153         * A parenthesized subpattern (including assertions)
3154
3155         * A subroutine call to a subpattern (recursive or otherwise)
3156
3157       The general repetition quantifier specifies a minimum and maximum  num‐
3158       ber  of  permitted matches, by giving the two numbers in curly brackets
3159       (braces), separated by a comma. The numbers must be <  65536,  and  the
3160       first  must  be less than or equal to the second. For example, the fol‐
3161       lowing matches "zz", "zzz", or "zzzz":
3162
3163       z{2,4}
3164
3165       A closing brace on its own is not a special character.  If  the  second
3166       number  is  omitted, but the comma is present, there is no upper limit.
3167       If the second number and the comma are  both  omitted,  the  quantifier
3168       specifies  an  exact  number  of  required matches. Thus, the following
3169       matches at least three successive vowels, but can match many more:
3170
3171       [aeiou]{3,}
3172
3173       The following matches exactly eight digits:
3174
3175       \d{8}
3176
3177       An opening curly bracket that appears in a position where a  quantifier
3178       is  not allowed, or one that does not match the syntax of a quantifier,
3179       is taken as a literal character. For example, {,6} is not a quantifier,
3180       but a literal string of four characters.
3181
3182       In  Unicode  mode, quantifiers apply to characters rather than to indi‐
3183       vidual data units. Thus, for example, \x{100}{2}  matches  two  charac‐
3184       ters,  each  of  which  is  represented by a 2-byte sequence in a UTF-8
3185       string. Similarly, \X{3} matches three Unicode extended grapheme  clus‐
3186       ters,  each  of  which  can be many data units long (and they can be of
3187       different lengths).
3188
3189       The quantifier {0} is permitted, causing the expression to behave as if
3190       the previous item and the quantifier were not present. This can be use‐
3191       ful for subpatterns that are referenced as subroutines  from  elsewhere
3192       in  the  pattern (but see also section  Defining Subpatterns for Use by
3193       Reference Only). Items other than subpatterns that have a  {0}  quanti‐
3194       fier are omitted from the compiled pattern.
3195
3196       For  convenience, the three most common quantifiers have single-charac‐
3197       ter abbreviations:
3198
3199         *:
3200           Equivalent to {0,}
3201
3202         +:
3203           Equivalent to {1,}
3204
3205         ?:
3206           Equivalent to {0,1}
3207
3208       Infinite loops can be constructed by following a  subpattern  that  can
3209       match  no  characters  with  a  quantifier that has no upper limit, for
3210       example:
3211
3212       (a?)*
3213
3214       Earlier versions of Perl and PCRE used to give an error at compile time
3215       for  such  patterns. However, as there are cases where this can be use‐
3216       ful, such patterns are now accepted. However, if any repetition of  the
3217       subpattern matches no characters, the loop is forcibly broken.
3218
3219       By  default,  the quantifiers are "greedy", that is, they match as much
3220       as possible (up to the maximum  number  of  permitted  times),  without
3221       causing  the  remaining  pattern  to fail. The classic example of where
3222       this gives problems is in trying to match comments in C programs. These
3223       appear  between /* and */. Within the comment, individual * and / char‐
3224       acters can appear. An attempt to match C comments by applying the  pat‐
3225       tern
3226
3227       /\*.*\*/
3228
3229       to the string
3230
3231       /* first comment */  not comment  /* second comment */
3232
3233       fails,  as  it matches the entire string owing to the greediness of the
3234       .* item.
3235
3236       However, if a quantifier is followed by a question mark, it  ceases  to
3237       be greedy, and instead matches the minimum number of times possible, so
3238       the following pattern does the right thing with the C comments:
3239
3240       /\*.*?\*/
3241
3242       The meaning of the various quantifiers is not otherwise  changed,  only
3243       the  preferred  number  of matches. Do not confuse this use of question
3244       mark with its use as a quantifier in its own right. As it has two uses,
3245       it can sometimes appear doubled, as in
3246
3247       \d??\d
3248
3249       which matches one digit by preference, but can match two if that is the
3250       only way the remaining pattern matches.
3251
3252       If option ungreedy is set (an option that is not  available  in  Perl),
3253       the  quantifiers  are not greedy by default, but individual ones can be
3254       made greedy by following them with a question mark. That is, it inverts
3255       the default behavior.
3256
3257       When  a  parenthesized  subpattern  is quantified with a minimum repeat
3258       count that is > 1 or with a limited maximum, more  memory  is  required
3259       for  the  compiled pattern, in proportion to the size of the minimum or
3260       maximum.
3261
3262       If a pattern starts with .* or .{0,} and option dotall  (equivalent  to
3263       Perl  option  /s)  is set, thus allowing the dot to match newlines, the
3264       pattern is implicitly  anchored,  because  whatever  follows  is  tried
3265       against every character position in the subject string. So, there is no
3266       point in retrying the overall match at any position  after  the  first.
3267       PCRE normally treats such a pattern as if it was preceded by \A.
3268
3269       In  cases  where  it  is known that the subject string contains no new‐
3270       lines, it is worth setting  dotall  to  obtain  this  optimization,  or
3271       alternatively using ^ to indicate anchoring explicitly.
3272
3273       However,  there  are  some cases where the optimization cannot be used.
3274       When .* is inside capturing parentheses that are the subject of a  back
3275       reference elsewhere in the pattern, a match at the start can fail where
3276       a later one succeeds. Consider, for example:
3277
3278       (.*)abc\1
3279
3280       If the subject is "xyz123abc123", the match point is the fourth charac‐
3281       ter. Therefore, such a pattern is not implicitly anchored.
3282
3283       Another  case where implicit anchoring is not applied is when the lead‐
3284       ing .* is inside an atomic group. Once again, a match at the start  can
3285       fail where a later one succeeds. Consider the following pattern:
3286
3287       (?>.*?a)b
3288
3289       It  matches "ab" in the subject "aab". The use of the backtracking con‐
3290       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3291
3292       When a capturing subpattern is repeated, the value captured is the sub‐
3293       string that matched the final iteration. For example, after
3294
3295       (tweedle[dume]{3}\s*)+
3296
3297       has  matched  "tweedledum  tweedledee",  the value of the captured sub‐
3298       string is "tweedledee". However, if there are nested capturing  subpat‐
3299       terns,  the corresponding captured values can have been set in previous
3300       iterations. For example, after
3301
3302       /(a|(b))+/
3303
3304       matches "aba", the value of the second captured substring is "b".
3305

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

3307       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
3308       repetition,  failure  of what follows normally causes the repeated item
3309       to be re-evaluated to see if a different number of repeats  allows  the
3310       remaining  pattern  to  match.  Sometimes it is useful to prevent this,
3311       either to change the nature of the match, or to cause it to  fail  ear‐
3312       lier than it otherwise might, when the author of the pattern knows that
3313       there is no point in carrying on.
3314
3315       Consider, for example, the pattern \d+foo when applied to the following
3316       subject line:
3317
3318       123456bar
3319
3320       After matching all six digits and then failing to match "foo", the nor‐
3321       mal action of the matcher is to try again with only five digits  match‐
3322       ing item \d+, and then with four, and so on, before ultimately failing.
3323       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
3324       the  means for specifying that once a subpattern has matched, it is not
3325       to be re-evaluated in this way.
3326
3327       If atomic grouping is used for the previous example, the matcher  gives
3328       up  immediately  on failing to match "foo" the first time. The notation
3329       is a kind of special parenthesis, starting with (?> as in the following
3330       example:
3331
3332       (?>\d+)foo
3333
3334       This kind of parenthesis "locks up" the part of the pattern it contains
3335       once it has matched, and a failure further into  the  pattern  is  pre‐
3336       vented  from  backtracking  into  it.  Backtracking past it to previous
3337       items, however, works as normal.
3338
3339       An alternative description is that a subpattern of  this  type  matches
3340       the  string  of  characters  that an identical standalone pattern would
3341       match, if anchored at the current point in the subject string.
3342
3343       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3344       such as the above example can be thought of as a maximizing repeat that
3345       must swallow everything it can. So, while both \d+ and  \d+?  are  pre‐
3346       pared  to  adjust the number of digits they match to make the remaining
3347       pattern match, (?>\d+) can only match an entire sequence of digits.
3348
3349       Atomic groups in general can contain any complicated  subpatterns,  and
3350       can be nested. However, when the subpattern for an atomic group is just
3351       a single repeated item, as in the example above,  a  simpler  notation,
3352       called a "possessive quantifier" can be used. This consists of an extra
3353       + character following a quantifier. Using this notation,  the  previous
3354       example can be rewritten as
3355
3356       \d++foo
3357
3358       Notice  that  a possessive quantifier can be used with an entire group,
3359       for example:
3360
3361       (abc|xyz){2,3}+
3362
3363       Possessive  quantifiers  are  always  greedy;  the  setting  of  option
3364       ungreedy  is  ignored.  They  are a convenient notation for the simpler
3365       forms of an atomic group. However, there is no difference in the  mean‐
3366       ing  of  a  possessive  quantifier and the equivalent atomic group, but
3367       there can be a performance difference; possessive quantifiers are prob‐
3368       ably slightly faster.
3369
3370       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn‐
3371       tax. Jeffrey Friedl originated the idea (and the  name)  in  the  first
3372       edition of his book. Mike McCloskey liked it, so implemented it when he
3373       built the Sun Java package, and PCRE copied it  from  there.  It  ulti‐
3374       mately found its way into Perl at release 5.10.
3375
3376       PCRE has an optimization that automatically "possessifies" certain sim‐
3377       ple pattern constructs. For example, the sequence  A+B  is  treated  as
3378       A++B,  as there is no point in backtracking into a sequence of A:s when
3379       B must follow.
3380
3381       When a pattern contains an unlimited repeat inside  a  subpattern  that
3382       can  itself  be  repeated  an  unlimited number of times, the use of an
3383       atomic group is the only way to avoid some  failing  matches  taking  a
3384       long time. The pattern
3385
3386       (\D+|<\d+>)*[!?]
3387
3388       matches  an  unlimited number of substrings that either consist of non-
3389       digits, or digits enclosed in <>, followed by ! or ?. When it  matches,
3390       it runs quickly. However, if it is applied to
3391
3392       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3393
3394       it  takes  a  long  time  before reporting failure. This is because the
3395       string can be divided between the internal \D+ repeat and the  external
3396       *  repeat  in  many ways, and all must be tried. (The example uses [!?]
3397       rather than a single character at the end, as both PCRE and  Perl  have
3398       an optimization that allows for fast failure when a single character is
3399       used. They remember the last single character that is  required  for  a
3400       match,  and fail early if it is not present in the string.) If the pat‐
3401       tern is changed so that it uses an atomic group,  like  the  following,
3402       sequences of non-digits cannot be broken, and failure happens quickly:
3403
3404       ((?>\D+)|<\d+>)*[!?]
3405

BACK REFERENCES

3407       Outside  a  character  class,  a backslash followed by a digit > 0 (and
3408       possibly further digits) is a back reference to a capturing  subpattern
3409       earlier (that is, to its left) in the pattern, provided there have been
3410       that many previous capturing left parentheses.
3411
3412       However, if the decimal number following the backslash is < 10,  it  is
3413       always taken as a back reference, and causes an error only if there are
3414       not that many capturing left parentheses in the  entire  pattern.  That
3415       is,  the  parentheses that are referenced do need not be to the left of
3416       the reference for numbers < 10. A "forward back reference" of this type
3417       can  make sense when a repetition is involved and the subpattern to the
3418       right has participated in an earlier iteration.
3419
3420       It is not possible to have a numerical "forward back  reference"  to  a
3421       subpattern  whose number is 10 or more using this syntax, as a sequence
3422       such as \50 is interpreted as a character defined in  octal.  For  more
3423       details  of  the  handling of digits following a backslash, see section
3424       Non-Printing Characters earlier. There is no such  problem  when  named
3425       parentheses  are  used.  A back reference to any subpattern is possible
3426       using named parentheses (see below).
3427
3428       Another way to avoid the ambiguity inherent in the use of  digits  fol‐
3429       lowing  a  backslash is to use the \g escape sequence. This escape must
3430       be followed by an unsigned number  or  a  negative  number,  optionally
3431       enclosed in braces. The following examples are identical:
3432
3433       (ring), \1
3434       (ring), \g1
3435       (ring), \g{1}
3436
3437       An  unsigned number specifies an absolute reference without the ambigu‐
3438       ity that is present in the older syntax. It is also useful when literal
3439       digits follow the reference. A negative number is a relative reference.
3440       Consider the following example:
3441
3442       (abc(def)ghi)\g{-1}
3443
3444       The sequence \g{-1} is a reference to the most recently started captur‐
3445       ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3446       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
3447       references  can  be helpful in long patterns, and also in patterns that
3448       are created by joining fragments  containing  references  within  them‐
3449       selves.
3450
3451       A  back  reference matches whatever matched the capturing subpattern in
3452       the current subject string, rather than anything matching  the  subpat‐
3453       tern itself (section Subpattern as Subroutines describes a way of doing
3454       that). So, the following pattern matches "sense  and  sensibility"  and
3455       "response and responsibility", but not "sense and responsibility":
3456
3457       (sens|respons)e and \1ibility
3458
3459       If  caseful matching is in force at the time of the back reference, the
3460       case of letters is relevant. For example, the  following  matches  "rah
3461       rah"  and "RAH RAH", but not "RAH rah", although the original capturing
3462       subpattern is matched caselessly:
3463
3464       ((?i)rah)\s+\1
3465
3466       There are many different ways of writing back references to named  sub‐
3467       patterns.  The  .NET  syntax  \k{name}  and the Perl syntax \k<name> or
3468       \k'name' are supported, as is the Python syntax (?P=name). The  unified
3469       back  reference  syntax  in Perl 5.10, in which \g can be used for both
3470       numeric and named references, is also supported. The  previous  example
3471       can be rewritten in the following ways:
3472
3473       (?<p1>(?i)rah)\s+\k<p1>
3474       (?'p1'(?i)rah)\s+\k{p1}
3475       (?P<p1>(?i)rah)\s+(?P=p1)
3476       (?<p1>(?i)rah)\s+\g{p1}
3477
3478       A  subpattern  that  is  referenced  by  name can appear in the pattern
3479       before or after the reference.
3480
3481       There can be more than one back reference to the same subpattern. If  a
3482       subpattern has not been used in a particular match, any back references
3483       to it always fails. For example, the following pattern always fails  if
3484       it starts to match "a" rather than "bc":
3485
3486       (a|(bc))\2
3487
3488       As  there  can  be  many capturing parentheses in a pattern, all digits
3489       following the backslash are taken as part of a potential back reference
3490       number. If the pattern continues with a digit character, some delimiter
3491       must be used to terminate the back reference.  If  option  extended  is
3492       set,  this  can  be whitespace. Otherwise an empty comment (see section
3493       Comments) can be used.
3494
3495       Recursive Back References
3496
3497       A back reference that occurs inside the parentheses to which it  refers
3498       fails  when  the subpattern is first used, so, for example, (a\1) never
3499       matches. However, such references can be useful inside repeated subpat‐
3500       terns.  For  example,  the following pattern matches any number of "a"s
3501       and also "aba", "ababbaa", and so on:
3502
3503       (a|b\1)+
3504
3505       At each iteration of the subpattern, the  back  reference  matches  the
3506       character  string corresponding to the previous iteration. In order for
3507       this to work, the pattern must be such that the  first  iteration  does
3508       not  need  to match the back reference. This can be done using alterna‐
3509       tion, as in the example above, or by a quantifier  with  a  minimum  of
3510       zero.
3511
3512       Back  references of this type cause the group that they reference to be
3513       treated as an atomic group. Once the whole group has  been  matched,  a
3514       subsequent  matching  failure cannot cause backtracking into the middle
3515       of the group.
3516

ASSERTIONS

3518       An assertion is a test on the characters  following  or  preceding  the
3519       current matching point that does not consume any characters. The simple
3520       assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are  described  in
3521       the previous sections.
3522
3523       More  complicated  assertions  are  coded as subpatterns. There are two
3524       kinds: those that look ahead of the current  position  in  the  subject
3525       string,  and  those  that  look  behind  it. An assertion subpattern is
3526       matched in the normal way, except that it does not  cause  the  current
3527       matching position to be changed.
3528
3529       Assertion  subpatterns are not capturing subpatterns. If such an asser‐
3530       tion contains capturing subpatterns within it, these  are  counted  for
3531       the  purposes  of numbering the capturing subpatterns in the whole pat‐
3532       tern. However, substring capturing is done  only  for  positive  asser‐
3533       tions.  (Perl sometimes, but not always, performs capturing in negative
3534       assertions.)
3535
3536   Warning:
3537       If a positive assertion containing one or  more  capturing  subpatterns
3538       succeeds, but failure to match later in the pattern causes backtracking
3539       over this assertion, the captures within the assertion are  reset  only
3540       if no higher numbered captures are already set. This is, unfortunately,
3541       a fundamental limitation of the current implementation, and as PCRE1 is
3542       now in maintenance-only status, it is unlikely ever to change.
3543
3544
3545       For  compatibility  with  Perl,  assertion subpatterns can be repeated.
3546       However, it makes no sense to assert the same  thing  many  times,  the
3547       side  effect  of  capturing  parentheses can occasionally be useful. In
3548       practice, there are only three cases:
3549
3550         * If the quantifier is {0}, the  assertion  is  never  obeyed  during
3551           matching.  However, it can contain internal capturing parenthesized
3552           groups that are called from elsewhere through the subroutine mecha‐
3553           nism.
3554
3555         * If  quantifier  is  {0,n},  where n > 0, it is treated as if it was
3556           {0,1}. At runtime, the remaining pattern match is  tried  with  and
3557           without  the  assertion, the order depends on the greediness of the
3558           quantifier.
3559
3560         * If the minimum repetition is > 0, the quantifier  is  ignored.  The
3561           assertion is obeyed only once when encountered during matching.
3562
3563       Lookahead Assertions
3564
3565       Lookahead assertions start with (?= for positive assertions and (?! for
3566       negative assertions. For example, the following matches a word followed
3567       by a semicolon, but does not include the semicolon in the match:
3568
3569       \w+(?=;)
3570
3571       The  following  matches any occurrence of "foo" that is not followed by
3572       "bar":
3573
3574       foo(?!bar)
3575
3576       Notice that the apparently similar pattern
3577
3578       (?!foo)bar
3579
3580       does not find an occurrence of "bar"  that  is  preceded  by  something
3581       other  than  "foo". It finds any occurrence of "bar" whatsoever, as the
3582       assertion (?!foo) is always true when the  next  three  characters  are
3583       "bar". A lookbehind assertion is needed to achieve the other effect.
3584
3585       If you want to force a matching failure at some point in a pattern, the
3586       most convenient way to do it is with (?!), as an  empty  string  always
3587       matches.  So,  an  assertion  that requires there is not to be an empty
3588       string must always fail. The backtracking control verb (*FAIL) or  (*F)
3589       is a synonym for (?!).
3590
3591       Lookbehind Assertions
3592
3593       Lookbehind  assertions start with (?<= for positive assertions and (?<!
3594       for negative assertions. For example, the following finds an occurrence
3595       of "bar" that is not preceded by "foo":
3596
3597       (?<!foo)bar
3598
3599       The contents of a lookbehind assertion are restricted such that all the
3600       strings it matches must have a fixed length. However, if there are many
3601       top-level  alternatives,  they  do  not all have to have the same fixed
3602       length. Thus, the following is permitted:
3603
3604       (?<=bullock|donkey)
3605
3606       The following causes an error at compile time:
3607
3608       (?<!dogs?|cats?)
3609
3610       Branches that match different length strings are permitted only at  the
3611       top-level of a lookbehind assertion. This is an extension compared with
3612       Perl, which requires all branches to match the same length  of  string.
3613       An assertion such as the following is not permitted, as its single top-
3614       level branch can match two different lengths:
3615
3616       (?<=ab(c|de))
3617
3618       However, it is acceptable to PCRE if rewritten  to  use  two  top-level
3619       branches:
3620
3621       (?<=abc|abde)
3622
3623       Sometimes  the  escape sequence \K (see above) can be used instead of a
3624       lookbehind assertion to get round the fixed-length restriction.
3625
3626       The implementation of lookbehind assertions is, for  each  alternative,
3627       to  move  the current position back temporarily by the fixed length and
3628       then try to match. If there are insufficient characters before the cur‐
3629       rent position, the assertion fails.
3630
3631       In  a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3632       gle data unit even in a UTF mode) to appear in  lookbehind  assertions,
3633       as  it  makes  it impossible to calculate the length of the lookbehind.
3634       The \X and \R escapes, which can match different numbers of data units,
3635       are not permitted either.
3636
3637       "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3638       lookbehinds, as long as the subpattern matches a  fixed-length  string.
3639       Recursion, however, is not supported.
3640
3641       Possessive  quantifiers can be used with lookbehind assertions to spec‐
3642       ify efficient matching of fixed-length strings at the  end  of  subject
3643       strings.  Consider  the following simple pattern when applied to a long
3644       string that does not match:
3645
3646       abcd$
3647
3648       As matching proceeds from left to right, PCRE looks for each "a" in the
3649       subject and then sees if what follows matches the remaining pattern. If
3650       the pattern is specified as
3651
3652       ^.*abcd$
3653
3654       the initial .* matches the entire string at first. However,  when  this
3655       fails  (as  there  is no following "a"), it backtracks to match all but
3656       the last character, then all but the last two characters,  and  so  on.
3657       Once  again  the search for "a" covers the entire string, from right to
3658       left, so we are no better off. However, if the pattern is written as
3659
3660       ^.*+(?<=abcd)
3661
3662       there can be no backtracking for the .*+ item; it can  match  only  the
3663       entire  string.  The subsequent lookbehind assertion does a single test
3664       on the last four characters. If it fails, the match fails  immediately.
3665       For  long  strings, this approach makes a significant difference to the
3666       processing time.
3667
3668       Using Multiple Assertions
3669
3670       Many assertions (of any sort) can occur in succession. For example, the
3671       following matches "foo" preceded by three digits that are not "999":
3672
3673       (?<=\d{3})(?<!999)foo
3674
3675       Notice that each of the assertions is applied independently at the same
3676       point in the subject string. First there is a check that  the  previous
3677       three  characters  are  all  digits, and then there is a check that the
3678       same three characters are not "999". This pattern does not match  "foo"
3679       preceded  by six characters, the first of which are digits and the last
3680       three of which are not "999". For example, it does not  match  "123abc‐
3681       foo". A pattern to do that is the following:
3682
3683       (?<=\d{3}...)(?<!999)foo
3684
3685       This  time  the  first assertion looks at the preceding six characters,
3686       checks that the first three are digits, and then the  second  assertion
3687       checks that the preceding three characters are not "999".
3688
3689       Assertions can be nested in any combination. For example, the following
3690       matches an occurrence of "baz" that is preceded by "bar", which in turn
3691       is not preceded by "foo":
3692
3693       (?<=(?<!foo)bar)baz
3694
3695       The  following  pattern  matches "foo" preceded by three digits and any
3696       three characters that are not "999":
3697
3698       (?<=\d{3}(?!999)...)foo
3699

CONDITIONAL SUBPATTERNS

3701       It is possible to cause the matching process to obey a subpattern  con‐
3702       ditionally  or to choose between two alternative subpatterns, depending
3703       on the result of an assertion, or whether a specific capturing  subpat‐
3704       tern has already been matched. The following are the two possible forms
3705       of conditional subpattern:
3706
3707       (?(condition)yes-pattern)
3708       (?(condition)yes-pattern|no-pattern)
3709
3710       If the condition is satisfied, the yes-pattern is used,  otherwise  the
3711       no-pattern  (if  present).  If  more than two alternatives exist in the
3712       subpattern, a compile-time error occurs. Each of the  two  alternatives
3713       can  itself  contain  nested  subpatterns of any form, including condi‐
3714       tional subpatterns; the restriction to two alternatives applies only at
3715       the  level of the condition. The following pattern fragment is an exam‐
3716       ple where the alternatives are complex:
3717
3718       (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3719
3720       There are four kinds of condition: references  to  subpatterns,  refer‐
3721       ences to recursion, a pseudo-condition called DEFINE, and assertions.
3722
3723       Checking for a Used Subpattern By Number
3724
3725       If  the  text between the parentheses consists of a sequence of digits,
3726       the condition is true if a capturing subpattern of that number has pre‐
3727       viously  matched.  If  more than one capturing subpattern with the same
3728       number exists (see section  Duplicate Subpattern Numbers earlier),  the
3729       condition  is true if any of them have matched. An alternative notation
3730       is to precede the digits with a plus or minus sign. In this  case,  the
3731       subpattern  number  is relative rather than absolute. The most recently
3732       opened parentheses can be referenced by (?(-1), the next most recent by
3733       (?(-2),  and  so  on.  Inside loops, it can also make sense to refer to
3734       subsequent groups. The next parentheses to be opened can be  referenced
3735       as  (?(+1),  and  so  on.  (The value zero in any of these forms is not
3736       used; it provokes a compile-time error.)
3737
3738       Consider the following pattern, which contains  non-significant  white‐
3739       space  to  make it more readable (assume option extended) and to divide
3740       it into three parts for ease of discussion:
3741
3742       ( \( )?    [^()]+    (?(1) \) )
3743
3744       The first part matches an optional opening  parenthesis,  and  if  that
3745       character is present, sets it as the first captured substring. The sec‐
3746       ond part matches one or more characters that are not  parentheses.  The
3747       third part is a conditional subpattern that tests whether the first set
3748       of parentheses matched or not. If they did, that is, if subject started
3749       with an opening parenthesis, the condition is true, and so the yes-pat‐
3750       tern is executed and a closing parenthesis is required.  Otherwise,  as
3751       no-pattern  is  not  present,  the subpattern matches nothing. That is,
3752       this pattern matches a sequence of non-parentheses, optionally enclosed
3753       in parentheses.
3754
3755       If  this  pattern is embedded in a larger one, a relative reference can
3756       be used:
3757
3758       This makes the fragment independent of the parentheses  in  the  larger
3759       pattern.
3760
3761       Checking for a Used Subpattern By Name
3762
3763       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
3764       used subpattern by name. For compatibility  with  earlier  versions  of
3765       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
3766       also recognized.
3767
3768       Rewriting the previous example to use a named subpattern gives:
3769
3770       (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
3771
3772       If the name used in a condition of this kind is a duplicate,  the  test
3773       is  applied to all subpatterns of the same name, and is true if any one
3774       of them has matched.
3775
3776       Checking for Pattern Recursion
3777
3778       If the condition is the string (R), and there is no subpattern with the
3779       name  R, the condition is true if a recursive call to the whole pattern
3780       or any subpattern has been made. If digits or a name preceded by amper‐
3781       sand follow the letter R, for example:
3782
3783       (?(R3)...) or (?(R&name)...)
3784
3785       the condition is true if the most recent recursion is into a subpattern
3786       whose number or name is given. This condition does not check the entire
3787       recursion  stack.  If  the  name  used in a condition of this kind is a
3788       duplicate, the test is applied to all subpatterns of the same name, and
3789       is true if any one of them is the most recent recursion.
3790
3791       At "top-level", all these recursion test conditions are false. The syn‐
3792       tax for recursive patterns is described below.
3793
3794       Defining Subpatterns for Use By Reference Only
3795
3796       If the condition is the string (DEFINE), and  there  is  no  subpattern
3797       with  the  name  DEFINE,  the  condition is always false. In this case,
3798       there can be only one alternative  in  the  subpattern.  It  is  always
3799       skipped  if  control  reaches  this  point  in the pattern. The idea of
3800       DEFINE is that it can be used to define "subroutines" that can be  ref‐
3801       erenced  from  elsewhere.  (The use of subroutines is described below.)
3802       For  example,  a  pattern  to  match   an   IPv4   address,   such   as
3803       "192.168.23.245",  can be written like this (ignore whitespace and line
3804       breaks):
3805
3806       (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3807
3808       The first part of the pattern is a  DEFINE  group  inside  which  is  a
3809       another  group named "byte" is defined. This matches an individual com‐
3810       ponent of an IPv4 address (a number < 256). When matching takes  place,
3811       this part of the pattern is skipped, as DEFINE acts like a false condi‐
3812       tion. The remaining pattern uses references to the named group to match
3813       the  four  dot-separated  components of an IPv4 address, insisting on a
3814       word boundary at each end.
3815
3816       Assertion Conditions
3817
3818       If the condition is not in any of the above  formats,  it  must  be  an
3819       assertion.  This  can be a positive or negative lookahead or lookbehind
3820       assertion. Consider the following pattern,  containing  non-significant
3821       whitespace, and with the two alternatives on the second line:
3822
3823       (?(?=[^a-z]*[a-z])
3824       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
3825
3826       The  condition  is  a  positive  lookahead  assertion  that  matches an
3827       optional sequence of non-letters followed by  a  letter.  That  is,  it
3828       tests for the presence of at least one letter in the subject. If a let‐
3829       ter is found, the subject is matched  against  the  first  alternative,
3830       otherwise  it  is  matched  against  the  second.  This pattern matches
3831       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
3832       letters and dd are digits.
3833

COMMENTS

3835       There  are  two ways to include comments in patterns that are processed
3836       by PCRE. In both cases, the start of the comment must not be in a char‐
3837       acter  class, or in the middle of any other sequence of related charac‐
3838       ters such as (?: or a subpattern name or number.  The  characters  that
3839       make up a comment play no part in the pattern matching.
3840
3841       The  sequence (?# marks the start of a comment that continues up to the
3842       next closing parenthesis. Nested  parentheses  are  not  permitted.  If
3843       option PCRE_EXTENDED is set, an unescaped # character also introduces a
3844       comment, which in this case continues to  immediately  after  the  next
3845       newline  character  or character sequence in the pattern. Which charac‐
3846       ters are interpreted as newlines is controlled by the options passed to
3847       a  compiling function or by a special sequence at the start of the pat‐
3848       tern, as described in section  Newline Conventions earlier.
3849
3850       Notice that the end of this  type  of  comment  is  a  literal  newline
3851       sequence  in  the  pattern; escape sequences that happen to represent a
3852       newline do not count. For example, consider the following pattern  when
3853       extended is set, and the default newline convention is in force:
3854
3855       abc #comment \n still comment
3856
3857       On  encountering character #, pcre_compile() skips along, looking for a
3858       newline in the pattern. The sequence \n is still literal at this stage,
3859       so  it does not terminate the comment. Only a character with code value
3860       0x0a (the default newline) does so.
3861

RECURSIVE PATTERNS

3863       Consider the problem of matching a string in parentheses, allowing  for
3864       unlimited  nested  parentheses.  Without the use of recursion, the best
3865       that can be done is to use a pattern that  matches  up  to  some  fixed
3866       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
3867       depth.
3868
3869       For some time, Perl has provided a facility that allows regular expres‐
3870       sions  to  recurse  (among other things). It does this by interpolating
3871       Perl code in the expression at runtime, and the code can refer  to  the
3872       expression itself. A Perl pattern using code interpolation to solve the
3873       parentheses problem can be created like this:
3874
3875       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3876
3877       Item (?p{...}) interpolates Perl code at  runtime,  and  in  this  case
3878       refers recursively to the pattern in which it appears.
3879
3880       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3881       it supports special syntax for recursion of the entire pattern, and for
3882       individual  subpattern  recursion.  After  its introduction in PCRE and
3883       Python, this kind of  recursion  was  later  introduced  into  Perl  at
3884       release 5.10.
3885
3886       A special item that consists of (? followed by a number > 0 and a clos‐
3887       ing parenthesis is a recursive subroutine call of the subpattern of the
3888       given  number,  if  it  occurs inside that subpattern. (If not, it is a
3889       non-recursive subroutine call, which is described in the next section.)
3890       The special item (?R) or (?0) is a recursive call of the entire regular
3891       expression.
3892
3893       This PCRE pattern solves the nested parentheses  problem  (assume  that
3894       option extended is set so that whitespace is ignored):
3895
3896       \( ( [^()]++ | (?R) )* \)
3897
3898       First  it matches an opening parenthesis. Then it matches any number of
3899       substrings, which can either be a  sequence  of  non-parentheses  or  a
3900       recursive  match  of the pattern itself (that is, a correctly parenthe‐
3901       sized substring). Finally there is a closing  parenthesis.  Notice  the
3902       use  of a possessive quantifier to avoid backtracking into sequences of
3903       non-parentheses.
3904
3905       If this was part of a larger pattern, you would not want to recurse the
3906       entire pattern, so instead you can use:
3907
3908       ( \( ( [^()]++ | (?1) )* \) )
3909
3910       The  pattern is here within parentheses so that the recursion refers to
3911       them instead of the whole pattern.
3912
3913       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
3914       tricky.  This is made easier by the use of relative references. Instead
3915       of (?1) in the pattern above, you can write (?-2) to refer to the  sec‐
3916       ond  most recently opened parentheses preceding the recursion. That is,
3917       a negative number counts capturing parentheses leftwards from the point
3918       at which it is encountered.
3919
3920       It  is  also  possible to refer to later opened parentheses, by writing
3921       references such as (?+2). However, these cannot be  recursive,  as  the
3922       reference  is  not inside the parentheses that are referenced. They are
3923       always non-recursive subroutine calls, as described in  the  next  sec‐
3924       tion.
3925
3926       An  alternative  approach is to use named parentheses instead. The Perl
3927       syntax for this is (?&name). The earlier PCRE syntax (?P>name) is  also
3928       supported. We can rewrite the above example as follows:
3929
3930       (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3931
3932       If  there  is more than one subpattern with the same name, the earliest
3933       one is used.
3934
3935       This particular example pattern that we have  studied  contains  nested
3936       unlimited repeats, and so the use of a possessive quantifier for match‐
3937       ing strings of non-parentheses is important when applying  the  pattern
3938       to strings that do not match. For example, when this pattern is applied
3939       to
3940
3941       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3942
3943       it gives "no match" quickly. However, if a possessive quantifier is not
3944       used,  the  match  runs for a long time, as there are so many different
3945       ways the + and * repeats can carve up the  subject,  and  all  must  be
3946       tested before failure can be reported.
3947
3948       At  the  end  of a match, the values of capturing parentheses are those
3949       from the outermost level. If the pattern above is matched against
3950
3951       (ab(cd)ef)
3952
3953       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
3954       which  is the last value taken on at the top-level. If a capturing sub‐
3955       pattern is not matched at the top level, its final  captured  value  is
3956       unset,  even  if  it was (temporarily) set at a deeper level during the
3957       matching process.
3958
3959       Do not confuse item (?R) with condition (R), which tests for recursion.
3960       Consider  the  following pattern, which matches text in angle brackets,
3961       allowing for arbitrary nesting.  Only  digits  are  allowed  in  nested
3962       brackets  (that is, when recursing), while any characters are permitted
3963       at the outer level.
3964
3965       < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
3966
3967       Here (?(R) is the start of a conditional subpattern, with two different
3968       alternatives  for  the  recursive and non-recursive cases. Item (?R) is
3969       the actual recursive call.
3970
3971       Differences in Recursion Processing between PCRE and Perl
3972
3973       Recursion processing in PCRE differs from Perl in two  important  ways.
3974       In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
3975       always treated as an atomic group. That is, once it has matched some of
3976       the subject string, it is never re-entered, even if it contains untried
3977       alternatives and there is a subsequent matching failure.  This  can  be
3978       illustrated  by  the  following  pattern, which means to match a palin‐
3979       dromic string containing an odd number of characters (for example, "a",
3980       "aba", "abcba", "abcdcba"):
3981
3982       ^(.|(.)(?1)\2)$
3983
3984       The idea is that it either matches a single character, or two identical
3985       characters surrounding a subpalindrome. In Perl, this pattern works; in
3986       PCRE  it  does not work if the pattern is longer than three characters.
3987       Consider the subject string "abcba".
3988
3989       At the top level, the first character is matched, but as it is  not  at
3990       the end of the string, the first alternative fails, the second alterna‐
3991       tive is taken, and the recursion kicks in. The recursive call  to  sub‐
3992       pattern  1  successfully matches the next character ("b"). (Notice that
3993       the beginning and end of line tests are not part of the recursion.)
3994
3995       Back at the top level, the next character ("c") is compared  with  what
3996       subpattern  2  matched,  which was "a". This fails. As the recursion is
3997       treated as an atomic group, there are now no backtracking  points,  and
3998       so the entire match fails. (Perl can now re-enter the recursion and try
3999       the second alternative.) However, if the pattern is  written  with  the
4000       alternatives in the other order, things are different:
4001
4002       ^((.)(?1)\2|.)$
4003
4004       This  time,  the recursing alternative is tried first, and continues to
4005       recurse until it runs out of characters, at which point  the  recursion
4006       fails.  But  this time we have another alternative to try at the higher
4007       level. That is the significant difference: in  the  previous  case  the
4008       remaining alternative is at a deeper recursion level, which PCRE cannot
4009       use.
4010
4011       To change the pattern so that it matches all palindromic  strings,  not
4012       only  those  with an odd number of characters, it is tempting to change
4013       the pattern to this:
4014
4015       ^((.)(?1)\2|.?)$
4016
4017       Again, this works in Perl, but not in PCRE, and for  the  same  reason.
4018       When  a  deeper  recursion has matched a single character, it cannot be
4019       entered again to match an empty string. The solution is to separate the
4020       two  cases, and write out the odd and even cases as alternatives at the
4021       higher level:
4022
4023       ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4024
4025       If you want to match typical  palindromic  phrases,  the  pattern  must
4026       ignore all non-word characters, which can be done as follows:
4027
4028       ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4029
4030       If  run  with  option caseless, this pattern matches phrases such as "A
4031       man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4032       Notice  the  use  of the possessive quantifier *+ to avoid backtracking
4033       into sequences of non-word characters. Without this,  PCRE  takes  much
4034       longer  (10  times or more) to match typical phrases, and Perl takes so
4035       long that you think it has gone into a loop.
4036
4037   Note:
4038       The palindrome-matching patterns above work only if the subject  string
4039       does  not  start  with  a  palindrome  that  is shorter than the entire
4040       string. For example, although "abcba" is correctly matched, if the sub‐
4041       ject  is  "ababa",  PCRE  finds palindrome "aba" at the start, and then
4042       fails at top level, as the end of the  string  does  not  follow.  Once
4043       again,  it  cannot  jump  back into the recursion to try other alterna‐
4044       tives, so the entire match fails.
4045
4046
4047       The second way in which PCRE and Perl differ in  their  recursion  pro‐
4048       cessing  is in the handling of captured values. In Perl, when a subpat‐
4049       tern is called recursively or as a subpattern (see the  next  section),
4050       it  has  no  access to any values that were captured outside the recur‐
4051       sion. In PCRE these values can be referenced.  Consider  the  following
4052       pattern:
4053
4054       ^(.)(\1|a(?2))
4055
4056       In  PCRE,  it matches "bab". The first capturing parentheses match "b",
4057       then in the second group, when the back reference  \1  fails  to  match
4058       "b",  the  second  alternative  matches  "a", and then recurses. In the
4059       recursion, \1 does now match "b" and so the whole  match  succeeds.  In
4060       Perl,  the  pattern fails to match because inside the recursive call \1
4061       cannot access the externally set value.
4062

SUBPATTERNS AS SUBROUTINES

4064       If the syntax for a recursive subpattern call (either by number  or  by
4065       name)  is  used outside the parentheses to which it refers, it operates
4066       like a subroutine in a programming language. The called subpattern  can
4067       be  defined  before or after the reference. A numbered reference can be
4068       absolute or relative, as in the following examples:
4069
4070       (...(absolute)...)...(?2)...
4071       (...(relative)...)...(?-1)...
4072       (...(?+1)...(relative)...
4073
4074       An earlier example pointed  out  that  the  following  pattern  matches
4075       "sense  and  sensibility"  and  "response  and responsibility", but not
4076       "sense and responsibility":
4077
4078       (sens|respons)e and \1ibility
4079
4080       If instead the following pattern is used, it matches "sense and respon‐
4081       sibility" and the other two strings:
4082
4083       (sens|respons)e and (?1)ibility
4084
4085       Another example is provided in the discussion of DEFINE earlier.
4086
4087       All  subroutine  calls,  recursive or not, are always treated as atomic
4088       groups. That is, once a subroutine has  matched  some  of  the  subject
4089       string,  it  is  never re-entered, even if it contains untried alterna‐
4090       tives and there is a subsequent matching failure. Any capturing  paren‐
4091       theses that are set during the subroutine call revert to their previous
4092       values afterwards.
4093
4094       Processing options such as case-independence are fixed when  a  subpat‐
4095       tern  is defined, so if it is used as a subroutine, such options cannot
4096       be changed for different calls.  For  example,  the  following  pattern
4097       matches  "abcabc"  but not "abcABC", as the change of processing option
4098       does not affect the called subpattern:
4099
4100       (abc)(?i:(?-1))
4101

ONIGURUMA SUBROUTINE SYNTAX

4103       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
4104       name or a number enclosed either in angle brackets or single quotes, is
4105       alternative syntax for referencing a subpattern as a subroutine, possi‐
4106       bly recursively. Here follows two of the examples used above, rewritten
4107       using this syntax:
4108
4109       (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4110       (sens|respons)e and \g'1'ibility
4111
4112       PCRE supports an extension to Oniguruma: if a number is preceded  by  a
4113       plus or minus sign, it is taken as a relative reference, for example:
4114
4115       (abc)(?i:\g<-1>)
4116
4117       Notice  that  \g{...}  (Perl syntax) and \g<...> (Oniguruma syntax) are
4118       not synonymous. The former is a back reference; the latter is a subrou‐
4119       tine call.
4120

BACKTRACKING CONTROL

4122       Perl  5.10  introduced some "Special Backtracking Control Verbs", which
4123       are still described in the Perl documentation as "experimental and sub‐
4124       ject  to  change or removal in a future version of Perl". It goes on to
4125       say: "Their usage in production code should be noted to avoid  problems
4126       during upgrades." The same remarks apply to the PCRE features described
4127       in this section.
4128
4129       The new verbs make use of what was previously invalid syntax: an  open‐
4130       ing parenthesis followed by an asterisk. They are generally of the form
4131       (*VERB) or (*VERB:NAME). Some can take either form,  possibly  behaving
4132       differently  depending  on  whether  a  name  is present. A name is any
4133       sequence of characters that does not include a closing parenthesis. The
4134       maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4135       and 32-bit libraries. If the name is empty, that  is,  if  the  closing
4136       parenthesis  immediately  follows  the  colon,  the effect is as if the
4137       colon was not there. Any number of these verbs can occur in a pattern.
4138
4139       The behavior of these verbs in repeated groups, assertions, and in sub‐
4140       patterns   called  as  subroutines  (whether  or  not  recursively)  is
4141       described below.
4142
4143       Optimizations That Affect Backtracking Verbs
4144
4145       PCRE contains some optimizations that are used to speed up matching  by
4146       running some checks at the start of each match attempt. For example, it
4147       can know the minimum length of matching subject, or that  a  particular
4148       character must be present. When one of these optimizations bypasses the
4149       running of a match, any included backtracking verbs are not  processed.
4150       processed. You can suppress the start-of-match optimizations by setting
4151       option no_start_optimize when calling compile/2 or run/3, or by  start‐
4152       ing the pattern with (*NO_START_OPT).
4153
4154       Experiments  with  Perl  suggest that it too has similar optimizations,
4155       sometimes leading to anomalous results.
4156
4157       Verbs That Act Immediately
4158
4159       The following verbs act as soon as they are encountered. They must  not
4160       be followed by a name.
4161
4162       (*ACCEPT)
4163
4164       This  verb causes the match to end successfully, skipping the remainder
4165       of the pattern. However, when it is inside a subpattern that is  called
4166       as  a  subroutine, only that subpattern is ended successfully. Matching
4167       then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4168       tive  assertion,  the  assertion succeeds; in a negative assertion, the
4169       assertion fails.
4170
4171       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap‐
4172       tured.  For  example, the following matches "AB", "AAD", or "ACD". When
4173       it matches "AB", "B" is captured by the outer parentheses.
4174
4175       A((?:A|B(*ACCEPT)|C)D)
4176
4177       The following verb causes a matching failure, forcing  backtracking  to
4178       occur. It is equivalent to (?!) but easier to read.
4179
4180       (*FAIL) or (*F)
4181
4182       The Perl documentation states that it is probably useful only when com‐
4183       bined with (?{}) or (??{}).  Those  are  Perl  features  that  are  not
4184       present in PCRE.
4185
4186       A  match  with the string "aaaa" always fails, but the callout is taken
4187       before each backtrack occurs (in this example, 10 times).
4188
4189       Recording Which Path Was Taken
4190
4191       The main purpose of this verb is to track how a match was  arrived  at,
4192       although it also has a secondary use in with advancing the match start‐
4193       ing point (see (*SKIP) below).
4194
4195   Note:
4196       In Erlang, there is no interface to retrieve a mark  with  run/2,3,  so
4197       only the secondary purpose is relevant to the Erlang programmer.
4198
4199       The  rest  of  this  section  is therefore deliberately not adapted for
4200       reading by the Erlang programmer, but the examples can help  in  under‐
4201       standing NAMES as they can be used by (*SKIP).
4202
4203
4204       (*MARK:NAME) or (*:NAME)
4205
4206       A  name  is  always  required  with  this  verb.  There  can be as many
4207       instances of (*MARK) as you like in a pattern, and their names  do  not
4208       have to be unique.
4209
4210       When  a  match succeeds, the name of the last encountered (*MARK:NAME),
4211       (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed  back  to
4212       the  caller as described in section "Extra data for pcre_exec()" in the
4213       pcreapi documentation. In the following example of pcretest output, the
4214       /K modifier requests the retrieval and outputting of (*MARK) data:
4215
4216         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4217       data> XY
4218        0: XY
4219       MK: A
4220       XZ
4221        0: XZ
4222       MK: B
4223
4224       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4225       ple it indicates which of the two alternatives matched. This is a  more
4226       efficient  way of obtaining this information than putting each alterna‐
4227       tive in its own capturing parentheses.
4228
4229       If a verb with a name is encountered in a positive  assertion  that  is
4230       true,  the  name  is recorded and passed back if it is the last encoun‐
4231       tered. This does not occur for negative assertions or failing  positive
4232       assertions.
4233
4234       After  a  partial match or a failed match, the last encountered name in
4235       the entire match process is returned, for example:
4236
4237         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4238       data> XP
4239       No match, mark = B
4240
4241       Notice that in this unanchored example, the mark is retained  from  the
4242       match  attempt  that  started  at letter "X" in the subject. Subsequent
4243       match attempts starting at "P" and then with an empty string do not get
4244       as far as the (*MARK) item, nevertheless do not reset it.
4245
4246       Verbs That Act after Backtracking
4247
4248       The following verbs do nothing when they are encountered. Matching con‐
4249       tinues with what follows, but if there is no subsequent match,  causing
4250       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
4251       cannot pass to the left of the verb. However, when one of  these  verbs
4252       appears inside an atomic group or an assertion that is true, its effect
4253       is confined to that group, as once the group has been matched, there is
4254       never  any  backtracking  into  it. In this situation, backtracking can
4255       "jump back" to the left  of  the  entire  atomic  group  or  assertion.
4256       (Remember also, as stated above, that this localization also applies in
4257       subroutine calls.)
4258
4259       These verbs differ in exactly what kind of failure  occurs  when  back‐
4260       tracking reaches them. The behavior described below is what occurs when
4261       the verb is not in a subroutine or an  assertion.  Subsequent  sections
4262       cover these special cases.
4263
4264       The  following  verb,  which must not be followed by a name, causes the
4265       whole match to fail outright if there is a later matching failure  that
4266       causes  backtracking to reach it. Even if the pattern is unanchored, no
4267       further attempts to find a match by advancing the starting  point  take
4268       place.
4269
4270       (*COMMIT)
4271
4272       If (*COMMIT) is the only backtracking verb that is encountered, once it
4273       has been passed, run/2,3 is committed to find a match  at  the  current
4274       starting point, or not at all, for example:
4275
4276       a+(*COMMIT)b
4277
4278       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
4279       of dynamic anchor, or "I've started, so I must finish". The name of the
4280       most  recently passed (*MARK) in the path is passed back when (*COMMIT)
4281       forces a match failure.
4282
4283       If more than one backtracking verb exists in a pattern, a different one
4284       that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4285       MIT) during a match does not always guarantee that a match must  be  at
4286       this starting point.
4287
4288       Notice  that  (*COMMIT) at the start of a pattern is not the same as an
4289       anchor, unless the PCRE start-of-match optimizations are turned off, as
4290       shown in the following example:
4291
4292       1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4293       {match,["abc"]}
4294       2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4295       nomatch
4296
4297       For this pattern, PCRE knows that any match must start with "a", so the
4298       optimization skips along the subject to "a" before applying the pattern
4299       to  the first set of data. The match attempt then succeeds. In the sec‐
4300       ond call the no_start_optimize disables  the  optimization  that  skips
4301       along  to  the  first character. The pattern is now applied starting at
4302       "x", and so the (*COMMIT) causes the match to fail without  trying  any
4303       other starting points.
4304
4305       The  following  verb  causes  the match to fail at the current starting
4306       position in the subject if there  is  a  later  matching  failure  that
4307       causes backtracking to reach it:
4308
4309       (*PRUNE) or (*PRUNE:NAME)
4310
4311       If  the  pattern  is  unanchored, the normal "bumpalong" advance to the
4312       next starting character then occurs. Backtracking can occur as usual to
4313       the  left  of  (*PRUNE),  before it is reached, or when matching to the
4314       right of (*PRUNE), but if there is no match to the right,  backtracking
4315       cannot  cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an
4316       alternative to an atomic group or possessive quantifier, but there  are
4317       some  uses of (*PRUNE) that cannot be expressed in any other way. In an
4318       anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4319
4320       The   behavior   of   (*PRUNE:NAME)   is   the   not   the   same    as
4321       (*MARK:NAME)(*PRUNE).  It  is  like  (*MARK:NAME)  in  that the name is
4322       remembered for  passing  back  to  the  caller.  However,  (*SKIP:NAME)
4323       searches only for names set with (*MARK).
4324
4325   Note:
4326       The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4327       programmer, as names cannot be retrieved.
4328
4329
4330       The following verb, when specified without a name,  is  like  (*PRUNE),
4331       except  that  if  the pattern is unanchored, the "bumpalong" advance is
4332       not to the next character, but to the position  in  the  subject  where
4333       (*SKIP) was encountered.
4334
4335       (*SKIP)
4336
4337       (*SKIP)  signifies that whatever text was matched leading up to it can‐
4338       not be part of a successful match. Consider:
4339
4340       a+(*SKIP)b
4341
4342       If the subject is "aaaac...",  after  the  first  match  attempt  fails
4343       (starting  at  the  first  character in the string), the starting point
4344       skips on to start the next attempt at "c".  Notice  that  a  possessive
4345       quantifier  does  not have the same effect as this example; although it
4346       would suppress backtracking during the first match attempt, the  second
4347       attempt  would  start at the second character instead of skipping on to
4348       "c".
4349
4350       When (*SKIP) has an associated name, its behavior is modified:
4351
4352       (*SKIP:NAME)
4353
4354       When this is triggered,  the  previous  path  through  the  pattern  is
4355       searched  for the most recent (*MARK) that has the same name. If one is
4356       found, the "bumpalong" advance is to the subject position  that  corre‐
4357       sponds  to that (*MARK) instead of to where (*SKIP) was encountered. If
4358       no (*MARK) with a matching name is found, (*SKIP) is ignored.
4359
4360       Notice that (*SKIP:NAME) searches only for names set  by  (*MARK:NAME).
4361       It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4362
4363       The following verb causes a skip to the next innermost alternative when
4364       backtracking reaches it. That is, it cancels any  further  backtracking
4365       within the current alternative.
4366
4367       (*THEN) or (*THEN:NAME)
4368
4369       The verb name comes from the observation that it can be used for a pat‐
4370       tern-based if-then-else block:
4371
4372       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4373
4374       If the COND1 pattern matches, FOO is tried (and possibly further  items
4375       after  the  end  of the group if FOO succeeds). On failure, the matcher
4376       skips to the second alternative and tries COND2,  without  backtracking
4377       into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4378       fails, there are no more alternatives, so there is a backtrack to what‐
4379       ever came before the entire group. If (*THEN) is not inside an alterna‐
4380       tion, it acts like (*PRUNE).
4381
4382       The   behavior   of   (*THEN:NAME)   is   the   not   the    same    as
4383       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4384       bered for passing back to the caller.  However,  (*SKIP:NAME)  searches
4385       only for names set with (*MARK).
4386
4387   Note:
4388       The  fact that (*THEN:NAME) remembers the name is useless to the Erlang
4389       programmer, as names cannot be retrieved.
4390
4391
4392       A subpattern that does not contain a | character is just a part of  the
4393       enclosing  alternative;  it  is  not a nested alternation with only one
4394       alternative. The effect of (*THEN) extends beyond such a subpattern  to
4395       the  enclosing alternative. Consider the following pattern, where A, B,
4396       and so on, are complex pattern fragments that  do  not  contain  any  |
4397       characters at this level:
4398
4399       A (B(*THEN)C) | D
4400
4401       If  A and B are matched, but there is a failure in C, matching does not
4402       backtrack into A; instead it moves to the next alternative, that is, D.
4403       However,  if the subpattern containing (*THEN) is given an alternative,
4404       it behaves differently:
4405
4406       A (B(*THEN)C | (*FAIL)) | D
4407
4408       The effect of (*THEN) is now confined to the inner subpattern. After  a
4409       failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4410       tern to fail, as there are no more alternatives to try. In  this  case,
4411       matching does now backtrack into A.
4412
4413       Notice  that  a  conditional subpattern is not considered as having two
4414       alternatives, as only one is ever used. That is, the | character  in  a
4415       conditional  subpattern  has  a different meaning. Ignoring whitespace,
4416       consider:
4417
4418       ^.*? (?(?=a) a | b(*THEN)c )
4419
4420       If the subject is  "ba",  this  pattern  does  not  match.  As  .*?  is
4421       ungreedy,  it  initially  matches  zero characters. The condition (?=a)
4422       then fails, the character "b" is matched,  but  "c"  is  not.  At  this
4423       point,  matching  does  not backtrack to .*? as can perhaps be expected
4424       from the presence of the | character.  The  conditional  subpattern  is
4425       part of the single alternative that comprises the whole pattern, and so
4426       the match fails. (If there was a backtrack into  .*?,  allowing  it  to
4427       match "b", the match would succeed.)
4428
4429       The verbs described above provide four different "strengths" of control
4430       when subsequent matching fails:
4431
4432         * (*THEN) is the weakest, carrying on the match at the next  alterna‐
4433           tive.
4434
4435         * (*PRUNE)  comes next, fails the match at the current starting posi‐
4436           tion, but allows an advance to the next  character  (for  an  unan‐
4437           chored pattern).
4438
4439         * (*SKIP)  is  similar,  except that the advance can be more than one
4440           character.
4441
4442         * (*COMMIT) is the strongest, causing the entire match to fail.
4443
4444       More than One Backtracking Verb
4445
4446       If more than one backtracking verb is present in  a  pattern,  the  one
4447       that  is backtracked onto first acts. For example, consider the follow‐
4448       ing pattern, where A, B, and so on, are complex pattern fragments:
4449
4450       (A(*COMMIT)B(*THEN)C|ABD)
4451
4452       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
4453       match to fail. However, if A and B match, but C fails, the backtrack to
4454       (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4455       consistent, but is not always the same as in Perl. It means that if two
4456       or more backtracking verbs appear in succession, the last of  them  has
4457       no effect. Consider the following example:
4458
4459       If there is a matching failure to the right, backtracking onto (*PRUNE)
4460       causes it to be triggered, and its action is taken. There can never  be
4461       a backtrack onto (*COMMIT).
4462
4463       Backtracking Verbs in Repeated Groups
4464
4465       PCRE  differs  from  Perl  in  its  handling  of  backtracking verbs in
4466       repeated groups. For example, consider:
4467
4468       /(a(*COMMIT)b)+ac/
4469
4470       If the subject is "abac", Perl matches,  but  PCRE  fails  because  the
4471       (*COMMIT) in the second repeat of the group acts.
4472
4473       Backtracking Verbs in Assertions
4474
4475       (*FAIL)  in  an assertion has its normal effect: it forces an immediate
4476       backtrack.
4477
4478       (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4479       out  any  further processing. In a negative assertion, (*ACCEPT) causes
4480       the assertion to fail without any further processing.
4481
4482       The other backtracking verbs are not treated specially if  they  appear
4483       in  a  positive  assertion.  In  particular,  (*THEN) skips to the next
4484       alternative in the innermost enclosing  group  that  has  alternations,
4485       regardless if this is within the assertion.
4486
4487       Negative  assertions are, however, different, to ensure that changing a
4488       positive assertion into a negative assertion changes its result.  Back‐
4489       tracking  into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative asser‐
4490       tion to be true, without considering any further  alternative  branches
4491       in  the  assertion.  Backtracking into (*THEN) causes it to skip to the
4492       next enclosing alternative within the assertion (the normal  behavior),
4493       but if the assertion does not have such an alternative, (*THEN) behaves
4494       like (*PRUNE).
4495
4496       Backtracking Verbs in Subroutines
4497
4498       These behaviors occur regardless if the  subpattern  is  called  recur‐
4499       sively.  The  treatment  of  subroutines  in  Perl is different in some
4500       cases.
4501
4502         * (*FAIL) in a subpattern called  as  a  subroutine  has  its  normal
4503           effect: it forces an immediate backtrack.
4504
4505         * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4506           tine match to succeed without any further processing. Matching then
4507           continues after the subroutine call.
4508
4509         * (*COMMIT),  (*SKIP),  and (*PRUNE) in a subpattern called as a sub‐
4510           routine cause the subroutine match to fail.
4511
4512         * (*THEN) skips to the next alternative in  the  innermost  enclosing
4513           group  within  the subpattern that has alternatives. If there is no
4514           such group within the subpattern,  (*THEN)  causes  the  subroutine
4515           match to fail.
4516
4517Ericsson AB                     stdlib 3.4.5.1                           re(3)
Impressum