1re(3)                      Erlang Module Definition                      re(3)
2
3
4

NAME

6       re - Perl-like regular expressions for Erlang.
7

DESCRIPTION

9       This  module contains regular expression matching functions for strings
10       and binaries.
11
12       The regular expression syntax and semantics resemble that of Perl.
13
14       The matching algorithms of the library are based on the  PCRE  library,
15       but  not  all  of  the PCRE library is interfaced and some parts of the
16       library go  beyond  what  PCRE  offers.  Currently  PCRE  version  8.40
17       (release  date 2017-01-11) is used. The sections of the PCRE documenta‐
18       tion that are relevant to this module are included here.
19
20   Note:
21       The Erlang literal syntax for strings uses the "\" (backslash)  charac‐
22       ter  as  an  escape  code.  You  need  to escape backslashes in literal
23       strings, both in your code and in the shell, with an  extra  backslash,
24       that is, "\\".
25
26

DATA TYPES

28       mp() = {re_pattern, term(), term(), term(), term()}
29
30              Opaque  data type containing a compiled regular expression. mp()
31              is guaranteed to be a tuple() having the atom re_pattern as  its
32              first element, to allow for matching in guards. The arity of the
33              tuple or the content of the other fields can  change  in  future
34              Erlang/OTP releases.
35
36       nl_spec() = cr | crlf | lf | anycrlf | any
37
38       compile_option() =
39           unicode | anchored | caseless | dollar_endonly | dotall |
40           extended | firstline | multiline | no_auto_capture |
41           dupnames | ungreedy |
42           {newline, nl_spec()} |
43           bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
44           never_utf
45

EXPORTS

47       version() -> binary()
48
49              The return of this function is a string with the PCRE version of
50              the system that was used in the Erlang/OTP compilation.
51
52       compile(Regexp) -> {ok, MP} | {error, ErrSpec}
53
54              Types:
55
56                 Regexp = iodata()
57                 MP = mp()
58                 ErrSpec =
59                     {ErrString :: string(), Position :: integer() >= 0}
60
61              The same as compile(Regexp,[])
62
63       compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
64
65              Types:
66
67                 Regexp = iodata() | unicode:charlist()
68                 Options = [Option]
69                 Option = compile_option()
70                 MP = mp()
71                 ErrSpec =
72                     {ErrString :: string(), Position :: integer() >= 0}
73
74              Compiles a regular expression, with the syntax described  below,
75              into an internal format to be used later as a parameter to run/2
76              and run/3.
77
78              Compiling the regular expression before matching  is  useful  if
79              the  same  expression is to be used in matching against multiple
80              subjects during the lifetime of the program. Compiling once  and
81              executing  many  times is far more efficient than compiling each
82              time one wants to match.
83
84              When option unicode is specified, the regular expression  is  to
85              be  specified  as  a  valid Unicode charlist(), otherwise as any
86              valid iodata().
87
88              Options:
89
90                unicode:
91                  The regular expression is specified as a Unicode  charlist()
92                  and  the  resulting  regular  expression  code  is to be run
93                  against a valid Unicode charlist()  subject.  Also  consider
94                  option ucp when using Unicode characters.
95
96                anchored:
97                  The  pattern is forced to be "anchored", that is, it is con‐
98                  strained to match only at the first matching  point  in  the
99                  string  that is searched (the "subject string"). This effect
100                  can also be achieved by appropriate constructs in  the  pat‐
101                  tern itself.
102
103                caseless:
104                  Letters  in  the  pattern match both uppercase and lowercase
105                  letters. It is equivalent to  Perl  option  /i  and  can  be
106                  changed within a pattern by a (?i) option setting. Uppercase
107                  and lowercase letters are defined as in the ISO 8859-1 char‐
108                  acter set.
109
110                dollar_endonly:
111                  A  dollar  metacharacter  in the pattern matches only at the
112                  end of the subject string. Without  this  option,  a  dollar
113                  also  matches immediately before a newline at the end of the
114                  string (but not before any other newlines). This  option  is
115                  ignored if option multiline is specified. There is no equiv‐
116                  alent option in Perl, and it cannot be set within a pattern.
117
118                dotall:
119                  A dot in the pattern matches all characters, including those
120                  indicating  newline.  Without  it, a dot does not match when
121                  the current position is at a newline. This option is equiva‐
122                  lent  to  Perl option /s and it can be changed within a pat‐
123                  tern by a (?s) option setting. A  negative  class,  such  as
124                  [^a],  always matches newline characters, independent of the
125                  setting of this option.
126
127                extended:
128                  If this option is set, most white space  characters  in  the
129                  pattern  are totally ignored except when escaped or inside a
130                  character class. However, white space is not allowed  within
131                  sequences  such  as (?> that introduce various parenthesized
132                  subpatterns, nor  within  a  numerical  quantifier  such  as
133                  {1,3}.  However,  ignorable white space is permitted between
134                  an item and a following quantifier and between a  quantifier
135                  and a following + that indicates possessiveness.
136
137                  White  space  did not used to include the VT character (code
138                  11), because Perl did not  treat  this  character  as  white
139                  space.  However,  Perl changed at release 5.18, so PCRE fol‐
140                  lowed at release 8.34, and VT is now treated as white space.
141
142                  This also causes characters between an unescaped # outside a
143                  character  class  and  the  next  newline,  inclusive, to be
144                  ignored. This is equivalent to Perl's /x option, and it  can
145                  be changed within a pattern by a (?x) option setting.
146
147                  With  this  option, comments inside complicated patterns can
148                  be included. However, notice that this applies only to  data
149                  characters.  Whitespace  characters  can never appear within
150                  special character sequences in a pattern, for example within
151                  sequence (?( that introduces a conditional subpattern.
152
153                firstline:
154                  An  unanchored pattern is required to match before or at the
155                  first newline in the subject string,  although  the  matched
156                  text can continue over the newline.
157
158                multiline:
159                  By  default, PCRE treats the subject string as consisting of
160                  a single line of characters (even if it contains  newlines).
161                  The  "start  of  line" metacharacter (^) matches only at the
162                  start of the string, while the "end of  line"  metacharacter
163                  ($)  matches only at the end of the string, or before a ter‐
164                  minating newline (unless  option  dollar_endonly  is  speci‐
165                  fied). This is the same as in Perl.
166
167                  When  this option is specified, the "start of line" and "end
168                  of line" constructs match immediately following  or  immedi‐
169                  ately  before  internal  newlines  in  the  subject  string,
170                  respectively, as well as at the very start and end. This  is
171                  equivalent  to  Perl  option  /m and can be changed within a
172                  pattern by a (?m) option setting. If there are  no  newlines
173                  in  a  subject string, or no occurrences of ^ or $ in a pat‐
174                  tern, setting multiline has no effect.
175
176                no_auto_capture:
177                  Disables the use of numbered capturing  parentheses  in  the
178                  pattern.  Any  opening parenthesis that is not followed by ?
179                  behaves as if it is followed by ?:.  Named  parentheses  can
180                  still be used for capturing (and they acquire numbers in the
181                  usual way). There is no equivalent option in Perl.
182
183                dupnames:
184                  Names used to identify capturing  subpatterns  need  not  be
185                  unique.  This  can  be  helpful for certain types of pattern
186                  when it is known that only one instance of the named subpat‐
187                  tern  can ever be matched. More details of named subpatterns
188                  are provided below.
189
190                ungreedy:
191                  Inverts the "greediness" of the quantifiers so that they are
192                  not greedy by default, but become greedy if followed by "?".
193                  It is not compatible with Perl. It can also be set by a (?U)
194                  option setting within the pattern.
195
196                {newline, NLSpec}:
197                  Overrides the default definition of a newline in the subject
198                  string, which is LF (ASCII 10) in Erlang.
199
200                  cr:
201                    Newline is indicated by a single character cr (ASCII 13).
202
203                  lf:
204                    Newline is indicated by a single character LF (ASCII  10),
205                    the default.
206
207                  crlf:
208                    Newline  is  indicated by the two-character CRLF (ASCII 13
209                    followed by ASCII 10) sequence.
210
211                  anycrlf:
212                    Any of the three preceding sequences is to be recognized.
213
214                  any:
215                    Any of  the  newline  sequences  above,  and  the  Unicode
216                    sequences   VT   (vertical  tab,  U+000B),  FF  (formfeed,
217                    U+000C), NEL (next  line,  U+0085),  LS  (line  separator,
218                    U+2028), and PS (paragraph separator, U+2029).
219
220                bsr_anycrlf:
221                  Specifies  specifically that \R is to match only the CR, LF,
222                  or CRLF sequences, not the Unicode-specific newline  charac‐
223                  ters.
224
225                bsr_unicode:
226                  Specifies  specifically  that \R is to match all the Unicode
227                  newline characters (including CRLF, and so on, the default).
228
229                no_start_optimize:
230                  Disables  optimization  that  can  malfunction  if  "Special
231                  start-of-pattern  items"  are present in the regular expres‐
232                  sion. A typical example  would  be  when  matching  "DEFABC"
233                  against "(*COMMIT)ABC", where the start optimization of PCRE
234                  would skip the subject up to "A" and never realize that  the
235                  (*COMMIT)  instruction  is  to  have made the matching fail.
236                  This option is only relevant if  you  use  "start-of-pattern
237                  items",  as  discussed  in  section  PCRE Regular Expression
238                  Details.
239
240                ucp:
241                  Specifies that Unicode character properties are to  be  used
242                  when  resolving  \B,  \b, \D, \d, \S, \s, \W and \w. Without
243                  this flag, only ISO Latin-1 properties are used. Using  Uni‐
244                  code  properties hurts performance, but is semantically cor‐
245                  rect when working with Unicode  characters  beyond  the  ISO
246                  Latin-1 range.
247
248                never_utf:
249                  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
250                  items" are forbidden. This  flag  cannot  be  combined  with
251                  option  unicode.  Useful  if  ISO  Latin-1  patterns from an
252                  external source are to be compiled.
253
254       inspect(MP, Item) -> {namelist, [binary()]}
255
256              Types:
257
258                 MP = mp()
259                 Item = namelist
260
261              Takes a compiled regular expression and an item, and returns the
262              relevant  data  from  the regular expression. The only supported
263              item  is  namelist,   which   returns   the   tuple   {namelist,
264              [binary()]},  containing the names of all (unique) named subpat‐
265              terns in the regular expression. For example:
266
267              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
268              {ok,{re_pattern,3,0,0,
269                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
270                                255,255,...>>}}
271              2> re:inspect(MP,namelist).
272              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
273              3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
274              {ok,{re_pattern,3,0,0,
275                              <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
276                                255,255,...>>}}
277              4> re:inspect(MPD,namelist).
278              {namelist,[<<"B">>,<<"C">>]}
279
280              Notice in the second example that the duplicate name only occurs
281              once  in the returned list, and that the list is in alphabetical
282              order regardless of where the names are positioned in the  regu‐
283              lar  expression. The order of the names is the same as the order
284              of captured subexpressions if {capture, all_names} is  specified
285              as  an option to run/3. You can therefore create a name-to-value
286              mapping from the result of run/3 like this:
287
288              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
289              {ok,{re_pattern,3,0,0,
290                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
291                                255,255,...>>}}
292              2> {namelist, N} = re:inspect(MP,namelist).
293              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
294              3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
295              {match,[<<"A">>,<<>>,<<>>]}
296              4> NameMap = lists:zip(N,L).
297              [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
298
299       replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
300
301              Types:
302
303                 Subject = iodata() | unicode:charlist()
304                 RE = mp() | iodata()
305                 Replacement = iodata() | unicode:charlist()
306
307              Same as replace(Subject, RE, Replacement, []).
308
309       replace(Subject, RE, Replacement, Options) ->
310                  iodata() | unicode:charlist()
311
312              Types:
313
314                 Subject = iodata() | unicode:charlist()
315                 RE = mp() | iodata() | unicode:charlist()
316                 Replacement = iodata() | unicode:charlist()
317                 Options = [Option]
318                 Option =
319                     anchored | global | notbol | noteol | notempty |
320                     notempty_atstart |
321                     {offset, integer() >= 0} |
322                     {newline, NLSpec} |
323                     bsr_anycrlf |
324                     {match_limit, integer() >= 0} |
325                     {match_limit_recursion, integer() >= 0} |
326                     bsr_unicode |
327                     {return, ReturnType} |
328                     CompileOpt
329                 ReturnType = iodata | list | binary
330                 CompileOpt = compile_option()
331                 NLSpec = cr | crlf | lf | anycrlf | any
332
333              Replaces the matched part of the Subject string  with  the  con‐
334              tents of Replacement.
335
336              The  permissible  options are the same as for run/3, except that
337              option capture is not allowed. Instead a {return, ReturnType} is
338              present. The default return type is iodata, constructed in a way
339              to minimize copying. The iodata result can be used  directly  in
340              many  I/O  operations.  If  a  flat  list()  is desired, specify
341              {return,  list}.  If  a  binary  is  desired,  specify  {return,
342              binary}.
343
344              As  in  function  run/3,  an  mp()  compiled with option unicode
345              requires Subject to be a Unicode charlist(). If  compilation  is
346              done  implicitly and the unicode compilation option is specified
347              to this function, both the regular expression and Subject are to
348              specified as valid Unicode charlist()s.
349
350              The  replacement  string  can  contain  the special character &,
351              which inserts the whole matching expression in the  result,  and
352              the  special  sequence  \N  (where N is an integer > 0), \gN, or
353              \g{N}, resulting in the subexpression number N, is  inserted  in
354              the result. If no subexpression with that number is generated by
355              the regular expression, nothing is inserted.
356
357              To insert an & or a \ in the result, precede it with a \. Notice
358              that  Erlang  already  gives  a  special meaning to \ in literal
359              strings, so a single \ must be written as "\\" and  therefore  a
360              double \ as "\\\\".
361
362              Example:
363
364              re:replace("abcd","c","[&]",[{return,list}]).
365
366              gives
367
368              "ab[c]d"
369
370              while
371
372              re:replace("abcd","c","[\\&]",[{return,list}]).
373
374              gives
375
376              "ab[&]d"
377
378              As  with  run/3,  compilation errors raise the badarg exception.
379              compile/2 can be used to get more information about the error.
380
381       run(Subject, RE) -> {match, Captured} | nomatch
382
383              Types:
384
385                 Subject = iodata() | unicode:charlist()
386                 RE = mp() | iodata()
387                 Captured = [CaptureData]
388                 CaptureData = {integer(), integer()}
389
390              Same as run(Subject,RE,[]).
391
392       run(Subject, RE, Options) ->
393              {match, Captured} | match | nomatch | {error, ErrType}
394
395              Types:
396
397                 Subject = iodata() | unicode:charlist()
398                 RE = mp() | iodata() | unicode:charlist()
399                 Options = [Option]
400                 Option =
401                     anchored | global | notbol | noteol | notempty |
402                     notempty_atstart | report_errors |
403                     {offset, integer() >= 0} |
404                     {match_limit, integer() >= 0} |
405                     {match_limit_recursion, integer() >= 0} |
406                     {newline, NLSpec :: nl_spec()} |
407                     bsr_anycrlf | bsr_unicode |
408                     {capture, ValueSpec} |
409                     {capture, ValueSpec, Type} |
410                     CompileOpt
411                 Type = index | list | binary
412                 ValueSpec =
413                     all | all_but_first | all_names | first  |  none  |  Val‐
414                 ueList
415                 ValueList = [ValueID]
416                 ValueID = integer() | string() | atom()
417                 CompileOpt = compile_option()
418                   See compile/2.
419                 Captured = [CaptureData] | [[CaptureData]]
420                 CaptureData =
421                     {integer(), integer()} | ListConversionData | binary()
422                 ListConversionData =
423                     string() |
424                     {error, string(), binary()} |
425                     {incomplete, string(), binary()}
426                 ErrType =
427                     match_limit  |  match_limit_recursion  |  {compile,  Com‐
428                 pileErr}
429                 CompileErr =
430                     {ErrString :: string(), Position :: integer() >= 0}
431
432              Executes   a   regular   expression   matching,   and    returns
433              match/{match,  Captured}  or nomatch. The regular expression can
434              be specified either as iodata() in which case  it  is  automati‐
435              cally  compiled  (as by compile/2) and executed, or as a precom‐
436              piled mp() in which case it  is  executed  against  the  subject
437              directly.
438
439              When  compilation  is  involved, exception badarg is thrown if a
440              compilation error occurs.  Call  compile/2  to  get  information
441              about the location of the error in the regular expression.
442
443              If  the  regular  expression  is previously compiled, the option
444              list can only contain the following options:
445
446                * anchored
447
448                * {capture, ValueSpec}/{capture, ValueSpec, Type}
449
450                * global
451
452                * {match_limit, integer() >= 0}
453
454                * {match_limit_recursion, integer() >= 0}
455
456                * {newline, NLSpec}
457
458                * notbol
459
460                * notempty
461
462                * notempty_atstart
463
464                * noteol
465
466                * {offset, integer() >= 0}
467
468                * report_errors
469
470              Otherwise all options valid  for  function  compile/2  are  also
471              allowed. Options allowed both for compilation and execution of a
472              match, namely anchored and {newline, NLSpec},  affect  both  the
473              compilation and execution if present together with a non-precom‐
474              piled regular expression.
475
476              If the regular expression was previously  compiled  with  option
477              unicode,   Subject   is  to  be  provided  as  a  valid  Unicode
478              charlist(), otherwise any iodata() will do.  If  compilation  is
479              involved  and  option unicode is specified, both Subject and the
480              regular  expression  are  to  be  specified  as  valid   Unicode
481              charlists().
482
483              {capture,  ValueSpec}/{capture, ValueSpec, Type} defines what to
484              return from the function upon successful matching.  The  capture
485              tuple  can  contain both a value specification, telling which of
486              the captured substrings are to be returned, and a type  specifi‐
487              cation,  telling  how captured substrings are to be returned (as
488              index tuples, lists, or binaries). The options are described  in
489              detail below.
490
491              If  the  capture options describe that no substring capturing is
492              to be done ({capture, none}), the function  returns  the  single
493              atom match upon successful matching, otherwise the tuple {match,
494              ValueList}. Disabling capturing can be done either by specifying
495              none or an empty list as ValueSpec.
496
497              Option report_errors adds the possibility that an error tuple is
498              returned.  The  tuple  either   indicates   a   matching   error
499              (match_limit  or match_limit_recursion), or a compilation error,
500              where the error tuple has  the  format  {error,  {compile,  Com‐
501              pileErr}}. Notice that if option report_errors is not specified,
502              the function never returns error tuples, but reports compilation
503              errors  as  a  badarg  exception  and  failed matches because of
504              exceeded match limits simply as nomatch.
505
506              The following options are relevant for execution:
507
508                anchored:
509                  Limits run/3 to matching at the first matching position.  If
510                  a  pattern  was  compiled with anchored, or turned out to be
511                  anchored by virtue of its contents, it cannot be made  unan‐
512                  chored  at  matching  time,  hence  there  is  no unanchored
513                  option.
514
515                global:
516                  Implements global (repetitive) search (flag g in Perl). Each
517                  match  is  returned as a separate list() containing the spe‐
518                  cific match and any matching subexpressions (or as specified
519                  by  option capture. The Captured part of the return value is
520                  hence a list() of list()s when this option is specified.
521
522                  The interaction of option global with a  regular  expression
523                  that  matches  an  empty  string  surprises some users. When
524                  option global is specified, run/3 handles empty  matches  in
525                  the  same  way  as Perl: a zero-length match at any point is
526                  also retried with options [anchored,  notempty_atstart].  If
527                  that  search  gives  a  result  of length > 0, the result is
528                  included. Example:
529
530                re:run("cat","(|at)",[global]).
531
532                  The following matchings are performed:
533
534                  At offset 0:
535                    The regular expression (|at) first match  at  the  initial
536                    position   of   string   cat,   giving   the   result  set
537                    [{0,0},{0,0}] (the second {0,0} is because of  the  subex‐
538                    pression  marked by the parentheses). As the length of the
539                    match is 0, we do not advance to the next position yet.
540
541                  At offset 0 with [anchored, notempty_atstart]:
542                    The   search   is   retried   with   options    [anchored,
543                    notempty_atstart]  at  the  same  position, which does not
544                    give any interesting  result  of  longer  length,  so  the
545                    search position is advanced to the next character (a).
546
547                  At offset 1:
548                    The  search  results  in  [{1,0},{1,0}], so this search is
549                    also repeated with the extra options.
550
551                  At offset 1 with [anchored, notempty_atstart]:
552                    Alternative ab is found and the result  is  [{1,2},{1,2}].
553                    The  result  is added to the list of results and the posi‐
554                    tion in the search string is advanced two steps.
555
556                  At offset 3:
557                    The search once again matches  the  empty  string,  giving
558                    [{3,0},{3,0}].
559
560                  At offset 1 with [anchored, notempty_atstart]:
561                    This  gives no result of length > 0 and we are at the last
562                    position, so the global search is complete.
563
564                  The result of the call is:
565
566                {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
567
568                notempty:
569                  An empty string is not considered to be  a  valid  match  if
570                  this  option  is  specified.  If alternatives in the pattern
571                  exist, they are tried. If all  the  alternatives  match  the
572                  empty string, the entire match fails.
573
574                  Example:
575
576                  If  the  following pattern is applied to a string not begin‐
577                  ning with "a" or "b", it  would  normally  match  the  empty
578                  string at the start of the subject:
579
580                a?b?
581
582                  With  option  notempty,  this  match  is  invalid,  so run/3
583                  searches further into the string for occurrences of  "a"  or
584                  "b".
585
586                notempty_atstart:
587                  Like notempty, except that an empty string match that is not
588                  at the start of the subject is permitted. If the pattern  is
589                  anchored,  such  a  match can occur only if the pattern con‐
590                  tains \K.
591
592                  Perl   has   no   direct   equivalent   of    notempty    or
593                  notempty_atstart,  but it does make a special case of a pat‐
594                  tern match of the empty string within its split()  function,
595                  and  when  using  modifier /g. The Perl behavior can be emu‐
596                  lated after matching a null string by first trying the match
597                  again at the same offset with notempty_atstart and anchored,
598                  and then, if that fails, by advancing  the  starting  offset
599                  (see below) and trying an ordinary match again.
600
601                notbol:
602                  Specifies  that the first character of the subject string is
603                  not the beginning of a line, so the circumflex metacharacter
604                  is  not  to  match before it. Setting this without multiline
605                  (at compile time) causes circumflex  never  to  match.  This
606                  option only affects the behavior of the circumflex metachar‐
607                  acter. It does not affect \A.
608
609                noteol:
610                  Specifies that the end of the subject string is not the  end
611                  of  a  line,  so the dollar metacharacter is not to match it
612                  nor (except in multiline mode) a newline immediately  before
613                  it.  Setting this without multiline (at compile time) causes
614                  dollar never to match. This option affects only the behavior
615                  of the dollar metacharacter. It does not affect \Z or \z.
616
617                report_errors:
618                  Gives  better  control  of the error handling in run/3. When
619                  specified, compilation errors (if the regular expression  is
620                  not  already  compiled)  and  runtime  errors are explicitly
621                  returned as an error tuple.
622
623                  The following are the possible runtime errors:
624
625                  match_limit:
626                    The PCRE library sets a limit on how many times the inter‐
627                    nal  match  function can be called. Defaults to 10,000,000
628                    in  the  library   compiled   for   Erlang.   If   {error,
629                    match_limit}  is  returned,  the  execution of the regular
630                    expression has reached this limit. This is normally to  be
631                    regarded  as  a nomatch, which is the default return value
632                    when this occurs, but by specifying report_errors, you are
633                    informed when the match fails because of too many internal
634                    calls.
635
636                  match_limit_recursion:
637                    This error is very similar to match_limit, but occurs when
638                    the  internal  match  function  of  PCRE  is "recursively"
639                    called more times than  the  match_limit_recursion  limit,
640                    which  defaults to 10,000,000 as well. Notice that as long
641                    as the match_limit and match_limit_default values are kept
642                    at  the  default  values,  the match_limit_recursion error
643                    cannot occur, as the match_limit error occurs before  that
644                    (each  recursive call is also a call, but not conversely).
645                    Both limits can however be changed, either by setting lim‐
646                    its directly in the regular expression string (see section
647                    PCRE Regular Eexpression Details) or by specifying options
648                    to run/3.
649
650                  It  is  important  to understand that what is referred to as
651                  "recursion" when limiting matches is not recursion on the  C
652                  stack  of the Erlang machine or on the Erlang process stack.
653                  The PCRE version compiled into the Erlang  VM  uses  machine
654                  "heap"  memory to store values that must be kept over recur‐
655                  sion in regular expression matches.
656
657                {match_limit, integer() >= 0}:
658                  Limits the execution time of a match in  an  implementation-
659                  specific  way.  It is described as follows by the PCRE docu‐
660                  mentation:
661
662                The match_limit field provides a means of preventing PCRE from using
663                up a vast amount of resources when running patterns that are not going
664                to match, but which have a very large number of possibilities in their
665                search trees. The classic example is a pattern that uses nested
666                unlimited repeats.
667
668                Internally, pcre_exec() uses a function called match(), which it calls
669                repeatedly (sometimes recursively). The limit set by match_limit is
670                imposed on the number of times this function is called during a match,
671                which has the effect of limiting the amount of backtracking that can
672                take place. For patterns that are not anchored, the count restarts
673                from zero for each position in the subject string.
674
675                  This means that runaway regular expression matches can  fail
676                  faster  if  the  limit  is  lowered  using  this option. The
677                  default value 10,000,000 is compiled into the Erlang VM.
678
679            Note:
680                This option does in no way affect the execution of the  Erlang
681                VM in terms of "long running BIFs". run/3 always gives control
682                back to the scheduler of Erlang processes  at  intervals  that
683                ensures the real-time properties of the Erlang system.
684
685
686                {match_limit_recursion, integer() >= 0}:
687                  Limits  the execution time and memory consumption of a match
688                  in  an  implementation-specific   way,   very   similar   to
689                  match_limit. It is described as follows by the PCRE documen‐
690                  tation:
691
692                The match_limit_recursion field is similar to match_limit, but instead
693                of limiting the total number of times that match() is called, it
694                limits the depth of recursion. The recursion depth is a smaller number
695                than the total number of calls, because not all calls to match() are
696                recursive. This limit is of use only if it is set smaller than
697                match_limit.
698
699                Limiting the recursion depth limits the amount of machine stack that
700                can be used, or, when PCRE has been compiled to use memory on the heap
701                instead of the stack, the amount of heap memory that can be used.
702
703                  The Erlang VM uses a PCRE library where heap memory is  used
704                  when  regular expression match recursion occurs. This there‐
705                  fore limits the use of machine heap, not C stack.
706
707                  Specifying a lower value can result  in  matches  with  deep
708                  recursion failing, when they should have matched:
709
710                1> re:run("aaaaaaaaaaaaaz","(a+)*z").
711                {match,[{0,14},{0,13}]}
712                2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
713                nomatch
714                3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
715                {error,match_limit_recursion}
716
717                  This  option  and  option match_limit are only to be used in
718                  rare cases. Understanding of the PCRE library  internals  is
719                  recommended before tampering with these limits.
720
721                {offset, integer() >= 0}:
722                  Start  matching  at  the  offset (position) specified in the
723                  subject string.  The  offset  is  zero-based,  so  that  the
724                  default is {offset,0} (all of the subject string).
725
726                {newline, NLSpec}:
727                  Overrides the default definition of a newline in the subject
728                  string, which is LF (ASCII 10) in Erlang.
729
730                  cr:
731                    Newline is indicated by a single character CR (ASCII 13).
732
733                  lf:
734                    Newline is indicated by a single character LF (ASCII  10),
735                    the default.
736
737                  crlf:
738                    Newline  is  indicated by the two-character CRLF (ASCII 13
739                    followed by ASCII 10) sequence.
740
741                  anycrlf:
742                    Any of the three preceding sequences is be recognized.
743
744                  any:
745                    Any of  the  newline  sequences  above,  and  the  Unicode
746                    sequences   VT   (vertical  tab,  U+000B),  FF  (formfeed,
747                    U+000C), NEL (next  line,  U+0085),  LS  (line  separator,
748                    U+2028), and PS (paragraph separator, U+2029).
749
750                bsr_anycrlf:
751                  Specifies  specifically  that \R is to match only the CR LF,
752                  or CRLF sequences, not the Unicode-specific newline  charac‐
753                  ters. (Overrides the compilation option.)
754
755                bsr_unicode:
756                  Specifies  specifically  that \R is to match all the Unicode
757                  newline characters (including CRLF, and so on, the default).
758                  (Overrides the compilation option.)
759
760                {capture, ValueSpec}/{capture, ValueSpec, Type}:
761                  Specifies which captured substrings are returned and in what
762                  format. By default, run/3 captures all of the matching  part
763                  of  the  substring and all capturing subpatterns (all of the
764                  pattern is automatically captured). The default return  type
765                  is (zero-based) indexes of the captured parts of the string,
766                  specified as {Offset,Length} pairs (the index Type  of  cap‐
767                  turing).
768
769                  As  an  example  of the default behavior, the following call
770                  returns, as first and only  captured  string,  the  matching
771                  part  of the subject ("abcd" in the middle) as an index pair
772                  {3,4}, where character positions are zero-based, just as  in
773                  offsets:
774
775                re:run("ABCabcdABC","abcd",[]).
776
777                  The return value of this call is:
778
779                {match,[{3,4}]}
780
781                  Another (and quite common) case is where the regular expres‐
782                  sion matches all of the subject:
783
784                re:run("ABCabcdABC",".*abcd.*",[]).
785
786                  Here the return value correspondingly points out all of  the
787                  string, beginning at index 0, and it is 10 characters long:
788
789                {match,[{0,10}]}
790
791                  If  the  regular  expression contains capturing subpatterns,
792                  like in:
793
794                re:run("ABCabcdABC",".*(abcd).*",[]).
795
796                  all of the matched subject is captured, as well as the  cap‐
797                  tured substrings:
798
799                {match,[{0,10},{3,4}]}
800
801                  The  complete matching pattern always gives the first return
802                  value in the list and the remaining subpatterns are added in
803                  the order they occurred in the regular expression.
804
805                  The capture tuple is built up as follows:
806
807                  ValueSpec:
808                    Specifies which captured (sub)patterns are to be returned.
809                    ValueSpec can either be an atom  describing  a  predefined
810                    set  of return values, or a list containing the indexes or
811                    the names of specific subpatterns to return.
812
813                    The following are the predefined sets of subpatterns:
814
815                    all:
816                      All captured subpatterns including the complete matching
817                      string. This is the default.
818
819                    all_names:
820                      All named subpatterns in the regular expression, as if a
821                      list() of all the names in alphabetical order was speci‐
822                      fied.  The  list of all names can also be retrieved with
823                      inspect/2.
824
825                    first:
826                      Only the first captured subpattern, which is always  the
827                      complete  matching  part  of the subject. All explicitly
828                      captured subpatterns are discarded.
829
830                    all_but_first:
831                      All but the first  matching  subpattern,  that  is,  all
832                      explicitly  captured  subpatterns,  but not the complete
833                      matching part of the subject string. This is  useful  if
834                      the  regular  expression as a whole matches a large part
835                      of the subject, but the part you are interested in is in
836                      an explicitly captured subpattern. If the return type is
837                      list or binary, not returning subpatterns  you  are  not
838                      interested in is a good way to optimize.
839
840                    none:
841                      Returns  no  matching subpatterns, gives the single atom
842                      match as the return value of the function when  matching
843                      successfully  instead  of  the  {match,  list()} return.
844                      Specifying an empty list gives the same behavior.
845
846                    The value list is a list of indexes for the subpatterns to
847                    return,  where index 0 is for all of the pattern, and 1 is
848                    for the first explicit capturing subpattern in the regular
849                    expression,  and  so on. When using named captured subpat‐
850                    terns (see below) in the regular expression, one  can  use
851                    atom()s  or  string()s  to  specify  the subpatterns to be
852                    returned. For example, consider the regular expression:
853
854                  ".*(abcd).*"
855
856                    matched against string "ABCabcdABC",  capturing  only  the
857                    "abcd" part (the first explicit subpattern):
858
859                  re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
860
861                    The  call gives the following result, as the first explic‐
862                    itly captured subpattern is "(abcd)", matching  "abcd"  in
863                    the subject, at (zero-based) position 3, of length 4:
864
865                  {match,[{3,4}]}
866
867                    Consider the same regular expression, but with the subpat‐
868                    tern explicitly named 'FOO':
869
870                  ".*(?<FOO>abcd).*"
871
872                    With this expression, we could still give the index of the
873                    subpattern with the following call:
874
875                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
876
877                    giving  the  same result as before. But, as the subpattern
878                    is named, we can also specify its name in the value list:
879
880                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
881
882                    This would give the same result as the  earlier  examples,
883                    namely:
884
885                  {match,[{3,4}]}
886
887                    The  values  list can specify indexes or names not present
888                    in the regular expression, in which case the return values
889                    vary  depending  on  the  type.  If the type is index, the
890                    tuple {-1,0} is returned for values with no  corresponding
891                    subpattern  in  the  regular expression, but for the other
892                    types (binary and list), the values are the  empty  binary
893                    or list, respectively.
894
895                  Type:
896                    Optionally  specifies  how  captured  substrings are to be
897                    returned. If omitted, the default of index is used.
898
899                    Type can be one of the following:
900
901                    index:
902                      Returns captured substrings as  pairs  of  byte  indexes
903                      into  the  subject  string  and  length  of the matching
904                      string in the subject (as  if  the  subject  string  was
905                      flattened   with   erlang:iolist_to_binary/1   or   uni‐
906                      code:characters_to_binary/2  before  matching).   Notice
907                      that  option unicode results in byte-oriented indexes in
908                      a (possibly virtual) UTF-8 encoded binary. A byte  index
909                      tuple  {0,2}  can therefore represent one or two charac‐
910                      ters when unicode is in effect. This can  seem  counter-
911                      intuitive,  but  has  been deemed the most effective and
912                      useful way to do it. To return lists instead can  result
913                      in  simpler code if that is desired. This return type is
914                      the default.
915
916                    list:
917                      Returns  matching  substrings  as  lists  of  characters
918                      (Erlang  string()s). It option unicode is used in combi‐
919                      nation with the \C sequence in the regular expression, a
920                      captured subpattern can contain bytes that are not valid
921                      UTF-8 (\C matches bytes regardless of  character  encod‐
922                      ing).  In that case the list capturing can result in the
923                      same types of tuples  that  unicode:characters_to_list/2
924                      can  return,  namely three-tuples with tag incomplete or
925                      error, the successfully  converted  characters  and  the
926                      invalid  UTF-8  tail  of the conversion as a binary. The
927                      best strategy is to avoid using  the  \C  sequence  when
928                      capturing lists.
929
930                    binary:
931                      Returns  matching substrings as binaries. If option uni‐
932                      code is used, these binaries are in  UTF-8.  If  the  \C
933                      sequence is used together with unicode, the binaries can
934                      be invalid UTF-8.
935
936                  In general, subpatterns that were not assigned  a  value  in
937                  the  match  are  returned  as  the tuple {-1,0} when type is
938                  index. Unassigned subpatterns  are  returned  as  the  empty
939                  binary  or  list, respectively, for other return types. Con‐
940                  sider the following regular expression:
941
942                ".*((?<FOO>abdd)|a(..d)).*"
943
944                  There are three explicitly capturing subpatterns, where  the
945                  opening  parenthesis  position  determines  the order in the
946                  result, hence ((?<FOO>abdd)|a(..d)) is subpattern  index  1,
947                  (?<FOO>abdd)  is subpattern index 2, and (..d) is subpattern
948                  index 3. When matched against the following string:
949
950                "ABCabcdABC"
951
952                  the subpattern at index 2 does not match, as "abdd"  is  not
953                  present  in  the  string,  but  the complete pattern matches
954                  (because of the alternative a(..d)). The subpattern at index
955                  2 is therefore unassigned and the default return value is:
956
957                {match,[{0,10},{3,4},{-1,0},{4,3}]}
958
959                  Setting the capture Type to binary gives:
960
961                {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
962
963                  Here  the empty binary (<<>>) represents the unassigned sub‐
964                  pattern. In the binary  case,  some  information  about  the
965                  matching  is  therefore  lost,  as <<>> can also be an empty
966                  string captured.
967
968                  If differentiation between empty  matches  and  non-existing
969                  subpatterns is necessary, use the type index and do the con‐
970                  version to the final type in Erlang code.
971
972                  When option global is speciified, the capture  specification
973                  affects each match separately, so that:
974
975                re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
976
977                  gives
978
979                {match,[["a"],["b"]]}
980
981              For  a  descriptions  of  options only affecting the compilation
982              step, see compile/2.
983
984       split(Subject, RE) -> SplitList
985
986              Types:
987
988                 Subject = iodata() | unicode:charlist()
989                 RE = mp() | iodata()
990                 SplitList = [iodata() | unicode:charlist()]
991
992              Same as split(Subject, RE, []).
993
994       split(Subject, RE, Options) -> SplitList
995
996              Types:
997
998                 Subject = iodata() | unicode:charlist()
999                 RE = mp() | iodata() | unicode:charlist()
1000                 Options = [Option]
1001                 Option =
1002                     anchored | notbol | noteol | notempty |  notempty_atstart
1003                 |
1004                     {offset, integer() >= 0} |
1005                     {newline, nl_spec()} |
1006                     {match_limit, integer() >= 0} |
1007                     {match_limit_recursion, integer() >= 0} |
1008                     bsr_anycrlf | bsr_unicode |
1009                     {return, ReturnType} |
1010                     {parts, NumParts} |
1011                     group | trim | CompileOpt
1012                 NumParts = integer() >= 0 | infinity
1013                 ReturnType = iodata | list | binary
1014                 CompileOpt = compile_option()
1015                   See compile/2.
1016                 SplitList = [RetData] | [GroupedRetData]
1017                 GroupedRetData = [RetData]
1018                 RetData = iodata() | unicode:charlist() | binary() | list()
1019
1020              Splits  the  input into parts by finding tokens according to the
1021              regular expression supplied. The splitting is basically done  by
1022              running  a global regular expression match and dividing the ini‐
1023              tial string wherever a match occurs. The matching  part  of  the
1024              string is removed from the output.
1025
1026              As  in run/3, an mp() compiled with option unicode requires Sub‐
1027              ject to be a Unicode charlist(). If compilation is done  implic‐
1028              itly  and  the  unicode  compilation option is specified to this
1029              function, both the regular expression  and  Subject  are  to  be
1030              specified as valid Unicode charlist()s.
1031
1032              The  result  is given as a list of "strings", the preferred data
1033              type specified in option return (default iodata).
1034
1035              If subexpressions are specified in the regular  expression,  the
1036              matching  subexpressions  are  returned in the resulting list as
1037              well. For example:
1038
1039              re:split("Erlang","[ln]",[{return,list}]).
1040
1041              gives
1042
1043              ["Er","a","g"]
1044
1045              while
1046
1047              re:split("Erlang","([ln])",[{return,list}]).
1048
1049              gives
1050
1051              ["Er","l","a","n","g"]
1052
1053              The text matching the subexpression (marked by  the  parentheses
1054              in  the regular expression) is inserted in the result list where
1055              it was found. This means that  concatenating  the  result  of  a
1056              split  where the whole regular expression is a single subexpres‐
1057              sion (as in the last example) always  results  in  the  original
1058              string.
1059
1060              As  there  is no matching subexpression for the last part in the
1061              example (the "g"), nothing is inserted after that. To  make  the
1062              group  of strings and the parts matching the subexpressions more
1063              obvious, one can use option group,  which  groups  together  the
1064              part  of  the  subject string with the parts matching the subex‐
1065              pressions when the string was split:
1066
1067              re:split("Erlang","([ln])",[{return,list},group]).
1068
1069              gives
1070
1071              [["Er","l"],["a","n"],["g"]]
1072
1073              Here the regular expression first matched the "l", causing  "Er"
1074              to  be the first part in the result. When the regular expression
1075              matched, the (only) subexpression was bound to the "l",  so  the
1076              "l"  is inserted in the group together with "Er". The next match
1077              is of the "n", making "a" the next part to be returned.  As  the
1078              subexpression is bound to substring "n" in this case, the "n" is
1079              inserted into this group. The last group consists of the remain‐
1080              ing string, as no more matches are found.
1081
1082              By  default,  all  parts  of  the  string,  including  the empty
1083              strings, are returned from the function, for example:
1084
1085              re:split("Erlang","[lg]",[{return,list}]).
1086
1087              gives
1088
1089              ["Er","an",[]]
1090
1091              as the matching of the "g" in the end of the  string  leaves  an
1092              empty  rest,  which is also returned. This behavior differs from
1093              the default behavior of the split function in Perl, where  empty
1094              strings at the end are by default removed. To get the "trimming"
1095              default behavior of Perl, specify trim as an option:
1096
1097              re:split("Erlang","[lg]",[{return,list},trim]).
1098
1099              gives
1100
1101              ["Er","an"]
1102
1103              The "trim" option says; "give  me  as  many  parts  as  possible
1104              except  the  empty ones", which sometimes can be useful. You can
1105              also specify how many parts you want, by specifying {parts,N}:
1106
1107              re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1108
1109              gives
1110
1111              ["Er","ang"]
1112
1113              Notice that the last part is "ang", not "an", as  splitting  was
1114              specified  into  two  parts, and the splitting stops when enough
1115              parts are given, which is why the result differs  from  that  of
1116              trim.
1117
1118              More than three parts are not possible with this indata, so
1119
1120              re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1121
1122              gives  the  same result as the default, which is to be viewed as
1123              "an infinite number of parts".
1124
1125              Specifying 0 as the number of parts gives  the  same  effect  as
1126              option  trim.  If  subexpressions are captured, empty subexpres‐
1127              sions matched at the end are also stripped from  the  result  if
1128              trim or {parts,0} is specified.
1129
1130              The  trim  behavior  corresponds  exactly  to  the Perl default.
1131              {parts,N}, where N is a positive integer, corresponds exactly to
1132              the Perl behavior with a positive numerical third parameter. The
1133              default behavior of split/3 corresponds  to  the  Perl  behavior
1134              when  a negative integer is specified as the third parameter for
1135              the Perl routine.
1136
1137              Summary of options not previously described for function run/3:
1138
1139                {return,ReturnType}:
1140                  Specifies how the parts of the original string are presented
1141                  in the result list. Valid types:
1142
1143                  iodata:
1144                    The  variant  of  iodata() that gives the least copying of
1145                    data with the current implementation (often a binary,  but
1146                    do not depend on it).
1147
1148                  binary:
1149                    All parts returned as binaries.
1150
1151                  list:
1152                    All parts returned as lists of characters ("strings").
1153
1154                group:
1155                  Groups together the part of the string with the parts of the
1156                  string matching the subexpressions of  the  regular  expres‐
1157                  sion.
1158
1159                  The  return value from the function is in this case a list()
1160                  of list()s. Each sublist begins with the string  picked  out
1161                  of  the  subject string, followed by the parts matching each
1162                  of the subexpressions in order of occurrence in the  regular
1163                  expression.
1164
1165                {parts,N}:
1166                  Specifies  the  number  of parts the subject string is to be
1167                  split into.
1168
1169                  The number of parts is to be a positive integer for  a  spe‐
1170                  cific  maximum number of parts, and infinity for the maximum
1171                  number of parts possible (the default). Specifying {parts,0}
1172                  gives  as many parts as possible disregarding empty parts at
1173                  the end, the same as specifying trim.
1174
1175                trim:
1176                  Specifies that empty parts at the end of the result list are
1177                  to  be  disregarded.  The same as specifying {parts,0}. This
1178                  corresponds to the default behavior of  the  split  built-in
1179                  function in Perl.
1180

PERL-LIKE REGULAR EXPRESSION SYNTAX

1182       The  following  sections  contain  reference  material  for the regular
1183       expressions used by this module. The information is based on  the  PCRE
1184       documentation,  with  changes  where this module behaves differently to
1185       the PCRE library.
1186

PCRE REGULAR EXPRESSION DETAILS

1188       The syntax and semantics of the regular expressions supported  by  PCRE
1189       are  described  in  detail  in  the  following sections. Perl's regular
1190       expressions are described in its own documentation, and regular expres‐
1191       sions in general are covered in many books, some with copious examples.
1192       Jeffrey  Friedl's  "Mastering  Regular   Expressions",   published   by
1193       O'Reilly,  covers regular expressions in great detail. This description
1194       of the PCRE regular expressions is intended as reference material.
1195
1196       The reference material is divided into the following sections:
1197
1198         * Special Start-of-Pattern Items
1199
1200         * Characters and Metacharacters
1201
1202         * Backslash
1203
1204         * Circumflex and Dollar
1205
1206         * Full Stop (Period, Dot) and \N
1207
1208         * Matching a Single Data Unit
1209
1210         * Square Brackets and Character Classes
1211
1212         * Posix Character Classes
1213
1214         * Vertical Bar
1215
1216         * Internal Option Setting
1217
1218         * Subpatterns
1219
1220         * Duplicate Subpattern Numbers
1221
1222         * Named Subpatterns
1223
1224         * Repetition
1225
1226         * Atomic Grouping and Possessive Quantifiers
1227
1228         * Back References
1229
1230         * Assertions
1231
1232         * Conditional Subpatterns
1233
1234         * Comments
1235
1236         * Recursive Patterns
1237
1238         * Subpatterns as Subroutines
1239
1240         * Oniguruma Subroutine Syntax
1241
1242         * Backtracking Control
1243

SPECIAL START-OF-PATTERN ITEMS

1245       Some options that can be passed to compile/2 can also be set by special
1246       items at the start of a pattern. These are not Perl-compatible, but are
1247       provided to make these options accessible to pattern  writers  who  are
1248       not  able  to change the program that processes the pattern. Any number
1249       of these items can appear, but they must all be together right  at  the
1250       start of the pattern string, and the letters must be in upper case.
1251
1252       UTF Support
1253
1254       Unicode  support  is  basically UTF-8 based. To use Unicode characters,
1255       you either call compile/2 or run/3 with option unicode, or the  pattern
1256       must start with one of these special sequences:
1257
1258       (*UTF8)
1259       (*UTF)
1260
1261       Both  options  give the same effect, the input string is interpreted as
1262       UTF-8. Notice that with these instructions, the automatic conversion of
1263       lists  to  UTF-8 is not performed by the re functions. Therefore, using
1264       these sequences is not recommended. Add  option  unicode  when  running
1265       compile/2 instead.
1266
1267       Some applications that allow their users to supply patterns can wish to
1268       restrict them to non-UTF data for security reasons. If option never_utf
1269       is  set  at compile time, (*UTF), and so on, are not allowed, and their
1270       appearance causes an error.
1271
1272       Unicode Property Support
1273
1274       The following is another special sequence that can appear at the  start
1275       of a pattern:
1276
1277       (*UCP)
1278
1279       This  has  the  same  effect as setting option ucp: it causes sequences
1280       such as \d and \w to use  Unicode  properties  to  determine  character
1281       types,  instead of recognizing only characters with codes < 256 through
1282       a lookup table.
1283
1284       Disabling Startup Optimizations
1285
1286       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
1287       setting option no_start_optimize at compile time.
1288
1289       Newline Conventions
1290
1291       PCRE supports five conventions for indicating line breaks in strings: a
1292       single CR (carriage return) character, a single LF (line feed)  charac‐
1293       ter,  the  two-character sequence CRLF, any of the three preceding, and
1294       any Unicode newline sequence.
1295
1296       A newline convention can also be specified by starting a pattern string
1297       with one of the following five sequences:
1298
1299         (*CR):
1300           Carriage return
1301
1302         (*LF):
1303           Line feed
1304
1305         (*CRLF):
1306           >Carriage return followed by line feed
1307
1308         (*ANYCRLF):
1309           Any of the three above
1310
1311         (*ANY):
1312           All Unicode newline sequences
1313
1314       These  override the default and the options specified to compile/2. For
1315       example, the following pattern changes the convention to CR:
1316
1317       (*CR)a.b
1318
1319       This pattern matches a\nb, as LF is no longer a newline. If  more  than
1320       one of them is present, the last one is used.
1321
1322       The  newline  convention affects where the circumflex and dollar asser‐
1323       tions are true. It also affects the interpretation of the dot metachar‐
1324       acter  when dotall is not set, and the behavior of \N. However, it does
1325       not affect what the \R escape sequence matches. By default, this is any
1326       Unicode  newline sequence, for Perl compatibility. However, this can be
1327       changed; see the description of \R  in  section  Newline  Sequences.  A
1328       change  of  the \R setting can be combined with a change of the newline
1329       convention.
1330
1331       Setting Match and Recursion Limits
1332
1333       The caller of run/3 can set a limit on the number of times the internal
1334       match() function is called and on the maximum depth of recursive calls.
1335       These facilities are provided to catch runaway matches  that  are  pro‐
1336       voked by patterns with huge matching trees (a typical example is a pat‐
1337       tern with nested unlimited repeats) and to avoid running out of  system
1338       stack  by  too  much  recursion.  When  one of these limits is reached,
1339       pcre_exec() gives an error return. The limits can also be set by  items
1340       at the start of the pattern of the following forms:
1341
1342       (*LIMIT_MATCH=d)
1343       (*LIMIT_RECURSION=d)
1344
1345       Here  d is any number of decimal digits. However, the value of the set‐
1346       ting must be less than the value set by the caller of run/3 for  it  to
1347       have any effect. That is, the pattern writer can lower the limit set by
1348       the programmer, but not raise it. If there is more than one setting  of
1349       one of these limits, the lower value is used.
1350
1351       The  default  value for both the limits is 10,000,000 in the Erlang VM.
1352       Notice that the recursion limit does not affect the stack depth of  the
1353       VM,  as  PCRE for Erlang is compiled in such a way that the match func‐
1354       tion never does recursion on the C stack.
1355
1356       Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value  of
1357       the limits set by the caller, not increase them.
1358

CHARACTERS AND METACHARACTERS

1360       A  regular  expression  is  a pattern that is matched against a subject
1361       string from left to right. Most characters stand for  themselves  in  a
1362       pattern  and  match  the  corresponding characters in the subject. As a
1363       trivial example, the following pattern matches a portion of  a  subject
1364       string that is identical to itself:
1365
1366       The quick brown fox
1367
1368       When  caseless  matching  is  specified  (option caseless), letters are
1369       matched independently of case.
1370
1371       The power of regular expressions comes  from  the  ability  to  include
1372       alternatives  and  repetitions in the pattern. These are encoded in the
1373       pattern by the use of metacharacters, which do not stand for themselves
1374       but instead are interpreted in some special way.
1375
1376       Two sets of metacharacters exist: those that are recognized anywhere in
1377       the pattern except within square brackets, and those  that  are  recog‐
1378       nized  within square brackets. Outside square brackets, the metacharac‐
1379       ters are as follows:
1380
1381         \:
1382           General escape character with many uses
1383
1384         ^:
1385           Assert start of string (or line, in multiline mode)
1386
1387         $:
1388           Assert end of string (or line, in multiline mode)
1389
1390         .:
1391           Match any character except newline (by default)
1392
1393         [:
1394           Start character class definition
1395
1396         |:
1397           Start of alternative branch
1398
1399         (:
1400           Start subpattern
1401
1402         ):
1403           End subpattern
1404
1405         ?:
1406           Extends the meaning of (, also 0 or 1 quantifier,  also  quantifier
1407           minimizer
1408
1409         *:
1410           0 or more quantifiers
1411
1412         +:
1413           1 or more quantifier, also "possessive quantifier"
1414
1415         {:
1416           Start min/max quantifier
1417
1418       Part of a pattern within square brackets is called a "character class".
1419       The following are the only metacharacters in a character class:
1420
1421         \:
1422           General escape character
1423
1424         ^:
1425           Negate the class, but only if the first character
1426
1427         -:
1428           Indicates character range
1429
1430         [:
1431           Posix character class (only if followed by Posix syntax)
1432
1433         ]:
1434           Terminates the character class
1435
1436       The following sections describe the use of each metacharacter.
1437

BACKSLASH

1439       The backslash character has many uses. First, if it is  followed  by  a
1440       character  that  is not a number or a letter, it takes away any special
1441       meaning that a character can have. This use of backslash as  an  escape
1442       character applies both inside and outside character classes.
1443
1444       For  example,  if  you want to match a * character, you write \* in the
1445       pattern. This escaping action applies if the following character  would
1446       otherwise  be  interpreted  as a metacharacter, so it is always safe to
1447       precede a non-alphanumeric with backslash to specify that it stands for
1448       itself. In particular, if you want to match a backslash, write \\.
1449
1450       In  unicode mode, only ASCII numbers and letters have any special mean‐
1451       ing after a backslash. All other characters (in particular, those whose
1452       code points are > 127) are treated as literals.
1453
1454       If  a  pattern is compiled with option extended, whitespace in the pat‐
1455       tern (other than in a character class) and characters between a #  out‐
1456       side  a  character  class and the next newline are ignored. An escaping
1457       backslash can be used to include a whitespace or # character as part of
1458       the pattern.
1459
1460       To  remove  the special meaning from a sequence of characters, put them
1461       between \Q and \E. This is different from Perl in that $ and @ are han‐
1462       dled  as  literals  in  \Q...\E  sequences in PCRE, while $ and @ cause
1463       variable interpolation in Perl. Notice the following examples:
1464
1465       Pattern            PCRE matches   Perl matches
1466
1467       \Qabc$xyz\E        abc$xyz        abc followed by the contents of $xyz
1468       \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1469       \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1470
1471       The \Q...\E sequence is recognized both inside  and  outside  character
1472       classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1473       not followed by \E later in the  pattern,  the  literal  interpretation
1474       continues  to  the  end  of  the pattern (that is, \E is assumed at the
1475       end). If the isolated \Q is inside a character class,  this  causes  an
1476       error, as the character class is not terminated.
1477
1478       Non-Printing Characters
1479
1480       A second use of backslash provides a way of encoding non-printing char‐
1481       acters in patterns in a visible manner. There is no restriction on  the
1482       appearance  of non-printing characters, apart from the binary zero that
1483       terminates a pattern. When a pattern is prepared by text editing, it is
1484       often  easier  to  use  one  of the following escape sequences than the
1485       binary character it represents:
1486
1487         \a:
1488           Alarm, that is, the BEL character (hex 07)
1489
1490         \cx:
1491           "Control-x", where x is any ASCII character
1492
1493         \e:
1494           Escape (hex 1B)
1495
1496         \f:
1497           Form feed (hex 0C)
1498
1499         \n:
1500           Line feed (hex 0A)
1501
1502         \r:
1503           Carriage return (hex 0D)
1504
1505         \t:
1506           Tab (hex 09)
1507
1508         \0dd:
1509           Character with octal code 0dd
1510
1511         \ddd:
1512           Character with octal code ddd, or back reference
1513
1514         \o{ddd..}:
1515           character with octal code ddd..
1516
1517         \xhh:
1518           Character with hex code hh
1519
1520         \x{hhh..}:
1521           Character with hex code hhh..
1522
1523   Note:
1524       Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1525       eral characters "8" and "9".
1526
1527
1528       The  precise effect of \cx on ASCII characters is as follows: if x is a
1529       lowercase letter, it is converted to upper case.  Then  bit  6  of  the
1530       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1531       (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and  \c;  becomes
1532       hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
1533       has a value > 127, a compile-time error occurs.  This  locks  out  non-
1534       ASCII characters in all modes.
1535
1536       The  \c  facility  was designed for use with ASCII characters, but with
1537       the extension to Unicode it is even less useful than it once was.
1538
1539       After \0 up to two further octal digits are read. If  there  are  fewer
1540       than  two  digits,  just  those  that  are  present  are used. Thus the
1541       sequence \0\x\015 specifies two binary zeros followed by a CR character
1542       (code value 13). Make sure you supply two digits after the initial zero
1543       if the pattern character that follows is itself an octal digit.
1544
1545       The escape \o must be followed by a sequence of octal digits,  enclosed
1546       in  braces.  An  error occurs if this is not the case. This escape is a
1547       recent addition to Perl; it provides way of specifying  character  code
1548       points  as  octal  numbers  greater than 0777, and it also allows octal
1549       numbers and back references to be unambiguously specified.
1550
1551       For greater clarity and unambiguity, it is best to avoid following \ by
1552       a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1553       ter numbers, and \g{} to specify back references. The  following  para‐
1554       graphs describe the old, ambiguous syntax.
1555
1556       The handling of a backslash followed by a digit other than 0 is compli‐
1557       cated, and Perl has changed in recent releases, causing  PCRE  also  to
1558       change. Outside a character class, PCRE reads the digit and any follow‐
1559       ing digits as a decimal number. If the number is < 8, or if there  have
1560       been  at  least  that  many  previous capturing left parentheses in the
1561       expression, the entire  sequence  is  taken  as  a  back  reference.  A
1562       description  of how this works is provided later, following the discus‐
1563       sion of parenthesized subpatterns.
1564
1565       Inside a character class, or if the decimal number following \ is  >  7
1566       and  there  have not been that many capturing subpatterns, PCRE handles
1567       \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1568       up  to  three  octal  digits following the backslash, and using them to
1569       generate a data character. Any subsequent digits stand for  themselves.
1570       For example:
1571
1572         \040:
1573           Another way of writing an ASCII space
1574
1575         \40:
1576           The same, provided there are < 40 previous capturing subpatterns
1577
1578         \7:
1579           Always a back reference
1580
1581         \11:
1582           Can be a back reference, or another way of writing a tab
1583
1584         \011:
1585           Always a tab
1586
1587         \0113:
1588           A tab followed by character "3"
1589
1590         \113:
1591           Can  be  a  back reference, otherwise the character with octal code
1592           113
1593
1594         \377:
1595           Can be a back reference, otherwise value 255 (decimal)
1596
1597         \81:
1598           Either a back reference, or the two characters "8" and "1"
1599
1600       Notice that octal values >= 100 that are specified  using  this  syntax
1601       must  not  be introduced by a leading zero, as no more than three octal
1602       digits are ever read.
1603
1604       By default, after \x that is not followed by {, from zero to two  hexa‐
1605       decimal  digits  are  read (letters can be in upper or lower case). Any
1606       number of hexadecimal digits may appear between \x{ and }. If a charac‐
1607       ter  other  than  a  hexadecimal digit appears between \x{ and }, or if
1608       there is no terminating }, an error occurs.
1609
1610       Characters whose value is less than 256 can be defined by either of the
1611       two  syntaxes  for  \x. There is no difference in the way they are han‐
1612       dled. For example, \xdc is exactly the same as \x{dc}.
1613
1614       Constraints on character values
1615
1616       Characters that are specified using octal or  hexadecimal  numbers  are
1617       limited to certain values, as follows:
1618
1619         8-bit non-UTF mode:
1620           < 0x100
1621
1622         8-bit UTF-8 mode:
1623           < 0x10ffff and a valid codepoint
1624
1625       Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
1626       called "surrogate" codepoints), and 0xffef.
1627
1628       Escape sequences in character classes
1629
1630       All the sequences that define a single character value can be used both
1631       inside  and  outside character classes. Also, inside a character class,
1632       \b is interpreted as the backspace character (hex 08).
1633
1634       \N is not allowed in a character class. \B, \R, and \X are not  special
1635       inside  a  character  class.  Like other unrecognized escape sequences,
1636       they are treated as the literal characters "B", "R", and "X". Outside a
1637       character class, these sequences have different meanings.
1638
1639       Unsupported Escape Sequences
1640
1641       In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
1642       handler and used to modify the case of following characters. PCRE  does
1643       not support these escape sequences.
1644
1645       Absolute and Relative Back References
1646
1647       The  sequence  \g followed by an unsigned or a negative number, option‐
1648       ally enclosed in braces, is an absolute or relative back  reference.  A
1649       named back reference can be coded as \g{name}. Back references are dis‐
1650       cussed later, following the discussion of parenthesized subpatterns.
1651
1652       Absolute and Relative Subroutine Calls
1653
1654       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
1655       name or a number enclosed either in angle brackets or single quotes, is
1656       alternative syntax for referencing  a  subpattern  as  a  "subroutine".
1657       Details  are  discussed  later.  Notice  that \g{...} (Perl syntax) and
1658       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
1659       reference and the latter is a subroutine call.
1660
1661       Generic Character Types
1662
1663       Another use of backslash is for specifying generic character types:
1664
1665         \d:
1666           Any decimal digit
1667
1668         \D:
1669           Any character that is not a decimal digit
1670
1671         \h:
1672           Any horizontal whitespace character
1673
1674         \H:
1675           Any character that is not a horizontal whitespace character
1676
1677         \s:
1678           Any whitespace character
1679
1680         \S:
1681           Any character that is not a whitespace character
1682
1683         \v:
1684           Any vertical whitespace character
1685
1686         \V:
1687           Any character that is not a vertical whitespace character
1688
1689         \w:
1690           Any "word" character
1691
1692         \W:
1693           Any "non-word" character
1694
1695       There is also the single sequence \N, which matches a non-newline char‐
1696       acter. This is the same as the "." metacharacter  when  dotall  is  not
1697       set.  Perl  also uses \N to match characters by name, but PCRE does not
1698       support this.
1699
1700       Each pair of lowercase and uppercase escape  sequences  partitions  the
1701       complete  set of characters into two disjoint sets. Any given character
1702       matches one, and only one, of each pair. The sequences can appear  both
1703       inside  and outside character classes. They each match one character of
1704       the appropriate type. If the current matching point is at  the  end  of
1705       the subject string, all fail, as there is no character to match.
1706
1707       For  compatibility with Perl, \s did not used to match the VT character
1708       (code 11), which made it different from the the  POSIX  "space"  class.
1709       However,  Perl  added  VT  at  release  5.18, and PCRE followed suit at
1710       release 8.34. The default \s characters are now HT  (9),  LF  (10),  VT
1711       (11),  FF  (12),  CR  (13),  and space (32), which are defined as white
1712       space in the "C" locale. This list may vary if locale-specific matching
1713       is  taking place. For example, in some locales the "non-breaking space"
1714       character (\xA0) is recognized as white space, and  in  others  the  VT
1715       character is not.
1716
1717       A  "word"  character is an underscore or any character that is a letter
1718       or a digit. By default, the definition of letters and  digits  is  con‐
1719       trolled  by the PCRE low-valued character tables, in Erlang's case (and
1720       without option unicode), the ISO Latin-1 character set.
1721
1722       By default, in unicode mode, characters with values > 255, that is, all
1723       characters  outside  the ISO Latin-1 character set, never match \d, \s,
1724       or \w, and always match \D, \S, and \W. These  sequences  retain  their
1725       original  meanings  from  before  UTF support was available, mainly for
1726       efficiency reasons. However, if option ucp  is  set,  the  behavior  is
1727       changed  so  that  Unicode  properties  are used to determine character
1728       types, as follows:
1729
1730         \d:
1731           Any character that \p{Nd} matches (decimal digit)
1732
1733         \s:
1734           Any character that \p{Z} or \h or \v
1735
1736         \w:
1737           Any character that matches \p{L} or \p{N} matches, plus underscore
1738
1739       The uppercase escapes match the inverse sets of characters. Notice that
1740       \d matches only decimal digits, while \w matches any Unicode digit, any
1741       Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1742       as  they are defined in terms of \w and \W. Matching these sequences is
1743       noticeably slower when ucp is set.
1744
1745       The sequences \h, \H, \v, and \V are features that were added  to  Perl
1746       in  release  5.10. In contrast to the other sequences, which match only
1747       ASCII characters by default, these  always  match  certain  high-valued
1748       code points, regardless if ucp is set.
1749
1750       The following are the horizontal space characters:
1751
1752         U+0009:
1753           Horizontal tab (HT)
1754
1755         U+0020:
1756           Space
1757
1758         U+00A0:
1759           Non-break space
1760
1761         U+1680:
1762           Ogham space mark
1763
1764         U+180E:
1765           Mongolian vowel separator
1766
1767         U+2000:
1768           En quad
1769
1770         U+2001:
1771           Em quad
1772
1773         U+2002:
1774           En space
1775
1776         U+2003:
1777           Em space
1778
1779         U+2004:
1780           Three-per-em space
1781
1782         U+2005:
1783           Four-per-em space
1784
1785         U+2006:
1786           Six-per-em space
1787
1788         U+2007:
1789           Figure space
1790
1791         U+2008:
1792           Punctuation space
1793
1794         U+2009:
1795           Thin space
1796
1797         U+200A:
1798           Hair space
1799
1800         U+202F:
1801           Narrow no-break space
1802
1803         U+205F:
1804           Medium mathematical space
1805
1806         U+3000:
1807           Ideographic space
1808
1809       The following are the vertical space characters:
1810
1811         U+000A:
1812           Line feed (LF)
1813
1814         U+000B:
1815           Vertical tab (VT)
1816
1817         U+000C:
1818           Form feed (FF)
1819
1820         U+000D:
1821           Carriage return (CR)
1822
1823         U+0085:
1824           Next line (NEL)
1825
1826         U+2028:
1827           Line separator
1828
1829         U+2029:
1830           Paragraph separator
1831
1832       In  8-bit,  non-UTF-8  mode, only the characters with code points < 256
1833       are relevant.
1834
1835       Newline Sequences
1836
1837       Outside a character class, by default, the escape sequence  \R  matches
1838       any  Unicode  newline  sequence. In non-UTF-8 mode, \R is equivalent to
1839       the following:
1840
1841       (?>\r\n|\n|\x0b|\f|\r|\x85)
1842
1843       This is an example of an "atomic group", details are provided below.
1844
1845       This particular group matches either the two-character sequence CR fol‐
1846       lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1847       (vertical tab, U+000B), FF (form feed, U+000C),  CR  (carriage  return,
1848       U+000D),  or  NEL  (next  line,  U+0085). The two-character sequence is
1849       treated as a single unit that cannot be split.
1850
1851       In Unicode mode, two more characters whose code points are  >  255  are
1852       added:  LS  (line  separator,  U+2028)  and  PS  (paragraph  separator,
1853       U+2029). Unicode character property support is  not  needed  for  these
1854       characters to be recognized.
1855
1856       \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1857       plete set of Unicode line endings) by setting option bsr_anycrlf either
1858       at  compile time or when the pattern is matched. (BSR is an acronym for
1859       "backslash R".) This can be made the default when PCRE is built; if so,
1860       the  other  behavior can be requested through option bsr_unicode. These
1861       settings can also be specified by starting a pattern string with one of
1862       the following sequences:
1863
1864         (*BSR_ANYCRLF):
1865           CR, LF, or CRLF only
1866
1867         (*BSR_UNICODE):
1868           Any Unicode newline sequence
1869
1870       These  override  the default and the options specified to the compiling
1871       function, but they can themselves be overridden by options specified to
1872       a  matching function. Notice that these special settings, which are not
1873       Perl-compatible, are recognized only at the very start  of  a  pattern,
1874       and  that  they  must  be  in  upper  case. If more than one of them is
1875       present, the last one is used. They can be combined with  a  change  of
1876       newline convention; for example, a pattern can start with:
1877
1878       (*ANY)(*BSR_ANYCRLF)
1879
1880       They  can  also be combined with the (*UTF8), (*UTF), or (*UCP) special
1881       sequences. Inside a character class, \R is treated as  an  unrecognized
1882       escape sequence, and so matches the letter "R" by default.
1883
1884       Unicode Character Properties
1885
1886       Three more escape sequences that match characters with specific proper‐
1887       ties are available. When in 8-bit non-UTF-8 mode, these  sequences  are
1888       limited  to testing characters whose code points are < 256, but they do
1889       work in this mode. The following are the extra escape sequences:
1890
1891         \p{xx}:
1892           A character with property xx
1893
1894         \P{xx}:
1895           A character without property xx
1896
1897         \X:
1898           A Unicode extended grapheme cluster
1899
1900       The property names represented by xx above are limited to  the  Unicode
1901       script names, the general category properties, "Any", which matches any
1902       character  (including  newline),  and  some  special  PCRE   properties
1903       (described  in the next section). Other Perl properties, such as "InMu‐
1904       sicalSymbols", are currently not supported by PCRE. Notice that \P{Any}
1905       does not match any characters and always causes a match failure.
1906
1907       Sets of Unicode characters are defined as belonging to certain scripts.
1908       A character from one of these sets can be matched using a script  name,
1909       for example:
1910
1911       \p{Greek} \P{Han}
1912
1913       Those  that are not part of an identified script are lumped together as
1914       "Common". The following is the current list of scripts:
1915
1916         * Arabic
1917
1918         * Armenian
1919
1920         * Avestan
1921
1922         * Balinese
1923
1924         * Bamum
1925
1926         * Bassa_Vah
1927
1928         * Batak
1929
1930         * Bengali
1931
1932         * Bopomofo
1933
1934         * Braille
1935
1936         * Buginese
1937
1938         * Buhid
1939
1940         * Canadian_Aboriginal
1941
1942         * Carian
1943
1944         * Caucasian_Albanian
1945
1946         * Chakma
1947
1948         * Cham
1949
1950         * Cherokee
1951
1952         * Common
1953
1954         * Coptic
1955
1956         * Cuneiform
1957
1958         * Cypriot
1959
1960         * Cyrillic
1961
1962         * Deseret
1963
1964         * Devanagari
1965
1966         * Duployan
1967
1968         * Egyptian_Hieroglyphs
1969
1970         * Elbasan
1971
1972         * Ethiopic
1973
1974         * Georgian
1975
1976         * Glagolitic
1977
1978         * Gothic
1979
1980         * Grantha
1981
1982         * Greek
1983
1984         * Gujarati
1985
1986         * Gurmukhi
1987
1988         * Han
1989
1990         * Hangul
1991
1992         * Hanunoo
1993
1994         * Hebrew
1995
1996         * Hiragana
1997
1998         * Imperial_Aramaic
1999
2000         * Inherited
2001
2002         * Inscriptional_Pahlavi
2003
2004         * Inscriptional_Parthian
2005
2006         * Javanese
2007
2008         * Kaithi
2009
2010         * Kannada
2011
2012         * Katakana
2013
2014         * Kayah_Li
2015
2016         * Kharoshthi
2017
2018         * Khmer
2019
2020         * Khojki
2021
2022         * Khudawadi
2023
2024         * Lao
2025
2026         * Latin
2027
2028         * Lepcha
2029
2030         * Limbu
2031
2032         * Linear_A
2033
2034         * Linear_B
2035
2036         * Lisu
2037
2038         * Lycian
2039
2040         * Lydian
2041
2042         * Mahajani
2043
2044         * Malayalam
2045
2046         * Mandaic
2047
2048         * Manichaean
2049
2050         * Meetei_Mayek
2051
2052         * Mende_Kikakui
2053
2054         * Meroitic_Cursive
2055
2056         * Meroitic_Hieroglyphs
2057
2058         * Miao
2059
2060         * Modi
2061
2062         * Mongolian
2063
2064         * Mro
2065
2066         * Myanmar
2067
2068         * Nabataean
2069
2070         * New_Tai_Lue
2071
2072         * Nko
2073
2074         * Ogham
2075
2076         * Ol_Chiki
2077
2078         * Old_Italic
2079
2080         * Old_North_Arabian
2081
2082         * Old_Permic
2083
2084         * Old_Persian
2085
2086         * Oriya
2087
2088         * Old_South_Arabian
2089
2090         * Old_Turkic
2091
2092         * Osmanya
2093
2094         * Pahawh_Hmong
2095
2096         * Palmyrene
2097
2098         * Pau_Cin_Hau
2099
2100         * Phags_Pa
2101
2102         * Phoenician
2103
2104         * Psalter_Pahlavi
2105
2106         * Rejang
2107
2108         * Runic
2109
2110         * Samaritan
2111
2112         * Saurashtra
2113
2114         * Sharada
2115
2116         * Shavian
2117
2118         * Siddham
2119
2120         * Sinhala
2121
2122         * Sora_Sompeng
2123
2124         * Sundanese
2125
2126         * Syloti_Nagri
2127
2128         * Syriac
2129
2130         * Tagalog
2131
2132         * Tagbanwa
2133
2134         * Tai_Le
2135
2136         * Tai_Tham
2137
2138         * Tai_Viet
2139
2140         * Takri
2141
2142         * Tamil
2143
2144         * Telugu
2145
2146         * Thaana
2147
2148         * Thai
2149
2150         * Tibetan
2151
2152         * Tifinagh
2153
2154         * Tirhuta
2155
2156         * Ugaritic
2157
2158         * Vai
2159
2160         * Warang_Citi
2161
2162         * Yi
2163
2164       Each character has exactly one Unicode general category property, spec‐
2165       ified  by  a  two-letter acronym. For compatibility with Perl, negation
2166       can be specified by including a circumflex between  the  opening  brace
2167       and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2168
2169       If only one letter is specified with \p or \P, it includes all the gen‐
2170       eral category properties that start with that letter. In this case,  in
2171       the  absence of negation, the curly brackets in the escape sequence are
2172       optional. The following two examples have the same effect:
2173
2174       \p{L}
2175       \pL
2176
2177       The following general category property codes are supported:
2178
2179         C:
2180           Other
2181
2182         Cc:
2183           Control
2184
2185         Cf:
2186           Format
2187
2188         Cn:
2189           Unassigned
2190
2191         Co:
2192           Private use
2193
2194         Cs:
2195           Surrogate
2196
2197         L:
2198           Letter
2199
2200         Ll:
2201           Lowercase letter
2202
2203         Lm:
2204           Modifier letter
2205
2206         Lo:
2207           Other letter
2208
2209         Lt:
2210           Title case letter
2211
2212         Lu:
2213           Uppercase letter
2214
2215         M:
2216           Mark
2217
2218         Mc:
2219           Spacing mark
2220
2221         Me:
2222           Enclosing mark
2223
2224         Mn:
2225           Non-spacing mark
2226
2227         N:
2228           Number
2229
2230         Nd:
2231           Decimal number
2232
2233         Nl:
2234           Letter number
2235
2236         No:
2237           Other number
2238
2239         P:
2240           Punctuation
2241
2242         Pc:
2243           Connector punctuation
2244
2245         Pd:
2246           Dash punctuation
2247
2248         Pe:
2249           Close punctuation
2250
2251         Pf:
2252           Final punctuation
2253
2254         Pi:
2255           Initial punctuation
2256
2257         Po:
2258           Other punctuation
2259
2260         Ps:
2261           Open punctuation
2262
2263         S:
2264           Symbol
2265
2266         Sc:
2267           Currency symbol
2268
2269         Sk:
2270           Modifier symbol
2271
2272         Sm:
2273           Mathematical symbol
2274
2275         So:
2276           Other symbol
2277
2278         Z:
2279           Separator
2280
2281         Zl:
2282           Line separator
2283
2284         Zp:
2285           Paragraph separator
2286
2287         Zs:
2288           Space separator
2289
2290       The special property L& is also supported. It matches a character  that
2291       has  the  Lu, Ll, or Lt property, that is, a letter that is not classi‐
2292       fied as a modifier or "other".
2293
2294       The Cs (Surrogate) property applies only to  characters  in  the  range
2295       U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2296       cannot be tested by PCRE. Perl does not support the Cs property.
2297
2298       The long synonyms for property names supported by Perl (such as \p{Let‐
2299       ter})  are  not supported by PCRE. It is not permitted to prefix any of
2300       these properties with "Is".
2301
2302       No character in the Unicode table has  the  Cn  (unassigned)  property.
2303       This  property is instead assumed for any code point that is not in the
2304       Unicode table.
2305
2306       Specifying caseless matching does not affect  these  escape  sequences.
2307       For example, \p{Lu} always matches only uppercase letters. This is dif‐
2308       ferent from the behavior of current versions of Perl.
2309
2310       Matching characters by Unicode property is not fast, as PCRE must do  a
2311       multistage  table  lookup to find a character property. That is why the
2312       traditional escape sequences such as \d and \w do not use Unicode prop‐
2313       erties  in PCRE by default. However, you can make them do so by setting
2314       option ucp or by starting the pattern with (*UCP).
2315
2316       Extended Grapheme Clusters
2317
2318       The \X escape matches any number of Unicode  characters  that  form  an
2319       "extended grapheme cluster", and treats the sequence as an atomic group
2320       (see below). Up to and including release 8.31, PCRE matched an earlier,
2321       simpler  definition  that  was  equivalent  to (?>\PM\pM*). That is, it
2322       matched a character without the "mark" property, followed  by  zero  or
2323       more  characters  with  the "mark" property. Characters with the "mark"
2324       property are typically non-spacing accents that  affect  the  preceding
2325       character.
2326
2327       This  simple definition was extended in Unicode to include more compli‐
2328       cated kinds of composite character by giving each character a  grapheme
2329       breaking  property,  and  creating  rules  that use these properties to
2330       define the boundaries of extended grapheme clusters. In  PCRE  releases
2331       later than 8.31, \X matches one of these clusters.
2332
2333       \X  always  matches  at least one character. Then it decides whether to
2334       add more characters according to the following rules for ending a clus‐
2335       ter:
2336
2337         * End at the end of the subject string.
2338
2339         * Do not end between CR and LF; otherwise end after any control char‐
2340           acter.
2341
2342         * Do not break Hangul (a Korean script)  syllable  sequences.  Hangul
2343           characters  are of five types: L, V, T, LV, and LVT. An L character
2344           can be followed by an L, V, LV, or LVT character. An LV or V  char‐
2345           acter  can be followed by a V or T character. An LVT or T character
2346           can be followed only by a T character.
2347
2348         * Do not end before extending characters or spacing marks. Characters
2349           with the "mark" property always have the "extend" grapheme breaking
2350           property.
2351
2352         * Do not end after prepend characters.
2353
2354         * Otherwise, end the cluster.
2355
2356       PCRE Additional Properties
2357
2358       In addition to the standard Unicode properties described earlier,  PCRE
2359       supports  four more that make it possible to convert traditional escape
2360       sequences, such as \w and \s to use Unicode properties. PCRE uses these
2361       non-standard,  non-Perl  properties  internally  when the ucp option is
2362       passed. However, they can also be used explicitly. The  properties  are
2363       as follows:
2364
2365         Xan:
2366           Any alphanumeric character. Matches characters that have either the
2367           L (letter) or the N (number) property.
2368
2369         Xps:
2370           Any Posix space character. Matches the characters tab,  line  feed,
2371           vertical  tab,  form feed, carriage return, and any other character
2372           that has the Z (separator) property.
2373
2374         Xsp:
2375           Any Perl space character. Matches the same as Xps, except that ver‐
2376           tical tab is excluded.
2377
2378         Xwd:
2379           Any Perl "word" character. Matches the same characters as Xan, plus
2380           underscore.
2381
2382       Perl and POSIX space are now the same. Perl added VT to its space char‐
2383       acter set at release 5.18 and PCRE changed at release 8.34.
2384
2385       Xan  matches  characters that have either the L (letter) or the N (num‐
2386       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
2387       form  feed,  or carriage return, and any other character that has the Z
2388       (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2389       cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2390       at release 8.34. Xwd matches the same characters as  Xan,  plus  under‐
2391       score.
2392
2393       There  is another non-standard property, Xuc, which matches any charac‐
2394       ter that can be represented by a Universal Character Name  in  C++  and
2395       other  programming  languages.  These are the characters $, @, ` (grave
2396       accent), and all characters with Unicode code points >= U+00A0,  except
2397       for  the  surrogates  U+D800  to  U+DFFF. Notice that most base (ASCII)
2398       characters are excluded. (Universal Character Names  are  of  the  form
2399       \uHHHH  or  \UHHHHHHHH, where H is a hexadecimal digit. Notice that the
2400       Xuc property does not match these sequences  but  the  characters  that
2401       they represent.)
2402
2403       Resetting the Match Start
2404
2405       The  escape sequence \K causes any previously matched characters not to
2406       be included in the final matched sequence. For example,  the  following
2407       pattern matches "foobar", but reports that it has matched "bar":
2408
2409       foo\Kbar
2410
2411       This  feature  is  similar to a lookbehind assertion (described below).
2412       However, in this case, the part of the subject before  the  real  match
2413       does  not  have to be of fixed length, as lookbehind assertions do. The
2414       use of \K does not interfere with the setting of  captured  substrings.
2415       For  example,  when  the  following pattern matches "foobar", the first
2416       substring is still set to "foo":
2417
2418       (foo)\Kbar
2419
2420       Perl documents that the use  of  \K  within  assertions  is  "not  well
2421       defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
2422       assertions, but is ignored in negative assertions.  Note  that  when  a
2423       pattern  such  as (?=ab\K) matches, the reported start of the match can
2424       be greater than the end of the match.
2425
2426       Simple Assertions
2427
2428       The final use of backslash is for certain simple assertions. An  asser‐
2429       tion  specifies a condition that must be met at a particular point in a
2430       match, without consuming any characters from the  subject  string.  The
2431       use  of subpatterns for more complicated assertions is described below.
2432       The following are the backslashed assertions:
2433
2434         \b:
2435           Matches at a word boundary.
2436
2437         \B:
2438           Matches when not at a word boundary.
2439
2440         \A:
2441           Matches at the start of the subject.
2442
2443         \Z:
2444           Matches at the end of the subject, and before a newline at the  end
2445           of the subject.
2446
2447         \z:
2448           Matches only at the end of the subject.
2449
2450         \G:
2451           Matches at the first matching position in the subject.
2452
2453       Inside  a  character  class, \b has a different meaning; it matches the
2454       backspace character. If any other of  these  assertions  appears  in  a
2455       character  class, by default it matches the corresponding literal char‐
2456       acter (for example, \B matches the letter B).
2457
2458       A word boundary is a position in the subject string where  the  current
2459       character  and  the previous character do not both match \w or \W (that
2460       is, one matches \w and the other matches \W), or the start  or  end  of
2461       the  string if the first or last character matches \w, respectively. In
2462       UTF mode, the meanings of \w and \W can be changed  by  setting  option
2463       ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2464       have a separate "start of word" or "end of word" metasequence. However,
2465       whatever  follows  \b normally determines which it is. For example, the
2466       fragment \ba matches "a" at the start of a word.
2467
2468       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
2469       and dollar (described in the next section) in that they only ever match
2470       at the very start and end of the subject string, whatever  options  are
2471       set.  Thus,  they are independent of multiline mode. These three asser‐
2472       tions are not affected by options notbol or noteol, which  affect  only
2473       the  behavior  of the circumflex and dollar metacharacters. However, if
2474       argument startoffset of run/3 is non-zero, indicating that matching  is
2475       to  start  at  a  point other than the beginning of the subject, \A can
2476       never match. The difference between \Z and \z is that \Z matches before
2477       a  newline  at  the  end  of  the  string and at the very end, while \z
2478       matches only at the end.
2479
2480       The \G assertion is true only when the current matching position is  at
2481       the  start  point of the match, as specified by argument startoffset of
2482       run/3. It differs from \A when the value of startoffset is non-zero. By
2483       calling  run/3 multiple times with appropriate arguments, you can mimic
2484       the Perl option /g, and it is in this kind of implementation  where  \G
2485       can be useful.
2486
2487       Notice,  however,  that  the PCRE interpretation of \G, as the start of
2488       the current match, is subtly different from Perl, which defines  it  as
2489       the end of the previous match. In Perl, these can be different when the
2490       previously matched string was empty. As PCRE does only one match  at  a
2491       time, it cannot reproduce this behavior.
2492
2493       If  all  the alternatives of a pattern begin with \G, the expression is
2494       anchored to the starting match position, and the "anchored" flag is set
2495       in the compiled regular expression.
2496

CIRCUMFLEX AND DOLLAR

2498       The  circumflex  and  dollar  metacharacters are zero-width assertions.
2499       That is, they test for a particular condition to be true  without  con‐
2500       suming any characters from the subject string.
2501
2502       Outside a character class, in the default matching mode, the circumflex
2503       character is an assertion that is true only  if  the  current  matching
2504       point is at the start of the subject string. If argument startoffset of
2505       run/3 is non-zero, circumflex can never match if  option  multiline  is
2506       unset.  Inside  a character class, circumflex has an entirely different
2507       meaning (see below).
2508
2509       Circumflex needs not to be the first character of the pattern  if  some
2510       alternatives  are  involved,  but  it  is to be the first thing in each
2511       alternative in which it appears if the pattern is ever  to  match  that
2512       branch.  If all possible alternatives start with a circumflex, that is,
2513       if the pattern is constrained to match only at the start  of  the  sub‐
2514       ject,  it  is  said  to be an "anchored" pattern. (There are also other
2515       constructs that can cause a pattern to be anchored.)
2516
2517       The dollar character is an assertion that is true only if  the  current
2518       matching  point  is  at  the  end of the subject string, or immediately
2519       before a newline at the end of the string (by default). Notice  however
2520       that  it  does  not  match the newline. Dollar needs not to be the last
2521       character of the pattern if some alternatives are involved, but  it  is
2522       to  be  the  last item in any branch in which it appears. Dollar has no
2523       special meaning in a character class.
2524
2525       The meaning of dollar can be changed so that it  matches  only  at  the
2526       very  end  of  the  string, by setting option dollar_endonly at compile
2527       time. This does not affect the \Z assertion.
2528
2529       The meanings of the circumflex and dollar  characters  are  changed  if
2530       option  multiline  is  set. When this is the case, a circumflex matches
2531       immediately after internal newlines and at the  start  of  the  subject
2532       string.  It does not match after a newline that ends the string. A dol‐
2533       lar matches before any newlines in the string, and  at  the  very  end,
2534       when  multiline  is set. When newline is specified as the two-character
2535       sequence CRLF, isolated CR and LF characters do not indicate newlines.
2536
2537       For example, the pattern /^abc$/ matches the subject string  "def\nabc"
2538       (where  \n  represents a newline) in multiline mode, but not otherwise.
2539       So, patterns that are anchored in single-line mode because all branches
2540       start  with  ^ are not anchored in multiline mode, and a match for cir‐
2541       cumflex is possible when argument startoffset  of  run/3  is  non-zero.
2542       Option dollar_endonly is ignored if multiline is set.
2543
2544       Notice that the sequences \A, \Z, and \z can be used to match the start
2545       and end of the subject in both modes. If  all  branches  of  a  pattern
2546       start with \A, it is always anchored, regardless if multiline is set.
2547

FULL STOP (PERIOD, DOT) AND \N

2549       Outside  a  character class, a dot in the pattern matches any character
2550       in the subject string except (by default) a  character  that  signifies
2551       the end of a line.
2552
2553       When  a line ending is defined as a single character, dot never matches
2554       that character. When the two-character sequence CRLF is used, dot  does
2555       not  match CR if it is immediately followed by LF, otherwise it matches
2556       all characters (including isolated CRs and LFs). When any Unicode  line
2557       endings  are recognized, dot does not match CR, LF, or any of the other
2558       line-ending characters.
2559
2560       The behavior of dot regarding newlines can be changed. If option dotall
2561       is  set,  a  dot  matches any character, without exception. If the two-
2562       character sequence CRLF is present in the subject string, it takes  two
2563       dots to match it.
2564
2565       The  handling of dot is entirely independent of the handling of circum‐
2566       flex and dollar, the only relationship is that both  involve  newlines.
2567       Dot has no special meaning in a character class.
2568
2569       The  escape  sequence  \N  behaves  like  a  dot, except that it is not
2570       affected by option PCRE_DOTALL.  That  is,  it  matches  any  character
2571       except one that signifies the end of a line. Perl also uses \N to match
2572       characters by name but PCRE does not support this.
2573

MATCHING A SINGLE DATA UNIT

2575       Outside a character class, the escape  sequence  \C  matches  any  data
2576       unit,  regardless  if  a  UTF  mode  is set. One data unit is one byte.
2577       Unlike a dot, \C always matches line-ending characters. The feature  is
2578       provided  in  Perl  to  match individual bytes in UTF-8 mode, but it is
2579       unclear how it can usefully be used. As \C breaks  up  characters  into
2580       individual  data  units,  matching one unit with \C in a UTF mode means
2581       that the remaining string can start with  a  malformed  UTF  character.
2582       This  has  undefined  results, as PCRE assumes that it deals with valid
2583       UTF strings.
2584
2585       PCRE does not allow \C to appear in  lookbehind  assertions  (described
2586       below) in a UTF mode, as this would make it impossible to calculate the
2587       length of the lookbehind.
2588
2589       The \C escape sequence is best avoided. However, one way  of  using  it
2590       that  avoids the problem of malformed UTF characters is to use a looka‐
2591       head to check the length of the next character,  as  in  the  following
2592       pattern,  which  can be used with a UTF-8 string (ignore whitespace and
2593       line breaks):
2594
2595       (?| (?=[\x00-\x7f])(\C) |
2596           (?=[\x80-\x{7ff}])(\C)(\C) |
2597           (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2598           (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2599
2600       A group that starts with (?| resets the capturing  parentheses  numbers
2601       in  each  alternative  (see  section Duplicate Subpattern Numbers). The
2602       assertions at the start of each branch check the next  UTF-8  character
2603       for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
2604       individual bytes of the character are then captured by the  appropriate
2605       number of groups.
2606

SQUARE BRACKETS AND CHARACTER CLASSES

2608       An opening square bracket introduces a character class, terminated by a
2609       closing square bracket. A closing square bracket on its own is not spe‐
2610       cial  by  default.  However, if option PCRE_JAVASCRIPT_COMPAT is set, a
2611       lone closing square bracket causes a compile-time error. If  a  closing
2612       square  bracket  is  required as a member of the class, it is to be the
2613       first data character in the class  (after  an  initial  circumflex,  if
2614       present) or escaped with a backslash.
2615
2616       A  character  class matches a single character in the subject. In a UTF
2617       mode, the character can be more than one  data  unit  long.  A  matched
2618       character must be in the set of characters defined by the class, unless
2619       the first character in the class definition is a circumflex,  in  which
2620       case the subject character must not be in the set defined by the class.
2621       If a circumflex is required as a member of the class, ensure that it is
2622       not the first character, or escape it with a backslash.
2623
2624       For  example,  the character class [aeiou] matches any lowercase vowel,
2625       while [^aeiou] matches any character that is  not  a  lowercase  vowel.
2626       Notice  that  a circumflex is just a convenient notation for specifying
2627       the characters that are in the class by enumerating those that are not.
2628       A  class  that  starts  with a circumflex is not an assertion; it still
2629       consumes a character from the subject string, and therefore it fails if
2630       the current pointer is at the end of the string.
2631
2632       In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2633       a class as a literal string of data units, or by using the \x{ escaping
2634       mechanism.
2635
2636       When  caseless  matching  is set, any letters in a class represent both
2637       their uppercase and lowercase versions. For example, a caseless [aeiou]
2638       matches  "A" and "a", and a caseless [^aeiou] does not match "A", but a
2639       caseful version would. In a UTF mode, PCRE always understands the  con‐
2640       cept  of case for characters whose values are < 256, so caseless match‐
2641       ing is always possible. For characters with higher values, the  concept
2642       of  case  is  supported  only if PCRE is compiled with Unicode property
2643       support. If you want to use caseless matching in a UTF mode for charac‐
2644       ters >=, ensure that PCRE is compiled with Unicode property support and
2645       with UTF support.
2646
2647       Characters that can indicate line breaks are never treated in any  spe‐
2648       cial way when matching character classes, whatever line-ending sequence
2649       is in use, and whatever setting of options PCRE_DOTALL and  PCRE_MULTI‐
2650       LINE  is used. A class such as [^a] always matches one of these charac‐
2651       ters.
2652
2653       The minus (hyphen) character can be used to specify a range of  charac‐
2654       ters  in  a  character  class.  For  example,  [d-m] matches any letter
2655       between d and m, inclusive. If a  minus  character  is  required  in  a
2656       class,  it  must  be  escaped  with a backslash or appear in a position
2657       where it cannot be interpreted as indicating a range, typically as  the
2658       first or last character in the class, or immediately after a range. For
2659       example, [b-d-z] matches letters in the range b to d, a hyphen  charac‐
2660       ter, or z.
2661
2662       The  literal  character  "]"  cannot be the end character of a range. A
2663       pattern such as [W-]46] is interpreted as a  class  of  two  characters
2664       ("W"  and  "-")  followed  by a literal string "46]", so it would match
2665       "W46]" or "-46]". However, if "]" is escaped with a  backslash,  it  is
2666       interpreted  as the end of range, so [W-\]46] is interpreted as a class
2667       containing a range followed by two other characters. The octal or hexa‐
2668       decimal representation of "]" can also be used to end a range.
2669
2670       An  error  is  generated  if  a POSIX character class (see below) or an
2671       escape sequence other than one that defines a single character  appears
2672       at  a  point  where  a range ending character is expected. For example,
2673       [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2674
2675       Ranges operate in the collating sequence of character values. They  can
2676       also  be  used  for  characters  specified  numerically,  for  example,
2677       [\000-\037]. Ranges can include any characters that are valid  for  the
2678       current mode.
2679
2680       If a range that includes letters is used when caseless matching is set,
2681       it matches the letters in either case. For example, [W-c] is equivalent
2682       to  [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2683       ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2684       E  characters in both cases. In UTF modes, PCRE supports the concept of
2685       case for characters with values > 255 only when  it  is  compiled  with
2686       Unicode property support.
2687
2688       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2689       \w, and \W can appear in a character class, and add the characters that
2690       they  match to the class. For example, [\dABCDEF] matches any hexadeci‐
2691       mal digit. In UTF modes, option ucp affects the meanings of \d, \s,  \w
2692       and  their uppercase partners, just as it does when they appear outside
2693       a character class, as described in section Generic Character Types ear‐
2694       lier. The escape sequence \b has a different meaning inside a character
2695       class; it matches the backspace character. The sequences  \B,  \N,  \R,
2696       and  \X are not special inside a character class. Like any other unrec‐
2697       ognized escape sequences, they are treated as  the  literal  characters
2698       "B", "N", "R", and "X".
2699
2700       A  circumflex  can  conveniently  be  used with the uppercase character
2701       types to specify a more restricted set of characters than the  matching
2702       lowercase  type. For example, class [^\W_] matches any letter or digit,
2703       but not underscore, while [\w] includes underscore. A positive  charac‐
2704       ter  class is to be read as "something OR something OR ..." and a nega‐
2705       tive class as "NOT something AND NOT something AND NOT ...".
2706
2707       Only the following metacharacters are recognized in character classes:
2708
2709         * Backslash
2710
2711         * Hyphen (only where it can be interpreted as specifying a range)
2712
2713         * Circumflex (only at the start)
2714
2715         * Opening square bracket (only when it can be interpreted  as  intro‐
2716           ducing  a Posix class name, or for a special compatibility feature;
2717           see the next two sections)
2718
2719         * Terminating closing square bracket
2720
2721       However, escaping other non-alphanumeric characters does no harm.
2722

POSIX CHARACTER CLASSES

2724       Perl supports the Posix notation for character classes. This uses names
2725       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
2726       supports this notation. For example, the following  matches  "0",  "1",
2727       any alphabetic character, or "%":
2728
2729       [01[:alpha:]%]
2730
2731       The following are the supported class names:
2732
2733         alnum:
2734           Letters and digits
2735
2736         alpha:
2737           Letters
2738
2739         ascii:
2740           Character codes 0-127
2741
2742         blank:
2743           Space or tab only
2744
2745         cntrl:
2746           Control characters
2747
2748         digit:
2749           Decimal digits (same as \d)
2750
2751         graph:
2752           Printing characters, excluding space
2753
2754         lower:
2755           Lowercase letters
2756
2757         print:
2758           Printing characters, including space
2759
2760         punct:
2761           Printing characters, excluding letters, digits, and space
2762
2763         space:
2764           Whitespace (the same as \s from PCRE 8.34)
2765
2766         upper:
2767           Uppercase letters
2768
2769         word:
2770           "Word" characters (same as \w)
2771
2772         xdigit:
2773           Hexadecimal digits
2774
2775       The  default  "space" characters are HT (9), LF (10), VT (11), FF (12),
2776       CR (13), and space (32). If locale-specific matching is  taking  place,
2777       the  list  of  space characters may be different; there may be fewer or
2778       more of them. "Space" used to be different to \s, which did not include
2779       VT,  for Perl compatibility. However, Perl changed at release 5.18, and
2780       PCRE followed at release 8.34. "Space" and \s now match the same set of
2781       characters.
2782
2783       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
2784       from Perl 5.8. Another Perl extension is negation, which  is  indicated
2785       by  a  ^  character after the colon. For example, the following matches
2786       "1", "2", or any non-digit:
2787
2788       [12[:^digit:]]
2789
2790       PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2791       "ch"  is  a  "collating  element",  but these are not supported, and an
2792       error is given if they are encountered.
2793
2794       By default, characters with values > 255 do not match any of the  Posix
2795       character  classes.  However, if option PCRE_UCP is passed to pcre_com‐
2796       pile(), some of the classes are changed so that Unicode character prop‐
2797       erties are used. This is achieved by replacing certain Posix classes by
2798       other sequences, as follows:
2799
2800         [:alnum:]:
2801           Becomes \p{Xan}
2802
2803         [:alpha:]:
2804           Becomes \p{L}
2805
2806         [:blank:]:
2807           Becomes \h
2808
2809         [:digit:]:
2810           Becomes \p{Nd}
2811
2812         [:lower:]:
2813           Becomes \p{Ll}
2814
2815         [:space:]:
2816           Becomes \p{Xps}
2817
2818         [:upper:]:
2819           Becomes \p{Lu}
2820
2821         [:word:]:
2822           Becomes \p{Xwd}
2823
2824       Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2825       POSIX classes are handled specially in UCP mode:
2826
2827         [:graph:]:
2828           This  matches  characters  that have glyphs that mark the page when
2829           printed. In Unicode property terms, it matches all characters  with
2830           the L, M, N, P, S, or Cf properties, except for:
2831
2832           U+061C:
2833             Arabic Letter Mark
2834
2835           U+180E:
2836             Mongolian Vowel Separator
2837
2838           U+2066 - U+2069:
2839             Various "isolate"s
2840
2841         [:print:]:
2842           This matches the same characters as [:graph:] plus space characters
2843           that are not controls, that is, characters with the Zs property.
2844
2845         [:punct:]:
2846           This matches all characters that have the Unicode  P  (punctuation)
2847           property, plus those characters whose code points are less than 128
2848           that have the S (Symbol) property.
2849
2850       The other POSIX classes are unchanged, and match only  characters  with
2851       code points less than 128.
2852
2853       Compatibility Feature for Word Boundaries
2854
2855       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
2856       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
2857       and "end of word". PCRE treats these items as follows:
2858
2859         [[:<:]]:
2860           is converted to \b(?=\w)
2861
2862         [[:>:]]:
2863           is converted to \b(?<=\w)
2864
2865       Only these exact character sequences are recognized. A sequence such as
2866       [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This
2867       support  is not compatible with Perl. It is provided to help migrations
2868       from other environments, and is best not used in any new patterns. Note
2869       that  \b matches at the start and the end of a word (see "Simple asser‐
2870       tions" above), and in a Perl-style pattern the preceding  or  following
2871       character  normally  shows  which  is  wanted, without the need for the
2872       assertions that are used above in order to give exactly the  POSIX  be‐
2873       haviour.
2874

VERTICAL BAR

2876       Vertical  bar characters are used to separate alternative patterns. For
2877       example, the following pattern matches either "gilbert" or "sullivan":
2878
2879       gilbert|sullivan
2880
2881       Any number of alternatives can appear, and an empty alternative is per‐
2882       mitted  (matching  the  empty  string). The matching process tries each
2883       alternative in turn, from left to right, and the first that succeeds is
2884       used.  If  the alternatives are within a subpattern (defined in section
2885       Subpatterns), "succeeds" means matching the remaining main pattern  and
2886       the alternative in the subpattern.
2887

INTERNAL OPTION SETTING

2889       The  settings  of  the  Perl-compatible  options  caseless,  multiline,
2890       dotall, and extended can be  changed  from  within  the  pattern  by  a
2891       sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
2892       option letters are as follows:
2893
2894         i:
2895           For caseless
2896
2897         m:
2898           For multiline
2899
2900         s:
2901           For dotall
2902
2903         x:
2904           For extended
2905
2906       For example, (?im) sets caseless, multiline matching. These options can
2907       also be unset by preceding the letter with a hyphen. A combined setting
2908       and unsetting such as (?im-sx),  which  sets  caseless  and  multiline,
2909       while  unsetting  dotall  and  extended, is also permitted. If a letter
2910       appears both before and after the hyphen, the option is unset.
2911
2912       The PCRE-specific options dupnames, ungreedy, and extra can be  changed
2913       in  the same way as the Perl-compatible options by using the characters
2914       J, U, and X respectively.
2915
2916       When one of these option changes occurs  at  top-level  (that  is,  not
2917       inside  subpattern parentheses), the change applies to the remainder of
2918       the pattern that follows.
2919
2920       An option change within a subpattern (see section Subpatterns)  affects
2921       only  that  part  of  the subpattern that follows it. So, the following
2922       matches abc and aBc and no other  strings  (assuming  caseless  is  not
2923       used):
2924
2925       (a(?i)b)c
2926
2927       By  this  means, options can be made to have different settings in dif‐
2928       ferent parts of the pattern. Any changes made  in  one  alternative  do
2929       carry on into subsequent branches within the same subpattern. For exam‐
2930       ple:
2931
2932       (a(?i)b|c)
2933
2934       matches "ab", "aB", "c", and "C", although when matching "C" the  first
2935       branch  is  abandoned  before  the  option setting. This is because the
2936       effects of option settings occur at compile time. There would  be  some
2937       weird behavior otherwise.
2938
2939   Note:
2940       Other PCRE-specific options can be set by the application when the com‐
2941       piling or matching functions are called. Sometimes the pattern can con‐
2942       tain  special  leading sequences, such as (*CRLF), to override what the
2943       application has set or what has been defaulted. Details are provided in
2944       section  Newline Sequences earlier.
2945
2946       The  (*UTF8)  and  (*UCP)  leading sequences can be used to set UTF and
2947       Unicode property modes. They are equivalent to setting options  unicode
2948       and  ucp,  respectively.  The (*UTF) sequence is a generic version that
2949       can be used with any of the libraries. However, the application can set
2950       option never_utf, which locks out the use of the (*UTF) sequences.
2951
2952

SUBPATTERNS

2954       Subpatterns are delimited by parentheses (round brackets), which can be
2955       nested. Turning part of a pattern into a subpattern does two things:
2956
2957         1.:
2958           It localizes a set of alternatives. For example, the following pat‐
2959           tern matches "cataract", "caterpillar", or "cat":
2960
2961         cat(aract|erpillar|)
2962
2963           Without  the parentheses, it would match "cataract", "erpillar", or
2964           an empty string.
2965
2966         2.:
2967           It sets up the subpattern as a capturing subpattern. That is,  when
2968           the  complete  pattern  matches, that portion of the subject string
2969           that matched the subpattern is passed back to  the  caller  through
2970           the return value of run/3.
2971
2972       Opening parentheses are counted from left to right (starting from 1) to
2973       obtain numbers for the  capturing  subpatterns.  For  example,  if  the
2974       string  "the  red  king"  is matched against the following pattern, the
2975       captured substrings are "red king", "red", and "king", and are numbered
2976       1, 2, and 3, respectively:
2977
2978       the ((red|white) (king|queen))
2979
2980       It  is not always helpful that plain parentheses fulfill two functions.
2981       Often a grouping subpattern is required without  a  capturing  require‐
2982       ment.  If  an  opening parenthesis is followed by a question mark and a
2983       colon, the subpattern does not do any capturing,  and  is  not  counted
2984       when  computing the number of any subsequent capturing subpatterns. For
2985       example, if the string "the white queen" is matched against the follow‐
2986       ing pattern, the captured substrings are "white queen" and "queen", and
2987       are numbered 1 and 2:
2988
2989       the ((?:red|white) (king|queen))
2990
2991       The maximum number of capturing subpatterns is 65535.
2992
2993       As a convenient shorthand, if any option settings are required  at  the
2994       start  of  a  non-capturing  subpattern,  the option letters can appear
2995       between "?" and ":". Thus, the following two patterns  match  the  same
2996       set of strings:
2997
2998       (?i:saturday|sunday)
2999       (?:(?i)saturday|sunday)
3000
3001       As  alternative  branches are tried from left to right, and options are
3002       not reset until the end of the subpattern is reached, an option setting
3003       in  one  branch  does affect subsequent branches, so the above patterns
3004       match both "SUNDAY" and "Saturday".
3005

DUPLICATE SUBPATTERN NUMBERS

3007       Perl 5.10 introduced a feature where each alternative in  a  subpattern
3008       uses  the same numbers for its capturing parentheses. Such a subpattern
3009       starts with (?| and is itself a non-capturing subpattern. For  example,
3010       consider the following pattern:
3011
3012       (?|(Sat)ur|(Sun))day
3013
3014       As  the two alternatives are inside a (?| group, both sets of capturing
3015       parentheses are numbered one. Thus, when the pattern matches,  you  can
3016       look  at  captured substring number one, whichever alternative matched.
3017       This construct is useful when you want to capture a part, but not  all,
3018       of  one  of many alternatives. Inside a (?| group, parentheses are num‐
3019       bered as usual, but the number is reset at the start  of  each  branch.
3020       The  numbers  of  any  capturing parentheses that follow the subpattern
3021       start after the highest number used in any branch. The following  exam‐
3022       ple  is  from  the  Perl  documentation; the numbers underneath show in
3023       which buffer the captured content is stored:
3024
3025       # before  ---------------branch-reset----------- after
3026       / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3027       # 1            2         2  3        2     3     4
3028
3029       A back reference to a numbered subpattern uses the  most  recent  value
3030       that  is  set  for that number by any subpattern. The following pattern
3031       matches "abcabc" or "defdef":
3032
3033       /(?|(abc)|(def))\1/
3034
3035       In contrast, a subroutine call to a numbered subpattern  always  refers
3036       to  the  first  one in the pattern with the given number. The following
3037       pattern matches "abcabc" or "defabc":
3038
3039       /(?|(abc)|(def))(?1)/
3040
3041       If a condition test for a subpattern having matched refers  to  a  non-
3042       unique  number, the test is true if any of the subpatterns of that num‐
3043       ber have matched.
3044
3045       An alternative approach using this "branch reset"  feature  is  to  use
3046       duplicate named subpatterns, as described in the next section.
3047

NAMED SUBPATTERNS

3049       Identifying  capturing  parentheses  by number is simple, but it can be
3050       hard to keep track of the numbers in complicated  regular  expressions.
3051       Also,  if  an  expression  is modified, the numbers can change. To help
3052       with this difficulty, PCRE supports the  naming  of  subpatterns.  This
3053       feature  was  not added to Perl until release 5.10. Python had the fea‐
3054       ture earlier, and PCRE introduced it at release 4.0, using  the  Python
3055       syntax.  PCRE  now  supports  both the Perl and the Python syntax. Perl
3056       allows identically numbered subpatterns to have  different  names,  but
3057       PCRE does not.
3058
3059       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
3060       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
3061       to  capturing parentheses from other parts of the pattern, such as back
3062       references, recursion, and conditions, can be made by name and by  num‐
3063       ber.
3064
3065       Names  consist of up to 32 alphanumeric characters and underscores, but
3066       must start with a non-digit.  Named  capturing  parentheses  are  still
3067       allocated  numbers  as  well as names, exactly as if the names were not
3068       present. The capture specification to run/3 can  use  named  values  if
3069       they are present in the regular expression.
3070
3071       By default, a name must be unique within a pattern, but this constraint
3072       can be relaxed by setting option dupnames at compile  time.  (Duplicate
3073       names  are  also always permitted for subpatterns with the same number,
3074       set up as described in the previous section.) Duplicate  names  can  be
3075       useful  for  patterns  where only one instance of the named parentheses
3076       can match. Suppose that you want to match the name of a weekday, either
3077       as  a  3-letter abbreviation or as the full name, and in both cases you
3078       want to extract the abbreviation. The following pattern  (ignoring  the
3079       line breaks) does the job:
3080
3081       (?<DN>Mon|Fri|Sun)(?:day)?|
3082       (?<DN>Tue)(?:sday)?|
3083       (?<DN>Wed)(?:nesday)?|
3084       (?<DN>Thu)(?:rsday)?|
3085       (?<DN>Sat)(?:urday)?
3086
3087       There  are  five capturing substrings, but only one is ever set after a
3088       match. (An alternative way of solving this problem is to use a  "branch
3089       reset" subpattern, as described in the previous section.)
3090
3091       For  capturing  named subpatterns which names are not unique, the first
3092       matching occurrence (counted from left to  right  in  the  subject)  is
3093       returned from run/3, if the name is specified in the values part of the
3094       capture statement. The all_names capturing value matches all the  names
3095       in the same way.
3096
3097   Note:
3098       You  cannot  use different names to distinguish between two subpatterns
3099       with the same number, as PCRE uses only the numbers when matching.  For
3100       this  reason,  an error is given at compile time if different names are
3101       specified to subpatterns with the same number. However, you can specify
3102       the  same  name to subpatterns with the same number, even when dupnames
3103       is not set.
3104
3105

REPETITION

3107       Repetition is specified by quantifiers, which can  follow  any  of  the
3108       following items:
3109
3110         * A literal data character
3111
3112         * The dot metacharacter
3113
3114         * The \C escape sequence
3115
3116         * The \X escape sequence
3117
3118         * The \R escape sequence
3119
3120         * An escape such as \d or \pL that matches a single character
3121
3122         * A character class
3123
3124         * A back reference (see the next section)
3125
3126         * A parenthesized subpattern (including assertions)
3127
3128         * A subroutine call to a subpattern (recursive or otherwise)
3129
3130       The  general repetition quantifier specifies a minimum and maximum num‐
3131       ber of permitted matches, by giving the two numbers in  curly  brackets
3132       (braces),  separated  by  a comma. The numbers must be < 65536, and the
3133       first must be less than or equal to the second. For example,  the  fol‐
3134       lowing matches "zz", "zzz", or "zzzz":
3135
3136       z{2,4}
3137
3138       A  closing  brace  on its own is not a special character. If the second
3139       number is omitted, but the comma is present, there is no  upper  limit.
3140       If  the  second  number  and the comma are both omitted, the quantifier
3141       specifies an exact number of  required  matches.  Thus,  the  following
3142       matches at least three successive vowels, but can match many more:
3143
3144       [aeiou]{3,}
3145
3146       The following matches exactly eight digits:
3147
3148       \d{8}
3149
3150       An  opening curly bracket that appears in a position where a quantifier
3151       is not allowed, or one that does not match the syntax of a  quantifier,
3152       is taken as a literal character. For example, {,6} is not a quantifier,
3153       but a literal string of four characters.
3154
3155       In Unicode mode, quantifiers apply to characters rather than  to  indi‐
3156       vidual  data  units.  Thus, for example, \x{100}{2} matches two charac‐
3157       ters, each of which is represented by a  2-byte  sequence  in  a  UTF-8
3158       string.  Similarly, \X{3} matches three Unicode extended grapheme clus‐
3159       ters, each of which can be many data units long (and  they  can  be  of
3160       different lengths).
3161
3162       The quantifier {0} is permitted, causing the expression to behave as if
3163       the previous item and the quantifier were not present. This can be use‐
3164       ful  for  subpatterns that are referenced as subroutines from elsewhere
3165       in the pattern (but see also section  Defining Subpatterns for  Use  by
3166       Reference  Only).  Items other than subpatterns that have a {0} quanti‐
3167       fier are omitted from the compiled pattern.
3168
3169       For convenience, the three most common quantifiers have  single-charac‐
3170       ter abbreviations:
3171
3172         *:
3173           Equivalent to {0,}
3174
3175         +:
3176           Equivalent to {1,}
3177
3178         ?:
3179           Equivalent to {0,1}
3180
3181       Infinite  loops  can  be constructed by following a subpattern that can
3182       match no characters with a quantifier that  has  no  upper  limit,  for
3183       example:
3184
3185       (a?)*
3186
3187       Earlier versions of Perl and PCRE used to give an error at compile time
3188       for such patterns. However, as there are cases where this can  be  use‐
3189       ful,  such patterns are now accepted. However, if any repetition of the
3190       subpattern matches no characters, the loop is forcibly broken.
3191
3192       By default, the quantifiers are "greedy", that is, they match  as  much
3193       as  possible  (up  to  the  maximum number of permitted times), without
3194       causing the remaining pattern to fail. The  classic  example  of  where
3195       this gives problems is in trying to match comments in C programs. These
3196       appear between /* and */. Within the comment, individual * and /  char‐
3197       acters  can appear. An attempt to match C comments by applying the pat‐
3198       tern
3199
3200       /\*.*\*/
3201
3202       to the string
3203
3204       /* first comment */  not comment  /* second comment */
3205
3206       fails, as it matches the entire string owing to the greediness  of  the
3207       .* item.
3208
3209       However,  if  a quantifier is followed by a question mark, it ceases to
3210       be greedy, and instead matches the minimum number of times possible, so
3211       the following pattern does the right thing with the C comments:
3212
3213       /\*.*?\*/
3214
3215       The  meaning  of the various quantifiers is not otherwise changed, only
3216       the preferred number of matches. Do not confuse this  use  of  question
3217       mark with its use as a quantifier in its own right. As it has two uses,
3218       it can sometimes appear doubled, as in
3219
3220       \d??\d
3221
3222       which matches one digit by preference, but can match two if that is the
3223       only way the remaining pattern matches.
3224
3225       If  option  ungreedy  is set (an option that is not available in Perl),
3226       the quantifiers are not greedy by default, but individual ones  can  be
3227       made greedy by following them with a question mark. That is, it inverts
3228       the default behavior.
3229
3230       When a parenthesized subpattern is quantified  with  a  minimum  repeat
3231       count  that  is  > 1 or with a limited maximum, more memory is required
3232       for the compiled pattern, in proportion to the size of the  minimum  or
3233       maximum.
3234
3235       If  a  pattern starts with .* or .{0,} and option dotall (equivalent to
3236       Perl option /s) is set, thus allowing the dot to  match  newlines,  the
3237       pattern  is  implicitly  anchored,  because  whatever  follows is tried
3238       against every character position in the subject string. So, there is no
3239       point  in  retrying  the overall match at any position after the first.
3240       PCRE normally treats such a pattern as if it was preceded by \A.
3241
3242       In cases where it is known that the subject  string  contains  no  new‐
3243       lines,  it  is  worth  setting  dotall  to obtain this optimization, or
3244       alternatively using ^ to indicate anchoring explicitly.
3245
3246       However, there are some cases where the optimization  cannot  be  used.
3247       When  .* is inside capturing parentheses that are the subject of a back
3248       reference elsewhere in the pattern, a match at the start can fail where
3249       a later one succeeds. Consider, for example:
3250
3251       (.*)abc\1
3252
3253       If the subject is "xyz123abc123", the match point is the fourth charac‐
3254       ter. Therefore, such a pattern is not implicitly anchored.
3255
3256       Another case where implicit anchoring is not applied is when the  lead‐
3257       ing  .* is inside an atomic group. Once again, a match at the start can
3258       fail where a later one succeeds. Consider the following pattern:
3259
3260       (?>.*?a)b
3261
3262       It matches "ab" in the subject "aab". The use of the backtracking  con‐
3263       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3264
3265       When a capturing subpattern is repeated, the value captured is the sub‐
3266       string that matched the final iteration. For example, after
3267
3268       (tweedle[dume]{3}\s*)+
3269
3270       has matched "tweedledum tweedledee", the value  of  the  captured  sub‐
3271       string  is "tweedledee". However, if there are nested capturing subpat‐
3272       terns, the corresponding captured values can have been set in  previous
3273       iterations. For example, after
3274
3275       /(a|(b))+/
3276
3277       matches "aba", the value of the second captured substring is "b".
3278

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

3280       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
3281       repetition, failure of what follows normally causes the  repeated  item
3282       to  be  re-evaluated to see if a different number of repeats allows the
3283       remaining pattern to match. Sometimes it is  useful  to  prevent  this,
3284       either  to  change the nature of the match, or to cause it to fail ear‐
3285       lier than it otherwise might, when the author of the pattern knows that
3286       there is no point in carrying on.
3287
3288       Consider, for example, the pattern \d+foo when applied to the following
3289       subject line:
3290
3291       123456bar
3292
3293       After matching all six digits and then failing to match "foo", the nor‐
3294       mal  action of the matcher is to try again with only five digits match‐
3295       ing item \d+, and then with four, and so on, before ultimately failing.
3296       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
3297       the means for specifying that once a subpattern has matched, it is  not
3298       to be re-evaluated in this way.
3299
3300       If  atomic grouping is used for the previous example, the matcher gives
3301       up immediately on failing to match "foo" the first time.  The  notation
3302       is a kind of special parenthesis, starting with (?> as in the following
3303       example:
3304
3305       (?>\d+)foo
3306
3307       This kind of parenthesis "locks up" the part of the pattern it contains
3308       once  it  has  matched,  and a failure further into the pattern is pre‐
3309       vented from backtracking into it.  Backtracking  past  it  to  previous
3310       items, however, works as normal.
3311
3312       An  alternative  description  is that a subpattern of this type matches
3313       the string of characters that an  identical  standalone  pattern  would
3314       match, if anchored at the current point in the subject string.
3315
3316       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3317       such as the above example can be thought of as a maximizing repeat that
3318       must  swallow  everything  it can. So, while both \d+ and \d+? are pre‐
3319       pared to adjust the number of digits they match to make  the  remaining
3320       pattern match, (?>\d+) can only match an entire sequence of digits.
3321
3322       Atomic  groups  in general can contain any complicated subpatterns, and
3323       can be nested. However, when the subpattern for an atomic group is just
3324       a  single  repeated  item, as in the example above, a simpler notation,
3325       called a "possessive quantifier" can be used. This consists of an extra
3326       +  character  following a quantifier. Using this notation, the previous
3327       example can be rewritten as
3328
3329       \d++foo
3330
3331       Notice that a possessive quantifier can be used with an  entire  group,
3332       for example:
3333
3334       (abc|xyz){2,3}+
3335
3336       Possessive  quantifiers  are  always  greedy;  the  setting  of  option
3337       ungreedy is ignored. They are a convenient  notation  for  the  simpler
3338       forms  of an atomic group. However, there is no difference in the mean‐
3339       ing of a possessive quantifier and the  equivalent  atomic  group,  but
3340       there can be a performance difference; possessive quantifiers are prob‐
3341       ably slightly faster.
3342
3343       The possessive quantifier syntax is an extension to the Perl  5.8  syn‐
3344       tax.  Jeffrey  Friedl  originated  the idea (and the name) in the first
3345       edition of his book. Mike McCloskey liked it, so implemented it when he
3346       built  the  Sun  Java  package, and PCRE copied it from there. It ulti‐
3347       mately found its way into Perl at release 5.10.
3348
3349       PCRE has an optimization that automatically "possessifies" certain sim‐
3350       ple  pattern  constructs.  For  example, the sequence A+B is treated as
3351       A++B, as there is no point in backtracking into a sequence of A:s  when
3352       B must follow.
3353
3354       When  a  pattern  contains an unlimited repeat inside a subpattern that
3355       can itself be repeated an unlimited number of  times,  the  use  of  an
3356       atomic  group  is  the  only way to avoid some failing matches taking a
3357       long time. The pattern
3358
3359       (\D+|<\d+>)*[!?]
3360
3361       matches an unlimited number of substrings that either consist  of  non-
3362       digits,  or digits enclosed in <>, followed by ! or ?. When it matches,
3363       it runs quickly. However, if it is applied to
3364
3365       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3366
3367       it takes a long time before reporting  failure.  This  is  because  the
3368       string  can be divided between the internal \D+ repeat and the external
3369       * repeat in many ways, and all must be tried. (The  example  uses  [!?]
3370       rather  than  a single character at the end, as both PCRE and Perl have
3371       an optimization that allows for fast failure when a single character is
3372       used.  They  remember  the last single character that is required for a
3373       match, and fail early if it is not present in the string.) If the  pat‐
3374       tern  is  changed  so that it uses an atomic group, like the following,
3375       sequences of non-digits cannot be broken, and failure happens quickly:
3376
3377       ((?>\D+)|<\d+>)*[!?]
3378

BACK REFERENCES

3380       Outside a character class, a backslash followed by a  digit  >  0  (and
3381       possibly  further digits) is a back reference to a capturing subpattern
3382       earlier (that is, to its left) in the pattern, provided there have been
3383       that many previous capturing left parentheses.
3384
3385       However,  if  the decimal number following the backslash is < 10, it is
3386       always taken as a back reference, and causes an error only if there are
3387       not  that  many  capturing left parentheses in the entire pattern. That
3388       is, the parentheses that are referenced do need not be to the  left  of
3389       the reference for numbers < 10. A "forward back reference" of this type
3390       can make sense when a repetition is involved and the subpattern to  the
3391       right has participated in an earlier iteration.
3392
3393       It  is  not  possible to have a numerical "forward back reference" to a
3394       subpattern whose number is 10 or more using this syntax, as a  sequence
3395       such  as  \50  is interpreted as a character defined in octal. For more
3396       details of the handling of digits following a  backslash,  see  section
3397       Non-Printing  Characters  earlier.  There is no such problem when named
3398       parentheses are used. A back reference to any  subpattern  is  possible
3399       using named parentheses (see below).
3400
3401       Another  way  to avoid the ambiguity inherent in the use of digits fol‐
3402       lowing a backslash is to use the \g escape sequence. This  escape  must
3403       be  followed  by  an  unsigned  number or a negative number, optionally
3404       enclosed in braces. The following examples are identical:
3405
3406       (ring), \1
3407       (ring), \g1
3408       (ring), \g{1}
3409
3410       An unsigned number specifies an absolute reference without the  ambigu‐
3411       ity that is present in the older syntax. It is also useful when literal
3412       digits follow the reference. A negative number is a relative reference.
3413       Consider the following example:
3414
3415       (abc(def)ghi)\g{-1}
3416
3417       The sequence \g{-1} is a reference to the most recently started captur‐
3418       ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3419       ple.  Similarly,  \g{-2} would be equivalent to \1. The use of relative
3420       references can be helpful in long patterns, and also in  patterns  that
3421       are  created  by  joining  fragments containing references within them‐
3422       selves.
3423
3424       A back reference matches whatever matched the capturing  subpattern  in
3425       the  current  subject string, rather than anything matching the subpat‐
3426       tern itself (section Subpattern as Subroutines describes a way of doing
3427       that).  So,  the  following pattern matches "sense and sensibility" and
3428       "response and responsibility", but not "sense and responsibility":
3429
3430       (sens|respons)e and \1ibility
3431
3432       If caseful matching is in force at the time of the back reference,  the
3433       case  of  letters  is relevant. For example, the following matches "rah
3434       rah" and "RAH RAH", but not "RAH rah", although the original  capturing
3435       subpattern is matched caselessly:
3436
3437       ((?i)rah)\s+\1
3438
3439       There  are many different ways of writing back references to named sub‐
3440       patterns. The .NET syntax \k{name} and  the  Perl  syntax  \k<name>  or
3441       \k'name'  are supported, as is the Python syntax (?P=name). The unified
3442       back reference syntax in Perl 5.10, in which \g can be  used  for  both
3443       numeric  and  named references, is also supported. The previous example
3444       can be rewritten in the following ways:
3445
3446       (?<p1>(?i)rah)\s+\k<p1>
3447       (?'p1'(?i)rah)\s+\k{p1}
3448       (?P<p1>(?i)rah)\s+(?P=p1)
3449       (?<p1>(?i)rah)\s+\g{p1}
3450
3451       A subpattern that is referenced by  name  can  appear  in  the  pattern
3452       before or after the reference.
3453
3454       There  can be more than one back reference to the same subpattern. If a
3455       subpattern has not been used in a particular match, any back references
3456       to  it always fails. For example, the following pattern always fails if
3457       it starts to match "a" rather than "bc":
3458
3459       (a|(bc))\2
3460
3461       As there can be many capturing parentheses in  a  pattern,  all  digits
3462       following the backslash are taken as part of a potential back reference
3463       number. If the pattern continues with a digit character, some delimiter
3464       must  be  used  to  terminate the back reference. If option extended is
3465       set, this can be whitespace. Otherwise an empty  comment  (see  section
3466       Comments) can be used.
3467
3468       Recursive Back References
3469
3470       A  back reference that occurs inside the parentheses to which it refers
3471       fails when the subpattern is first used, so, for example,  (a\1)  never
3472       matches. However, such references can be useful inside repeated subpat‐
3473       terns. For example, the following pattern matches any  number  of  "a"s
3474       and also "aba", "ababbaa", and so on:
3475
3476       (a|b\1)+
3477
3478       At  each  iteration  of  the subpattern, the back reference matches the
3479       character string corresponding to the previous iteration. In order  for
3480       this  to  work,  the pattern must be such that the first iteration does
3481       not need to match the back reference. This can be done  using  alterna‐
3482       tion,  as  in  the  example above, or by a quantifier with a minimum of
3483       zero.
3484
3485       Back references of this type cause the group that they reference to  be
3486       treated  as  an  atomic group. Once the whole group has been matched, a
3487       subsequent matching failure cannot cause backtracking into  the  middle
3488       of the group.
3489

ASSERTIONS

3491       An  assertion  is  a  test on the characters following or preceding the
3492       current matching point that does not consume any characters. The simple
3493       assertions  coded  as \b, \B, \A, \G, \Z, \z, ^, and $ are described in
3494       the previous sections.
3495
3496       More complicated assertions are coded as  subpatterns.  There  are  two
3497       kinds:  those  that  look  ahead of the current position in the subject
3498       string, and those that look  behind  it.  An  assertion  subpattern  is
3499       matched  in  the  normal way, except that it does not cause the current
3500       matching position to be changed.
3501
3502       Assertion subpatterns are not capturing subpatterns. If such an  asser‐
3503       tion  contains  capturing  subpatterns within it, these are counted for
3504       the purposes of numbering the capturing subpatterns in the  whole  pat‐
3505       tern.  However,  substring  capturing  is done only for positive asser‐
3506       tions. (Perl sometimes, but not always, performs capturing in  negative
3507       assertions.)
3508
3509   Warning:
3510       If  a  positive  assertion containing one or more capturing subpatterns
3511       succeeds, but failure to match later in the pattern causes backtracking
3512       over  this  assertion, the captures within the assertion are reset only
3513       if no higher numbered captures are already set. This is, unfortunately,
3514       a fundamental limitation of the current implementation, and as PCRE1 is
3515       now in maintenance-only status, it is unlikely ever to change.
3516
3517
3518       For compatibility with Perl, assertion  subpatterns  can  be  repeated.
3519       However,  it  makes  no  sense to assert the same thing many times, the
3520       side effect of capturing parentheses can  occasionally  be  useful.  In
3521       practice, there are only three cases:
3522
3523         * If  the  quantifier  is  {0},  the assertion is never obeyed during
3524           matching. However, it can contain internal capturing  parenthesized
3525           groups that are called from elsewhere through the subroutine mecha‐
3526           nism.
3527
3528         * If quantifier is {0,n}, where n > 0, it is treated  as  if  it  was
3529           {0,1}.  At  runtime,  the remaining pattern match is tried with and
3530           without the assertion, the order depends on the greediness  of  the
3531           quantifier.
3532
3533         * If  the  minimum  repetition is > 0, the quantifier is ignored. The
3534           assertion is obeyed only once when encountered during matching.
3535
3536       Lookahead Assertions
3537
3538       Lookahead assertions start with (?= for positive assertions and (?! for
3539       negative assertions. For example, the following matches a word followed
3540       by a semicolon, but does not include the semicolon in the match:
3541
3542       \w+(?=;)
3543
3544       The following matches any occurrence of "foo" that is not  followed  by
3545       "bar":
3546
3547       foo(?!bar)
3548
3549       Notice that the apparently similar pattern
3550
3551       (?!foo)bar
3552
3553       does  not  find  an  occurrence  of "bar" that is preceded by something
3554       other than "foo". It finds any occurrence of "bar" whatsoever,  as  the
3555       assertion  (?!foo)  is  always  true when the next three characters are
3556       "bar". A lookbehind assertion is needed to achieve the other effect.
3557
3558       If you want to force a matching failure at some point in a pattern, the
3559       most  convenient  way  to do it is with (?!), as an empty string always
3560       matches. So, an assertion that requires there is not  to  be  an  empty
3561       string  must always fail. The backtracking control verb (*FAIL) or (*F)
3562       is a synonym for (?!).
3563
3564       Lookbehind Assertions
3565
3566       Lookbehind assertions start with (?<= for positive assertions and  (?<!
3567       for negative assertions. For example, the following finds an occurrence
3568       of "bar" that is not preceded by "foo":
3569
3570       (?<!foo)bar
3571
3572       The contents of a lookbehind assertion are restricted such that all the
3573       strings it matches must have a fixed length. However, if there are many
3574       top-level alternatives, they do not all have to  have  the  same  fixed
3575       length. Thus, the following is permitted:
3576
3577       (?<=bullock|donkey)
3578
3579       The following causes an error at compile time:
3580
3581       (?<!dogs?|cats?)
3582
3583       Branches  that match different length strings are permitted only at the
3584       top-level of a lookbehind assertion. This is an extension compared with
3585       Perl,  which  requires all branches to match the same length of string.
3586       An assertion such as the following is not permitted, as its single top-
3587       level branch can match two different lengths:
3588
3589       (?<=ab(c|de))
3590
3591       However,  it  is  acceptable  to PCRE if rewritten to use two top-level
3592       branches:
3593
3594       (?<=abc|abde)
3595
3596       Sometimes the escape sequence \K (see above) can be used instead  of  a
3597       lookbehind assertion to get round the fixed-length restriction.
3598
3599       The  implementation  of lookbehind assertions is, for each alternative,
3600       to move the current position back temporarily by the fixed  length  and
3601       then try to match. If there are insufficient characters before the cur‐
3602       rent position, the assertion fails.
3603
3604       In a UTF mode, PCRE does not allow the \C escape (which matches a  sin‐
3605       gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
3606       as it makes it impossible to calculate the length  of  the  lookbehind.
3607       The \X and \R escapes, which can match different numbers of data units,
3608       are not permitted either.
3609
3610       "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3611       lookbehinds,  as  long as the subpattern matches a fixed-length string.
3612       Recursion, however, is not supported.
3613
3614       Possessive quantifiers can be used with lookbehind assertions to  spec‐
3615       ify  efficient  matching  of fixed-length strings at the end of subject
3616       strings. Consider the following simple pattern when applied to  a  long
3617       string that does not match:
3618
3619       abcd$
3620
3621       As matching proceeds from left to right, PCRE looks for each "a" in the
3622       subject and then sees if what follows matches the remaining pattern. If
3623       the pattern is specified as
3624
3625       ^.*abcd$
3626
3627       the  initial  .* matches the entire string at first. However, when this
3628       fails (as there is no following "a"), it backtracks to  match  all  but
3629       the  last  character,  then all but the last two characters, and so on.
3630       Once again the search for "a" covers the entire string, from  right  to
3631       left, so we are no better off. However, if the pattern is written as
3632
3633       ^.*+(?<=abcd)
3634
3635       there  can  be  no backtracking for the .*+ item; it can match only the
3636       entire string. The subsequent lookbehind assertion does a  single  test
3637       on  the last four characters. If it fails, the match fails immediately.
3638       For long strings, this approach makes a significant difference  to  the
3639       processing time.
3640
3641       Using Multiple Assertions
3642
3643       Many assertions (of any sort) can occur in succession. For example, the
3644       following matches "foo" preceded by three digits that are not "999":
3645
3646       (?<=\d{3})(?<!999)foo
3647
3648       Notice that each of the assertions is applied independently at the same
3649       point  in  the subject string. First there is a check that the previous
3650       three characters are all digits, and then there is  a  check  that  the
3651       same  three characters are not "999". This pattern does not match "foo"
3652       preceded by six characters, the first of which are digits and the  last
3653       three  of  which are not "999". For example, it does not match "123abc‐
3654       foo". A pattern to do that is the following:
3655
3656       (?<=\d{3}...)(?<!999)foo
3657
3658       This time the first assertion looks at the  preceding  six  characters,
3659       checks  that  the first three are digits, and then the second assertion
3660       checks that the preceding three characters are not "999".
3661
3662       Assertions can be nested in any combination. For example, the following
3663       matches an occurrence of "baz" that is preceded by "bar", which in turn
3664       is not preceded by "foo":
3665
3666       (?<=(?<!foo)bar)baz
3667
3668       The following pattern matches "foo" preceded by three  digits  and  any
3669       three characters that are not "999":
3670
3671       (?<=\d{3}(?!999)...)foo
3672

CONDITIONAL SUBPATTERNS

3674       It  is possible to cause the matching process to obey a subpattern con‐
3675       ditionally or to choose between two alternative subpatterns,  depending
3676       on  the result of an assertion, or whether a specific capturing subpat‐
3677       tern has already been matched. The following are the two possible forms
3678       of conditional subpattern:
3679
3680       (?(condition)yes-pattern)
3681       (?(condition)yes-pattern|no-pattern)
3682
3683       If  the  condition is satisfied, the yes-pattern is used, otherwise the
3684       no-pattern (if present). If more than two  alternatives  exist  in  the
3685       subpattern,  a  compile-time error occurs. Each of the two alternatives
3686       can itself contain nested subpatterns of  any  form,  including  condi‐
3687       tional subpatterns; the restriction to two alternatives applies only at
3688       the level of the condition. The following pattern fragment is an  exam‐
3689       ple where the alternatives are complex:
3690
3691       (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3692
3693       There  are  four  kinds of condition: references to subpatterns, refer‐
3694       ences to recursion, a pseudo-condition called DEFINE, and assertions.
3695
3696       Checking for a Used Subpattern By Number
3697
3698       If the text between the parentheses consists of a sequence  of  digits,
3699       the condition is true if a capturing subpattern of that number has pre‐
3700       viously matched. If more than one capturing subpattern  with  the  same
3701       number  exists (see section  Duplicate Subpattern Numbers earlier), the
3702       condition is true if any of them have matched. An alternative  notation
3703       is  to  precede the digits with a plus or minus sign. In this case, the
3704       subpattern number is relative rather than absolute. The  most  recently
3705       opened parentheses can be referenced by (?(-1), the next most recent by
3706       (?(-2), and so on. Inside loops, it can also make  sense  to  refer  to
3707       subsequent  groups. The next parentheses to be opened can be referenced
3708       as (?(+1), and so on. (The value zero in any  of  these  forms  is  not
3709       used; it provokes a compile-time error.)
3710
3711       Consider  the  following pattern, which contains non-significant white‐
3712       space to make it more readable (assume option extended) and  to  divide
3713       it into three parts for ease of discussion:
3714
3715       ( \( )?    [^()]+    (?(1) \) )
3716
3717       The  first  part  matches  an optional opening parenthesis, and if that
3718       character is present, sets it as the first captured substring. The sec‐
3719       ond  part  matches one or more characters that are not parentheses. The
3720       third part is a conditional subpattern that tests whether the first set
3721       of parentheses matched or not. If they did, that is, if subject started
3722       with an opening parenthesis, the condition is true, and so the yes-pat‐
3723       tern  is  executed and a closing parenthesis is required. Otherwise, as
3724       no-pattern is not present, the subpattern  matches  nothing.  That  is,
3725       this pattern matches a sequence of non-parentheses, optionally enclosed
3726       in parentheses.
3727
3728       If this pattern is embedded in a larger one, a relative  reference  can
3729       be used:
3730
3731       This  makes  the  fragment independent of the parentheses in the larger
3732       pattern.
3733
3734       Checking for a Used Subpattern By Name
3735
3736       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
3737       used  subpattern  by  name.  For compatibility with earlier versions of
3738       PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
3739       also recognized.
3740
3741       Rewriting the previous example to use a named subpattern gives:
3742
3743       (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
3744
3745       If  the  name used in a condition of this kind is a duplicate, the test
3746       is applied to all subpatterns of the same name, and is true if any  one
3747       of them has matched.
3748
3749       Checking for Pattern Recursion
3750
3751       If the condition is the string (R), and there is no subpattern with the
3752       name R, the condition is true if a recursive call to the whole  pattern
3753       or any subpattern has been made. If digits or a name preceded by amper‐
3754       sand follow the letter R, for example:
3755
3756       (?(R3)...) or (?(R&name)...)
3757
3758       the condition is true if the most recent recursion is into a subpattern
3759       whose number or name is given. This condition does not check the entire
3760       recursion stack. If the name used in a condition  of  this  kind  is  a
3761       duplicate, the test is applied to all subpatterns of the same name, and
3762       is true if any one of them is the most recent recursion.
3763
3764       At "top-level", all these recursion test conditions are false. The syn‐
3765       tax for recursive patterns is described below.
3766
3767       Defining Subpatterns for Use By Reference Only
3768
3769       If  the  condition  is  the string (DEFINE), and there is no subpattern
3770       with the name DEFINE, the condition is  always  false.  In  this  case,
3771       there  can  be  only  one  alternative  in the subpattern. It is always
3772       skipped if control reaches this point  in  the  pattern.  The  idea  of
3773       DEFINE  is that it can be used to define "subroutines" that can be ref‐
3774       erenced from elsewhere. (The use of subroutines  is  described  below.)
3775       For   example,   a   pattern   to   match  an  IPv4  address,  such  as
3776       "192.168.23.245", can be written like this (ignore whitespace and  line
3777       breaks):
3778
3779       (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3780
3781       The  first  part  of  the  pattern  is a DEFINE group inside which is a
3782       another group named "byte" is defined. This matches an individual  com‐
3783       ponent  of an IPv4 address (a number < 256). When matching takes place,
3784       this part of the pattern is skipped, as DEFINE acts like a false condi‐
3785       tion. The remaining pattern uses references to the named group to match
3786       the four dot-separated components of an IPv4 address,  insisting  on  a
3787       word boundary at each end.
3788
3789       Assertion Conditions
3790
3791       If  the  condition  is  not  in any of the above formats, it must be an
3792       assertion. This can be a positive or negative lookahead  or  lookbehind
3793       assertion.  Consider  the following pattern, containing non-significant
3794       whitespace, and with the two alternatives on the second line:
3795
3796       (?(?=[^a-z]*[a-z])
3797       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
3798
3799       The condition  is  a  positive  lookahead  assertion  that  matches  an
3800       optional  sequence  of  non-letters  followed  by a letter. That is, it
3801       tests for the presence of at least one letter in the subject. If a let‐
3802       ter  is  found,  the  subject is matched against the first alternative,
3803       otherwise it is  matched  against  the  second.  This  pattern  matches
3804       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
3805       letters and dd are digits.
3806

COMMENTS

3808       There are two ways to include comments in patterns that  are  processed
3809       by PCRE. In both cases, the start of the comment must not be in a char‐
3810       acter class, or in the middle of any other sequence of related  charac‐
3811       ters  such  as  (?: or a subpattern name or number. The characters that
3812       make up a comment play no part in the pattern matching.
3813
3814       The sequence (?# marks the start of a comment that continues up to  the
3815       next  closing  parenthesis.  Nested  parentheses  are not permitted. If
3816       option PCRE_EXTENDED is set, an unescaped # character also introduces a
3817       comment,  which  in  this  case continues to immediately after the next
3818       newline character or character sequence in the pattern.  Which  charac‐
3819       ters are interpreted as newlines is controlled by the options passed to
3820       a compiling function or by a special sequence at the start of the  pat‐
3821       tern, as described in section  Newline Conventions earlier.
3822
3823       Notice  that  the  end  of  this  type  of comment is a literal newline
3824       sequence in the pattern; escape sequences that happen  to  represent  a
3825       newline  do not count. For example, consider the following pattern when
3826       extended is set, and the default newline convention is in force:
3827
3828       abc #comment \n still comment
3829
3830       On encountering character #, pcre_compile() skips along, looking for  a
3831       newline in the pattern. The sequence \n is still literal at this stage,
3832       so it does not terminate the comment. Only a character with code  value
3833       0x0a (the default newline) does so.
3834

RECURSIVE PATTERNS

3836       Consider  the problem of matching a string in parentheses, allowing for
3837       unlimited nested parentheses. Without the use of  recursion,  the  best
3838       that  can  be  done  is  to use a pattern that matches up to some fixed
3839       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
3840       depth.
3841
3842       For some time, Perl has provided a facility that allows regular expres‐
3843       sions to recurse (among other things). It does  this  by  interpolating
3844       Perl  code  in the expression at runtime, and the code can refer to the
3845       expression itself. A Perl pattern using code interpolation to solve the
3846       parentheses problem can be created like this:
3847
3848       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3849
3850       Item  (?p{...})  interpolates  Perl  code  at runtime, and in this case
3851       refers recursively to the pattern in which it appears.
3852
3853       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3854       it supports special syntax for recursion of the entire pattern, and for
3855       individual subpattern recursion. After its  introduction  in  PCRE  and
3856       Python,  this  kind  of  recursion  was  later  introduced into Perl at
3857       release 5.10.
3858
3859       A special item that consists of (? followed by a number > 0 and a clos‐
3860       ing parenthesis is a recursive subroutine call of the subpattern of the
3861       given number, if it occurs inside that subpattern. (If  not,  it  is  a
3862       non-recursive subroutine call, which is described in the next section.)
3863       The special item (?R) or (?0) is a recursive call of the entire regular
3864       expression.
3865
3866       This  PCRE  pattern  solves the nested parentheses problem (assume that
3867       option extended is set so that whitespace is ignored):
3868
3869       \( ( [^()]++ | (?R) )* \)
3870
3871       First it matches an opening parenthesis. Then it matches any number  of
3872       substrings,  which  can  either  be  a sequence of non-parentheses or a
3873       recursive match of the pattern itself (that is, a  correctly  parenthe‐
3874       sized  substring).  Finally  there is a closing parenthesis. Notice the
3875       use of a possessive quantifier to avoid backtracking into sequences  of
3876       non-parentheses.
3877
3878       If this was part of a larger pattern, you would not want to recurse the
3879       entire pattern, so instead you can use:
3880
3881       ( \( ( [^()]++ | (?1) )* \) )
3882
3883       The pattern is here within parentheses so that the recursion refers  to
3884       them instead of the whole pattern.
3885
3886       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
3887       tricky. This is made easier by the use of relative references.  Instead
3888       of  (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3889       ond most recently opened parentheses preceding the recursion. That  is,
3890       a negative number counts capturing parentheses leftwards from the point
3891       at which it is encountered.
3892
3893       It is also possible to refer to later opened  parentheses,  by  writing
3894       references  such  as  (?+2). However, these cannot be recursive, as the
3895       reference is not inside the parentheses that are referenced.  They  are
3896       always  non-recursive  subroutine  calls, as described in the next sec‐
3897       tion.
3898
3899       An alternative approach is to use named parentheses instead.  The  Perl
3900       syntax  for this is (?&name). The earlier PCRE syntax (?P>name) is also
3901       supported. We can rewrite the above example as follows:
3902
3903       (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3904
3905       If there is more than one subpattern with the same name,  the  earliest
3906       one is used.
3907
3908       This  particular  example  pattern that we have studied contains nested
3909       unlimited repeats, and so the use of a possessive quantifier for match‐
3910       ing  strings  of non-parentheses is important when applying the pattern
3911       to strings that do not match. For example, when this pattern is applied
3912       to
3913
3914       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3915
3916       it gives "no match" quickly. However, if a possessive quantifier is not
3917       used, the match runs for a long time, as there are  so  many  different
3918       ways  the  +  and  *  repeats can carve up the subject, and all must be
3919       tested before failure can be reported.
3920
3921       At the end of a match, the values of capturing  parentheses  are  those
3922       from the outermost level. If the pattern above is matched against
3923
3924       (ab(cd)ef)
3925
3926       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
3927       which is the last value taken on at the top-level. If a capturing  sub‐
3928       pattern  is  not  matched at the top level, its final captured value is
3929       unset, even if it was (temporarily) set at a deeper  level  during  the
3930       matching process.
3931
3932       Do not confuse item (?R) with condition (R), which tests for recursion.
3933       Consider the following pattern, which matches text in  angle  brackets,
3934       allowing  for  arbitrary  nesting.  Only  digits  are allowed in nested
3935       brackets (that is, when recursing), while any characters are  permitted
3936       at the outer level.
3937
3938       < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
3939
3940       Here (?(R) is the start of a conditional subpattern, with two different
3941       alternatives for the recursive and non-recursive cases.  Item  (?R)  is
3942       the actual recursive call.
3943
3944       Differences in Recursion Processing between PCRE and Perl
3945
3946       Recursion  processing  in PCRE differs from Perl in two important ways.
3947       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
3948       always treated as an atomic group. That is, once it has matched some of
3949       the subject string, it is never re-entered, even if it contains untried
3950       alternatives  and  there  is a subsequent matching failure. This can be
3951       illustrated by the following pattern, which means  to  match  a  palin‐
3952       dromic string containing an odd number of characters (for example, "a",
3953       "aba", "abcba", "abcdcba"):
3954
3955       ^(.|(.)(?1)\2)$
3956
3957       The idea is that it either matches a single character, or two identical
3958       characters surrounding a subpalindrome. In Perl, this pattern works; in
3959       PCRE it does not work if the pattern is longer than  three  characters.
3960       Consider the subject string "abcba".
3961
3962       At  the  top level, the first character is matched, but as it is not at
3963       the end of the string, the first alternative fails, the second alterna‐
3964       tive  is  taken, and the recursion kicks in. The recursive call to sub‐
3965       pattern 1 successfully matches the next character ("b").  (Notice  that
3966       the beginning and end of line tests are not part of the recursion.)
3967
3968       Back  at  the top level, the next character ("c") is compared with what
3969       subpattern 2 matched, which was "a". This fails. As  the  recursion  is
3970       treated  as  an atomic group, there are now no backtracking points, and
3971       so the entire match fails. (Perl can now re-enter the recursion and try
3972       the  second  alternative.)  However, if the pattern is written with the
3973       alternatives in the other order, things are different:
3974
3975       ^((.)(?1)\2|.)$
3976
3977       This time, the recursing alternative is tried first, and  continues  to
3978       recurse  until  it runs out of characters, at which point the recursion
3979       fails. But this time we have another alternative to try at  the  higher
3980       level.  That  is  the  significant difference: in the previous case the
3981       remaining alternative is at a deeper recursion level, which PCRE cannot
3982       use.
3983
3984       To  change  the pattern so that it matches all palindromic strings, not
3985       only those with an odd number of characters, it is tempting  to  change
3986       the pattern to this:
3987
3988       ^((.)(?1)\2|.?)$
3989
3990       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
3991       When a deeper recursion has matched a single character,  it  cannot  be
3992       entered again to match an empty string. The solution is to separate the
3993       two cases, and write out the odd and even cases as alternatives at  the
3994       higher level:
3995
3996       ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
3997
3998       If  you  want  to  match  typical palindromic phrases, the pattern must
3999       ignore all non-word characters, which can be done as follows:
4000
4001       ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4002
4003       If run with option caseless, this pattern matches phrases  such  as  "A
4004       man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4005       Notice the use of the possessive quantifier *+  to  avoid  backtracking
4006       into  sequences  of  non-word characters. Without this, PCRE takes much
4007       longer (10 times or more) to match typical phrases, and Perl  takes  so
4008       long that you think it has gone into a loop.
4009
4010   Note:
4011       The  palindrome-matching patterns above work only if the subject string
4012       does not start with a  palindrome  that  is  shorter  than  the  entire
4013       string. For example, although "abcba" is correctly matched, if the sub‐
4014       ject is "ababa", PCRE finds palindrome "aba" at  the  start,  and  then
4015       fails  at  top  level,  as  the end of the string does not follow. Once
4016       again, it cannot jump back into the recursion  to  try  other  alterna‐
4017       tives, so the entire match fails.
4018
4019
4020       The  second  way  in which PCRE and Perl differ in their recursion pro‐
4021       cessing is in the handling of captured values. In Perl, when a  subpat‐
4022       tern  is  called recursively or as a subpattern (see the next section),
4023       it has no access to any values that were captured  outside  the  recur‐
4024       sion.  In  PCRE  these values can be referenced. Consider the following
4025       pattern:
4026
4027       ^(.)(\1|a(?2))
4028
4029       In PCRE, it matches "bab". The first capturing parentheses  match  "b",
4030       then  in  the  second  group, when the back reference \1 fails to match
4031       "b", the second alternative matches "a",  and  then  recurses.  In  the
4032       recursion,  \1  does  now match "b" and so the whole match succeeds. In
4033       Perl, the pattern fails to match because inside the recursive  call  \1
4034       cannot access the externally set value.
4035

SUBPATTERNS AS SUBROUTINES

4037       If  the  syntax for a recursive subpattern call (either by number or by
4038       name) is used outside the parentheses to which it refers,  it  operates
4039       like  a subroutine in a programming language. The called subpattern can
4040       be defined before or after the reference. A numbered reference  can  be
4041       absolute or relative, as in the following examples:
4042
4043       (...(absolute)...)...(?2)...
4044       (...(relative)...)...(?-1)...
4045       (...(?+1)...(relative)...
4046
4047       An  earlier  example  pointed  out  that  the following pattern matches
4048       "sense and sensibility" and  "response  and  responsibility",  but  not
4049       "sense and responsibility":
4050
4051       (sens|respons)e and \1ibility
4052
4053       If instead the following pattern is used, it matches "sense and respon‐
4054       sibility" and the other two strings:
4055
4056       (sens|respons)e and (?1)ibility
4057
4058       Another example is provided in the discussion of DEFINE earlier.
4059
4060       All subroutine calls, recursive or not, are always  treated  as  atomic
4061       groups.  That  is,  once  a  subroutine has matched some of the subject
4062       string, it is never re-entered, even if it  contains  untried  alterna‐
4063       tives  and there is a subsequent matching failure. Any capturing paren‐
4064       theses that are set during the subroutine call revert to their previous
4065       values afterwards.
4066
4067       Processing  options  such as case-independence are fixed when a subpat‐
4068       tern is defined, so if it is used as a subroutine, such options  cannot
4069       be  changed  for  different  calls.  For example, the following pattern
4070       matches "abcabc" but not "abcABC", as the change of  processing  option
4071       does not affect the called subpattern:
4072
4073       (abc)(?i:(?-1))
4074

ONIGURUMA SUBROUTINE SYNTAX

4076       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
4077       name or a number enclosed either in angle brackets or single quotes, is
4078       alternative syntax for referencing a subpattern as a subroutine, possi‐
4079       bly recursively. Here follows two of the examples used above, rewritten
4080       using this syntax:
4081
4082       (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4083       (sens|respons)e and \g'1'ibility
4084
4085       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
4086       plus or minus sign, it is taken as a relative reference, for example:
4087
4088       (abc)(?i:\g<-1>)
4089
4090       Notice that \g{...} (Perl syntax) and \g<...>  (Oniguruma  syntax)  are
4091       not synonymous. The former is a back reference; the latter is a subrou‐
4092       tine call.
4093

BACKTRACKING CONTROL

4095       Perl 5.10 introduced some "Special Backtracking Control  Verbs",  which
4096       are still described in the Perl documentation as "experimental and sub‐
4097       ject to change or removal in a future version of Perl". It goes  on  to
4098       say:  "Their usage in production code should be noted to avoid problems
4099       during upgrades." The same remarks apply to the PCRE features described
4100       in this section.
4101
4102       The  new verbs make use of what was previously invalid syntax: an open‐
4103       ing parenthesis followed by an asterisk. They are generally of the form
4104       (*VERB)  or  (*VERB:NAME). Some can take either form, possibly behaving
4105       differently depending on whether a name  is  present.  A  name  is  any
4106       sequence of characters that does not include a closing parenthesis. The
4107       maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4108       and  32-bit  libraries.  If  the name is empty, that is, if the closing
4109       parenthesis immediately follows the colon, the  effect  is  as  if  the
4110       colon was not there. Any number of these verbs can occur in a pattern.
4111
4112       The behavior of these verbs in repeated groups, assertions, and in sub‐
4113       patterns  called  as  subroutines  (whether  or  not  recursively)   is
4114       described below.
4115
4116       Optimizations That Affect Backtracking Verbs
4117
4118       PCRE  contains some optimizations that are used to speed up matching by
4119       running some checks at the start of each match attempt. For example, it
4120       can  know  the minimum length of matching subject, or that a particular
4121       character must be present. When one of these optimizations bypasses the
4122       running  of a match, any included backtracking verbs are not processed.
4123       processed. You can suppress the start-of-match optimizations by setting
4124       option  no_start_optimize when calling compile/2 or run/3, or by start‐
4125       ing the pattern with (*NO_START_OPT).
4126
4127       Experiments with Perl suggest that it too  has  similar  optimizations,
4128       sometimes leading to anomalous results.
4129
4130       Verbs That Act Immediately
4131
4132       The  following verbs act as soon as they are encountered. They must not
4133       be followed by a name.
4134
4135       (*ACCEPT)
4136
4137       This verb causes the match to end successfully, skipping the  remainder
4138       of  the pattern. However, when it is inside a subpattern that is called
4139       as a subroutine, only that subpattern is ended  successfully.  Matching
4140       then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4141       tive assertion, the assertion succeeds; in a  negative  assertion,  the
4142       assertion fails.
4143
4144       If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap‐
4145       tured. For example, the following matches "AB", "AAD", or  "ACD".  When
4146       it matches "AB", "B" is captured by the outer parentheses.
4147
4148       A((?:A|B(*ACCEPT)|C)D)
4149
4150       The  following  verb causes a matching failure, forcing backtracking to
4151       occur. It is equivalent to (?!) but easier to read.
4152
4153       (*FAIL) or (*F)
4154
4155       The Perl documentation states that it is probably useful only when com‐
4156       bined  with  (?{})  or  (??{}).  Those  are  Perl features that are not
4157       present in PCRE.
4158
4159       A match with the string "aaaa" always fails, but the callout  is  taken
4160       before each backtrack occurs (in this example, 10 times).
4161
4162       Recording Which Path Was Taken
4163
4164       The  main  purpose of this verb is to track how a match was arrived at,
4165       although it also has a secondary use in with advancing the match start‐
4166       ing point (see (*SKIP) below).
4167
4168   Note:
4169       In  Erlang,  there  is no interface to retrieve a mark with run/2,3, so
4170       only the secondary purpose is relevant to the Erlang programmer.
4171
4172       The rest of this section is  therefore  deliberately  not  adapted  for
4173       reading  by  the Erlang programmer, but the examples can help in under‐
4174       standing NAMES as they can be used by (*SKIP).
4175
4176
4177       (*MARK:NAME) or (*:NAME)
4178
4179       A name is always  required  with  this  verb.  There  can  be  as  many
4180       instances  of  (*MARK) as you like in a pattern, and their names do not
4181       have to be unique.
4182
4183       When a match succeeds, the name of the last  encountered  (*MARK:NAME),
4184       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
4185       the caller as described in section "Extra data for pcre_exec()" in  the
4186       pcreapi documentation. In the following example of pcretest output, the
4187       /K modifier requests the retrieval and outputting of (*MARK) data:
4188
4189         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4190       data> XY
4191        0: XY
4192       MK: A
4193       XZ
4194        0: XZ
4195       MK: B
4196
4197       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4198       ple  it indicates which of the two alternatives matched. This is a more
4199       efficient way of obtaining this information than putting each  alterna‐
4200       tive in its own capturing parentheses.
4201
4202       If  a  verb  with a name is encountered in a positive assertion that is
4203       true, the name is recorded and passed back if it is  the  last  encoun‐
4204       tered.  This does not occur for negative assertions or failing positive
4205       assertions.
4206
4207       After a partial match or a failed match, the last encountered  name  in
4208       the entire match process is returned, for example:
4209
4210         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4211       data> XP
4212       No match, mark = B
4213
4214       Notice  that  in this unanchored example, the mark is retained from the
4215       match attempt that started at letter "X"  in  the  subject.  Subsequent
4216       match attempts starting at "P" and then with an empty string do not get
4217       as far as the (*MARK) item, nevertheless do not reset it.
4218
4219       Verbs That Act after Backtracking
4220
4221       The following verbs do nothing when they are encountered. Matching con‐
4222       tinues  with what follows, but if there is no subsequent match, causing
4223       a backtrack to the verb, a failure is  forced.  That  is,  backtracking
4224       cannot  pass  to the left of the verb. However, when one of these verbs
4225       appears inside an atomic group or an assertion that is true, its effect
4226       is confined to that group, as once the group has been matched, there is
4227       never any backtracking into it. In  this  situation,  backtracking  can
4228       "jump  back"  to  the  left  of  the  entire atomic group or assertion.
4229       (Remember also, as stated above, that this localization also applies in
4230       subroutine calls.)
4231
4232       These  verbs  differ  in exactly what kind of failure occurs when back‐
4233       tracking reaches them. The behavior described below is what occurs when
4234       the  verb  is  not in a subroutine or an assertion. Subsequent sections
4235       cover these special cases.
4236
4237       The following verb, which must not be followed by a  name,  causes  the
4238       whole  match to fail outright if there is a later matching failure that
4239       causes backtracking to reach it. Even if the pattern is unanchored,  no
4240       further  attempts  to find a match by advancing the starting point take
4241       place.
4242
4243       (*COMMIT)
4244
4245       If (*COMMIT) is the only backtracking verb that is encountered, once it
4246       has  been  passed,  run/2,3 is committed to find a match at the current
4247       starting point, or not at all, for example:
4248
4249       a+(*COMMIT)b
4250
4251       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
4252       of dynamic anchor, or "I've started, so I must finish". The name of the
4253       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
4254       forces a match failure.
4255
4256       If more than one backtracking verb exists in a pattern, a different one
4257       that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4258       MIT)  during  a match does not always guarantee that a match must be at
4259       this starting point.
4260
4261       Notice that (*COMMIT) at the start of a pattern is not the same  as  an
4262       anchor, unless the PCRE start-of-match optimizations are turned off, as
4263       shown in the following example:
4264
4265       1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4266       {match,["abc"]}
4267       2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4268       nomatch
4269
4270       For this pattern, PCRE knows that any match must start with "a", so the
4271       optimization skips along the subject to "a" before applying the pattern
4272       to the first set of data. The match attempt then succeeds. In the  sec‐
4273       ond  call  the  no_start_optimize  disables the optimization that skips
4274       along to the first character. The pattern is now  applied  starting  at
4275       "x",  and  so the (*COMMIT) causes the match to fail without trying any
4276       other starting points.
4277
4278       The following verb causes the match to fail  at  the  current  starting
4279       position  in  the  subject  if  there  is a later matching failure that
4280       causes backtracking to reach it:
4281
4282       (*PRUNE) or (*PRUNE:NAME)
4283
4284       If the pattern is unanchored, the normal  "bumpalong"  advance  to  the
4285       next starting character then occurs. Backtracking can occur as usual to
4286       the left of (*PRUNE), before it is reached, or  when  matching  to  the
4287       right  of (*PRUNE), but if there is no match to the right, backtracking
4288       cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just  an
4289       alternative  to an atomic group or possessive quantifier, but there are
4290       some uses of (*PRUNE) that cannot be expressed in any other way. In  an
4291       anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4292
4293       The    behavior   of   (*PRUNE:NAME)   is   the   not   the   same   as
4294       (*MARK:NAME)(*PRUNE). It is like  (*MARK:NAME)  in  that  the  name  is
4295       remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
4296       searches only for names set with (*MARK).
4297
4298   Note:
4299       The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4300       programmer, as names cannot be retrieved.
4301
4302
4303       The  following  verb,  when specified without a name, is like (*PRUNE),
4304       except that if the pattern is unanchored, the  "bumpalong"  advance  is
4305       not  to  the  next  character, but to the position in the subject where
4306       (*SKIP) was encountered.
4307
4308       (*SKIP)
4309
4310       (*SKIP) signifies that whatever text was matched leading up to it  can‐
4311       not be part of a successful match. Consider:
4312
4313       a+(*SKIP)b
4314
4315       If  the  subject  is  "aaaac...",  after  the first match attempt fails
4316       (starting at the first character in the  string),  the  starting  point
4317       skips  on  to  start  the next attempt at "c". Notice that a possessive
4318       quantifier does not have the same effect as this example;  although  it
4319       would  suppress backtracking during the first match attempt, the second
4320       attempt would start at the second character instead of skipping  on  to
4321       "c".
4322
4323       When (*SKIP) has an associated name, its behavior is modified:
4324
4325       (*SKIP:NAME)
4326
4327       When  this  is  triggered,  the  previous  path  through the pattern is
4328       searched for the most recent (*MARK) that has the same name. If one  is
4329       found,  the  "bumpalong" advance is to the subject position that corre‐
4330       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
4331       no (*MARK) with a matching name is found, (*SKIP) is ignored.
4332
4333       Notice  that  (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4334       It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4335
4336       The following verb causes a skip to the next innermost alternative when
4337       backtracking  reaches  it. That is, it cancels any further backtracking
4338       within the current alternative.
4339
4340       (*THEN) or (*THEN:NAME)
4341
4342       The verb name comes from the observation that it can be used for a pat‐
4343       tern-based if-then-else block:
4344
4345       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4346
4347       If  the COND1 pattern matches, FOO is tried (and possibly further items
4348       after the end of the group if FOO succeeds). On  failure,  the  matcher
4349       skips  to  the second alternative and tries COND2, without backtracking
4350       into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4351       fails, there are no more alternatives, so there is a backtrack to what‐
4352       ever came before the entire group. If (*THEN) is not inside an alterna‐
4353       tion, it acts like (*PRUNE).
4354
4355       The    behavior    of   (*THEN:NAME)   is   the   not   the   same   as
4356       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4357       bered  for  passing  back to the caller. However, (*SKIP:NAME) searches
4358       only for names set with (*MARK).
4359
4360   Note:
4361       The fact that (*THEN:NAME) remembers the name is useless to the  Erlang
4362       programmer, as names cannot be retrieved.
4363
4364
4365       A  subpattern that does not contain a | character is just a part of the
4366       enclosing alternative; it is not a nested  alternation  with  only  one
4367       alternative.  The effect of (*THEN) extends beyond such a subpattern to
4368       the enclosing alternative. Consider the following pattern, where A,  B,
4369       and  so  on,  are  complex  pattern fragments that do not contain any |
4370       characters at this level:
4371
4372       A (B(*THEN)C) | D
4373
4374       If A and B are matched, but there is a failure in C, matching does  not
4375       backtrack into A; instead it moves to the next alternative, that is, D.
4376       However, if the subpattern containing (*THEN) is given an  alternative,
4377       it behaves differently:
4378
4379       A (B(*THEN)C | (*FAIL)) | D
4380
4381       The  effect of (*THEN) is now confined to the inner subpattern. After a
4382       failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4383       tern  to  fail, as there are no more alternatives to try. In this case,
4384       matching does now backtrack into A.
4385
4386       Notice that a conditional subpattern is not considered  as  having  two
4387       alternatives,  as  only one is ever used. That is, the | character in a
4388       conditional subpattern has a different  meaning.  Ignoring  whitespace,
4389       consider:
4390
4391       ^.*? (?(?=a) a | b(*THEN)c )
4392
4393       If  the  subject  is  "ba",  this  pattern  does  not  match. As .*? is
4394       ungreedy, it initially matches zero  characters.  The  condition  (?=a)
4395       then  fails,  the  character  "b"  is  matched, but "c" is not. At this
4396       point, matching does not backtrack to .*? as can  perhaps  be  expected
4397       from  the  presence  of  the | character. The conditional subpattern is
4398       part of the single alternative that comprises the whole pattern, and so
4399       the  match  fails.  (If  there was a backtrack into .*?, allowing it to
4400       match "b", the match would succeed.)
4401
4402       The verbs described above provide four different "strengths" of control
4403       when subsequent matching fails:
4404
4405         * (*THEN)  is the weakest, carrying on the match at the next alterna‐
4406           tive.
4407
4408         * (*PRUNE) comes next, fails the match at the current starting  posi‐
4409           tion,  but  allows  an  advance to the next character (for an unan‐
4410           chored pattern).
4411
4412         * (*SKIP) is similar, except that the advance can be  more  than  one
4413           character.
4414
4415         * (*COMMIT) is the strongest, causing the entire match to fail.
4416
4417       More than One Backtracking Verb
4418
4419       If  more  than  one  backtracking verb is present in a pattern, the one
4420       that is backtracked onto first acts. For example, consider the  follow‐
4421       ing pattern, where A, B, and so on, are complex pattern fragments:
4422
4423       (A(*COMMIT)B(*THEN)C|ABD)
4424
4425       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
4426       match to fail. However, if A and B match, but C fails, the backtrack to
4427       (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4428       consistent, but is not always the same as in Perl. It means that if two
4429       or  more  backtracking verbs appear in succession, the last of them has
4430       no effect. Consider the following example:
4431
4432       If there is a matching failure to the right, backtracking onto (*PRUNE)
4433       causes  it to be triggered, and its action is taken. There can never be
4434       a backtrack onto (*COMMIT).
4435
4436       Backtracking Verbs in Repeated Groups
4437
4438       PCRE differs from  Perl  in  its  handling  of  backtracking  verbs  in
4439       repeated groups. For example, consider:
4440
4441       /(a(*COMMIT)b)+ac/
4442
4443       If  the  subject  is  "abac",  Perl matches, but PCRE fails because the
4444       (*COMMIT) in the second repeat of the group acts.
4445
4446       Backtracking Verbs in Assertions
4447
4448       (*FAIL) in an assertion has its normal effect: it forces  an  immediate
4449       backtrack.
4450
4451       (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4452       out any further processing. In a negative assertion,  (*ACCEPT)  causes
4453       the assertion to fail without any further processing.
4454
4455       The  other  backtracking verbs are not treated specially if they appear
4456       in a positive assertion. In  particular,  (*THEN)  skips  to  the  next
4457       alternative  in  the  innermost  enclosing group that has alternations,
4458       regardless if this is within the assertion.
4459
4460       Negative assertions are, however, different, to ensure that changing  a
4461       positive  assertion into a negative assertion changes its result. Back‐
4462       tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative  asser‐
4463       tion  to  be true, without considering any further alternative branches
4464       in the assertion. Backtracking into (*THEN) causes it to  skip  to  the
4465       next  enclosing alternative within the assertion (the normal behavior),
4466       but if the assertion does not have such an alternative, (*THEN) behaves
4467       like (*PRUNE).
4468
4469       Backtracking Verbs in Subroutines
4470
4471       These  behaviors  occur  regardless  if the subpattern is called recur‐
4472       sively. The treatment of subroutines  in  Perl  is  different  in  some
4473       cases.
4474
4475         * (*FAIL)  in  a  subpattern  called  as  a subroutine has its normal
4476           effect: it forces an immediate backtrack.
4477
4478         * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4479           tine match to succeed without any further processing. Matching then
4480           continues after the subroutine call.
4481
4482         * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as  a  sub‐
4483           routine cause the subroutine match to fail.
4484
4485         * (*THEN)  skips  to  the next alternative in the innermost enclosing
4486           group within the subpattern that has alternatives. If there  is  no
4487           such  group  within  the  subpattern, (*THEN) causes the subroutine
4488           match to fail.
4489
4490Ericsson AB                       stdlib 3.10                            re(3)
Impressum