1re(3)                      Erlang Module Definition                      re(3)
2
3
4

NAME

6       re - Perl-like regular expressions for Erlang.
7

DESCRIPTION

9       This  module contains regular expression matching functions for strings
10       and binaries.
11
12       The regular expression syntax and semantics resemble that of Perl.
13
14       The matching algorithms of the library are based on the  PCRE  library,
15       but not all of the PCRE library is interfaced and some parts of the li‐
16       brary go beyond what PCRE offers. Currently PCRE version 8.40  (release
17       date  2017-01-11)  is used. The sections of the PCRE documentation that
18       are relevant to this module are included here.
19
20   Note:
21       The Erlang literal syntax for strings uses the "\" (backslash)  charac‐
22       ter  as  an  escape  code.  You  need  to escape backslashes in literal
23       strings, both in your code and in the shell, with an  extra  backslash,
24       that is, "\\".
25
26

DATA TYPES

28       mp() = {re_pattern, term(), term(), term(), term()}
29
30              Opaque  data type containing a compiled regular expression. mp()
31              is guaranteed to be a tuple() having the atom re_pattern as  its
32              first element, to allow for matching in guards. The arity of the
33              tuple or the content of the other fields can  change  in  future
34              Erlang/OTP releases.
35
36       nl_spec() = cr | crlf | lf | anycrlf | any
37
38       compile_option() =
39           unicode | anchored | caseless | dollar_endonly | dotall |
40           extended | firstline | multiline | no_auto_capture |
41           dupnames | ungreedy |
42           {newline, nl_spec()} |
43           bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
44           never_utf
45

EXPORTS

47       version() -> binary()
48
49              The return of this function is a string with the PCRE version of
50              the system that was used in the Erlang/OTP compilation.
51
52       compile(Regexp) -> {ok, MP} | {error, ErrSpec}
53
54              Types:
55
56                 Regexp = iodata()
57                 MP = mp()
58                 ErrSpec =
59                     {ErrString :: string(), Position :: integer() >= 0}
60
61              The same as compile(Regexp,[])
62
63       compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
64
65              Types:
66
67                 Regexp = iodata() | unicode:charlist()
68                 Options = [Option]
69                 Option = compile_option()
70                 MP = mp()
71                 ErrSpec =
72                     {ErrString :: string(), Position :: integer() >= 0}
73
74              Compiles a regular expression, with the syntax described  below,
75              into an internal format to be used later as a parameter to run/2
76              and run/3.
77
78              Compiling the regular expression before matching  is  useful  if
79              the  same  expression is to be used in matching against multiple
80              subjects during the lifetime of the program. Compiling once  and
81              executing  many  times is far more efficient than compiling each
82              time one wants to match.
83
84              When option unicode is specified, the regular expression  is  to
85              be  specified  as  a  valid Unicode charlist(), otherwise as any
86              valid iodata().
87
88              Options:
89
90                unicode:
91                  The regular expression is specified as a Unicode  charlist()
92                  and  the  resulting  regular  expression  code  is to be run
93                  against a valid Unicode charlist()  subject.  Also  consider
94                  option ucp when using Unicode characters.
95
96                anchored:
97                  The  pattern is forced to be "anchored", that is, it is con‐
98                  strained to match only at the first matching  point  in  the
99                  string  that is searched (the "subject string"). This effect
100                  can also be achieved by appropriate constructs in  the  pat‐
101                  tern itself.
102
103                caseless:
104                  Letters  in  the  pattern match both uppercase and lowercase
105                  letters. It is equivalent to  Perl  option  /i  and  can  be
106                  changed within a pattern by a (?i) option setting. Uppercase
107                  and lowercase letters are defined as in the ISO 8859-1 char‐
108                  acter set.
109
110                dollar_endonly:
111                  A  dollar  metacharacter  in the pattern matches only at the
112                  end of the subject string. Without  this  option,  a  dollar
113                  also  matches immediately before a newline at the end of the
114                  string (but not before any other newlines). This  option  is
115                  ignored if option multiline is specified. There is no equiv‐
116                  alent option in Perl, and it cannot be set within a pattern.
117
118                dotall:
119                  A dot in the pattern matches all characters, including those
120                  indicating  newline.  Without  it, a dot does not match when
121                  the current position is at a newline. This option is equiva‐
122                  lent  to  Perl option /s and it can be changed within a pat‐
123                  tern by a (?s) option setting. A  negative  class,  such  as
124                  [^a],  always matches newline characters, independent of the
125                  setting of this option.
126
127                extended:
128                  If this option is set, most white space  characters  in  the
129                  pattern  are totally ignored except when escaped or inside a
130                  character class. However, white space is not allowed  within
131                  sequences  such  as (?> that introduce various parenthesized
132                  subpatterns, nor  within  a  numerical  quantifier  such  as
133                  {1,3}.  However,  ignorable white space is permitted between
134                  an item and a following quantifier and between a  quantifier
135                  and a following + that indicates possessiveness.
136
137                  White  space  did not used to include the VT character (code
138                  11), because Perl did not  treat  this  character  as  white
139                  space.  However,  Perl changed at release 5.18, so PCRE fol‐
140                  lowed at release 8.34, and VT is now treated as white space.
141
142                  This also causes characters between an unescaped # outside a
143                  character  class  and the next newline, inclusive, to be ig‐
144                  nored. This is equivalent to Perl's /x option, and it can be
145                  changed within a pattern by a (?x) option setting.
146
147                  With  this  option, comments inside complicated patterns can
148                  be included. However, notice that this applies only to  data
149                  characters.  Whitespace  characters  can never appear within
150                  special character sequences in a pattern, for example within
151                  sequence (?( that introduces a conditional subpattern.
152
153                firstline:
154                  An  unanchored pattern is required to match before or at the
155                  first newline in the subject string,  although  the  matched
156                  text can continue over the newline.
157
158                multiline:
159                  By  default, PCRE treats the subject string as consisting of
160                  a single line of characters (even if it contains  newlines).
161                  The  "start  of  line" metacharacter (^) matches only at the
162                  start of the string, while the "end of  line"  metacharacter
163                  ($)  matches only at the end of the string, or before a ter‐
164                  minating newline (unless  option  dollar_endonly  is  speci‐
165                  fied). This is the same as in Perl.
166
167                  When  this option is specified, the "start of line" and "end
168                  of line" constructs match immediately following  or  immedi‐
169                  ately  before  internal  newlines in the subject string, re‐
170                  spectively, as well as at the very start and  end.  This  is
171                  equivalent  to  Perl  option  /m and can be changed within a
172                  pattern by a (?m) option setting. If there are  no  newlines
173                  in  a  subject string, or no occurrences of ^ or $ in a pat‐
174                  tern, setting multiline has no effect.
175
176                no_auto_capture:
177                  Disables the use of numbered capturing  parentheses  in  the
178                  pattern.  Any  opening parenthesis that is not followed by ?
179                  behaves as if it is followed by ?:.  Named  parentheses  can
180                  still be used for capturing (and they acquire numbers in the
181                  usual way). There is no equivalent option in Perl.
182
183                dupnames:
184                  Names used to identify capturing  subpatterns  need  not  be
185                  unique.  This  can  be  helpful for certain types of pattern
186                  when it is known that only one instance of the named subpat‐
187                  tern  can ever be matched. More details of named subpatterns
188                  are provided below.
189
190                ungreedy:
191                  Inverts the "greediness" of the quantifiers so that they are
192                  not greedy by default, but become greedy if followed by "?".
193                  It is not compatible with Perl. It can also be set by a (?U)
194                  option setting within the pattern.
195
196                {newline, NLSpec}:
197                  Overrides the default definition of a newline in the subject
198                  string, which is LF (ASCII 10) in Erlang.
199
200                  cr:
201                    Newline is indicated by a single character cr (ASCII 13).
202
203                  lf:
204                    Newline is indicated by a single character LF (ASCII  10),
205                    the default.
206
207                  crlf:
208                    Newline  is  indicated by the two-character CRLF (ASCII 13
209                    followed by ASCII 10) sequence.
210
211                  anycrlf:
212                    Any of the three preceding sequences is to be recognized.
213
214                  any:
215                    Any of the newline sequences above, and  the  Unicode  se‐
216                    quences  VT (vertical tab, U+000B), FF (formfeed, U+000C),
217                    NEL (next line, U+0085), LS (line separator, U+2028),  and
218                    PS (paragraph separator, U+2029).
219
220                bsr_anycrlf:
221                  Specifies  specifically that \R is to match only the CR, LF,
222                  or CRLF sequences, not the Unicode-specific newline  charac‐
223                  ters.
224
225                bsr_unicode:
226                  Specifies  specifically  that \R is to match all the Unicode
227                  newline characters (including CRLF, and so on, the default).
228
229                no_start_optimize:
230                  Disables  optimization  that  can  malfunction  if  "Special
231                  start-of-pattern  items"  are present in the regular expres‐
232                  sion. A typical example  would  be  when  matching  "DEFABC"
233                  against "(*COMMIT)ABC", where the start optimization of PCRE
234                  would skip the subject up to "A" and never realize that  the
235                  (*COMMIT)  instruction  is  to  have made the matching fail.
236                  This option is only relevant if  you  use  "start-of-pattern
237                  items",  as discussed in section PCRE Regular Expression De‐
238                  tails.
239
240                ucp:
241                  Specifies that Unicode character properties are to  be  used
242                  when  resolving  \B,  \b, \D, \d, \S, \s, \W and \w. Without
243                  this flag, only ISO Latin-1 properties are used. Using  Uni‐
244                  code  properties hurts performance, but is semantically cor‐
245                  rect when working with Unicode  characters  beyond  the  ISO
246                  Latin-1 range.
247
248                never_utf:
249                  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
250                  items" are forbidden. This flag cannot be combined with  op‐
251                  tion  unicode. Useful if ISO Latin-1 patterns from an exter‐
252                  nal source are to be compiled.
253
254       inspect(MP, Item) -> {namelist, [binary()]}
255
256              Types:
257
258                 MP = mp()
259                 Item = namelist
260
261              Takes a compiled regular expression and an item, and returns the
262              relevant  data  from  the regular expression. The only supported
263              item is  namelist,  which  returns  the  tuple  {namelist,  [bi‐
264              nary()]}, containing the names of all (unique) named subpatterns
265              in the regular expression. For example:
266
267              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
268              {ok,{re_pattern,3,0,0,
269                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
270                                255,255,...>>}}
271              2> re:inspect(MP,namelist).
272              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
273              3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
274              {ok,{re_pattern,3,0,0,
275                              <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
276                                255,255,...>>}}
277              4> re:inspect(MPD,namelist).
278              {namelist,[<<"B">>,<<"C">>]}
279
280              Notice in the second example that the duplicate name only occurs
281              once  in the returned list, and that the list is in alphabetical
282              order regardless of where the names are positioned in the  regu‐
283              lar  expression. The order of the names is the same as the order
284              of captured subexpressions if {capture, all_names} is  specified
285              as  an option to run/3. You can therefore create a name-to-value
286              mapping from the result of run/3 like this:
287
288              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
289              {ok,{re_pattern,3,0,0,
290                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
291                                255,255,...>>}}
292              2> {namelist, N} = re:inspect(MP,namelist).
293              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
294              3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
295              {match,[<<"A">>,<<>>,<<>>]}
296              4> NameMap = lists:zip(N,L).
297              [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
298
299       replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
300
301              Types:
302
303                 Subject = iodata() | unicode:charlist()
304                 RE = mp() | iodata()
305                 Replacement = iodata() | unicode:charlist()
306
307              Same as replace(Subject, RE, Replacement, []).
308
309       replace(Subject, RE, Replacement, Options) ->
310                  iodata() | unicode:charlist()
311
312              Types:
313
314                 Subject = iodata() | unicode:charlist()
315                 RE = mp() | iodata() | unicode:charlist()
316                 Replacement = iodata() | unicode:charlist()
317                 Options = [Option]
318                 Option =
319                     anchored | global | notbol | noteol | notempty |
320                     notempty_atstart |
321                     {offset, integer() >= 0} |
322                     {newline, NLSpec} |
323                     bsr_anycrlf |
324                     {match_limit, integer() >= 0} |
325                     {match_limit_recursion, integer() >= 0} |
326                     bsr_unicode |
327                     {return, ReturnType} |
328                     CompileOpt
329                 ReturnType = iodata | list | binary
330                 CompileOpt = compile_option()
331                 NLSpec = cr | crlf | lf | anycrlf | any
332
333              Replaces the matched part of the Subject string  with  the  con‐
334              tents of Replacement.
335
336              The  permissible  options are the same as for run/3, except that
337              option capture is not allowed. Instead a {return, ReturnType} is
338              present. The default return type is iodata, constructed in a way
339              to minimize copying. The iodata result can be used  directly  in
340              many  I/O  operations. If a flat list() is desired, specify {re‐
341              turn, list}. If a binary is desired, specify {return, binary}.
342
343              As in function run/3, an mp() compiled with option  unicode  re‐
344              quires  Subject  to  be  a Unicode charlist(). If compilation is
345              done implicitly and the unicode compilation option is  specified
346              to this function, both the regular expression and Subject are to
347              specified as valid Unicode charlist()s.
348
349              The replacement string can  contain  the  special  character  &,
350              which  inserts  the whole matching expression in the result, and
351              the special sequence \N (where N is an integer  >  0),  \gN,  or
352              \g{N},  resulting  in the subexpression number N, is inserted in
353              the result. If no subexpression with that number is generated by
354              the regular expression, nothing is inserted.
355
356              To insert an & or a \ in the result, precede it with a \. Notice
357              that Erlang already gives a special  meaning  to  \  in  literal
358              strings,  so  a single \ must be written as "\\" and therefore a
359              double \ as "\\\\".
360
361              Example:
362
363              re:replace("abcd","c","[&]",[{return,list}]).
364
365              gives
366
367              "ab[c]d"
368
369              while
370
371              re:replace("abcd","c","[\\&]",[{return,list}]).
372
373              gives
374
375              "ab[&]d"
376
377              As with run/3, compilation errors raise  the  badarg  exception.
378              compile/2 can be used to get more information about the error.
379
380       run(Subject, RE) -> {match, Captured} | nomatch
381
382              Types:
383
384                 Subject = iodata() | unicode:charlist()
385                 RE = mp() | iodata()
386                 Captured = [CaptureData]
387                 CaptureData = {integer(), integer()}
388
389              Same as run(Subject,RE,[]).
390
391       run(Subject, RE, Options) ->
392              {match, Captured} | match | nomatch | {error, ErrType}
393
394              Types:
395
396                 Subject = iodata() | unicode:charlist()
397                 RE = mp() | iodata() | unicode:charlist()
398                 Options = [Option]
399                 Option =
400                     anchored | global | notbol | noteol | notempty |
401                     notempty_atstart | report_errors |
402                     {offset, integer() >= 0} |
403                     {match_limit, integer() >= 0} |
404                     {match_limit_recursion, integer() >= 0} |
405                     {newline, NLSpec :: nl_spec()} |
406                     bsr_anycrlf | bsr_unicode |
407                     {capture, ValueSpec} |
408                     {capture, ValueSpec, Type} |
409                     CompileOpt
410                 Type = index | list | binary
411                 ValueSpec =
412                     all  |  all_but_first  |  all_names | first | none | Val‐
413                 ueList
414                 ValueList = [ValueID]
415                 ValueID = integer() | string() | atom()
416                 CompileOpt = compile_option()
417                   See compile/2.
418                 Captured = [CaptureData] | [[CaptureData]]
419                 CaptureData =
420                     {integer(), integer()} | ListConversionData | binary()
421                 ListConversionData =
422                     string() |
423                     {error, string(), binary()} |
424                     {incomplete, string(), binary()}
425                 ErrType =
426                     match_limit  |  match_limit_recursion  |  {compile,  Com‐
427                 pileErr}
428                 CompileErr =
429                     {ErrString :: string(), Position :: integer() >= 0}
430
431              Executes    a   regular   expression   matching,   and   returns
432              match/{match, Captured} or nomatch. The regular  expression  can
433              be  specified  either  as iodata() in which case it is automati‐
434              cally compiled (as by compile/2) and executed, or as  a  precom‐
435              piled  mp() in which case it is executed against the subject di‐
436              rectly.
437
438              When compilation is involved, exception badarg is  thrown  if  a
439              compilation  error  occurs.  Call  compile/2  to get information
440              about the location of the error in the regular expression.
441
442              If the regular expression is  previously  compiled,  the  option
443              list can only contain the following options:
444
445                * anchored
446
447                * {capture, ValueSpec}/{capture, ValueSpec, Type}
448
449                * global
450
451                * {match_limit, integer() >= 0}
452
453                * {match_limit_recursion, integer() >= 0}
454
455                * {newline, NLSpec}
456
457                * notbol
458
459                * notempty
460
461                * notempty_atstart
462
463                * noteol
464
465                * {offset, integer() >= 0}
466
467                * report_errors
468
469              Otherwise  all options valid for function compile/2 are also al‐
470              lowed. Options allowed both for compilation and execution  of  a
471              match,  namely  anchored  and {newline, NLSpec}, affect both the
472              compilation and execution if present together with a non-precom‐
473              piled regular expression.
474
475              If  the  regular  expression was previously compiled with option
476              unicode,  Subject  is  to  be  provided  as  a   valid   Unicode
477              charlist(),  otherwise  any  iodata() will do. If compilation is
478              involved and option unicode is specified, both Subject  and  the
479              regular   expression  are  to  be  specified  as  valid  Unicode
480              charlists().
481
482              {capture, ValueSpec}/{capture, ValueSpec, Type} defines what  to
483              return  from  the function upon successful matching. The capture
484              tuple can contain both a value specification, telling  which  of
485              the  captured substrings are to be returned, and a type specifi‐
486              cation, telling how captured substrings are to be  returned  (as
487              index  tuples, lists, or binaries). The options are described in
488              detail below.
489
490              If the capture options describe that no substring  capturing  is
491              to  be  done  ({capture, none}), the function returns the single
492              atom match upon successful matching, otherwise the tuple {match,
493              ValueList}. Disabling capturing can be done either by specifying
494              none or an empty list as ValueSpec.
495
496              Option report_errors adds the possibility that an error tuple is
497              returned.   The   tuple   either   indicates  a  matching  error
498              (match_limit or match_limit_recursion), or a compilation  error,
499              where  the  error  tuple  has  the format {error, {compile, Com‐
500              pileErr}}. Notice that if option report_errors is not specified,
501              the function never returns error tuples, but reports compilation
502              errors as a badarg exception and failed matches because  of  ex‐
503              ceeded match limits simply as nomatch.
504
505              The following options are relevant for execution:
506
507                anchored:
508                  Limits  run/3 to matching at the first matching position. If
509                  a pattern was compiled with anchored, or turned  out  to  be
510                  anchored  by virtue of its contents, it cannot be made unan‐
511                  chored at matching time, hence there is  no  unanchored  op‐
512                  tion.
513
514                global:
515                  Implements global (repetitive) search (flag g in Perl). Each
516                  match is returned as a separate list() containing  the  spe‐
517                  cific match and any matching subexpressions (or as specified
518                  by option capture. The Captured part of the return value  is
519                  hence a list() of list()s when this option is specified.
520
521                  The  interaction  of option global with a regular expression
522                  that matches an empty string surprises some users. When  op‐
523                  tion global is specified, run/3 handles empty matches in the
524                  same way as Perl: a zero-length match at any point  is  also
525                  retried  with  options [anchored, notempty_atstart]. If that
526                  search gives a result of length > 0, the result is included.
527                  Example:
528
529                re:run("cat","(|at)",[global]).
530
531                  The following matchings are performed:
532
533                  At offset 0:
534                    The  regular  expression  (|at) first match at the initial
535                    position  of   string   cat,   giving   the   result   set
536                    [{0,0},{0,0}]  (the  second {0,0} is because of the subex‐
537                    pression marked by the parentheses). As the length of  the
538                    match is 0, we do not advance to the next position yet.
539
540                  At offset 0 with [anchored, notempty_atstart]:
541                    The search is retried with options [anchored, notempty_at‐
542                    start] at the same position, which does not give  any  in‐
543                    teresting  result of longer length, so the search position
544                    is advanced to the next character (a).
545
546                  At offset 1:
547                    The search results in [{1,0},{1,0}],  so  this  search  is
548                    also repeated with the extra options.
549
550                  At offset 1 with [anchored, notempty_atstart]:
551                    Alternative  ab  is found and the result is [{1,2},{1,2}].
552                    The result is added to the list of results and  the  posi‐
553                    tion in the search string is advanced two steps.
554
555                  At offset 3:
556                    The  search  once  again  matches the empty string, giving
557                    [{3,0},{3,0}].
558
559                  At offset 1 with [anchored, notempty_atstart]:
560                    This gives no result of length > 0 and we are at the  last
561                    position, so the global search is complete.
562
563                  The result of the call is:
564
565                {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
566
567                notempty:
568                  An  empty  string  is  not considered to be a valid match if
569                  this option is specified. If alternatives in the pattern ex‐
570                  ist, they are tried. If all the alternatives match the empty
571                  string, the entire match fails.
572
573                  Example:
574
575                  If the following pattern is applied to a string  not  begin‐
576                  ning  with  "a"  or  "b",  it would normally match the empty
577                  string at the start of the subject:
578
579                a?b?
580
581                  With option  notempty,  this  match  is  invalid,  so  run/3
582                  searches  further  into the string for occurrences of "a" or
583                  "b".
584
585                notempty_atstart:
586                  Like notempty, except that an empty string match that is not
587                  at  the start of the subject is permitted. If the pattern is
588                  anchored, such a match can occur only if  the  pattern  con‐
589                  tains \K.
590
591                  Perl  has  no  direct equivalent of notempty or notempty_at‐
592                  start, but it does make a special case of a pattern match of
593                  the empty string within its split() function, and when using
594                  modifier /g. The Perl behavior can be emulated after  match‐
595                  ing  a  null  string  by first trying the match again at the
596                  same offset with notempty_atstart and anchored, and then, if
597                  that fails, by advancing the starting offset (see below) and
598                  trying an ordinary match again.
599
600                notbol:
601                  Specifies that the first character of the subject string  is
602                  not the beginning of a line, so the circumflex metacharacter
603                  is not to match before it. Setting  this  without  multiline
604                  (at compile time) causes circumflex never to match. This op‐
605                  tion only affects the behavior of the circumflex metacharac‐
606                  ter. It does not affect \A.
607
608                noteol:
609                  Specifies  that the end of the subject string is not the end
610                  of a line, so the dollar metacharacter is not  to  match  it
611                  nor  (except in multiline mode) a newline immediately before
612                  it. Setting this without multiline (at compile time)  causes
613                  dollar never to match. This option affects only the behavior
614                  of the dollar metacharacter. It does not affect \Z or \z.
615
616                report_errors:
617                  Gives better control of the error handling  in  run/3.  When
618                  specified,  compilation errors (if the regular expression is
619                  not already compiled) and runtime errors are explicitly  re‐
620                  turned as an error tuple.
621
622                  The following are the possible runtime errors:
623
624                  match_limit:
625                    The PCRE library sets a limit on how many times the inter‐
626                    nal match function can be called. Defaults  to  10,000,000
627                    in   the   library   compiled   for   Erlang.  If  {error,
628                    match_limit} is returned, the execution of the regular ex‐
629                    pression  has  reached  this limit. This is normally to be
630                    regarded as a nomatch, which is the default  return  value
631                    when this occurs, but by specifying report_errors, you are
632                    informed when the match fails because of too many internal
633                    calls.
634
635                  match_limit_recursion:
636                    This error is very similar to match_limit, but occurs when
637                    the internal  match  function  of  PCRE  is  "recursively"
638                    called  more  times  than the match_limit_recursion limit,
639                    which defaults to 10,000,000 as well. Notice that as  long
640                    as the match_limit and match_limit_default values are kept
641                    at the default  values,  the  match_limit_recursion  error
642                    cannot  occur, as the match_limit error occurs before that
643                    (each recursive call is also a call, but not  conversely).
644                    Both limits can however be changed, either by setting lim‐
645                    its directly in the regular expression string (see section
646                    PCRE Regular Eexpression Details) or by specifying options
647                    to run/3.
648
649                  It is important to understand that what is  referred  to  as
650                  "recursion"  when limiting matches is not recursion on the C
651                  stack of the Erlang machine or on the Erlang process  stack.
652                  The  PCRE  version  compiled into the Erlang VM uses machine
653                  "heap" memory to store values that must be kept over  recur‐
654                  sion in regular expression matches.
655
656                {match_limit, integer() >= 0}:
657                  Limits  the  execution time of a match in an implementation-
658                  specific way. It is described as follows by the  PCRE  docu‐
659                  mentation:
660
661                The match_limit field provides a means of preventing PCRE from using
662                up a vast amount of resources when running patterns that are not going
663                to match, but which have a very large number of possibilities in their
664                search trees. The classic example is a pattern that uses nested
665                unlimited repeats.
666
667                Internally, pcre_exec() uses a function called match(), which it calls
668                repeatedly (sometimes recursively). The limit set by match_limit is
669                imposed on the number of times this function is called during a match,
670                which has the effect of limiting the amount of backtracking that can
671                take place. For patterns that are not anchored, the count restarts
672                from zero for each position in the subject string.
673
674                  This  means that runaway regular expression matches can fail
675                  faster if the limit is lowered using this  option.  The  de‐
676                  fault value 10,000,000 is compiled into the Erlang VM.
677
678            Note:
679                This  option does in no way affect the execution of the Erlang
680                VM in terms of "long running BIFs". run/3 always gives control
681                back  to  the  scheduler of Erlang processes at intervals that
682                ensures the real-time properties of the Erlang system.
683
684
685                {match_limit_recursion, integer() >= 0}:
686                  Limits the execution time and memory consumption of a  match
687                  in   an   implementation-specific   way,   very  similar  to
688                  match_limit. It is described as follows by the PCRE documen‐
689                  tation:
690
691                The match_limit_recursion field is similar to match_limit, but instead
692                of limiting the total number of times that match() is called, it
693                limits the depth of recursion. The recursion depth is a smaller number
694                than the total number of calls, because not all calls to match() are
695                recursive. This limit is of use only if it is set smaller than
696                match_limit.
697
698                Limiting the recursion depth limits the amount of machine stack that
699                can be used, or, when PCRE has been compiled to use memory on the heap
700                instead of the stack, the amount of heap memory that can be used.
701
702                  The  Erlang VM uses a PCRE library where heap memory is used
703                  when regular expression match recursion occurs. This  there‐
704                  fore limits the use of machine heap, not C stack.
705
706                  Specifying a lower value can result in matches with deep re‐
707                  cursion failing, when they should have matched:
708
709                1> re:run("aaaaaaaaaaaaaz","(a+)*z").
710                {match,[{0,14},{0,13}]}
711                2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
712                nomatch
713                3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
714                {error,match_limit_recursion}
715
716                  This option and option match_limit are only to  be  used  in
717                  rare  cases.  Understanding of the PCRE library internals is
718                  recommended before tampering with these limits.
719
720                {offset, integer() >= 0}:
721                  Start matching at the offset  (position)  specified  in  the
722                  subject  string.  The  offset is zero-based, so that the de‐
723                  fault is {offset,0} (all of the subject string).
724
725                {newline, NLSpec}:
726                  Overrides the default definition of a newline in the subject
727                  string, which is LF (ASCII 10) in Erlang.
728
729                  cr:
730                    Newline is indicated by a single character CR (ASCII 13).
731
732                  lf:
733                    Newline  is indicated by a single character LF (ASCII 10),
734                    the default.
735
736                  crlf:
737                    Newline is indicated by the two-character CRLF  (ASCII  13
738                    followed by ASCII 10) sequence.
739
740                  anycrlf:
741                    Any of the three preceding sequences is be recognized.
742
743                  any:
744                    Any  of  the  newline sequences above, and the Unicode se‐
745                    quences VT (vertical tab, U+000B), FF (formfeed,  U+000C),
746                    NEL  (next line, U+0085), LS (line separator, U+2028), and
747                    PS (paragraph separator, U+2029).
748
749                bsr_anycrlf:
750                  Specifies specifically that \R is to match only the  CR  LF,
751                  or  CRLF sequences, not the Unicode-specific newline charac‐
752                  ters. (Overrides the compilation option.)
753
754                bsr_unicode:
755                  Specifies specifically that \R is to match all  the  Unicode
756                  newline characters (including CRLF, and so on, the default).
757                  (Overrides the compilation option.)
758
759                {capture, ValueSpec}/{capture, ValueSpec, Type}:
760                  Specifies which captured substrings are returned and in what
761                  format.  By default, run/3 captures all of the matching part
762                  of the substring and all capturing subpatterns (all  of  the
763                  pattern  is automatically captured). The default return type
764                  is (zero-based) indexes of the captured parts of the string,
765                  specified  as  {Offset,Length} pairs (the index Type of cap‐
766                  turing).
767
768                  As an example of the default behavior,  the  following  call
769                  returns,  as  first  and  only captured string, the matching
770                  part of the subject ("abcd" in the middle) as an index  pair
771                  {3,4},  where character positions are zero-based, just as in
772                  offsets:
773
774                re:run("ABCabcdABC","abcd",[]).
775
776                  The return value of this call is:
777
778                {match,[{3,4}]}
779
780                  Another (and quite common) case is where the regular expres‐
781                  sion matches all of the subject:
782
783                re:run("ABCabcdABC",".*abcd.*",[]).
784
785                  Here  the return value correspondingly points out all of the
786                  string, beginning at index 0, and it is 10 characters long:
787
788                {match,[{0,10}]}
789
790                  If the regular expression  contains  capturing  subpatterns,
791                  like in:
792
793                re:run("ABCabcdABC",".*(abcd).*",[]).
794
795                  all  of the matched subject is captured, as well as the cap‐
796                  tured substrings:
797
798                {match,[{0,10},{3,4}]}
799
800                  The complete matching pattern always gives the first  return
801                  value in the list and the remaining subpatterns are added in
802                  the order they occurred in the regular expression.
803
804                  The capture tuple is built up as follows:
805
806                  ValueSpec:
807                    Specifies which captured (sub)patterns are to be returned.
808                    ValueSpec  can  either  be an atom describing a predefined
809                    set of return values, or a list containing the indexes  or
810                    the names of specific subpatterns to return.
811
812                    The following are the predefined sets of subpatterns:
813
814                    all:
815                      All captured subpatterns including the complete matching
816                      string. This is the default.
817
818                    all_names:
819                      All named subpatterns in the regular expression, as if a
820                      list() of all the names in alphabetical order was speci‐
821                      fied. The list of all names can also be  retrieved  with
822                      inspect/2.
823
824                    first:
825                      Only  the first captured subpattern, which is always the
826                      complete matching part of the  subject.  All  explicitly
827                      captured subpatterns are discarded.
828
829                    all_but_first:
830                      All  but the first matching subpattern, that is, all ex‐
831                      plicitly captured  subpatterns,  but  not  the  complete
832                      matching  part  of the subject string. This is useful if
833                      the regular expression as a whole matches a  large  part
834                      of the subject, but the part you are interested in is in
835                      an explicitly captured subpattern. If the return type is
836                      list  or  binary,  not returning subpatterns you are not
837                      interested in is a good way to optimize.
838
839                    none:
840                      Returns no matching subpatterns, gives the  single  atom
841                      match  as the return value of the function when matching
842                      successfully instead  of  the  {match,  list()}  return.
843                      Specifying an empty list gives the same behavior.
844
845                    The value list is a list of indexes for the subpatterns to
846                    return, where index 0 is for all of the pattern, and 1  is
847                    for the first explicit capturing subpattern in the regular
848                    expression, and so on. When using named  captured  subpat‐
849                    terns  (see  below) in the regular expression, one can use
850                    atom()s or string()s to specify the subpatterns to be  re‐
851                    turned. For example, consider the regular expression:
852
853                  ".*(abcd).*"
854
855                    matched  against  string  "ABCabcdABC", capturing only the
856                    "abcd" part (the first explicit subpattern):
857
858                  re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
859
860                    The call gives the following result, as the first  explic‐
861                    itly  captured  subpattern is "(abcd)", matching "abcd" in
862                    the subject, at (zero-based) position 3, of length 4:
863
864                  {match,[{3,4}]}
865
866                    Consider the same regular expression, but with the subpat‐
867                    tern explicitly named 'FOO':
868
869                  ".*(?<FOO>abcd).*"
870
871                    With this expression, we could still give the index of the
872                    subpattern with the following call:
873
874                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
875
876                    giving the same result as before. But, as  the  subpattern
877                    is named, we can also specify its name in the value list:
878
879                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
880
881                    This  would  give the same result as the earlier examples,
882                    namely:
883
884                  {match,[{3,4}]}
885
886                    The values list can specify indexes or names  not  present
887                    in the regular expression, in which case the return values
888                    vary depending on the type. If the type is index, the  tu‐
889                    ple  {-1,0}  is  returned for values with no corresponding
890                    subpattern in the regular expression, but  for  the  other
891                    types  (binary  and list), the values are the empty binary
892                    or list, respectively.
893
894                  Type:
895                    Optionally specifies how captured substrings are to be re‐
896                    turned. If omitted, the default of index is used.
897
898                    Type can be one of the following:
899
900                    index:
901                      Returns  captured  substrings  as  pairs of byte indexes
902                      into the subject  string  and  length  of  the  matching
903                      string  in  the  subject  (as  if the subject string was
904                      flattened   with   erlang:iolist_to_binary/1   or   uni‐
905                      code:characters_to_binary/2   before  matching).  Notice
906                      that option unicode results in byte-oriented indexes  in
907                      a  (possibly virtual) UTF-8 encoded binary. A byte index
908                      tuple {0,2} can therefore represent one or  two  charac‐
909                      ters  when  unicode is in effect. This can seem counter-
910                      intuitive, but has been deemed the  most  effective  and
911                      useful  way to do it. To return lists instead can result
912                      in simpler code if that is desired. This return type  is
913                      the default.
914
915                    list:
916                      Returns  matching substrings as lists of characters (Er‐
917                      lang string()s). It option unicode is used  in  combina‐
918                      tion  with  the \C sequence in the regular expression, a
919                      captured subpattern can contain bytes that are not valid
920                      UTF-8  (\C  matches bytes regardless of character encod‐
921                      ing). In that case the list capturing can result in  the
922                      same  types  of tuples that unicode:characters_to_list/2
923                      can return, namely three-tuples with tag  incomplete  or
924                      error, the successfully converted characters and the in‐
925                      valid UTF-8 tail of the conversion as a binary. The best
926                      strategy  is to avoid using the \C sequence when captur‐
927                      ing lists.
928
929                    binary:
930                      Returns matching substrings as binaries. If option  uni‐
931                      code is used, these binaries are in UTF-8. If the \C se‐
932                      quence is used together with unicode, the  binaries  can
933                      be invalid UTF-8.
934
935                  In  general,  subpatterns  that were not assigned a value in
936                  the match are returned as the tuple {-1,0} when type is  in‐
937                  dex. Unassigned subpatterns are returned as the empty binary
938                  or list, respectively, for other return types. Consider  the
939                  following regular expression:
940
941                ".*((?<FOO>abdd)|a(..d)).*"
942
943                  There  are three explicitly capturing subpatterns, where the
944                  opening parenthesis position determines the order in the re‐
945                  sult,  hence  ((?<FOO>abdd)|a(..d))  is  subpattern index 1,
946                  (?<FOO>abdd) is subpattern index 2, and (..d) is  subpattern
947                  index 3. When matched against the following string:
948
949                "ABCabcdABC"
950
951                  the  subpattern  at index 2 does not match, as "abdd" is not
952                  present in the string, but the complete pattern matches (be‐
953                  cause  of the alternative a(..d)). The subpattern at index 2
954                  is therefore unassigned and the default return value is:
955
956                {match,[{0,10},{3,4},{-1,0},{4,3}]}
957
958                  Setting the capture Type to binary gives:
959
960                {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
961
962                  Here the empty binary (<<>>) represents the unassigned  sub‐
963                  pattern.  In  the  binary  case,  some information about the
964                  matching is therefore lost, as <<>> can  also  be  an  empty
965                  string captured.
966
967                  If  differentiation  between  empty matches and non-existing
968                  subpatterns is necessary, use the type index and do the con‐
969                  version to the final type in Erlang code.
970
971                  When  option global is speciified, the capture specification
972                  affects each match separately, so that:
973
974                re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
975
976                  gives
977
978                {match,[["a"],["b"]]}
979
980              For a descriptions of options  only  affecting  the  compilation
981              step, see compile/2.
982
983       split(Subject, RE) -> SplitList
984
985              Types:
986
987                 Subject = iodata() | unicode:charlist()
988                 RE = mp() | iodata()
989                 SplitList = [iodata() | unicode:charlist()]
990
991              Same as split(Subject, RE, []).
992
993       split(Subject, RE, Options) -> SplitList
994
995              Types:
996
997                 Subject = iodata() | unicode:charlist()
998                 RE = mp() | iodata() | unicode:charlist()
999                 Options = [Option]
1000                 Option =
1001                     anchored  | notbol | noteol | notempty | notempty_atstart
1002                 |
1003                     {offset, integer() >= 0} |
1004                     {newline, nl_spec()} |
1005                     {match_limit, integer() >= 0} |
1006                     {match_limit_recursion, integer() >= 0} |
1007                     bsr_anycrlf | bsr_unicode |
1008                     {return, ReturnType} |
1009                     {parts, NumParts} |
1010                     group | trim | CompileOpt
1011                 NumParts = integer() >= 0 | infinity
1012                 ReturnType = iodata | list | binary
1013                 CompileOpt = compile_option()
1014                   See compile/2.
1015                 SplitList = [RetData] | [GroupedRetData]
1016                 GroupedRetData = [RetData]
1017                 RetData = iodata() | unicode:charlist() | binary() | list()
1018
1019              Splits the input into parts by finding tokens according  to  the
1020              regular  expression supplied. The splitting is basically done by
1021              running a global regular expression match and dividing the  ini‐
1022              tial  string  wherever  a match occurs. The matching part of the
1023              string is removed from the output.
1024
1025              As in run/3, an mp() compiled with option unicode requires  Sub‐
1026              ject  to be a Unicode charlist(). If compilation is done implic‐
1027              itly and the unicode compilation option  is  specified  to  this
1028              function,  both  the  regular  expression  and Subject are to be
1029              specified as valid Unicode charlist()s.
1030
1031              The result is given as a list of "strings", the  preferred  data
1032              type specified in option return (default iodata).
1033
1034              If  subexpressions  are specified in the regular expression, the
1035              matching subexpressions are returned in the  resulting  list  as
1036              well. For example:
1037
1038              re:split("Erlang","[ln]",[{return,list}]).
1039
1040              gives
1041
1042              ["Er","a","g"]
1043
1044              while
1045
1046              re:split("Erlang","([ln])",[{return,list}]).
1047
1048              gives
1049
1050              ["Er","l","a","n","g"]
1051
1052              The  text  matching the subexpression (marked by the parentheses
1053              in the regular expression) is inserted in the result list  where
1054              it  was  found.  This  means  that concatenating the result of a
1055              split where the whole regular expression is a single  subexpres‐
1056              sion  (as  in  the  last example) always results in the original
1057              string.
1058
1059              As there is no matching subexpression for the last part  in  the
1060              example  (the  "g"), nothing is inserted after that. To make the
1061              group of strings and the parts matching the subexpressions  more
1062              obvious,  one  can  use  option group, which groups together the
1063              part of the subject string with the parts  matching  the  subex‐
1064              pressions when the string was split:
1065
1066              re:split("Erlang","([ln])",[{return,list},group]).
1067
1068              gives
1069
1070              [["Er","l"],["a","n"],["g"]]
1071
1072              Here  the regular expression first matched the "l", causing "Er"
1073              to be the first part in the result. When the regular  expression
1074              matched,  the  (only) subexpression was bound to the "l", so the
1075              "l" is inserted in the group together with "Er". The next  match
1076              is  of  the "n", making "a" the next part to be returned. As the
1077              subexpression is bound to substring "n" in this case, the "n" is
1078              inserted into this group. The last group consists of the remain‐
1079              ing string, as no more matches are found.
1080
1081              By default,  all  parts  of  the  string,  including  the  empty
1082              strings, are returned from the function, for example:
1083
1084              re:split("Erlang","[lg]",[{return,list}]).
1085
1086              gives
1087
1088              ["Er","an",[]]
1089
1090              as  the  matching  of the "g" in the end of the string leaves an
1091              empty rest, which is also returned. This behavior  differs  from
1092              the  default behavior of the split function in Perl, where empty
1093              strings at the end are by default removed. To get the "trimming"
1094              default behavior of Perl, specify trim as an option:
1095
1096              re:split("Erlang","[lg]",[{return,list},trim]).
1097
1098              gives
1099
1100              ["Er","an"]
1101
1102              The  "trim"  option says; "give me as many parts as possible ex‐
1103              cept the empty ones", which sometimes can  be  useful.  You  can
1104              also specify how many parts you want, by specifying {parts,N}:
1105
1106              re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1107
1108              gives
1109
1110              ["Er","ang"]
1111
1112              Notice  that  the last part is "ang", not "an", as splitting was
1113              specified into two parts, and the splitting  stops  when  enough
1114              parts  are  given,  which is why the result differs from that of
1115              trim.
1116
1117              More than three parts are not possible with this indata, so
1118
1119              re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1120
1121              gives the same result as the default, which is to be  viewed  as
1122              "an infinite number of parts".
1123
1124              Specifying 0 as the number of parts gives the same effect as op‐
1125              tion trim. If subexpressions are captured, empty  subexpressions
1126              matched  at the end are also stripped from the result if trim or
1127              {parts,0} is specified.
1128
1129              The trim behavior  corresponds  exactly  to  the  Perl  default.
1130              {parts,N}, where N is a positive integer, corresponds exactly to
1131              the Perl behavior with a positive numerical third parameter. The
1132              default  behavior  of  split/3  corresponds to the Perl behavior
1133              when a negative integer is specified as the third parameter  for
1134              the Perl routine.
1135
1136              Summary of options not previously described for function run/3:
1137
1138                {return,ReturnType}:
1139                  Specifies how the parts of the original string are presented
1140                  in the result list. Valid types:
1141
1142                  iodata:
1143                    The variant of iodata() that gives the  least  copying  of
1144                    data  with the current implementation (often a binary, but
1145                    do not depend on it).
1146
1147                  binary:
1148                    All parts returned as binaries.
1149
1150                  list:
1151                    All parts returned as lists of characters ("strings").
1152
1153                group:
1154                  Groups together the part of the string with the parts of the
1155                  string  matching  the  subexpressions of the regular expres‐
1156                  sion.
1157
1158                  The return value from the function is in this case a  list()
1159                  of  list()s.  Each sublist begins with the string picked out
1160                  of the subject string, followed by the parts  matching  each
1161                  of  the subexpressions in order of occurrence in the regular
1162                  expression.
1163
1164                {parts,N}:
1165                  Specifies the number of parts the subject string  is  to  be
1166                  split into.
1167
1168                  The  number  of parts is to be a positive integer for a spe‐
1169                  cific maximum number of parts, and infinity for the  maximum
1170                  number of parts possible (the default). Specifying {parts,0}
1171                  gives as many parts as possible disregarding empty parts  at
1172                  the end, the same as specifying trim.
1173
1174                trim:
1175                  Specifies that empty parts at the end of the result list are
1176                  to be disregarded. The same as  specifying  {parts,0}.  This
1177                  corresponds  to  the  default behavior of the split built-in
1178                  function in Perl.
1179

PERL-LIKE REGULAR EXPRESSION SYNTAX

1181       The following sections contain reference material for the  regular  ex‐
1182       pressions  used  by  this  module. The information is based on the PCRE
1183       documentation, with changes where this module  behaves  differently  to
1184       the PCRE library.
1185

PCRE REGULAR EXPRESSION DETAILS

1187       The  syntax  and semantics of the regular expressions supported by PCRE
1188       are described in detail in the following sections. Perl's  regular  ex‐
1189       pressions  are  described in its own documentation, and regular expres‐
1190       sions in general are covered in many books, some with copious examples.
1191       Jeffrey   Friedl's   "Mastering   Regular  Expressions",  published  by
1192       O'Reilly, covers regular expressions in great detail. This  description
1193       of the PCRE regular expressions is intended as reference material.
1194
1195       The reference material is divided into the following sections:
1196
1197         * Special Start-of-Pattern Items
1198
1199         * Characters and Metacharacters
1200
1201         * Backslash
1202
1203         * Circumflex and Dollar
1204
1205         * Full Stop (Period, Dot) and \N
1206
1207         * Matching a Single Data Unit
1208
1209         * Square Brackets and Character Classes
1210
1211         * Posix Character Classes
1212
1213         * Vertical Bar
1214
1215         * Internal Option Setting
1216
1217         * Subpatterns
1218
1219         * Duplicate Subpattern Numbers
1220
1221         * Named Subpatterns
1222
1223         * Repetition
1224
1225         * Atomic Grouping and Possessive Quantifiers
1226
1227         * Back References
1228
1229         * Assertions
1230
1231         * Conditional Subpatterns
1232
1233         * Comments
1234
1235         * Recursive Patterns
1236
1237         * Subpatterns as Subroutines
1238
1239         * Oniguruma Subroutine Syntax
1240
1241         * Backtracking Control
1242

SPECIAL START-OF-PATTERN ITEMS

1244       Some options that can be passed to compile/2 can also be set by special
1245       items at the start of a pattern. These are not Perl-compatible, but are
1246       provided  to  make  these options accessible to pattern writers who are
1247       not able to change the program that processes the pattern.  Any  number
1248       of  these  items can appear, but they must all be together right at the
1249       start of the pattern string, and the letters must be in upper case.
1250
1251       UTF Support
1252
1253       Unicode support is basically UTF-8 based. To  use  Unicode  characters,
1254       you  either call compile/2 or run/3 with option unicode, or the pattern
1255       must start with one of these special sequences:
1256
1257       (*UTF8)
1258       (*UTF)
1259
1260       Both options give the same effect, the input string is  interpreted  as
1261       UTF-8. Notice that with these instructions, the automatic conversion of
1262       lists to UTF-8 is not performed by the re functions.  Therefore,  using
1263       these  sequences  is  not  recommended. Add option unicode when running
1264       compile/2 instead.
1265
1266       Some applications that allow their users to supply patterns can wish to
1267       restrict them to non-UTF data for security reasons. If option never_utf
1268       is set at compile time, (*UTF), and so on, are not allowed,  and  their
1269       appearance causes an error.
1270
1271       Unicode Property Support
1272
1273       The  following is another special sequence that can appear at the start
1274       of a pattern:
1275
1276       (*UCP)
1277
1278       This has the same effect as setting option  ucp:  it  causes  sequences
1279       such  as  \d  and  \w  to use Unicode properties to determine character
1280       types, instead of recognizing only characters with codes < 256  through
1281       a lookup table.
1282
1283       Disabling Startup Optimizations
1284
1285       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
1286       setting option no_start_optimize at compile time.
1287
1288       Newline Conventions
1289
1290       PCRE supports five conventions for indicating line breaks in strings: a
1291       single  CR (carriage return) character, a single LF (line feed) charac‐
1292       ter, the two-character sequence CRLF, any of the three  preceding,  and
1293       any Unicode newline sequence.
1294
1295       A newline convention can also be specified by starting a pattern string
1296       with one of the following five sequences:
1297
1298         (*CR):
1299           Carriage return
1300
1301         (*LF):
1302           Line feed
1303
1304         (*CRLF):
1305           >Carriage return followed by line feed
1306
1307         (*ANYCRLF):
1308           Any of the three above
1309
1310         (*ANY):
1311           All Unicode newline sequences
1312
1313       These override the default and the options specified to compile/2.  For
1314       example, the following pattern changes the convention to CR:
1315
1316       (*CR)a.b
1317
1318       This  pattern  matches a\nb, as LF is no longer a newline. If more than
1319       one of them is present, the last one is used.
1320
1321       The newline convention affects where the circumflex and  dollar  asser‐
1322       tions are true. It also affects the interpretation of the dot metachar‐
1323       acter when dotall is not set, and the behavior of \N. However, it  does
1324       not affect what the \R escape sequence matches. By default, this is any
1325       Unicode newline sequence, for Perl compatibility. However, this can  be
1326       changed;  see  the  description  of  \R in section Newline Sequences. A
1327       change of the \R setting can be combined with a change of  the  newline
1328       convention.
1329
1330       Setting Match and Recursion Limits
1331
1332       The caller of run/3 can set a limit on the number of times the internal
1333       match() function is called and on the maximum depth of recursive calls.
1334       These  facilities  are  provided to catch runaway matches that are pro‐
1335       voked by patterns with huge matching trees (a typical example is a pat‐
1336       tern  with nested unlimited repeats) and to avoid running out of system
1337       stack by too much recursion. When  one  of  these  limits  is  reached,
1338       pcre_exec()  gives an error return. The limits can also be set by items
1339       at the start of the pattern of the following forms:
1340
1341       (*LIMIT_MATCH=d)
1342       (*LIMIT_RECURSION=d)
1343
1344       Here d is any number of decimal digits. However, the value of the  set‐
1345       ting  must  be less than the value set by the caller of run/3 for it to
1346       have any effect. That is, the pattern writer can lower the limit set by
1347       the  programmer, but not raise it. If there is more than one setting of
1348       one of these limits, the lower value is used.
1349
1350       The default value for both the limits is 10,000,000 in the  Erlang  VM.
1351       Notice  that the recursion limit does not affect the stack depth of the
1352       VM, as PCRE for Erlang is compiled in such a way that the  match  func‐
1353       tion never does recursion on the C stack.
1354
1355       Note  that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1356       the limits set by the caller, not increase them.
1357

CHARACTERS AND METACHARACTERS

1359       A regular expression is a pattern that is  matched  against  a  subject
1360       string  from  left  to right. Most characters stand for themselves in a
1361       pattern and match the corresponding characters in  the  subject.  As  a
1362       trivial  example,  the following pattern matches a portion of a subject
1363       string that is identical to itself:
1364
1365       The quick brown fox
1366
1367       When caseless matching is  specified  (option  caseless),  letters  are
1368       matched independently of case.
1369
1370       The  power of regular expressions comes from the ability to include al‐
1371       ternatives and repetitions in the pattern. These  are  encoded  in  the
1372       pattern by the use of metacharacters, which do not stand for themselves
1373       but instead are interpreted in some special way.
1374
1375       Two sets of metacharacters exist: those that are recognized anywhere in
1376       the  pattern  except  within square brackets, and those that are recog‐
1377       nized within square brackets. Outside square brackets, the  metacharac‐
1378       ters are as follows:
1379
1380         \:
1381           General escape character with many uses
1382
1383         ^:
1384           Assert start of string (or line, in multiline mode)
1385
1386         $:
1387           Assert end of string (or line, in multiline mode)
1388
1389         .:
1390           Match any character except newline (by default)
1391
1392         [:
1393           Start character class definition
1394
1395         |:
1396           Start of alternative branch
1397
1398         (:
1399           Start subpattern
1400
1401         ):
1402           End subpattern
1403
1404         ?:
1405           Extends  the  meaning of (, also 0 or 1 quantifier, also quantifier
1406           minimizer
1407
1408         *:
1409           0 or more quantifiers
1410
1411         +:
1412           1 or more quantifier, also "possessive quantifier"
1413
1414         {:
1415           Start min/max quantifier
1416
1417       Part of a pattern within square brackets is called a "character class".
1418       The following are the only metacharacters in a character class:
1419
1420         \:
1421           General escape character
1422
1423         ^:
1424           Negate the class, but only if the first character
1425
1426         -:
1427           Indicates character range
1428
1429         [:
1430           Posix character class (only if followed by Posix syntax)
1431
1432         ]:
1433           Terminates the character class
1434
1435       The following sections describe the use of each metacharacter.
1436

BACKSLASH

1438       The  backslash  character  has many uses. First, if it is followed by a
1439       character that is not a number or a letter, it takes away  any  special
1440       meaning  that  a character can have. This use of backslash as an escape
1441       character applies both inside and outside character classes.
1442
1443       For example, if you want to match a * character, you write  \*  in  the
1444       pattern.  This escaping action applies if the following character would
1445       otherwise be interpreted as a metacharacter, so it is  always  safe  to
1446       precede a non-alphanumeric with backslash to specify that it stands for
1447       itself. In particular, if you want to match a backslash, write \\.
1448
1449       In unicode mode, only ASCII numbers and letters have any special  mean‐
1450       ing after a backslash. All other characters (in particular, those whose
1451       code points are > 127) are treated as literals.
1452
1453       If a pattern is compiled with option extended, whitespace in  the  pat‐
1454       tern  (other than in a character class) and characters between a # out‐
1455       side a character class and the next newline are  ignored.  An  escaping
1456       backslash can be used to include a whitespace or # character as part of
1457       the pattern.
1458
1459       To remove the special meaning from a sequence of characters,  put  them
1460       between \Q and \E. This is different from Perl in that $ and @ are han‐
1461       dled as literals in \Q...\E sequences in PCRE,  while  $  and  @  cause
1462       variable interpolation in Perl. Notice the following examples:
1463
1464       Pattern            PCRE matches   Perl matches
1465
1466       \Qabc$xyz\E        abc$xyz        abc followed by the contents of $xyz
1467       \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1468       \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1469
1470       The  \Q...\E  sequence  is recognized both inside and outside character
1471       classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1472       not  followed  by  \E  later in the pattern, the literal interpretation
1473       continues to the end of the pattern (that is,  \E  is  assumed  at  the
1474       end).  If  the  isolated \Q is inside a character class, this causes an
1475       error, as the character class is not terminated.
1476
1477       Non-Printing Characters
1478
1479       A second use of backslash provides a way of encoding non-printing char‐
1480       acters  in patterns in a visible manner. There is no restriction on the
1481       appearance of non-printing characters, apart from the binary zero  that
1482       terminates a pattern. When a pattern is prepared by text editing, it is
1483       often easier to use one of the following escape sequences than the  bi‐
1484       nary character it represents:
1485
1486         \a:
1487           Alarm, that is, the BEL character (hex 07)
1488
1489         \cx:
1490           "Control-x", where x is any ASCII character
1491
1492         \e:
1493           Escape (hex 1B)
1494
1495         \f:
1496           Form feed (hex 0C)
1497
1498         \n:
1499           Line feed (hex 0A)
1500
1501         \r:
1502           Carriage return (hex 0D)
1503
1504         \t:
1505           Tab (hex 09)
1506
1507         \0dd:
1508           Character with octal code 0dd
1509
1510         \ddd:
1511           Character with octal code ddd, or back reference
1512
1513         \o{ddd..}:
1514           character with octal code ddd..
1515
1516         \xhh:
1517           Character with hex code hh
1518
1519         \x{hhh..}:
1520           Character with hex code hhh..
1521
1522   Note:
1523       Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1524       eral characters "8" and "9".
1525
1526
1527       The precise effect of \cx on ASCII characters is as follows: if x is  a
1528       lowercase  letter,  it  is  converted  to upper case. Then bit 6 of the
1529       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1530       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1531       hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
1532       has  a  value  >  127, a compile-time error occurs. This locks out non-
1533       ASCII characters in all modes.
1534
1535       The \c facility was designed for use with ASCII  characters,  but  with
1536       the extension to Unicode it is even less useful than it once was.
1537
1538       After  \0  up  to two further octal digits are read. If there are fewer
1539       than two digits, just those that are present are  used.  Thus  the  se‐
1540       quence  \0\x\015  specifies two binary zeros followed by a CR character
1541       (code value 13). Make sure you supply two digits after the initial zero
1542       if the pattern character that follows is itself an octal digit.
1543
1544       The  escape \o must be followed by a sequence of octal digits, enclosed
1545       in braces. An error occurs if this is not the case. This  escape  is  a
1546       recent  addition  to Perl; it provides way of specifying character code
1547       points as octal numbers greater than 0777, and  it  also  allows  octal
1548       numbers and back references to be unambiguously specified.
1549
1550       For greater clarity and unambiguity, it is best to avoid following \ by
1551       a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1552       ter  numbers,  and \g{} to specify back references. The following para‐
1553       graphs describe the old, ambiguous syntax.
1554
1555       The handling of a backslash followed by a digit other than 0 is compli‐
1556       cated,  and  Perl  has changed in recent releases, causing PCRE also to
1557       change. Outside a character class, PCRE reads the digit and any follow‐
1558       ing  digits as a decimal number. If the number is < 8, or if there have
1559       been at least that many previous capturing left parentheses in the  ex‐
1560       pression,  the entire sequence is taken as a back reference. A descrip‐
1561       tion of how this works is provided later, following the  discussion  of
1562       parenthesized subpatterns.
1563
1564       Inside  a  character class, or if the decimal number following \ is > 7
1565       and there have not been that many capturing subpatterns,  PCRE  handles
1566       \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1567       up to three octal digits following the backslash,  and  using  them  to
1568       generate  a data character. Any subsequent digits stand for themselves.
1569       For example:
1570
1571         \040:
1572           Another way of writing an ASCII space
1573
1574         \40:
1575           The same, provided there are < 40 previous capturing subpatterns
1576
1577         \7:
1578           Always a back reference
1579
1580         \11:
1581           Can be a back reference, or another way of writing a tab
1582
1583         \011:
1584           Always a tab
1585
1586         \0113:
1587           A tab followed by character "3"
1588
1589         \113:
1590           Can be a back reference, otherwise the character  with  octal  code
1591           113
1592
1593         \377:
1594           Can be a back reference, otherwise value 255 (decimal)
1595
1596         \81:
1597           Either a back reference, or the two characters "8" and "1"
1598
1599       Notice  that  octal  values >= 100 that are specified using this syntax
1600       must not be introduced by a leading zero, as no more than  three  octal
1601       digits are ever read.
1602
1603       By  default, after \x that is not followed by {, from zero to two hexa‐
1604       decimal digits are read (letters can be in upper or  lower  case).  Any
1605       number of hexadecimal digits may appear between \x{ and }. If a charac‐
1606       ter other than a hexadecimal digit appears between \x{  and  },  or  if
1607       there is no terminating }, an error occurs.
1608
1609       Characters whose value is less than 256 can be defined by either of the
1610       two syntaxes for \x. There is no difference in the way  they  are  han‐
1611       dled. For example, \xdc is exactly the same as \x{dc}.
1612
1613       Constraints on character values
1614
1615       Characters  that  are  specified using octal or hexadecimal numbers are
1616       limited to certain values, as follows:
1617
1618         8-bit non-UTF mode:
1619           < 0x100
1620
1621         8-bit UTF-8 mode:
1622           < 0x10ffff and a valid codepoint
1623
1624       Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
1625       called "surrogate" codepoints), and 0xffef.
1626
1627       Escape sequences in character classes
1628
1629       All the sequences that define a single character value can be used both
1630       inside and outside character classes. Also, inside a  character  class,
1631       \b is interpreted as the backspace character (hex 08).
1632
1633       \N  is not allowed in a character class. \B, \R, and \X are not special
1634       inside a character class. Like  other  unrecognized  escape  sequences,
1635       they are treated as the literal characters "B", "R", and "X". Outside a
1636       character class, these sequences have different meanings.
1637
1638       Unsupported Escape Sequences
1639
1640       In Perl, the sequences \l, \L, \u, and \U are recognized by its  string
1641       handler  and used to modify the case of following characters. PCRE does
1642       not support these escape sequences.
1643
1644       Absolute and Relative Back References
1645
1646       The sequence \g followed by an unsigned or a negative  number,  option‐
1647       ally  enclosed  in braces, is an absolute or relative back reference. A
1648       named back reference can be coded as \g{name}. Back references are dis‐
1649       cussed later, following the discussion of parenthesized subpatterns.
1650
1651       Absolute and Relative Subroutine Calls
1652
1653       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
1654       name or a number enclosed either in angle brackets or single quotes, is
1655       alternative  syntax for referencing a subpattern as a "subroutine". De‐
1656       tails are discussed  later.  Notice  that  \g{...}  (Perl  syntax)  and
1657       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
1658       reference and the latter is a subroutine call.
1659
1660       Generic Character Types
1661
1662       Another use of backslash is for specifying generic character types:
1663
1664         \d:
1665           Any decimal digit
1666
1667         \D:
1668           Any character that is not a decimal digit
1669
1670         \h:
1671           Any horizontal whitespace character
1672
1673         \H:
1674           Any character that is not a horizontal whitespace character
1675
1676         \s:
1677           Any whitespace character
1678
1679         \S:
1680           Any character that is not a whitespace character
1681
1682         \v:
1683           Any vertical whitespace character
1684
1685         \V:
1686           Any character that is not a vertical whitespace character
1687
1688         \w:
1689           Any "word" character
1690
1691         \W:
1692           Any "non-word" character
1693
1694       There is also the single sequence \N, which matches a non-newline char‐
1695       acter.  This  is  the  same as the "." metacharacter when dotall is not
1696       set. Perl also uses \N to match characters by name, but PCRE  does  not
1697       support this.
1698
1699       Each  pair  of  lowercase and uppercase escape sequences partitions the
1700       complete set of characters into two disjoint sets. Any given  character
1701       matches  one, and only one, of each pair. The sequences can appear both
1702       inside and outside character classes. They each match one character  of
1703       the  appropriate  type.  If the current matching point is at the end of
1704       the subject string, all fail, as there is no character to match.
1705
1706       For compatibility with Perl, \s did not used to match the VT  character
1707       (code  11),  which  made it different from the the POSIX "space" class.
1708       However, Perl added VT at release 5.18, and PCRE followed suit  at  re‐
1709       lease 8.34. The default \s characters are now HT (9), LF (10), VT (11),
1710       FF (12), CR (13), and space (32), which are defined as white  space  in
1711       the  "C" locale. This list may vary if locale-specific matching is tak‐
1712       ing place. For example, in some locales the "non-breaking space"  char‐
1713       acter (\xA0) is recognized as white space, and in others the VT charac‐
1714       ter is not.
1715
1716       A "word" character is an underscore or any character that is  a  letter
1717       or  a  digit.  By default, the definition of letters and digits is con‐
1718       trolled by the PCRE low-valued character tables, in Erlang's case  (and
1719       without option unicode), the ISO Latin-1 character set.
1720
1721       By default, in unicode mode, characters with values > 255, that is, all
1722       characters outside the ISO Latin-1 character set, never match  \d,  \s,
1723       or  \w,  and  always match \D, \S, and \W. These sequences retain their
1724       original meanings from before UTF support was available, mainly for ef‐
1725       ficiency  reasons.  However,  if  option  ucp  is  set, the behavior is
1726       changed so that Unicode properties  are  used  to  determine  character
1727       types, as follows:
1728
1729         \d:
1730           Any character that \p{Nd} matches (decimal digit)
1731
1732         \s:
1733           Any character that \p{Z} or \h or \v
1734
1735         \w:
1736           Any character that matches \p{L} or \p{N} matches, plus underscore
1737
1738       The uppercase escapes match the inverse sets of characters. Notice that
1739       \d matches only decimal digits, while \w matches any Unicode digit, any
1740       Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1741       as they are defined in terms of \w and \W. Matching these sequences  is
1742       noticeably slower when ucp is set.
1743
1744       The  sequences  \h, \H, \v, and \V are features that were added to Perl
1745       in release 5.10. In contrast to the other sequences, which  match  only
1746       ASCII  characters  by  default,  these always match certain high-valued
1747       code points, regardless if ucp is set.
1748
1749       The following are the horizontal space characters:
1750
1751         U+0009:
1752           Horizontal tab (HT)
1753
1754         U+0020:
1755           Space
1756
1757         U+00A0:
1758           Non-break space
1759
1760         U+1680:
1761           Ogham space mark
1762
1763         U+180E:
1764           Mongolian vowel separator
1765
1766         U+2000:
1767           En quad
1768
1769         U+2001:
1770           Em quad
1771
1772         U+2002:
1773           En space
1774
1775         U+2003:
1776           Em space
1777
1778         U+2004:
1779           Three-per-em space
1780
1781         U+2005:
1782           Four-per-em space
1783
1784         U+2006:
1785           Six-per-em space
1786
1787         U+2007:
1788           Figure space
1789
1790         U+2008:
1791           Punctuation space
1792
1793         U+2009:
1794           Thin space
1795
1796         U+200A:
1797           Hair space
1798
1799         U+202F:
1800           Narrow no-break space
1801
1802         U+205F:
1803           Medium mathematical space
1804
1805         U+3000:
1806           Ideographic space
1807
1808       The following are the vertical space characters:
1809
1810         U+000A:
1811           Line feed (LF)
1812
1813         U+000B:
1814           Vertical tab (VT)
1815
1816         U+000C:
1817           Form feed (FF)
1818
1819         U+000D:
1820           Carriage return (CR)
1821
1822         U+0085:
1823           Next line (NEL)
1824
1825         U+2028:
1826           Line separator
1827
1828         U+2029:
1829           Paragraph separator
1830
1831       In 8-bit, non-UTF-8 mode, only the characters with code  points  <  256
1832       are relevant.
1833
1834       Newline Sequences
1835
1836       Outside  a  character class, by default, the escape sequence \R matches
1837       any Unicode newline sequence. In non-UTF-8 mode, \R  is  equivalent  to
1838       the following:
1839
1840       (?>\r\n|\n|\x0b|\f|\r|\x85)
1841
1842       This is an example of an "atomic group", details are provided below.
1843
1844       This particular group matches either the two-character sequence CR fol‐
1845       lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1846       (vertical  tab,  U+000B),  FF (form feed, U+000C), CR (carriage return,
1847       U+000D), or NEL (next line,  U+0085).  The  two-character  sequence  is
1848       treated as a single unit that cannot be split.
1849
1850       In  Unicode  mode,  two more characters whose code points are > 255 are
1851       added:  LS  (line  separator,  U+2028)  and  PS  (paragraph  separator,
1852       U+2029).  Unicode  character  property  support is not needed for these
1853       characters to be recognized.
1854
1855       \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1856       plete set of Unicode line endings) by setting option bsr_anycrlf either
1857       at compile time or when the pattern is matched. (BSR is an acronym  for
1858       "backslash R".) This can be made the default when PCRE is built; if so,
1859       the other behavior can be requested through option  bsr_unicode.  These
1860       settings can also be specified by starting a pattern string with one of
1861       the following sequences:
1862
1863         (*BSR_ANYCRLF):
1864           CR, LF, or CRLF only
1865
1866         (*BSR_UNICODE):
1867           Any Unicode newline sequence
1868
1869       These override the default and the options specified to  the  compiling
1870       function, but they can themselves be overridden by options specified to
1871       a matching function. Notice that these special settings, which are  not
1872       Perl-compatible,  are  recognized  only at the very start of a pattern,
1873       and that they must be in upper case.  If  more  than  one  of  them  is
1874       present,  the  last  one is used. They can be combined with a change of
1875       newline convention; for example, a pattern can start with:
1876
1877       (*ANY)(*BSR_ANYCRLF)
1878
1879       They can also be combined with the (*UTF8), (*UTF), or  (*UCP)  special
1880       sequences.  Inside  a character class, \R is treated as an unrecognized
1881       escape sequence, and so matches the letter "R" by default.
1882
1883       Unicode Character Properties
1884
1885       Three more escape sequences that match characters with specific proper‐
1886       ties  are  available. When in 8-bit non-UTF-8 mode, these sequences are
1887       limited to testing characters whose code points are < 256, but they  do
1888       work in this mode. The following are the extra escape sequences:
1889
1890         \p{xx}:
1891           A character with property xx
1892
1893         \P{xx}:
1894           A character without property xx
1895
1896         \X:
1897           A Unicode extended grapheme cluster
1898
1899       The  property  names represented by xx above are limited to the Unicode
1900       script names, the general category properties, "Any", which matches any
1901       character  (including  newline),  and some special PCRE properties (de‐
1902       scribed in the next section). Other Perl properties, such  as  "InMusi‐
1903       calSymbols",  are  currently not supported by PCRE. Notice that \P{Any}
1904       does not match any characters and always causes a match failure.
1905
1906       Sets of Unicode characters are defined as belonging to certain scripts.
1907       A  character from one of these sets can be matched using a script name,
1908       for example:
1909
1910       \p{Greek} \P{Han}
1911
1912       Those that are not part of an identified script are lumped together  as
1913       "Common". The following is the current list of scripts:
1914
1915         * Arabic
1916
1917         * Armenian
1918
1919         * Avestan
1920
1921         * Balinese
1922
1923         * Bamum
1924
1925         * Bassa_Vah
1926
1927         * Batak
1928
1929         * Bengali
1930
1931         * Bopomofo
1932
1933         * Braille
1934
1935         * Buginese
1936
1937         * Buhid
1938
1939         * Canadian_Aboriginal
1940
1941         * Carian
1942
1943         * Caucasian_Albanian
1944
1945         * Chakma
1946
1947         * Cham
1948
1949         * Cherokee
1950
1951         * Common
1952
1953         * Coptic
1954
1955         * Cuneiform
1956
1957         * Cypriot
1958
1959         * Cyrillic
1960
1961         * Deseret
1962
1963         * Devanagari
1964
1965         * Duployan
1966
1967         * Egyptian_Hieroglyphs
1968
1969         * Elbasan
1970
1971         * Ethiopic
1972
1973         * Georgian
1974
1975         * Glagolitic
1976
1977         * Gothic
1978
1979         * Grantha
1980
1981         * Greek
1982
1983         * Gujarati
1984
1985         * Gurmukhi
1986
1987         * Han
1988
1989         * Hangul
1990
1991         * Hanunoo
1992
1993         * Hebrew
1994
1995         * Hiragana
1996
1997         * Imperial_Aramaic
1998
1999         * Inherited
2000
2001         * Inscriptional_Pahlavi
2002
2003         * Inscriptional_Parthian
2004
2005         * Javanese
2006
2007         * Kaithi
2008
2009         * Kannada
2010
2011         * Katakana
2012
2013         * Kayah_Li
2014
2015         * Kharoshthi
2016
2017         * Khmer
2018
2019         * Khojki
2020
2021         * Khudawadi
2022
2023         * Lao
2024
2025         * Latin
2026
2027         * Lepcha
2028
2029         * Limbu
2030
2031         * Linear_A
2032
2033         * Linear_B
2034
2035         * Lisu
2036
2037         * Lycian
2038
2039         * Lydian
2040
2041         * Mahajani
2042
2043         * Malayalam
2044
2045         * Mandaic
2046
2047         * Manichaean
2048
2049         * Meetei_Mayek
2050
2051         * Mende_Kikakui
2052
2053         * Meroitic_Cursive
2054
2055         * Meroitic_Hieroglyphs
2056
2057         * Miao
2058
2059         * Modi
2060
2061         * Mongolian
2062
2063         * Mro
2064
2065         * Myanmar
2066
2067         * Nabataean
2068
2069         * New_Tai_Lue
2070
2071         * Nko
2072
2073         * Ogham
2074
2075         * Ol_Chiki
2076
2077         * Old_Italic
2078
2079         * Old_North_Arabian
2080
2081         * Old_Permic
2082
2083         * Old_Persian
2084
2085         * Oriya
2086
2087         * Old_South_Arabian
2088
2089         * Old_Turkic
2090
2091         * Osmanya
2092
2093         * Pahawh_Hmong
2094
2095         * Palmyrene
2096
2097         * Pau_Cin_Hau
2098
2099         * Phags_Pa
2100
2101         * Phoenician
2102
2103         * Psalter_Pahlavi
2104
2105         * Rejang
2106
2107         * Runic
2108
2109         * Samaritan
2110
2111         * Saurashtra
2112
2113         * Sharada
2114
2115         * Shavian
2116
2117         * Siddham
2118
2119         * Sinhala
2120
2121         * Sora_Sompeng
2122
2123         * Sundanese
2124
2125         * Syloti_Nagri
2126
2127         * Syriac
2128
2129         * Tagalog
2130
2131         * Tagbanwa
2132
2133         * Tai_Le
2134
2135         * Tai_Tham
2136
2137         * Tai_Viet
2138
2139         * Takri
2140
2141         * Tamil
2142
2143         * Telugu
2144
2145         * Thaana
2146
2147         * Thai
2148
2149         * Tibetan
2150
2151         * Tifinagh
2152
2153         * Tirhuta
2154
2155         * Ugaritic
2156
2157         * Vai
2158
2159         * Warang_Citi
2160
2161         * Yi
2162
2163       Each character has exactly one Unicode general category property, spec‐
2164       ified by a two-letter acronym. For compatibility  with  Perl,  negation
2165       can  be  specified  by including a circumflex between the opening brace
2166       and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2167
2168       If only one letter is specified with \p or \P, it includes all the gen‐
2169       eral  category properties that start with that letter. In this case, in
2170       the absence of negation, the curly brackets in the escape sequence  are
2171       optional. The following two examples have the same effect:
2172
2173       \p{L}
2174       \pL
2175
2176       The following general category property codes are supported:
2177
2178         C:
2179           Other
2180
2181         Cc:
2182           Control
2183
2184         Cf:
2185           Format
2186
2187         Cn:
2188           Unassigned
2189
2190         Co:
2191           Private use
2192
2193         Cs:
2194           Surrogate
2195
2196         L:
2197           Letter
2198
2199         Ll:
2200           Lowercase letter
2201
2202         Lm:
2203           Modifier letter
2204
2205         Lo:
2206           Other letter
2207
2208         Lt:
2209           Title case letter
2210
2211         Lu:
2212           Uppercase letter
2213
2214         M:
2215           Mark
2216
2217         Mc:
2218           Spacing mark
2219
2220         Me:
2221           Enclosing mark
2222
2223         Mn:
2224           Non-spacing mark
2225
2226         N:
2227           Number
2228
2229         Nd:
2230           Decimal number
2231
2232         Nl:
2233           Letter number
2234
2235         No:
2236           Other number
2237
2238         P:
2239           Punctuation
2240
2241         Pc:
2242           Connector punctuation
2243
2244         Pd:
2245           Dash punctuation
2246
2247         Pe:
2248           Close punctuation
2249
2250         Pf:
2251           Final punctuation
2252
2253         Pi:
2254           Initial punctuation
2255
2256         Po:
2257           Other punctuation
2258
2259         Ps:
2260           Open punctuation
2261
2262         S:
2263           Symbol
2264
2265         Sc:
2266           Currency symbol
2267
2268         Sk:
2269           Modifier symbol
2270
2271         Sm:
2272           Mathematical symbol
2273
2274         So:
2275           Other symbol
2276
2277         Z:
2278           Separator
2279
2280         Zl:
2281           Line separator
2282
2283         Zp:
2284           Paragraph separator
2285
2286         Zs:
2287           Space separator
2288
2289       The  special property L& is also supported. It matches a character that
2290       has the Lu, Ll, or Lt property, that is, a letter that is  not  classi‐
2291       fied as a modifier or "other".
2292
2293       The  Cs  (Surrogate)  property  applies only to characters in the range
2294       U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2295       cannot be tested by PCRE. Perl does not support the Cs property.
2296
2297       The long synonyms for property names supported by Perl (such as \p{Let‐
2298       ter}) are not supported by PCRE. It is not permitted to prefix  any  of
2299       these properties with "Is".
2300
2301       No  character  in  the  Unicode table has the Cn (unassigned) property.
2302       This property is instead assumed for any code point that is not in  the
2303       Unicode table.
2304
2305       Specifying  caseless  matching  does not affect these escape sequences.
2306       For example, \p{Lu} always matches only uppercase letters. This is dif‐
2307       ferent from the behavior of current versions of Perl.
2308
2309       Matching  characters by Unicode property is not fast, as PCRE must do a
2310       multistage table lookup to find a character property. That is  why  the
2311       traditional escape sequences such as \d and \w do not use Unicode prop‐
2312       erties in PCRE by default. However, you can make them do so by  setting
2313       option ucp or by starting the pattern with (*UCP).
2314
2315       Extended Grapheme Clusters
2316
2317       The  \X  escape  matches  any number of Unicode characters that form an
2318       "extended grapheme cluster", and treats the sequence as an atomic group
2319       (see below). Up to and including release 8.31, PCRE matched an earlier,
2320       simpler definition that was equivalent  to  (?>\PM\pM*).  That  is,  it
2321       matched  a  character  without the "mark" property, followed by zero or
2322       more characters with the "mark" property. Characters  with  the  "mark"
2323       property  are  typically  non-spacing accents that affect the preceding
2324       character.
2325
2326       This simple definition was extended in Unicode to include more  compli‐
2327       cated  kinds of composite character by giving each character a grapheme
2328       breaking property, and creating rules that use these properties to  de‐
2329       fine  the  boundaries  of  extended grapheme clusters. In PCRE releases
2330       later than 8.31, \X matches one of these clusters.
2331
2332       \X always matches at least one character. Then it  decides  whether  to
2333       add more characters according to the following rules for ending a clus‐
2334       ter:
2335
2336         * End at the end of the subject string.
2337
2338         * Do not end between CR and LF; otherwise end after any control char‐
2339           acter.
2340
2341         * Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
2342           characters are of five types: L, V, T, LV, and LVT. An L  character
2343           can  be followed by an L, V, LV, or LVT character. An LV or V char‐
2344           acter can be followed by a V or T character. An LVT or T  character
2345           can be followed only by a T character.
2346
2347         * Do not end before extending characters or spacing marks. Characters
2348           with the "mark" property always have the "extend" grapheme breaking
2349           property.
2350
2351         * Do not end after prepend characters.
2352
2353         * Otherwise, end the cluster.
2354
2355       PCRE Additional Properties
2356
2357       In  addition to the standard Unicode properties described earlier, PCRE
2358       supports four more that make it possible to convert traditional  escape
2359       sequences, such as \w and \s to use Unicode properties. PCRE uses these
2360       non-standard, non-Perl properties internally when  the  ucp  option  is
2361       passed.  However,  they can also be used explicitly. The properties are
2362       as follows:
2363
2364         Xan:
2365           Any alphanumeric character. Matches characters that have either the
2366           L (letter) or the N (number) property.
2367
2368         Xps:
2369           Any  Posix  space character. Matches the characters tab, line feed,
2370           vertical tab, form feed, carriage return, and any  other  character
2371           that has the Z (separator) property.
2372
2373         Xsp:
2374           Any Perl space character. Matches the same as Xps, except that ver‐
2375           tical tab is excluded.
2376
2377         Xwd:
2378           Any Perl "word" character. Matches the same characters as Xan, plus
2379           underscore.
2380
2381       Perl and POSIX space are now the same. Perl added VT to its space char‐
2382       acter set at release 5.18 and PCRE changed at release 8.34.
2383
2384       Xan matches characters that have either the L (letter) or the  N  (num‐
2385       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
2386       form feed, or carriage return, and any other character that has  the  Z
2387       (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2388       cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2389       at  release  8.34.  Xwd matches the same characters as Xan, plus under‐
2390       score.
2391
2392       There is another non-standard property, Xuc, which matches any  charac‐
2393       ter  that  can  be represented by a Universal Character Name in C++ and
2394       other programming languages. These are the characters $,  @,  `  (grave
2395       accent),  and all characters with Unicode code points >= U+00A0, except
2396       for the surrogates U+D800 to U+DFFF.  Notice  that  most  base  (ASCII)
2397       characters  are  excluded.  (Universal  Character Names are of the form
2398       \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice  that  the
2399       Xuc  property  does  not  match these sequences but the characters that
2400       they represent.)
2401
2402       Resetting the Match Start
2403
2404       The escape sequence \K causes any previously matched characters not  to
2405       be  included  in the final matched sequence. For example, the following
2406       pattern matches "foobar", but reports that it has matched "bar":
2407
2408       foo\Kbar
2409
2410       This feature is similar to a lookbehind  assertion  (described  below).
2411       However,  in  this  case, the part of the subject before the real match
2412       does not have to be of fixed length, as lookbehind assertions  do.  The
2413       use  of  \K does not interfere with the setting of captured substrings.
2414       For example, when the following pattern  matches  "foobar",  the  first
2415       substring is still set to "foo":
2416
2417       (foo)\Kbar
2418
2419       Perl  documents  that  the use of \K within assertions is "not well de‐
2420       fined". In PCRE, \K is acted upon when it occurs inside positive asser‐
2421       tions,  but is ignored in negative assertions. Note that when a pattern
2422       such as (?=ab\K) matches, the  reported  start  of  the  match  can  be
2423       greater than the end of the match.
2424
2425       Simple Assertions
2426
2427       The  final use of backslash is for certain simple assertions. An asser‐
2428       tion specifies a condition that must be met at a particular point in  a
2429       match,  without  consuming  any characters from the subject string. The
2430       use of subpatterns for more complicated assertions is described  below.
2431       The following are the backslashed assertions:
2432
2433         \b:
2434           Matches at a word boundary.
2435
2436         \B:
2437           Matches when not at a word boundary.
2438
2439         \A:
2440           Matches at the start of the subject.
2441
2442         \Z:
2443           Matches  at the end of the subject, and before a newline at the end
2444           of the subject.
2445
2446         \z:
2447           Matches only at the end of the subject.
2448
2449         \G:
2450           Matches at the first matching position in the subject.
2451
2452       Inside a character class, \b has a different meaning;  it  matches  the
2453       backspace  character.  If  any  other  of these assertions appears in a
2454       character class, by default it matches the corresponding literal  char‐
2455       acter (for example, \B matches the letter B).
2456
2457       A  word  boundary is a position in the subject string where the current
2458       character and the previous character do not both match \w or  \W  (that
2459       is,  one  matches  \w and the other matches \W), or the start or end of
2460       the string if the first or last character matches \w, respectively.  In
2461       UTF  mode,  the  meanings of \w and \W can be changed by setting option
2462       ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2463       have a separate "start of word" or "end of word" metasequence. However,
2464       whatever follows \b normally determines which it is. For  example,  the
2465       fragment \ba matches "a" at the start of a word.
2466
2467       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
2468       and dollar (described in the next section) in that they only ever match
2469       at  the  very start and end of the subject string, whatever options are
2470       set. Thus, they are independent of multiline mode. These  three  asser‐
2471       tions  are  not affected by options notbol or noteol, which affect only
2472       the behavior of the circumflex and dollar metacharacters.  However,  if
2473       argument  startoffset of run/3 is non-zero, indicating that matching is
2474       to start at a point other than the beginning of  the  subject,  \A  can
2475       never match. The difference between \Z and \z is that \Z matches before
2476       a newline at the end of the string  and  at  the  very  end,  while  \z
2477       matches only at the end.
2478
2479       The  \G assertion is true only when the current matching position is at
2480       the start point of the match, as specified by argument  startoffset  of
2481       run/3. It differs from \A when the value of startoffset is non-zero. By
2482       calling run/3 multiple times with appropriate arguments, you can  mimic
2483       the  Perl  option /g, and it is in this kind of implementation where \G
2484       can be useful.
2485
2486       Notice, however, that the PCRE interpretation of \G, as  the  start  of
2487       the  current  match, is subtly different from Perl, which defines it as
2488       the end of the previous match. In Perl, these can be different when the
2489       previously  matched  string was empty. As PCRE does only one match at a
2490       time, it cannot reproduce this behavior.
2491
2492       If all the alternatives of a pattern begin with \G, the  expression  is
2493       anchored to the starting match position, and the "anchored" flag is set
2494       in the compiled regular expression.
2495

CIRCUMFLEX AND DOLLAR

2497       The circumflex and dollar  metacharacters  are  zero-width  assertions.
2498       That  is,  they test for a particular condition to be true without con‐
2499       suming any characters from the subject string.
2500
2501       Outside a character class, in the default matching mode, the circumflex
2502       character  is  an  assertion  that is true only if the current matching
2503       point is at the start of the subject string. If argument startoffset of
2504       run/3  is  non-zero,  circumflex can never match if option multiline is
2505       unset. Inside a character class, circumflex has an  entirely  different
2506       meaning (see below).
2507
2508       Circumflex  needs  not to be the first character of the pattern if some
2509       alternatives are involved, but it is to be the first thing in each  al‐
2510       ternative  in  which  it  appears  if the pattern is ever to match that
2511       branch. If all possible alternatives start with a circumflex, that  is,
2512       if  the  pattern  is constrained to match only at the start of the sub‐
2513       ject, it is said to be an "anchored" pattern.  (There  are  also  other
2514       constructs that can cause a pattern to be anchored.)
2515
2516       The  dollar  character is an assertion that is true only if the current
2517       matching point is at the end of the subject string, or immediately  be‐
2518       fore  a  newline  at the end of the string (by default). Notice however
2519       that it does not match the newline. Dollar needs not  to  be  the  last
2520       character  of  the pattern if some alternatives are involved, but it is
2521       to be the last item in any branch in which it appears.  Dollar  has  no
2522       special meaning in a character class.
2523
2524       The  meaning  of  dollar  can be changed so that it matches only at the
2525       very end of the string, by setting  option  dollar_endonly  at  compile
2526       time. This does not affect the \Z assertion.
2527
2528       The meanings of the circumflex and dollar characters are changed if op‐
2529       tion multiline is set. When this is the case, a circumflex matches  im‐
2530       mediately  after  internal  newlines  and  at  the start of the subject
2531       string. It does not match after a newline that ends the string. A  dol‐
2532       lar  matches  before  any  newlines in the string, and at the very end,
2533       when multiline is set. When newline is specified as  the  two-character
2534       sequence CRLF, isolated CR and LF characters do not indicate newlines.
2535
2536       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
2537       (where \n represents a newline) in multiline mode, but  not  otherwise.
2538       So, patterns that are anchored in single-line mode because all branches
2539       start with ^ are not anchored in multiline mode, and a match  for  cir‐
2540       cumflex is possible when argument startoffset of run/3 is non-zero. Op‐
2541       tion dollar_endonly is ignored if multiline is set.
2542
2543       Notice that the sequences \A, \Z, and \z can be used to match the start
2544       and  end  of  the  subject  in both modes. If all branches of a pattern
2545       start with \A, it is always anchored, regardless if multiline is set.
2546

FULL STOP (PERIOD, DOT) AND \N

2548       Outside a character class, a dot in the pattern matches  any  character
2549       in  the  subject  string except (by default) a character that signifies
2550       the end of a line.
2551
2552       When a line ending is defined as a single character, dot never  matches
2553       that  character. When the two-character sequence CRLF is used, dot does
2554       not match CR if it is immediately followed by LF, otherwise it  matches
2555       all  characters (including isolated CRs and LFs). When any Unicode line
2556       endings are recognized, dot does not match CR, LF, or any of the  other
2557       line-ending characters.
2558
2559       The behavior of dot regarding newlines can be changed. If option dotall
2560       is set, a dot matches any character, without  exception.  If  the  two-
2561       character  sequence CRLF is present in the subject string, it takes two
2562       dots to match it.
2563
2564       The handling of dot is entirely independent of the handling of  circum‐
2565       flex  and  dollar, the only relationship is that both involve newlines.
2566       Dot has no special meaning in a character class.
2567
2568       The escape sequence \N behaves like a dot, except that it  is  not  af‐
2569       fected  by option PCRE_DOTALL. That is, it matches any character except
2570       one that signifies the end of a line. Perl also uses \N to match  char‐
2571       acters by name but PCRE does not support this.
2572

MATCHING A SINGLE DATA UNIT

2574       Outside  a  character  class,  the  escape sequence \C matches any data
2575       unit, regardless if a UTF mode is set. One data unit is one  byte.  Un‐
2576       like  a  dot,  \C always matches line-ending characters. The feature is
2577       provided in Perl to match individual bytes in UTF-8 mode, but it is un‐
2578       clear  how it can usefully be used. As \C breaks up characters into in‐
2579       dividual data units, matching one unit with \C in a UTF mode means that
2580       the remaining string can start with a malformed UTF character. This has
2581       undefined results, as  PCRE  assumes  that  it  deals  with  valid  UTF
2582       strings.
2583
2584       PCRE  does  not  allow \C to appear in lookbehind assertions (described
2585       below) in a UTF mode, as this would make it impossible to calculate the
2586       length of the lookbehind.
2587
2588       The  \C  escape  sequence is best avoided. However, one way of using it
2589       that avoids the problem of malformed UTF characters is to use  a  look‐
2590       ahead  to  check  the length of the next character, as in the following
2591       pattern, which can be used with a UTF-8 string (ignore  whitespace  and
2592       line breaks):
2593
2594       (?| (?=[\x00-\x7f])(\C) |
2595           (?=[\x80-\x{7ff}])(\C)(\C) |
2596           (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2597           (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2598
2599       A  group  that starts with (?| resets the capturing parentheses numbers
2600       in each alternative (see section Duplicate Subpattern Numbers). The as‐
2601       sertions at the start of each branch check the next UTF-8 character for
2602       values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The indi‐
2603       vidual bytes of the character are then captured by the appropriate num‐
2604       ber of groups.
2605

SQUARE BRACKETS AND CHARACTER CLASSES

2607       An opening square bracket introduces a character class, terminated by a
2608       closing square bracket. A closing square bracket on its own is not spe‐
2609       cial by default. However, if option PCRE_JAVASCRIPT_COMPAT  is  set,  a
2610       lone  closing  square bracket causes a compile-time error. If a closing
2611       square bracket is required as a member of the class, it is  to  be  the
2612       first  data  character  in  the  class (after an initial circumflex, if
2613       present) or escaped with a backslash.
2614
2615       A character class matches a single character in the subject. In  a  UTF
2616       mode,  the  character  can  be  more than one data unit long. A matched
2617       character must be in the set of characters defined by the class, unless
2618       the  first  character in the class definition is a circumflex, in which
2619       case the subject character must not be in the set defined by the class.
2620       If a circumflex is required as a member of the class, ensure that it is
2621       not the first character, or escape it with a backslash.
2622
2623       For example, the character class [aeiou] matches any  lowercase  vowel,
2624       while [^aeiou] matches any character that is not a lowercase vowel. No‐
2625       tice that a circumflex is just a convenient notation for specifying the
2626       characters  that  are in the class by enumerating those that are not. A
2627       class that starts with a circumflex is not an assertion; it still  con‐
2628       sumes  a  character  from the subject string, and therefore it fails if
2629       the current pointer is at the end of the string.
2630
2631       In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2632       a class as a literal string of data units, or by using the \x{ escaping
2633       mechanism.
2634
2635       When caseless matching is set, any letters in a  class  represent  both
2636       their uppercase and lowercase versions. For example, a caseless [aeiou]
2637       matches "A" and "a", and a caseless [^aeiou] does not match "A", but  a
2638       caseful  version would. In a UTF mode, PCRE always understands the con‐
2639       cept of case for characters whose values are < 256, so caseless  match‐
2640       ing  is always possible. For characters with higher values, the concept
2641       of case is supported only if PCRE is  compiled  with  Unicode  property
2642       support. If you want to use caseless matching in a UTF mode for charac‐
2643       ters >=, ensure that PCRE is compiled with Unicode property support and
2644       with UTF support.
2645
2646       Characters  that can indicate line breaks are never treated in any spe‐
2647       cial way when matching character classes, whatever line-ending sequence
2648       is  in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2649       LINE is used. A class such as [^a] always matches one of these  charac‐
2650       ters.
2651
2652       The  minus (hyphen) character can be used to specify a range of charac‐
2653       ters in a character class. For example, [d-m] matches  any  letter  be‐
2654       tween  d and m, inclusive. If a minus character is required in a class,
2655       it must be escaped with a backslash or appear in a  position  where  it
2656       cannot  be interpreted as indicating a range, typically as the first or
2657       last character in the class, or immediately after a range. For example,
2658       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
2659
2660       The  literal  character  "]"  cannot be the end character of a range. A
2661       pattern such as [W-]46] is interpreted as a  class  of  two  characters
2662       ("W"  and  "-")  followed  by a literal string "46]", so it would match
2663       "W46]" or "-46]". However, if "]" is escaped with a  backslash,  it  is
2664       interpreted  as the end of range, so [W-\]46] is interpreted as a class
2665       containing a range followed by two other characters. The octal or hexa‐
2666       decimal representation of "]" can also be used to end a range.
2667
2668       An  error is generated if a POSIX character class (see below) or an es‐
2669       cape sequence other than one that defines a single character appears at
2670       a  point  where  a  range  ending  character  is expected. For example,
2671       [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2672
2673       Ranges operate in the collating sequence of character values. They  can
2674       also  be  used  for  characters  specified  numerically,  for  example,
2675       [\000-\037]. Ranges can include any characters that are valid  for  the
2676       current mode.
2677
2678       If a range that includes letters is used when caseless matching is set,
2679       it matches the letters in either case. For example, [W-c] is equivalent
2680       to  [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2681       ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2682       E  characters in both cases. In UTF modes, PCRE supports the concept of
2683       case for characters with values > 255 only when  it  is  compiled  with
2684       Unicode property support.
2685
2686       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2687       \w, and \W can appear in a character class, and add the characters that
2688       they  match to the class. For example, [\dABCDEF] matches any hexadeci‐
2689       mal digit. In UTF modes, option ucp affects the meanings of \d, \s,  \w
2690       and  their uppercase partners, just as it does when they appear outside
2691       a character class, as described in section Generic Character Types ear‐
2692       lier. The escape sequence \b has a different meaning inside a character
2693       class; it matches the backspace character. The sequences  \B,  \N,  \R,
2694       and  \X are not special inside a character class. Like any other unrec‐
2695       ognized escape sequences, they are treated as  the  literal  characters
2696       "B", "N", "R", and "X".
2697
2698       A  circumflex  can  conveniently  be  used with the uppercase character
2699       types to specify a more restricted set of characters than the  matching
2700       lowercase  type. For example, class [^\W_] matches any letter or digit,
2701       but not underscore, while [\w] includes underscore. A positive  charac‐
2702       ter  class is to be read as "something OR something OR ..." and a nega‐
2703       tive class as "NOT something AND NOT something AND NOT ...".
2704
2705       Only the following metacharacters are recognized in character classes:
2706
2707         * Backslash
2708
2709         * Hyphen (only where it can be interpreted as specifying a range)
2710
2711         * Circumflex (only at the start)
2712
2713         * Opening square bracket (only when it can be interpreted  as  intro‐
2714           ducing  a Posix class name, or for a special compatibility feature;
2715           see the next two sections)
2716
2717         * Terminating closing square bracket
2718
2719       However, escaping other non-alphanumeric characters does no harm.
2720

POSIX CHARACTER CLASSES

2722       Perl supports the Posix notation for character classes. This uses names
2723       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
2724       supports this notation. For example, the following  matches  "0",  "1",
2725       any alphabetic character, or "%":
2726
2727       [01[:alpha:]%]
2728
2729       The following are the supported class names:
2730
2731         alnum:
2732           Letters and digits
2733
2734         alpha:
2735           Letters
2736
2737         blank:
2738           Space or tab only
2739
2740         cntrl:
2741           Control characters
2742
2743         digit:
2744           Decimal digits (same as \d)
2745
2746         graph:
2747           Printing characters, excluding space
2748
2749         lower:
2750           Lowercase letters
2751
2752         print:
2753           Printing characters, including space
2754
2755         punct:
2756           Printing characters, excluding letters, digits, and space
2757
2758         space:
2759           Whitespace (the same as \s from PCRE 8.34)
2760
2761         upper:
2762           Uppercase letters
2763
2764         word:
2765           "Word" characters (same as \w)
2766
2767         xdigit:
2768           Hexadecimal digits
2769
2770       There  is  another  character  class,  ascii,  that erroneously matches
2771       Latin-1 characters instead of the 0-127 range specified by POSIX.  This
2772       cannot  be fixed without altering the behaviour of other classes, so we
2773       recommend matching the range with [\\0-\x7f] instead.
2774
2775       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
2776       CR  (13),  and space (32). If locale-specific matching is taking place,
2777       the list of space characters may be different; there may  be  fewer  or
2778       more of them. "Space" used to be different to \s, which did not include
2779       VT, for Perl compatibility. However, Perl changed at release 5.18,  and
2780       PCRE followed at release 8.34. "Space" and \s now match the same set of
2781       characters.
2782
2783       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
2784       from  Perl  5.8. Another Perl extension is negation, which is indicated
2785       by a ^ character after the colon. For example,  the  following  matches
2786       "1", "2", or any non-digit:
2787
2788       [12[:^digit:]]
2789
2790       PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2791       "ch" is a "collating element", but these are not supported, and an  er‐
2792       ror is given if they are encountered.
2793
2794       By  default, characters with values > 255 do not match any of the Posix
2795       character classes. However, if option PCRE_UCP is passed  to  pcre_com‐
2796       pile(), some of the classes are changed so that Unicode character prop‐
2797       erties are used. This is achieved by replacing certain Posix classes by
2798       other sequences, as follows:
2799
2800         [:alnum:]:
2801           Becomes \p{Xan}
2802
2803         [:alpha:]:
2804           Becomes \p{L}
2805
2806         [:blank:]:
2807           Becomes \h
2808
2809         [:digit:]:
2810           Becomes \p{Nd}
2811
2812         [:lower:]:
2813           Becomes \p{Ll}
2814
2815         [:space:]:
2816           Becomes \p{Xps}
2817
2818         [:upper:]:
2819           Becomes \p{Lu}
2820
2821         [:word:]:
2822           Becomes \p{Xwd}
2823
2824       Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2825       POSIX classes are handled specially in UCP mode:
2826
2827         [:graph:]:
2828           This matches characters that have glyphs that mark  the  page  when
2829           printed.  In Unicode property terms, it matches all characters with
2830           the L, M, N, P, S, or Cf properties, except for:
2831
2832           U+061C:
2833             Arabic Letter Mark
2834
2835           U+180E:
2836             Mongolian Vowel Separator
2837
2838           U+2066 - U+2069:
2839             Various "isolate"s
2840
2841         [:print:]:
2842           This matches the same characters as [:graph:] plus space characters
2843           that are not controls, that is, characters with the Zs property.
2844
2845         [:punct:]:
2846           This  matches  all characters that have the Unicode P (punctuation)
2847           property, plus those characters whose code points are less than 128
2848           that have the S (Symbol) property.
2849
2850       The  other  POSIX classes are unchanged, and match only characters with
2851       code points less than 128.
2852
2853       Compatibility Feature for Word Boundaries
2854
2855       In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
2856       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
2857       and "end of word". PCRE treats these items as follows:
2858
2859         [[:<:]]:
2860           is converted to \b(?=\w)
2861
2862         [[:>:]]:
2863           is converted to \b(?<=\w)
2864
2865       Only these exact character sequences are recognized. A sequence such as
2866       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
2867       support is not compatible with Perl. It is provided to help  migrations
2868       from other environments, and is best not used in any new patterns. Note
2869       that \b matches at the start and the end of a word (see "Simple  asser‐
2870       tions"  above),  and in a Perl-style pattern the preceding or following
2871       character normally shows which is wanted, without the need for the  as‐
2872       sertions  that are used above in order to give exactly the POSIX behav‐
2873       iour.
2874

VERTICAL BAR

2876       Vertical bar characters are used to separate alternative patterns.  For
2877       example, the following pattern matches either "gilbert" or "sullivan":
2878
2879       gilbert|sullivan
2880
2881       Any number of alternatives can appear, and an empty alternative is per‐
2882       mitted (matching the empty string). The matching process tries each al‐
2883       ternative  in  turn, from left to right, and the first that succeeds is
2884       used. If the alternatives are within a subpattern (defined  in  section
2885       Subpatterns),  "succeeds" means matching the remaining main pattern and
2886       the alternative in the subpattern.
2887

INTERNAL OPTION SETTING

2889       The  settings  of  the  Perl-compatible  options  caseless,  multiline,
2890       dotall,  and  extended  can be changed from within the pattern by a se‐
2891       quence of Perl option letters enclosed between "(?" and ")". The option
2892       letters are as follows:
2893
2894         i:
2895           For caseless
2896
2897         m:
2898           For multiline
2899
2900         s:
2901           For dotall
2902
2903         x:
2904           For extended
2905
2906       For example, (?im) sets caseless, multiline matching. These options can
2907       also be unset by preceding the letter with a hyphen. A combined setting
2908       and  unsetting  such  as  (?im-sx),  which sets caseless and multiline,
2909       while unsetting dotall and extended, is also permitted. If a letter ap‐
2910       pears both before and after the hyphen, the option is unset.
2911
2912       The  PCRE-specific options dupnames, ungreedy, and extra can be changed
2913       in the same way as the Perl-compatible options by using the  characters
2914       J, U, and X respectively.
2915
2916       When  one of these option changes occurs at top-level (that is, not in‐
2917       side subpattern parentheses), the change applies to  the  remainder  of
2918       the pattern that follows.
2919
2920       An  option change within a subpattern (see section Subpatterns) affects
2921       only that part of the subpattern that follows  it.  So,  the  following
2922       matches  abc  and  aBc  and  no other strings (assuming caseless is not
2923       used):
2924
2925       (a(?i)b)c
2926
2927       By this means, options can be made to have different settings  in  dif‐
2928       ferent  parts  of  the  pattern. Any changes made in one alternative do
2929       carry on into subsequent branches within the same subpattern. For exam‐
2930       ple:
2931
2932       (a(?i)b|c)
2933
2934       matches  "ab", "aB", "c", and "C", although when matching "C" the first
2935       branch is abandoned before the option setting. This is because the  ef‐
2936       fects  of  option  settings  occur at compile time. There would be some
2937       weird behavior otherwise.
2938
2939   Note:
2940       Other PCRE-specific options can be set by the application when the com‐
2941       piling or matching functions are called. Sometimes the pattern can con‐
2942       tain special leading sequences, such as (*CRLF), to override  what  the
2943       application has set or what has been defaulted. Details are provided in
2944       section  Newline Sequences earlier.
2945
2946       The (*UTF8) and (*UCP) leading sequences can be used  to  set  UTF  and
2947       Unicode  property modes. They are equivalent to setting options unicode
2948       and ucp, respectively. The (*UTF) sequence is a  generic  version  that
2949       can be used with any of the libraries. However, the application can set
2950       option never_utf, which locks out the use of the (*UTF) sequences.
2951
2952

SUBPATTERNS

2954       Subpatterns are delimited by parentheses (round brackets), which can be
2955       nested. Turning part of a pattern into a subpattern does two things:
2956
2957         1.:
2958           It localizes a set of alternatives. For example, the following pat‐
2959           tern matches "cataract", "caterpillar", or "cat":
2960
2961         cat(aract|erpillar|)
2962
2963           Without the parentheses, it would match "cataract", "erpillar",  or
2964           an empty string.
2965
2966         2.:
2967           It  sets up the subpattern as a capturing subpattern. That is, when
2968           the complete pattern matches, that portion of  the  subject  string
2969           that  matched  the  subpattern is passed back to the caller through
2970           the return value of run/3.
2971
2972       Opening parentheses are counted from left to right (starting from 1) to
2973       obtain  numbers  for  the  capturing  subpatterns.  For example, if the
2974       string "the red king" is matched against  the  following  pattern,  the
2975       captured substrings are "red king", "red", and "king", and are numbered
2976       1, 2, and 3, respectively:
2977
2978       the ((red|white) (king|queen))
2979
2980       It is not always helpful that plain parentheses fulfill two  functions.
2981       Often  a  grouping  subpattern is required without a capturing require‐
2982       ment. If an opening parenthesis is followed by a question  mark  and  a
2983       colon,  the  subpattern  does  not do any capturing, and is not counted
2984       when computing the number of any subsequent capturing subpatterns.  For
2985       example, if the string "the white queen" is matched against the follow‐
2986       ing pattern, the captured substrings are "white queen" and "queen", and
2987       are numbered 1 and 2:
2988
2989       the ((?:red|white) (king|queen))
2990
2991       The maximum number of capturing subpatterns is 65535.
2992
2993       As  a  convenient shorthand, if any option settings are required at the
2994       start of a non-capturing subpattern, the option letters can appear  be‐
2995       tween  "?" and ":". Thus, the following two patterns match the same set
2996       of strings:
2997
2998       (?i:saturday|sunday)
2999       (?:(?i)saturday|sunday)
3000
3001       As alternative branches are tried from left to right, and  options  are
3002       not reset until the end of the subpattern is reached, an option setting
3003       in one branch does affect subsequent branches, so  the  above  patterns
3004       match both "SUNDAY" and "Saturday".
3005

DUPLICATE SUBPATTERN NUMBERS

3007       Perl  5.10  introduced a feature where each alternative in a subpattern
3008       uses the same numbers for its capturing parentheses. Such a  subpattern
3009       starts  with (?| and is itself a non-capturing subpattern. For example,
3010       consider the following pattern:
3011
3012       (?|(Sat)ur|(Sun))day
3013
3014       As the two alternatives are inside a (?| group, both sets of  capturing
3015       parentheses  are  numbered one. Thus, when the pattern matches, you can
3016       look at captured substring number one, whichever  alternative  matched.
3017       This  construct is useful when you want to capture a part, but not all,
3018       of one of many alternatives. Inside a (?| group, parentheses  are  num‐
3019       bered  as  usual,  but the number is reset at the start of each branch.
3020       The numbers of any capturing parentheses  that  follow  the  subpattern
3021       start  after the highest number used in any branch. The following exam‐
3022       ple is from the Perl documentation;  the  numbers  underneath  show  in
3023       which buffer the captured content is stored:
3024
3025       # before  ---------------branch-reset----------- after
3026       / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3027       # 1            2         2  3        2     3     4
3028
3029       A  back  reference  to a numbered subpattern uses the most recent value
3030       that is set for that number by any subpattern.  The  following  pattern
3031       matches "abcabc" or "defdef":
3032
3033       /(?|(abc)|(def))\1/
3034
3035       In  contrast,  a subroutine call to a numbered subpattern always refers
3036       to the first one in the pattern with the given  number.  The  following
3037       pattern matches "abcabc" or "defabc":
3038
3039       /(?|(abc)|(def))(?1)/
3040
3041       If  a  condition  test for a subpattern having matched refers to a non-
3042       unique number, the test is true if any of the subpatterns of that  num‐
3043       ber have matched.
3044
3045       An alternative approach using this "branch reset" feature is to use du‐
3046       plicate named subpatterns, as described in the next section.
3047

NAMED SUBPATTERNS

3049       Identifying capturing parentheses by number is simple, but  it  can  be
3050       hard  to  keep track of the numbers in complicated regular expressions.
3051       Also, if an expression is modified, the numbers  can  change.  To  help
3052       with  this  difficulty,  PCRE  supports the naming of subpatterns. This
3053       feature was not added to Perl until release 5.10. Python had  the  fea‐
3054       ture  earlier,  and PCRE introduced it at release 4.0, using the Python
3055       syntax. PCRE now supports both the Perl and the Python syntax. Perl al‐
3056       lows identically numbered subpatterns to have different names, but PCRE
3057       does not.
3058
3059       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
3060       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
3061       to capturing parentheses from other parts of the pattern, such as  back
3062       references,  recursion, and conditions, can be made by name and by num‐
3063       ber.
3064
3065       Names consist of up to 32 alphanumeric characters and underscores,  but
3066       must  start with a non-digit. Named capturing parentheses are still al‐
3067       located numbers as well as names, exactly as  if  the  names  were  not
3068       present.  The  capture  specification  to run/3 can use named values if
3069       they are present in the regular expression.
3070
3071       By default, a name must be unique within a pattern, but this constraint
3072       can  be  relaxed by setting option dupnames at compile time. (Duplicate
3073       names are also always permitted for subpatterns with the  same  number,
3074       set  up  as  described in the previous section.) Duplicate names can be
3075       useful for patterns where only one instance of  the  named  parentheses
3076       can match. Suppose that you want to match the name of a weekday, either
3077       as a 3-letter abbreviation or as the full name, and in both  cases  you
3078       want  to  extract the abbreviation. The following pattern (ignoring the
3079       line breaks) does the job:
3080
3081       (?<DN>Mon|Fri|Sun)(?:day)?|
3082       (?<DN>Tue)(?:sday)?|
3083       (?<DN>Wed)(?:nesday)?|
3084       (?<DN>Thu)(?:rsday)?|
3085       (?<DN>Sat)(?:urday)?
3086
3087       There are five capturing substrings, but only one is ever set  after  a
3088       match.  (An alternative way of solving this problem is to use a "branch
3089       reset" subpattern, as described in the previous section.)
3090
3091       For capturing named subpatterns which names are not unique,  the  first
3092       matching  occurrence (counted from left to right in the subject) is re‐
3093       turned from run/3, if the name is specified in the values part  of  the
3094       capture  statement. The all_names capturing value matches all the names
3095       in the same way.
3096
3097   Note:
3098       You cannot use different names to distinguish between  two  subpatterns
3099       with  the same number, as PCRE uses only the numbers when matching. For
3100       this reason, an error is given at compile time if different  names  are
3101       specified to subpatterns with the same number. However, you can specify
3102       the same name to subpatterns with the same number, even  when  dupnames
3103       is not set.
3104
3105

REPETITION

3107       Repetition  is  specified  by  quantifiers, which can follow any of the
3108       following items:
3109
3110         * A literal data character
3111
3112         * The dot metacharacter
3113
3114         * The \C escape sequence
3115
3116         * The \X escape sequence
3117
3118         * The \R escape sequence
3119
3120         * An escape such as \d or \pL that matches a single character
3121
3122         * A character class
3123
3124         * A back reference (see the next section)
3125
3126         * A parenthesized subpattern (including assertions)
3127
3128         * A subroutine call to a subpattern (recursive or otherwise)
3129
3130       The general repetition quantifier specifies a minimum and maximum  num‐
3131       ber  of  permitted matches, by giving the two numbers in curly brackets
3132       (braces), separated by a comma. The numbers must be <  65536,  and  the
3133       first  must  be less than or equal to the second. For example, the fol‐
3134       lowing matches "zz", "zzz", or "zzzz":
3135
3136       z{2,4}
3137
3138       A closing brace on its own is not a special character.  If  the  second
3139       number  is  omitted, but the comma is present, there is no upper limit.
3140       If the second number and the comma are  both  omitted,  the  quantifier
3141       specifies  an  exact  number  of  required matches. Thus, the following
3142       matches at least three successive vowels, but can match many more:
3143
3144       [aeiou]{3,}
3145
3146       The following matches exactly eight digits:
3147
3148       \d{8}
3149
3150       An opening curly bracket that appears in a position where a  quantifier
3151       is  not allowed, or one that does not match the syntax of a quantifier,
3152       is taken as a literal character. For example, {,6} is not a quantifier,
3153       but a literal string of four characters.
3154
3155       In  Unicode  mode, quantifiers apply to characters rather than to indi‐
3156       vidual data units. Thus, for example, \x{100}{2}  matches  two  charac‐
3157       ters,  each  of  which  is  represented by a 2-byte sequence in a UTF-8
3158       string. Similarly, \X{3} matches three Unicode extended grapheme  clus‐
3159       ters,  each  of  which  can be many data units long (and they can be of
3160       different lengths).
3161
3162       The quantifier {0} is permitted, causing the expression to behave as if
3163       the previous item and the quantifier were not present. This can be use‐
3164       ful for subpatterns that are referenced as subroutines  from  elsewhere
3165       in  the  pattern (but see also section  Defining Subpatterns for Use by
3166       Reference Only). Items other than subpatterns that have a  {0}  quanti‐
3167       fier are omitted from the compiled pattern.
3168
3169       For  convenience, the three most common quantifiers have single-charac‐
3170       ter abbreviations:
3171
3172         *:
3173           Equivalent to {0,}
3174
3175         +:
3176           Equivalent to {1,}
3177
3178         ?:
3179           Equivalent to {0,1}
3180
3181       Infinite loops can be constructed by following a  subpattern  that  can
3182       match  no characters with a quantifier that has no upper limit, for ex‐
3183       ample:
3184
3185       (a?)*
3186
3187       Earlier versions of Perl and PCRE used to give an error at compile time
3188       for  such  patterns. However, as there are cases where this can be use‐
3189       ful, such patterns are now accepted. However, if any repetition of  the
3190       subpattern matches no characters, the loop is forcibly broken.
3191
3192       By  default,  the quantifiers are "greedy", that is, they match as much
3193       as possible (up to the maximum  number  of  permitted  times),  without
3194       causing  the  remaining  pattern  to fail. The classic example of where
3195       this gives problems is in trying to match comments in C programs. These
3196       appear  between /* and */. Within the comment, individual * and / char‐
3197       acters can appear. An attempt to match C comments by applying the  pat‐
3198       tern
3199
3200       /\*.*\*/
3201
3202       to the string
3203
3204       /* first comment */  not comment  /* second comment */
3205
3206       fails,  as  it matches the entire string owing to the greediness of the
3207       .* item.
3208
3209       However, if a quantifier is followed by a question mark, it  ceases  to
3210       be greedy, and instead matches the minimum number of times possible, so
3211       the following pattern does the right thing with the C comments:
3212
3213       /\*.*?\*/
3214
3215       The meaning of the various quantifiers is not otherwise  changed,  only
3216       the  preferred  number  of matches. Do not confuse this use of question
3217       mark with its use as a quantifier in its own right. As it has two uses,
3218       it can sometimes appear doubled, as in
3219
3220       \d??\d
3221
3222       which matches one digit by preference, but can match two if that is the
3223       only way the remaining pattern matches.
3224
3225       If option ungreedy is set (an option that is not  available  in  Perl),
3226       the  quantifiers  are not greedy by default, but individual ones can be
3227       made greedy by following them with a question mark. That is, it inverts
3228       the default behavior.
3229
3230       When  a  parenthesized  subpattern  is quantified with a minimum repeat
3231       count that is > 1 or with a limited maximum, more  memory  is  required
3232       for  the  compiled pattern, in proportion to the size of the minimum or
3233       maximum.
3234
3235       If a pattern starts with .* or .{0,} and option dotall  (equivalent  to
3236       Perl  option  /s)  is set, thus allowing the dot to match newlines, the
3237       pattern is implicitly  anchored,  because  whatever  follows  is  tried
3238       against every character position in the subject string. So, there is no
3239       point in retrying the overall match at any position  after  the  first.
3240       PCRE normally treats such a pattern as if it was preceded by \A.
3241
3242       In  cases  where  it  is known that the subject string contains no new‐
3243       lines, it is worth setting dotall to obtain this optimization,  or  al‐
3244       ternatively using ^ to indicate anchoring explicitly.
3245
3246       However,  there  are  some cases where the optimization cannot be used.
3247       When .* is inside capturing parentheses that are the subject of a  back
3248       reference elsewhere in the pattern, a match at the start can fail where
3249       a later one succeeds. Consider, for example:
3250
3251       (.*)abc\1
3252
3253       If the subject is "xyz123abc123", the match point is the fourth charac‐
3254       ter. Therefore, such a pattern is not implicitly anchored.
3255
3256       Another  case where implicit anchoring is not applied is when the lead‐
3257       ing .* is inside an atomic group. Once again, a match at the start  can
3258       fail where a later one succeeds. Consider the following pattern:
3259
3260       (?>.*?a)b
3261
3262       It  matches "ab" in the subject "aab". The use of the backtracking con‐
3263       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3264
3265       When a capturing subpattern is repeated, the value captured is the sub‐
3266       string that matched the final iteration. For example, after
3267
3268       (tweedle[dume]{3}\s*)+
3269
3270       has  matched  "tweedledum  tweedledee",  the value of the captured sub‐
3271       string is "tweedledee". However, if there are nested capturing  subpat‐
3272       terns,  the corresponding captured values can have been set in previous
3273       iterations. For example, after
3274
3275       /(a|(b))+/
3276
3277       matches "aba", the value of the second captured substring is "b".
3278

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

3280       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
3281       repetition,  failure  of what follows normally causes the repeated item
3282       to be re-evaluated to see if a different number of repeats  allows  the
3283       remaining pattern to match. Sometimes it is useful to prevent this, ei‐
3284       ther to change the nature of the match, or to cause it to fail  earlier
3285       than  it  otherwise  might,  when  the author of the pattern knows that
3286       there is no point in carrying on.
3287
3288       Consider, for example, the pattern \d+foo when applied to the following
3289       subject line:
3290
3291       123456bar
3292
3293       After matching all six digits and then failing to match "foo", the nor‐
3294       mal action of the matcher is to try again with only five digits  match‐
3295       ing item \d+, and then with four, and so on, before ultimately failing.
3296       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
3297       the  means for specifying that once a subpattern has matched, it is not
3298       to be re-evaluated in this way.
3299
3300       If atomic grouping is used for the previous example, the matcher  gives
3301       up  immediately  on failing to match "foo" the first time. The notation
3302       is a kind of special parenthesis, starting with (?> as in the following
3303       example:
3304
3305       (?>\d+)foo
3306
3307       This kind of parenthesis "locks up" the part of the pattern it contains
3308       once it has matched, and a failure further into  the  pattern  is  pre‐
3309       vented  from  backtracking  into  it.  Backtracking past it to previous
3310       items, however, works as normal.
3311
3312       An alternative description is that a subpattern of  this  type  matches
3313       the  string  of  characters  that an identical standalone pattern would
3314       match, if anchored at the current point in the subject string.
3315
3316       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3317       such as the above example can be thought of as a maximizing repeat that
3318       must swallow everything it can. So, while both \d+ and  \d+?  are  pre‐
3319       pared  to  adjust the number of digits they match to make the remaining
3320       pattern match, (?>\d+) can only match an entire sequence of digits.
3321
3322       Atomic groups in general can contain any complicated  subpatterns,  and
3323       can be nested. However, when the subpattern for an atomic group is just
3324       a single repeated item, as in the example above,  a  simpler  notation,
3325       called a "possessive quantifier" can be used. This consists of an extra
3326       + character following a quantifier. Using this notation,  the  previous
3327       example can be rewritten as
3328
3329       \d++foo
3330
3331       Notice  that  a possessive quantifier can be used with an entire group,
3332       for example:
3333
3334       (abc|xyz){2,3}+
3335
3336       Possessive quantifiers are always greedy; the  setting  of  option  un‐
3337       greedy is ignored. They are a convenient notation for the simpler forms
3338       of an atomic group. However, there is no difference in the meaning of a
3339       possessive quantifier and the equivalent atomic group, but there can be
3340       a performance difference; possessive quantifiers are probably  slightly
3341       faster.
3342
3343       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn‐
3344       tax. Jeffrey Friedl originated the idea (and the  name)  in  the  first
3345       edition of his book. Mike McCloskey liked it, so implemented it when he
3346       built the Sun Java package, and PCRE copied it  from  there.  It  ulti‐
3347       mately found its way into Perl at release 5.10.
3348
3349       PCRE has an optimization that automatically "possessifies" certain sim‐
3350       ple pattern constructs. For example, the sequence  A+B  is  treated  as
3351       A++B,  as there is no point in backtracking into a sequence of A:s when
3352       B must follow.
3353
3354       When a pattern contains an unlimited repeat inside  a  subpattern  that
3355       can  itself  be  repeated  an  unlimited number of times, the use of an
3356       atomic group is the only way to avoid some  failing  matches  taking  a
3357       long time. The pattern
3358
3359       (\D+|<\d+>)*[!?]
3360
3361       matches  an  unlimited number of substrings that either consist of non-
3362       digits, or digits enclosed in <>, followed by ! or ?. When it  matches,
3363       it runs quickly. However, if it is applied to
3364
3365       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3366
3367       it  takes  a  long  time  before reporting failure. This is because the
3368       string can be divided between the internal \D+ repeat and the  external
3369       *  repeat  in  many ways, and all must be tried. (The example uses [!?]
3370       rather than a single character at the end, as both PCRE and  Perl  have
3371       an optimization that allows for fast failure when a single character is
3372       used. They remember the last single character that is  required  for  a
3373       match,  and fail early if it is not present in the string.) If the pat‐
3374       tern is changed so that it uses an atomic group,  like  the  following,
3375       sequences of non-digits cannot be broken, and failure happens quickly:
3376
3377       ((?>\D+)|<\d+>)*[!?]
3378

BACK REFERENCES

3380       Outside  a  character  class,  a backslash followed by a digit > 0 (and
3381       possibly further digits) is a back reference to a capturing  subpattern
3382       earlier (that is, to its left) in the pattern, provided there have been
3383       that many previous capturing left parentheses.
3384
3385       However, if the decimal number following the backslash is < 10,  it  is
3386       always taken as a back reference, and causes an error only if there are
3387       not that many capturing left parentheses in the  entire  pattern.  That
3388       is,  the  parentheses that are referenced do need not be to the left of
3389       the reference for numbers < 10. A "forward back reference" of this type
3390       can  make sense when a repetition is involved and the subpattern to the
3391       right has participated in an earlier iteration.
3392
3393       It is not possible to have a numerical "forward back  reference"  to  a
3394       subpattern  whose number is 10 or more using this syntax, as a sequence
3395       such as \50 is interpreted as a character defined in  octal.  For  more
3396       details  of  the  handling of digits following a backslash, see section
3397       Non-Printing Characters earlier. There is no such  problem  when  named
3398       parentheses  are  used.  A back reference to any subpattern is possible
3399       using named parentheses (see below).
3400
3401       Another way to avoid the ambiguity inherent in the use of  digits  fol‐
3402       lowing  a  backslash is to use the \g escape sequence. This escape must
3403       be followed by an unsigned number or a negative number, optionally  en‐
3404       closed in braces. The following examples are identical:
3405
3406       (ring), \1
3407       (ring), \g1
3408       (ring), \g{1}
3409
3410       An  unsigned number specifies an absolute reference without the ambigu‐
3411       ity that is present in the older syntax. It is also useful when literal
3412       digits follow the reference. A negative number is a relative reference.
3413       Consider the following example:
3414
3415       (abc(def)ghi)\g{-1}
3416
3417       The sequence \g{-1} is a reference to the most recently started captur‐
3418       ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3419       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
3420       references  can  be helpful in long patterns, and also in patterns that
3421       are created by joining fragments  containing  references  within  them‐
3422       selves.
3423
3424       A  back  reference matches whatever matched the capturing subpattern in
3425       the current subject string, rather than anything matching  the  subpat‐
3426       tern itself (section Subpattern as Subroutines describes a way of doing
3427       that). So, the following pattern matches "sense  and  sensibility"  and
3428       "response and responsibility", but not "sense and responsibility":
3429
3430       (sens|respons)e and \1ibility
3431
3432       If  caseful matching is in force at the time of the back reference, the
3433       case of letters is relevant. For example, the  following  matches  "rah
3434       rah"  and "RAH RAH", but not "RAH rah", although the original capturing
3435       subpattern is matched caselessly:
3436
3437       ((?i)rah)\s+\1
3438
3439       There are many different ways of writing back references to named  sub‐
3440       patterns.  The  .NET  syntax  \k{name}  and the Perl syntax \k<name> or
3441       \k'name' are supported, as is the Python syntax (?P=name). The  unified
3442       back  reference  syntax  in Perl 5.10, in which \g can be used for both
3443       numeric and named references, is also supported. The  previous  example
3444       can be rewritten in the following ways:
3445
3446       (?<p1>(?i)rah)\s+\k<p1>
3447       (?'p1'(?i)rah)\s+\k{p1}
3448       (?P<p1>(?i)rah)\s+(?P=p1)
3449       (?<p1>(?i)rah)\s+\g{p1}
3450
3451       A  subpattern  that is referenced by name can appear in the pattern be‐
3452       fore or after the reference.
3453
3454       There can be more than one back reference to the same subpattern. If  a
3455       subpattern has not been used in a particular match, any back references
3456       to it always fails. For example, the following pattern always fails  if
3457       it starts to match "a" rather than "bc":
3458
3459       (a|(bc))\2
3460
3461       As  there  can  be  many capturing parentheses in a pattern, all digits
3462       following the backslash are taken as part of a potential back reference
3463       number. If the pattern continues with a digit character, some delimiter
3464       must be used to terminate the back reference.  If  option  extended  is
3465       set,  this  can  be whitespace. Otherwise an empty comment (see section
3466       Comments) can be used.
3467
3468       Recursive Back References
3469
3470       A back reference that occurs inside the parentheses to which it  refers
3471       fails  when  the subpattern is first used, so, for example, (a\1) never
3472       matches. However, such references can be useful inside repeated subpat‐
3473       terns.  For  example,  the following pattern matches any number of "a"s
3474       and also "aba", "ababbaa", and so on:
3475
3476       (a|b\1)+
3477
3478       At each iteration of the subpattern, the  back  reference  matches  the
3479       character  string corresponding to the previous iteration. In order for
3480       this to work, the pattern must be such that the  first  iteration  does
3481       not  need  to match the back reference. This can be done using alterna‐
3482       tion, as in the example above, or by a quantifier  with  a  minimum  of
3483       zero.
3484
3485       Back  references of this type cause the group that they reference to be
3486       treated as an atomic group. Once the whole group has  been  matched,  a
3487       subsequent  matching  failure cannot cause backtracking into the middle
3488       of the group.
3489

ASSERTIONS

3491       An assertion is a test on the characters  following  or  preceding  the
3492       current matching point that does not consume any characters. The simple
3493       assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are  described  in
3494       the previous sections.
3495
3496       More  complicated  assertions  are  coded as subpatterns. There are two
3497       kinds: those that look ahead of the current  position  in  the  subject
3498       string,  and  those  that  look  behind  it. An assertion subpattern is
3499       matched in the normal way, except that it does not  cause  the  current
3500       matching position to be changed.
3501
3502       Assertion  subpatterns are not capturing subpatterns. If such an asser‐
3503       tion contains capturing subpatterns within it, these  are  counted  for
3504       the  purposes  of numbering the capturing subpatterns in the whole pat‐
3505       tern. However, substring capturing is done  only  for  positive  asser‐
3506       tions.  (Perl sometimes, but not always, performs capturing in negative
3507       assertions.)
3508
3509   Warning:
3510       If a positive assertion containing one or  more  capturing  subpatterns
3511       succeeds, but failure to match later in the pattern causes backtracking
3512       over this assertion, the captures within the assertion are  reset  only
3513       if no higher numbered captures are already set. This is, unfortunately,
3514       a fundamental limitation of the current implementation, and as PCRE1 is
3515       now in maintenance-only status, it is unlikely ever to change.
3516
3517
3518       For  compatibility  with  Perl,  assertion subpatterns can be repeated.
3519       However, it makes no sense to assert the same  thing  many  times,  the
3520       side  effect  of  capturing  parentheses can occasionally be useful. In
3521       practice, there are only three cases:
3522
3523         * If the quantifier is {0}, the  assertion  is  never  obeyed  during
3524           matching.  However, it can contain internal capturing parenthesized
3525           groups that are called from elsewhere through the subroutine mecha‐
3526           nism.
3527
3528         * If  quantifier  is  {0,n},  where n > 0, it is treated as if it was
3529           {0,1}. At runtime, the remaining pattern match is  tried  with  and
3530           without  the  assertion, the order depends on the greediness of the
3531           quantifier.
3532
3533         * If the minimum repetition is > 0, the quantifier  is  ignored.  The
3534           assertion is obeyed only once when encountered during matching.
3535
3536       Lookahead Assertions
3537
3538       Lookahead assertions start with (?= for positive assertions and (?! for
3539       negative assertions. For example, the following matches a word followed
3540       by a semicolon, but does not include the semicolon in the match:
3541
3542       \w+(?=;)
3543
3544       The  following  matches any occurrence of "foo" that is not followed by
3545       "bar":
3546
3547       foo(?!bar)
3548
3549       Notice that the apparently similar pattern
3550
3551       (?!foo)bar
3552
3553       does not find an occurrence of "bar"  that  is  preceded  by  something
3554       other  than  "foo". It finds any occurrence of "bar" whatsoever, as the
3555       assertion (?!foo) is always true when the  next  three  characters  are
3556       "bar". A lookbehind assertion is needed to achieve the other effect.
3557
3558       If you want to force a matching failure at some point in a pattern, the
3559       most convenient way to do it is with (?!), as an  empty  string  always
3560       matches.  So,  an  assertion  that requires there is not to be an empty
3561       string must always fail. The backtracking control verb (*FAIL) or  (*F)
3562       is a synonym for (?!).
3563
3564       Lookbehind Assertions
3565
3566       Lookbehind  assertions start with (?<= for positive assertions and (?<!
3567       for negative assertions. For example, the following finds an occurrence
3568       of "bar" that is not preceded by "foo":
3569
3570       (?<!foo)bar
3571
3572       The contents of a lookbehind assertion are restricted such that all the
3573       strings it matches must have a fixed length. However, if there are many
3574       top-level  alternatives,  they  do  not all have to have the same fixed
3575       length. Thus, the following is permitted:
3576
3577       (?<=bullock|donkey)
3578
3579       The following causes an error at compile time:
3580
3581       (?<!dogs?|cats?)
3582
3583       Branches that match different length strings are permitted only at  the
3584       top-level of a lookbehind assertion. This is an extension compared with
3585       Perl, which requires all branches to match the same length  of  string.
3586       An assertion such as the following is not permitted, as its single top-
3587       level branch can match two different lengths:
3588
3589       (?<=ab(c|de))
3590
3591       However, it is acceptable to PCRE if rewritten  to  use  two  top-level
3592       branches:
3593
3594       (?<=abc|abde)
3595
3596       Sometimes  the  escape sequence \K (see above) can be used instead of a
3597       lookbehind assertion to get round the fixed-length restriction.
3598
3599       The implementation of lookbehind assertions is, for  each  alternative,
3600       to  move  the current position back temporarily by the fixed length and
3601       then try to match. If there are insufficient characters before the cur‐
3602       rent position, the assertion fails.
3603
3604       In  a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3605       gle data unit even in a UTF mode) to appear in  lookbehind  assertions,
3606       as  it  makes  it impossible to calculate the length of the lookbehind.
3607       The \X and \R escapes, which can match different numbers of data units,
3608       are not permitted either.
3609
3610       "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3611       lookbehinds, as long as the subpattern matches a  fixed-length  string.
3612       Recursion, however, is not supported.
3613
3614       Possessive  quantifiers can be used with lookbehind assertions to spec‐
3615       ify efficient matching of fixed-length strings at the  end  of  subject
3616       strings.  Consider  the following simple pattern when applied to a long
3617       string that does not match:
3618
3619       abcd$
3620
3621       As matching proceeds from left to right, PCRE looks for each "a" in the
3622       subject and then sees if what follows matches the remaining pattern. If
3623       the pattern is specified as
3624
3625       ^.*abcd$
3626
3627       the initial .* matches the entire string at first. However,  when  this
3628       fails  (as  there  is no following "a"), it backtracks to match all but
3629       the last character, then all but the last two characters,  and  so  on.
3630       Once  again  the search for "a" covers the entire string, from right to
3631       left, so we are no better off. However, if the pattern is written as
3632
3633       ^.*+(?<=abcd)
3634
3635       there can be no backtracking for the .*+ item; it can  match  only  the
3636       entire  string.  The subsequent lookbehind assertion does a single test
3637       on the last four characters. If it fails, the match fails  immediately.
3638       For  long  strings, this approach makes a significant difference to the
3639       processing time.
3640
3641       Using Multiple Assertions
3642
3643       Many assertions (of any sort) can occur in succession. For example, the
3644       following matches "foo" preceded by three digits that are not "999":
3645
3646       (?<=\d{3})(?<!999)foo
3647
3648       Notice that each of the assertions is applied independently at the same
3649       point in the subject string. First there is a check that  the  previous
3650       three  characters  are  all  digits, and then there is a check that the
3651       same three characters are not "999". This pattern does not match  "foo"
3652       preceded  by six characters, the first of which are digits and the last
3653       three of which are not "999". For example, it does not  match  "123abc‐
3654       foo". A pattern to do that is the following:
3655
3656       (?<=\d{3}...)(?<!999)foo
3657
3658       This  time  the  first assertion looks at the preceding six characters,
3659       checks that the first three are digits, and then the  second  assertion
3660       checks that the preceding three characters are not "999".
3661
3662       Assertions can be nested in any combination. For example, the following
3663       matches an occurrence of "baz" that is preceded by "bar", which in turn
3664       is not preceded by "foo":
3665
3666       (?<=(?<!foo)bar)baz
3667
3668       The  following  pattern  matches "foo" preceded by three digits and any
3669       three characters that are not "999":
3670
3671       (?<=\d{3}(?!999)...)foo
3672

CONDITIONAL SUBPATTERNS

3674       It is possible to cause the matching process to obey a subpattern  con‐
3675       ditionally  or to choose between two alternative subpatterns, depending
3676       on the result of an assertion, or whether a specific capturing  subpat‐
3677       tern has already been matched. The following are the two possible forms
3678       of conditional subpattern:
3679
3680       (?(condition)yes-pattern)
3681       (?(condition)yes-pattern|no-pattern)
3682
3683       If the condition is satisfied, the yes-pattern is used,  otherwise  the
3684       no-pattern  (if  present).  If  more than two alternatives exist in the
3685       subpattern, a compile-time error occurs. Each of the  two  alternatives
3686       can  itself  contain  nested  subpatterns of any form, including condi‐
3687       tional subpatterns; the restriction to two alternatives applies only at
3688       the  level of the condition. The following pattern fragment is an exam‐
3689       ple where the alternatives are complex:
3690
3691       (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3692
3693       There are four kinds of condition: references  to  subpatterns,  refer‐
3694       ences to recursion, a pseudo-condition called DEFINE, and assertions.
3695
3696       Checking for a Used Subpattern By Number
3697
3698       If  the  text between the parentheses consists of a sequence of digits,
3699       the condition is true if a capturing subpattern of that number has pre‐
3700       viously  matched.  If  more than one capturing subpattern with the same
3701       number exists (see section  Duplicate Subpattern Numbers earlier),  the
3702       condition  is true if any of them have matched. An alternative notation
3703       is to precede the digits with a plus or minus sign. In this  case,  the
3704       subpattern  number  is relative rather than absolute. The most recently
3705       opened parentheses can be referenced by (?(-1), the next most recent by
3706       (?(-2),  and  so  on.  Inside loops, it can also make sense to refer to
3707       subsequent groups. The next parentheses to be opened can be  referenced
3708       as  (?(+1),  and  so  on.  (The value zero in any of these forms is not
3709       used; it provokes a compile-time error.)
3710
3711       Consider the following pattern, which contains  non-significant  white‐
3712       space  to  make it more readable (assume option extended) and to divide
3713       it into three parts for ease of discussion:
3714
3715       ( \( )?    [^()]+    (?(1) \) )
3716
3717       The first part matches an optional opening  parenthesis,  and  if  that
3718       character is present, sets it as the first captured substring. The sec‐
3719       ond part matches one or more characters that are not  parentheses.  The
3720       third part is a conditional subpattern that tests whether the first set
3721       of parentheses matched or not. If they did, that is, if subject started
3722       with an opening parenthesis, the condition is true, and so the yes-pat‐
3723       tern is executed and a closing parenthesis is required.  Otherwise,  as
3724       no-pattern  is  not  present,  the subpattern matches nothing. That is,
3725       this pattern matches a sequence of non-parentheses, optionally enclosed
3726       in parentheses.
3727
3728       If  this  pattern is embedded in a larger one, a relative reference can
3729       be used:
3730
3731       This makes the fragment independent of the parentheses  in  the  larger
3732       pattern.
3733
3734       Checking for a Used Subpattern By Name
3735
3736       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
3737       used subpattern by name. For compatibility  with  earlier  versions  of
3738       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
3739       also recognized.
3740
3741       Rewriting the previous example to use a named subpattern gives:
3742
3743       (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
3744
3745       If the name used in a condition of this kind is a duplicate,  the  test
3746       is  applied to all subpatterns of the same name, and is true if any one
3747       of them has matched.
3748
3749       Checking for Pattern Recursion
3750
3751       If the condition is the string (R), and there is no subpattern with the
3752       name  R, the condition is true if a recursive call to the whole pattern
3753       or any subpattern has been made. If digits or a name preceded by amper‐
3754       sand follow the letter R, for example:
3755
3756       (?(R3)...) or (?(R&name)...)
3757
3758       the condition is true if the most recent recursion is into a subpattern
3759       whose number or name is given. This condition does not check the entire
3760       recursion  stack. If the name used in a condition of this kind is a du‐
3761       plicate, the test is applied to all subpatterns of the same  name,  and
3762       is true if any one of them is the most recent recursion.
3763
3764       At "top-level", all these recursion test conditions are false. The syn‐
3765       tax for recursive patterns is described below.
3766
3767       Defining Subpatterns for Use By Reference Only
3768
3769       If the condition is the string (DEFINE), and  there  is  no  subpattern
3770       with  the  name  DEFINE,  the  condition is always false. In this case,
3771       there can be only one alternative  in  the  subpattern.  It  is  always
3772       skipped  if  control reaches this point in the pattern. The idea of DE‐
3773       FINE is that it can be used to define "subroutines" that can be  refer‐
3774       enced  from elsewhere. (The use of subroutines is described below.) For
3775       example, a pattern to match an IPv4 address, such as  "192.168.23.245",
3776       can be written like this (ignore whitespace and line breaks):
3777
3778       (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3779
3780       The  first  part of the pattern is a DEFINE group inside which is a an‐
3781       other group named "byte" is defined. This matches an individual  compo‐
3782       nent  of  an  IPv4 address (a number < 256). When matching takes place,
3783       this part of the pattern is skipped, as DEFINE acts like a false condi‐
3784       tion. The remaining pattern uses references to the named group to match
3785       the four dot-separated components of an IPv4 address,  insisting  on  a
3786       word boundary at each end.
3787
3788       Assertion Conditions
3789
3790       If  the condition is not in any of the above formats, it must be an as‐
3791       sertion. This can be a positive or negative lookahead or lookbehind as‐
3792       sertion.  Consider  the  following  pattern, containing non-significant
3793       whitespace, and with the two alternatives on the second line:
3794
3795       (?(?=[^a-z]*[a-z])
3796       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
3797
3798       The condition is a positive lookahead assertion  that  matches  an  op‐
3799       tional  sequence of non-letters followed by a letter. That is, it tests
3800       for the presence of at least one letter in the subject. If a letter  is
3801       found,  the subject is matched against the first alternative, otherwise
3802       it is matched against the second. This pattern matches strings  in  one
3803       of  the  two  forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd
3804       are digits.
3805

COMMENTS

3807       There are two ways to include comments in patterns that  are  processed
3808       by PCRE. In both cases, the start of the comment must not be in a char‐
3809       acter class, or in the middle of any other sequence of related  charac‐
3810       ters  such  as  (?: or a subpattern name or number. The characters that
3811       make up a comment play no part in the pattern matching.
3812
3813       The sequence (?# marks the start of a comment that continues up to  the
3814       next  closing parenthesis. Nested parentheses are not permitted. If op‐
3815       tion PCRE_EXTENDED is set, an unescaped # character also  introduces  a
3816       comment,  which  in  this  case continues to immediately after the next
3817       newline character or character sequence in the pattern.  Which  charac‐
3818       ters are interpreted as newlines is controlled by the options passed to
3819       a compiling function or by a special sequence at the start of the  pat‐
3820       tern, as described in section  Newline Conventions earlier.
3821
3822       Notice  that  the  end of this type of comment is a literal newline se‐
3823       quence in the pattern; escape sequences that happen to represent a new‐
3824       line do not count. For example, consider the following pattern when ex‐
3825       tended is set, and the default newline convention is in force:
3826
3827       abc #comment \n still comment
3828
3829       On encountering character #, pcre_compile() skips along, looking for  a
3830       newline in the pattern. The sequence \n is still literal at this stage,
3831       so it does not terminate the comment. Only a character with code  value
3832       0x0a (the default newline) does so.
3833

RECURSIVE PATTERNS

3835       Consider  the problem of matching a string in parentheses, allowing for
3836       unlimited nested parentheses. Without the use of  recursion,  the  best
3837       that  can  be  done  is  to use a pattern that matches up to some fixed
3838       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
3839       depth.
3840
3841       For some time, Perl has provided a facility that allows regular expres‐
3842       sions to recurse (among other things). It does  this  by  interpolating
3843       Perl  code  in the expression at runtime, and the code can refer to the
3844       expression itself. A Perl pattern using code interpolation to solve the
3845       parentheses problem can be created like this:
3846
3847       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3848
3849       Item  (?p{...})  interpolates  Perl  code  at runtime, and in this case
3850       refers recursively to the pattern in which it appears.
3851
3852       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3853       it supports special syntax for recursion of the entire pattern, and for
3854       individual subpattern recursion. After its  introduction  in  PCRE  and
3855       Python,  this  kind  of recursion was later introduced into Perl at re‐
3856       lease 5.10.
3857
3858       A special item that consists of (? followed by a number > 0 and a clos‐
3859       ing parenthesis is a recursive subroutine call of the subpattern of the
3860       given number, if it occurs inside that subpattern. (If  not,  it  is  a
3861       non-recursive subroutine call, which is described in the next section.)
3862       The special item (?R) or (?0) is a recursive call of the entire regular
3863       expression.
3864
3865       This  PCRE  pattern  solves the nested parentheses problem (assume that
3866       option extended is set so that whitespace is ignored):
3867
3868       \( ( [^()]++ | (?R) )* \)
3869
3870       First it matches an opening parenthesis. Then it matches any number  of
3871       substrings,  which can either be a sequence of non-parentheses or a re‐
3872       cursive match of the pattern itself (that is, a correctly parenthesized
3873       substring). Finally there is a closing parenthesis. Notice the use of a
3874       possessive quantifier to avoid  backtracking  into  sequences  of  non-
3875       parentheses.
3876
3877       If this was part of a larger pattern, you would not want to recurse the
3878       entire pattern, so instead you can use:
3879
3880       ( \( ( [^()]++ | (?1) )* \) )
3881
3882       The pattern is here within parentheses so that the recursion refers  to
3883       them instead of the whole pattern.
3884
3885       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
3886       tricky. This is made easier by the use of relative references.  Instead
3887       of  (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3888       ond most recently opened parentheses preceding the recursion. That  is,
3889       a negative number counts capturing parentheses leftwards from the point
3890       at which it is encountered.
3891
3892       It is also possible to refer to later opened  parentheses,  by  writing
3893       references  such  as  (?+2). However, these cannot be recursive, as the
3894       reference is not inside the parentheses that are referenced.  They  are
3895       always  non-recursive  subroutine  calls, as described in the next sec‐
3896       tion.
3897
3898       An alternative approach is to use named parentheses instead.  The  Perl
3899       syntax  for this is (?&name). The earlier PCRE syntax (?P>name) is also
3900       supported. We can rewrite the above example as follows:
3901
3902       (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3903
3904       If there is more than one subpattern with the same name,  the  earliest
3905       one is used.
3906
3907       This  particular  example  pattern that we have studied contains nested
3908       unlimited repeats, and so the use of a possessive quantifier for match‐
3909       ing  strings  of non-parentheses is important when applying the pattern
3910       to strings that do not match. For example, when this pattern is applied
3911       to
3912
3913       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3914
3915       it gives "no match" quickly. However, if a possessive quantifier is not
3916       used, the match runs for a long time, as there are  so  many  different
3917       ways  the  +  and  *  repeats can carve up the subject, and all must be
3918       tested before failure can be reported.
3919
3920       At the end of a match, the values of capturing  parentheses  are  those
3921       from the outermost level. If the pattern above is matched against
3922
3923       (ab(cd)ef)
3924
3925       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
3926       which is the last value taken on at the top-level. If a capturing  sub‐
3927       pattern  is  not  matched at the top level, its final captured value is
3928       unset, even if it was (temporarily) set at a deeper  level  during  the
3929       matching process.
3930
3931       Do not confuse item (?R) with condition (R), which tests for recursion.
3932       Consider the following pattern, which matches text in  angle  brackets,
3933       allowing  for  arbitrary  nesting.  Only  digits  are allowed in nested
3934       brackets (that is, when recursing), while any characters are  permitted
3935       at the outer level.
3936
3937       < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
3938
3939       Here (?(R) is the start of a conditional subpattern, with two different
3940       alternatives for the recursive and non-recursive cases.  Item  (?R)  is
3941       the actual recursive call.
3942
3943       Differences in Recursion Processing between PCRE and Perl
3944
3945       Recursion  processing  in PCRE differs from Perl in two important ways.
3946       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
3947       always treated as an atomic group. That is, once it has matched some of
3948       the subject string, it is never re-entered, even if it contains untried
3949       alternatives  and  there  is a subsequent matching failure. This can be
3950       illustrated by the following pattern, which means  to  match  a  palin‐
3951       dromic string containing an odd number of characters (for example, "a",
3952       "aba", "abcba", "abcdcba"):
3953
3954       ^(.|(.)(?1)\2)$
3955
3956       The idea is that it either matches a single character, or two identical
3957       characters surrounding a subpalindrome. In Perl, this pattern works; in
3958       PCRE it does not work if the pattern is longer than  three  characters.
3959       Consider the subject string "abcba".
3960
3961       At  the  top level, the first character is matched, but as it is not at
3962       the end of the string, the first alternative fails, the second alterna‐
3963       tive  is  taken, and the recursion kicks in. The recursive call to sub‐
3964       pattern 1 successfully matches the next character ("b").  (Notice  that
3965       the beginning and end of line tests are not part of the recursion.)
3966
3967       Back  at  the top level, the next character ("c") is compared with what
3968       subpattern 2 matched, which was "a". This fails. As  the  recursion  is
3969       treated  as  an atomic group, there are now no backtracking points, and
3970       so the entire match fails. (Perl can now re-enter the recursion and try
3971       the  second  alternative.)  However, if the pattern is written with the
3972       alternatives in the other order, things are different:
3973
3974       ^((.)(?1)\2|.)$
3975
3976       This time, the recursing alternative is tried first, and  continues  to
3977       recurse  until  it runs out of characters, at which point the recursion
3978       fails. But this time we have another alternative to try at  the  higher
3979       level. That is the significant difference: in the previous case the re‐
3980       maining alternative is at a deeper recursion level, which  PCRE  cannot
3981       use.
3982
3983       To  change  the pattern so that it matches all palindromic strings, not
3984       only those with an odd number of characters, it is tempting  to  change
3985       the pattern to this:
3986
3987       ^((.)(?1)\2|.?)$
3988
3989       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
3990       When a deeper recursion has matched a single character,  it  cannot  be
3991       entered again to match an empty string. The solution is to separate the
3992       two cases, and write out the odd and even cases as alternatives at  the
3993       higher level:
3994
3995       ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
3996
3997       If  you want to match typical palindromic phrases, the pattern must ig‐
3998       nore all non-word characters, which can be done as follows:
3999
4000       ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4001
4002       If run with option caseless, this pattern matches phrases  such  as  "A
4003       man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4004       Notice the use of the possessive quantifier *+  to  avoid  backtracking
4005       into  sequences  of  non-word characters. Without this, PCRE takes much
4006       longer (10 times or more) to match typical phrases, and Perl  takes  so
4007       long that you think it has gone into a loop.
4008
4009   Note:
4010       The  palindrome-matching patterns above work only if the subject string
4011       does not start with a  palindrome  that  is  shorter  than  the  entire
4012       string. For example, although "abcba" is correctly matched, if the sub‐
4013       ject is "ababa", PCRE finds palindrome "aba" at  the  start,  and  then
4014       fails  at  top  level,  as  the end of the string does not follow. Once
4015       again, it cannot jump back into the recursion  to  try  other  alterna‐
4016       tives, so the entire match fails.
4017
4018
4019       The  second  way  in which PCRE and Perl differ in their recursion pro‐
4020       cessing is in the handling of captured values. In Perl, when a  subpat‐
4021       tern  is  called recursively or as a subpattern (see the next section),
4022       it has no access to any values that were captured  outside  the  recur‐
4023       sion.  In  PCRE  these values can be referenced. Consider the following
4024       pattern:
4025
4026       ^(.)(\1|a(?2))
4027
4028       In PCRE, it matches "bab". The first capturing parentheses  match  "b",
4029       then  in  the  second  group, when the back reference \1 fails to match
4030       "b", the second alternative matches "a", and then recurses. In the  re‐
4031       cursion,  \1  does  now  match  "b" and so the whole match succeeds. In
4032       Perl, the pattern fails to match because inside the recursive  call  \1
4033       cannot access the externally set value.
4034

SUBPATTERNS AS SUBROUTINES

4036       If  the  syntax for a recursive subpattern call (either by number or by
4037       name) is used outside the parentheses to which it refers,  it  operates
4038       like  a subroutine in a programming language. The called subpattern can
4039       be defined before or after the reference. A numbered reference  can  be
4040       absolute or relative, as in the following examples:
4041
4042       (...(absolute)...)...(?2)...
4043       (...(relative)...)...(?-1)...
4044       (...(?+1)...(relative)...
4045
4046       An  earlier  example  pointed  out  that  the following pattern matches
4047       "sense and sensibility" and  "response  and  responsibility",  but  not
4048       "sense and responsibility":
4049
4050       (sens|respons)e and \1ibility
4051
4052       If instead the following pattern is used, it matches "sense and respon‐
4053       sibility" and the other two strings:
4054
4055       (sens|respons)e and (?1)ibility
4056
4057       Another example is provided in the discussion of DEFINE earlier.
4058
4059       All subroutine calls, recursive or not, are always  treated  as  atomic
4060       groups.  That  is,  once  a  subroutine has matched some of the subject
4061       string, it is never re-entered, even if it  contains  untried  alterna‐
4062       tives  and there is a subsequent matching failure. Any capturing paren‐
4063       theses that are set during the subroutine call revert to their previous
4064       values afterwards.
4065
4066       Processing  options  such as case-independence are fixed when a subpat‐
4067       tern is defined, so if it is used as a subroutine, such options  cannot
4068       be  changed  for  different  calls.  For example, the following pattern
4069       matches "abcabc" but not "abcABC", as the change of  processing  option
4070       does not affect the called subpattern:
4071
4072       (abc)(?i:(?-1))
4073

ONIGURUMA SUBROUTINE SYNTAX

4075       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
4076       name or a number enclosed either in angle brackets or single quotes, is
4077       alternative syntax for referencing a subpattern as a subroutine, possi‐
4078       bly recursively. Here follows two of the examples used above, rewritten
4079       using this syntax:
4080
4081       (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4082       (sens|respons)e and \g'1'ibility
4083
4084       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
4085       plus or minus sign, it is taken as a relative reference, for example:
4086
4087       (abc)(?i:\g<-1>)
4088
4089       Notice that \g{...} (Perl syntax) and \g<...>  (Oniguruma  syntax)  are
4090       not synonymous. The former is a back reference; the latter is a subrou‐
4091       tine call.
4092

BACKTRACKING CONTROL

4094       Perl 5.10 introduced some "Special Backtracking Control  Verbs",  which
4095       are still described in the Perl documentation as "experimental and sub‐
4096       ject to change or removal in a future version of Perl". It goes  on  to
4097       say:  "Their usage in production code should be noted to avoid problems
4098       during upgrades." The same remarks apply to the PCRE features described
4099       in this section.
4100
4101       The  new verbs make use of what was previously invalid syntax: an open‐
4102       ing parenthesis followed by an asterisk. They are generally of the form
4103       (*VERB)  or  (*VERB:NAME). Some can take either form, possibly behaving
4104       differently depending on whether a name is present. A name is  any  se‐
4105       quence  of  characters that does not include a closing parenthesis. The
4106       maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4107       and  32-bit  libraries.  If  the name is empty, that is, if the closing
4108       parenthesis immediately follows the colon, the  effect  is  as  if  the
4109       colon was not there. Any number of these verbs can occur in a pattern.
4110
4111       The behavior of these verbs in repeated groups, assertions, and in sub‐
4112       patterns called as subroutines (whether  or  not  recursively)  is  de‐
4113       scribed below.
4114
4115       Optimizations That Affect Backtracking Verbs
4116
4117       PCRE  contains some optimizations that are used to speed up matching by
4118       running some checks at the start of each match attempt. For example, it
4119       can  know  the minimum length of matching subject, or that a particular
4120       character must be present. When one of these optimizations bypasses the
4121       running  of a match, any included backtracking verbs are not processed.
4122       processed. You can suppress the start-of-match optimizations by setting
4123       option  no_start_optimize when calling compile/2 or run/3, or by start‐
4124       ing the pattern with (*NO_START_OPT).
4125
4126       Experiments with Perl suggest that it too  has  similar  optimizations,
4127       sometimes leading to anomalous results.
4128
4129       Verbs That Act Immediately
4130
4131       The  following verbs act as soon as they are encountered. They must not
4132       be followed by a name.
4133
4134       (*ACCEPT)
4135
4136       This verb causes the match to end successfully, skipping the  remainder
4137       of  the pattern. However, when it is inside a subpattern that is called
4138       as a subroutine, only that subpattern is ended  successfully.  Matching
4139       then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4140       tive assertion, the assertion succeeds; in a  negative  assertion,  the
4141       assertion fails.
4142
4143       If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap‐
4144       tured. For example, the following matches "AB", "AAD", or  "ACD".  When
4145       it matches "AB", "B" is captured by the outer parentheses.
4146
4147       A((?:A|B(*ACCEPT)|C)D)
4148
4149       The  following  verb causes a matching failure, forcing backtracking to
4150       occur. It is equivalent to (?!) but easier to read.
4151
4152       (*FAIL) or (*F)
4153
4154       The Perl documentation states that it is probably useful only when com‐
4155       bined  with  (?{})  or  (??{}).  Those  are  Perl features that are not
4156       present in PCRE.
4157
4158       A match with the string "aaaa" always fails, but the callout  is  taken
4159       before each backtrack occurs (in this example, 10 times).
4160
4161       Recording Which Path Was Taken
4162
4163       The  main  purpose of this verb is to track how a match was arrived at,
4164       although it also has a secondary use in with advancing the match start‐
4165       ing point (see (*SKIP) below).
4166
4167   Note:
4168       In  Erlang,  there  is no interface to retrieve a mark with run/2,3, so
4169       only the secondary purpose is relevant to the Erlang programmer.
4170
4171       The rest of this section is  therefore  deliberately  not  adapted  for
4172       reading  by  the Erlang programmer, but the examples can help in under‐
4173       standing NAMES as they can be used by (*SKIP).
4174
4175
4176       (*MARK:NAME) or (*:NAME)
4177
4178       A name is always required with this verb. There  can  be  as  many  in‐
4179       stances  of  (*MARK)  as  you like in a pattern, and their names do not
4180       have to be unique.
4181
4182       When a match succeeds, the name of the last  encountered  (*MARK:NAME),
4183       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
4184       the caller as described in section "Extra data for pcre_exec()" in  the
4185       pcreapi documentation. In the following example of pcretest output, the
4186       /K modifier requests the retrieval and outputting of (*MARK) data:
4187
4188         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4189       data> XY
4190        0: XY
4191       MK: A
4192       XZ
4193        0: XZ
4194       MK: B
4195
4196       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4197       ple  it indicates which of the two alternatives matched. This is a more
4198       efficient way of obtaining this information than putting each  alterna‐
4199       tive in its own capturing parentheses.
4200
4201       If  a  verb  with a name is encountered in a positive assertion that is
4202       true, the name is recorded and passed back if it is  the  last  encoun‐
4203       tered.  This does not occur for negative assertions or failing positive
4204       assertions.
4205
4206       After a partial match or a failed match, the last encountered  name  in
4207       the entire match process is returned, for example:
4208
4209         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4210       data> XP
4211       No match, mark = B
4212
4213       Notice  that  in this unanchored example, the mark is retained from the
4214       match attempt that started at letter "X"  in  the  subject.  Subsequent
4215       match attempts starting at "P" and then with an empty string do not get
4216       as far as the (*MARK) item, nevertheless do not reset it.
4217
4218       Verbs That Act after Backtracking
4219
4220       The following verbs do nothing when they are encountered. Matching con‐
4221       tinues  with what follows, but if there is no subsequent match, causing
4222       a backtrack to the verb, a failure is  forced.  That  is,  backtracking
4223       cannot  pass  to the left of the verb. However, when one of these verbs
4224       appears inside an atomic group or an assertion that is true, its effect
4225       is confined to that group, as once the group has been matched, there is
4226       never any backtracking into it. In  this  situation,  backtracking  can
4227       "jump  back"  to the left of the entire atomic group or assertion. (Re‐
4228       member also, as stated above, that this localization  also  applies  in
4229       subroutine calls.)
4230
4231       These  verbs  differ  in exactly what kind of failure occurs when back‐
4232       tracking reaches them. The behavior described below is what occurs when
4233       the  verb  is  not in a subroutine or an assertion. Subsequent sections
4234       cover these special cases.
4235
4236       The following verb, which must not be followed by a  name,  causes  the
4237       whole  match to fail outright if there is a later matching failure that
4238       causes backtracking to reach it. Even if the pattern is unanchored,  no
4239       further  attempts  to find a match by advancing the starting point take
4240       place.
4241
4242       (*COMMIT)
4243
4244       If (*COMMIT) is the only backtracking verb that is encountered, once it
4245       has  been  passed,  run/2,3 is committed to find a match at the current
4246       starting point, or not at all, for example:
4247
4248       a+(*COMMIT)b
4249
4250       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
4251       of dynamic anchor, or "I've started, so I must finish". The name of the
4252       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
4253       forces a match failure.
4254
4255       If more than one backtracking verb exists in a pattern, a different one
4256       that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4257       MIT)  during  a match does not always guarantee that a match must be at
4258       this starting point.
4259
4260       Notice that (*COMMIT) at the start of a pattern is not the same  as  an
4261       anchor, unless the PCRE start-of-match optimizations are turned off, as
4262       shown in the following example:
4263
4264       1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4265       {match,["abc"]}
4266       2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4267       nomatch
4268
4269       For this pattern, PCRE knows that any match must start with "a", so the
4270       optimization skips along the subject to "a" before applying the pattern
4271       to the first set of data. The match attempt then succeeds. In the  sec‐
4272       ond  call  the  no_start_optimize  disables the optimization that skips
4273       along to the first character. The pattern is now  applied  starting  at
4274       "x",  and  so the (*COMMIT) causes the match to fail without trying any
4275       other starting points.
4276
4277       The following verb causes the match to fail at the current starting po‐
4278       sition  in the subject if there is a later matching failure that causes
4279       backtracking to reach it:
4280
4281       (*PRUNE) or (*PRUNE:NAME)
4282
4283       If the pattern is unanchored, the normal  "bumpalong"  advance  to  the
4284       next starting character then occurs. Backtracking can occur as usual to
4285       the left of (*PRUNE), before it is reached, or  when  matching  to  the
4286       right  of (*PRUNE), but if there is no match to the right, backtracking
4287       cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just  an
4288       alternative  to an atomic group or possessive quantifier, but there are
4289       some uses of (*PRUNE) that cannot be expressed in any other way. In  an
4290       anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4291
4292       The    behavior   of   (*PRUNE:NAME)   is   the   not   the   same   as
4293       (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name  is  re‐
4294       membered for passing back to the caller. However, (*SKIP:NAME) searches
4295       only for names set with (*MARK).
4296
4297   Note:
4298       The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4299       programmer, as names cannot be retrieved.
4300
4301
4302       The  following  verb,  when specified without a name, is like (*PRUNE),
4303       except that if the pattern is unanchored, the  "bumpalong"  advance  is
4304       not  to  the  next  character, but to the position in the subject where
4305       (*SKIP) was encountered.
4306
4307       (*SKIP)
4308
4309       (*SKIP) signifies that whatever text was matched leading up to it  can‐
4310       not be part of a successful match. Consider:
4311
4312       a+(*SKIP)b
4313
4314       If  the  subject  is  "aaaac...",  after  the first match attempt fails
4315       (starting at the first character in the  string),  the  starting  point
4316       skips  on  to  start  the next attempt at "c". Notice that a possessive
4317       quantifier does not have the same effect as this example;  although  it
4318       would  suppress backtracking during the first match attempt, the second
4319       attempt would start at the second character instead of skipping  on  to
4320       "c".
4321
4322       When (*SKIP) has an associated name, its behavior is modified:
4323
4324       (*SKIP:NAME)
4325
4326       When  this  is  triggered,  the  previous  path  through the pattern is
4327       searched for the most recent (*MARK) that has the same name. If one  is
4328       found,  the  "bumpalong" advance is to the subject position that corre‐
4329       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
4330       no (*MARK) with a matching name is found, (*SKIP) is ignored.
4331
4332       Notice  that  (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4333       It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4334
4335       The following verb causes a skip to the next innermost alternative when
4336       backtracking  reaches  it. That is, it cancels any further backtracking
4337       within the current alternative.
4338
4339       (*THEN) or (*THEN:NAME)
4340
4341       The verb name comes from the observation that it can be used for a pat‐
4342       tern-based if-then-else block:
4343
4344       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4345
4346       If  the COND1 pattern matches, FOO is tried (and possibly further items
4347       after the end of the group if FOO succeeds). On  failure,  the  matcher
4348       skips  to  the second alternative and tries COND2, without backtracking
4349       into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4350       fails, there are no more alternatives, so there is a backtrack to what‐
4351       ever came before the entire group. If (*THEN) is not inside an alterna‐
4352       tion, it acts like (*PRUNE).
4353
4354       The    behavior    of   (*THEN:NAME)   is   the   not   the   same   as
4355       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4356       bered  for  passing  back to the caller. However, (*SKIP:NAME) searches
4357       only for names set with (*MARK).
4358
4359   Note:
4360       The fact that (*THEN:NAME) remembers the name is useless to the  Erlang
4361       programmer, as names cannot be retrieved.
4362
4363
4364       A  subpattern that does not contain a | character is just a part of the
4365       enclosing alternative; it is not a nested alternation with only one al‐
4366       ternative.  The  effect  of (*THEN) extends beyond such a subpattern to
4367       the enclosing alternative. Consider the following pattern, where A,  B,
4368       and  so  on,  are  complex  pattern fragments that do not contain any |
4369       characters at this level:
4370
4371       A (B(*THEN)C) | D
4372
4373       If A and B are matched, but there is a failure in C, matching does  not
4374       backtrack into A; instead it moves to the next alternative, that is, D.
4375       However, if the subpattern containing (*THEN) is given an  alternative,
4376       it behaves differently:
4377
4378       A (B(*THEN)C | (*FAIL)) | D
4379
4380       The  effect of (*THEN) is now confined to the inner subpattern. After a
4381       failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4382       tern  to  fail, as there are no more alternatives to try. In this case,
4383       matching does now backtrack into A.
4384
4385       Notice that a conditional subpattern is not considered  as  having  two
4386       alternatives,  as  only one is ever used. That is, the | character in a
4387       conditional subpattern has a different  meaning.  Ignoring  whitespace,
4388       consider:
4389
4390       ^.*? (?(?=a) a | b(*THEN)c )
4391
4392       If  the  subject  is  "ba",  this pattern does not match. As .*? is un‐
4393       greedy, it initially matches zero characters. The condition (?=a)  then
4394       fails,  the  character  "b"  is matched, but "c" is not. At this point,
4395       matching does not backtrack to .*? as can perhaps be expected from  the
4396       presence  of the | character. The conditional subpattern is part of the
4397       single alternative that comprises the whole pattern, and so  the  match
4398       fails.  (If  there  was a backtrack into .*?, allowing it to match "b",
4399       the match would succeed.)
4400
4401       The verbs described above provide four different "strengths" of control
4402       when subsequent matching fails:
4403
4404         * (*THEN)  is the weakest, carrying on the match at the next alterna‐
4405           tive.
4406
4407         * (*PRUNE) comes next, fails the match at the current starting  posi‐
4408           tion,  but  allows  an  advance to the next character (for an unan‐
4409           chored pattern).
4410
4411         * (*SKIP) is similar, except that the advance can be  more  than  one
4412           character.
4413
4414         * (*COMMIT) is the strongest, causing the entire match to fail.
4415
4416       More than One Backtracking Verb
4417
4418       If  more  than  one  backtracking verb is present in a pattern, the one
4419       that is backtracked onto first acts. For example, consider the  follow‐
4420       ing pattern, where A, B, and so on, are complex pattern fragments:
4421
4422       (A(*COMMIT)B(*THEN)C|ABD)
4423
4424       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
4425       match to fail. However, if A and B match, but C fails, the backtrack to
4426       (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4427       consistent, but is not always the same as in Perl. It means that if two
4428       or  more  backtracking verbs appear in succession, the last of them has
4429       no effect. Consider the following example:
4430
4431       If there is a matching failure to the right, backtracking onto (*PRUNE)
4432       causes  it to be triggered, and its action is taken. There can never be
4433       a backtrack onto (*COMMIT).
4434
4435       Backtracking Verbs in Repeated Groups
4436
4437       PCRE differs from Perl in its handling of  backtracking  verbs  in  re‐
4438       peated groups. For example, consider:
4439
4440       /(a(*COMMIT)b)+ac/
4441
4442       If  the  subject  is  "abac",  Perl matches, but PCRE fails because the
4443       (*COMMIT) in the second repeat of the group acts.
4444
4445       Backtracking Verbs in Assertions
4446
4447       (*FAIL) in an assertion has its normal effect: it forces  an  immediate
4448       backtrack.
4449
4450       (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4451       out any further processing. In a negative assertion,  (*ACCEPT)  causes
4452       the assertion to fail without any further processing.
4453
4454       The  other  backtracking verbs are not treated specially if they appear
4455       in a positive assertion. In particular, (*THEN) skips to the  next  al‐
4456       ternative  in  the innermost enclosing group that has alternations, re‐
4457       gardless if this is within the assertion.
4458
4459       Negative assertions are, however, different, to ensure that changing  a
4460       positive  assertion into a negative assertion changes its result. Back‐
4461       tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative  asser‐
4462       tion  to  be true, without considering any further alternative branches
4463       in the assertion. Backtracking into (*THEN) causes it to  skip  to  the
4464       next  enclosing alternative within the assertion (the normal behavior),
4465       but if the assertion does not have such an alternative, (*THEN) behaves
4466       like (*PRUNE).
4467
4468       Backtracking Verbs in Subroutines
4469
4470       These  behaviors  occur  regardless  if the subpattern is called recur‐
4471       sively. The treatment of subroutines  in  Perl  is  different  in  some
4472       cases.
4473
4474         * (*FAIL)  in  a subpattern called as a subroutine has its normal ef‐
4475           fect: it forces an immediate backtrack.
4476
4477         * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4478           tine match to succeed without any further processing. Matching then
4479           continues after the subroutine call.
4480
4481         * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as  a  sub‐
4482           routine cause the subroutine match to fail.
4483
4484         * (*THEN)  skips  to  the next alternative in the innermost enclosing
4485           group within the subpattern that has alternatives. If there  is  no
4486           such  group  within  the  subpattern, (*THEN) causes the subroutine
4487           match to fail.
4488
4489Ericsson AB                     stdlib 4.3.1.3                           re(3)
Impressum