re(3erl) - f39

1re(3)                      Erlang Module Definition                      re(3)
2
3
4

NAME

6       re - Perl-like regular expressions for Erlang.
7

DESCRIPTION

9       This  module contains regular expression matching functions for strings
10       and binaries.
11
12       The regular expression syntax and semantics resemble that of Perl.
13
14       The matching algorithms of the library are based on the  PCRE  library,
15       but not all of the PCRE library is interfaced and some parts of the li‐
16       brary go beyond what PCRE offers. Currently PCRE version 8.40  (release
17       date  2017-01-11)  is used. The sections of the PCRE documentation that
18       are relevant to this module are included here.
19
20   Note:
21       The Erlang literal syntax for strings uses the "\" (backslash)  charac‐
22       ter  as  an  escape  code.  You  need  to escape backslashes in literal
23       strings, both in your code and in the shell, with an  extra  backslash,
24       that is, "\\".
25
26

DATA TYPES

28       mp() = {re_pattern, term(), term(), term(), term()}
29
30              Opaque  data type containing a compiled regular expression. mp()
31              is guaranteed to be a tuple() having the atom re_pattern as  its
32              first element, to allow for matching in guards. The arity of the
33              tuple or the content of the other fields can  change  in  future
34              Erlang/OTP releases.
35
36       nl_spec() = cr | crlf | lf | anycrlf | any
37
38       compile_option() =
39           unicode | anchored | caseless | dollar_endonly | dotall |
40           extended | firstline | multiline | no_auto_capture |
41           dupnames | ungreedy |
42           {newline, nl_spec()} |
43           bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
44           never_utf
45
46       replace_fun() =
47           fun((binary(), [binary()]) -> iodata() | unicode:charlist())
48

EXPORTS

50       version() -> binary()
51
52              The return of this function is a string with the PCRE version of
53              the system that was used in the Erlang/OTP compilation.
54
55       compile(Regexp) -> {ok, MP} | {error, ErrSpec}
56
57              Types:
58
59                 Regexp = iodata()
60                 MP = mp()
61                 ErrSpec =
62                     {ErrString :: string(), Position :: integer() >= 0}
63
64              The same as compile(Regexp,[])
65
66       compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
67
68              Types:
69
70                 Regexp = iodata() | unicode:charlist()
71                 Options = [Option]
72                 Option = compile_option()
73                 MP = mp()
74                 ErrSpec =
75                     {ErrString :: string(), Position :: integer() >= 0}
76
77              Compiles a regular expression, with the syntax described  below,
78              into an internal format to be used later as a parameter to run/2
79              and run/3.
80
81              Compiling the regular expression before matching  is  useful  if
82              the  same  expression is to be used in matching against multiple
83              subjects during the lifetime of the program. Compiling once  and
84              executing  many  times is far more efficient than compiling each
85              time one wants to match.
86
87              When option unicode is specified, the regular expression  is  to
88              be  specified  as  a  valid Unicode charlist(), otherwise as any
89              valid iodata().
90
91              Options:
92
93                unicode:
94                  The regular expression is specified as a Unicode  charlist()
95                  and  the  resulting  regular  expression  code  is to be run
96                  against a valid Unicode charlist()  subject.  Also  consider
97                  option ucp when using Unicode characters.
98
99                anchored:
100                  The  pattern is forced to be "anchored", that is, it is con‐
101                  strained to match only at the first matching  point  in  the
102                  string  that is searched (the "subject string"). This effect
103                  can also be achieved by appropriate constructs in  the  pat‐
104                  tern itself.
105
106                caseless:
107                  Letters  in  the  pattern match both uppercase and lowercase
108                  letters. It is equivalent to  Perl  option  /i  and  can  be
109                  changed within a pattern by a (?i) option setting. Uppercase
110                  and lowercase letters are defined as in the ISO 8859-1 char‐
111                  acter set.
112
113                dollar_endonly:
114                  A  dollar  metacharacter  in the pattern matches only at the
115                  end of the subject string. Without  this  option,  a  dollar
116                  also  matches immediately before a newline at the end of the
117                  string (but not before any other newlines). This  option  is
118                  ignored if option multiline is specified. There is no equiv‐
119                  alent option in Perl, and it cannot be set within a pattern.
120
121                dotall:
122                  A dot in the pattern matches all characters, including those
123                  indicating  newline.  Without  it, a dot does not match when
124                  the current position is at a newline. This option is equiva‐
125                  lent  to  Perl option /s and it can be changed within a pat‐
126                  tern by a (?s) option setting. A  negative  class,  such  as
127                  [^a],  always matches newline characters, independent of the
128                  setting of this option.
129
130                extended:
131                  If this option is set, most white space  characters  in  the
132                  pattern  are totally ignored except when escaped or inside a
133                  character class. However, white space is not allowed  within
134                  sequences  such  as (?> that introduce various parenthesized
135                  subpatterns, nor  within  a  numerical  quantifier  such  as
136                  {1,3}.  However,  ignorable white space is permitted between
137                  an item and a following quantifier and between a  quantifier
138                  and a following + that indicates possessiveness.
139
140                  White  space  did not used to include the VT character (code
141                  11), because Perl did not  treat  this  character  as  white
142                  space.  However,  Perl changed at release 5.18, so PCRE fol‐
143                  lowed at release 8.34, and VT is now treated as white space.
144
145                  This also causes characters between an unescaped # outside a
146                  character  class  and the next newline, inclusive, to be ig‐
147                  nored. This is equivalent to Perl's /x option, and it can be
148                  changed within a pattern by a (?x) option setting.
149
150                  With  this  option, comments inside complicated patterns can
151                  be included. However, notice that this applies only to  data
152                  characters.  Whitespace  characters  can never appear within
153                  special character sequences in a pattern, for example within
154                  sequence (?( that introduces a conditional subpattern.
155
156                firstline:
157                  An  unanchored pattern is required to match before or at the
158                  first newline in the subject string,  although  the  matched
159                  text can continue over the newline.
160
161                multiline:
162                  By  default, PCRE treats the subject string as consisting of
163                  a single line of characters (even if it contains  newlines).
164                  The  "start  of  line" metacharacter (^) matches only at the
165                  start of the string, while the "end of  line"  metacharacter
166                  ($)  matches only at the end of the string, or before a ter‐
167                  minating newline (unless  option  dollar_endonly  is  speci‐
168                  fied). This is the same as in Perl.
169
170                  When  this option is specified, the "start of line" and "end
171                  of line" constructs match immediately following  or  immedi‐
172                  ately  before  internal  newlines in the subject string, re‐
173                  spectively, as well as at the very start and  end.  This  is
174                  equivalent  to  Perl  option  /m and can be changed within a
175                  pattern by a (?m) option setting. If there are  no  newlines
176                  in  a  subject string, or no occurrences of ^ or $ in a pat‐
177                  tern, setting multiline has no effect.
178
179                no_auto_capture:
180                  Disables the use of numbered capturing  parentheses  in  the
181                  pattern.  Any  opening parenthesis that is not followed by ?
182                  behaves as if it is followed by ?:.  Named  parentheses  can
183                  still be used for capturing (and they acquire numbers in the
184                  usual way). There is no equivalent option in Perl.
185
186                dupnames:
187                  Names used to identify capturing  subpatterns  need  not  be
188                  unique.  This  can  be  helpful for certain types of pattern
189                  when it is known that only one instance of the named subpat‐
190                  tern  can ever be matched. More details of named subpatterns
191                  are provided below.
192
193                ungreedy:
194                  Inverts the "greediness" of the quantifiers so that they are
195                  not greedy by default, but become greedy if followed by "?".
196                  It is not compatible with Perl. It can also be set by a (?U)
197                  option setting within the pattern.
198
199                {newline, NLSpec}:
200                  Overrides the default definition of a newline in the subject
201                  string, which is LF (ASCII 10) in Erlang.
202
203                  cr:
204                    Newline is indicated by a single character cr (ASCII 13).
205
206                  lf:
207                    Newline is indicated by a single character LF (ASCII  10),
208                    the default.
209
210                  crlf:
211                    Newline  is  indicated by the two-character CRLF (ASCII 13
212                    followed by ASCII 10) sequence.
213
214                  anycrlf:
215                    Any of the three preceding sequences is to be recognized.
216
217                  any:
218                    Any of the newline sequences above, and  the  Unicode  se‐
219                    quences  VT (vertical tab, U+000B), FF (formfeed, U+000C),
220                    NEL (next line, U+0085), LS (line separator, U+2028),  and
221                    PS (paragraph separator, U+2029).
222
223                bsr_anycrlf:
224                  Specifies  specifically that \R is to match only the CR, LF,
225                  or CRLF sequences, not the Unicode-specific newline  charac‐
226                  ters.
227
228                bsr_unicode:
229                  Specifies  specifically  that \R is to match all the Unicode
230                  newline characters (including CRLF, and so on, the default).
231
232                no_start_optimize:
233                  Disables  optimization  that  can  malfunction  if  "Special
234                  start-of-pattern  items"  are present in the regular expres‐
235                  sion. A typical example  would  be  when  matching  "DEFABC"
236                  against "(*COMMIT)ABC", where the start optimization of PCRE
237                  would skip the subject up to "A" and never realize that  the
238                  (*COMMIT)  instruction  is  to  have made the matching fail.
239                  This option is only relevant if  you  use  "start-of-pattern
240                  items",  as discussed in section PCRE Regular Expression De‐
241                  tails.
242
243                ucp:
244                  Specifies that Unicode character properties are to  be  used
245                  when  resolving  \B,  \b, \D, \d, \S, \s, \W and \w. Without
246                  this flag, only ISO Latin-1 properties are used. Using  Uni‐
247                  code  properties hurts performance, but is semantically cor‐
248                  rect when working with Unicode  characters  beyond  the  ISO
249                  Latin-1 range.
250
251                never_utf:
252                  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
253                  items" are forbidden. This flag cannot be combined with  op‐
254                  tion  unicode. Useful if ISO Latin-1 patterns from an exter‐
255                  nal source are to be compiled.
256
257       inspect(MP, Item) -> {namelist, [binary()]}
258
259              Types:
260
261                 MP = mp()
262                 Item = namelist
263
264              Takes a compiled regular expression and an item, and returns the
265              relevant  data  from  the regular expression. The only supported
266              item is  namelist,  which  returns  the  tuple  {namelist,  [bi‐
267              nary()]}, containing the names of all (unique) named subpatterns
268              in the regular expression. For example:
269
270              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
271              {ok,{re_pattern,3,0,0,
272                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
273                                255,255,...>>}}
274              2> re:inspect(MP,namelist).
275              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
276              3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
277              {ok,{re_pattern,3,0,0,
278                              <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
279                                255,255,...>>}}
280              4> re:inspect(MPD,namelist).
281              {namelist,[<<"B">>,<<"C">>]}
282
283              Notice in the second example that the duplicate name only occurs
284              once  in the returned list, and that the list is in alphabetical
285              order regardless of where the names are positioned in the  regu‐
286              lar  expression. The order of the names is the same as the order
287              of captured subexpressions if {capture, all_names} is  specified
288              as  an option to run/3. You can therefore create a name-to-value
289              mapping from the result of run/3 like this:
290
291              1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
292              {ok,{re_pattern,3,0,0,
293                              <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
294                                255,255,...>>}}
295              2> {namelist, N} = re:inspect(MP,namelist).
296              {namelist,[<<"A">>,<<"B">>,<<"C">>]}
297              3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
298              {match,[<<"A">>,<<>>,<<>>]}
299              4> NameMap = lists:zip(N,L).
300              [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
301
302       replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
303
304              Types:
305
306                 Subject = iodata() | unicode:charlist()
307                 RE = mp() | iodata()
308                 Replacement = iodata() | unicode:charlist() | replace_fun()
309
310              Same as replace(Subject, RE, Replacement, []).
311
312       replace(Subject, RE, Replacement, Options) ->
313                  iodata() | unicode:charlist()
314
315              Types:
316
317                 Subject = iodata() | unicode:charlist()
318                 RE = mp() | iodata() | unicode:charlist()
319                 Replacement = iodata() | unicode:charlist() | replace_fun()
320                 Options = [Option]
321                 Option =
322                     anchored | global | notbol | noteol | notempty |
323                     notempty_atstart |
324                     {offset, integer() >= 0} |
325                     {newline, NLSpec} |
326                     bsr_anycrlf |
327                     {match_limit, integer() >= 0} |
328                     {match_limit_recursion, integer() >= 0} |
329                     bsr_unicode |
330                     {return, ReturnType} |
331                     CompileOpt
332                 ReturnType = iodata | list | binary
333                 CompileOpt = compile_option()
334                 NLSpec = cr | crlf | lf | anycrlf | any
335
336              Replaces the matched part of the Subject  string  with  Replace‐
337              ment.
338
339              The  permissible  options are the same as for run/3, except that
340              option capture is not allowed. Instead a {return, ReturnType} is
341              present. The default return type is iodata, constructed in a way
342              to minimize copying. The iodata result can be used  directly  in
343              many  I/O  operations. If a flat list() is desired, specify {re‐
344              turn, list}. If a binary is desired, specify {return, binary}.
345
346              As in function run/3, an mp() compiled with option  unicode  re‐
347              quires  Subject  to  be  a Unicode charlist(). If compilation is
348              done implicitly and the unicode compilation option is  specified
349              to this function, both the regular expression and Subject are to
350              specified as valid Unicode charlist()s.
351
352              If the replacement is given as a string, it can contain the spe‐
353              cial character &, which inserts the whole matching expression in
354              the result, and the special sequence \N (where N is an integer >
355              0),  \gN,  or \g{N}, resulting in the subexpression number N, is
356              inserted in the result. If no subexpression with that number  is
357              generated by the regular expression, nothing is inserted.
358
359              To insert an & or a \ in the result, precede it with a \. Notice
360              that Erlang already gives a special  meaning  to  \  in  literal
361              strings,  so  a single \ must be written as "\\" and therefore a
362              double \ as "\\\\".
363
364              Example:
365
366              re:replace("abcd","c","[&]",[{return,list}]).
367
368              gives
369
370              "ab[c]d"
371
372              while
373
374              re:replace("abcd","c","[\\&]",[{return,list}]).
375
376              gives
377
378              "ab[&]d"
379
380              If the replacement is given as a fun, it will be called with the
381              whole  matching  expression  as the first argument and a list of
382              subexpression matches in the order in which they appear  in  the
383              regular  expression.  The returned value will be inserted in the
384              result.
385
386              Example:
387
388              re:replace("abcd", ".(.)", fun(Whole, [<<C>>]) -> <<$#, Whole/binary, $-, (C - $a + $A), $#>> end, [{return, list}]).
389
390              gives
391
392              "#ab-B#cd"
393
394          Note:
395              Non-matching optional subexpressions will not be included in the
396              list  of  subexpression  matches if they are the last subexpres‐
397              sions in the regular expression.
398
399              Example:
400
401              The regular expression "(a)(b)?(c)?" ("a",  optionally  followed
402              by  "b",  optionally  followed by "c") will create the following
403              subexpression lists:
404
405                * [<<"a">>, <<"b">>, <<"c">>] when applied to the string "abc"
406
407                * [<<"a">>, <<>>, <<"c">>] when applied to the string "acx"
408
409                * [<<"a">>, <<"b">>] when applied to the string "abx"
410
411                * [<<"a">>] when applied to the string "axx"
412
413              As with run/3, compilation errors raise  the  badarg  exception.
414              compile/2 can be used to get more information about the error.
415
416       run(Subject, RE) -> {match, Captured} | nomatch
417
418              Types:
419
420                 Subject = iodata() | unicode:charlist()
421                 RE = mp() | iodata()
422                 Captured = [CaptureData]
423                 CaptureData = {integer(), integer()}
424
425              Same as run(Subject,RE,[]).
426
427       run(Subject, RE, Options) ->
428              {match, Captured} | match | nomatch | {error, ErrType}
429
430              Types:
431
432                 Subject = iodata() | unicode:charlist()
433                 RE = mp() | iodata() | unicode:charlist()
434                 Options = [Option]
435                 Option =
436                     anchored | global | notbol | noteol | notempty |
437                     notempty_atstart | report_errors |
438                     {offset, integer() >= 0} |
439                     {match_limit, integer() >= 0} |
440                     {match_limit_recursion, integer() >= 0} |
441                     {newline, NLSpec :: nl_spec()} |
442                     bsr_anycrlf | bsr_unicode |
443                     {capture, ValueSpec} |
444                     {capture, ValueSpec, Type} |
445                     CompileOpt
446                 Type = index | list | binary
447                 ValueSpec =
448                     all  |  all_but_first  |  all_names | first | none | Val‐
449                 ueList
450                 ValueList = [ValueID]
451                 ValueID = integer() | string() | atom()
452                 CompileOpt = compile_option()
453                   See compile/2.
454                 Captured = [CaptureData] | [[CaptureData]]
455                 CaptureData =
456                     {integer(), integer()} | ListConversionData | binary()
457                 ListConversionData =
458                     string() |
459                     {error, string(), binary()} |
460                     {incomplete, string(), binary()}
461                 ErrType =
462                     match_limit  |  match_limit_recursion  |  {compile,  Com‐
463                 pileErr}
464                 CompileErr =
465                     {ErrString :: string(), Position :: integer() >= 0}
466
467              Executes    a   regular   expression   matching,   and   returns
468              match/{match, Captured} or nomatch. The regular  expression  can
469              be  specified  either  as iodata() in which case it is automati‐
470              cally compiled (as by compile/2) and executed, or as  a  precom‐
471              piled  mp() in which case it is executed against the subject di‐
472              rectly.
473
474              When compilation is involved, exception badarg is  thrown  if  a
475              compilation  error  occurs.  Call  compile/2  to get information
476              about the location of the error in the regular expression.
477
478              If the regular expression is  previously  compiled,  the  option
479              list can only contain the following options:
480
481                * anchored
482
483                * {capture, ValueSpec}/{capture, ValueSpec, Type}
484
485                * global
486
487                * {match_limit, integer() >= 0}
488
489                * {match_limit_recursion, integer() >= 0}
490
491                * {newline, NLSpec}
492
493                * notbol
494
495                * notempty
496
497                * notempty_atstart
498
499                * noteol
500
501                * {offset, integer() >= 0}
502
503                * report_errors
504
505              Otherwise  all options valid for function compile/2 are also al‐
506              lowed. Options allowed both for compilation and execution  of  a
507              match,  namely  anchored  and {newline, NLSpec}, affect both the
508              compilation and execution if present together with a non-precom‐
509              piled regular expression.
510
511              If  the  regular  expression was previously compiled with option
512              unicode,  Subject  is  to  be  provided  as  a   valid   Unicode
513              charlist(),  otherwise  any  iodata() will do. If compilation is
514              involved and option unicode is specified, both Subject  and  the
515              regular   expression  are  to  be  specified  as  valid  Unicode
516              charlists().
517
518              {capture, ValueSpec}/{capture, ValueSpec, Type} defines what  to
519              return  from  the function upon successful matching. The capture
520              tuple can contain both a value specification, telling  which  of
521              the  captured substrings are to be returned, and a type specifi‐
522              cation, telling how captured substrings are to be  returned  (as
523              index  tuples, lists, or binaries). The options are described in
524              detail below.
525
526              If the capture options describe that no substring  capturing  is
527              to  be  done  ({capture, none}), the function returns the single
528              atom match upon successful matching, otherwise the tuple {match,
529              ValueList}. Disabling capturing can be done either by specifying
530              none or an empty list as ValueSpec.
531
532              Option report_errors adds the possibility that an error tuple is
533              returned.   The   tuple   either   indicates  a  matching  error
534              (match_limit or match_limit_recursion), or a compilation  error,
535              where  the  error  tuple  has  the format {error, {compile, Com‐
536              pileErr}}. Notice that if option report_errors is not specified,
537              the function never returns error tuples, but reports compilation
538              errors as a badarg exception and failed matches because  of  ex‐
539              ceeded match limits simply as nomatch.
540
541              The following options are relevant for execution:
542
543                anchored:
544                  Limits  run/3 to matching at the first matching position. If
545                  a pattern was compiled with anchored, or turned  out  to  be
546                  anchored  by virtue of its contents, it cannot be made unan‐
547                  chored at matching time, hence there is  no  unanchored  op‐
548                  tion.
549
550                global:
551                  Implements global (repetitive) search (flag g in Perl). Each
552                  match is returned as a separate list() containing  the  spe‐
553                  cific match and any matching subexpressions (or as specified
554                  by option capture. The Captured part of the return value  is
555                  hence a list() of list()s when this option is specified.
556
557                  The  interaction  of option global with a regular expression
558                  that matches an empty string surprises some users. When  op‐
559                  tion global is specified, run/3 handles empty matches in the
560                  same way as Perl: a zero-length match at any point  is  also
561                  retried  with  options [anchored, notempty_atstart]. If that
562                  search gives a result of length > 0, the result is included.
563                  Example:
564
565                re:run("cat","(|at)",[global]).
566
567                  The following matchings are performed:
568
569                  At offset 0:
570                    The  regular  expression  (|at) first match at the initial
571                    position  of   string   cat,   giving   the   result   set
572                    [{0,0},{0,0}]  (the  second {0,0} is because of the subex‐
573                    pression marked by the parentheses). As the length of  the
574                    match is 0, we do not advance to the next position yet.
575
576                  At offset 0 with [anchored, notempty_atstart]:
577                    The search is retried with options [anchored, notempty_at‐
578                    start] at the same position, which does not give  any  in‐
579                    teresting  result of longer length, so the search position
580                    is advanced to the next character (a).
581
582                  At offset 1:
583                    The search results in [{1,0},{1,0}],  so  this  search  is
584                    also repeated with the extra options.
585
586                  At offset 1 with [anchored, notempty_atstart]:
587                    Alternative  ab  is found and the result is [{1,2},{1,2}].
588                    The result is added to the list of results and  the  posi‐
589                    tion in the search string is advanced two steps.
590
591                  At offset 3:
592                    The  search  once  again  matches the empty string, giving
593                    [{3,0},{3,0}].
594
595                  At offset 1 with [anchored, notempty_atstart]:
596                    This gives no result of length > 0 and we are at the  last
597                    position, so the global search is complete.
598
599                  The result of the call is:
600
601                {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
602
603                notempty:
604                  An  empty  string  is  not considered to be a valid match if
605                  this option is specified. If alternatives in the pattern ex‐
606                  ist, they are tried. If all the alternatives match the empty
607                  string, the entire match fails.
608
609                  Example:
610
611                  If the following pattern is applied to a string  not  begin‐
612                  ning  with  "a"  or  "b",  it would normally match the empty
613                  string at the start of the subject:
614
615                a?b?
616
617                  With option  notempty,  this  match  is  invalid,  so  run/3
618                  searches  further  into the string for occurrences of "a" or
619                  "b".
620
621                notempty_atstart:
622                  Like notempty, except that an empty string match that is not
623                  at  the start of the subject is permitted. If the pattern is
624                  anchored, such a match can occur only if  the  pattern  con‐
625                  tains \K.
626
627                  Perl  has  no  direct equivalent of notempty or notempty_at‐
628                  start, but it does make a special case of a pattern match of
629                  the empty string within its split() function, and when using
630                  modifier /g. The Perl behavior can be emulated after  match‐
631                  ing  a  null  string  by first trying the match again at the
632                  same offset with notempty_atstart and anchored, and then, if
633                  that fails, by advancing the starting offset (see below) and
634                  trying an ordinary match again.
635
636                notbol:
637                  Specifies that the first character of the subject string  is
638                  not the beginning of a line, so the circumflex metacharacter
639                  is not to match before it. Setting  this  without  multiline
640                  (at compile time) causes circumflex never to match. This op‐
641                  tion only affects the behavior of the circumflex metacharac‐
642                  ter. It does not affect \A.
643
644                noteol:
645                  Specifies  that the end of the subject string is not the end
646                  of a line, so the dollar metacharacter is not  to  match  it
647                  nor  (except in multiline mode) a newline immediately before
648                  it. Setting this without multiline (at compile time)  causes
649                  dollar never to match. This option affects only the behavior
650                  of the dollar metacharacter. It does not affect \Z or \z.
651
652                report_errors:
653                  Gives better control of the error handling  in  run/3.  When
654                  specified,  compilation errors (if the regular expression is
655                  not already compiled) and runtime errors are explicitly  re‐
656                  turned as an error tuple.
657
658                  The following are the possible runtime errors:
659
660                  match_limit:
661                    The PCRE library sets a limit on how many times the inter‐
662                    nal match function can be called. Defaults  to  10,000,000
663                    in   the   library   compiled   for   Erlang.  If  {error,
664                    match_limit} is returned, the execution of the regular ex‐
665                    pression  has  reached  this limit. This is normally to be
666                    regarded as a nomatch, which is the default  return  value
667                    when this occurs, but by specifying report_errors, you are
668                    informed when the match fails because of too many internal
669                    calls.
670
671                  match_limit_recursion:
672                    This error is very similar to match_limit, but occurs when
673                    the internal  match  function  of  PCRE  is  "recursively"
674                    called  more  times  than the match_limit_recursion limit,
675                    which defaults to 10,000,000 as well. Notice that as  long
676                    as the match_limit and match_limit_default values are kept
677                    at the default  values,  the  match_limit_recursion  error
678                    cannot  occur, as the match_limit error occurs before that
679                    (each recursive call is also a call, but not  conversely).
680                    Both limits can however be changed, either by setting lim‐
681                    its directly in the regular expression string (see section
682                    PCRE Regular Eexpression Details) or by specifying options
683                    to run/3.
684
685                  It is important to understand that what is  referred  to  as
686                  "recursion"  when limiting matches is not recursion on the C
687                  stack of the Erlang machine or on the Erlang process  stack.
688                  The  PCRE  version  compiled into the Erlang VM uses machine
689                  "heap" memory to store values that must be kept over  recur‐
690                  sion in regular expression matches.
691
692                {match_limit, integer() >= 0}:
693                  Limits  the  execution time of a match in an implementation-
694                  specific way. It is described as follows by the  PCRE  docu‐
695                  mentation:
696
697                The match_limit field provides a means of preventing PCRE from using
698                up a vast amount of resources when running patterns that are not going
699                to match, but which have a very large number of possibilities in their
700                search trees. The classic example is a pattern that uses nested
701                unlimited repeats.
702
703                Internally, pcre_exec() uses a function called match(), which it calls
704                repeatedly (sometimes recursively). The limit set by match_limit is
705                imposed on the number of times this function is called during a match,
706                which has the effect of limiting the amount of backtracking that can
707                take place. For patterns that are not anchored, the count restarts
708                from zero for each position in the subject string.
709
710                  This  means that runaway regular expression matches can fail
711                  faster if the limit is lowered using this  option.  The  de‐
712                  fault value 10,000,000 is compiled into the Erlang VM.
713
714            Note:
715                This  option does in no way affect the execution of the Erlang
716                VM in terms of "long running BIFs". run/3 always gives control
717                back  to  the  scheduler of Erlang processes at intervals that
718                ensures the real-time properties of the Erlang system.
719
720
721                {match_limit_recursion, integer() >= 0}:
722                  Limits the execution time and memory consumption of a  match
723                  in   an   implementation-specific   way,   very  similar  to
724                  match_limit. It is described as follows by the PCRE documen‐
725                  tation:
726
727                The match_limit_recursion field is similar to match_limit, but instead
728                of limiting the total number of times that match() is called, it
729                limits the depth of recursion. The recursion depth is a smaller number
730                than the total number of calls, because not all calls to match() are
731                recursive. This limit is of use only if it is set smaller than
732                match_limit.
733
734                Limiting the recursion depth limits the amount of machine stack that
735                can be used, or, when PCRE has been compiled to use memory on the heap
736                instead of the stack, the amount of heap memory that can be used.
737
738                  The  Erlang VM uses a PCRE library where heap memory is used
739                  when regular expression match recursion occurs. This  there‐
740                  fore limits the use of machine heap, not C stack.
741
742                  Specifying a lower value can result in matches with deep re‐
743                  cursion failing, when they should have matched:
744
745                1> re:run("aaaaaaaaaaaaaz","(a+)*z").
746                {match,[{0,14},{0,13}]}
747                2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
748                nomatch
749                3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
750                {error,match_limit_recursion}
751
752                  This option and option match_limit are only to  be  used  in
753                  rare  cases.  Understanding of the PCRE library internals is
754                  recommended before tampering with these limits.
755
756                {offset, integer() >= 0}:
757                  Start matching at the offset  (position)  specified  in  the
758                  subject  string.  The  offset is zero-based, so that the de‐
759                  fault is {offset,0} (all of the subject string).
760
761                {newline, NLSpec}:
762                  Overrides the default definition of a newline in the subject
763                  string, which is LF (ASCII 10) in Erlang.
764
765                  cr:
766                    Newline is indicated by a single character CR (ASCII 13).
767
768                  lf:
769                    Newline  is indicated by a single character LF (ASCII 10),
770                    the default.
771
772                  crlf:
773                    Newline is indicated by the two-character CRLF  (ASCII  13
774                    followed by ASCII 10) sequence.
775
776                  anycrlf:
777                    Any of the three preceding sequences is be recognized.
778
779                  any:
780                    Any  of  the  newline sequences above, and the Unicode se‐
781                    quences VT (vertical tab, U+000B), FF (formfeed,  U+000C),
782                    NEL  (next line, U+0085), LS (line separator, U+2028), and
783                    PS (paragraph separator, U+2029).
784
785                bsr_anycrlf:
786                  Specifies specifically that \R is to match only the  CR  LF,
787                  or  CRLF sequences, not the Unicode-specific newline charac‐
788                  ters. (Overrides the compilation option.)
789
790                bsr_unicode:
791                  Specifies specifically that \R is to match all  the  Unicode
792                  newline characters (including CRLF, and so on, the default).
793                  (Overrides the compilation option.)
794
795                {capture, ValueSpec}/{capture, ValueSpec, Type}:
796                  Specifies which captured substrings are returned and in what
797                  format.  By default, run/3 captures all of the matching part
798                  of the substring and all capturing subpatterns (all  of  the
799                  pattern  is automatically captured). The default return type
800                  is (zero-based) indexes of the captured parts of the string,
801                  specified  as  {Offset,Length} pairs (the index Type of cap‐
802                  turing).
803
804                  As an example of the default behavior,  the  following  call
805                  returns,  as  first  and  only captured string, the matching
806                  part of the subject ("abcd" in the middle) as an index  pair
807                  {3,4},  where character positions are zero-based, just as in
808                  offsets:
809
810                re:run("ABCabcdABC","abcd",[]).
811
812                  The return value of this call is:
813
814                {match,[{3,4}]}
815
816                  Another (and quite common) case is where the regular expres‐
817                  sion matches all of the subject:
818
819                re:run("ABCabcdABC",".*abcd.*",[]).
820
821                  Here  the return value correspondingly points out all of the
822                  string, beginning at index 0, and it is 10 characters long:
823
824                {match,[{0,10}]}
825
826                  If the regular expression  contains  capturing  subpatterns,
827                  like in:
828
829                re:run("ABCabcdABC",".*(abcd).*",[]).
830
831                  all  of the matched subject is captured, as well as the cap‐
832                  tured substrings:
833
834                {match,[{0,10},{3,4}]}
835
836                  The complete matching pattern always gives the first  return
837                  value in the list and the remaining subpatterns are added in
838                  the order they occurred in the regular expression.
839
840                  The capture tuple is built up as follows:
841
842                  ValueSpec:
843                    Specifies which captured (sub)patterns are to be returned.
844                    ValueSpec  can  either  be an atom describing a predefined
845                    set of return values, or a list containing the indexes  or
846                    the names of specific subpatterns to return.
847
848                    The following are the predefined sets of subpatterns:
849
850                    all:
851                      All captured subpatterns including the complete matching
852                      string. This is the default.
853
854                    all_names:
855                      All named subpatterns in the regular expression, as if a
856                      list() of all the names in alphabetical order was speci‐
857                      fied. The list of all names can also be  retrieved  with
858                      inspect/2.
859
860                    first:
861                      Only  the first captured subpattern, which is always the
862                      complete matching part of the  subject.  All  explicitly
863                      captured subpatterns are discarded.
864
865                    all_but_first:
866                      All  but the first matching subpattern, that is, all ex‐
867                      plicitly captured  subpatterns,  but  not  the  complete
868                      matching  part  of the subject string. This is useful if
869                      the regular expression as a whole matches a  large  part
870                      of the subject, but the part you are interested in is in
871                      an explicitly captured subpattern. If the return type is
872                      list  or  binary,  not returning subpatterns you are not
873                      interested in is a good way to optimize.
874
875                    none:
876                      Returns no matching subpatterns, gives the  single  atom
877                      match  as the return value of the function when matching
878                      successfully instead  of  the  {match,  list()}  return.
879                      Specifying an empty list gives the same behavior.
880
881                    The value list is a list of indexes for the subpatterns to
882                    return, where index 0 is for all of the pattern, and 1  is
883                    for the first explicit capturing subpattern in the regular
884                    expression, and so on. When using named  captured  subpat‐
885                    terns  (see  below) in the regular expression, one can use
886                    atom()s or string()s to specify the subpatterns to be  re‐
887                    turned. For example, consider the regular expression:
888
889                  ".*(abcd).*"
890
891                    matched  against  string  "ABCabcdABC", capturing only the
892                    "abcd" part (the first explicit subpattern):
893
894                  re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
895
896                    The call gives the following result, as the first  explic‐
897                    itly  captured  subpattern is "(abcd)", matching "abcd" in
898                    the subject, at (zero-based) position 3, of length 4:
899
900                  {match,[{3,4}]}
901
902                    Consider the same regular expression, but with the subpat‐
903                    tern explicitly named 'FOO':
904
905                  ".*(?<FOO>abcd).*"
906
907                    With this expression, we could still give the index of the
908                    subpattern with the following call:
909
910                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
911
912                    giving the same result as before. But, as  the  subpattern
913                    is named, we can also specify its name in the value list:
914
915                  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
916
917                    This  would  give the same result as the earlier examples,
918                    namely:
919
920                  {match,[{3,4}]}
921
922                    The values list can specify indexes or names  not  present
923                    in the regular expression, in which case the return values
924                    vary depending on the type. If the type is index, the  tu‐
925                    ple  {-1,0}  is  returned for values with no corresponding
926                    subpattern in the regular expression, but  for  the  other
927                    types  (binary  and list), the values are the empty binary
928                    or list, respectively.
929
930                  Type:
931                    Optionally specifies how captured substrings are to be re‐
932                    turned. If omitted, the default of index is used.
933
934                    Type can be one of the following:
935
936                    index:
937                      Returns  captured  substrings  as  pairs of byte indexes
938                      into the subject  string  and  length  of  the  matching
939                      string  in  the  subject  (as  if the subject string was
940                      flattened   with   erlang:iolist_to_binary/1   or   uni‐
941                      code:characters_to_binary/2   before  matching).  Notice
942                      that option unicode results in byte-oriented indexes  in
943                      a  (possibly virtual) UTF-8 encoded binary. A byte index
944                      tuple {0,2} can therefore represent one or  two  charac‐
945                      ters  when  unicode is in effect. This can seem counter-
946                      intuitive, but has been deemed the  most  effective  and
947                      useful  way to do it. To return lists instead can result
948                      in simpler code if that is desired. This return type  is
949                      the default.
950
951                    list:
952                      Returns  matching substrings as lists of characters (Er‐
953                      lang string()s). It option unicode is used  in  combina‐
954                      tion  with  the \C sequence in the regular expression, a
955                      captured subpattern can contain bytes that are not valid
956                      UTF-8  (\C  matches bytes regardless of character encod‐
957                      ing). In that case the list capturing can result in  the
958                      same  types  of tuples that unicode:characters_to_list/2
959                      can return, namely three-tuples with tag  incomplete  or
960                      error, the successfully converted characters and the in‐
961                      valid UTF-8 tail of the conversion as a binary. The best
962                      strategy  is to avoid using the \C sequence when captur‐
963                      ing lists.
964
965                    binary:
966                      Returns matching substrings as binaries. If option  uni‐
967                      code is used, these binaries are in UTF-8. If the \C se‐
968                      quence is used together with unicode, the  binaries  can
969                      be invalid UTF-8.
970
971                  In  general,  subpatterns  that were not assigned a value in
972                  the match are returned as the tuple {-1,0} when type is  in‐
973                  dex. Unassigned subpatterns are returned as the empty binary
974                  or list, respectively, for other return types. Consider  the
975                  following regular expression:
976
977                ".*((?<FOO>abdd)|a(..d)).*"
978
979                  There  are three explicitly capturing subpatterns, where the
980                  opening parenthesis position determines the order in the re‐
981                  sult,  hence  ((?<FOO>abdd)|a(..d))  is  subpattern index 1,
982                  (?<FOO>abdd) is subpattern index 2, and (..d) is  subpattern
983                  index 3. When matched against the following string:
984
985                "ABCabcdABC"
986
987                  the  subpattern  at index 2 does not match, as "abdd" is not
988                  present in the string, but the complete pattern matches (be‐
989                  cause  of the alternative a(..d)). The subpattern at index 2
990                  is therefore unassigned and the default return value is:
991
992                {match,[{0,10},{3,4},{-1,0},{4,3}]}
993
994                  Setting the capture Type to binary gives:
995
996                {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
997
998                  Here the empty binary (<<>>) represents the unassigned  sub‐
999                  pattern.  In  the  binary  case,  some information about the
1000                  matching is therefore lost, as <<>> can  also  be  an  empty
1001                  string captured.
1002
1003                  If  differentiation  between  empty matches and non-existing
1004                  subpatterns is necessary, use the type index and do the con‐
1005                  version to the final type in Erlang code.
1006
1007                  When  option global is speciified, the capture specification
1008                  affects each match separately, so that:
1009
1010                re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
1011
1012                  gives
1013
1014                {match,[["a"],["b"]]}
1015
1016              For a descriptions of options  only  affecting  the  compilation
1017              step, see compile/2.
1018
1019       split(Subject, RE) -> SplitList
1020
1021              Types:
1022
1023                 Subject = iodata() | unicode:charlist()
1024                 RE = mp() | iodata()
1025                 SplitList = [iodata() | unicode:charlist()]
1026
1027              Same as split(Subject, RE, []).
1028
1029       split(Subject, RE, Options) -> SplitList
1030
1031              Types:
1032
1033                 Subject = iodata() | unicode:charlist()
1034                 RE = mp() | iodata() | unicode:charlist()
1035                 Options = [Option]
1036                 Option =
1037                     anchored  | notbol | noteol | notempty | notempty_atstart
1038                 |
1039                     {offset, integer() >= 0} |
1040                     {newline, nl_spec()} |
1041                     {match_limit, integer() >= 0} |
1042                     {match_limit_recursion, integer() >= 0} |
1043                     bsr_anycrlf | bsr_unicode |
1044                     {return, ReturnType} |
1045                     {parts, NumParts} |
1046                     group | trim | CompileOpt
1047                 NumParts = integer() >= 0 | infinity
1048                 ReturnType = iodata | list | binary
1049                 CompileOpt = compile_option()
1050                   See compile/2.
1051                 SplitList = [RetData] | [GroupedRetData]
1052                 GroupedRetData = [RetData]
1053                 RetData = iodata() | unicode:charlist() | binary() | list()
1054
1055              Splits the input into parts by finding tokens according  to  the
1056              regular  expression supplied. The splitting is basically done by
1057              running a global regular expression match and dividing the  ini‐
1058              tial  string  wherever  a match occurs. The matching part of the
1059              string is removed from the output.
1060
1061              As in run/3, an mp() compiled with option unicode requires  Sub‐
1062              ject  to be a Unicode charlist(). If compilation is done implic‐
1063              itly and the unicode compilation option  is  specified  to  this
1064              function,  both  the  regular  expression  and Subject are to be
1065              specified as valid Unicode charlist()s.
1066
1067              The result is given as a list of "strings", the  preferred  data
1068              type specified in option return (default iodata).
1069
1070              If  subexpressions  are specified in the regular expression, the
1071              matching subexpressions are returned in the  resulting  list  as
1072              well. For example:
1073
1074              re:split("Erlang","[ln]",[{return,list}]).
1075
1076              gives
1077
1078              ["Er","a","g"]
1079
1080              while
1081
1082              re:split("Erlang","([ln])",[{return,list}]).
1083
1084              gives
1085
1086              ["Er","l","a","n","g"]
1087
1088              The  text  matching the subexpression (marked by the parentheses
1089              in the regular expression) is inserted in the result list  where
1090              it  was  found.  This  means  that concatenating the result of a
1091              split where the whole regular expression is a single  subexpres‐
1092              sion  (as  in  the  last example) always results in the original
1093              string.
1094
1095              As there is no matching subexpression for the last part  in  the
1096              example  (the  "g"), nothing is inserted after that. To make the
1097              group of strings and the parts matching the subexpressions  more
1098              obvious,  one  can  use  option group, which groups together the
1099              part of the subject string with the parts  matching  the  subex‐
1100              pressions when the string was split:
1101
1102              re:split("Erlang","([ln])",[{return,list},group]).
1103
1104              gives
1105
1106              [["Er","l"],["a","n"],["g"]]
1107
1108              Here  the regular expression first matched the "l", causing "Er"
1109              to be the first part in the result. When the regular  expression
1110              matched,  the  (only) subexpression was bound to the "l", so the
1111              "l" is inserted in the group together with "Er". The next  match
1112              is  of  the "n", making "a" the next part to be returned. As the
1113              subexpression is bound to substring "n" in this case, the "n" is
1114              inserted into this group. The last group consists of the remain‐
1115              ing string, as no more matches are found.
1116
1117              By default,  all  parts  of  the  string,  including  the  empty
1118              strings, are returned from the function, for example:
1119
1120              re:split("Erlang","[lg]",[{return,list}]).
1121
1122              gives
1123
1124              ["Er","an",[]]
1125
1126              as  the  matching  of the "g" in the end of the string leaves an
1127              empty rest, which is also returned. This behavior  differs  from
1128              the  default behavior of the split function in Perl, where empty
1129              strings at the end are by default removed. To get the "trimming"
1130              default behavior of Perl, specify trim as an option:
1131
1132              re:split("Erlang","[lg]",[{return,list},trim]).
1133
1134              gives
1135
1136              ["Er","an"]
1137
1138              The  "trim"  option says; "give me as many parts as possible ex‐
1139              cept the empty ones", which sometimes can  be  useful.  You  can
1140              also specify how many parts you want, by specifying {parts,N}:
1141
1142              re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1143
1144              gives
1145
1146              ["Er","ang"]
1147
1148              Notice  that  the last part is "ang", not "an", as splitting was
1149              specified into two parts, and the splitting  stops  when  enough
1150              parts  are  given,  which is why the result differs from that of
1151              trim.
1152
1153              More than three parts are not possible with this indata, so
1154
1155              re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1156
1157              gives the same result as the default, which is to be  viewed  as
1158              "an infinite number of parts".
1159
1160              Specifying 0 as the number of parts gives the same effect as op‐
1161              tion trim. If subexpressions are captured, empty  subexpressions
1162              matched  at the end are also stripped from the result if trim or
1163              {parts,0} is specified.
1164
1165              The trim behavior  corresponds  exactly  to  the  Perl  default.
1166              {parts,N}, where N is a positive integer, corresponds exactly to
1167              the Perl behavior with a positive numerical third parameter. The
1168              default  behavior  of  split/3  corresponds to the Perl behavior
1169              when a negative integer is specified as the third parameter  for
1170              the Perl routine.
1171
1172              Summary of options not previously described for function run/3:
1173
1174                {return,ReturnType}:
1175                  Specifies how the parts of the original string are presented
1176                  in the result list. Valid types:
1177
1178                  iodata:
1179                    The variant of iodata() that gives the  least  copying  of
1180                    data  with the current implementation (often a binary, but
1181                    do not depend on it).
1182
1183                  binary:
1184                    All parts returned as binaries.
1185
1186                  list:
1187                    All parts returned as lists of characters ("strings").
1188
1189                group:
1190                  Groups together the part of the string with the parts of the
1191                  string  matching  the  subexpressions of the regular expres‐
1192                  sion.
1193
1194                  The return value from the function is in this case a  list()
1195                  of  list()s.  Each sublist begins with the string picked out
1196                  of the subject string, followed by the parts  matching  each
1197                  of  the subexpressions in order of occurrence in the regular
1198                  expression.
1199
1200                {parts,N}:
1201                  Specifies the number of parts the subject string  is  to  be
1202                  split into.
1203
1204                  The  number  of parts is to be a positive integer for a spe‐
1205                  cific maximum number of parts, and infinity for the  maximum
1206                  number of parts possible (the default). Specifying {parts,0}
1207                  gives as many parts as possible disregarding empty parts  at
1208                  the end, the same as specifying trim.
1209
1210                trim:
1211                  Specifies that empty parts at the end of the result list are
1212                  to be disregarded. The same as  specifying  {parts,0}.  This
1213                  corresponds  to  the  default behavior of the split built-in
1214                  function in Perl.
1215

PERL-LIKE REGULAR EXPRESSION SYNTAX

1217       The following sections contain reference material for the  regular  ex‐
1218       pressions  used  by  this  module. The information is based on the PCRE
1219       documentation, with changes where this module  behaves  differently  to
1220       the PCRE library.
1221

PCRE REGULAR EXPRESSION DETAILS

1223       The  syntax  and semantics of the regular expressions supported by PCRE
1224       are described in detail in the following sections. Perl's  regular  ex‐
1225       pressions  are  described in its own documentation, and regular expres‐
1226       sions in general are covered in many books, some with copious examples.
1227       Jeffrey   Friedl's   "Mastering   Regular  Expressions",  published  by
1228       O'Reilly, covers regular expressions in great detail. This  description
1229       of the PCRE regular expressions is intended as reference material.
1230
1231       The reference material is divided into the following sections:
1232
1233         * Special Start-of-Pattern Items
1234
1235         * Characters and Metacharacters
1236
1237         * Backslash
1238
1239         * Circumflex and Dollar
1240
1241         * Full Stop (Period, Dot) and \N
1242
1243         * Matching a Single Data Unit
1244
1245         * Square Brackets and Character Classes
1246
1247         * Posix Character Classes
1248
1249         * Vertical Bar
1250
1251         * Internal Option Setting
1252
1253         * Subpatterns
1254
1255         * Duplicate Subpattern Numbers
1256
1257         * Named Subpatterns
1258
1259         * Repetition
1260
1261         * Atomic Grouping and Possessive Quantifiers
1262
1263         * Back References
1264
1265         * Assertions
1266
1267         * Conditional Subpatterns
1268
1269         * Comments
1270
1271         * Recursive Patterns
1272
1273         * Subpatterns as Subroutines
1274
1275         * Oniguruma Subroutine Syntax
1276
1277         * Backtracking Control
1278

SPECIAL START-OF-PATTERN ITEMS

1280       Some options that can be passed to compile/2 can also be set by special
1281       items at the start of a pattern. These are not Perl-compatible, but are
1282       provided  to  make  these options accessible to pattern writers who are
1283       not able to change the program that processes the pattern.  Any  number
1284       of  these  items can appear, but they must all be together right at the
1285       start of the pattern string, and the letters must be in upper case.
1286
1287       UTF Support
1288
1289       Unicode support is basically UTF-8 based. To  use  Unicode  characters,
1290       you  either call compile/2 or run/3 with option unicode, or the pattern
1291       must start with one of these special sequences:
1292
1293       (*UTF8)
1294       (*UTF)
1295
1296       Both options give the same effect, the input string is  interpreted  as
1297       UTF-8. Notice that with these instructions, the automatic conversion of
1298       lists to UTF-8 is not performed by the re functions.  Therefore,  using
1299       these  sequences  is  not  recommended. Add option unicode when running
1300       compile/2 instead.
1301
1302       Some applications that allow their users to supply patterns can wish to
1303       restrict them to non-UTF data for security reasons. If option never_utf
1304       is set at compile time, (*UTF), and so on, are not allowed,  and  their
1305       appearance causes an error.
1306
1307       Unicode Property Support
1308
1309       The  following is another special sequence that can appear at the start
1310       of a pattern:
1311
1312       (*UCP)
1313
1314       This has the same effect as setting option  ucp:  it  causes  sequences
1315       such  as  \d  and  \w  to use Unicode properties to determine character
1316       types, instead of recognizing only characters with codes < 256  through
1317       a lookup table.
1318
1319       Disabling Startup Optimizations
1320
1321       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
1322       setting option no_start_optimize at compile time.
1323
1324       Newline Conventions
1325
1326       PCRE supports five conventions for indicating line breaks in strings: a
1327       single  CR (carriage return) character, a single LF (line feed) charac‐
1328       ter, the two-character sequence CRLF, any of the three  preceding,  and
1329       any Unicode newline sequence.
1330
1331       A newline convention can also be specified by starting a pattern string
1332       with one of the following five sequences:
1333
1334         (*CR):
1335           Carriage return
1336
1337         (*LF):
1338           Line feed
1339
1340         (*CRLF):
1341           >Carriage return followed by line feed
1342
1343         (*ANYCRLF):
1344           Any of the three above
1345
1346         (*ANY):
1347           All Unicode newline sequences
1348
1349       These override the default and the options specified to compile/2.  For
1350       example, the following pattern changes the convention to CR:
1351
1352       (*CR)a.b
1353
1354       This  pattern  matches a\nb, as LF is no longer a newline. If more than
1355       one of them is present, the last one is used.
1356
1357       The newline convention affects where the circumflex and  dollar  asser‐
1358       tions are true. It also affects the interpretation of the dot metachar‐
1359       acter when dotall is not set, and the behavior of \N. However, it  does
1360       not affect what the \R escape sequence matches. By default, this is any
1361       Unicode newline sequence, for Perl compatibility. However, this can  be
1362       changed;  see  the  description  of  \R in section Newline Sequences. A
1363       change of the \R setting can be combined with a change of  the  newline
1364       convention.
1365
1366       Setting Match and Recursion Limits
1367
1368       The caller of run/3 can set a limit on the number of times the internal
1369       match() function is called and on the maximum depth of recursive calls.
1370       These  facilities  are  provided to catch runaway matches that are pro‐
1371       voked by patterns with huge matching trees (a typical example is a pat‐
1372       tern  with nested unlimited repeats) and to avoid running out of system
1373       stack by too much recursion. When  one  of  these  limits  is  reached,
1374       pcre_exec()  gives an error return. The limits can also be set by items
1375       at the start of the pattern of the following forms:
1376
1377       (*LIMIT_MATCH=d)
1378       (*LIMIT_RECURSION=d)
1379
1380       Here d is any number of decimal digits. However, the value of the  set‐
1381       ting  must  be less than the value set by the caller of run/3 for it to
1382       have any effect. That is, the pattern writer can lower the limit set by
1383       the  programmer, but not raise it. If there is more than one setting of
1384       one of these limits, the lower value is used.
1385
1386       The default value for both the limits is 10,000,000 in the  Erlang  VM.
1387       Notice  that the recursion limit does not affect the stack depth of the
1388       VM, as PCRE for Erlang is compiled in such a way that the  match  func‐
1389       tion never does recursion on the C stack.
1390
1391       Note  that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1392       the limits set by the caller, not increase them.
1393

CHARACTERS AND METACHARACTERS

1395       A regular expression is a pattern that is  matched  against  a  subject
1396       string  from  left  to right. Most characters stand for themselves in a
1397       pattern and match the corresponding characters in  the  subject.  As  a
1398       trivial  example,  the following pattern matches a portion of a subject
1399       string that is identical to itself:
1400
1401       The quick brown fox
1402
1403       When caseless matching is  specified  (option  caseless),  letters  are
1404       matched independently of case.
1405
1406       The  power of regular expressions comes from the ability to include al‐
1407       ternatives and repetitions in the pattern. These  are  encoded  in  the
1408       pattern by the use of metacharacters, which do not stand for themselves
1409       but instead are interpreted in some special way.
1410
1411       Two sets of metacharacters exist: those that are recognized anywhere in
1412       the  pattern  except  within square brackets, and those that are recog‐
1413       nized within square brackets. Outside square brackets, the  metacharac‐
1414       ters are as follows:
1415
1416         \:
1417           General escape character with many uses
1418
1419         ^:
1420           Assert start of string (or line, in multiline mode)
1421
1422         $:
1423           Assert end of string (or line, in multiline mode)
1424
1425         .:
1426           Match any character except newline (by default)
1427
1428         [:
1429           Start character class definition
1430
1431         |:
1432           Start of alternative branch
1433
1434         (:
1435           Start subpattern
1436
1437         ):
1438           End subpattern
1439
1440         ?:
1441           Extends  the  meaning of (, also 0 or 1 quantifier, also quantifier
1442           minimizer
1443
1444         *:
1445           0 or more quantifiers
1446
1447         +:
1448           1 or more quantifier, also "possessive quantifier"
1449
1450         {:
1451           Start min/max quantifier
1452
1453       Part of a pattern within square brackets is called a "character class".
1454       The following are the only metacharacters in a character class:
1455
1456         \:
1457           General escape character
1458
1459         ^:
1460           Negate the class, but only if the first character
1461
1462         -:
1463           Indicates character range
1464
1465         [:
1466           Posix character class (only if followed by Posix syntax)
1467
1468         ]:
1469           Terminates the character class
1470
1471       The following sections describe the use of each metacharacter.
1472

BACKSLASH

1474       The  backslash  character  has many uses. First, if it is followed by a
1475       character that is not a number or a letter, it takes away  any  special
1476       meaning  that  a character can have. This use of backslash as an escape
1477       character applies both inside and outside character classes.
1478
1479       For example, if you want to match a * character, you write  \*  in  the
1480       pattern.  This escaping action applies if the following character would
1481       otherwise be interpreted as a metacharacter, so it is  always  safe  to
1482       precede a non-alphanumeric with backslash to specify that it stands for
1483       itself. In particular, if you want to match a backslash, write \\.
1484
1485       In unicode mode, only ASCII numbers and letters have any special  mean‐
1486       ing after a backslash. All other characters (in particular, those whose
1487       code points are > 127) are treated as literals.
1488
1489       If a pattern is compiled with option extended, whitespace in  the  pat‐
1490       tern  (other than in a character class) and characters between a # out‐
1491       side a character class and the next newline are  ignored.  An  escaping
1492       backslash can be used to include a whitespace or # character as part of
1493       the pattern.
1494
1495       To remove the special meaning from a sequence of characters,  put  them
1496       between \Q and \E. This is different from Perl in that $ and @ are han‐
1497       dled as literals in \Q...\E sequences in PCRE,  while  $  and  @  cause
1498       variable interpolation in Perl. Notice the following examples:
1499
1500       Pattern            PCRE matches   Perl matches
1501
1502       \Qabc$xyz\E        abc$xyz        abc followed by the contents of $xyz
1503       \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1504       \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1505
1506       The  \Q...\E  sequence  is recognized both inside and outside character
1507       classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1508       not  followed  by  \E  later in the pattern, the literal interpretation
1509       continues to the end of the pattern (that is,  \E  is  assumed  at  the
1510       end).  If  the  isolated \Q is inside a character class, this causes an
1511       error, as the character class is not terminated.
1512
1513       Non-Printing Characters
1514
1515       A second use of backslash provides a way of encoding non-printing char‐
1516       acters  in patterns in a visible manner. There is no restriction on the
1517       appearance of non-printing characters, apart from the binary zero  that
1518       terminates a pattern. When a pattern is prepared by text editing, it is
1519       often easier to use one of the following escape sequences than the  bi‐
1520       nary character it represents:
1521
1522         \a:
1523           Alarm, that is, the BEL character (hex 07)
1524
1525         \cx:
1526           "Control-x", where x is any ASCII character
1527
1528         \e:
1529           Escape (hex 1B)
1530
1531         \f:
1532           Form feed (hex 0C)
1533
1534         \n:
1535           Line feed (hex 0A)
1536
1537         \r:
1538           Carriage return (hex 0D)
1539
1540         \t:
1541           Tab (hex 09)
1542
1543         \0dd:
1544           Character with octal code 0dd
1545
1546         \ddd:
1547           Character with octal code ddd, or back reference
1548
1549         \o{ddd..}:
1550           character with octal code ddd..
1551
1552         \xhh:
1553           Character with hex code hh
1554
1555         \x{hhh..}:
1556           Character with hex code hhh..
1557
1558   Note:
1559       Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1560       eral characters "8" and "9".
1561
1562
1563       The precise effect of \cx on ASCII characters is as follows: if x is  a
1564       lowercase  letter,  it  is  converted  to upper case. Then bit 6 of the
1565       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1566       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1567       hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
1568       has  a  value  >  127, a compile-time error occurs. This locks out non-
1569       ASCII characters in all modes.
1570
1571       The \c facility was designed for use with ASCII  characters,  but  with
1572       the extension to Unicode it is even less useful than it once was.
1573
1574       After  \0  up  to two further octal digits are read. If there are fewer
1575       than two digits, just those that are present are  used.  Thus  the  se‐
1576       quence  \0\x\015  specifies two binary zeros followed by a CR character
1577       (code value 13). Make sure you supply two digits after the initial zero
1578       if the pattern character that follows is itself an octal digit.
1579
1580       The  escape \o must be followed by a sequence of octal digits, enclosed
1581       in braces. An error occurs if this is not the case. This  escape  is  a
1582       recent  addition  to Perl; it provides way of specifying character code
1583       points as octal numbers greater than 0777, and  it  also  allows  octal
1584       numbers and back references to be unambiguously specified.
1585
1586       For greater clarity and unambiguity, it is best to avoid following \ by
1587       a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1588       ter  numbers,  and \g{} to specify back references. The following para‐
1589       graphs describe the old, ambiguous syntax.
1590
1591       The handling of a backslash followed by a digit other than 0 is compli‐
1592       cated,  and  Perl  has changed in recent releases, causing PCRE also to
1593       change. Outside a character class, PCRE reads the digit and any follow‐
1594       ing  digits as a decimal number. If the number is < 8, or if there have
1595       been at least that many previous capturing left parentheses in the  ex‐
1596       pression,  the entire sequence is taken as a back reference. A descrip‐
1597       tion of how this works is provided later, following the  discussion  of
1598       parenthesized subpatterns.
1599
1600       Inside  a  character class, or if the decimal number following \ is > 7
1601       and there have not been that many capturing subpatterns,  PCRE  handles
1602       \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1603       up to three octal digits following the backslash,  and  using  them  to
1604       generate  a data character. Any subsequent digits stand for themselves.
1605       For example:
1606
1607         \040:
1608           Another way of writing an ASCII space
1609
1610         \40:
1611           The same, provided there are < 40 previous capturing subpatterns
1612
1613         \7:
1614           Always a back reference
1615
1616         \11:
1617           Can be a back reference, or another way of writing a tab
1618
1619         \011:
1620           Always a tab
1621
1622         \0113:
1623           A tab followed by character "3"
1624
1625         \113:
1626           Can be a back reference, otherwise the character  with  octal  code
1627           113
1628
1629         \377:
1630           Can be a back reference, otherwise value 255 (decimal)
1631
1632         \81:
1633           Either a back reference, or the two characters "8" and "1"
1634
1635       Notice  that  octal  values >= 100 that are specified using this syntax
1636       must not be introduced by a leading zero, as no more than  three  octal
1637       digits are ever read.
1638
1639       By  default, after \x that is not followed by {, from zero to two hexa‐
1640       decimal digits are read (letters can be in upper or  lower  case).  Any
1641       number of hexadecimal digits may appear between \x{ and }. If a charac‐
1642       ter other than a hexadecimal digit appears between \x{  and  },  or  if
1643       there is no terminating }, an error occurs.
1644
1645       Characters whose value is less than 256 can be defined by either of the
1646       two syntaxes for \x. There is no difference in the way  they  are  han‐
1647       dled. For example, \xdc is exactly the same as \x{dc}.
1648
1649       Constraints on character values
1650
1651       Characters  that  are  specified using octal or hexadecimal numbers are
1652       limited to certain values, as follows:
1653
1654         8-bit non-UTF mode:
1655           < 0x100
1656
1657         8-bit UTF-8 mode:
1658           < 0x10ffff and a valid codepoint
1659
1660       Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
1661       called "surrogate" codepoints), and 0xffef.
1662
1663       Escape sequences in character classes
1664
1665       All the sequences that define a single character value can be used both
1666       inside and outside character classes. Also, inside a  character  class,
1667       \b is interpreted as the backspace character (hex 08).
1668
1669       \N  is not allowed in a character class. \B, \R, and \X are not special
1670       inside a character class. Like  other  unrecognized  escape  sequences,
1671       they are treated as the literal characters "B", "R", and "X". Outside a
1672       character class, these sequences have different meanings.
1673
1674       Unsupported Escape Sequences
1675
1676       In Perl, the sequences \l, \L, \u, and \U are recognized by its  string
1677       handler  and used to modify the case of following characters. PCRE does
1678       not support these escape sequences.
1679
1680       Absolute and Relative Back References
1681
1682       The sequence \g followed by an unsigned or a negative  number,  option‐
1683       ally  enclosed  in braces, is an absolute or relative back reference. A
1684       named back reference can be coded as \g{name}. Back references are dis‐
1685       cussed later, following the discussion of parenthesized subpatterns.
1686
1687       Absolute and Relative Subroutine Calls
1688
1689       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
1690       name or a number enclosed either in angle brackets or single quotes, is
1691       alternative  syntax for referencing a subpattern as a "subroutine". De‐
1692       tails are discussed  later.  Notice  that  \g{...}  (Perl  syntax)  and
1693       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
1694       reference and the latter is a subroutine call.
1695
1696       Generic Character Types
1697
1698       Another use of backslash is for specifying generic character types:
1699
1700         \d:
1701           Any decimal digit
1702
1703         \D:
1704           Any character that is not a decimal digit
1705
1706         \h:
1707           Any horizontal whitespace character
1708
1709         \H:
1710           Any character that is not a horizontal whitespace character
1711
1712         \s:
1713           Any whitespace character
1714
1715         \S:
1716           Any character that is not a whitespace character
1717
1718         \v:
1719           Any vertical whitespace character
1720
1721         \V:
1722           Any character that is not a vertical whitespace character
1723
1724         \w:
1725           Any "word" character
1726
1727         \W:
1728           Any "non-word" character
1729
1730       There is also the single sequence \N, which matches a non-newline char‐
1731       acter.  This  is  the  same as the "." metacharacter when dotall is not
1732       set. Perl also uses \N to match characters by name, but PCRE  does  not
1733       support this.
1734
1735       Each  pair  of  lowercase and uppercase escape sequences partitions the
1736       complete set of characters into two disjoint sets. Any given  character
1737       matches  one, and only one, of each pair. The sequences can appear both
1738       inside and outside character classes. They each match one character  of
1739       the  appropriate  type.  If the current matching point is at the end of
1740       the subject string, all fail, as there is no character to match.
1741
1742       For compatibility with Perl, \s did not used to match the VT  character
1743       (code  11),  which  made it different from the the POSIX "space" class.
1744       However, Perl added VT at release 5.18, and PCRE followed suit  at  re‐
1745       lease 8.34. The default \s characters are now HT (9), LF (10), VT (11),
1746       FF (12), CR (13), and space (32), which are defined as white  space  in
1747       the  "C" locale. This list may vary if locale-specific matching is tak‐
1748       ing place. For example, in some locales the "non-breaking space"  char‐
1749       acter (\xA0) is recognized as white space, and in others the VT charac‐
1750       ter is not.
1751
1752       A "word" character is an underscore or any character that is  a  letter
1753       or  a  digit.  By default, the definition of letters and digits is con‐
1754       trolled by the PCRE low-valued character tables, in Erlang's case  (and
1755       without option unicode), the ISO Latin-1 character set.
1756
1757       By default, in unicode mode, characters with values > 255, that is, all
1758       characters outside the ISO Latin-1 character set, never match  \d,  \s,
1759       or  \w,  and  always match \D, \S, and \W. These sequences retain their
1760       original meanings from before UTF support was available, mainly for ef‐
1761       ficiency  reasons.  However,  if  option  ucp  is  set, the behavior is
1762       changed so that Unicode properties  are  used  to  determine  character
1763       types, as follows:
1764
1765         \d:
1766           Any character that \p{Nd} matches (decimal digit)
1767
1768         \s:
1769           Any character that \p{Z} or \h or \v
1770
1771         \w:
1772           Any character that matches \p{L} or \p{N} matches, plus underscore
1773
1774       The uppercase escapes match the inverse sets of characters. Notice that
1775       \d matches only decimal digits, while \w matches any Unicode digit, any
1776       Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1777       as they are defined in terms of \w and \W. Matching these sequences  is
1778       noticeably slower when ucp is set.
1779
1780       The  sequences  \h, \H, \v, and \V are features that were added to Perl
1781       in release 5.10. In contrast to the other sequences, which  match  only
1782       ASCII  characters  by  default,  these always match certain high-valued
1783       code points, regardless if ucp is set.
1784
1785       The following are the horizontal space characters:
1786
1787         U+0009:
1788           Horizontal tab (HT)
1789
1790         U+0020:
1791           Space
1792
1793         U+00A0:
1794           Non-break space
1795
1796         U+1680:
1797           Ogham space mark
1798
1799         U+180E:
1800           Mongolian vowel separator
1801
1802         U+2000:
1803           En quad
1804
1805         U+2001:
1806           Em quad
1807
1808         U+2002:
1809           En space
1810
1811         U+2003:
1812           Em space
1813
1814         U+2004:
1815           Three-per-em space
1816
1817         U+2005:
1818           Four-per-em space
1819
1820         U+2006:
1821           Six-per-em space
1822
1823         U+2007:
1824           Figure space
1825
1826         U+2008:
1827           Punctuation space
1828
1829         U+2009:
1830           Thin space
1831
1832         U+200A:
1833           Hair space
1834
1835         U+202F:
1836           Narrow no-break space
1837
1838         U+205F:
1839           Medium mathematical space
1840
1841         U+3000:
1842           Ideographic space
1843
1844       The following are the vertical space characters:
1845
1846         U+000A:
1847           Line feed (LF)
1848
1849         U+000B:
1850           Vertical tab (VT)
1851
1852         U+000C:
1853           Form feed (FF)
1854
1855         U+000D:
1856           Carriage return (CR)
1857
1858         U+0085:
1859           Next line (NEL)
1860
1861         U+2028:
1862           Line separator
1863
1864         U+2029:
1865           Paragraph separator
1866
1867       In 8-bit, non-UTF-8 mode, only the characters with code  points  <  256
1868       are relevant.
1869
1870       Newline Sequences
1871
1872       Outside  a  character class, by default, the escape sequence \R matches
1873       any Unicode newline sequence. In non-UTF-8 mode, \R  is  equivalent  to
1874       the following:
1875
1876       (?>\r\n|\n|\x0b|\f|\r|\x85)
1877
1878       This is an example of an "atomic group", details are provided below.
1879
1880       This particular group matches either the two-character sequence CR fol‐
1881       lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1882       (vertical  tab,  U+000B),  FF (form feed, U+000C), CR (carriage return,
1883       U+000D), or NEL (next line,  U+0085).  The  two-character  sequence  is
1884       treated as a single unit that cannot be split.
1885
1886       In  Unicode  mode,  two more characters whose code points are > 255 are
1887       added:  LS  (line  separator,  U+2028)  and  PS  (paragraph  separator,
1888       U+2029).  Unicode  character  property  support is not needed for these
1889       characters to be recognized.
1890
1891       \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1892       plete set of Unicode line endings) by setting option bsr_anycrlf either
1893       at compile time or when the pattern is matched. (BSR is an acronym  for
1894       "backslash R".) This can be made the default when PCRE is built; if so,
1895       the other behavior can be requested through option  bsr_unicode.  These
1896       settings can also be specified by starting a pattern string with one of
1897       the following sequences:
1898
1899         (*BSR_ANYCRLF):
1900           CR, LF, or CRLF only
1901
1902         (*BSR_UNICODE):
1903           Any Unicode newline sequence
1904
1905       These override the default and the options specified to  the  compiling
1906       function, but they can themselves be overridden by options specified to
1907       a matching function. Notice that these special settings, which are  not
1908       Perl-compatible,  are  recognized  only at the very start of a pattern,
1909       and that they must be in upper case.  If  more  than  one  of  them  is
1910       present,  the  last  one is used. They can be combined with a change of
1911       newline convention; for example, a pattern can start with:
1912
1913       (*ANY)(*BSR_ANYCRLF)
1914
1915       They can also be combined with the (*UTF8), (*UTF), or  (*UCP)  special
1916       sequences.  Inside  a character class, \R is treated as an unrecognized
1917       escape sequence, and so matches the letter "R" by default.
1918
1919       Unicode Character Properties
1920
1921       Three more escape sequences that match characters with specific proper‐
1922       ties  are  available. When in 8-bit non-UTF-8 mode, these sequences are
1923       limited to testing characters whose code points are < 256, but they  do
1924       work in this mode. The following are the extra escape sequences:
1925
1926         \p{xx}:
1927           A character with property xx
1928
1929         \P{xx}:
1930           A character without property xx
1931
1932         \X:
1933           A Unicode extended grapheme cluster
1934
1935       The  property  names represented by xx above are limited to the Unicode
1936       script names, the general category properties, "Any", which matches any
1937       character  (including  newline),  and some special PCRE properties (de‐
1938       scribed in the next section). Other Perl properties, such  as  "InMusi‐
1939       calSymbols",  are  currently not supported by PCRE. Notice that \P{Any}
1940       does not match any characters and always causes a match failure.
1941
1942       Sets of Unicode characters are defined as belonging to certain scripts.
1943       A  character from one of these sets can be matched using a script name,
1944       for example:
1945
1946       \p{Greek} \P{Han}
1947
1948       Those that are not part of an identified script are lumped together  as
1949       "Common". The following is the current list of scripts:
1950
1951         * Arabic
1952
1953         * Armenian
1954
1955         * Avestan
1956
1957         * Balinese
1958
1959         * Bamum
1960
1961         * Bassa_Vah
1962
1963         * Batak
1964
1965         * Bengali
1966
1967         * Bopomofo
1968
1969         * Braille
1970
1971         * Buginese
1972
1973         * Buhid
1974
1975         * Canadian_Aboriginal
1976
1977         * Carian
1978
1979         * Caucasian_Albanian
1980
1981         * Chakma
1982
1983         * Cham
1984
1985         * Cherokee
1986
1987         * Common
1988
1989         * Coptic
1990
1991         * Cuneiform
1992
1993         * Cypriot
1994
1995         * Cyrillic
1996
1997         * Deseret
1998
1999         * Devanagari
2000
2001         * Duployan
2002
2003         * Egyptian_Hieroglyphs
2004
2005         * Elbasan
2006
2007         * Ethiopic
2008
2009         * Georgian
2010
2011         * Glagolitic
2012
2013         * Gothic
2014
2015         * Grantha
2016
2017         * Greek
2018
2019         * Gujarati
2020
2021         * Gurmukhi
2022
2023         * Han
2024
2025         * Hangul
2026
2027         * Hanunoo
2028
2029         * Hebrew
2030
2031         * Hiragana
2032
2033         * Imperial_Aramaic
2034
2035         * Inherited
2036
2037         * Inscriptional_Pahlavi
2038
2039         * Inscriptional_Parthian
2040
2041         * Javanese
2042
2043         * Kaithi
2044
2045         * Kannada
2046
2047         * Katakana
2048
2049         * Kayah_Li
2050
2051         * Kharoshthi
2052
2053         * Khmer
2054
2055         * Khojki
2056
2057         * Khudawadi
2058
2059         * Lao
2060
2061         * Latin
2062
2063         * Lepcha
2064
2065         * Limbu
2066
2067         * Linear_A
2068
2069         * Linear_B
2070
2071         * Lisu
2072
2073         * Lycian
2074
2075         * Lydian
2076
2077         * Mahajani
2078
2079         * Malayalam
2080
2081         * Mandaic
2082
2083         * Manichaean
2084
2085         * Meetei_Mayek
2086
2087         * Mende_Kikakui
2088
2089         * Meroitic_Cursive
2090
2091         * Meroitic_Hieroglyphs
2092
2093         * Miao
2094
2095         * Modi
2096
2097         * Mongolian
2098
2099         * Mro
2100
2101         * Myanmar
2102
2103         * Nabataean
2104
2105         * New_Tai_Lue
2106
2107         * Nko
2108
2109         * Ogham
2110
2111         * Ol_Chiki
2112
2113         * Old_Italic
2114
2115         * Old_North_Arabian
2116
2117         * Old_Permic
2118
2119         * Old_Persian
2120
2121         * Oriya
2122
2123         * Old_South_Arabian
2124
2125         * Old_Turkic
2126
2127         * Osmanya
2128
2129         * Pahawh_Hmong
2130
2131         * Palmyrene
2132
2133         * Pau_Cin_Hau
2134
2135         * Phags_Pa
2136
2137         * Phoenician
2138
2139         * Psalter_Pahlavi
2140
2141         * Rejang
2142
2143         * Runic
2144
2145         * Samaritan
2146
2147         * Saurashtra
2148
2149         * Sharada
2150
2151         * Shavian
2152
2153         * Siddham
2154
2155         * Sinhala
2156
2157         * Sora_Sompeng
2158
2159         * Sundanese
2160
2161         * Syloti_Nagri
2162
2163         * Syriac
2164
2165         * Tagalog
2166
2167         * Tagbanwa
2168
2169         * Tai_Le
2170
2171         * Tai_Tham
2172
2173         * Tai_Viet
2174
2175         * Takri
2176
2177         * Tamil
2178
2179         * Telugu
2180
2181         * Thaana
2182
2183         * Thai
2184
2185         * Tibetan
2186
2187         * Tifinagh
2188
2189         * Tirhuta
2190
2191         * Ugaritic
2192
2193         * Vai
2194
2195         * Warang_Citi
2196
2197         * Yi
2198
2199       Each character has exactly one Unicode general category property, spec‐
2200       ified by a two-letter acronym. For compatibility  with  Perl,  negation
2201       can  be  specified  by including a circumflex between the opening brace
2202       and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2203
2204       If only one letter is specified with \p or \P, it includes all the gen‐
2205       eral  category properties that start with that letter. In this case, in
2206       the absence of negation, the curly brackets in the escape sequence  are
2207       optional. The following two examples have the same effect:
2208
2209       \p{L}
2210       \pL
2211
2212       The following general category property codes are supported:
2213
2214         C:
2215           Other
2216
2217         Cc:
2218           Control
2219
2220         Cf:
2221           Format
2222
2223         Cn:
2224           Unassigned
2225
2226         Co:
2227           Private use
2228
2229         Cs:
2230           Surrogate
2231
2232         L:
2233           Letter
2234
2235         Ll:
2236           Lowercase letter
2237
2238         Lm:
2239           Modifier letter
2240
2241         Lo:
2242           Other letter
2243
2244         Lt:
2245           Title case letter
2246
2247         Lu:
2248           Uppercase letter
2249
2250         M:
2251           Mark
2252
2253         Mc:
2254           Spacing mark
2255
2256         Me:
2257           Enclosing mark
2258
2259         Mn:
2260           Non-spacing mark
2261
2262         N:
2263           Number
2264
2265         Nd:
2266           Decimal number
2267
2268         Nl:
2269           Letter number
2270
2271         No:
2272           Other number
2273
2274         P:
2275           Punctuation
2276
2277         Pc:
2278           Connector punctuation
2279
2280         Pd:
2281           Dash punctuation
2282
2283         Pe:
2284           Close punctuation
2285
2286         Pf:
2287           Final punctuation
2288
2289         Pi:
2290           Initial punctuation
2291
2292         Po:
2293           Other punctuation
2294
2295         Ps:
2296           Open punctuation
2297
2298         S:
2299           Symbol
2300
2301         Sc:
2302           Currency symbol
2303
2304         Sk:
2305           Modifier symbol
2306
2307         Sm:
2308           Mathematical symbol
2309
2310         So:
2311           Other symbol
2312
2313         Z:
2314           Separator
2315
2316         Zl:
2317           Line separator
2318
2319         Zp:
2320           Paragraph separator
2321
2322         Zs:
2323           Space separator
2324
2325       The  special property L& is also supported. It matches a character that
2326       has the Lu, Ll, or Lt property, that is, a letter that is  not  classi‐
2327       fied as a modifier or "other".
2328
2329       The  Cs  (Surrogate)  property  applies only to characters in the range
2330       U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2331       cannot be tested by PCRE. Perl does not support the Cs property.
2332
2333       The long synonyms for property names supported by Perl (such as \p{Let‐
2334       ter}) are not supported by PCRE. It is not permitted to prefix  any  of
2335       these properties with "Is".
2336
2337       No  character  in  the  Unicode table has the Cn (unassigned) property.
2338       This property is instead assumed for any code point that is not in  the
2339       Unicode table.
2340
2341       Specifying  caseless  matching  does not affect these escape sequences.
2342       For example, \p{Lu} always matches only uppercase letters. This is dif‐
2343       ferent from the behavior of current versions of Perl.
2344
2345       Matching  characters by Unicode property is not fast, as PCRE must do a
2346       multistage table lookup to find a character property. That is  why  the
2347       traditional escape sequences such as \d and \w do not use Unicode prop‐
2348       erties in PCRE by default. However, you can make them do so by  setting
2349       option ucp or by starting the pattern with (*UCP).
2350
2351       Extended Grapheme Clusters
2352
2353       The  \X  escape  matches  any number of Unicode characters that form an
2354       "extended grapheme cluster", and treats the sequence as an atomic group
2355       (see below). Up to and including release 8.31, PCRE matched an earlier,
2356       simpler definition that was equivalent  to  (?>\PM\pM*).  That  is,  it
2357       matched  a  character  without the "mark" property, followed by zero or
2358       more characters with the "mark" property. Characters  with  the  "mark"
2359       property  are  typically  non-spacing accents that affect the preceding
2360       character.
2361
2362       This simple definition was extended in Unicode to include more  compli‐
2363       cated  kinds of composite character by giving each character a grapheme
2364       breaking property, and creating rules that use these properties to  de‐
2365       fine  the  boundaries  of  extended grapheme clusters. In PCRE releases
2366       later than 8.31, \X matches one of these clusters.
2367
2368       \X always matches at least one character. Then it  decides  whether  to
2369       add more characters according to the following rules for ending a clus‐
2370       ter:
2371
2372         * End at the end of the subject string.
2373
2374         * Do not end between CR and LF; otherwise end after any control char‐
2375           acter.
2376
2377         * Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
2378           characters are of five types: L, V, T, LV, and LVT. An L  character
2379           can  be followed by an L, V, LV, or LVT character. An LV or V char‐
2380           acter can be followed by a V or T character. An LVT or T  character
2381           can be followed only by a T character.
2382
2383         * Do not end before extending characters or spacing marks. Characters
2384           with the "mark" property always have the "extend" grapheme breaking
2385           property.
2386
2387         * Do not end after prepend characters.
2388
2389         * Otherwise, end the cluster.
2390
2391       PCRE Additional Properties
2392
2393       In  addition to the standard Unicode properties described earlier, PCRE
2394       supports four more that make it possible to convert traditional  escape
2395       sequences, such as \w and \s to use Unicode properties. PCRE uses these
2396       non-standard, non-Perl properties internally when  the  ucp  option  is
2397       passed.  However,  they can also be used explicitly. The properties are
2398       as follows:
2399
2400         Xan:
2401           Any alphanumeric character. Matches characters that have either the
2402           L (letter) or the N (number) property.
2403
2404         Xps:
2405           Any  Posix  space character. Matches the characters tab, line feed,
2406           vertical tab, form feed, carriage return, and any  other  character
2407           that has the Z (separator) property.
2408
2409         Xsp:
2410           Any Perl space character. Matches the same as Xps, except that ver‐
2411           tical tab is excluded.
2412
2413         Xwd:
2414           Any Perl "word" character. Matches the same characters as Xan, plus
2415           underscore.
2416
2417       Perl and POSIX space are now the same. Perl added VT to its space char‐
2418       acter set at release 5.18 and PCRE changed at release 8.34.
2419
2420       Xan matches characters that have either the L (letter) or the  N  (num‐
2421       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
2422       form feed, or carriage return, and any other character that has  the  Z
2423       (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2424       cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2425       at  release  8.34.  Xwd matches the same characters as Xan, plus under‐
2426       score.
2427
2428       There is another non-standard property, Xuc, which matches any  charac‐
2429       ter  that  can  be represented by a Universal Character Name in C++ and
2430       other programming languages. These are the characters $,  @,  `  (grave
2431       accent),  and all characters with Unicode code points >= U+00A0, except
2432       for the surrogates U+D800 to U+DFFF.  Notice  that  most  base  (ASCII)
2433       characters  are  excluded.  (Universal  Character Names are of the form
2434       \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice  that  the
2435       Xuc  property  does  not  match these sequences but the characters that
2436       they represent.)
2437
2438       Resetting the Match Start
2439
2440       The escape sequence \K causes any previously matched characters not  to
2441       be  included  in the final matched sequence. For example, the following
2442       pattern matches "foobar", but reports that it has matched "bar":
2443
2444       foo\Kbar
2445
2446       This feature is similar to a lookbehind  assertion  (described  below).
2447       However,  in  this  case, the part of the subject before the real match
2448       does not have to be of fixed length, as lookbehind assertions  do.  The
2449       use  of  \K does not interfere with the setting of captured substrings.
2450       For example, when the following pattern  matches  "foobar",  the  first
2451       substring is still set to "foo":
2452
2453       (foo)\Kbar
2454
2455       Perl  documents  that  the use of \K within assertions is "not well de‐
2456       fined". In PCRE, \K is acted upon when it occurs inside positive asser‐
2457       tions,  but is ignored in negative assertions. Note that when a pattern
2458       such as (?=ab\K) matches, the  reported  start  of  the  match  can  be
2459       greater than the end of the match.
2460
2461       Simple Assertions
2462
2463       The  final use of backslash is for certain simple assertions. An asser‐
2464       tion specifies a condition that must be met at a particular point in  a
2465       match,  without  consuming  any characters from the subject string. The
2466       use of subpatterns for more complicated assertions is described  below.
2467       The following are the backslashed assertions:
2468
2469         \b:
2470           Matches at a word boundary.
2471
2472         \B:
2473           Matches when not at a word boundary.
2474
2475         \A:
2476           Matches at the start of the subject.
2477
2478         \Z:
2479           Matches  at the end of the subject, and before a newline at the end
2480           of the subject.
2481
2482         \z:
2483           Matches only at the end of the subject.
2484
2485         \G:
2486           Matches at the first matching position in the subject.
2487
2488       Inside a character class, \b has a different meaning;  it  matches  the
2489       backspace  character.  If  any  other  of these assertions appears in a
2490       character class, by default it matches the corresponding literal  char‐
2491       acter (for example, \B matches the letter B).
2492
2493       A  word  boundary is a position in the subject string where the current
2494       character and the previous character do not both match \w or  \W  (that
2495       is,  one  matches  \w and the other matches \W), or the start or end of
2496       the string if the first or last character matches \w, respectively.  In
2497       UTF  mode,  the  meanings of \w and \W can be changed by setting option
2498       ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2499       have a separate "start of word" or "end of word" metasequence. However,
2500       whatever follows \b normally determines which it is. For  example,  the
2501       fragment \ba matches "a" at the start of a word.
2502
2503       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
2504       and dollar (described in the next section) in that they only ever match
2505       at  the  very start and end of the subject string, whatever options are
2506       set. Thus, they are independent of multiline mode. These  three  asser‐
2507       tions  are  not affected by options notbol or noteol, which affect only
2508       the behavior of the circumflex and dollar metacharacters.  However,  if
2509       argument  startoffset of run/3 is non-zero, indicating that matching is
2510       to start at a point other than the beginning of  the  subject,  \A  can
2511       never match. The difference between \Z and \z is that \Z matches before
2512       a newline at the end of the string  and  at  the  very  end,  while  \z
2513       matches only at the end.
2514
2515       The  \G assertion is true only when the current matching position is at
2516       the start point of the match, as specified by argument  startoffset  of
2517       run/3. It differs from \A when the value of startoffset is non-zero. By
2518       calling run/3 multiple times with appropriate arguments, you can  mimic
2519       the  Perl  option /g, and it is in this kind of implementation where \G
2520       can be useful.
2521
2522       Notice, however, that the PCRE interpretation of \G, as  the  start  of
2523       the  current  match, is subtly different from Perl, which defines it as
2524       the end of the previous match. In Perl, these can be different when the
2525       previously  matched  string was empty. As PCRE does only one match at a
2526       time, it cannot reproduce this behavior.
2527
2528       If all the alternatives of a pattern begin with \G, the  expression  is
2529       anchored to the starting match position, and the "anchored" flag is set
2530       in the compiled regular expression.
2531

CIRCUMFLEX AND DOLLAR

2533       The circumflex and dollar  metacharacters  are  zero-width  assertions.
2534       That  is,  they test for a particular condition to be true without con‐
2535       suming any characters from the subject string.
2536
2537       Outside a character class, in the default matching mode, the circumflex
2538       character  is  an  assertion  that is true only if the current matching
2539       point is at the start of the subject string. If argument startoffset of
2540       run/3  is  non-zero,  circumflex can never match if option multiline is
2541       unset. Inside a character class, circumflex has an  entirely  different
2542       meaning (see below).
2543
2544       Circumflex  needs  not to be the first character of the pattern if some
2545       alternatives are involved, but it is to be the first thing in each  al‐
2546       ternative  in  which  it  appears  if the pattern is ever to match that
2547       branch. If all possible alternatives start with a circumflex, that  is,
2548       if  the  pattern  is constrained to match only at the start of the sub‐
2549       ject, it is said to be an "anchored" pattern.  (There  are  also  other
2550       constructs that can cause a pattern to be anchored.)
2551
2552       The  dollar  character is an assertion that is true only if the current
2553       matching point is at the end of the subject string, or immediately  be‐
2554       fore  a  newline  at the end of the string (by default). Notice however
2555       that it does not match the newline. Dollar needs not  to  be  the  last
2556       character  of  the pattern if some alternatives are involved, but it is
2557       to be the last item in any branch in which it appears.  Dollar  has  no
2558       special meaning in a character class.
2559
2560       The  meaning  of  dollar  can be changed so that it matches only at the
2561       very end of the string, by setting  option  dollar_endonly  at  compile
2562       time. This does not affect the \Z assertion.
2563
2564       The meanings of the circumflex and dollar characters are changed if op‐
2565       tion multiline is set. When this is the case, a circumflex matches  im‐
2566       mediately  after  internal  newlines  and  at  the start of the subject
2567       string. It does not match after a newline that ends the string. A  dol‐
2568       lar  matches  before  any  newlines in the string, and at the very end,
2569       when multiline is set. When newline is specified as  the  two-character
2570       sequence CRLF, isolated CR and LF characters do not indicate newlines.
2571
2572       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
2573       (where \n represents a newline) in multiline mode, but  not  otherwise.
2574       So, patterns that are anchored in single-line mode because all branches
2575       start with ^ are not anchored in multiline mode, and a match  for  cir‐
2576       cumflex is possible when argument startoffset of run/3 is non-zero. Op‐
2577       tion dollar_endonly is ignored if multiline is set.
2578
2579       Notice that the sequences \A, \Z, and \z can be used to match the start
2580       and  end  of  the  subject  in both modes. If all branches of a pattern
2581       start with \A, it is always anchored, regardless if multiline is set.
2582

FULL STOP (PERIOD, DOT) AND \N

2584       Outside a character class, a dot in the pattern matches  any  character
2585       in  the  subject  string except (by default) a character that signifies
2586       the end of a line.
2587
2588       When a line ending is defined as a single character, dot never  matches
2589       that  character. When the two-character sequence CRLF is used, dot does
2590       not match CR if it is immediately followed by LF, otherwise it  matches
2591       all  characters (including isolated CRs and LFs). When any Unicode line
2592       endings are recognized, dot does not match CR, LF, or any of the  other
2593       line-ending characters.
2594
2595       The behavior of dot regarding newlines can be changed. If option dotall
2596       is set, a dot matches any character, without  exception.  If  the  two-
2597       character  sequence CRLF is present in the subject string, it takes two
2598       dots to match it.
2599
2600       The handling of dot is entirely independent of the handling of  circum‐
2601       flex  and  dollar, the only relationship is that both involve newlines.
2602       Dot has no special meaning in a character class.
2603
2604       The escape sequence \N behaves like a dot, except that it  is  not  af‐
2605       fected  by option PCRE_DOTALL. That is, it matches any character except
2606       one that signifies the end of a line. Perl also uses \N to match  char‐
2607       acters by name but PCRE does not support this.
2608

MATCHING A SINGLE DATA UNIT

2610       Outside  a  character  class,  the  escape sequence \C matches any data
2611       unit, regardless if a UTF mode is set. One data unit is one  byte.  Un‐
2612       like  a  dot,  \C always matches line-ending characters. The feature is
2613       provided in Perl to match individual bytes in UTF-8 mode, but it is un‐
2614       clear  how it can usefully be used. As \C breaks up characters into in‐
2615       dividual data units, matching one unit with \C in a UTF mode means that
2616       the remaining string can start with a malformed UTF character. This has
2617       undefined results, as  PCRE  assumes  that  it  deals  with  valid  UTF
2618       strings.
2619
2620       PCRE  does  not  allow \C to appear in lookbehind assertions (described
2621       below) in a UTF mode, as this would make it impossible to calculate the
2622       length of the lookbehind.
2623
2624       The  \C  escape  sequence is best avoided. However, one way of using it
2625       that avoids the problem of malformed UTF characters is to use  a  look‐
2626       ahead  to  check  the length of the next character, as in the following
2627       pattern, which can be used with a UTF-8 string (ignore  whitespace  and
2628       line breaks):
2629
2630       (?| (?=[\x00-\x7f])(\C) |
2631           (?=[\x80-\x{7ff}])(\C)(\C) |
2632           (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2633           (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2634
2635       A  group  that starts with (?| resets the capturing parentheses numbers
2636       in each alternative (see section Duplicate Subpattern Numbers). The as‐
2637       sertions at the start of each branch check the next UTF-8 character for
2638       values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The indi‐
2639       vidual bytes of the character are then captured by the appropriate num‐
2640       ber of groups.
2641

SQUARE BRACKETS AND CHARACTER CLASSES

2643       An opening square bracket introduces a character class, terminated by a
2644       closing square bracket. A closing square bracket on its own is not spe‐
2645       cial by default. However, if option PCRE_JAVASCRIPT_COMPAT  is  set,  a
2646       lone  closing  square bracket causes a compile-time error. If a closing
2647       square bracket is required as a member of the class, it is  to  be  the
2648       first  data  character  in  the  class (after an initial circumflex, if
2649       present) or escaped with a backslash.
2650
2651       A character class matches a single character in the subject. In  a  UTF
2652       mode,  the  character  can  be  more than one data unit long. A matched
2653       character must be in the set of characters defined by the class, unless
2654       the  first  character in the class definition is a circumflex, in which
2655       case the subject character must not be in the set defined by the class.
2656       If a circumflex is required as a member of the class, ensure that it is
2657       not the first character, or escape it with a backslash.
2658
2659       For example, the character class [aeiou] matches any  lowercase  vowel,
2660       while [^aeiou] matches any character that is not a lowercase vowel. No‐
2661       tice that a circumflex is just a convenient notation for specifying the
2662       characters  that  are in the class by enumerating those that are not. A
2663       class that starts with a circumflex is not an assertion; it still  con‐
2664       sumes  a  character  from the subject string, and therefore it fails if
2665       the current pointer is at the end of the string.
2666
2667       In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2668       a class as a literal string of data units, or by using the \x{ escaping
2669       mechanism.
2670
2671       When caseless matching is set, any letters in a  class  represent  both
2672       their uppercase and lowercase versions. For example, a caseless [aeiou]
2673       matches "A" and "a", and a caseless [^aeiou] does not match "A", but  a
2674       caseful  version would. In a UTF mode, PCRE always understands the con‐
2675       cept of case for characters whose values are < 256, so caseless  match‐
2676       ing  is always possible. For characters with higher values, the concept
2677       of case is supported only if PCRE is  compiled  with  Unicode  property
2678       support. If you want to use caseless matching in a UTF mode for charac‐
2679       ters >=, ensure that PCRE is compiled with Unicode property support and
2680       with UTF support.
2681
2682       Characters  that can indicate line breaks are never treated in any spe‐
2683       cial way when matching character classes, whatever line-ending sequence
2684       is  in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2685       LINE is used. A class such as [^a] always matches one of these  charac‐
2686       ters.
2687
2688       The  minus (hyphen) character can be used to specify a range of charac‐
2689       ters in a character class. For example, [d-m] matches  any  letter  be‐
2690       tween  d and m, inclusive. If a minus character is required in a class,
2691       it must be escaped with a backslash or appear in a  position  where  it
2692       cannot  be interpreted as indicating a range, typically as the first or
2693       last character in the class, or immediately after a range. For example,
2694       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
2695
2696       The  literal  character  "]"  cannot be the end character of a range. A
2697       pattern such as [W-]46] is interpreted as a  class  of  two  characters
2698       ("W"  and  "-")  followed  by a literal string "46]", so it would match
2699       "W46]" or "-46]". However, if "]" is escaped with a  backslash,  it  is
2700       interpreted  as the end of range, so [W-\]46] is interpreted as a class
2701       containing a range followed by two other characters. The octal or hexa‐
2702       decimal representation of "]" can also be used to end a range.
2703
2704       An  error is generated if a POSIX character class (see below) or an es‐
2705       cape sequence other than one that defines a single character appears at
2706       a  point  where  a  range  ending  character  is expected. For example,
2707       [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2708
2709       Ranges operate in the collating sequence of character values. They  can
2710       also  be  used  for  characters  specified  numerically,  for  example,
2711       [\000-\037]. Ranges can include any characters that are valid  for  the
2712       current mode.
2713
2714       If a range that includes letters is used when caseless matching is set,
2715       it matches the letters in either case. For example, [W-c] is equivalent
2716       to  [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2717       ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2718       E  characters in both cases. In UTF modes, PCRE supports the concept of
2719       case for characters with values > 255 only when  it  is  compiled  with
2720       Unicode property support.
2721
2722       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2723       \w, and \W can appear in a character class, and add the characters that
2724       they  match to the class. For example, [\dABCDEF] matches any hexadeci‐
2725       mal digit. In UTF modes, option ucp affects the meanings of \d, \s,  \w
2726       and  their uppercase partners, just as it does when they appear outside
2727       a character class, as described in section Generic Character Types ear‐
2728       lier. The escape sequence \b has a different meaning inside a character
2729       class; it matches the backspace character. The sequences  \B,  \N,  \R,
2730       and  \X are not special inside a character class. Like any other unrec‐
2731       ognized escape sequences, they are treated as  the  literal  characters
2732       "B", "N", "R", and "X".
2733
2734       A  circumflex  can  conveniently  be  used with the uppercase character
2735       types to specify a more restricted set of characters than the  matching
2736       lowercase  type. For example, class [^\W_] matches any letter or digit,
2737       but not underscore, while [\w] includes underscore. A positive  charac‐
2738       ter  class is to be read as "something OR something OR ..." and a nega‐
2739       tive class as "NOT something AND NOT something AND NOT ...".
2740
2741       Only the following metacharacters are recognized in character classes:
2742
2743         * Backslash
2744
2745         * Hyphen (only where it can be interpreted as specifying a range)
2746
2747         * Circumflex (only at the start)
2748
2749         * Opening square bracket (only when it can be interpreted  as  intro‐
2750           ducing  a Posix class name, or for a special compatibility feature;
2751           see the next two sections)
2752
2753         * Terminating closing square bracket
2754
2755       However, escaping other non-alphanumeric characters does no harm.
2756

POSIX CHARACTER CLASSES

2758       Perl supports the Posix notation for character classes. This uses names
2759       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
2760       supports this notation. For example, the following  matches  "0",  "1",
2761       any alphabetic character, or "%":
2762
2763       [01[:alpha:]%]
2764
2765       The following are the supported class names:
2766
2767         alnum:
2768           Letters and digits
2769
2770         alpha:
2771           Letters
2772
2773         blank:
2774           Space or tab only
2775
2776         cntrl:
2777           Control characters
2778
2779         digit:
2780           Decimal digits (same as \d)
2781
2782         graph:
2783           Printing characters, excluding space
2784
2785         lower:
2786           Lowercase letters
2787
2788         print:
2789           Printing characters, including space
2790
2791         punct:
2792           Printing characters, excluding letters, digits, and space
2793
2794         space:
2795           Whitespace (the same as \s from PCRE 8.34)
2796
2797         upper:
2798           Uppercase letters
2799
2800         word:
2801           "Word" characters (same as \w)
2802
2803         xdigit:
2804           Hexadecimal digits
2805
2806       There  is  another  character  class,  ascii,  that erroneously matches
2807       Latin-1 characters instead of the 0-127 range specified by POSIX.  This
2808       cannot  be fixed without altering the behaviour of other classes, so we
2809       recommend matching the range with [\\0-\x7f] instead.
2810
2811       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
2812       CR  (13),  and space (32). If locale-specific matching is taking place,
2813       the list of space characters may be different; there may  be  fewer  or
2814       more of them. "Space" used to be different to \s, which did not include
2815       VT, for Perl compatibility. However, Perl changed at release 5.18,  and
2816       PCRE followed at release 8.34. "Space" and \s now match the same set of
2817       characters.
2818
2819       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
2820       from  Perl  5.8. Another Perl extension is negation, which is indicated
2821       by a ^ character after the colon. For example,  the  following  matches
2822       "1", "2", or any non-digit:
2823
2824       [12[:^digit:]]
2825
2826       PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2827       "ch" is a "collating element", but these are not supported, and an  er‐
2828       ror is given if they are encountered.
2829
2830       By  default, characters with values > 255 do not match any of the Posix
2831       character classes. However, if option PCRE_UCP is passed  to  pcre_com‐
2832       pile(), some of the classes are changed so that Unicode character prop‐
2833       erties are used. This is achieved by replacing certain Posix classes by
2834       other sequences, as follows:
2835
2836         [:alnum:]:
2837           Becomes \p{Xan}
2838
2839         [:alpha:]:
2840           Becomes \p{L}
2841
2842         [:blank:]:
2843           Becomes \h
2844
2845         [:digit:]:
2846           Becomes \p{Nd}
2847
2848         [:lower:]:
2849           Becomes \p{Ll}
2850
2851         [:space:]:
2852           Becomes \p{Xps}
2853
2854         [:upper:]:
2855           Becomes \p{Lu}
2856
2857         [:word:]:
2858           Becomes \p{Xwd}
2859
2860       Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2861       POSIX classes are handled specially in UCP mode:
2862
2863         [:graph:]:
2864           This matches characters that have glyphs that mark  the  page  when
2865           printed.  In Unicode property terms, it matches all characters with
2866           the L, M, N, P, S, or Cf properties, except for:
2867
2868           U+061C:
2869             Arabic Letter Mark
2870
2871           U+180E:
2872             Mongolian Vowel Separator
2873
2874           U+2066 - U+2069:
2875             Various "isolate"s
2876
2877         [:print:]:
2878           This matches the same characters as [:graph:] plus space characters
2879           that are not controls, that is, characters with the Zs property.
2880
2881         [:punct:]:
2882           This  matches  all characters that have the Unicode P (punctuation)
2883           property, plus those characters whose code points are less than 128
2884           that have the S (Symbol) property.
2885
2886       The  other  POSIX classes are unchanged, and match only characters with
2887       code points less than 128.
2888
2889       Compatibility Feature for Word Boundaries
2890
2891       In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
2892       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
2893       and "end of word". PCRE treats these items as follows:
2894
2895         [[:<:]]:
2896           is converted to \b(?=\w)
2897
2898         [[:>:]]:
2899           is converted to \b(?<=\w)
2900
2901       Only these exact character sequences are recognized. A sequence such as
2902       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
2903       support is not compatible with Perl. It is provided to help  migrations
2904       from other environments, and is best not used in any new patterns. Note
2905       that \b matches at the start and the end of a word (see "Simple  asser‐
2906       tions"  above),  and in a Perl-style pattern the preceding or following
2907       character normally shows which is wanted, without the need for the  as‐
2908       sertions  that are used above in order to give exactly the POSIX behav‐
2909       iour.
2910

VERTICAL BAR

2912       Vertical bar characters are used to separate alternative patterns.  For
2913       example, the following pattern matches either "gilbert" or "sullivan":
2914
2915       gilbert|sullivan
2916
2917       Any number of alternatives can appear, and an empty alternative is per‐
2918       mitted (matching the empty string). The matching process tries each al‐
2919       ternative  in  turn, from left to right, and the first that succeeds is
2920       used. If the alternatives are within a subpattern (defined  in  section
2921       Subpatterns),  "succeeds" means matching the remaining main pattern and
2922       the alternative in the subpattern.
2923

INTERNAL OPTION SETTING

2925       The  settings  of  the  Perl-compatible  options  caseless,  multiline,
2926       dotall,  and  extended  can be changed from within the pattern by a se‐
2927       quence of Perl option letters enclosed between "(?" and ")". The option
2928       letters are as follows:
2929
2930         i:
2931           For caseless
2932
2933         m:
2934           For multiline
2935
2936         s:
2937           For dotall
2938
2939         x:
2940           For extended
2941
2942       For example, (?im) sets caseless, multiline matching. These options can
2943       also be unset by preceding the letter with a hyphen. A combined setting
2944       and  unsetting  such  as  (?im-sx),  which sets caseless and multiline,
2945       while unsetting dotall and extended, is also permitted. If a letter ap‐
2946       pears both before and after the hyphen, the option is unset.
2947
2948       The  PCRE-specific options dupnames, ungreedy, and extra can be changed
2949       in the same way as the Perl-compatible options by using the  characters
2950       J, U, and X respectively.
2951
2952       When  one of these option changes occurs at top-level (that is, not in‐
2953       side subpattern parentheses), the change applies to  the  remainder  of
2954       the pattern that follows.
2955
2956       An  option change within a subpattern (see section Subpatterns) affects
2957       only that part of the subpattern that follows  it.  So,  the  following
2958       matches  abc  and  aBc  and  no other strings (assuming caseless is not
2959       used):
2960
2961       (a(?i)b)c
2962
2963       By this means, options can be made to have different settings  in  dif‐
2964       ferent  parts  of  the  pattern. Any changes made in one alternative do
2965       carry on into subsequent branches within the same subpattern. For exam‐
2966       ple:
2967
2968       (a(?i)b|c)
2969
2970       matches  "ab", "aB", "c", and "C", although when matching "C" the first
2971       branch is abandoned before the option setting. This is because the  ef‐
2972       fects  of  option  settings  occur at compile time. There would be some
2973       weird behavior otherwise.
2974
2975   Note:
2976       Other PCRE-specific options can be set by the application when the com‐
2977       piling or matching functions are called. Sometimes the pattern can con‐
2978       tain special leading sequences, such as (*CRLF), to override  what  the
2979       application has set or what has been defaulted. Details are provided in
2980       section  Newline Sequences earlier.
2981
2982       The (*UTF8) and (*UCP) leading sequences can be used  to  set  UTF  and
2983       Unicode  property modes. They are equivalent to setting options unicode
2984       and ucp, respectively. The (*UTF) sequence is a  generic  version  that
2985       can be used with any of the libraries. However, the application can set
2986       option never_utf, which locks out the use of the (*UTF) sequences.
2987
2988

SUBPATTERNS

2990       Subpatterns are delimited by parentheses (round brackets), which can be
2991       nested. Turning part of a pattern into a subpattern does two things:
2992
2993         1.:
2994           It localizes a set of alternatives. For example, the following pat‐
2995           tern matches "cataract", "caterpillar", or "cat":
2996
2997         cat(aract|erpillar|)
2998
2999           Without the parentheses, it would match "cataract", "erpillar",  or
3000           an empty string.
3001
3002         2.:
3003           It  sets up the subpattern as a capturing subpattern. That is, when
3004           the complete pattern matches, that portion of  the  subject  string
3005           that  matched  the  subpattern is passed back to the caller through
3006           the return value of run/3.
3007
3008       Opening parentheses are counted from left to right (starting from 1) to
3009       obtain  numbers  for  the  capturing  subpatterns.  For example, if the
3010       string "the red king" is matched against  the  following  pattern,  the
3011       captured substrings are "red king", "red", and "king", and are numbered
3012       1, 2, and 3, respectively:
3013
3014       the ((red|white) (king|queen))
3015
3016       It is not always helpful that plain parentheses fulfill two  functions.
3017       Often  a  grouping  subpattern is required without a capturing require‐
3018       ment. If an opening parenthesis is followed by a question  mark  and  a
3019       colon,  the  subpattern  does  not do any capturing, and is not counted
3020       when computing the number of any subsequent capturing subpatterns.  For
3021       example, if the string "the white queen" is matched against the follow‐
3022       ing pattern, the captured substrings are "white queen" and "queen", and
3023       are numbered 1 and 2:
3024
3025       the ((?:red|white) (king|queen))
3026
3027       The maximum number of capturing subpatterns is 65535.
3028
3029       As  a  convenient shorthand, if any option settings are required at the
3030       start of a non-capturing subpattern, the option letters can appear  be‐
3031       tween  "?" and ":". Thus, the following two patterns match the same set
3032       of strings:
3033
3034       (?i:saturday|sunday)
3035       (?:(?i)saturday|sunday)
3036
3037       As alternative branches are tried from left to right, and  options  are
3038       not reset until the end of the subpattern is reached, an option setting
3039       in one branch does affect subsequent branches, so  the  above  patterns
3040       match both "SUNDAY" and "Saturday".
3041

DUPLICATE SUBPATTERN NUMBERS

3043       Perl  5.10  introduced a feature where each alternative in a subpattern
3044       uses the same numbers for its capturing parentheses. Such a  subpattern
3045       starts  with (?| and is itself a non-capturing subpattern. For example,
3046       consider the following pattern:
3047
3048       (?|(Sat)ur|(Sun))day
3049
3050       As the two alternatives are inside a (?| group, both sets of  capturing
3051       parentheses  are  numbered one. Thus, when the pattern matches, you can
3052       look at captured substring number one, whichever  alternative  matched.
3053       This  construct is useful when you want to capture a part, but not all,
3054       of one of many alternatives. Inside a (?| group, parentheses  are  num‐
3055       bered  as  usual,  but the number is reset at the start of each branch.
3056       The numbers of any capturing parentheses  that  follow  the  subpattern
3057       start  after the highest number used in any branch. The following exam‐
3058       ple is from the Perl documentation;  the  numbers  underneath  show  in
3059       which buffer the captured content is stored:
3060
3061       # before  ---------------branch-reset----------- after
3062       / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3063       # 1            2         2  3        2     3     4
3064
3065       A  back  reference  to a numbered subpattern uses the most recent value
3066       that is set for that number by any subpattern.  The  following  pattern
3067       matches "abcabc" or "defdef":
3068
3069       /(?|(abc)|(def))\1/
3070
3071       In  contrast,  a subroutine call to a numbered subpattern always refers
3072       to the first one in the pattern with the given  number.  The  following
3073       pattern matches "abcabc" or "defabc":
3074
3075       /(?|(abc)|(def))(?1)/
3076
3077       If  a  condition  test for a subpattern having matched refers to a non-
3078       unique number, the test is true if any of the subpatterns of that  num‐
3079       ber have matched.
3080
3081       An alternative approach using this "branch reset" feature is to use du‐
3082       plicate named subpatterns, as described in the next section.
3083

NAMED SUBPATTERNS

3085       Identifying capturing parentheses by number is simple, but  it  can  be
3086       hard  to  keep track of the numbers in complicated regular expressions.
3087       Also, if an expression is modified, the numbers  can  change.  To  help
3088       with  this  difficulty,  PCRE  supports the naming of subpatterns. This
3089       feature was not added to Perl until release 5.10. Python had  the  fea‐
3090       ture  earlier,  and PCRE introduced it at release 4.0, using the Python
3091       syntax. PCRE now supports both the Perl and the Python syntax. Perl al‐
3092       lows identically numbered subpatterns to have different names, but PCRE
3093       does not.
3094
3095       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
3096       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
3097       to capturing parentheses from other parts of the pattern, such as  back
3098       references,  recursion, and conditions, can be made by name and by num‐
3099       ber.
3100
3101       Names consist of up to 32 alphanumeric characters and underscores,  but
3102       must  start with a non-digit. Named capturing parentheses are still al‐
3103       located numbers as well as names, exactly as  if  the  names  were  not
3104       present.  The  capture  specification  to run/3 can use named values if
3105       they are present in the regular expression.
3106
3107       By default, a name must be unique within a pattern, but this constraint
3108       can  be  relaxed by setting option dupnames at compile time. (Duplicate
3109       names are also always permitted for subpatterns with the  same  number,
3110       set  up  as  described in the previous section.) Duplicate names can be
3111       useful for patterns where only one instance of  the  named  parentheses
3112       can match. Suppose that you want to match the name of a weekday, either
3113       as a 3-letter abbreviation or as the full name, and in both  cases  you
3114       want  to  extract the abbreviation. The following pattern (ignoring the
3115       line breaks) does the job:
3116
3117       (?<DN>Mon|Fri|Sun)(?:day)?|
3118       (?<DN>Tue)(?:sday)?|
3119       (?<DN>Wed)(?:nesday)?|
3120       (?<DN>Thu)(?:rsday)?|
3121       (?<DN>Sat)(?:urday)?
3122
3123       There are five capturing substrings, but only one is ever set  after  a
3124       match.  (An alternative way of solving this problem is to use a "branch
3125       reset" subpattern, as described in the previous section.)
3126
3127       For capturing named subpatterns which names are not unique,  the  first
3128       matching  occurrence (counted from left to right in the subject) is re‐
3129       turned from run/3, if the name is specified in the values part  of  the
3130       capture  statement. The all_names capturing value matches all the names
3131       in the same way.
3132
3133   Note:
3134       You cannot use different names to distinguish between  two  subpatterns
3135       with  the same number, as PCRE uses only the numbers when matching. For
3136       this reason, an error is given at compile time if different  names  are
3137       specified to subpatterns with the same number. However, you can specify
3138       the same name to subpatterns with the same number, even  when  dupnames
3139       is not set.
3140
3141

REPETITION

3143       Repetition  is  specified  by  quantifiers, which can follow any of the
3144       following items:
3145
3146         * A literal data character
3147
3148         * The dot metacharacter
3149
3150         * The \C escape sequence
3151
3152         * The \X escape sequence
3153
3154         * The \R escape sequence
3155
3156         * An escape such as \d or \pL that matches a single character
3157
3158         * A character class
3159
3160         * A back reference (see the next section)
3161
3162         * A parenthesized subpattern (including assertions)
3163
3164         * A subroutine call to a subpattern (recursive or otherwise)
3165
3166       The general repetition quantifier specifies a minimum and maximum  num‐
3167       ber  of  permitted matches, by giving the two numbers in curly brackets
3168       (braces), separated by a comma. The numbers must be <  65536,  and  the
3169       first  must  be less than or equal to the second. For example, the fol‐
3170       lowing matches "zz", "zzz", or "zzzz":
3171
3172       z{2,4}
3173
3174       A closing brace on its own is not a special character.  If  the  second
3175       number  is  omitted, but the comma is present, there is no upper limit.
3176       If the second number and the comma are  both  omitted,  the  quantifier
3177       specifies  an  exact  number  of  required matches. Thus, the following
3178       matches at least three successive vowels, but can match many more:
3179
3180       [aeiou]{3,}
3181
3182       The following matches exactly eight digits:
3183
3184       \d{8}
3185
3186       An opening curly bracket that appears in a position where a  quantifier
3187       is  not allowed, or one that does not match the syntax of a quantifier,
3188       is taken as a literal character. For example, {,6} is not a quantifier,
3189       but a literal string of four characters.
3190
3191       In  Unicode  mode, quantifiers apply to characters rather than to indi‐
3192       vidual data units. Thus, for example, \x{100}{2}  matches  two  charac‐
3193       ters,  each  of  which  is  represented by a 2-byte sequence in a UTF-8
3194       string. Similarly, \X{3} matches three Unicode extended grapheme  clus‐
3195       ters,  each  of  which  can be many data units long (and they can be of
3196       different lengths).
3197
3198       The quantifier {0} is permitted, causing the expression to behave as if
3199       the previous item and the quantifier were not present. This can be use‐
3200       ful for subpatterns that are referenced as subroutines  from  elsewhere
3201       in  the  pattern (but see also section  Defining Subpatterns for Use by
3202       Reference Only). Items other than subpatterns that have a  {0}  quanti‐
3203       fier are omitted from the compiled pattern.
3204
3205       For  convenience, the three most common quantifiers have single-charac‐
3206       ter abbreviations:
3207
3208         *:
3209           Equivalent to {0,}
3210
3211         +:
3212           Equivalent to {1,}
3213
3214         ?:
3215           Equivalent to {0,1}
3216
3217       Infinite loops can be constructed by following a  subpattern  that  can
3218       match  no characters with a quantifier that has no upper limit, for ex‐
3219       ample:
3220
3221       (a?)*
3222
3223       Earlier versions of Perl and PCRE used to give an error at compile time
3224       for  such  patterns. However, as there are cases where this can be use‐
3225       ful, such patterns are now accepted. However, if any repetition of  the
3226       subpattern matches no characters, the loop is forcibly broken.
3227
3228       By  default,  the quantifiers are "greedy", that is, they match as much
3229       as possible (up to the maximum  number  of  permitted  times),  without
3230       causing  the  remaining  pattern  to fail. The classic example of where
3231       this gives problems is in trying to match comments in C programs. These
3232       appear  between /* and */. Within the comment, individual * and / char‐
3233       acters can appear. An attempt to match C comments by applying the  pat‐
3234       tern
3235
3236       /\*.*\*/
3237
3238       to the string
3239
3240       /* first comment */  not comment  /* second comment */
3241
3242       fails,  as  it matches the entire string owing to the greediness of the
3243       .* item.
3244
3245       However, if a quantifier is followed by a question mark, it  ceases  to
3246       be greedy, and instead matches the minimum number of times possible, so
3247       the following pattern does the right thing with the C comments:
3248
3249       /\*.*?\*/
3250
3251       The meaning of the various quantifiers is not otherwise  changed,  only
3252       the  preferred  number  of matches. Do not confuse this use of question
3253       mark with its use as a quantifier in its own right. As it has two uses,
3254       it can sometimes appear doubled, as in
3255
3256       \d??\d
3257
3258       which matches one digit by preference, but can match two if that is the
3259       only way the remaining pattern matches.
3260
3261       If option ungreedy is set (an option that is not  available  in  Perl),
3262       the  quantifiers  are not greedy by default, but individual ones can be
3263       made greedy by following them with a question mark. That is, it inverts
3264       the default behavior.
3265
3266       When  a  parenthesized  subpattern  is quantified with a minimum repeat
3267       count that is > 1 or with a limited maximum, more  memory  is  required
3268       for  the  compiled pattern, in proportion to the size of the minimum or
3269       maximum.
3270
3271       If a pattern starts with .* or .{0,} and option dotall  (equivalent  to
3272       Perl  option  /s)  is set, thus allowing the dot to match newlines, the
3273       pattern is implicitly  anchored,  because  whatever  follows  is  tried
3274       against every character position in the subject string. So, there is no
3275       point in retrying the overall match at any position  after  the  first.
3276       PCRE normally treats such a pattern as if it was preceded by \A.
3277
3278       In  cases  where  it  is known that the subject string contains no new‐
3279       lines, it is worth setting dotall to obtain this optimization,  or  al‐
3280       ternatively using ^ to indicate anchoring explicitly.
3281
3282       However,  there  are  some cases where the optimization cannot be used.
3283       When .* is inside capturing parentheses that are the subject of a  back
3284       reference elsewhere in the pattern, a match at the start can fail where
3285       a later one succeeds. Consider, for example:
3286
3287       (.*)abc\1
3288
3289       If the subject is "xyz123abc123", the match point is the fourth charac‐
3290       ter. Therefore, such a pattern is not implicitly anchored.
3291
3292       Another  case where implicit anchoring is not applied is when the lead‐
3293       ing .* is inside an atomic group. Once again, a match at the start  can
3294       fail where a later one succeeds. Consider the following pattern:
3295
3296       (?>.*?a)b
3297
3298       It  matches "ab" in the subject "aab". The use of the backtracking con‐
3299       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3300
3301       When a capturing subpattern is repeated, the value captured is the sub‐
3302       string that matched the final iteration. For example, after
3303
3304       (tweedle[dume]{3}\s*)+
3305
3306       has  matched  "tweedledum  tweedledee",  the value of the captured sub‐
3307       string is "tweedledee". However, if there are nested capturing  subpat‐
3308       terns,  the corresponding captured values can have been set in previous
3309       iterations. For example, after
3310
3311       /(a|(b))+/
3312
3313       matches "aba", the value of the second captured substring is "b".
3314

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

3316       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
3317       repetition,  failure  of what follows normally causes the repeated item
3318       to be re-evaluated to see if a different number of repeats  allows  the
3319       remaining pattern to match. Sometimes it is useful to prevent this, ei‐
3320       ther to change the nature of the match, or to cause it to fail  earlier
3321       than  it  otherwise  might,  when  the author of the pattern knows that
3322       there is no point in carrying on.
3323
3324       Consider, for example, the pattern \d+foo when applied to the following
3325       subject line:
3326
3327       123456bar
3328
3329       After matching all six digits and then failing to match "foo", the nor‐
3330       mal action of the matcher is to try again with only five digits  match‐
3331       ing item \d+, and then with four, and so on, before ultimately failing.
3332       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
3333       the  means for specifying that once a subpattern has matched, it is not
3334       to be re-evaluated in this way.
3335
3336       If atomic grouping is used for the previous example, the matcher  gives
3337       up  immediately  on failing to match "foo" the first time. The notation
3338       is a kind of special parenthesis, starting with (?> as in the following
3339       example:
3340
3341       (?>\d+)foo
3342
3343       This kind of parenthesis "locks up" the part of the pattern it contains
3344       once it has matched, and a failure further into  the  pattern  is  pre‐
3345       vented  from  backtracking  into  it.  Backtracking past it to previous
3346       items, however, works as normal.
3347
3348       An alternative description is that a subpattern of  this  type  matches
3349       the  string  of  characters  that an identical standalone pattern would
3350       match, if anchored at the current point in the subject string.
3351
3352       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3353       such as the above example can be thought of as a maximizing repeat that
3354       must swallow everything it can. So, while both \d+ and  \d+?  are  pre‐
3355       pared  to  adjust the number of digits they match to make the remaining
3356       pattern match, (?>\d+) can only match an entire sequence of digits.
3357
3358       Atomic groups in general can contain any complicated  subpatterns,  and
3359       can be nested. However, when the subpattern for an atomic group is just
3360       a single repeated item, as in the example above,  a  simpler  notation,
3361       called a "possessive quantifier" can be used. This consists of an extra
3362       + character following a quantifier. Using this notation,  the  previous
3363       example can be rewritten as
3364
3365       \d++foo
3366
3367       Notice  that  a possessive quantifier can be used with an entire group,
3368       for example:
3369
3370       (abc|xyz){2,3}+
3371
3372       Possessive quantifiers are always greedy; the  setting  of  option  un‐
3373       greedy is ignored. They are a convenient notation for the simpler forms
3374       of an atomic group. However, there is no difference in the meaning of a
3375       possessive quantifier and the equivalent atomic group, but there can be
3376       a performance difference; possessive quantifiers are probably  slightly
3377       faster.
3378
3379       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn‐
3380       tax. Jeffrey Friedl originated the idea (and the  name)  in  the  first
3381       edition of his book. Mike McCloskey liked it, so implemented it when he
3382       built the Sun Java package, and PCRE copied it  from  there.  It  ulti‐
3383       mately found its way into Perl at release 5.10.
3384
3385       PCRE has an optimization that automatically "possessifies" certain sim‐
3386       ple pattern constructs. For example, the sequence  A+B  is  treated  as
3387       A++B,  as there is no point in backtracking into a sequence of A:s when
3388       B must follow.
3389
3390       When a pattern contains an unlimited repeat inside  a  subpattern  that
3391       can  itself  be  repeated  an  unlimited number of times, the use of an
3392       atomic group is the only way to avoid some  failing  matches  taking  a
3393       long time. The pattern
3394
3395       (\D+|<\d+>)*[!?]
3396
3397       matches  an  unlimited number of substrings that either consist of non-
3398       digits, or digits enclosed in <>, followed by ! or ?. When it  matches,
3399       it runs quickly. However, if it is applied to
3400
3401       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3402
3403       it  takes  a  long  time  before reporting failure. This is because the
3404       string can be divided between the internal \D+ repeat and the  external
3405       *  repeat  in  many ways, and all must be tried. (The example uses [!?]
3406       rather than a single character at the end, as both PCRE and  Perl  have
3407       an optimization that allows for fast failure when a single character is
3408       used. They remember the last single character that is  required  for  a
3409       match,  and fail early if it is not present in the string.) If the pat‐
3410       tern is changed so that it uses an atomic group,  like  the  following,
3411       sequences of non-digits cannot be broken, and failure happens quickly:
3412
3413       ((?>\D+)|<\d+>)*[!?]
3414

BACK REFERENCES

3416       Outside  a  character  class,  a backslash followed by a digit > 0 (and
3417       possibly further digits) is a back reference to a capturing  subpattern
3418       earlier (that is, to its left) in the pattern, provided there have been
3419       that many previous capturing left parentheses.
3420
3421       However, if the decimal number following the backslash is < 10,  it  is
3422       always taken as a back reference, and causes an error only if there are
3423       not that many capturing left parentheses in the  entire  pattern.  That
3424       is,  the  parentheses that are referenced do need not be to the left of
3425       the reference for numbers < 10. A "forward back reference" of this type
3426       can  make sense when a repetition is involved and the subpattern to the
3427       right has participated in an earlier iteration.
3428
3429       It is not possible to have a numerical "forward back  reference"  to  a
3430       subpattern  whose number is 10 or more using this syntax, as a sequence
3431       such as \50 is interpreted as a character defined in  octal.  For  more
3432       details  of  the  handling of digits following a backslash, see section
3433       Non-Printing Characters earlier. There is no such  problem  when  named
3434       parentheses  are  used.  A back reference to any subpattern is possible
3435       using named parentheses (see below).
3436
3437       Another way to avoid the ambiguity inherent in the use of  digits  fol‐
3438       lowing  a  backslash is to use the \g escape sequence. This escape must
3439       be followed by an unsigned number or a negative number, optionally  en‐
3440       closed in braces. The following examples are identical:
3441
3442       (ring), \1
3443       (ring), \g1
3444       (ring), \g{1}
3445
3446       An  unsigned number specifies an absolute reference without the ambigu‐
3447       ity that is present in the older syntax. It is also useful when literal
3448       digits follow the reference. A negative number is a relative reference.
3449       Consider the following example:
3450
3451       (abc(def)ghi)\g{-1}
3452
3453       The sequence \g{-1} is a reference to the most recently started captur‐
3454       ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3455       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
3456       references  can  be helpful in long patterns, and also in patterns that
3457       are created by joining fragments  containing  references  within  them‐
3458       selves.
3459
3460       A  back  reference matches whatever matched the capturing subpattern in
3461       the current subject string, rather than anything matching  the  subpat‐
3462       tern itself (section Subpattern as Subroutines describes a way of doing
3463       that). So, the following pattern matches "sense  and  sensibility"  and
3464       "response and responsibility", but not "sense and responsibility":
3465
3466       (sens|respons)e and \1ibility
3467
3468       If  caseful matching is in force at the time of the back reference, the
3469       case of letters is relevant. For example, the  following  matches  "rah
3470       rah"  and "RAH RAH", but not "RAH rah", although the original capturing
3471       subpattern is matched caselessly:
3472
3473       ((?i)rah)\s+\1
3474
3475       There are many different ways of writing back references to named  sub‐
3476       patterns.  The  .NET  syntax  \k{name}  and the Perl syntax \k<name> or
3477       \k'name' are supported, as is the Python syntax (?P=name). The  unified
3478       back  reference  syntax  in Perl 5.10, in which \g can be used for both
3479       numeric and named references, is also supported. The  previous  example
3480       can be rewritten in the following ways:
3481
3482       (?<p1>(?i)rah)\s+\k<p1>
3483       (?'p1'(?i)rah)\s+\k{p1}
3484       (?P<p1>(?i)rah)\s+(?P=p1)
3485       (?<p1>(?i)rah)\s+\g{p1}
3486
3487       A  subpattern  that is referenced by name can appear in the pattern be‐
3488       fore or after the reference.
3489
3490       There can be more than one back reference to the same subpattern. If  a
3491       subpattern has not been used in a particular match, any back references
3492       to it always fails. For example, the following pattern always fails  if
3493       it starts to match "a" rather than "bc":
3494
3495       (a|(bc))\2
3496
3497       As  there  can  be  many capturing parentheses in a pattern, all digits
3498       following the backslash are taken as part of a potential back reference
3499       number. If the pattern continues with a digit character, some delimiter
3500       must be used to terminate the back reference.  If  option  extended  is
3501       set,  this  can  be whitespace. Otherwise an empty comment (see section
3502       Comments) can be used.
3503
3504       Recursive Back References
3505
3506       A back reference that occurs inside the parentheses to which it  refers
3507       fails  when  the subpattern is first used, so, for example, (a\1) never
3508       matches. However, such references can be useful inside repeated subpat‐
3509       terns.  For  example,  the following pattern matches any number of "a"s
3510       and also "aba", "ababbaa", and so on:
3511
3512       (a|b\1)+
3513
3514       At each iteration of the subpattern, the  back  reference  matches  the
3515       character  string corresponding to the previous iteration. In order for
3516       this to work, the pattern must be such that the  first  iteration  does
3517       not  need  to match the back reference. This can be done using alterna‐
3518       tion, as in the example above, or by a quantifier  with  a  minimum  of
3519       zero.
3520
3521       Back  references of this type cause the group that they reference to be
3522       treated as an atomic group. Once the whole group has  been  matched,  a
3523       subsequent  matching  failure cannot cause backtracking into the middle
3524       of the group.
3525

ASSERTIONS

3527       An assertion is a test on the characters  following  or  preceding  the
3528       current matching point that does not consume any characters. The simple
3529       assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are  described  in
3530       the previous sections.
3531
3532       More  complicated  assertions  are  coded as subpatterns. There are two
3533       kinds: those that look ahead of the current  position  in  the  subject
3534       string,  and  those  that  look  behind  it. An assertion subpattern is
3535       matched in the normal way, except that it does not  cause  the  current
3536       matching position to be changed.
3537
3538       Assertion  subpatterns are not capturing subpatterns. If such an asser‐
3539       tion contains capturing subpatterns within it, these  are  counted  for
3540       the  purposes  of numbering the capturing subpatterns in the whole pat‐
3541       tern. However, substring capturing is done  only  for  positive  asser‐
3542       tions.  (Perl sometimes, but not always, performs capturing in negative
3543       assertions.)
3544
3545   Warning:
3546       If a positive assertion containing one or  more  capturing  subpatterns
3547       succeeds, but failure to match later in the pattern causes backtracking
3548       over this assertion, the captures within the assertion are  reset  only
3549       if no higher numbered captures are already set. This is, unfortunately,
3550       a fundamental limitation of the current implementation, and as PCRE1 is
3551       now in maintenance-only status, it is unlikely ever to change.
3552
3553
3554       For  compatibility  with  Perl,  assertion subpatterns can be repeated.
3555       However, it makes no sense to assert the same  thing  many  times,  the
3556       side  effect  of  capturing  parentheses can occasionally be useful. In
3557       practice, there are only three cases:
3558
3559         * If the quantifier is {0}, the  assertion  is  never  obeyed  during
3560           matching.  However, it can contain internal capturing parenthesized
3561           groups that are called from elsewhere through the subroutine mecha‐
3562           nism.
3563
3564         * If  quantifier  is  {0,n},  where n > 0, it is treated as if it was
3565           {0,1}. At runtime, the remaining pattern match is  tried  with  and
3566           without  the  assertion, the order depends on the greediness of the
3567           quantifier.
3568
3569         * If the minimum repetition is > 0, the quantifier  is  ignored.  The
3570           assertion is obeyed only once when encountered during matching.
3571
3572       Lookahead Assertions
3573
3574       Lookahead assertions start with (?= for positive assertions and (?! for
3575       negative assertions. For example, the following matches a word followed
3576       by a semicolon, but does not include the semicolon in the match:
3577
3578       \w+(?=;)
3579
3580       The  following  matches any occurrence of "foo" that is not followed by
3581       "bar":
3582
3583       foo(?!bar)
3584
3585       Notice that the apparently similar pattern
3586
3587       (?!foo)bar
3588
3589       does not find an occurrence of "bar"  that  is  preceded  by  something
3590       other  than  "foo". It finds any occurrence of "bar" whatsoever, as the
3591       assertion (?!foo) is always true when the  next  three  characters  are
3592       "bar". A lookbehind assertion is needed to achieve the other effect.
3593
3594       If you want to force a matching failure at some point in a pattern, the
3595       most convenient way to do it is with (?!), as an  empty  string  always
3596       matches.  So,  an  assertion  that requires there is not to be an empty
3597       string must always fail. The backtracking control verb (*FAIL) or  (*F)
3598       is a synonym for (?!).
3599
3600       Lookbehind Assertions
3601
3602       Lookbehind  assertions start with (?<= for positive assertions and (?<!
3603       for negative assertions. For example, the following finds an occurrence
3604       of "bar" that is not preceded by "foo":
3605
3606       (?<!foo)bar
3607
3608       The contents of a lookbehind assertion are restricted such that all the
3609       strings it matches must have a fixed length. However, if there are many
3610       top-level  alternatives,  they  do  not all have to have the same fixed
3611       length. Thus, the following is permitted:
3612
3613       (?<=bullock|donkey)
3614
3615       The following causes an error at compile time:
3616
3617       (?<!dogs?|cats?)
3618
3619       Branches that match different length strings are permitted only at  the
3620       top-level of a lookbehind assertion. This is an extension compared with
3621       Perl, which requires all branches to match the same length  of  string.
3622       An assertion such as the following is not permitted, as its single top-
3623       level branch can match two different lengths:
3624
3625       (?<=ab(c|de))
3626
3627       However, it is acceptable to PCRE if rewritten  to  use  two  top-level
3628       branches:
3629
3630       (?<=abc|abde)
3631
3632       Sometimes  the  escape sequence \K (see above) can be used instead of a
3633       lookbehind assertion to get round the fixed-length restriction.
3634
3635       The implementation of lookbehind assertions is, for  each  alternative,
3636       to  move  the current position back temporarily by the fixed length and
3637       then try to match. If there are insufficient characters before the cur‐
3638       rent position, the assertion fails.
3639
3640       In  a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3641       gle data unit even in a UTF mode) to appear in  lookbehind  assertions,
3642       as  it  makes  it impossible to calculate the length of the lookbehind.
3643       The \X and \R escapes, which can match different numbers of data units,
3644       are not permitted either.
3645
3646       "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3647       lookbehinds, as long as the subpattern matches a  fixed-length  string.
3648       Recursion, however, is not supported.
3649
3650       Possessive  quantifiers can be used with lookbehind assertions to spec‐
3651       ify efficient matching of fixed-length strings at the  end  of  subject
3652       strings.  Consider  the following simple pattern when applied to a long
3653       string that does not match:
3654
3655       abcd$
3656
3657       As matching proceeds from left to right, PCRE looks for each "a" in the
3658       subject and then sees if what follows matches the remaining pattern. If
3659       the pattern is specified as
3660
3661       ^.*abcd$
3662
3663       the initial .* matches the entire string at first. However,  when  this
3664       fails  (as  there  is no following "a"), it backtracks to match all but
3665       the last character, then all but the last two characters,  and  so  on.
3666       Once  again  the search for "a" covers the entire string, from right to
3667       left, so we are no better off. However, if the pattern is written as
3668
3669       ^.*+(?<=abcd)
3670
3671       there can be no backtracking for the .*+ item; it can  match  only  the
3672       entire  string.  The subsequent lookbehind assertion does a single test
3673       on the last four characters. If it fails, the match fails  immediately.
3674       For  long  strings, this approach makes a significant difference to the
3675       processing time.
3676
3677       Using Multiple Assertions
3678
3679       Many assertions (of any sort) can occur in succession. For example, the
3680       following matches "foo" preceded by three digits that are not "999":
3681
3682       (?<=\d{3})(?<!999)foo
3683
3684       Notice that each of the assertions is applied independently at the same
3685       point in the subject string. First there is a check that  the  previous
3686       three  characters  are  all  digits, and then there is a check that the
3687       same three characters are not "999". This pattern does not match  "foo"
3688       preceded  by six characters, the first of which are digits and the last
3689       three of which are not "999". For example, it does not  match  "123abc‐
3690       foo". A pattern to do that is the following:
3691
3692       (?<=\d{3}...)(?<!999)foo
3693
3694       This  time  the  first assertion looks at the preceding six characters,
3695       checks that the first three are digits, and then the  second  assertion
3696       checks that the preceding three characters are not "999".
3697
3698       Assertions can be nested in any combination. For example, the following
3699       matches an occurrence of "baz" that is preceded by "bar", which in turn
3700       is not preceded by "foo":
3701
3702       (?<=(?<!foo)bar)baz
3703
3704       The  following  pattern  matches "foo" preceded by three digits and any
3705       three characters that are not "999":
3706
3707       (?<=\d{3}(?!999)...)foo
3708

CONDITIONAL SUBPATTERNS

3710       It is possible to cause the matching process to obey a subpattern  con‐
3711       ditionally  or to choose between two alternative subpatterns, depending
3712       on the result of an assertion, or whether a specific capturing  subpat‐
3713       tern has already been matched. The following are the two possible forms
3714       of conditional subpattern:
3715
3716       (?(condition)yes-pattern)
3717       (?(condition)yes-pattern|no-pattern)
3718
3719       If the condition is satisfied, the yes-pattern is used,  otherwise  the
3720       no-pattern  (if  present).  If  more than two alternatives exist in the
3721       subpattern, a compile-time error occurs. Each of the  two  alternatives
3722       can  itself  contain  nested  subpatterns of any form, including condi‐
3723       tional subpatterns; the restriction to two alternatives applies only at
3724       the  level of the condition. The following pattern fragment is an exam‐
3725       ple where the alternatives are complex:
3726
3727       (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3728
3729       There are four kinds of condition: references  to  subpatterns,  refer‐
3730       ences to recursion, a pseudo-condition called DEFINE, and assertions.
3731
3732       Checking for a Used Subpattern By Number
3733
3734       If  the  text between the parentheses consists of a sequence of digits,
3735       the condition is true if a capturing subpattern of that number has pre‐
3736       viously  matched.  If  more than one capturing subpattern with the same
3737       number exists (see section  Duplicate Subpattern Numbers earlier),  the
3738       condition  is true if any of them have matched. An alternative notation
3739       is to precede the digits with a plus or minus sign. In this  case,  the
3740       subpattern  number  is relative rather than absolute. The most recently
3741       opened parentheses can be referenced by (?(-1), the next most recent by
3742       (?(-2),  and  so  on.  Inside loops, it can also make sense to refer to
3743       subsequent groups. The next parentheses to be opened can be  referenced
3744       as  (?(+1),  and  so  on.  (The value zero in any of these forms is not
3745       used; it provokes a compile-time error.)
3746
3747       Consider the following pattern, which contains  non-significant  white‐
3748       space  to  make it more readable (assume option extended) and to divide
3749       it into three parts for ease of discussion:
3750
3751       ( \( )?    [^()]+    (?(1) \) )
3752
3753       The first part matches an optional opening  parenthesis,  and  if  that
3754       character is present, sets it as the first captured substring. The sec‐
3755       ond part matches one or more characters that are not  parentheses.  The
3756       third part is a conditional subpattern that tests whether the first set
3757       of parentheses matched or not. If they did, that is, if subject started
3758       with an opening parenthesis, the condition is true, and so the yes-pat‐
3759       tern is executed and a closing parenthesis is required.  Otherwise,  as
3760       no-pattern  is  not  present,  the subpattern matches nothing. That is,
3761       this pattern matches a sequence of non-parentheses, optionally enclosed
3762       in parentheses.
3763
3764       If  this  pattern is embedded in a larger one, a relative reference can
3765       be used:
3766
3767       This makes the fragment independent of the parentheses  in  the  larger
3768       pattern.
3769
3770       Checking for a Used Subpattern By Name
3771
3772       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
3773       used subpattern by name. For compatibility  with  earlier  versions  of
3774       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
3775       also recognized.
3776
3777       Rewriting the previous example to use a named subpattern gives:
3778
3779       (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
3780
3781       If the name used in a condition of this kind is a duplicate,  the  test
3782       is  applied to all subpatterns of the same name, and is true if any one
3783       of them has matched.
3784
3785       Checking for Pattern Recursion
3786
3787       If the condition is the string (R), and there is no subpattern with the
3788       name  R, the condition is true if a recursive call to the whole pattern
3789       or any subpattern has been made. If digits or a name preceded by amper‐
3790       sand follow the letter R, for example:
3791
3792       (?(R3)...) or (?(R&name)...)
3793
3794       the condition is true if the most recent recursion is into a subpattern
3795       whose number or name is given. This condition does not check the entire
3796       recursion  stack. If the name used in a condition of this kind is a du‐
3797       plicate, the test is applied to all subpatterns of the same  name,  and
3798       is true if any one of them is the most recent recursion.
3799
3800       At "top-level", all these recursion test conditions are false. The syn‐
3801       tax for recursive patterns is described below.
3802
3803       Defining Subpatterns for Use By Reference Only
3804
3805       If the condition is the string (DEFINE), and  there  is  no  subpattern
3806       with  the  name  DEFINE,  the  condition is always false. In this case,
3807       there can be only one alternative  in  the  subpattern.  It  is  always
3808       skipped  if  control reaches this point in the pattern. The idea of DE‐
3809       FINE is that it can be used to define "subroutines" that can be  refer‐
3810       enced  from elsewhere. (The use of subroutines is described below.) For
3811       example, a pattern to match an IPv4 address, such as  "192.168.23.245",
3812       can be written like this (ignore whitespace and line breaks):
3813
3814       (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3815
3816       The  first  part of the pattern is a DEFINE group inside which is a an‐
3817       other group named "byte" is defined. This matches an individual  compo‐
3818       nent  of  an  IPv4 address (a number < 256). When matching takes place,
3819       this part of the pattern is skipped, as DEFINE acts like a false condi‐
3820       tion. The remaining pattern uses references to the named group to match
3821       the four dot-separated components of an IPv4 address,  insisting  on  a
3822       word boundary at each end.
3823
3824       Assertion Conditions
3825
3826       If  the condition is not in any of the above formats, it must be an as‐
3827       sertion. This can be a positive or negative lookahead or lookbehind as‐
3828       sertion.  Consider  the  following  pattern, containing non-significant
3829       whitespace, and with the two alternatives on the second line:
3830
3831       (?(?=[^a-z]*[a-z])
3832       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
3833
3834       The condition is a positive lookahead assertion  that  matches  an  op‐
3835       tional  sequence of non-letters followed by a letter. That is, it tests
3836       for the presence of at least one letter in the subject. If a letter  is
3837       found,  the subject is matched against the first alternative, otherwise
3838       it is matched against the second. This pattern matches strings  in  one
3839       of  the  two  forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd
3840       are digits.
3841

COMMENTS

3843       There are two ways to include comments in patterns that  are  processed
3844       by PCRE. In both cases, the start of the comment must not be in a char‐
3845       acter class, or in the middle of any other sequence of related  charac‐
3846       ters  such  as  (?: or a subpattern name or number. The characters that
3847       make up a comment play no part in the pattern matching.
3848
3849       The sequence (?# marks the start of a comment that continues up to  the
3850       next  closing parenthesis. Nested parentheses are not permitted. If op‐
3851       tion PCRE_EXTENDED is set, an unescaped # character also  introduces  a
3852       comment,  which  in  this  case continues to immediately after the next
3853       newline character or character sequence in the pattern.  Which  charac‐
3854       ters are interpreted as newlines is controlled by the options passed to
3855       a compiling function or by a special sequence at the start of the  pat‐
3856       tern, as described in section  Newline Conventions earlier.
3857
3858       Notice  that  the  end of this type of comment is a literal newline se‐
3859       quence in the pattern; escape sequences that happen to represent a new‐
3860       line do not count. For example, consider the following pattern when ex‐
3861       tended is set, and the default newline convention is in force:
3862
3863       abc #comment \n still comment
3864
3865       On encountering character #, pcre_compile() skips along, looking for  a
3866       newline in the pattern. The sequence \n is still literal at this stage,
3867       so it does not terminate the comment. Only a character with code  value
3868       0x0a (the default newline) does so.
3869

RECURSIVE PATTERNS

3871       Consider  the problem of matching a string in parentheses, allowing for
3872       unlimited nested parentheses. Without the use of  recursion,  the  best
3873       that  can  be  done  is  to use a pattern that matches up to some fixed
3874       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
3875       depth.
3876
3877       For some time, Perl has provided a facility that allows regular expres‐
3878       sions to recurse (among other things). It does  this  by  interpolating
3879       Perl  code  in the expression at runtime, and the code can refer to the
3880       expression itself. A Perl pattern using code interpolation to solve the
3881       parentheses problem can be created like this:
3882
3883       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3884
3885       Item  (?p{...})  interpolates  Perl  code  at runtime, and in this case
3886       refers recursively to the pattern in which it appears.
3887
3888       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3889       it supports special syntax for recursion of the entire pattern, and for
3890       individual subpattern recursion. After its  introduction  in  PCRE  and
3891       Python,  this  kind  of recursion was later introduced into Perl at re‐
3892       lease 5.10.
3893
3894       A special item that consists of (? followed by a number > 0 and a clos‐
3895       ing parenthesis is a recursive subroutine call of the subpattern of the
3896       given number, if it occurs inside that subpattern. (If  not,  it  is  a
3897       non-recursive subroutine call, which is described in the next section.)
3898       The special item (?R) or (?0) is a recursive call of the entire regular
3899       expression.
3900
3901       This  PCRE  pattern  solves the nested parentheses problem (assume that
3902       option extended is set so that whitespace is ignored):
3903
3904       \( ( [^()]++ | (?R) )* \)
3905
3906       First it matches an opening parenthesis. Then it matches any number  of
3907       substrings,  which can either be a sequence of non-parentheses or a re‐
3908       cursive match of the pattern itself (that is, a correctly parenthesized
3909       substring). Finally there is a closing parenthesis. Notice the use of a
3910       possessive quantifier to avoid  backtracking  into  sequences  of  non-
3911       parentheses.
3912
3913       If this was part of a larger pattern, you would not want to recurse the
3914       entire pattern, so instead you can use:
3915
3916       ( \( ( [^()]++ | (?1) )* \) )
3917
3918       The pattern is here within parentheses so that the recursion refers  to
3919       them instead of the whole pattern.
3920
3921       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
3922       tricky. This is made easier by the use of relative references.  Instead
3923       of  (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3924       ond most recently opened parentheses preceding the recursion. That  is,
3925       a negative number counts capturing parentheses leftwards from the point
3926       at which it is encountered.
3927
3928       It is also possible to refer to later opened  parentheses,  by  writing
3929       references  such  as  (?+2). However, these cannot be recursive, as the
3930       reference is not inside the parentheses that are referenced.  They  are
3931       always  non-recursive  subroutine  calls, as described in the next sec‐
3932       tion.
3933
3934       An alternative approach is to use named parentheses instead.  The  Perl
3935       syntax  for this is (?&name). The earlier PCRE syntax (?P>name) is also
3936       supported. We can rewrite the above example as follows:
3937
3938       (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3939
3940       If there is more than one subpattern with the same name,  the  earliest
3941       one is used.
3942
3943       This  particular  example  pattern that we have studied contains nested
3944       unlimited repeats, and so the use of a possessive quantifier for match‐
3945       ing  strings  of non-parentheses is important when applying the pattern
3946       to strings that do not match. For example, when this pattern is applied
3947       to
3948
3949       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3950
3951       it gives "no match" quickly. However, if a possessive quantifier is not
3952       used, the match runs for a long time, as there are  so  many  different
3953       ways  the  +  and  *  repeats can carve up the subject, and all must be
3954       tested before failure can be reported.
3955
3956       At the end of a match, the values of capturing  parentheses  are  those
3957       from the outermost level. If the pattern above is matched against
3958
3959       (ab(cd)ef)
3960
3961       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
3962       which is the last value taken on at the top-level. If a capturing  sub‐
3963       pattern  is  not  matched at the top level, its final captured value is
3964       unset, even if it was (temporarily) set at a deeper  level  during  the
3965       matching process.
3966
3967       Do not confuse item (?R) with condition (R), which tests for recursion.
3968       Consider the following pattern, which matches text in  angle  brackets,
3969       allowing  for  arbitrary  nesting.  Only  digits  are allowed in nested
3970       brackets (that is, when recursing), while any characters are  permitted
3971       at the outer level.
3972
3973       < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
3974
3975       Here (?(R) is the start of a conditional subpattern, with two different
3976       alternatives for the recursive and non-recursive cases.  Item  (?R)  is
3977       the actual recursive call.
3978
3979       Differences in Recursion Processing between PCRE and Perl
3980
3981       Recursion  processing  in PCRE differs from Perl in two important ways.
3982       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
3983       always treated as an atomic group. That is, once it has matched some of
3984       the subject string, it is never re-entered, even if it contains untried
3985       alternatives  and  there  is a subsequent matching failure. This can be
3986       illustrated by the following pattern, which means  to  match  a  palin‐
3987       dromic string containing an odd number of characters (for example, "a",
3988       "aba", "abcba", "abcdcba"):
3989
3990       ^(.|(.)(?1)\2)$
3991
3992       The idea is that it either matches a single character, or two identical
3993       characters surrounding a subpalindrome. In Perl, this pattern works; in
3994       PCRE it does not work if the pattern is longer than  three  characters.
3995       Consider the subject string "abcba".
3996
3997       At  the  top level, the first character is matched, but as it is not at
3998       the end of the string, the first alternative fails, the second alterna‐
3999       tive  is  taken, and the recursion kicks in. The recursive call to sub‐
4000       pattern 1 successfully matches the next character ("b").  (Notice  that
4001       the beginning and end of line tests are not part of the recursion.)
4002
4003       Back  at  the top level, the next character ("c") is compared with what
4004       subpattern 2 matched, which was "a". This fails. As  the  recursion  is
4005       treated  as  an atomic group, there are now no backtracking points, and
4006       so the entire match fails. (Perl can now re-enter the recursion and try
4007       the  second  alternative.)  However, if the pattern is written with the
4008       alternatives in the other order, things are different:
4009
4010       ^((.)(?1)\2|.)$
4011
4012       This time, the recursing alternative is tried first, and  continues  to
4013       recurse  until  it runs out of characters, at which point the recursion
4014       fails. But this time we have another alternative to try at  the  higher
4015       level. That is the significant difference: in the previous case the re‐
4016       maining alternative is at a deeper recursion level, which  PCRE  cannot
4017       use.
4018
4019       To  change  the pattern so that it matches all palindromic strings, not
4020       only those with an odd number of characters, it is tempting  to  change
4021       the pattern to this:
4022
4023       ^((.)(?1)\2|.?)$
4024
4025       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
4026       When a deeper recursion has matched a single character,  it  cannot  be
4027       entered again to match an empty string. The solution is to separate the
4028       two cases, and write out the odd and even cases as alternatives at  the
4029       higher level:
4030
4031       ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4032
4033       If  you want to match typical palindromic phrases, the pattern must ig‐
4034       nore all non-word characters, which can be done as follows:
4035
4036       ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4037
4038       If run with option caseless, this pattern matches phrases  such  as  "A
4039       man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4040       Notice the use of the possessive quantifier *+  to  avoid  backtracking
4041       into  sequences  of  non-word characters. Without this, PCRE takes much
4042       longer (10 times or more) to match typical phrases, and Perl  takes  so
4043       long that you think it has gone into a loop.
4044
4045   Note:
4046       The  palindrome-matching patterns above work only if the subject string
4047       does not start with a  palindrome  that  is  shorter  than  the  entire
4048       string. For example, although "abcba" is correctly matched, if the sub‐
4049       ject is "ababa", PCRE finds palindrome "aba" at  the  start,  and  then
4050       fails  at  top  level,  as  the end of the string does not follow. Once
4051       again, it cannot jump back into the recursion  to  try  other  alterna‐
4052       tives, so the entire match fails.
4053
4054
4055       The  second  way  in which PCRE and Perl differ in their recursion pro‐
4056       cessing is in the handling of captured values. In Perl, when a  subpat‐
4057       tern  is  called recursively or as a subpattern (see the next section),
4058       it has no access to any values that were captured  outside  the  recur‐
4059       sion.  In  PCRE  these values can be referenced. Consider the following
4060       pattern:
4061
4062       ^(.)(\1|a(?2))
4063
4064       In PCRE, it matches "bab". The first capturing parentheses  match  "b",
4065       then  in  the  second  group, when the back reference \1 fails to match
4066       "b", the second alternative matches "a", and then recurses. In the  re‐
4067       cursion,  \1  does  now  match  "b" and so the whole match succeeds. In
4068       Perl, the pattern fails to match because inside the recursive  call  \1
4069       cannot access the externally set value.
4070

SUBPATTERNS AS SUBROUTINES

4072       If  the  syntax for a recursive subpattern call (either by number or by
4073       name) is used outside the parentheses to which it refers,  it  operates
4074       like  a subroutine in a programming language. The called subpattern can
4075       be defined before or after the reference. A numbered reference  can  be
4076       absolute or relative, as in the following examples:
4077
4078       (...(absolute)...)...(?2)...
4079       (...(relative)...)...(?-1)...
4080       (...(?+1)...(relative)...
4081
4082       An  earlier  example  pointed  out  that  the following pattern matches
4083       "sense and sensibility" and  "response  and  responsibility",  but  not
4084       "sense and responsibility":
4085
4086       (sens|respons)e and \1ibility
4087
4088       If instead the following pattern is used, it matches "sense and respon‐
4089       sibility" and the other two strings:
4090
4091       (sens|respons)e and (?1)ibility
4092
4093       Another example is provided in the discussion of DEFINE earlier.
4094
4095       All subroutine calls, recursive or not, are always  treated  as  atomic
4096       groups.  That  is,  once  a  subroutine has matched some of the subject
4097       string, it is never re-entered, even if it  contains  untried  alterna‐
4098       tives  and there is a subsequent matching failure. Any capturing paren‐
4099       theses that are set during the subroutine call revert to their previous
4100       values afterwards.
4101
4102       Processing  options  such as case-independence are fixed when a subpat‐
4103       tern is defined, so if it is used as a subroutine, such options  cannot
4104       be  changed  for  different  calls.  For example, the following pattern
4105       matches "abcabc" but not "abcABC", as the change of  processing  option
4106       does not affect the called subpattern:
4107
4108       (abc)(?i:(?-1))
4109

ONIGURUMA SUBROUTINE SYNTAX

4111       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
4112       name or a number enclosed either in angle brackets or single quotes, is
4113       alternative syntax for referencing a subpattern as a subroutine, possi‐
4114       bly recursively. Here follows two of the examples used above, rewritten
4115       using this syntax:
4116
4117       (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4118       (sens|respons)e and \g'1'ibility
4119
4120       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
4121       plus or minus sign, it is taken as a relative reference, for example:
4122
4123       (abc)(?i:\g<-1>)
4124
4125       Notice that \g{...} (Perl syntax) and \g<...>  (Oniguruma  syntax)  are
4126       not synonymous. The former is a back reference; the latter is a subrou‐
4127       tine call.
4128

BACKTRACKING CONTROL

4130       Perl 5.10 introduced some "Special Backtracking Control  Verbs",  which
4131       are still described in the Perl documentation as "experimental and sub‐
4132       ject to change or removal in a future version of Perl". It goes  on  to
4133       say:  "Their usage in production code should be noted to avoid problems
4134       during upgrades." The same remarks apply to the PCRE features described
4135       in this section.
4136
4137       The  new verbs make use of what was previously invalid syntax: an open‐
4138       ing parenthesis followed by an asterisk. They are generally of the form
4139       (*VERB)  or  (*VERB:NAME). Some can take either form, possibly behaving
4140       differently depending on whether a name is present. A name is  any  se‐
4141       quence  of  characters that does not include a closing parenthesis. The
4142       maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4143       and  32-bit  libraries.  If  the name is empty, that is, if the closing
4144       parenthesis immediately follows the colon, the  effect  is  as  if  the
4145       colon was not there. Any number of these verbs can occur in a pattern.
4146
4147       The behavior of these verbs in repeated groups, assertions, and in sub‐
4148       patterns called as subroutines (whether  or  not  recursively)  is  de‐
4149       scribed below.
4150
4151       Optimizations That Affect Backtracking Verbs
4152
4153       PCRE  contains some optimizations that are used to speed up matching by
4154       running some checks at the start of each match attempt. For example, it
4155       can  know  the minimum length of matching subject, or that a particular
4156       character must be present. When one of these optimizations bypasses the
4157       running  of a match, any included backtracking verbs are not processed.
4158       processed. You can suppress the start-of-match optimizations by setting
4159       option  no_start_optimize when calling compile/2 or run/3, or by start‐
4160       ing the pattern with (*NO_START_OPT).
4161
4162       Experiments with Perl suggest that it too  has  similar  optimizations,
4163       sometimes leading to anomalous results.
4164
4165       Verbs That Act Immediately
4166
4167       The  following verbs act as soon as they are encountered. They must not
4168       be followed by a name.
4169
4170       (*ACCEPT)
4171
4172       This verb causes the match to end successfully, skipping the  remainder
4173       of  the pattern. However, when it is inside a subpattern that is called
4174       as a subroutine, only that subpattern is ended  successfully.  Matching
4175       then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4176       tive assertion, the assertion succeeds; in a  negative  assertion,  the
4177       assertion fails.
4178
4179       If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap‐
4180       tured. For example, the following matches "AB", "AAD", or  "ACD".  When
4181       it matches "AB", "B" is captured by the outer parentheses.
4182
4183       A((?:A|B(*ACCEPT)|C)D)
4184
4185       The  following  verb causes a matching failure, forcing backtracking to
4186       occur. It is equivalent to (?!) but easier to read.
4187
4188       (*FAIL) or (*F)
4189
4190       The Perl documentation states that it is probably useful only when com‐
4191       bined  with  (?{})  or  (??{}).  Those  are  Perl features that are not
4192       present in PCRE.
4193
4194       A match with the string "aaaa" always fails, but the callout  is  taken
4195       before each backtrack occurs (in this example, 10 times).
4196
4197       Recording Which Path Was Taken
4198
4199       The  main  purpose of this verb is to track how a match was arrived at,
4200       although it also has a secondary use in with advancing the match start‐
4201       ing point (see (*SKIP) below).
4202
4203   Note:
4204       In  Erlang,  there  is no interface to retrieve a mark with run/2,3, so
4205       only the secondary purpose is relevant to the Erlang programmer.
4206
4207       The rest of this section is  therefore  deliberately  not  adapted  for
4208       reading  by  the Erlang programmer, but the examples can help in under‐
4209       standing NAMES as they can be used by (*SKIP).
4210
4211
4212       (*MARK:NAME) or (*:NAME)
4213
4214       A name is always required with this verb. There  can  be  as  many  in‐
4215       stances  of  (*MARK)  as  you like in a pattern, and their names do not
4216       have to be unique.
4217
4218       When a match succeeds, the name of the last  encountered  (*MARK:NAME),
4219       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
4220       the caller as described in section "Extra data for pcre_exec()" in  the
4221       pcreapi documentation. In the following example of pcretest output, the
4222       /K modifier requests the retrieval and outputting of (*MARK) data:
4223
4224         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4225       data> XY
4226        0: XY
4227       MK: A
4228       XZ
4229        0: XZ
4230       MK: B
4231
4232       The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4233       ple  it indicates which of the two alternatives matched. This is a more
4234       efficient way of obtaining this information than putting each  alterna‐
4235       tive in its own capturing parentheses.
4236
4237       If  a  verb  with a name is encountered in a positive assertion that is
4238       true, the name is recorded and passed back if it is  the  last  encoun‐
4239       tered.  This does not occur for negative assertions or failing positive
4240       assertions.
4241
4242       After a partial match or a failed match, the last encountered  name  in
4243       the entire match process is returned, for example:
4244
4245         re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4246       data> XP
4247       No match, mark = B
4248
4249       Notice  that  in this unanchored example, the mark is retained from the
4250       match attempt that started at letter "X"  in  the  subject.  Subsequent
4251       match attempts starting at "P" and then with an empty string do not get
4252       as far as the (*MARK) item, nevertheless do not reset it.
4253
4254       Verbs That Act after Backtracking
4255
4256       The following verbs do nothing when they are encountered. Matching con‐
4257       tinues  with what follows, but if there is no subsequent match, causing
4258       a backtrack to the verb, a failure is  forced.  That  is,  backtracking
4259       cannot  pass  to the left of the verb. However, when one of these verbs
4260       appears inside an atomic group or an assertion that is true, its effect
4261       is confined to that group, as once the group has been matched, there is
4262       never any backtracking into it. In  this  situation,  backtracking  can
4263       "jump  back"  to the left of the entire atomic group or assertion. (Re‐
4264       member also, as stated above, that this localization  also  applies  in
4265       subroutine calls.)
4266
4267       These  verbs  differ  in exactly what kind of failure occurs when back‐
4268       tracking reaches them. The behavior described below is what occurs when
4269       the  verb  is  not in a subroutine or an assertion. Subsequent sections
4270       cover these special cases.
4271
4272       The following verb, which must not be followed by a  name,  causes  the
4273       whole  match to fail outright if there is a later matching failure that
4274       causes backtracking to reach it. Even if the pattern is unanchored,  no
4275       further  attempts  to find a match by advancing the starting point take
4276       place.
4277
4278       (*COMMIT)
4279
4280       If (*COMMIT) is the only backtracking verb that is encountered, once it
4281       has  been  passed,  run/2,3 is committed to find a match at the current
4282       starting point, or not at all, for example:
4283
4284       a+(*COMMIT)b
4285
4286       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
4287       of dynamic anchor, or "I've started, so I must finish". The name of the
4288       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
4289       forces a match failure.
4290
4291       If more than one backtracking verb exists in a pattern, a different one
4292       that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4293       MIT)  during  a match does not always guarantee that a match must be at
4294       this starting point.
4295
4296       Notice that (*COMMIT) at the start of a pattern is not the same  as  an
4297       anchor, unless the PCRE start-of-match optimizations are turned off, as
4298       shown in the following example:
4299
4300       1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4301       {match,["abc"]}
4302       2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4303       nomatch
4304
4305       For this pattern, PCRE knows that any match must start with "a", so the
4306       optimization skips along the subject to "a" before applying the pattern
4307       to the first set of data. The match attempt then succeeds. In the  sec‐
4308       ond  call  the  no_start_optimize  disables the optimization that skips
4309       along to the first character. The pattern is now  applied  starting  at
4310       "x",  and  so the (*COMMIT) causes the match to fail without trying any
4311       other starting points.
4312
4313       The following verb causes the match to fail at the current starting po‐
4314       sition  in the subject if there is a later matching failure that causes
4315       backtracking to reach it:
4316
4317       (*PRUNE) or (*PRUNE:NAME)
4318
4319       If the pattern is unanchored, the normal  "bumpalong"  advance  to  the
4320       next starting character then occurs. Backtracking can occur as usual to
4321       the left of (*PRUNE), before it is reached, or  when  matching  to  the
4322       right  of (*PRUNE), but if there is no match to the right, backtracking
4323       cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just  an
4324       alternative  to an atomic group or possessive quantifier, but there are
4325       some uses of (*PRUNE) that cannot be expressed in any other way. In  an
4326       anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4327
4328       The    behavior   of   (*PRUNE:NAME)   is   the   not   the   same   as
4329       (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name  is  re‐
4330       membered for passing back to the caller. However, (*SKIP:NAME) searches
4331       only for names set with (*MARK).
4332
4333   Note:
4334       The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4335       programmer, as names cannot be retrieved.
4336
4337
4338       The  following  verb,  when specified without a name, is like (*PRUNE),
4339       except that if the pattern is unanchored, the  "bumpalong"  advance  is
4340       not  to  the  next  character, but to the position in the subject where
4341       (*SKIP) was encountered.
4342
4343       (*SKIP)
4344
4345       (*SKIP) signifies that whatever text was matched leading up to it  can‐
4346       not be part of a successful match. Consider:
4347
4348       a+(*SKIP)b
4349
4350       If  the  subject  is  "aaaac...",  after  the first match attempt fails
4351       (starting at the first character in the  string),  the  starting  point
4352       skips  on  to  start  the next attempt at "c". Notice that a possessive
4353       quantifier does not have the same effect as this example;  although  it
4354       would  suppress backtracking during the first match attempt, the second
4355       attempt would start at the second character instead of skipping  on  to
4356       "c".
4357
4358       When (*SKIP) has an associated name, its behavior is modified:
4359
4360       (*SKIP:NAME)
4361
4362       When  this  is  triggered,  the  previous  path  through the pattern is
4363       searched for the most recent (*MARK) that has the same name. If one  is
4364       found,  the  "bumpalong" advance is to the subject position that corre‐
4365       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
4366       no (*MARK) with a matching name is found, (*SKIP) is ignored.
4367
4368       Notice  that  (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4369       It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4370
4371       The following verb causes a skip to the next innermost alternative when
4372       backtracking  reaches  it. That is, it cancels any further backtracking
4373       within the current alternative.
4374
4375       (*THEN) or (*THEN:NAME)
4376
4377       The verb name comes from the observation that it can be used for a pat‐
4378       tern-based if-then-else block:
4379
4380       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4381
4382       If  the COND1 pattern matches, FOO is tried (and possibly further items
4383       after the end of the group if FOO succeeds). On  failure,  the  matcher
4384       skips  to  the second alternative and tries COND2, without backtracking
4385       into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4386       fails, there are no more alternatives, so there is a backtrack to what‐
4387       ever came before the entire group. If (*THEN) is not inside an alterna‐
4388       tion, it acts like (*PRUNE).
4389
4390       The    behavior    of   (*THEN:NAME)   is   the   not   the   same   as
4391       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4392       bered  for  passing  back to the caller. However, (*SKIP:NAME) searches
4393       only for names set with (*MARK).
4394
4395   Note:
4396       The fact that (*THEN:NAME) remembers the name is useless to the  Erlang
4397       programmer, as names cannot be retrieved.
4398
4399
4400       A  subpattern that does not contain a | character is just a part of the
4401       enclosing alternative; it is not a nested alternation with only one al‐
4402       ternative.  The  effect  of (*THEN) extends beyond such a subpattern to
4403       the enclosing alternative. Consider the following pattern, where A,  B,
4404       and  so  on,  are  complex  pattern fragments that do not contain any |
4405       characters at this level:
4406
4407       A (B(*THEN)C) | D
4408
4409       If A and B are matched, but there is a failure in C, matching does  not
4410       backtrack into A; instead it moves to the next alternative, that is, D.
4411       However, if the subpattern containing (*THEN) is given an  alternative,
4412       it behaves differently:
4413
4414       A (B(*THEN)C | (*FAIL)) | D
4415
4416       The  effect of (*THEN) is now confined to the inner subpattern. After a
4417       failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4418       tern  to  fail, as there are no more alternatives to try. In this case,
4419       matching does now backtrack into A.
4420
4421       Notice that a conditional subpattern is not considered  as  having  two
4422       alternatives,  as  only one is ever used. That is, the | character in a
4423       conditional subpattern has a different  meaning.  Ignoring  whitespace,
4424       consider:
4425
4426       ^.*? (?(?=a) a | b(*THEN)c )
4427
4428       If  the  subject  is  "ba",  this pattern does not match. As .*? is un‐
4429       greedy, it initially matches zero characters. The condition (?=a)  then
4430       fails,  the  character  "b"  is matched, but "c" is not. At this point,
4431       matching does not backtrack to .*? as can perhaps be expected from  the
4432       presence  of the | character. The conditional subpattern is part of the
4433       single alternative that comprises the whole pattern, and so  the  match
4434       fails.  (If  there  was a backtrack into .*?, allowing it to match "b",
4435       the match would succeed.)
4436
4437       The verbs described above provide four different "strengths" of control
4438       when subsequent matching fails:
4439
4440         * (*THEN)  is the weakest, carrying on the match at the next alterna‐
4441           tive.
4442
4443         * (*PRUNE) comes next, fails the match at the current starting  posi‐
4444           tion,  but  allows  an  advance to the next character (for an unan‐
4445           chored pattern).
4446
4447         * (*SKIP) is similar, except that the advance can be  more  than  one
4448           character.
4449
4450         * (*COMMIT) is the strongest, causing the entire match to fail.
4451
4452       More than One Backtracking Verb
4453
4454       If  more  than  one  backtracking verb is present in a pattern, the one
4455       that is backtracked onto first acts. For example, consider the  follow‐
4456       ing pattern, where A, B, and so on, are complex pattern fragments:
4457
4458       (A(*COMMIT)B(*THEN)C|ABD)
4459
4460       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
4461       match to fail. However, if A and B match, but C fails, the backtrack to
4462       (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4463       consistent, but is not always the same as in Perl. It means that if two
4464       or  more  backtracking verbs appear in succession, the last of them has
4465       no effect. Consider the following example:
4466
4467       If there is a matching failure to the right, backtracking onto (*PRUNE)
4468       causes  it to be triggered, and its action is taken. There can never be
4469       a backtrack onto (*COMMIT).
4470
4471       Backtracking Verbs in Repeated Groups
4472
4473       PCRE differs from Perl in its handling of  backtracking  verbs  in  re‐
4474       peated groups. For example, consider:
4475
4476       /(a(*COMMIT)b)+ac/
4477
4478       If  the  subject  is  "abac",  Perl matches, but PCRE fails because the
4479       (*COMMIT) in the second repeat of the group acts.
4480
4481       Backtracking Verbs in Assertions
4482
4483       (*FAIL) in an assertion has its normal effect: it forces  an  immediate
4484       backtrack.
4485
4486       (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4487       out any further processing. In a negative assertion,  (*ACCEPT)  causes
4488       the assertion to fail without any further processing.
4489
4490       The  other  backtracking verbs are not treated specially if they appear
4491       in a positive assertion. In particular, (*THEN) skips to the  next  al‐
4492       ternative  in  the innermost enclosing group that has alternations, re‐
4493       gardless if this is within the assertion.
4494
4495       Negative assertions are, however, different, to ensure that changing  a
4496       positive  assertion into a negative assertion changes its result. Back‐
4497       tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative  asser‐
4498       tion  to  be true, without considering any further alternative branches
4499       in the assertion. Backtracking into (*THEN) causes it to  skip  to  the
4500       next  enclosing alternative within the assertion (the normal behavior),
4501       but if the assertion does not have such an alternative, (*THEN) behaves
4502       like (*PRUNE).
4503
4504       Backtracking Verbs in Subroutines
4505
4506       These  behaviors  occur  regardless  if the subpattern is called recur‐
4507       sively. The treatment of subroutines  in  Perl  is  different  in  some
4508       cases.
4509
4510         * (*FAIL)  in  a subpattern called as a subroutine has its normal ef‐
4511           fect: it forces an immediate backtrack.
4512
4513         * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4514           tine match to succeed without any further processing. Matching then
4515           continues after the subroutine call.
4516
4517         * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as  a  sub‐
4518           routine cause the subroutine match to fail.
4519
4520         * (*THEN)  skips  to  the next alternative in the innermost enclosing
4521           group within the subpattern that has alternatives. If there  is  no
4522           such  group  within  the  subpattern, (*THEN) causes the subroutine
4523           match to fail.
4524
4525Ericsson AB                      stdlib 5.1.1                            re(3)