perlre(1)

1PERLRE(1)              Perl Programmers Reference Guide              PERLRE(1)
2
3
4

NAME

6       perlre - Perl regular expressions
7

DESCRIPTION

9       This page describes the syntax of regular expressions in Perl.
10
11       If you haven't used regular expressions before, a quick-start
12       introduction is available in perlrequick, and a longer tutorial
13       introduction is available in perlretut.
14
15       For reference on how regular expressions are used in matching
16       operations, plus various examples of the same, see discussions of
17       "m//", "s///", "qr//" and "??" in "Regexp Quote-Like Operators" in
18       perlop.
19
20   Modifiers
21       Matching operations can have various modifiers.  Modifiers that relate
22       to the interpretation of the regular expression inside are listed
23       below.  Modifiers that alter the way a regular expression is used by
24       Perl are detailed in "Regexp Quote-Like Operators" in perlop and "Gory
25       details of parsing quoted constructs" in perlop.
26
27       m   Treat string as multiple lines.  That is, change "^" and "$" from
28           matching the start or end of the string to matching the start or
29           end of any line anywhere within the string.
30
31       s   Treat string as single line.  That is, change "." to match any
32           character whatsoever, even a newline, which normally it would not
33           match.
34
35           Used together, as "/ms", they let the "." match any character
36           whatsoever, while still allowing "^" and "$" to match,
37           respectively, just after and just before newlines within the
38           string.
39
40       i   Do case-insensitive pattern matching.
41
42           If "use locale" is in effect, the case map is taken from the
43           current locale.  See perllocale.
44
45       x   Extend your pattern's legibility by permitting whitespace and
46           comments.
47
48       p   Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
49           ${^POSTMATCH} are available for use after matching.
50
51       g and c
52           Global matching, and keep the Current position after failed
53           matching.  Unlike i, m, s and x, these two flags affect the way the
54           regex is used rather than the regex itself. See "Using regular
55           expressions in Perl" in perlretut for further explanation of the g
56           and c modifiers.
57
58       These are usually written as "the "/x" modifier", even though the
59       delimiter in question might not really be a slash.  Any of these
60       modifiers may also be embedded within the regular expression itself
61       using the "(?...)" construct.  See below.
62
63       The "/x" modifier itself needs a little more explanation.  It tells the
64       regular expression parser to ignore most whitespace that is neither
65       backslashed nor within a character class.  You can use this to break up
66       your regular expression into (slightly) more readable parts.  The "#"
67       character is also treated as a metacharacter introducing a comment,
68       just as in ordinary Perl code.  This also means that if you want real
69       whitespace or "#" characters in the pattern (outside a character class,
70       where they are unaffected by "/x"), then you'll either have to escape
71       them (using backslashes or "\Q...\E") or encode them using octal, hex,
72       or "\N{}" escapes.  Taken together, these features go a long way
73       towards making Perl's regular expressions more readable.  Note that you
74       have to be careful not to include the pattern delimiter in the
75       comment--perl has no way of knowing you did not intend to close the
76       pattern early.  See the C-comment deletion code in perlop.  Also note
77       that anything inside a "\Q...\E" stays unaffected by "/x".  And note
78       that "/x" doesn't affect whether space interpretation within a single
79       multi-character construct.  For example in "\x{...}", regardless of the
80       "/x" modifier, there can be no spaces.  Same for a quantifier such as
81       "{3}" or "{5,}".  Similarly, "(?:...)" can't have a space between the
82       "?" and ":", but can between the "(" and "?".  Within any delimiters
83       for such a construct, allowed spaces are not affected by "/x", and
84       depend on the construct.  For example, "\x{...}" can't have spaces
85       because hexadecimal numbers don't have spaces in them.  But, Unicode
86       properties can have spaces, so in "\p{...}"  there can be spaces that
87       follow the Unicode rules, for which see "Properties accessible through
88       \p{} and \P{}" in perluniprops.
89
90   Regular Expressions
91       Metacharacters
92
93       The patterns used in Perl pattern matching evolved from those supplied
94       in the Version 8 regex routines.  (The routines are derived (distantly)
95       from Henry Spencer's freely redistributable reimplementation of the V8
96       routines.)  See "Version 8 Regular Expressions" for details.
97
98       In particular the following metacharacters have their standard
99       egrep-ish meanings:
100
101           \   Quote the next metacharacter
102           ^   Match the beginning of the line
103           .   Match any character (except newline)
104           $   Match the end of the line (or before newline at the end)
105           |   Alternation
106           ()  Grouping
107           []  Bracketed Character class
108
109       By default, the "^" character is guaranteed to match only the beginning
110       of the string, the "$" character only the end (or before the newline at
111       the end), and Perl does certain optimizations with the assumption that
112       the string contains only one line.  Embedded newlines will not be
113       matched by "^" or "$".  You may, however, wish to treat a string as a
114       multi-line buffer, such that the "^" will match after any newline
115       within the string (except if the newline is the last character in the
116       string), and "$" will match before any newline.  At the cost of a
117       little more overhead, you can do this by using the /m modifier on the
118       pattern match operator.  (Older programs did this by setting $*, but
119       this practice has been removed in perl 5.9.)
120
121       To simplify multi-line substitutions, the "." character never matches a
122       newline unless you use the "/s" modifier, which in effect tells Perl to
123       pretend the string is a single line--even if it isn't.
124
125       Quantifiers
126
127       The following standard quantifiers are recognized:
128
129           *      Match 0 or more times
130           +      Match 1 or more times
131           ?      Match 1 or 0 times
132           {n}    Match exactly n times
133           {n,}   Match at least n times
134           {n,m}  Match at least n but not more than m times
135
136       (If a curly bracket occurs in any other context, it is treated as a
137       regular character.  In particular, the lower bound is not optional.)
138       The "*" quantifier is equivalent to "{0,}", the "+" quantifier to
139       "{1,}", and the "?" quantifier to "{0,1}".  n and m are limited to non-
140       negative integral values less than a preset limit defined when perl is
141       built.  This is usually 32766 on the most common platforms.  The actual
142       limit can be seen in the error message generated by code such as this:
143
144           $_ **= $_ , / {$_} / for 2 .. 42;
145
146       By default, a quantified subpattern is "greedy", that is, it will match
147       as many times as possible (given a particular starting location) while
148       still allowing the rest of the pattern to match.  If you want it to
149       match the minimum number of times possible, follow the quantifier with
150       a "?".  Note that the meanings don't change, just the "greediness":
151
152           *?     Match 0 or more times, not greedily
153           +?     Match 1 or more times, not greedily
154           ??     Match 0 or 1 time, not greedily
155           {n}?   Match exactly n times, not greedily
156           {n,}?  Match at least n times, not greedily
157           {n,m}? Match at least n but not more than m times, not greedily
158
159       By default, when a quantified subpattern does not allow the rest of the
160       overall pattern to match, Perl will backtrack. However, this behaviour
161       is sometimes undesirable. Thus Perl provides the "possessive"
162       quantifier form as well.
163
164           *+     Match 0 or more times and give nothing back
165           ++     Match 1 or more times and give nothing back
166           ?+     Match 0 or 1 time and give nothing back
167           {n}+   Match exactly n times and give nothing back (redundant)
168           {n,}+  Match at least n times and give nothing back
169           {n,m}+ Match at least n but not more than m times and give nothing back
170
171       For instance,
172
173          'aaaa' =~ /a++a/
174
175       will never match, as the "a++" will gobble up all the "a"'s in the
176       string and won't leave any for the remaining part of the pattern. This
177       feature can be extremely useful to give perl hints about where it
178       shouldn't backtrack. For instance, the typical "match a double-quoted
179       string" problem can be most efficiently performed when written as:
180
181          /"(?:[^"\\]++|\\.)*+"/
182
183       as we know that if the final quote does not match, backtracking will
184       not help. See the independent subexpression "(?>...)" for more details;
185       possessive quantifiers are just syntactic sugar for that construct. For
186       instance the above example could also be written as follows:
187
188          /"(?>(?:(?>[^"\\]+)|\\.)*)"/
189
190       Escape sequences
191
192       Because patterns are processed as double quoted strings, the following
193       also work:
194
195           \t          tab                   (HT, TAB)
196           \n          newline               (LF, NL)
197           \r          return                (CR)
198           \f          form feed             (FF)
199           \a          alarm (bell)          (BEL)
200           \e          escape (think troff)  (ESC)
201           \033        octal char            (example: ESC)
202           \x1B        hex char              (example: ESC)
203           \x{263a}    long hex char         (example: Unicode SMILEY)
204           \cK         control char          (example: VT)
205           \N{name}    named Unicode character
206           \N{U+263D}  Unicode character     (example: FIRST QUARTER MOON)
207           \l          lowercase next char (think vi)
208           \u          uppercase next char (think vi)
209           \L          lowercase till \E (think vi)
210           \U          uppercase till \E (think vi)
211           \Q          quote (disable) pattern metacharacters till \E
212           \E          end either case modification or quoted section (think vi)
213
214       Details are in "Quote and Quote-like Operators" in perlop.
215
216       Character Classes and other Special Escapes
217
218       In addition, Perl defines the following:
219
220         Sequence   Note    Description
221          [...]     [1]  Match a character according to the rules of the bracketed
222                           character class defined by the "...".  Example: [a-z]
223                           matches "a" or "b" or "c" ... or "z"
224          [[:...:]] [2]  Match a character according to the rules of the POSIX
225                           character class "..." within the outer bracketed character
226                           class.  Example: [[:upper:]] matches any uppercase
227                           character.
228          \w        [3]  Match a "word" character (alphanumeric plus "_")
229          \W        [3]  Match a non-"word" character
230          \s        [3]  Match a whitespace character
231          \S        [3]  Match a non-whitespace character
232          \d        [3]  Match a decimal digit character
233          \D        [3]  Match a non-digit character
234          \pP       [3]  Match P, named property.  Use \p{Prop} for longer names.
235          \PP       [3]  Match non-P
236          \X        [4]  Match Unicode "eXtended grapheme cluster"
237          \C             Match a single C-language char (octet) even if that is part
238                           of a larger UTF-8 character.  Thus it breaks up characters
239                           into their UTF-8 bytes, so you may end up with malformed
240                           pieces of UTF-8.  Unsupported in lookbehind.
241          \1        [5]  Backreference to a specific capture buffer or group.
242                           '1' may actually be any positive integer.
243          \g1       [5]  Backreference to a specific or previous group,
244          \g{-1}    [5]  The number may be negative indicating a relative previous
245                           buffer and may optionally be wrapped in curly brackets for
246                           safer parsing.
247          \g{name}  [5]  Named backreference
248          \k<name>  [5]  Named backreference
249          \K        [6]  Keep the stuff left of the \K, don't include it in $&
250          \N        [7]  Any character but \n (experimental).  Not affected by /s
251                           modifier
252          \v        [3]  Vertical whitespace
253          \V        [3]  Not vertical whitespace
254          \h        [3]  Horizontal whitespace
255          \H        [3]  Not horizontal whitespace
256          \R        [4]  Linebreak
257
258       [1] See "Bracketed Character Classes" in perlrecharclass for details.
259
260       [2] See "POSIX Character Classes" in perlrecharclass for details.
261
262       [3] See "Backslash sequences" in perlrecharclass for details.
263
264       [4] See "Misc" in perlrebackslash for details.
265
266       [5] See "Capture buffers" below for details.
267
268       [6] See "Extended Patterns" below for details.
269
270       [7] Note that "\N" has two meanings.  When of the form "\N{NAME}", it
271           matches the character whose name is "NAME"; and similarly when of
272           the form "\N{U+wide hex char}", it matches the character whose
273           Unicode ordinal is wide hex char.  Otherwise it matches any
274           character but "\n".
275
276       Assertions
277
278       Perl defines the following zero-width assertions:
279
280           \b  Match a word boundary
281           \B  Match except at a word boundary
282           \A  Match only at beginning of string
283           \Z  Match only at end of string, or before newline at the end
284           \z  Match only at end of string
285           \G  Match only at pos() (e.g. at the end-of-match position
286               of prior m//g)
287
288       A word boundary ("\b") is a spot between two characters that has a "\w"
289       on one side of it and a "\W" on the other side of it (in either order),
290       counting the imaginary characters off the beginning and end of the
291       string as matching a "\W".  (Within character classes "\b" represents
292       backspace rather than a word boundary, just as it normally does in any
293       double-quoted string.)  The "\A" and "\Z" are just like "^" and "$",
294       except that they won't match multiple times when the "/m" modifier is
295       used, while "^" and "$" will match at every internal line boundary.  To
296       match the actual end of the string and not ignore an optional trailing
297       newline, use "\z".
298
299       The "\G" assertion can be used to chain global matches (using "m//g"),
300       as described in "Regexp Quote-Like Operators" in perlop.  It is also
301       useful when writing "lex"-like scanners, when you have several patterns
302       that you want to match against consequent substrings of your string,
303       see the previous reference.  The actual location where "\G" will match
304       can also be influenced by using "pos()" as an lvalue: see "pos" in
305       perlfunc. Note that the rule for zero-length matches is modified
306       somewhat, in that contents to the left of "\G" is not counted when
307       determining the length of the match. Thus the following will not match
308       forever:
309
310           $str = 'ABC';
311           pos($str) = 1;
312           while (/.\G/g) {
313               print $&;
314           }
315
316       It will print 'A' and then terminate, as it considers the match to be
317       zero-width, and thus will not match at the same position twice in a
318       row.
319
320       It is worth noting that "\G" improperly used can result in an infinite
321       loop. Take care when using patterns that include "\G" in an
322       alternation.
323
324       Capture buffers
325
326       The bracketing construct "( ... )" creates capture buffers. To refer to
327       the current contents of a buffer later on, within the same pattern, use
328       \1 for the first, \2 for the second, and so on.  Outside the match use
329       "$" instead of "\".  (The \<digit> notation works in certain
330       circumstances outside the match.  See "Warning on \1 Instead of $1"
331       below for details.)  Referring back to another part of the match is
332       called a backreference.
333
334       There is no limit to the number of captured substrings that you may
335       use.  However Perl also uses \10, \11, etc. as aliases for \010, \011,
336       etc.  (Recall that 0 means octal, so \011 is the character at number 9
337       in your coded character set; which would be the 10th character, a
338       horizontal tab under ASCII.)  Perl resolves this ambiguity by
339       interpreting \10 as a backreference only if at least 10 left
340       parentheses have opened before it.  Likewise \11 is a backreference
341       only if at least 11 left parentheses have opened before it.  And so on.
342       \1 through \9 are always interpreted as backreferences.  If the
343       bracketing group did not match, the associated backreference won't
344       match either. (This can happen if the bracketing group is optional, or
345       in a different branch of an alternation.)
346
347       In order to provide a safer and easier way to construct patterns using
348       backreferences, Perl provides the "\g{N}" notation (starting with perl
349       5.10.0). The curly brackets are optional, however omitting them is less
350       safe as the meaning of the pattern can be changed by text (such as
351       digits) following it. When N is a positive integer the "\g{N}" notation
352       is exactly equivalent to using normal backreferences. When N is a
353       negative integer then it is a relative backreference referring to the
354       previous N'th capturing group. When the bracket form is used and N is
355       not an integer, it is treated as a reference to a named buffer.
356
357       Thus "\g{-1}" refers to the last buffer, "\g{-2}" refers to the buffer
358       before that. For example:
359
360               /
361                (Y)            # buffer 1
362                (              # buffer 2
363                   (X)         # buffer 3
364                   \g{-1}      # backref to buffer 3
365                   \g{-3}      # backref to buffer 1
366                )
367               /x
368
369       and would match the same as "/(Y) ( (X) \3 \1 )/x".
370
371       Additionally, as of Perl 5.10.0 you may use named capture buffers and
372       named backreferences. The notation is "(?<name>...)" to declare and
373       "\k<name>" to reference. You may also use apostrophes instead of angle
374       brackets to delimit the name; and you may use the bracketed "\g{name}"
375       backreference syntax.  It's possible to refer to a named capture buffer
376       by absolute and relative number as well.  Outside the pattern, a named
377       capture buffer is available via the "%+" hash.  When different buffers
378       within the same pattern have the same name, $+{name} and "\k<name>"
379       refer to the leftmost defined group. (Thus it's possible to do things
380       with named capture buffers that would otherwise require "(??{})" code
381       to accomplish.)
382
383       Examples:
384
385           s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words
386
387           /(.)\1/                         # find first doubled char
388                and print "'$1' is the first doubled character\n";
389
390           /(?<char>.)\k<char>/            # ... a different way
391                and print "'$+{char}' is the first doubled character\n";
392
393           /(?'char'.)\1/                  # ... mix and match
394                and print "'$1' is the first doubled character\n";
395
396           if (/Time: (..):(..):(..)/) {   # parse out values
397               $hours = $1;
398               $minutes = $2;
399               $seconds = $3;
400           }
401
402       Several special variables also refer back to portions of the previous
403       match.  $+ returns whatever the last bracket match matched.  $& returns
404       the entire matched string.  (At one point $0 did also, but now it
405       returns the name of the program.)  "$`" returns everything before the
406       matched string.  "$'" returns everything after the matched string. And
407       $^N contains whatever was matched by the most-recently closed group
408       (submatch). $^N can be used in extended patterns (see below), for
409       example to assign a submatch to a variable.
410
411       The numbered match variables ($1, $2, $3, etc.) and the related
412       punctuation set ($+, $&, "$`", "$'", and $^N) are all dynamically
413       scoped until the end of the enclosing block or until the next
414       successful match, whichever comes first.  (See "Compound Statements" in
415       perlsyn.)
416
417       NOTE: Failed matches in Perl do not reset the match variables, which
418       makes it easier to write code that tests for a series of more specific
419       cases and remembers the best match.
420
421       WARNING: Once Perl sees that you need one of $&, "$`", or "$'" anywhere
422       in the program, it has to provide them for every pattern match.  This
423       may substantially slow your program.  Perl uses the same mechanism to
424       produce $1, $2, etc, so you also pay a price for each pattern that
425       contains capturing parentheses.  (To avoid this cost while retaining
426       the grouping behaviour, use the extended regular expression "(?: ... )"
427       instead.)  But if you never use $&, "$`" or "$'", then patterns without
428       capturing parentheses will not be penalized.  So avoid $&, "$'", and
429       "$`" if you can, but if you can't (and some algorithms really
430       appreciate them), once you've used them once, use them at will, because
431       you've already paid the price.  As of 5.005, $& is not so costly as the
432       other two.
433
434       As a workaround for this problem, Perl 5.10.0 introduces
435       "${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
436       to "$`", $& and "$'", except that they are only guaranteed to be
437       defined after a successful match that was executed with the "/p"
438       (preserve) modifier.  The use of these variables incurs no global
439       performance penalty, unlike their punctuation char equivalents, however
440       at the trade-off that you have to tell perl when you want to use them.
441
442   Quoting metacharacters
443       Backslashed metacharacters in Perl are alphanumeric, such as "\b",
444       "\w", "\n".  Unlike some other regular expression languages, there are
445       no backslashed symbols that aren't alphanumeric.  So anything that
446       looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a
447       literal character, not a metacharacter.  This was once used in a common
448       idiom to disable or quote the special meanings of regular expression
449       metacharacters in a string that you want to use for a pattern. Simply
450       quote all non-"word" characters:
451
452           $pattern =~ s/(\W)/\\$1/g;
453
454       (If "use locale" is set, then this depends on the current locale.)
455       Today it is more common to use the quotemeta() function or the "\Q"
456       metaquoting escape sequence to disable all metacharacters' special
457       meanings like this:
458
459           /$unquoted\Q$quoted\E$unquoted/
460
461       Beware that if you put literal backslashes (those not inside
462       interpolated variables) between "\Q" and "\E", double-quotish backslash
463       interpolation may lead to confusing results.  If you need to use
464       literal backslashes within "\Q...\E", consult "Gory details of parsing
465       quoted constructs" in perlop.
466
467   Extended Patterns
468       Perl also defines a consistent extension syntax for features not found
469       in standard tools like awk and lex.  The syntax is a pair of
470       parentheses with a question mark as the first thing within the
471       parentheses.  The character after the question mark indicates the
472       extension.
473
474       The stability of these extensions varies widely.  Some have been part
475       of the core language for many years.  Others are experimental and may
476       change without warning or be completely removed.  Check the
477       documentation on an individual feature to verify its current status.
478
479       A question mark was chosen for this and for the minimal-matching
480       construct because 1) question marks are rare in older regular
481       expressions, and 2) whenever you see one, you should stop and
482       "question" exactly what is going on.  That's psychology...
483
484       "(?#text)"
485                 A comment.  The text is ignored.  If the "/x" modifier
486                 enables whitespace formatting, a simple "#" will suffice.
487                 Note that Perl closes the comment as soon as it sees a ")",
488                 so there is no way to put a literal ")" in the comment.
489
490       "(?pimsx-imsx)"
491                 One or more embedded pattern-match modifiers, to be turned on
492                 (or turned off, if preceded by "-") for the remainder of the
493                 pattern or the remainder of the enclosing pattern group (if
494                 any). This is particularly useful for dynamic patterns, such
495                 as those read in from a configuration file, taken from an
496                 argument, or specified in a table somewhere.  Consider the
497                 case where some patterns want to be case sensitive and some
498                 do not:  The case insensitive ones merely need to include
499                 "(?i)" at the front of the pattern.  For example:
500
501                     $pattern = "foobar";
502                     if ( /$pattern/i ) { }
503
504                     # more flexible:
505
506                     $pattern = "(?i)foobar";
507                     if ( /$pattern/ ) { }
508
509                 These modifiers are restored at the end of the enclosing
510                 group. For example,
511
512                     ( (?i) blah ) \s+ \1
513
514                 will match "blah" in any case, some spaces, and an exact
515                 (including the case!)  repetition of the previous word,
516                 assuming the "/x" modifier, and no "/i" modifier outside this
517                 group.
518
519                 These modifiers do not carry over into named subpatterns
520                 called in the enclosing group. In other words, a pattern such
521                 as "((?i)(&NAME))" does not change the case-sensitivity of
522                 the "NAME" pattern.
523
524                 Note that the "p" modifier is special in that it can only be
525                 enabled, not disabled, and that its presence anywhere in a
526                 pattern has a global effect. Thus "(?-p)" and "(?-p:...)" are
527                 meaningless and will warn when executed under "use warnings".
528
529       "(?:pattern)"
530       "(?imsx-imsx:pattern)"
531                 This is for clustering, not capturing; it groups
532                 subexpressions like "()", but doesn't make backreferences as
533                 "()" does.  So
534
535                     @fields = split(/\b(?:a|b|c)\b/)
536
537                 is like
538
539                     @fields = split(/\b(a|b|c)\b/)
540
541                 but doesn't spit out extra fields.  It's also cheaper not to
542                 capture characters if you don't need to.
543
544                 Any letters between "?" and ":" act as flags modifiers as
545                 with "(?imsx-imsx)".  For example,
546
547                     /(?s-i:more.*than).*million/i
548
549                 is equivalent to the more verbose
550
551                     /(?:(?s-i)more.*than).*million/i
552
553       "(?|pattern)"
554                 This is the "branch reset" pattern, which has the special
555                 property that the capture buffers are numbered from the same
556                 starting point in each alternation branch. It is available
557                 starting from perl 5.10.0.
558
559                 Capture buffers are numbered from left to right, but inside
560                 this construct the numbering is restarted for each branch.
561
562                 The numbering within each branch will be as normal, and any
563                 buffers following this construct will be numbered as though
564                 the construct contained only one branch, that being the one
565                 with the most capture buffers in it.
566
567                 This construct will be useful when you want to capture one of
568                 a number of alternative matches.
569
570                 Consider the following pattern.  The numbers underneath show
571                 in which buffer the captured content will be stored.
572
573                     # before  ---------------branch-reset----------- after
574                     / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
575                     # 1            2         2  3        2     3     4
576
577                 Be careful when using the branch reset pattern in combination
578                 with named captures. Named captures are implemented as being
579                 aliases to numbered buffers holding the captures, and that
580                 interferes with the implementation of the branch reset
581                 pattern. If you are using named captures in a branch reset
582                 pattern, it's best to use the same names, in the same order,
583                 in each of the alternations:
584
585                    /(?|  (?<a> x ) (?<b> y )
586                       |  (?<a> z ) (?<b> w )) /x
587
588                 Not doing so may lead to surprises:
589
590                   "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
591                   say $+ {a};   # Prints '12'
592                   say $+ {b};   # *Also* prints '12'.
593
594                 The problem here is that both the buffer named "a" and the
595                 buffer named "b" are aliases for the buffer belonging to $1.
596
597       Look-Around Assertions
598                 Look-around assertions are zero width patterns which match a
599                 specific pattern without including it in $&. Positive
600                 assertions match when their subpattern matches, negative
601                 assertions match when their subpattern fails. Look-behind
602                 matches text up to the current match position, look-ahead
603                 matches text following the current match position.
604
605                 "(?=pattern)"
606                     A zero-width positive look-ahead assertion.  For example,
607                     "/\w+(?=\t)/" matches a word followed by a tab, without
608                     including the tab in $&.
609
610                 "(?!pattern)"
611                     A zero-width negative look-ahead assertion.  For example
612                     "/foo(?!bar)/" matches any occurrence of "foo" that isn't
613                     followed by "bar".  Note however that look-ahead and
614                     look-behind are NOT the same thing.  You cannot use this
615                     for look-behind.
616
617                     If you are looking for a "bar" that isn't preceded by a
618                     "foo", "/(?!foo)bar/" will not do what you want.  That's
619                     because the "(?!foo)" is just saying that the next thing
620                     cannot be "foo"--and it's not, it's a "bar", so "foobar"
621                     will match.  You would have to do something like
622                     "/(?!foo)...bar/" for that.   We say "like" because
623                     there's the case of your "bar" not having three
624                     characters before it.  You could cover that this way:
625                     "/(?:(?!foo)...|^.{0,2})bar/".  Sometimes it's still
626                     easier just to say:
627
628                         if (/bar/ && $` !~ /foo$/)
629
630                     For look-behind see below.
631
632                 "(?<=pattern)" "\K"
633                     A zero-width positive look-behind assertion.  For
634                     example, "/(?<=\t)\w+/" matches a word that follows a
635                     tab, without including the tab in $&.  Works only for
636                     fixed-width look-behind.
637
638                     There is a special form of this construct, called "\K",
639                     which causes the regex engine to "keep" everything it had
640                     matched prior to the "\K" and not include it in $&. This
641                     effectively provides variable length look-behind. The use
642                     of "\K" inside of another look-around assertion is
643                     allowed, but the behaviour is currently not well defined.
644
645                     For various reasons "\K" may be significantly more
646                     efficient than the equivalent "(?<=...)" construct, and
647                     it is especially useful in situations where you want to
648                     efficiently remove something following something else in
649                     a string. For instance
650
651                       s/(foo)bar/$1/g;
652
653                     can be rewritten as the much more efficient
654
655                       s/foo\Kbar//g;
656
657                 "(?<!pattern)"
658                     A zero-width negative look-behind assertion.  For example
659                     "/(?<!bar)foo/" matches any occurrence of "foo" that does
660                     not follow "bar".  Works only for fixed-width look-
661                     behind.
662
663       "(?'NAME'pattern)"
664       "(?<NAME>pattern)"
665                 A named capture buffer. Identical in every respect to normal
666                 capturing parentheses "()" but for the additional fact that
667                 "%+" or "%-" may be used after a successful match to refer to
668                 a named buffer. See "perlvar" for more details on the "%+"
669                 and "%-" hashes.
670
671                 If multiple distinct capture buffers have the same name then
672                 the $+{NAME} will refer to the leftmost defined buffer in the
673                 match.
674
675                 The forms "(?'NAME'pattern)" and "(?<NAME>pattern)" are
676                 equivalent.
677
678                 NOTE: While the notation of this construct is the same as the
679                 similar function in .NET regexes, the behavior is not. In
680                 Perl the buffers are numbered sequentially regardless of
681                 being named or not. Thus in the pattern
682
683                   /(x)(?<foo>y)(z)/
684
685                 $+{foo} will be the same as $2, and $3 will contain 'z'
686                 instead of the opposite which is what a .NET regex hacker
687                 might expect.
688
689                 Currently NAME is restricted to simple identifiers only.  In
690                 other words, it must match "/^[_A-Za-z][_A-Za-z0-9]*\z/" or
691                 its Unicode extension (see utf8), though it isn't extended by
692                 the locale (see perllocale).
693
694                 NOTE: In order to make things easier for programmers with
695                 experience with the Python or PCRE regex engines, the pattern
696                 "(?P<NAME>pattern)" may be used instead of
697                 "(?<NAME>pattern)"; however this form does not support the
698                 use of single quotes as a delimiter for the name.
699
700       "\k<NAME>"
701       "\k'NAME'"
702                 Named backreference. Similar to numeric backreferences,
703                 except that the group is designated by name and not number.
704                 If multiple groups have the same name then it refers to the
705                 leftmost defined group in the current match.
706
707                 It is an error to refer to a name not defined by a
708                 "(?<NAME>)" earlier in the pattern.
709
710                 Both forms are equivalent.
711
712                 NOTE: In order to make things easier for programmers with
713                 experience with the Python or PCRE regex engines, the pattern
714                 "(?P=NAME)" may be used instead of "\k<NAME>".
715
716       "(?{ code })"
717                 WARNING: This extended regular expression feature is
718                 considered experimental, and may be changed without notice.
719                 Code executed that has side effects may not perform
720                 identically from version to version due to the effect of
721                 future optimisations in the regex engine.
722
723                 This zero-width assertion evaluates any embedded Perl code.
724                 It always succeeds, and its "code" is not interpolated.
725                 Currently, the rules to determine where the "code" ends are
726                 somewhat convoluted.
727
728                 This feature can be used together with the special variable
729                 $^N to capture the results of submatches in variables without
730                 having to keep track of the number of nested parentheses. For
731                 example:
732
733                   $_ = "The brown fox jumps over the lazy dog";
734                   /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
735                   print "color = $color, animal = $animal\n";
736
737                 Inside the "(?{...})" block, $_ refers to the string the
738                 regular expression is matching against. You can also use
739                 "pos()" to know what is the current position of matching
740                 within this string.
741
742                 The "code" is properly scoped in the following sense: If the
743                 assertion is backtracked (compare "Backtracking"), all
744                 changes introduced after "local"ization are undone, so that
745
746                   $_ = 'a' x 8;
747                   m<
748                      (?{ $cnt = 0 })                    # Initialize $cnt.
749                      (
750                        a
751                        (?{
752                            local $cnt = $cnt + 1;       # Update $cnt, backtracking-safe.
753                        })
754                      )*
755                      aaaa
756                      (?{ $res = $cnt })                 # On success copy to non-localized
757                                                         # location.
758                    >x;
759
760                 will set "$res = 4".  Note that after the match, $cnt returns
761                 to the globally introduced value, because the scopes that
762                 restrict "local" operators are unwound.
763
764                 This assertion may be used as a
765                 "(?(condition)yes-pattern|no-pattern)" switch.  If not used
766                 in this way, the result of evaluation of "code" is put into
767                 the special variable $^R.  This happens immediately, so $^R
768                 can be used from other "(?{ code })" assertions inside the
769                 same regular expression.
770
771                 The assignment to $^R above is properly localized, so the old
772                 value of $^R is restored if the assertion is backtracked;
773                 compare "Backtracking".
774
775                 For reasons of security, this construct is forbidden if the
776                 regular expression involves run-time interpolation of
777                 variables, unless the perilous "use re 'eval'" pragma has
778                 been used (see re), or the variables contain results of
779                 "qr//" operator (see "qr/STRING/msixpo" in perlop).
780
781                 This restriction is due to the wide-spread and remarkably
782                 convenient custom of using run-time determined strings as
783                 patterns.  For example:
784
785                     $re = <>;
786                     chomp $re;
787                     $string =~ /$re/;
788
789                 Before Perl knew how to execute interpolated code within a
790                 pattern, this operation was completely safe from a security
791                 point of view, although it could raise an exception from an
792                 illegal pattern.  If you turn on the "use re 'eval'", though,
793                 it is no longer secure, so you should only do so if you are
794                 also using taint checking.  Better yet, use the carefully
795                 constrained evaluation within a Safe compartment.  See
796                 perlsec for details about both these mechanisms.
797
798                 WARNING: Use of lexical ("my") variables in these blocks is
799                 broken. The result is unpredictable and will make perl
800                 unstable. The workaround is to use global ("our") variables.
801
802                 WARNING: Because Perl's regex engine is currently not re-
803                 entrant, interpolated code may not invoke the regex engine
804                 either directly with "m//" or "s///"), or indirectly with
805                 functions such as "split". Invoking the regex engine in these
806                 blocks will make perl unstable.
807
808       "(??{ code })"
809                 WARNING: This extended regular expression feature is
810                 considered experimental, and may be changed without notice.
811                 Code executed that has side effects may not perform
812                 identically from version to version due to the effect of
813                 future optimisations in the regex engine.
814
815                 This is a "postponed" regular subexpression.  The "code" is
816                 evaluated at run time, at the moment this subexpression may
817                 match.  The result of evaluation is considered as a regular
818                 expression and matched as if it were inserted instead of this
819                 construct.  Note that this means that the contents of capture
820                 buffers defined inside an eval'ed pattern are not available
821                 outside of the pattern, and vice versa, there is no way for
822                 the inner pattern to refer to a capture buffer defined
823                 outside.  Thus,
824
825                     ('a' x 100)=~/(??{'(.)' x 100})/
826
827                 will match, it will not set $1.
828
829                 The "code" is not interpolated.  As before, the rules to
830                 determine where the "code" ends are currently somewhat
831                 convoluted.
832
833                 The following pattern matches a parenthesized group:
834
835                   $re = qr{
836                              \(
837                              (?:
838                                 (?> [^()]+ )    # Non-parens without backtracking
839                               |
840                                 (??{ $re })     # Group with matching parens
841                              )*
842                              \)
843                           }x;
844
845                 See also "(?PARNO)" for a different, more efficient way to
846                 accomplish the same task.
847
848                 For reasons of security, this construct is forbidden if the
849                 regular expression involves run-time interpolation of
850                 variables, unless the perilous "use re 'eval'" pragma has
851                 been used (see re), or the variables contain results of
852                 "qr//" operator (see "qr/STRING/msixpo" in perlop).
853
854                 Because perl's regex engine is not currently re-entrant,
855                 delayed code may not invoke the regex engine either directly
856                 with "m//" or "s///"), or indirectly with functions such as
857                 "split".
858
859                 Recursing deeper than 50 times without consuming any input
860                 string will result in a fatal error.  The maximum depth is
861                 compiled into perl, so changing it requires a custom build.
862
863       "(?PARNO)" "(?-PARNO)" "(?+PARNO)" "(?R)" "(?0)"
864                 Similar to "(??{ code })" except it does not involve
865                 compiling any code, instead it treats the contents of a
866                 capture buffer as an independent pattern that must match at
867                 the current position.  Capture buffers contained by the
868                 pattern will have the value as determined by the outermost
869                 recursion.
870
871                 PARNO is a sequence of digits (not starting with 0) whose
872                 value reflects the paren-number of the capture buffer to
873                 recurse to. "(?R)" recurses to the beginning of the whole
874                 pattern. "(?0)" is an alternate syntax for "(?R)". If PARNO
875                 is preceded by a plus or minus sign then it is assumed to be
876                 relative, with negative numbers indicating preceding capture
877                 buffers and positive ones following. Thus "(?-1)" refers to
878                 the most recently declared buffer, and "(?+1)" indicates the
879                 next buffer to be declared.  Note that the counting for
880                 relative recursion differs from that of relative
881                 backreferences, in that with recursion unclosed buffers are
882                 included.
883
884                 The following pattern matches a function foo() which may
885                 contain balanced parentheses as the argument.
886
887                   $re = qr{ (                    # paren group 1 (full function)
888                               foo
889                               (                  # paren group 2 (parens)
890                                 \(
891                                   (              # paren group 3 (contents of parens)
892                                   (?:
893                                    (?> [^()]+ )  # Non-parens without backtracking
894                                   |
895                                    (?2)          # Recurse to start of paren group 2
896                                   )*
897                                   )
898                                 \)
899                               )
900                             )
901                           }x;
902
903                 If the pattern was used as follows
904
905                     'foo(bar(baz)+baz(bop))'=~/$re/
906                         and print "\$1 = $1\n",
907                                   "\$2 = $2\n",
908                                   "\$3 = $3\n";
909
910                 the output produced should be the following:
911
912                     $1 = foo(bar(baz)+baz(bop))
913                     $2 = (bar(baz)+baz(bop))
914                     $3 = bar(baz)+baz(bop)
915
916                 If there is no corresponding capture buffer defined, then it
917                 is a fatal error.  Recursing deeper than 50 times without
918                 consuming any input string will also result in a fatal error.
919                 The maximum depth is compiled into perl, so changing it
920                 requires a custom build.
921
922                 The following shows how using negative indexing can make it
923                 easier to embed recursive patterns inside of a "qr//"
924                 construct for later use:
925
926                     my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
927                     if (/foo $parens \s+ + \s+ bar $parens/x) {
928                        # do something here...
929                     }
930
931                 Note that this pattern does not behave the same way as the
932                 equivalent PCRE or Python construct of the same form. In Perl
933                 you can backtrack into a recursed group, in PCRE and Python
934                 the recursed into group is treated as atomic. Also, modifiers
935                 are resolved at compile time, so constructs like (?i:(?1)) or
936                 (?:(?i)(?1)) do not affect how the sub-pattern will be
937                 processed.
938
939       "(?&NAME)"
940                 Recurse to a named subpattern. Identical to "(?PARNO)" except
941                 that the parenthesis to recurse to is determined by name. If
942                 multiple parentheses have the same name, then it recurses to
943                 the leftmost.
944
945                 It is an error to refer to a name that is not declared
946                 somewhere in the pattern.
947
948                 NOTE: In order to make things easier for programmers with
949                 experience with the Python or PCRE regex engines the pattern
950                 "(?P>NAME)" may be used instead of "(?&NAME)".
951
952       "(?(condition)yes-pattern|no-pattern)"
953       "(?(condition)yes-pattern)"
954                 Conditional expression.  "(condition)" should be either an
955                 integer in parentheses (which is valid if the corresponding
956                 pair of parentheses matched), a
957                 look-ahead/look-behind/evaluate zero-width assertion, a name
958                 in angle brackets or single quotes (which is valid if a
959                 buffer with the given name matched), or the special symbol
960                 (R) (true when evaluated inside of recursion or eval).
961                 Additionally the R may be followed by a number, (which will
962                 be true when evaluated when recursing inside of the
963                 appropriate group), or by &NAME, in which case it will be
964                 true only when evaluated during recursion in the named group.
965
966                 Here's a summary of the possible predicates:
967
968                 (1) (2) ...
969                     Checks if the numbered capturing buffer has matched
970                     something.
971
972                 (<NAME>) ('NAME')
973                     Checks if a buffer with the given name has matched
974                     something.
975
976                 (?{ CODE })
977                     Treats the code block as the condition.
978
979                 (R) Checks if the expression has been evaluated inside of
980                     recursion.
981
982                 (R1) (R2) ...
983                     Checks if the expression has been evaluated while
984                     executing directly inside of the n-th capture group. This
985                     check is the regex equivalent of
986
987                       if ((caller(0))[3] eq 'subname') { ... }
988
989                     In other words, it does not check the full recursion
990                     stack.
991
992                 (R&NAME)
993                     Similar to "(R1)", this predicate checks to see if we're
994                     executing directly inside of the leftmost group with a
995                     given name (this is the same logic used by "(?&NAME)" to
996                     disambiguate). It does not check the full stack, but only
997                     the name of the innermost active recursion.
998
999                 (DEFINE)
1000                     In this case, the yes-pattern is never directly executed,
1001                     and no no-pattern is allowed. Similar in spirit to
1002                     "(?{0})" but more efficient.  See below for details.
1003
1004                 For example:
1005
1006                     m{ ( \( )?
1007                        [^()]+
1008                        (?(1) \) )
1009                      }x
1010
1011                 matches a chunk of non-parentheses, possibly included in
1012                 parentheses themselves.
1013
1014                 A special form is the "(DEFINE)" predicate, which never
1015                 executes directly its yes-pattern, and does not allow a no-
1016                 pattern. This allows to define subpatterns which will be
1017                 executed only by using the recursion mechanism.  This way,
1018                 you can define a set of regular expression rules that can be
1019                 bundled into any pattern you choose.
1020
1021                 It is recommended that for this usage you put the DEFINE
1022                 block at the end of the pattern, and that you name any
1023                 subpatterns defined within it.
1024
1025                 Also, it's worth noting that patterns defined this way
1026                 probably will not be as efficient, as the optimiser is not
1027                 very clever about handling them.
1028
1029                 An example of how this might be used is as follows:
1030
1031                   /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
1032                    (?(DEFINE)
1033                      (?<NAME_PAT>....)
1034                      (?<ADRESS_PAT>....)
1035                    )/x
1036
1037                 Note that capture buffers matched inside of recursion are not
1038                 accessible after the recursion returns, so the extra layer of
1039                 capturing buffers is necessary. Thus $+{NAME_PAT} would not
1040                 be defined even though $+{NAME} would be.
1041
1042       "(?>pattern)"
1043                 An "independent" subexpression, one which matches the
1044                 substring that a standalone "pattern" would match if anchored
1045                 at the given position, and it matches nothing other than this
1046                 substring.  This construct is useful for optimizations of
1047                 what would otherwise be "eternal" matches, because it will
1048                 not backtrack (see "Backtracking").  It may also be useful in
1049                 places where the "grab all you can, and do not give anything
1050                 back" semantic is desirable.
1051
1052                 For example: "^(?>a*)ab" will never match, since "(?>a*)"
1053                 (anchored at the beginning of string, as above) will match
1054                 all characters "a" at the beginning of string, leaving no "a"
1055                 for "ab" to match.  In contrast, "a*ab" will match the same
1056                 as "a+b", since the match of the subgroup "a*" is influenced
1057                 by the following group "ab" (see "Backtracking").  In
1058                 particular, "a*" inside "a*ab" will match fewer characters
1059                 than a standalone "a*", since this makes the tail match.
1060
1061                 An effect similar to "(?>pattern)" may be achieved by writing
1062                 "(?=(pattern))\1".  This matches the same substring as a
1063                 standalone "a+", and the following "\1" eats the matched
1064                 string; it therefore makes a zero-length assertion into an
1065                 analogue of "(?>...)".  (The difference between these two
1066                 constructs is that the second one uses a capturing group,
1067                 thus shifting ordinals of backreferences in the rest of a
1068                 regular expression.)
1069
1070                 Consider this pattern:
1071
1072                     m{ \(
1073                           (
1074                             [^()]+              # x+
1075                           |
1076                             \( [^()]* \)
1077                           )+
1078                        \)
1079                      }x
1080
1081                 That will efficiently match a nonempty group with matching
1082                 parentheses two levels deep or less.  However, if there is no
1083                 such group, it will take virtually forever on a long string.
1084                 That's because there are so many different ways to split a
1085                 long string into several substrings.  This is what "(.+)+" is
1086                 doing, and "(.+)+" is similar to a subpattern of the above
1087                 pattern.  Consider how the pattern above detects no-match on
1088                 "((()aaaaaaaaaaaaaaaaaa" in several seconds, but that each
1089                 extra letter doubles this time.  This exponential performance
1090                 will make it appear that your program has hung.  However, a
1091                 tiny change to this pattern
1092
1093                     m{ \(
1094                           (
1095                             (?> [^()]+ )        # change x+ above to (?> x+ )
1096                           |
1097                             \( [^()]* \)
1098                           )+
1099                        \)
1100                      }x
1101
1102                 which uses "(?>...)" matches exactly when the one above does
1103                 (verifying this yourself would be a productive exercise), but
1104                 finishes in a fourth the time when used on a similar string
1105                 with 1000000 "a"s.  Be aware, however, that this pattern
1106                 currently triggers a warning message under the "use warnings"
1107                 pragma or -w switch saying it "matches null string many times
1108                 in regex".
1109
1110                 On simple groups, such as the pattern "(?> [^()]+ )", a
1111                 comparable effect may be achieved by negative look-ahead, as
1112                 in "[^()]+ (?! [^()] )".  This was only 4 times slower on a
1113                 string with 1000000 "a"s.
1114
1115                 The "grab all you can, and do not give anything back"
1116                 semantic is desirable in many situations where on the first
1117                 sight a simple "()*" looks like the correct solution.
1118                 Suppose we parse text with comments being delimited by "#"
1119                 followed by some optional (horizontal) whitespace.  Contrary
1120                 to its appearance, "#[ \t]*" is not the correct subexpression
1121                 to match the comment delimiter, because it may "give up" some
1122                 whitespace if the remainder of the pattern can be made to
1123                 match that way.  The correct answer is either one of these:
1124
1125                     (?>#[ \t]*)
1126                     #[ \t]*(?![ \t])
1127
1128                 For example, to grab non-empty comments into $1, one should
1129                 use either one of these:
1130
1131                     / (?> \# [ \t]* ) (        .+ ) /x;
1132                     /     \# [ \t]*   ( [^ \t] .* ) /x;
1133
1134                 Which one you pick depends on which of these expressions
1135                 better reflects the above specification of comments.
1136
1137                 In some literature this construct is called "atomic matching"
1138                 or "possessive matching".
1139
1140                 Possessive quantifiers are equivalent to putting the item
1141                 they are applied to inside of one of these constructs. The
1142                 following equivalences apply:
1143
1144                     Quantifier Form     Bracketing Form
1145                     ---------------     ---------------
1146                     PAT*+               (?>PAT*)
1147                     PAT++               (?>PAT+)
1148                     PAT?+               (?>PAT?)
1149                     PAT{min,max}+       (?>PAT{min,max})
1150
1151   Special Backtracking Control Verbs
1152       WARNING: These patterns are experimental and subject to change or
1153       removal in a future version of Perl. Their usage in production code
1154       should be noted to avoid problems during upgrades.
1155
1156       These special patterns are generally of the form "(*VERB:ARG)". Unless
1157       otherwise stated the ARG argument is optional; in some cases, it is
1158       forbidden.
1159
1160       Any pattern containing a special backtracking verb that allows an
1161       argument has the special behaviour that when executed it sets the
1162       current package's $REGERROR and $REGMARK variables. When doing so the
1163       following rules apply:
1164
1165       On failure, the $REGERROR variable will be set to the ARG value of the
1166       verb pattern, if the verb was involved in the failure of the match. If
1167       the ARG part of the pattern was omitted, then $REGERROR will be set to
1168       the name of the last "(*MARK:NAME)" pattern executed, or to TRUE if
1169       there was none. Also, the $REGMARK variable will be set to FALSE.
1170
1171       On a successful match, the $REGERROR variable will be set to FALSE, and
1172       the $REGMARK variable will be set to the name of the last
1173       "(*MARK:NAME)" pattern executed.  See the explanation for the
1174       "(*MARK:NAME)" verb below for more details.
1175
1176       NOTE: $REGERROR and $REGMARK are not magic variables like $1 and most
1177       other regex related variables. They are not local to a scope, nor
1178       readonly, but instead are volatile package variables similar to
1179       $AUTOLOAD.  Use "local" to localize changes to them to a specific scope
1180       if necessary.
1181
1182       If a pattern does not contain a special backtracking verb that allows
1183       an argument, then $REGERROR and $REGMARK are not touched at all.
1184
1185       Verbs that take an argument
1186           "(*PRUNE)" "(*PRUNE:NAME)"
1187               This zero-width pattern prunes the backtracking tree at the
1188               current point when backtracked into on failure. Consider the
1189               pattern "A (*PRUNE) B", where A and B are complex patterns.
1190               Until the "(*PRUNE)" verb is reached, A may backtrack as
1191               necessary to match. Once it is reached, matching continues in
1192               B, which may also backtrack as necessary; however, should B not
1193               match, then no further backtracking will take place, and the
1194               pattern will fail outright at the current starting position.
1195
1196               The following example counts all the possible matching strings
1197               in a pattern (without actually matching any of them).
1198
1199                   'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
1200                   print "Count=$count\n";
1201
1202               which produces:
1203
1204                   aaab
1205                   aaa
1206                   aa
1207                   a
1208                   aab
1209                   aa
1210                   a
1211                   ab
1212                   a
1213                   Count=9
1214
1215               If we add a "(*PRUNE)" before the count like the following
1216
1217                   'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
1218                   print "Count=$count\n";
1219
1220               we prevent backtracking and find the count of the longest
1221               matching at each matching starting point like so:
1222
1223                   aaab
1224                   aab
1225                   ab
1226                   Count=3
1227
1228               Any number of "(*PRUNE)" assertions may be used in a pattern.
1229
1230               See also "(?>pattern)" and possessive quantifiers for other
1231               ways to control backtracking. In some cases, the use of
1232               "(*PRUNE)" can be replaced with a "(?>pattern)" with no
1233               functional difference; however, "(*PRUNE)" can be used to
1234               handle cases that cannot be expressed using a "(?>pattern)"
1235               alone.
1236
1237           "(*SKIP)" "(*SKIP:NAME)"
1238               This zero-width pattern is similar to "(*PRUNE)", except that
1239               on failure it also signifies that whatever text that was
1240               matched leading up to the "(*SKIP)" pattern being executed
1241               cannot be part of any match of this pattern. This effectively
1242               means that the regex engine "skips" forward to this position on
1243               failure and tries to match again, (assuming that there is
1244               sufficient room to match).
1245
1246               The name of the "(*SKIP:NAME)" pattern has special
1247               significance. If a "(*MARK:NAME)" was encountered while
1248               matching, then it is that position which is used as the "skip
1249               point". If no "(*MARK)" of that name was encountered, then the
1250               "(*SKIP)" operator has no effect. When used without a name the
1251               "skip point" is where the match point was when executing the
1252               (*SKIP) pattern.
1253
1254               Compare the following to the examples in "(*PRUNE)", note the
1255               string is twice as long:
1256
1257                   'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1258                   print "Count=$count\n";
1259
1260               outputs
1261
1262                   aaab
1263                   aaab
1264                   Count=2
1265
1266               Once the 'aaab' at the start of the string has matched, and the
1267               "(*SKIP)" executed, the next starting point will be where the
1268               cursor was when the "(*SKIP)" was executed.
1269
1270           "(*MARK:NAME)" "(*:NAME)" "(*MARK:NAME)" "(*:NAME)"
1271               This zero-width pattern can be used to mark the point reached
1272               in a string when a certain part of the pattern has been
1273               successfully matched. This mark may be given a name. A later
1274               "(*SKIP)" pattern will then skip forward to that point if
1275               backtracked into on failure. Any number of "(*MARK)" patterns
1276               are allowed, and the NAME portion may be duplicated.
1277
1278               In addition to interacting with the "(*SKIP)" pattern,
1279               "(*MARK:NAME)" can be used to "label" a pattern branch, so that
1280               after matching, the program can determine which branches of the
1281               pattern were involved in the match.
1282
1283               When a match is successful, the $REGMARK variable will be set
1284               to the name of the most recently executed "(*MARK:NAME)" that
1285               was involved in the match.
1286
1287               This can be used to determine which branch of a pattern was
1288               matched without using a separate capture buffer for each
1289               branch, which in turn can result in a performance improvement,
1290               as perl cannot optimize "/(?:(x)|(y)|(z))/" as efficiently as
1291               something like "/(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/".
1292
1293               When a match has failed, and unless another verb has been
1294               involved in failing the match and has provided its own name to
1295               use, the $REGERROR variable will be set to the name of the most
1296               recently executed "(*MARK:NAME)".
1297
1298               See "(*SKIP)" for more details.
1299
1300               As a shortcut "(*MARK:NAME)" can be written "(*:NAME)".
1301
1302           "(*THEN)" "(*THEN:NAME)"
1303               This is similar to the "cut group" operator "::" from Perl 6.
1304               Like "(*PRUNE)", this verb always matches, and when backtracked
1305               into on failure, it causes the regex engine to try the next
1306               alternation in the innermost enclosing group (capturing or
1307               otherwise).
1308
1309               Its name comes from the observation that this operation
1310               combined with the alternation operator ("|") can be used to
1311               create what is essentially a pattern-based if/then/else block:
1312
1313                 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1314
1315               Note that if this operator is used and NOT inside of an
1316               alternation then it acts exactly like the "(*PRUNE)" operator.
1317
1318                 / A (*PRUNE) B /
1319
1320               is the same as
1321
1322                 / A (*THEN) B /
1323
1324               but
1325
1326                 / ( A (*THEN) B | C (*THEN) D ) /
1327
1328               is not the same as
1329
1330                 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1331
1332               as after matching the A but failing on the B the "(*THEN)" verb
1333               will backtrack and try C; but the "(*PRUNE)" verb will simply
1334               fail.
1335
1336           "(*COMMIT)"
1337               This is the Perl 6 "commit pattern" "<commit>" or ":::". It's a
1338               zero-width pattern similar to "(*SKIP)", except that when
1339               backtracked into on failure it causes the match to fail
1340               outright. No further attempts to find a valid match by
1341               advancing the start pointer will occur again.  For example,
1342
1343                   'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
1344                   print "Count=$count\n";
1345
1346               outputs
1347
1348                   aaab
1349                   Count=1
1350
1351               In other words, once the "(*COMMIT)" has been entered, and if
1352               the pattern does not match, the regex engine will not try any
1353               further matching on the rest of the string.
1354
1355       Verbs without an argument
1356           "(*FAIL)" "(*F)"
1357               This pattern matches nothing and always fails. It can be used
1358               to force the engine to backtrack. It is equivalent to "(?!)",
1359               but easier to read. In fact, "(?!)" gets optimised into
1360               "(*FAIL)" internally.
1361
1362               It is probably useful only when combined with "(?{})" or
1363               "(??{})".
1364
1365           "(*ACCEPT)"
1366               WARNING: This feature is highly experimental. It is not
1367               recommended for production code.
1368
1369               This pattern matches nothing and causes the end of successful
1370               matching at the point at which the "(*ACCEPT)" pattern was
1371               encountered, regardless of whether there is actually more to
1372               match in the string. When inside of a nested pattern, such as
1373               recursion, or in a subpattern dynamically generated via
1374               "(??{})", only the innermost pattern is ended immediately.
1375
1376               If the "(*ACCEPT)" is inside of capturing buffers then the
1377               buffers are marked as ended at the point at which the
1378               "(*ACCEPT)" was encountered.  For instance:
1379
1380                 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
1381
1382               will match, and $1 will be "AB" and $2 will be "B", $3 will not
1383               be set. If another branch in the inner parentheses were
1384               matched, such as in the string 'ACDE', then the "D" and "E"
1385               would have to be matched as well.
1386
1387   Backtracking
1388       NOTE: This section presents an abstract approximation of regular
1389       expression behavior.  For a more rigorous (and complicated) view of the
1390       rules involved in selecting a match among possible alternatives, see
1391       "Combining RE Pieces".
1392
1393       A fundamental feature of regular expression matching involves the
1394       notion called backtracking, which is currently used (when needed) by
1395       all regular non-possessive expression quantifiers, namely "*", "*?",
1396       "+", "+?", "{n,m}", and "{n,m}?".  Backtracking is often optimized
1397       internally, but the general principle outlined here is valid.
1398
1399       For a regular expression to match, the entire regular expression must
1400       match, not just part of it.  So if the beginning of a pattern
1401       containing a quantifier succeeds in a way that causes later parts in
1402       the pattern to fail, the matching engine backs up and recalculates the
1403       beginning part--that's why it's called backtracking.
1404
1405       Here is an example of backtracking:  Let's say you want to find the
1406       word following "foo" in the string "Food is on the foo table.":
1407
1408           $_ = "Food is on the foo table.";
1409           if ( /\b(foo)\s+(\w+)/i ) {
1410               print "$2 follows $1.\n";
1411           }
1412
1413       When the match runs, the first part of the regular expression
1414       ("\b(foo)") finds a possible match right at the beginning of the
1415       string, and loads up $1 with "Foo".  However, as soon as the matching
1416       engine sees that there's no whitespace following the "Foo" that it had
1417       saved in $1, it realizes its mistake and starts over again one
1418       character after where it had the tentative match.  This time it goes
1419       all the way until the next occurrence of "foo". The complete regular
1420       expression matches this time, and you get the expected output of "table
1421       follows foo."
1422
1423       Sometimes minimal matching can help a lot.  Imagine you'd like to match
1424       everything between "foo" and "bar".  Initially, you write something
1425       like this:
1426
1427           $_ =  "The food is under the bar in the barn.";
1428           if ( /foo(.*)bar/ ) {
1429               print "got <$1>\n";
1430           }
1431
1432       Which perhaps unexpectedly yields:
1433
1434         got <d is under the bar in the >
1435
1436       That's because ".*" was greedy, so you get everything between the first
1437       "foo" and the last "bar".  Here it's more effective to use minimal
1438       matching to make sure you get the text between a "foo" and the first
1439       "bar" thereafter.
1440
1441           if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1442         got <d is under the >
1443
1444       Here's another example. Let's say you'd like to match a number at the
1445       end of a string, and you also want to keep the preceding part of the
1446       match.  So you write this:
1447
1448           $_ = "I have 2 numbers: 53147";
1449           if ( /(.*)(\d*)/ ) {                                # Wrong!
1450               print "Beginning is <$1>, number is <$2>.\n";
1451           }
1452
1453       That won't work at all, because ".*" was greedy and gobbled up the
1454       whole string. As "\d*" can match on an empty string the complete
1455       regular expression matched successfully.
1456
1457           Beginning is <I have 2 numbers: 53147>, number is <>.
1458
1459       Here are some variants, most of which don't work:
1460
1461           $_ = "I have 2 numbers: 53147";
1462           @pats = qw{
1463               (.*)(\d*)
1464               (.*)(\d+)
1465               (.*?)(\d*)
1466               (.*?)(\d+)
1467               (.*)(\d+)$
1468               (.*?)(\d+)$
1469               (.*)\b(\d+)$
1470               (.*\D)(\d+)$
1471           };
1472
1473           for $pat (@pats) {
1474               printf "%-12s ", $pat;
1475               if ( /$pat/ ) {
1476                   print "<$1> <$2>\n";
1477               } else {
1478                   print "FAIL\n";
1479               }
1480           }
1481
1482       That will print out:
1483
1484           (.*)(\d*)    <I have 2 numbers: 53147> <>
1485           (.*)(\d+)    <I have 2 numbers: 5314> <7>
1486           (.*?)(\d*)   <> <>
1487           (.*?)(\d+)   <I have > <2>
1488           (.*)(\d+)$   <I have 2 numbers: 5314> <7>
1489           (.*?)(\d+)$  <I have 2 numbers: > <53147>
1490           (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1491           (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1492
1493       As you see, this can be a bit tricky.  It's important to realize that a
1494       regular expression is merely a set of assertions that gives a
1495       definition of success.  There may be 0, 1, or several different ways
1496       that the definition might succeed against a particular string.  And if
1497       there are multiple ways it might succeed, you need to understand
1498       backtracking to know which variety of success you will achieve.
1499
1500       When using look-ahead assertions and negations, this can all get even
1501       trickier.  Imagine you'd like to find a sequence of non-digits not
1502       followed by "123".  You might try to write that as
1503
1504           $_ = "ABC123";
1505           if ( /^\D*(?!123)/ ) {              # Wrong!
1506               print "Yup, no 123 in $_\n";
1507           }
1508
1509       But that isn't going to match; at least, not the way you're hoping.  It
1510       claims that there is no 123 in the string.  Here's a clearer picture of
1511       why that pattern matches, contrary to popular expectations:
1512
1513           $x = 'ABC123';
1514           $y = 'ABC445';
1515
1516           print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1517           print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
1518
1519           print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1520           print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
1521
1522       This prints
1523
1524           2: got ABC
1525           3: got AB
1526           4: got ABC
1527
1528       You might have expected test 3 to fail because it seems to a more
1529       general purpose version of test 1.  The important difference between
1530       them is that test 3 contains a quantifier ("\D*") and so can use
1531       backtracking, whereas test 1 will not.  What's happening is that you've
1532       asked "Is it true that at the start of $x, following 0 or more non-
1533       digits, you have something that's not 123?"  If the pattern matcher had
1534       let "\D*" expand to "ABC", this would have caused the whole pattern to
1535       fail.
1536
1537       The search engine will initially match "\D*" with "ABC".  Then it will
1538       try to match "(?!123" with "123", which fails.  But because a
1539       quantifier ("\D*") has been used in the regular expression, the search
1540       engine can backtrack and retry the match differently in the hope of
1541       matching the complete regular expression.
1542
1543       The pattern really, really wants to succeed, so it uses the standard
1544       pattern back-off-and-retry and lets "\D*" expand to just "AB" this
1545       time.  Now there's indeed something following "AB" that is not "123".
1546       It's "C123", which suffices.
1547
1548       We can deal with this by using both an assertion and a negation.  We'll
1549       say that the first part in $1 must be followed both by a digit and by
1550       something that's not "123".  Remember that the look-aheads are zero-
1551       width expressions--they only look, but don't consume any of the string
1552       in their match.  So rewriting this way produces what you'd expect; that
1553       is, case 5 will fail, but case 6 succeeds:
1554
1555           print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
1556           print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
1557
1558           6: got ABC
1559
1560       In other words, the two zero-width assertions next to each other work
1561       as though they're ANDed together, just as you'd use any built-in
1562       assertions:  "/^$/" matches only if you're at the beginning of the line
1563       AND the end of the line simultaneously.  The deeper underlying truth is
1564       that juxtaposition in regular expressions always means AND, except when
1565       you write an explicit OR using the vertical bar.  "/ab/" means match
1566       "a" AND (then) match "b", although the attempted matches are made at
1567       different positions because "a" is not a zero-width assertion, but a
1568       one-width assertion.
1569
1570       WARNING: Particularly complicated regular expressions can take
1571       exponential time to solve because of the immense number of possible
1572       ways they can use backtracking to try for a match.  For example,
1573       without internal optimizations done by the regular expression engine,
1574       this will take a painfully long time to run:
1575
1576           'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
1577
1578       And if you used "*"'s in the internal groups instead of limiting them
1579       to 0 through 5 matches, then it would take forever--or until you ran
1580       out of stack space.  Moreover, these internal optimizations are not
1581       always applicable.  For example, if you put "{0,5}" instead of "*" on
1582       the external group, no current optimization is applicable, and the
1583       match takes a long time to finish.
1584
1585       A powerful tool for optimizing such beasts is what is known as an
1586       "independent group", which does not backtrack (see "(?>pattern)").
1587       Note also that zero-length look-ahead/look-behind assertions will not
1588       backtrack to make the tail match, since they are in "logical" context:
1589       only whether they match is considered relevant.  For an example where
1590       side-effects of look-ahead might have influenced the following match,
1591       see "(?>pattern)".
1592
1593   Version 8 Regular Expressions
1594       In case you're not familiar with the "regular" Version 8 regex
1595       routines, here are the pattern-matching rules not described above.
1596
1597       Any single character matches itself, unless it is a metacharacter with
1598       a special meaning described here or above.  You can cause characters
1599       that normally function as metacharacters to be interpreted literally by
1600       prefixing them with a "\" (e.g., "\." matches a ".", not any character;
1601       "\\" matches a "\"). This escape mechanism is also required for the
1602       character used as the pattern delimiter.
1603
1604       A series of characters matches that series of characters in the target
1605       string, so the pattern  "blurfl" would match "blurfl" in the target
1606       string.
1607
1608       You can specify a character class, by enclosing a list of characters in
1609       "[]", which will match any character from the list.  If the first
1610       character after the "[" is "^", the class matches any character not in
1611       the list.  Within a list, the "-" character specifies a range, so that
1612       "a-z" represents all characters between "a" and "z", inclusive.  If you
1613       want either "-" or "]" itself to be a member of a class, put it at the
1614       start of the list (possibly after a "^"), or escape it with a
1615       backslash.  "-" is also taken literally when it is at the end of the
1616       list, just before the closing "]".  (The following all specify the same
1617       class of three characters: "[-az]", "[az-]", and "[a\-z]".  All are
1618       different from "[a-z]", which specifies a class containing twenty-six
1619       characters, even on EBCDIC-based character sets.)  Also, if you try to
1620       use the character classes "\w", "\W", "\s", "\S", "\d", or "\D" as
1621       endpoints of a range, the "-" is understood literally.
1622
1623       Note also that the whole range idea is rather unportable between
1624       character sets--and even within character sets they may cause results
1625       you probably didn't expect.  A sound principle is to use only ranges
1626       that begin from and end at either alphabetics of equal case ([a-e],
1627       [A-E]), or digits ([0-9]).  Anything else is unsafe.  If in doubt,
1628       spell out the character sets in full.
1629
1630       Characters may be specified using a metacharacter syntax much like that
1631       used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
1632       "\f" a form feed, etc.  More generally, \nnn, where nnn is a string of
1633       octal digits, matches the character whose coded character set value is
1634       nnn.  Similarly, \xnn, where nn are hexadecimal digits, matches the
1635       character whose numeric value is nn. The expression \cx matches the
1636       character control-x.  Finally, the "." metacharacter matches any
1637       character except "\n" (unless you use "/s").
1638
1639       You can specify a series of alternatives for a pattern using "|" to
1640       separate them, so that "fee|fie|foe" will match any of "fee", "fie", or
1641       "foe" in the target string (as would "f(e|i|o)e").  The first
1642       alternative includes everything from the last pattern delimiter ("(",
1643       "[", or the beginning of the pattern) up to the first "|", and the last
1644       alternative contains everything from the last "|" to the next pattern
1645       delimiter.  That's why it's common practice to include alternatives in
1646       parentheses: to minimize confusion about where they start and end.
1647
1648       Alternatives are tried from left to right, so the first alternative
1649       found for which the entire expression matches, is the one that is
1650       chosen. This means that alternatives are not necessarily greedy. For
1651       example: when matching "foo|foot" against "barefoot", only the "foo"
1652       part will match, as that is the first alternative tried, and it
1653       successfully matches the target string. (This might not seem important,
1654       but it is important when you are capturing matched text using
1655       parentheses.)
1656
1657       Also remember that "|" is interpreted as a literal within square
1658       brackets, so if you write "[fee|fie|foe]" you're really only matching
1659       "[feio|]".
1660
1661       Within a pattern, you may designate subpatterns for later reference by
1662       enclosing them in parentheses, and you may refer back to the nth
1663       subpattern later in the pattern using the metacharacter \n.
1664       Subpatterns are numbered based on the left to right order of their
1665       opening parenthesis.  A backreference matches whatever actually matched
1666       the subpattern in the string being examined, not the rules for that
1667       subpattern.  Therefore, "(0|0x)\d*\s\1\d*" will match "0x1234 0x4321",
1668       but not "0x1234 01234", because subpattern 1 matched "0x", even though
1669       the rule "0|0x" could potentially match the leading 0 in the second
1670       number.
1671
1672   Warning on \1 Instead of $1
1673       Some people get too used to writing things like:
1674
1675           $pattern =~ s/(\W)/\\\1/g;
1676
1677       This is grandfathered (for \1 to \9) for the RHS of a substitute to
1678       avoid shocking the sed addicts, but it's a dirty habit to get into.
1679       That's because in PerlThink, the righthand side of an "s///" is a
1680       double-quoted string.  "\1" in the usual double-quoted string means a
1681       control-A.  The customary Unix meaning of "\1" is kludged in for
1682       "s///".  However, if you get into the habit of doing that, you get
1683       yourself into trouble if you then add an "/e" modifier.
1684
1685           s/(\d+)/ \1 + 1 /eg;        # causes warning under -w
1686
1687       Or if you try to do
1688
1689           s/(\d+)/\1000/;
1690
1691       You can't disambiguate that by saying "\{1}000", whereas you can fix it
1692       with "${1}000".  The operation of interpolation should not be confused
1693       with the operation of matching a backreference.  Certainly they mean
1694       two different things on the left side of the "s///".
1695
1696   Repeated Patterns Matching a Zero-length Substring
1697       WARNING: Difficult material (and prose) ahead.  This section needs a
1698       rewrite.
1699
1700       Regular expressions provide a terse and powerful programming language.
1701       As with most other power tools, power comes together with the ability
1702       to wreak havoc.
1703
1704       A common abuse of this power stems from the ability to make infinite
1705       loops using regular expressions, with something as innocuous as:
1706
1707           'foo' =~ m{ ( o? )* }x;
1708
1709       The "o?" matches at the beginning of 'foo', and since the position in
1710       the string is not moved by the match, "o?" would match again and again
1711       because of the "*" quantifier.  Another common way to create a similar
1712       cycle is with the looping modifier "//g":
1713
1714           @matches = ( 'foo' =~ m{ o? }xg );
1715
1716       or
1717
1718           print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
1719
1720       or the loop implied by split().
1721
1722       However, long experience has shown that many programming tasks may be
1723       significantly simplified by using repeated subexpressions that may
1724       match zero-length substrings.  Here's a simple example being:
1725
1726           @chars = split //, $string;           # // is not magic in split
1727           ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1728
1729       Thus Perl allows such constructs, by forcefully breaking the infinite
1730       loop.  The rules for this are different for lower-level loops given by
1731       the greedy quantifiers "*+{}", and for higher-level ones like the "/g"
1732       modifier or split() operator.
1733
1734       The lower-level loops are interrupted (that is, the loop is broken)
1735       when Perl detects that a repeated expression matched a zero-length
1736       substring.   Thus
1737
1738          m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
1739
1740       is made equivalent to
1741
1742          m{   (?: NON_ZERO_LENGTH )*
1743             |
1744               (?: ZERO_LENGTH )?
1745           }x;
1746
1747       The higher level-loops preserve an additional state between iterations:
1748       whether the last match was zero-length.  To break the loop, the
1749       following match after a zero-length match is prohibited to have a
1750       length of zero.  This prohibition interacts with backtracking (see
1751       "Backtracking"), and so the second best match is chosen if the best
1752       match is of zero length.
1753
1754       For example:
1755
1756           $_ = 'bar';
1757           s/\w??/<$&>/g;
1758
1759       results in "<><b><><a><><r><>".  At each position of the string the
1760       best match given by non-greedy "??" is the zero-length match, and the
1761       second best match is what is matched by "\w".  Thus zero-length matches
1762       alternate with one-character-long matches.
1763
1764       Similarly, for repeated "m/()/g" the second-best match is the match at
1765       the position one notch further in the string.
1766
1767       The additional state of being matched with zero-length is associated
1768       with the matched string, and is reset by each assignment to pos().
1769       Zero-length matches at the end of the previous match are ignored during
1770       "split".
1771
1772   Combining RE Pieces
1773       Each of the elementary pieces of regular expressions which were
1774       described before (such as "ab" or "\Z") could match at most one
1775       substring at the given position of the input string.  However, in a
1776       typical regular expression these elementary pieces are combined into
1777       more complicated patterns using combining operators "ST", "S|T", "S*"
1778       etc (in these examples "S" and "T" are regular subexpressions).
1779
1780       Such combinations can include alternatives, leading to a problem of
1781       choice: if we match a regular expression "a|ab" against "abc", will it
1782       match substring "a" or "ab"?  One way to describe which substring is
1783       actually matched is the concept of backtracking (see "Backtracking").
1784       However, this description is too low-level and makes you think in terms
1785       of a particular implementation.
1786
1787       Another description starts with notions of "better"/"worse".  All the
1788       substrings which may be matched by the given regular expression can be
1789       sorted from the "best" match to the "worst" match, and it is the "best"
1790       match which is chosen.  This substitutes the question of "what is
1791       chosen?"  by the question of "which matches are better, and which are
1792       worse?".
1793
1794       Again, for elementary pieces there is no such question, since at most
1795       one match at a given position is possible.  This section describes the
1796       notion of better/worse for combining operators.  In the description
1797       below "S" and "T" are regular subexpressions.
1798
1799       "ST"
1800           Consider two possible matches, "AB" and "A'B'", "A" and "A'" are
1801           substrings which can be matched by "S", "B" and "B'" are substrings
1802           which can be matched by "T".
1803
1804           If "A" is better match for "S" than "A'", "AB" is a better match
1805           than "A'B'".
1806
1807           If "A" and "A'" coincide: "AB" is a better match than "AB'" if "B"
1808           is better match for "T" than "B'".
1809
1810       "S|T"
1811           When "S" can match, it is a better match than when only "T" can
1812           match.
1813
1814           Ordering of two matches for "S" is the same as for "S".  Similar
1815           for two matches for "T".
1816
1817       "S{REPEAT_COUNT}"
1818           Matches as "SSS...S" (repeated as many times as necessary).
1819
1820       "S{min,max}"
1821           Matches as "S{max}|S{max-1}|...|S{min+1}|S{min}".
1822
1823       "S{min,max}?"
1824           Matches as "S{min}|S{min+1}|...|S{max-1}|S{max}".
1825
1826       "S?", "S*", "S+"
1827           Same as "S{0,1}", "S{0,BIG_NUMBER}", "S{1,BIG_NUMBER}"
1828           respectively.
1829
1830       "S??", "S*?", "S+?"
1831           Same as "S{0,1}?", "S{0,BIG_NUMBER}?", "S{1,BIG_NUMBER}?"
1832           respectively.
1833
1834       "(?>S)"
1835           Matches the best match for "S" and only that.
1836
1837       "(?=S)", "(?<=S)"
1838           Only the best match for "S" is considered.  (This is important only
1839           if "S" has capturing parentheses, and backreferences are used
1840           somewhere else in the whole regular expression.)
1841
1842       "(?!S)", "(?<!S)"
1843           For this grouping operator there is no need to describe the
1844           ordering, since only whether or not "S" can match is important.
1845
1846       "(??{ EXPR })", "(?PARNO)"
1847           The ordering is the same as for the regular expression which is the
1848           result of EXPR, or the pattern contained by capture buffer PARNO.
1849
1850       "(?(condition)yes-pattern|no-pattern)"
1851           Recall that which of "yes-pattern" or "no-pattern" actually matches
1852           is already determined.  The ordering of the matches is the same as
1853           for the chosen subexpression.
1854
1855       The above recipes describe the ordering of matches at a given position.
1856       One more rule is needed to understand how a match is determined for the
1857       whole regular expression: a match at an earlier position is always
1858       better than a match at a later position.
1859
1860   Creating Custom RE Engines
1861       Overloaded constants (see overload) provide a simple way to extend the
1862       functionality of the RE engine.
1863
1864       Suppose that we want to enable a new RE escape-sequence "\Y|" which
1865       matches at a boundary between whitespace characters and non-whitespace
1866       characters.  Note that "(?=\S)(?<!\S)|(?!\S)(?<=\S)" matches exactly at
1867       these positions, so we want to have each "\Y|" in the place of the more
1868       complicated version.  We can create a module "customre" to do this:
1869
1870           package customre;
1871           use overload;
1872
1873           sub import {
1874             shift;
1875             die "No argument to customre::import allowed" if @_;
1876             overload::constant 'qr' => \&convert;
1877           }
1878
1879           sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
1880
1881           # We must also take care of not escaping the legitimate \\Y|
1882           # sequence, hence the presence of '\\' in the conversion rules.
1883           my %rules = ( '\\' => '\\\\',
1884                         'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
1885           sub convert {
1886             my $re = shift;
1887             $re =~ s{
1888                       \\ ( \\ | Y . )
1889                     }
1890                     { $rules{$1} or invalid($re,$1) }sgex;
1891             return $re;
1892           }
1893
1894       Now "use customre" enables the new escape in constant regular
1895       expressions, i.e., those without any runtime variable interpolations.
1896       As documented in overload, this conversion will work only over literal
1897       parts of regular expressions.  For "\Y|$re\Y|" the variable part of
1898       this regular expression needs to be converted explicitly (but only if
1899       the special meaning of "\Y|" should be enabled inside $re):
1900
1901           use customre;
1902           $re = <>;
1903           chomp $re;
1904           $re = customre::convert $re;
1905           /\Y|$re\Y|/;
1906

PCRE/Python Support

1908       As of Perl 5.10.0, Perl supports several Python/PCRE specific
1909       extensions to the regex syntax. While Perl programmers are encouraged
1910       to use the Perl specific syntax, the following are also accepted:
1911
1912       "(?P<NAME>pattern)"
1913           Define a named capture buffer. Equivalent to "(?<NAME>pattern)".
1914
1915       "(?P=NAME)"
1916           Backreference to a named capture buffer. Equivalent to "\g{NAME}".
1917
1918       "(?P>NAME)"
1919           Subroutine call to a named capture buffer. Equivalent to
1920           "(?&NAME)".
1921

BUGS

1923       There are numerous problems with case insensitive matching of
1924       characters outside the ASCII range, especially with those whose folds
1925       are multiple characters, such as ligatures like "LATIN SMALL LIGATURE
1926       FF".
1927
1928       In a bracketed character class with case insensitive matching, ranges
1929       only work for ASCII characters.  For example, "m/[\N{CYRILLIC CAPITAL
1930       LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}]/i" doesn't match all the
1931       Russian upper and lower case letters.
1932
1933       Many regular expression constructs don't work on EBCDIC platforms.
1934
1935       This document varies from difficult to understand to completely and
1936       utterly opaque.  The wandering prose riddled with jargon is hard to
1937       fathom in several places.
1938
1939       This document needs a rewrite that separates the tutorial content from
1940       the reference content.
1941

NAME

DESCRIPTION

PCRE/Python Support

BUGS

SEE ALSO