perlre(1)

1PERLRE(1)              Perl Programmers Reference Guide              PERLRE(1)
2
3
4

NAME

6       perlre - Perl regular expressions
7

DESCRIPTION

9       This page describes the syntax of regular expressions in Perl.
10
11       If you haven't used regular expressions before, a quick-start introduc‐
12       tion is available in perlrequick, and a longer tutorial introduction is
13       available in perlretut.
14
15       For reference on how regular expressions are used in matching opera‐
16       tions, plus various examples of the same, see discussions of "m//",
17       "s///", "qr//" and "??" in "Regexp Quote-Like Operators" in perlop.
18
19       Matching operations can have various modifiers.  Modifiers that relate
20       to the interpretation of the regular expression inside are listed
21       below.  Modifiers that alter the way a regular expression is used by
22       Perl are detailed in "Regexp Quote-Like Operators" in perlop and "Gory
23       details of parsing quoted constructs" in perlop.
24
25       i   Do case-insensitive pattern matching.
26
27           If "use locale" is in effect, the case map is taken from the cur‐
28           rent locale.  See perllocale.
29
30       m   Treat string as multiple lines.  That is, change "^" and "$" from
31           matching the start or end of the string to matching the start or
32           end of any line anywhere within the string.
33
34       s   Treat string as single line.  That is, change "." to match any
35           character whatsoever, even a newline, which normally it would not
36           match.
37
38           The "/s" and "/m" modifiers both override the $* setting.  That is,
39           no matter what $* contains, "/s" without "/m" will force "^" to
40           match only at the beginning of the string and "$" to match only at
41           the end (or just before a newline at the end) of the string.
42           Together, as /ms, they let the "." match any character whatsoever,
43           while still allowing "^" and "$" to match, respectively, just after
44           and just before newlines within the string.
45
46       x   Extend your pattern's legibility by permitting whitespace and com‐
47           ments.
48
49       These are usually written as "the "/x" modifier", even though the
50       delimiter in question might not really be a slash.  Any of these modi‐
51       fiers may also be embedded within the regular expression itself using
52       the "(?...)" construct.  See below.
53
54       The "/x" modifier itself needs a little more explanation.  It tells the
55       regular expression parser to ignore whitespace that is neither back‐
56       slashed nor within a character class.  You can use this to break up
57       your regular expression into (slightly) more readable parts.  The "#"
58       character is also treated as a metacharacter introducing a comment,
59       just as in ordinary Perl code.  This also means that if you want real
60       whitespace or "#" characters in the pattern (outside a character class,
61       where they are unaffected by "/x"), that you'll either have to escape
62       them or encode them using octal or hex escapes.  Taken together, these
63       features go a long way towards making Perl's regular expressions more
64       readable.  Note that you have to be careful not to include the pattern
65       delimiter in the comment--perl has no way of knowing you did not intend
66       to close the pattern early.  See the C-comment deletion code in perlop.
67
68       Regular Expressions
69
70       The patterns used in Perl pattern matching derive from supplied in the
71       Version 8 regex routines.  (The routines are derived (distantly) from
72       Henry Spencer's freely redistributable reimplementation of the V8 rou‐
73       tines.)  See "Version 8 Regular Expressions" for details.
74
75       In particular the following metacharacters have their standard
76       egrep-ish meanings:
77
78           \   Quote the next metacharacter
79           ^   Match the beginning of the line
80           .   Match any character (except newline)
81           $   Match the end of the line (or before newline at the end)
82           ⎪   Alternation
83           ()  Grouping
84           []  Character class
85
86       By default, the "^" character is guaranteed to match only the beginning
87       of the string, the "$" character only the end (or before the newline at
88       the end), and Perl does certain optimizations with the assumption that
89       the string contains only one line.  Embedded newlines will not be
90       matched by "^" or "$".  You may, however, wish to treat a string as a
91       multi-line buffer, such that the "^" will match after any newline
92       within the string, and "$" will match before any newline.  At the cost
93       of a little more overhead, you can do this by using the /m modifier on
94       the pattern match operator.  (Older programs did this by setting $*,
95       but this practice is now deprecated.)
96
97       To simplify multi-line substitutions, the "." character never matches a
98       newline unless you use the "/s" modifier, which in effect tells Perl to
99       pretend the string is a single line--even if it isn't.  The "/s" modi‐
100       fier also overrides the setting of $*, in case you have some (badly
101       behaved) older code that sets it in another module.
102
103       The following standard quantifiers are recognized:
104
105           *      Match 0 or more times
106           +      Match 1 or more times
107           ?      Match 1 or 0 times
108           {n}    Match exactly n times
109           {n,}   Match at least n times
110           {n,m}  Match at least n but not more than m times
111
112       (If a curly bracket occurs in any other context, it is treated as a
113       regular character.  In particular, the lower bound is not optional.)
114       The "*" modifier is equivalent to "{0,}", the "+" modifier to "{1,}",
115       and the "?" modifier to "{0,1}".  n and m are limited to integral val‐
116       ues less than a preset limit defined when perl is built.  This is usu‐
117       ally 32766 on the most common platforms.  The actual limit can be seen
118       in the error message generated by code such as this:
119
120           $_ **= $_ , / {$_} / for 2 .. 42;
121
122       By default, a quantified subpattern is "greedy", that is, it will match
123       as many times as possible (given a particular starting location) while
124       still allowing the rest of the pattern to match.  If you want it to
125       match the minimum number of times possible, follow the quantifier with
126       a "?".  Note that the meanings don't change, just the "greediness":
127
128           *?     Match 0 or more times
129           +?     Match 1 or more times
130           ??     Match 0 or 1 time
131           {n}?   Match exactly n times
132           {n,}?  Match at least n times
133           {n,m}? Match at least n but not more than m times
134
135       Because patterns are processed as double quoted strings, the following
136       also work:
137
138           \t          tab                   (HT, TAB)
139           \n          newline               (LF, NL)
140           \r          return                (CR)
141           \f          form feed             (FF)
142           \a          alarm (bell)          (BEL)
143           \e          escape (think troff)  (ESC)
144           \033        octal char (think of a PDP-11)
145           \x1B        hex char
146           \x{263a}    wide hex char         (Unicode SMILEY)
147           \c[         control char
148           \N{name}    named char
149           \l          lowercase next char (think vi)
150           \u          uppercase next char (think vi)
151           \L          lowercase till \E (think vi)
152           \U          uppercase till \E (think vi)
153           \E          end case modification (think vi)
154           \Q          quote (disable) pattern metacharacters till \E
155
156       If "use locale" is in effect, the case map used by "\l", "\L", "\u" and
157       "\U" is taken from the current locale.  See perllocale.  For documenta‐
158       tion of "\N{name}", see charnames.
159
160       You cannot include a literal "$" or "@" within a "\Q" sequence.  An
161       unescaped "$" or "@" interpolates the corresponding variable, while
162       escaping will cause the literal string "\$" to be matched.  You'll need
163       to write something like "m/\Quser\E\@\Qhost/".
164
165       In addition, Perl defines the following:
166
167           \w  Match a "word" character (alphanumeric plus "_")
168           \W  Match a non-"word" character
169           \s  Match a whitespace character
170           \S  Match a non-whitespace character
171           \d  Match a digit character
172           \D  Match a non-digit character
173           \pP Match P, named property.  Use \p{Prop} for longer names.
174           \PP Match non-P
175           \X  Match eXtended Unicode "combining character sequence",
176               equivalent to (?:\PM\pM*)
177           \C  Match a single C char (octet) even under Unicode.
178               NOTE: breaks up characters into their UTF-8 bytes,
179               so you may end up with malformed pieces of UTF-8.
180               Unsupported in lookbehind.
181
182       A "\w" matches a single alphanumeric character (an alphabetic charac‐
183       ter, or a decimal digit) or "_", not a whole word.  Use "\w+" to match
184       a string of Perl-identifier characters (which isn't the same as match‐
185       ing an English word).  If "use locale" is in effect, the list of alpha‐
186       betic characters generated by "\w" is taken from the current locale.
187       See perllocale.  You may use "\w", "\W", "\s", "\S", "\d", and "\D"
188       within character classes, but if you try to use them as endpoints of a
189       range, that's not a range, the "-" is understood literally.  If Unicode
190       is in effect, "\s" matches also "\x{85}", "\x{2028}, and "\x{2029}",
191       see perlunicode for more details about "\pP", "\PP", and "\X", and per‐
192       luniintro about Unicode in general.  You can define your own "\p" and
193       "\P" properties, see perlunicode.
194
195       The POSIX character class syntax
196
197           [:class:]
198
199       is also available.  The available classes and their backslash equiva‐
200       lents (if available) are as follows:
201
202           alpha
203           alnum
204           ascii
205           blank               [1]
206           cntrl
207           digit       \d
208           graph
209           lower
210           print
211           punct
212           space       \s      [2]
213           upper
214           word        \w      [3]
215           xdigit
216
217       [1] A GNU extension equivalent to "[ \t]", "all horizontal whitespace".
218
219       [2] Not exactly equivalent to "\s" since the "[[:space:]]" includes
220           also the (very rare) "vertical tabulator", "\ck", chr(11).
221
222       [3] A Perl extension, see above.
223
224       For example use "[:upper:]" to match all the uppercase characters.
225       Note that the "[]" are part of the "[::]" construct, not part of the
226       whole character class.  For example:
227
228           [01[:alpha:]%]
229
230       matches zero, one, any alphabetic character, and the percentage sign.
231
232       The following equivalences to Unicode \p{} constructs and equivalent
233       backslash character classes (if available), will hold:
234
235           [:...:]     \p{...}         backslash
236
237           alpha       IsAlpha
238           alnum       IsAlnum
239           ascii       IsASCII
240           blank       IsSpace
241           cntrl       IsCntrl
242           digit       IsDigit        \d
243           graph       IsGraph
244           lower       IsLower
245           print       IsPrint
246           punct       IsPunct
247           space       IsSpace
248                       IsSpacePerl    \s
249           upper       IsUpper
250           word        IsWord
251           xdigit      IsXDigit
252
253       For example "[:lower:]" and "\p{IsLower}" are equivalent.
254
255       If the "utf8" pragma is not used but the "locale" pragma is, the
256       classes correlate with the usual isalpha(3) interface (except for
257       "word" and "blank").
258
259       The assumedly non-obviously named classes are:
260
261       cntrl
262           Any control character.  Usually characters that don't produce out‐
263           put as such but instead control the terminal somehow: for example
264           newline and backspace are control characters.  All characters with
265           ord() less than 32 are most often classified as control characters
266           (assuming ASCII, the ISO Latin character sets, and Unicode), as is
267           the character with the ord() value of 127 ("DEL").
268
269       graph
270           Any alphanumeric or punctuation (special) character.
271
272       print
273           Any alphanumeric or punctuation (special) character or the space
274           character.
275
276       punct
277           Any punctuation (special) character.
278
279       xdigit
280           Any hexadecimal digit.  Though this may feel silly ([0-9A-Fa-f]
281           would work just fine) it is included for completeness.
282
283       You can negate the [::] character classes by prefixing the class name
284       with a '^'. This is a Perl extension.  For example:
285
286           POSIX       traditional Unicode
287
288           [:^digit:]      \D      \P{IsDigit}
289           [:^space:]      \S      \P{IsSpace}
290           [:^word:]       \W      \P{IsWord}
291
292       Perl respects the POSIX standard in that POSIX character classes are
293       only supported within a character class.  The POSIX character classes
294       [.cc.] and [=cc=] are recognized but not supported and trying to use
295       them will cause an error.
296
297       Perl defines the following zero-width assertions:
298
299           \b  Match a word boundary
300           \B  Match a non-(word boundary)
301           \A  Match only at beginning of string
302           \Z  Match only at end of string, or before newline at the end
303           \z  Match only at end of string
304           \G  Match only at pos() (e.g. at the end-of-match position
305               of prior m//g)
306
307       A word boundary ("\b") is a spot between two characters that has a "\w"
308       on one side of it and a "\W" on the other side of it (in either order),
309       counting the imaginary characters off the beginning and end of the
310       string as matching a "\W".  (Within character classes "\b" represents
311       backspace rather than a word boundary, just as it normally does in any
312       double-quoted string.)  The "\A" and "\Z" are just like "^" and "$",
313       except that they won't match multiple times when the "/m" modifier is
314       used, while "^" and "$" will match at every internal line boundary.  To
315       match the actual end of the string and not ignore an optional trailing
316       newline, use "\z".
317
318       The "\G" assertion can be used to chain global matches (using "m//g"),
319       as described in "Regexp Quote-Like Operators" in perlop.  It is also
320       useful when writing "lex"-like scanners, when you have several patterns
321       that you want to match against consequent substrings of your string,
322       see the previous reference.  The actual location where "\G" will match
323       can also be influenced by using "pos()" as an lvalue: see "pos" in
324       perlfunc. Currently "\G" is only fully supported when anchored to the
325       start of the pattern; while it is permitted to use it elsewhere, as in
326       "/(?<=\G..)./g", some such uses ("/.\G/g", for example) currently cause
327       problems, and it is recommended that you avoid such usage for now.
328
329       The bracketing construct "( ... )" creates capture buffers.  To refer
330       to the digit'th buffer use \<digit> within the match.  Outside the
331       match use "$" instead of "\".  (The \<digit> notation works in certain
332       circumstances outside the match.  See the warning below about \1 vs $1
333       for details.)  Referring back to another part of the match is called a
334       backreference.
335
336       There is no limit to the number of captured substrings that you may
337       use.  However Perl also uses \10, \11, etc. as aliases for \010, \011,
338       etc.  (Recall that 0 means octal, so \011 is the character at number 9
339       in your coded character set; which would be the 10th character, a hori‐
340       zontal tab under ASCII.)  Perl resolves this ambiguity by interpreting
341       \10 as a backreference only if at least 10 left parentheses have opened
342       before it.  Likewise \11 is a backreference only if at least 11 left
343       parentheses have opened before it.  And so on.  \1 through \9 are
344       always interpreted as backreferences.
345
346       Examples:
347
348           s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words
349
350            if (/(.)\1/) {                 # find first doubled char
351                print "'$1' is the first doubled character\n";
352            }
353
354           if (/Time: (..):(..):(..)/) {   # parse out values
355               $hours = $1;
356               $minutes = $2;
357               $seconds = $3;
358           }
359
360       Several special variables also refer back to portions of the previous
361       match.  $+ returns whatever the last bracket match matched.  $& returns
362       the entire matched string.  (At one point $0 did also, but now it
363       returns the name of the program.)  $` returns everything before the
364       matched string.  $' returns everything after the matched string. And
365       $^N contains whatever was matched by the most-recently closed group
366       (submatch). $^N can be used in extended patterns (see below), for exam‐
367       ple to assign a submatch to a variable.
368
369       The numbered match variables ($1, $2, $3, etc.) and the related punctu‐
370       ation set ($+, $&, $`, $', and $^N) are all dynamically scoped until
371       the end of the enclosing block or until the next successful match,
372       whichever comes first.  (See "Compound Statements" in perlsyn.)
373
374       NOTE: failed matches in Perl do not reset the match variables, which
375       makes it easier to write code that tests for a series of more specific
376       cases and remembers the best match.
377
378       WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
379       the program, it has to provide them for every pattern match.  This may
380       substantially slow your program.  Perl uses the same mechanism to pro‐
381       duce $1, $2, etc, so you also pay a price for each pattern that con‐
382       tains capturing parentheses.  (To avoid this cost while retaining the
383       grouping behaviour, use the extended regular expression "(?: ... )"
384       instead.)  But if you never use $&, $` or $', then patterns without
385       capturing parentheses will not be penalized.  So avoid $&, $', and $`
386       if you can, but if you can't (and some algorithms really appreciate
387       them), once you've used them once, use them at will, because you've
388       already paid the price.  As of 5.005, $& is not so costly as the other
389       two.
390
391       Backslashed metacharacters in Perl are alphanumeric, such as "\b",
392       "\w", "\n".  Unlike some other regular expression languages, there are
393       no backslashed symbols that aren't alphanumeric.  So anything that
394       looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a
395       literal character, not a metacharacter.  This was once used in a common
396       idiom to disable or quote the special meanings of regular expression
397       metacharacters in a string that you want to use for a pattern. Simply
398       quote all non-"word" characters:
399
400           $pattern =~ s/(\W)/\\$1/g;
401
402       (If "use locale" is set, then this depends on the current locale.)
403       Today it is more common to use the quotemeta() function or the "\Q"
404       metaquoting escape sequence to disable all metacharacters' special
405       meanings like this:
406
407           /$unquoted\Q$quoted\E$unquoted/
408
409       Beware that if you put literal backslashes (those not inside interpo‐
410       lated variables) between "\Q" and "\E", double-quotish backslash inter‐
411       polation may lead to confusing results.  If you need to use literal
412       backslashes within "\Q...\E", consult "Gory details of parsing quoted
413       constructs" in perlop.
414
415       Extended Patterns
416
417       Perl also defines a consistent extension syntax for features not found
418       in standard tools like awk and lex.  The syntax is a pair of parenthe‐
419       ses with a question mark as the first thing within the parentheses.
420       The character after the question mark indicates the extension.
421
422       The stability of these extensions varies widely.  Some have been part
423       of the core language for many years.  Others are experimental and may
424       change without warning or be completely removed.  Check the documenta‐
425       tion on an individual feature to verify its current status.
426
427       A question mark was chosen for this and for the minimal-matching con‐
428       struct because 1) question marks are rare in older regular expressions,
429       and 2) whenever you see one, you should stop and "question" exactly
430       what is going on.  That's psychology...
431
432       "(?#text)"
433                 A comment.  The text is ignored.  If the "/x" modifier
434                 enables whitespace formatting, a simple "#" will suffice.
435                 Note that Perl closes the comment as soon as it sees a ")",
436                 so there is no way to put a literal ")" in the comment.
437
438       "(?imsx-imsx)"
439                 One or more embedded pattern-match modifiers, to be turned on
440                 (or turned off, if preceded by "-") for the remainder of the
441                 pattern or the remainder of the enclosing pattern group (if
442                 any). This is particularly useful for dynamic patterns, such
443                 as those read in from a configuration file, read in as an
444                 argument, are specified in a table somewhere, etc.  Consider
445                 the case that some of which want to be case sensitive and
446                 some do not.  The case insensitive ones need to include
447                 merely "(?i)" at the front of the pattern.  For example:
448
449                     $pattern = "foobar";
450                     if ( /$pattern/i ) { }
451
452                     # more flexible:
453
454                     $pattern = "(?i)foobar";
455                     if ( /$pattern/ ) { }
456
457                 These modifiers are restored at the end of the enclosing
458                 group. For example,
459
460                     ( (?i) blah ) \s+ \1
461
462                 will match a repeated (including the case!) word "blah" in
463                 any case, assuming "x" modifier, and no "i" modifier outside
464                 this group.
465
466       "(?:pattern)"
467       "(?imsx-imsx:pattern)"
468                 This is for clustering, not capturing; it groups subexpres‐
469                 sions like "()", but doesn't make backreferences as "()"
470                 does.  So
471
472                     @fields = split(/\b(?:a⎪b⎪c)\b/)
473
474                 is like
475
476                     @fields = split(/\b(a⎪b⎪c)\b/)
477
478                 but doesn't spit out extra fields.  It's also cheaper not to
479                 capture characters if you don't need to.
480
481                 Any letters between "?" and ":" act as flags modifiers as
482                 with "(?imsx-imsx)".  For example,
483
484                     /(?s-i:more.*than).*million/i
485
486                 is equivalent to the more verbose
487
488                     /(?:(?s-i)more.*than).*million/i
489
490       "(?=pattern)"
491                 A zero-width positive look-ahead assertion.  For example,
492                 "/\w+(?=\t)/" matches a word followed by a tab, without
493                 including the tab in $&.
494
495       "(?!pattern)"
496                 A zero-width negative look-ahead assertion.  For example
497                 "/foo(?!bar)/" matches any occurrence of "foo" that isn't
498                 followed by "bar".  Note however that look-ahead and look-
499                 behind are NOT the same thing.  You cannot use this for
500                 look-behind.
501
502                 If you are looking for a "bar" that isn't preceded by a
503                 "foo", "/(?!foo)bar/" will not do what you want.  That's
504                 because the "(?!foo)" is just saying that the next thing can‐
505                 not be "foo"--and it's not, it's a "bar", so "foobar" will
506                 match.  You would have to do something like "/(?!foo)...bar/"
507                 for that.   We say "like" because there's the case of your
508                 "bar" not having three characters before it.  You could cover
509                 that this way: "/(?:(?!foo)...⎪^.{0,2})bar/".  Sometimes it's
510                 still easier just to say:
511
512                     if (/bar/ && $` !~ /foo$/)
513
514                 For look-behind see below.
515
516       "(?<=pattern)"
517                 A zero-width positive look-behind assertion.  For example,
518                 "/(?<=\t)\w+/" matches a word that follows a tab, without
519                 including the tab in $&.  Works only for fixed-width
520                 look-behind.
521
522       "(?<!pattern)"
523                 A zero-width negative look-behind assertion.  For example
524                 "/(?<!bar)foo/" matches any occurrence of "foo" that does not
525                 follow "bar".  Works only for fixed-width look-behind.
526
527       "(?{ code })"
528                 WARNING: This extended regular expression feature is consid‐
529                 ered highly experimental, and may be changed or deleted with‐
530                 out notice.
531
532                 This zero-width assertion evaluates any embedded Perl code.
533                 It always succeeds, and its "code" is not interpolated.  Cur‐
534                 rently, the rules to determine where the "code" ends are
535                 somewhat convoluted.
536
537                 This feature can be used together with the special variable
538                 $^N to capture the results of submatches in variables without
539                 having to keep track of the number of nested parentheses. For
540                 example:
541
542                   $_ = "The brown fox jumps over the lazy dog";
543                   /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
544                   print "color = $color, animal = $animal\n";
545
546                 Inside the "(?{...})" block, $_ refers to the string the reg‐
547                 ular expression is matching against. You can also use "pos()"
548                 to know what is the current position of matching within this
549                 string.
550
551                 The "code" is properly scoped in the following sense: If the
552                 assertion is backtracked (compare "Backtracking"), all
553                 changes introduced after "local"ization are undone, so that
554
555                   $_ = 'a' x 8;
556                   m<
557                      (?{ $cnt = 0 })                    # Initialize $cnt.
558                      (
559                        a
560                        (?{
561                            local $cnt = $cnt + 1;       # Update $cnt, backtracking-safe.
562                        })
563                      )*
564                      aaaa
565                      (?{ $res = $cnt })                 # On success copy to non-localized
566                                                         # location.
567                    >x;
568
569                 will set "$res = 4".  Note that after the match, $cnt returns
570                 to the globally introduced value, because the scopes that
571                 restrict "local" operators are unwound.
572
573                 This assertion may be used as a "(?(condition)yes-pat‐
574                 tern⎪no-pattern)" switch.  If not used in this way, the
575                 result of evaluation of "code" is put into the special vari‐
576                 able $^R.  This happens immediately, so $^R can be used from
577                 other "(?{ code })" assertions inside the same regular
578                 expression.
579
580                 The assignment to $^R above is properly localized, so the old
581                 value of $^R is restored if the assertion is backtracked;
582                 compare "Backtracking".
583
584                 For reasons of security, this construct is forbidden if the
585                 regular expression involves run-time interpolation of vari‐
586                 ables, unless the perilous "use re 'eval'" pragma has been
587                 used (see re), or the variables contain results of "qr//"
588                 operator (see "qr/STRING/imosx" in perlop).
589
590                 This restriction is because of the wide-spread and remarkably
591                 convenient custom of using run-time determined strings as
592                 patterns.  For example:
593
594                     $re = <>;
595                     chomp $re;
596                     $string =~ /$re/;
597
598                 Before Perl knew how to execute interpolated code within a
599                 pattern, this operation was completely safe from a security
600                 point of view, although it could raise an exception from an
601                 illegal pattern.  If you turn on the "use re 'eval'", though,
602                 it is no longer secure, so you should only do so if you are
603                 also using taint checking.  Better yet, use the carefully
604                 constrained evaluation within a Safe compartment.  See
605                 perlsec for details about both these mechanisms.
606
607       "(??{ code })"
608                 WARNING: This extended regular expression feature is consid‐
609                 ered highly experimental, and may be changed or deleted with‐
610                 out notice.  A simplified version of the syntax may be intro‐
611                 duced for commonly used idioms.
612
613                 This is a "postponed" regular subexpression.  The "code" is
614                 evaluated at run time, at the moment this subexpression may
615                 match.  The result of evaluation is considered as a regular
616                 expression and matched as if it were inserted instead of this
617                 construct.
618
619                 The "code" is not interpolated.  As before, the rules to
620                 determine where the "code" ends are currently somewhat convo‐
621                 luted.
622
623                 The following pattern matches a parenthesized group:
624
625                   $re = qr{
626                              \(
627                              (?:
628                                 (?> [^()]+ )    # Non-parens without backtracking
629                               ⎪
630                                 (??{ $re })     # Group with matching parens
631                              )*
632                              \)
633                           }x;
634
635       "(?>pattern)"
636                 WARNING: This extended regular expression feature is consid‐
637                 ered highly experimental, and may be changed or deleted with‐
638                 out notice.
639
640                 An "independent" subexpression, one which matches the sub‐
641                 string that a standalone "pattern" would match if anchored at
642                 the given position, and it matches nothing other than this
643                 substring.  This construct is useful for optimizations of
644                 what would otherwise be "eternal" matches, because it will
645                 not backtrack (see "Backtracking").  It may also be useful in
646                 places where the "grab all you can, and do not give anything
647                 back" semantic is desirable.
648
649                 For example: "^(?>a*)ab" will never match, since "(?>a*)"
650                 (anchored at the beginning of string, as above) will match
651                 all characters "a" at the beginning of string, leaving no "a"
652                 for "ab" to match.  In contrast, "a*ab" will match the same
653                 as "a+b", since the match of the subgroup "a*" is influenced
654                 by the following group "ab" (see "Backtracking").  In partic‐
655                 ular, "a*" inside "a*ab" will match fewer characters than a
656                 standalone "a*", since this makes the tail match.
657
658                 An effect similar to "(?>pattern)" may be achieved by writing
659                 "(?=(pattern))\1".  This matches the same substring as a
660                 standalone "a+", and the following "\1" eats the matched
661                 string; it therefore makes a zero-length assertion into an
662                 analogue of "(?>...)".  (The difference between these two
663                 constructs is that the second one uses a capturing group,
664                 thus shifting ordinals of backreferences in the rest of a
665                 regular expression.)
666
667                 Consider this pattern:
668
669                     m{ \(
670                           (
671                             [^()]+              # x+
672                           ⎪
673                             \( [^()]* \)
674                           )+
675                        \)
676                      }x
677
678                 That will efficiently match a nonempty group with matching
679                 parentheses two levels deep or less.  However, if there is no
680                 such group, it will take virtually forever on a long string.
681                 That's because there are so many different ways to split a
682                 long string into several substrings.  This is what "(.+)+" is
683                 doing, and "(.+)+" is similar to a subpattern of the above
684                 pattern.  Consider how the pattern above detects no-match on
685                 "((()aaaaaaaaaaaaaaaaaa" in several seconds, but that each
686                 extra letter doubles this time.  This exponential performance
687                 will make it appear that your program has hung.  However, a
688                 tiny change to this pattern
689
690                     m{ \(
691                           (
692                             (?> [^()]+ )        # change x+ above to (?> x+ )
693                           ⎪
694                             \( [^()]* \)
695                           )+
696                        \)
697                      }x
698
699                 which uses "(?>...)" matches exactly when the one above does
700                 (verifying this yourself would be a productive exercise), but
701                 finishes in a fourth the time when used on a similar string
702                 with 1000000 "a"s.  Be aware, however, that this pattern cur‐
703                 rently triggers a warning message under the "use warnings"
704                 pragma or -w switch saying it "matches null string many times
705                 in regex".
706
707                 On simple groups, such as the pattern "(?> [^()]+ )", a com‐
708                 parable effect may be achieved by negative look-ahead, as in
709                 "[^()]+ (?! [^()] )".  This was only 4 times slower on a
710                 string with 1000000 "a"s.
711
712                 The "grab all you can, and do not give anything back" seman‐
713                 tic is desirable in many situations where on the first sight
714                 a simple "()*" looks like the correct solution.  Suppose we
715                 parse text with comments being delimited by "#" followed by
716                 some optional (horizontal) whitespace.  Contrary to its
717                 appearance, "#[ \t]*" is not the correct subexpression to
718                 match the comment delimiter, because it may "give up" some
719                 whitespace if the remainder of the pattern can be made to
720                 match that way.  The correct answer is either one of these:
721
722                     (?>#[ \t]*)
723                     #[ \t]*(?![ \t])
724
725                 For example, to grab non-empty comments into $1, one should
726                 use either one of these:
727
728                     / (?> \# [ \t]* ) (        .+ ) /x;
729                     /     \# [ \t]*   ( [^ \t] .* ) /x;
730
731                 Which one you pick depends on which of these expressions bet‐
732                 ter reflects the above specification of comments.
733
734       "(?(condition)yes-pattern⎪no-pattern)"
735       "(?(condition)yes-pattern)"
736                 WARNING: This extended regular expression feature is consid‐
737                 ered highly experimental, and may be changed or deleted with‐
738                 out notice.
739
740                 Conditional expression.  "(condition)" should be either an
741                 integer in parentheses (which is valid if the corresponding
742                 pair of parentheses matched), or look-ahead/look-behind/eval‐
743                 uate zero-width assertion.
744
745                 For example:
746
747                     m{ ( \( )?
748                        [^()]+
749                        (?(1) \) )
750                      }x
751
752                 matches a chunk of non-parentheses, possibly included in
753                 parentheses themselves.
754
755       Backtracking
756
757       NOTE: This section presents an abstract approximation of regular
758       expression behavior.  For a more rigorous (and complicated) view of the
759       rules involved in selecting a match among possible alternatives, see
760       "Combining pieces together".
761
762       A fundamental feature of regular expression matching involves the
763       notion called backtracking, which is currently used (when needed) by
764       all regular expression quantifiers, namely "*", "*?", "+", "+?",
765       "{n,m}", and "{n,m}?".  Backtracking is often optimized internally, but
766       the general principle outlined here is valid.
767
768       For a regular expression to match, the entire regular expression must
769       match, not just part of it.  So if the beginning of a pattern contain‐
770       ing a quantifier succeeds in a way that causes later parts in the pat‐
771       tern to fail, the matching engine backs up and recalculates the begin‐
772       ning part--that's why it's called backtracking.
773
774       Here is an example of backtracking:  Let's say you want to find the
775       word following "foo" in the string "Food is on the foo table.":
776
777           $_ = "Food is on the foo table.";
778           if ( /\b(foo)\s+(\w+)/i ) {
779               print "$2 follows $1.\n";
780           }
781
782       When the match runs, the first part of the regular expression
783       ("\b(foo)") finds a possible match right at the beginning of the
784       string, and loads up $1 with "Foo".  However, as soon as the matching
785       engine sees that there's no whitespace following the "Foo" that it had
786       saved in $1, it realizes its mistake and starts over again one charac‐
787       ter after where it had the tentative match.  This time it goes all the
788       way until the next occurrence of "foo". The complete regular expression
789       matches this time, and you get the expected output of "table follows
790       foo."
791
792       Sometimes minimal matching can help a lot.  Imagine you'd like to match
793       everything between "foo" and "bar".  Initially, you write something
794       like this:
795
796           $_ =  "The food is under the bar in the barn.";
797           if ( /foo(.*)bar/ ) {
798               print "got <$1>\n";
799           }
800
801       Which perhaps unexpectedly yields:
802
803         got <d is under the bar in the >
804
805       That's because ".*" was greedy, so you get everything between the first
806       "foo" and the last "bar".  Here it's more effective to use minimal
807       matching to make sure you get the text between a "foo" and the first
808       "bar" thereafter.
809
810           if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
811         got <d is under the >
812
813       Here's another example: let's say you'd like to match a number at the
814       end of a string, and you also want to keep the preceding part of the
815       match.  So you write this:
816
817           $_ = "I have 2 numbers: 53147";
818           if ( /(.*)(\d*)/ ) {                                # Wrong!
819               print "Beginning is <$1>, number is <$2>.\n";
820           }
821
822       That won't work at all, because ".*" was greedy and gobbled up the
823       whole string. As "\d*" can match on an empty string the complete regu‐
824       lar expression matched successfully.
825
826           Beginning is <I have 2 numbers: 53147>, number is <>.
827
828       Here are some variants, most of which don't work:
829
830           $_ = "I have 2 numbers: 53147";
831           @pats = qw{
832               (.*)(\d*)
833               (.*)(\d+)
834               (.*?)(\d*)
835               (.*?)(\d+)
836               (.*)(\d+)$
837               (.*?)(\d+)$
838               (.*)\b(\d+)$
839               (.*\D)(\d+)$
840           };
841
842           for $pat (@pats) {
843               printf "%-12s ", $pat;
844               if ( /$pat/ ) {
845                   print "<$1> <$2>\n";
846               } else {
847                   print "FAIL\n";
848               }
849           }
850
851       That will print out:
852
853           (.*)(\d*)    <I have 2 numbers: 53147> <>
854           (.*)(\d+)    <I have 2 numbers: 5314> <7>
855           (.*?)(\d*)   <> <>
856           (.*?)(\d+)   <I have > <2>
857           (.*)(\d+)$   <I have 2 numbers: 5314> <7>
858           (.*?)(\d+)$  <I have 2 numbers: > <53147>
859           (.*)\b(\d+)$ <I have 2 numbers: > <53147>
860           (.*\D)(\d+)$ <I have 2 numbers: > <53147>
861
862       As you see, this can be a bit tricky.  It's important to realize that a
863       regular expression is merely a set of assertions that gives a defini‐
864       tion of success.  There may be 0, 1, or several different ways that the
865       definition might succeed against a particular string.  And if there are
866       multiple ways it might succeed, you need to understand backtracking to
867       know which variety of success you will achieve.
868
869       When using look-ahead assertions and negations, this can all get even
870       trickier.  Imagine you'd like to find a sequence of non-digits not fol‐
871       lowed by "123".  You might try to write that as
872
873           $_ = "ABC123";
874           if ( /^\D*(?!123)/ ) {              # Wrong!
875               print "Yup, no 123 in $_\n";
876           }
877
878       But that isn't going to match; at least, not the way you're hoping.  It
879       claims that there is no 123 in the string.  Here's a clearer picture of
880       why that pattern matches, contrary to popular expectations:
881
882           $x = 'ABC123';
883           $y = 'ABC445';
884
885           print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
886           print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
887
888           print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
889           print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
890
891       This prints
892
893           2: got ABC
894           3: got AB
895           4: got ABC
896
897       You might have expected test 3 to fail because it seems to a more gen‐
898       eral purpose version of test 1.  The important difference between them
899       is that test 3 contains a quantifier ("\D*") and so can use backtrack‐
900       ing, whereas test 1 will not.  What's happening is that you've asked
901       "Is it true that at the start of $x, following 0 or more non-digits,
902       you have something that's not 123?"  If the pattern matcher had let
903       "\D*" expand to "ABC", this would have caused the whole pattern to
904       fail.
905
906       The search engine will initially match "\D*" with "ABC".  Then it will
907       try to match "(?!123" with "123", which fails.  But because a quanti‐
908       fier ("\D*") has been used in the regular expression, the search engine
909       can backtrack and retry the match differently in the hope of matching
910       the complete regular expression.
911
912       The pattern really, really wants to succeed, so it uses the standard
913       pattern back-off-and-retry and lets "\D*" expand to just "AB" this
914       time.  Now there's indeed something following "AB" that is not "123".
915       It's "C123", which suffices.
916
917       We can deal with this by using both an assertion and a negation.  We'll
918       say that the first part in $1 must be followed both by a digit and by
919       something that's not "123".  Remember that the look-aheads are zero-
920       width expressions--they only look, but don't consume any of the string
921       in their match.  So rewriting this way produces what you'd expect; that
922       is, case 5 will fail, but case 6 succeeds:
923
924           print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
925           print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
926
927           6: got ABC
928
929       In other words, the two zero-width assertions next to each other work
930       as though they're ANDed together, just as you'd use any built-in asser‐
931       tions:  "/^$/" matches only if you're at the beginning of the line AND
932       the end of the line simultaneously.  The deeper underlying truth is
933       that juxtaposition in regular expressions always means AND, except when
934       you write an explicit OR using the vertical bar.  "/ab/" means match
935       "a" AND (then) match "b", although the attempted matches are made at
936       different positions because "a" is not a zero-width assertion, but a
937       one-width assertion.
938
939       WARNING: particularly complicated regular expressions can take exponen‐
940       tial time to solve because of the immense number of possible ways they
941       can use backtracking to try match.  For example, without internal opti‐
942       mizations done by the regular expression engine, this will take a
943       painfully long time to run:
944
945           'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
946
947       And if you used "*"'s in the internal groups instead of limiting them
948       to 0 through 5 matches, then it would take forever--or until you ran
949       out of stack space.  Moreover, these internal optimizations are not
950       always applicable.  For example, if you put "{0,5}" instead of "*" on
951       the external group, no current optimization is applicable, and the
952       match takes a long time to finish.
953
954       A powerful tool for optimizing such beasts is what is known as an
955       "independent group", which does not backtrack (see ""(?>pattern)"").
956       Note also that zero-length look-ahead/look-behind assertions will not
957       backtrack to make the tail match, since they are in "logical" context:
958       only whether they match is considered relevant.  For an example where
959       side-effects of look-ahead might have influenced the following match,
960       see ""(?>pattern)"".
961
962       Version 8 Regular Expressions
963
964       In case you're not familiar with the "regular" Version 8 regex rou‐
965       tines, here are the pattern-matching rules not described above.
966
967       Any single character matches itself, unless it is a metacharacter with
968       a special meaning described here or above.  You can cause characters
969       that normally function as metacharacters to be interpreted literally by
970       prefixing them with a "\" (e.g., "\." matches a ".", not any character;
971       "\\" matches a "\").  A series of characters matches that series of
972       characters in the target string, so the pattern "blurfl" would match
973       "blurfl" in the target string.
974
975       You can specify a character class, by enclosing a list of characters in
976       "[]", which will match any one character from the list.  If the first
977       character after the "[" is "^", the class matches any character not in
978       the list.  Within a list, the "-" character specifies a range, so that
979       "a-z" represents all characters between "a" and "z", inclusive.  If you
980       want either "-" or "]" itself to be a member of a class, put it at the
981       start of the list (possibly after a "^"), or escape it with a back‐
982       slash.  "-" is also taken literally when it is at the end of the list,
983       just before the closing "]".  (The following all specify the same class
984       of three characters: "[-az]", "[az-]", and "[a\-z]".  All are different
985       from "[a-z]", which specifies a class containing twenty-six characters,
986       even on EBCDIC based coded character sets.)  Also, if you try to use
987       the character classes "\w", "\W", "\s", "\S", "\d", or "\D" as end‐
988       points of a range, that's not a range, the "-" is understood literally.
989
990       Note also that the whole range idea is rather unportable between char‐
991       acter sets--and even within character sets they may cause results you
992       probably didn't expect.  A sound principle is to use only ranges that
993       begin from and end at either alphabets of equal case ([a-e], [A-E]), or
994       digits ([0-9]).  Anything else is unsafe.  If in doubt, spell out the
995       character sets in full.
996
997       Characters may be specified using a metacharacter syntax much like that
998       used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
999       "\f" a form feed, etc.  More generally, \nnn, where nnn is a string of
1000       octal digits, matches the character whose coded character set value is
1001       nnn.  Similarly, \xnn, where nn are hexadecimal digits, matches the
1002       character whose numeric value is nn. The expression \cx matches the
1003       character control-x.  Finally, the "." metacharacter matches any char‐
1004       acter except "\n" (unless you use "/s").
1005
1006       You can specify a series of alternatives for a pattern using "⎪" to
1007       separate them, so that "fee⎪fie⎪foe" will match any of "fee", "fie", or
1008       "foe" in the target string (as would "f(e⎪i⎪o)e").  The first alterna‐
1009       tive includes everything from the last pattern delimiter ("(", "[", or
1010       the beginning of the pattern) up to the first "⎪", and the last alter‐
1011       native contains everything from the last "⎪" to the next pattern delim‐
1012       iter.  That's why it's common practice to include alternatives in
1013       parentheses: to minimize confusion about where they start and end.
1014
1015       Alternatives are tried from left to right, so the first alternative
1016       found for which the entire expression matches, is the one that is cho‐
1017       sen. This means that alternatives are not necessarily greedy. For exam‐
1018       ple: when matching "foo⎪foot" against "barefoot", only the "foo" part
1019       will match, as that is the first alternative tried, and it successfully
1020       matches the target string. (This might not seem important, but it is
1021       important when you are capturing matched text using parentheses.)
1022
1023       Also remember that "⎪" is interpreted as a literal within square brack‐
1024       ets, so if you write "[fee⎪fie⎪foe]" you're really only matching
1025       "[feio⎪]".
1026
1027       Within a pattern, you may designate subpatterns for later reference by
1028       enclosing them in parentheses, and you may refer back to the nth sub‐
1029       pattern later in the pattern using the metacharacter \n.  Subpatterns
1030       are numbered based on the left to right order of their opening paren‐
1031       thesis.  A backreference matches whatever actually matched the subpat‐
1032       tern in the string being examined, not the rules for that subpattern.
1033       Therefore, "(0⎪0x)\d*\s\1\d*" will match "0x1234 0x4321", but not
1034       "0x1234 01234", because subpattern 1 matched "0x", even though the rule
1035       "0⎪0x" could potentially match the leading 0 in the second number.
1036
1037       Warning on \1 vs $1
1038
1039       Some people get too used to writing things like:
1040
1041           $pattern =~ s/(\W)/\\\1/g;
1042
1043       This is grandfathered for the RHS of a substitute to avoid shocking the
1044       sed addicts, but it's a dirty habit to get into.  That's because in
1045       PerlThink, the righthand side of an "s///" is a double-quoted string.
1046       "\1" in the usual double-quoted string means a control-A.  The custom‐
1047       ary Unix meaning of "\1" is kludged in for "s///".  However, if you get
1048       into the habit of doing that, you get yourself into trouble if you then
1049       add an "/e" modifier.
1050
1051           s/(\d+)/ \1 + 1 /eg;        # causes warning under -w
1052
1053       Or if you try to do
1054
1055           s/(\d+)/\1000/;
1056
1057       You can't disambiguate that by saying "\{1}000", whereas you can fix it
1058       with "${1}000".  The operation of interpolation should not be confused
1059       with the operation of matching a backreference.  Certainly they mean
1060       two different things on the left side of the "s///".
1061
1062       Repeated patterns matching zero-length substring
1063
1064       WARNING: Difficult material (and prose) ahead.  This section needs a
1065       rewrite.
1066
1067       Regular expressions provide a terse and powerful programming language.
1068       As with most other power tools, power comes together with the ability
1069       to wreak havoc.
1070
1071       A common abuse of this power stems from the ability to make infinite
1072       loops using regular expressions, with something as innocuous as:
1073
1074           'foo' =~ m{ ( o? )* }x;
1075
1076       The "o?" can match at the beginning of 'foo', and since the position in
1077       the string is not moved by the match, "o?" would match again and again
1078       because of the "*" modifier.  Another common way to create a similar
1079       cycle is with the looping modifier "//g":
1080
1081           @matches = ( 'foo' =~ m{ o? }xg );
1082
1083       or
1084
1085           print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
1086
1087       or the loop implied by split().
1088
1089       However, long experience has shown that many programming tasks may be
1090       significantly simplified by using repeated subexpressions that may
1091       match zero-length substrings.  Here's a simple example being:
1092
1093           @chars = split //, $string;           # // is not magic in split
1094           ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1095
1096       Thus Perl allows such constructs, by forcefully breaking the infinite
1097       loop.  The rules for this are different for lower-level loops given by
1098       the greedy modifiers "*+{}", and for higher-level ones like the "/g"
1099       modifier or split() operator.
1100
1101       The lower-level loops are interrupted (that is, the loop is broken)
1102       when Perl detects that a repeated expression matched a zero-length sub‐
1103       string.   Thus
1104
1105          m{ (?: NON_ZERO_LENGTH ⎪ ZERO_LENGTH )* }x;
1106
1107       is made equivalent to
1108
1109          m{   (?: NON_ZERO_LENGTH )*
1110             ⎪
1111               (?: ZERO_LENGTH )?
1112           }x;
1113
1114       The higher level-loops preserve an additional state between iterations:
1115       whether the last match was zero-length.  To break the loop, the follow‐
1116       ing match after a zero-length match is prohibited to have a length of
1117       zero.  This prohibition interacts with backtracking (see "Backtrack‐
1118       ing"), and so the second best match is chosen if the best match is of
1119       zero length.
1120
1121       For example:
1122
1123           $_ = 'bar';
1124           s/\w??/<$&>/g;
1125
1126       results in "<><b><><a><><r><>".  At each position of the string the
1127       best match given by non-greedy "??" is the zero-length match, and the
1128       second best match is what is matched by "\w".  Thus zero-length matches
1129       alternate with one-character-long matches.
1130
1131       Similarly, for repeated "m/()/g" the second-best match is the match at
1132       the position one notch further in the string.
1133
1134       The additional state of being matched with zero-length is associated
1135       with the matched string, and is reset by each assignment to pos().
1136       Zero-length matches at the end of the previous match are ignored during
1137       "split".
1138
1139       Combining pieces together
1140
1141       Each of the elementary pieces of regular expressions which were
1142       described before (such as "ab" or "\Z") could match at most one sub‐
1143       string at the given position of the input string.  However, in a typi‐
1144       cal regular expression these elementary pieces are combined into more
1145       complicated patterns using combining operators "ST", "S⎪T", "S*" etc
1146       (in these examples "S" and "T" are regular subexpressions).
1147
1148       Such combinations can include alternatives, leading to a problem of
1149       choice: if we match a regular expression "a⎪ab" against "abc", will it
1150       match substring "a" or "ab"?  One way to describe which substring is
1151       actually matched is the concept of backtracking (see "Backtracking").
1152       However, this description is too low-level and makes you think in terms
1153       of a particular implementation.
1154
1155       Another description starts with notions of "better"/"worse".  All the
1156       substrings which may be matched by the given regular expression can be
1157       sorted from the "best" match to the "worst" match, and it is the "best"
1158       match which is chosen.  This substitutes the question of "what is cho‐
1159       sen?"  by the question of "which matches are better, and which are
1160       worse?".
1161
1162       Again, for elementary pieces there is no such question, since at most
1163       one match at a given position is possible.  This section describes the
1164       notion of better/worse for combining operators.  In the description
1165       below "S" and "T" are regular subexpressions.
1166
1167       "ST"
1168           Consider two possible matches, "AB" and "A'B'", "A" and "A'" are
1169           substrings which can be matched by "S", "B" and "B'" are substrings
1170           which can be matched by "T".
1171
1172           If "A" is better match for "S" than "A'", "AB" is a better match
1173           than "A'B'".
1174
1175           If "A" and "A'" coincide: "AB" is a better match than "AB'" if "B"
1176           is better match for "T" than "B'".
1177
1178       "S⎪T"
1179           When "S" can match, it is a better match than when only "T" can
1180           match.
1181
1182           Ordering of two matches for "S" is the same as for "S".  Similar
1183           for two matches for "T".
1184
1185       "S{REPEAT_COUNT}"
1186           Matches as "SSS...S" (repeated as many times as necessary).
1187
1188       "S{min,max}"
1189           Matches as "S{max}⎪S{max-1}⎪...⎪S{min+1}⎪S{min}".
1190
1191       "S{min,max}?"
1192           Matches as "S{min}⎪S{min+1}⎪...⎪S{max-1}⎪S{max}".
1193
1194       "S?", "S*", "S+"
1195           Same as "S{0,1}", "S{0,BIG_NUMBER}", "S{1,BIG_NUMBER}" respec‐
1196           tively.
1197
1198       "S??", "S*?", "S+?"
1199           Same as "S{0,1}?", "S{0,BIG_NUMBER}?", "S{1,BIG_NUMBER}?" respec‐
1200           tively.
1201
1202       "(?>S)"
1203           Matches the best match for "S" and only that.
1204
1205       "(?=S)", "(?<=S)"
1206           Only the best match for "S" is considered.  (This is important only
1207           if "S" has capturing parentheses, and backreferences are used some‐
1208           where else in the whole regular expression.)
1209
1210       "(?!S)", "(?<!S)"
1211           For this grouping operator there is no need to describe the order‐
1212           ing, since only whether or not "S" can match is important.
1213
1214       "(??{ EXPR })"
1215           The ordering is the same as for the regular expression which is the
1216           result of EXPR.
1217
1218       "(?(condition)yes-pattern⎪no-pattern)"
1219           Recall that which of "yes-pattern" or "no-pattern" actually matches
1220           is already determined.  The ordering of the matches is the same as
1221           for the chosen subexpression.
1222
1223       The above recipes describe the ordering of matches at a given position.
1224       One more rule is needed to understand how a match is determined for the
1225       whole regular expression: a match at an earlier position is always bet‐
1226       ter than a match at a later position.
1227
1228       Creating custom RE engines
1229
1230       Overloaded constants (see overload) provide a simple way to extend the
1231       functionality of the RE engine.
1232
1233       Suppose that we want to enable a new RE escape-sequence "\Y⎪" which
1234       matches at boundary between whitespace characters and non-whitespace
1235       characters.  Note that "(?=\S)(?<!\S)⎪(?!\S)(?<=\S)" matches exactly at
1236       these positions, so we want to have each "\Y⎪" in the place of the more
1237       complicated version.  We can create a module "customre" to do this:
1238
1239           package customre;
1240           use overload;
1241
1242           sub import {
1243             shift;
1244             die "No argument to customre::import allowed" if @_;
1245             overload::constant 'qr' => \&convert;
1246           }
1247
1248           sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
1249
1250           # We must also take care of not escaping the legitimate \\Y⎪
1251           # sequence, hence the presence of '\\' in the conversion rules.
1252           my %rules = ( '\\' => '\\\\',
1253                         'Y⎪' => qr/(?=\S)(?<!\S)⎪(?!\S)(?<=\S)/ );
1254           sub convert {
1255             my $re = shift;
1256             $re =~ s{
1257                       \\ ( \\ ⎪ Y . )
1258                     }
1259                     { $rules{$1} or invalid($re,$1) }sgex;
1260             return $re;
1261           }
1262
1263       Now "use customre" enables the new escape in constant regular expres‐
1264       sions, i.e., those without any runtime variable interpolations.  As
1265       documented in overload, this conversion will work only over literal
1266       parts of regular expressions.  For "\Y⎪$re\Y⎪" the variable part of
1267       this regular expression needs to be converted explicitly (but only if
1268       the special meaning of "\Y⎪" should be enabled inside $re):
1269
1270           use customre;
1271           $re = <>;
1272           chomp $re;
1273           $re = customre::convert $re;
1274           /\Y⎪$re\Y⎪/;
1275

BUGS

1277       This document varies from difficult to understand to completely and
1278       utterly opaque.  The wandering prose riddled with jargon is hard to
1279       fathom in several places.
1280
1281       This document needs a rewrite that separates the tutorial content from
1282       the reference content.
1283

NAME

DESCRIPTION

BUGS

SEE ALSO