1PERLRE(1) Perl Programmers Reference Guide PERLRE(1)
2
3
4
6 perlre - Perl regular expressions
7
9 This page describes the syntax of regular expressions in Perl.
10
11 If you haven't used regular expressions before, a quick-start
12 introduction is available in perlrequick, and a longer tutorial
13 introduction is available in perlretut.
14
15 For reference on how regular expressions are used in matching
16 operations, plus various examples of the same, see discussions of
17 "m//", "s///", "qr//" and "??" in "Regexp Quote-Like Operators" in
18 perlop.
19
20 Modifiers
21 Matching operations can have various modifiers. Modifiers that relate
22 to the interpretation of the regular expression inside are listed
23 below. Modifiers that alter the way a regular expression is used by
24 Perl are detailed in "Regexp Quote-Like Operators" in perlop and "Gory
25 details of parsing quoted constructs" in perlop.
26
27 m Treat string as multiple lines. That is, change "^" and "$" from
28 matching the start or end of the string to matching the start or
29 end of any line anywhere within the string.
30
31 s Treat string as single line. That is, change "." to match any
32 character whatsoever, even a newline, which normally it would not
33 match.
34
35 Used together, as "/ms", they let the "." match any character
36 whatsoever, while still allowing "^" and "$" to match,
37 respectively, just after and just before newlines within the
38 string.
39
40 i Do case-insensitive pattern matching.
41
42 If "use locale" is in effect, the case map is taken from the
43 current locale. See perllocale.
44
45 x Extend your pattern's legibility by permitting whitespace and
46 comments.
47
48 p Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
49 ${^POSTMATCH} are available for use after matching.
50
51 g and c
52 Global matching, and keep the Current position after failed
53 matching. Unlike i, m, s and x, these two flags affect the way the
54 regex is used rather than the regex itself. See "Using regular
55 expressions in Perl" in perlretut for further explanation of the g
56 and c modifiers.
57
58 These are usually written as "the "/x" modifier", even though the
59 delimiter in question might not really be a slash. Any of these
60 modifiers may also be embedded within the regular expression itself
61 using the "(?...)" construct. See below.
62
63 The "/x" modifier itself needs a little more explanation. It tells the
64 regular expression parser to ignore most whitespace that is neither
65 backslashed nor within a character class. You can use this to break up
66 your regular expression into (slightly) more readable parts. The "#"
67 character is also treated as a metacharacter introducing a comment,
68 just as in ordinary Perl code. This also means that if you want real
69 whitespace or "#" characters in the pattern (outside a character class,
70 where they are unaffected by "/x"), then you'll either have to escape
71 them (using backslashes or "\Q...\E") or encode them using octal, hex,
72 or "\N{}" escapes. Taken together, these features go a long way
73 towards making Perl's regular expressions more readable. Note that you
74 have to be careful not to include the pattern delimiter in the
75 comment--perl has no way of knowing you did not intend to close the
76 pattern early. See the C-comment deletion code in perlop. Also note
77 that anything inside a "\Q...\E" stays unaffected by "/x". And note
78 that "/x" doesn't affect whether space interpretation within a single
79 multi-character construct. For example in "\x{...}", regardless of the
80 "/x" modifier, there can be no spaces. Same for a quantifier such as
81 "{3}" or "{5,}". Similarly, "(?:...)" can't have a space between the
82 "?" and ":", but can between the "(" and "?". Within any delimiters
83 for such a construct, allowed spaces are not affected by "/x", and
84 depend on the construct. For example, "\x{...}" can't have spaces
85 because hexadecimal numbers don't have spaces in them. But, Unicode
86 properties can have spaces, so in "\p{...}" there can be spaces that
87 follow the Unicode rules, for which see "Properties accessible through
88 \p{} and \P{}" in perluniprops.
89
90 Regular Expressions
91 Metacharacters
92
93 The patterns used in Perl pattern matching evolved from those supplied
94 in the Version 8 regex routines. (The routines are derived (distantly)
95 from Henry Spencer's freely redistributable reimplementation of the V8
96 routines.) See "Version 8 Regular Expressions" for details.
97
98 In particular the following metacharacters have their standard
99 egrep-ish meanings:
100
101 \ Quote the next metacharacter
102 ^ Match the beginning of the line
103 . Match any character (except newline)
104 $ Match the end of the line (or before newline at the end)
105 | Alternation
106 () Grouping
107 [] Bracketed Character class
108
109 By default, the "^" character is guaranteed to match only the beginning
110 of the string, the "$" character only the end (or before the newline at
111 the end), and Perl does certain optimizations with the assumption that
112 the string contains only one line. Embedded newlines will not be
113 matched by "^" or "$". You may, however, wish to treat a string as a
114 multi-line buffer, such that the "^" will match after any newline
115 within the string (except if the newline is the last character in the
116 string), and "$" will match before any newline. At the cost of a
117 little more overhead, you can do this by using the /m modifier on the
118 pattern match operator. (Older programs did this by setting $*, but
119 this practice has been removed in perl 5.9.)
120
121 To simplify multi-line substitutions, the "." character never matches a
122 newline unless you use the "/s" modifier, which in effect tells Perl to
123 pretend the string is a single line--even if it isn't.
124
125 Quantifiers
126
127 The following standard quantifiers are recognized:
128
129 * Match 0 or more times
130 + Match 1 or more times
131 ? Match 1 or 0 times
132 {n} Match exactly n times
133 {n,} Match at least n times
134 {n,m} Match at least n but not more than m times
135
136 (If a curly bracket occurs in any other context, it is treated as a
137 regular character. In particular, the lower bound is not optional.)
138 The "*" quantifier is equivalent to "{0,}", the "+" quantifier to
139 "{1,}", and the "?" quantifier to "{0,1}". n and m are limited to non-
140 negative integral values less than a preset limit defined when perl is
141 built. This is usually 32766 on the most common platforms. The actual
142 limit can be seen in the error message generated by code such as this:
143
144 $_ **= $_ , / {$_} / for 2 .. 42;
145
146 By default, a quantified subpattern is "greedy", that is, it will match
147 as many times as possible (given a particular starting location) while
148 still allowing the rest of the pattern to match. If you want it to
149 match the minimum number of times possible, follow the quantifier with
150 a "?". Note that the meanings don't change, just the "greediness":
151
152 *? Match 0 or more times, not greedily
153 +? Match 1 or more times, not greedily
154 ?? Match 0 or 1 time, not greedily
155 {n}? Match exactly n times, not greedily
156 {n,}? Match at least n times, not greedily
157 {n,m}? Match at least n but not more than m times, not greedily
158
159 By default, when a quantified subpattern does not allow the rest of the
160 overall pattern to match, Perl will backtrack. However, this behaviour
161 is sometimes undesirable. Thus Perl provides the "possessive"
162 quantifier form as well.
163
164 *+ Match 0 or more times and give nothing back
165 ++ Match 1 or more times and give nothing back
166 ?+ Match 0 or 1 time and give nothing back
167 {n}+ Match exactly n times and give nothing back (redundant)
168 {n,}+ Match at least n times and give nothing back
169 {n,m}+ Match at least n but not more than m times and give nothing back
170
171 For instance,
172
173 'aaaa' =~ /a++a/
174
175 will never match, as the "a++" will gobble up all the "a"'s in the
176 string and won't leave any for the remaining part of the pattern. This
177 feature can be extremely useful to give perl hints about where it
178 shouldn't backtrack. For instance, the typical "match a double-quoted
179 string" problem can be most efficiently performed when written as:
180
181 /"(?:[^"\\]++|\\.)*+"/
182
183 as we know that if the final quote does not match, backtracking will
184 not help. See the independent subexpression "(?>...)" for more details;
185 possessive quantifiers are just syntactic sugar for that construct. For
186 instance the above example could also be written as follows:
187
188 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
189
190 Escape sequences
191
192 Because patterns are processed as double quoted strings, the following
193 also work:
194
195 \t tab (HT, TAB)
196 \n newline (LF, NL)
197 \r return (CR)
198 \f form feed (FF)
199 \a alarm (bell) (BEL)
200 \e escape (think troff) (ESC)
201 \033 octal char (example: ESC)
202 \x1B hex char (example: ESC)
203 \x{263a} long hex char (example: Unicode SMILEY)
204 \cK control char (example: VT)
205 \N{name} named Unicode character
206 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
207 \l lowercase next char (think vi)
208 \u uppercase next char (think vi)
209 \L lowercase till \E (think vi)
210 \U uppercase till \E (think vi)
211 \Q quote (disable) pattern metacharacters till \E
212 \E end either case modification or quoted section (think vi)
213
214 Details are in "Quote and Quote-like Operators" in perlop.
215
216 Character Classes and other Special Escapes
217
218 In addition, Perl defines the following:
219
220 Sequence Note Description
221 [...] [1] Match a character according to the rules of the bracketed
222 character class defined by the "...". Example: [a-z]
223 matches "a" or "b" or "c" ... or "z"
224 [[:...:]] [2] Match a character according to the rules of the POSIX
225 character class "..." within the outer bracketed character
226 class. Example: [[:upper:]] matches any uppercase
227 character.
228 \w [3] Match a "word" character (alphanumeric plus "_")
229 \W [3] Match a non-"word" character
230 \s [3] Match a whitespace character
231 \S [3] Match a non-whitespace character
232 \d [3] Match a decimal digit character
233 \D [3] Match a non-digit character
234 \pP [3] Match P, named property. Use \p{Prop} for longer names.
235 \PP [3] Match non-P
236 \X [4] Match Unicode "eXtended grapheme cluster"
237 \C Match a single C-language char (octet) even if that is part
238 of a larger UTF-8 character. Thus it breaks up characters
239 into their UTF-8 bytes, so you may end up with malformed
240 pieces of UTF-8. Unsupported in lookbehind.
241 \1 [5] Backreference to a specific capture buffer or group.
242 '1' may actually be any positive integer.
243 \g1 [5] Backreference to a specific or previous group,
244 \g{-1} [5] The number may be negative indicating a relative previous
245 buffer and may optionally be wrapped in curly brackets for
246 safer parsing.
247 \g{name} [5] Named backreference
248 \k<name> [5] Named backreference
249 \K [6] Keep the stuff left of the \K, don't include it in $&
250 \N [7] Any character but \n (experimental). Not affected by /s
251 modifier
252 \v [3] Vertical whitespace
253 \V [3] Not vertical whitespace
254 \h [3] Horizontal whitespace
255 \H [3] Not horizontal whitespace
256 \R [4] Linebreak
257
258 [1] See "Bracketed Character Classes" in perlrecharclass for details.
259
260 [2] See "POSIX Character Classes" in perlrecharclass for details.
261
262 [3] See "Backslash sequences" in perlrecharclass for details.
263
264 [4] See "Misc" in perlrebackslash for details.
265
266 [5] See "Capture buffers" below for details.
267
268 [6] See "Extended Patterns" below for details.
269
270 [7] Note that "\N" has two meanings. When of the form "\N{NAME}", it
271 matches the character whose name is "NAME"; and similarly when of
272 the form "\N{U+wide hex char}", it matches the character whose
273 Unicode ordinal is wide hex char. Otherwise it matches any
274 character but "\n".
275
276 Assertions
277
278 Perl defines the following zero-width assertions:
279
280 \b Match a word boundary
281 \B Match except at a word boundary
282 \A Match only at beginning of string
283 \Z Match only at end of string, or before newline at the end
284 \z Match only at end of string
285 \G Match only at pos() (e.g. at the end-of-match position
286 of prior m//g)
287
288 A word boundary ("\b") is a spot between two characters that has a "\w"
289 on one side of it and a "\W" on the other side of it (in either order),
290 counting the imaginary characters off the beginning and end of the
291 string as matching a "\W". (Within character classes "\b" represents
292 backspace rather than a word boundary, just as it normally does in any
293 double-quoted string.) The "\A" and "\Z" are just like "^" and "$",
294 except that they won't match multiple times when the "/m" modifier is
295 used, while "^" and "$" will match at every internal line boundary. To
296 match the actual end of the string and not ignore an optional trailing
297 newline, use "\z".
298
299 The "\G" assertion can be used to chain global matches (using "m//g"),
300 as described in "Regexp Quote-Like Operators" in perlop. It is also
301 useful when writing "lex"-like scanners, when you have several patterns
302 that you want to match against consequent substrings of your string,
303 see the previous reference. The actual location where "\G" will match
304 can also be influenced by using "pos()" as an lvalue: see "pos" in
305 perlfunc. Note that the rule for zero-length matches is modified
306 somewhat, in that contents to the left of "\G" is not counted when
307 determining the length of the match. Thus the following will not match
308 forever:
309
310 $str = 'ABC';
311 pos($str) = 1;
312 while (/.\G/g) {
313 print $&;
314 }
315
316 It will print 'A' and then terminate, as it considers the match to be
317 zero-width, and thus will not match at the same position twice in a
318 row.
319
320 It is worth noting that "\G" improperly used can result in an infinite
321 loop. Take care when using patterns that include "\G" in an
322 alternation.
323
324 Capture buffers
325
326 The bracketing construct "( ... )" creates capture buffers. To refer to
327 the current contents of a buffer later on, within the same pattern, use
328 \1 for the first, \2 for the second, and so on. Outside the match use
329 "$" instead of "\". (The \<digit> notation works in certain
330 circumstances outside the match. See "Warning on \1 Instead of $1"
331 below for details.) Referring back to another part of the match is
332 called a backreference.
333
334 There is no limit to the number of captured substrings that you may
335 use. However Perl also uses \10, \11, etc. as aliases for \010, \011,
336 etc. (Recall that 0 means octal, so \011 is the character at number 9
337 in your coded character set; which would be the 10th character, a
338 horizontal tab under ASCII.) Perl resolves this ambiguity by
339 interpreting \10 as a backreference only if at least 10 left
340 parentheses have opened before it. Likewise \11 is a backreference
341 only if at least 11 left parentheses have opened before it. And so on.
342 \1 through \9 are always interpreted as backreferences. If the
343 bracketing group did not match, the associated backreference won't
344 match either. (This can happen if the bracketing group is optional, or
345 in a different branch of an alternation.)
346
347 In order to provide a safer and easier way to construct patterns using
348 backreferences, Perl provides the "\g{N}" notation (starting with perl
349 5.10.0). The curly brackets are optional, however omitting them is less
350 safe as the meaning of the pattern can be changed by text (such as
351 digits) following it. When N is a positive integer the "\g{N}" notation
352 is exactly equivalent to using normal backreferences. When N is a
353 negative integer then it is a relative backreference referring to the
354 previous N'th capturing group. When the bracket form is used and N is
355 not an integer, it is treated as a reference to a named buffer.
356
357 Thus "\g{-1}" refers to the last buffer, "\g{-2}" refers to the buffer
358 before that. For example:
359
360 /
361 (Y) # buffer 1
362 ( # buffer 2
363 (X) # buffer 3
364 \g{-1} # backref to buffer 3
365 \g{-3} # backref to buffer 1
366 )
367 /x
368
369 and would match the same as "/(Y) ( (X) \3 \1 )/x".
370
371 Additionally, as of Perl 5.10.0 you may use named capture buffers and
372 named backreferences. The notation is "(?<name>...)" to declare and
373 "\k<name>" to reference. You may also use apostrophes instead of angle
374 brackets to delimit the name; and you may use the bracketed "\g{name}"
375 backreference syntax. It's possible to refer to a named capture buffer
376 by absolute and relative number as well. Outside the pattern, a named
377 capture buffer is available via the "%+" hash. When different buffers
378 within the same pattern have the same name, $+{name} and "\k<name>"
379 refer to the leftmost defined group. (Thus it's possible to do things
380 with named capture buffers that would otherwise require "(??{})" code
381 to accomplish.)
382
383 Examples:
384
385 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
386
387 /(.)\1/ # find first doubled char
388 and print "'$1' is the first doubled character\n";
389
390 /(?<char>.)\k<char>/ # ... a different way
391 and print "'$+{char}' is the first doubled character\n";
392
393 /(?'char'.)\1/ # ... mix and match
394 and print "'$1' is the first doubled character\n";
395
396 if (/Time: (..):(..):(..)/) { # parse out values
397 $hours = $1;
398 $minutes = $2;
399 $seconds = $3;
400 }
401
402 Several special variables also refer back to portions of the previous
403 match. $+ returns whatever the last bracket match matched. $& returns
404 the entire matched string. (At one point $0 did also, but now it
405 returns the name of the program.) "$`" returns everything before the
406 matched string. "$'" returns everything after the matched string. And
407 $^N contains whatever was matched by the most-recently closed group
408 (submatch). $^N can be used in extended patterns (see below), for
409 example to assign a submatch to a variable.
410
411 The numbered match variables ($1, $2, $3, etc.) and the related
412 punctuation set ($+, $&, "$`", "$'", and $^N) are all dynamically
413 scoped until the end of the enclosing block or until the next
414 successful match, whichever comes first. (See "Compound Statements" in
415 perlsyn.)
416
417 NOTE: Failed matches in Perl do not reset the match variables, which
418 makes it easier to write code that tests for a series of more specific
419 cases and remembers the best match.
420
421 WARNING: Once Perl sees that you need one of $&, "$`", or "$'" anywhere
422 in the program, it has to provide them for every pattern match. This
423 may substantially slow your program. Perl uses the same mechanism to
424 produce $1, $2, etc, so you also pay a price for each pattern that
425 contains capturing parentheses. (To avoid this cost while retaining
426 the grouping behaviour, use the extended regular expression "(?: ... )"
427 instead.) But if you never use $&, "$`" or "$'", then patterns without
428 capturing parentheses will not be penalized. So avoid $&, "$'", and
429 "$`" if you can, but if you can't (and some algorithms really
430 appreciate them), once you've used them once, use them at will, because
431 you've already paid the price. As of 5.005, $& is not so costly as the
432 other two.
433
434 As a workaround for this problem, Perl 5.10.0 introduces
435 "${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
436 to "$`", $& and "$'", except that they are only guaranteed to be
437 defined after a successful match that was executed with the "/p"
438 (preserve) modifier. The use of these variables incurs no global
439 performance penalty, unlike their punctuation char equivalents, however
440 at the trade-off that you have to tell perl when you want to use them.
441
442 Quoting metacharacters
443 Backslashed metacharacters in Perl are alphanumeric, such as "\b",
444 "\w", "\n". Unlike some other regular expression languages, there are
445 no backslashed symbols that aren't alphanumeric. So anything that
446 looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a
447 literal character, not a metacharacter. This was once used in a common
448 idiom to disable or quote the special meanings of regular expression
449 metacharacters in a string that you want to use for a pattern. Simply
450 quote all non-"word" characters:
451
452 $pattern =~ s/(\W)/\\$1/g;
453
454 (If "use locale" is set, then this depends on the current locale.)
455 Today it is more common to use the quotemeta() function or the "\Q"
456 metaquoting escape sequence to disable all metacharacters' special
457 meanings like this:
458
459 /$unquoted\Q$quoted\E$unquoted/
460
461 Beware that if you put literal backslashes (those not inside
462 interpolated variables) between "\Q" and "\E", double-quotish backslash
463 interpolation may lead to confusing results. If you need to use
464 literal backslashes within "\Q...\E", consult "Gory details of parsing
465 quoted constructs" in perlop.
466
467 Extended Patterns
468 Perl also defines a consistent extension syntax for features not found
469 in standard tools like awk and lex. The syntax is a pair of
470 parentheses with a question mark as the first thing within the
471 parentheses. The character after the question mark indicates the
472 extension.
473
474 The stability of these extensions varies widely. Some have been part
475 of the core language for many years. Others are experimental and may
476 change without warning or be completely removed. Check the
477 documentation on an individual feature to verify its current status.
478
479 A question mark was chosen for this and for the minimal-matching
480 construct because 1) question marks are rare in older regular
481 expressions, and 2) whenever you see one, you should stop and
482 "question" exactly what is going on. That's psychology...
483
484 "(?#text)"
485 A comment. The text is ignored. If the "/x" modifier
486 enables whitespace formatting, a simple "#" will suffice.
487 Note that Perl closes the comment as soon as it sees a ")",
488 so there is no way to put a literal ")" in the comment.
489
490 "(?pimsx-imsx)"
491 One or more embedded pattern-match modifiers, to be turned on
492 (or turned off, if preceded by "-") for the remainder of the
493 pattern or the remainder of the enclosing pattern group (if
494 any). This is particularly useful for dynamic patterns, such
495 as those read in from a configuration file, taken from an
496 argument, or specified in a table somewhere. Consider the
497 case where some patterns want to be case sensitive and some
498 do not: The case insensitive ones merely need to include
499 "(?i)" at the front of the pattern. For example:
500
501 $pattern = "foobar";
502 if ( /$pattern/i ) { }
503
504 # more flexible:
505
506 $pattern = "(?i)foobar";
507 if ( /$pattern/ ) { }
508
509 These modifiers are restored at the end of the enclosing
510 group. For example,
511
512 ( (?i) blah ) \s+ \1
513
514 will match "blah" in any case, some spaces, and an exact
515 (including the case!) repetition of the previous word,
516 assuming the "/x" modifier, and no "/i" modifier outside this
517 group.
518
519 These modifiers do not carry over into named subpatterns
520 called in the enclosing group. In other words, a pattern such
521 as "((?i)(&NAME))" does not change the case-sensitivity of
522 the "NAME" pattern.
523
524 Note that the "p" modifier is special in that it can only be
525 enabled, not disabled, and that its presence anywhere in a
526 pattern has a global effect. Thus "(?-p)" and "(?-p:...)" are
527 meaningless and will warn when executed under "use warnings".
528
529 "(?:pattern)"
530 "(?imsx-imsx:pattern)"
531 This is for clustering, not capturing; it groups
532 subexpressions like "()", but doesn't make backreferences as
533 "()" does. So
534
535 @fields = split(/\b(?:a|b|c)\b/)
536
537 is like
538
539 @fields = split(/\b(a|b|c)\b/)
540
541 but doesn't spit out extra fields. It's also cheaper not to
542 capture characters if you don't need to.
543
544 Any letters between "?" and ":" act as flags modifiers as
545 with "(?imsx-imsx)". For example,
546
547 /(?s-i:more.*than).*million/i
548
549 is equivalent to the more verbose
550
551 /(?:(?s-i)more.*than).*million/i
552
553 "(?|pattern)"
554 This is the "branch reset" pattern, which has the special
555 property that the capture buffers are numbered from the same
556 starting point in each alternation branch. It is available
557 starting from perl 5.10.0.
558
559 Capture buffers are numbered from left to right, but inside
560 this construct the numbering is restarted for each branch.
561
562 The numbering within each branch will be as normal, and any
563 buffers following this construct will be numbered as though
564 the construct contained only one branch, that being the one
565 with the most capture buffers in it.
566
567 This construct will be useful when you want to capture one of
568 a number of alternative matches.
569
570 Consider the following pattern. The numbers underneath show
571 in which buffer the captured content will be stored.
572
573 # before ---------------branch-reset----------- after
574 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
575 # 1 2 2 3 2 3 4
576
577 Be careful when using the branch reset pattern in combination
578 with named captures. Named captures are implemented as being
579 aliases to numbered buffers holding the captures, and that
580 interferes with the implementation of the branch reset
581 pattern. If you are using named captures in a branch reset
582 pattern, it's best to use the same names, in the same order,
583 in each of the alternations:
584
585 /(?| (?<a> x ) (?<b> y )
586 | (?<a> z ) (?<b> w )) /x
587
588 Not doing so may lead to surprises:
589
590 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
591 say $+ {a}; # Prints '12'
592 say $+ {b}; # *Also* prints '12'.
593
594 The problem here is that both the buffer named "a" and the
595 buffer named "b" are aliases for the buffer belonging to $1.
596
597 Look-Around Assertions
598 Look-around assertions are zero width patterns which match a
599 specific pattern without including it in $&. Positive
600 assertions match when their subpattern matches, negative
601 assertions match when their subpattern fails. Look-behind
602 matches text up to the current match position, look-ahead
603 matches text following the current match position.
604
605 "(?=pattern)"
606 A zero-width positive look-ahead assertion. For example,
607 "/\w+(?=\t)/" matches a word followed by a tab, without
608 including the tab in $&.
609
610 "(?!pattern)"
611 A zero-width negative look-ahead assertion. For example
612 "/foo(?!bar)/" matches any occurrence of "foo" that isn't
613 followed by "bar". Note however that look-ahead and
614 look-behind are NOT the same thing. You cannot use this
615 for look-behind.
616
617 If you are looking for a "bar" that isn't preceded by a
618 "foo", "/(?!foo)bar/" will not do what you want. That's
619 because the "(?!foo)" is just saying that the next thing
620 cannot be "foo"--and it's not, it's a "bar", so "foobar"
621 will match. You would have to do something like
622 "/(?!foo)...bar/" for that. We say "like" because
623 there's the case of your "bar" not having three
624 characters before it. You could cover that this way:
625 "/(?:(?!foo)...|^.{0,2})bar/". Sometimes it's still
626 easier just to say:
627
628 if (/bar/ && $` !~ /foo$/)
629
630 For look-behind see below.
631
632 "(?<=pattern)" "\K"
633 A zero-width positive look-behind assertion. For
634 example, "/(?<=\t)\w+/" matches a word that follows a
635 tab, without including the tab in $&. Works only for
636 fixed-width look-behind.
637
638 There is a special form of this construct, called "\K",
639 which causes the regex engine to "keep" everything it had
640 matched prior to the "\K" and not include it in $&. This
641 effectively provides variable length look-behind. The use
642 of "\K" inside of another look-around assertion is
643 allowed, but the behaviour is currently not well defined.
644
645 For various reasons "\K" may be significantly more
646 efficient than the equivalent "(?<=...)" construct, and
647 it is especially useful in situations where you want to
648 efficiently remove something following something else in
649 a string. For instance
650
651 s/(foo)bar/$1/g;
652
653 can be rewritten as the much more efficient
654
655 s/foo\Kbar//g;
656
657 "(?<!pattern)"
658 A zero-width negative look-behind assertion. For example
659 "/(?<!bar)foo/" matches any occurrence of "foo" that does
660 not follow "bar". Works only for fixed-width look-
661 behind.
662
663 "(?'NAME'pattern)"
664 "(?<NAME>pattern)"
665 A named capture buffer. Identical in every respect to normal
666 capturing parentheses "()" but for the additional fact that
667 "%+" or "%-" may be used after a successful match to refer to
668 a named buffer. See "perlvar" for more details on the "%+"
669 and "%-" hashes.
670
671 If multiple distinct capture buffers have the same name then
672 the $+{NAME} will refer to the leftmost defined buffer in the
673 match.
674
675 The forms "(?'NAME'pattern)" and "(?<NAME>pattern)" are
676 equivalent.
677
678 NOTE: While the notation of this construct is the same as the
679 similar function in .NET regexes, the behavior is not. In
680 Perl the buffers are numbered sequentially regardless of
681 being named or not. Thus in the pattern
682
683 /(x)(?<foo>y)(z)/
684
685 $+{foo} will be the same as $2, and $3 will contain 'z'
686 instead of the opposite which is what a .NET regex hacker
687 might expect.
688
689 Currently NAME is restricted to simple identifiers only. In
690 other words, it must match "/^[_A-Za-z][_A-Za-z0-9]*\z/" or
691 its Unicode extension (see utf8), though it isn't extended by
692 the locale (see perllocale).
693
694 NOTE: In order to make things easier for programmers with
695 experience with the Python or PCRE regex engines, the pattern
696 "(?P<NAME>pattern)" may be used instead of
697 "(?<NAME>pattern)"; however this form does not support the
698 use of single quotes as a delimiter for the name.
699
700 "\k<NAME>"
701 "\k'NAME'"
702 Named backreference. Similar to numeric backreferences,
703 except that the group is designated by name and not number.
704 If multiple groups have the same name then it refers to the
705 leftmost defined group in the current match.
706
707 It is an error to refer to a name not defined by a
708 "(?<NAME>)" earlier in the pattern.
709
710 Both forms are equivalent.
711
712 NOTE: In order to make things easier for programmers with
713 experience with the Python or PCRE regex engines, the pattern
714 "(?P=NAME)" may be used instead of "\k<NAME>".
715
716 "(?{ code })"
717 WARNING: This extended regular expression feature is
718 considered experimental, and may be changed without notice.
719 Code executed that has side effects may not perform
720 identically from version to version due to the effect of
721 future optimisations in the regex engine.
722
723 This zero-width assertion evaluates any embedded Perl code.
724 It always succeeds, and its "code" is not interpolated.
725 Currently, the rules to determine where the "code" ends are
726 somewhat convoluted.
727
728 This feature can be used together with the special variable
729 $^N to capture the results of submatches in variables without
730 having to keep track of the number of nested parentheses. For
731 example:
732
733 $_ = "The brown fox jumps over the lazy dog";
734 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
735 print "color = $color, animal = $animal\n";
736
737 Inside the "(?{...})" block, $_ refers to the string the
738 regular expression is matching against. You can also use
739 "pos()" to know what is the current position of matching
740 within this string.
741
742 The "code" is properly scoped in the following sense: If the
743 assertion is backtracked (compare "Backtracking"), all
744 changes introduced after "local"ization are undone, so that
745
746 $_ = 'a' x 8;
747 m<
748 (?{ $cnt = 0 }) # Initialize $cnt.
749 (
750 a
751 (?{
752 local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
753 })
754 )*
755 aaaa
756 (?{ $res = $cnt }) # On success copy to non-localized
757 # location.
758 >x;
759
760 will set "$res = 4". Note that after the match, $cnt returns
761 to the globally introduced value, because the scopes that
762 restrict "local" operators are unwound.
763
764 This assertion may be used as a
765 "(?(condition)yes-pattern|no-pattern)" switch. If not used
766 in this way, the result of evaluation of "code" is put into
767 the special variable $^R. This happens immediately, so $^R
768 can be used from other "(?{ code })" assertions inside the
769 same regular expression.
770
771 The assignment to $^R above is properly localized, so the old
772 value of $^R is restored if the assertion is backtracked;
773 compare "Backtracking".
774
775 For reasons of security, this construct is forbidden if the
776 regular expression involves run-time interpolation of
777 variables, unless the perilous "use re 'eval'" pragma has
778 been used (see re), or the variables contain results of
779 "qr//" operator (see "qr/STRING/msixpo" in perlop).
780
781 This restriction is due to the wide-spread and remarkably
782 convenient custom of using run-time determined strings as
783 patterns. For example:
784
785 $re = <>;
786 chomp $re;
787 $string =~ /$re/;
788
789 Before Perl knew how to execute interpolated code within a
790 pattern, this operation was completely safe from a security
791 point of view, although it could raise an exception from an
792 illegal pattern. If you turn on the "use re 'eval'", though,
793 it is no longer secure, so you should only do so if you are
794 also using taint checking. Better yet, use the carefully
795 constrained evaluation within a Safe compartment. See
796 perlsec for details about both these mechanisms.
797
798 WARNING: Use of lexical ("my") variables in these blocks is
799 broken. The result is unpredictable and will make perl
800 unstable. The workaround is to use global ("our") variables.
801
802 WARNING: Because Perl's regex engine is currently not re-
803 entrant, interpolated code may not invoke the regex engine
804 either directly with "m//" or "s///"), or indirectly with
805 functions such as "split". Invoking the regex engine in these
806 blocks will make perl unstable.
807
808 "(??{ code })"
809 WARNING: This extended regular expression feature is
810 considered experimental, and may be changed without notice.
811 Code executed that has side effects may not perform
812 identically from version to version due to the effect of
813 future optimisations in the regex engine.
814
815 This is a "postponed" regular subexpression. The "code" is
816 evaluated at run time, at the moment this subexpression may
817 match. The result of evaluation is considered as a regular
818 expression and matched as if it were inserted instead of this
819 construct. Note that this means that the contents of capture
820 buffers defined inside an eval'ed pattern are not available
821 outside of the pattern, and vice versa, there is no way for
822 the inner pattern to refer to a capture buffer defined
823 outside. Thus,
824
825 ('a' x 100)=~/(??{'(.)' x 100})/
826
827 will match, it will not set $1.
828
829 The "code" is not interpolated. As before, the rules to
830 determine where the "code" ends are currently somewhat
831 convoluted.
832
833 The following pattern matches a parenthesized group:
834
835 $re = qr{
836 \(
837 (?:
838 (?> [^()]+ ) # Non-parens without backtracking
839 |
840 (??{ $re }) # Group with matching parens
841 )*
842 \)
843 }x;
844
845 See also "(?PARNO)" for a different, more efficient way to
846 accomplish the same task.
847
848 For reasons of security, this construct is forbidden if the
849 regular expression involves run-time interpolation of
850 variables, unless the perilous "use re 'eval'" pragma has
851 been used (see re), or the variables contain results of
852 "qr//" operator (see "qr/STRING/msixpo" in perlop).
853
854 Because perl's regex engine is not currently re-entrant,
855 delayed code may not invoke the regex engine either directly
856 with "m//" or "s///"), or indirectly with functions such as
857 "split".
858
859 Recursing deeper than 50 times without consuming any input
860 string will result in a fatal error. The maximum depth is
861 compiled into perl, so changing it requires a custom build.
862
863 "(?PARNO)" "(?-PARNO)" "(?+PARNO)" "(?R)" "(?0)"
864 Similar to "(??{ code })" except it does not involve
865 compiling any code, instead it treats the contents of a
866 capture buffer as an independent pattern that must match at
867 the current position. Capture buffers contained by the
868 pattern will have the value as determined by the outermost
869 recursion.
870
871 PARNO is a sequence of digits (not starting with 0) whose
872 value reflects the paren-number of the capture buffer to
873 recurse to. "(?R)" recurses to the beginning of the whole
874 pattern. "(?0)" is an alternate syntax for "(?R)". If PARNO
875 is preceded by a plus or minus sign then it is assumed to be
876 relative, with negative numbers indicating preceding capture
877 buffers and positive ones following. Thus "(?-1)" refers to
878 the most recently declared buffer, and "(?+1)" indicates the
879 next buffer to be declared. Note that the counting for
880 relative recursion differs from that of relative
881 backreferences, in that with recursion unclosed buffers are
882 included.
883
884 The following pattern matches a function foo() which may
885 contain balanced parentheses as the argument.
886
887 $re = qr{ ( # paren group 1 (full function)
888 foo
889 ( # paren group 2 (parens)
890 \(
891 ( # paren group 3 (contents of parens)
892 (?:
893 (?> [^()]+ ) # Non-parens without backtracking
894 |
895 (?2) # Recurse to start of paren group 2
896 )*
897 )
898 \)
899 )
900 )
901 }x;
902
903 If the pattern was used as follows
904
905 'foo(bar(baz)+baz(bop))'=~/$re/
906 and print "\$1 = $1\n",
907 "\$2 = $2\n",
908 "\$3 = $3\n";
909
910 the output produced should be the following:
911
912 $1 = foo(bar(baz)+baz(bop))
913 $2 = (bar(baz)+baz(bop))
914 $3 = bar(baz)+baz(bop)
915
916 If there is no corresponding capture buffer defined, then it
917 is a fatal error. Recursing deeper than 50 times without
918 consuming any input string will also result in a fatal error.
919 The maximum depth is compiled into perl, so changing it
920 requires a custom build.
921
922 The following shows how using negative indexing can make it
923 easier to embed recursive patterns inside of a "qr//"
924 construct for later use:
925
926 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
927 if (/foo $parens \s+ + \s+ bar $parens/x) {
928 # do something here...
929 }
930
931 Note that this pattern does not behave the same way as the
932 equivalent PCRE or Python construct of the same form. In Perl
933 you can backtrack into a recursed group, in PCRE and Python
934 the recursed into group is treated as atomic. Also, modifiers
935 are resolved at compile time, so constructs like (?i:(?1)) or
936 (?:(?i)(?1)) do not affect how the sub-pattern will be
937 processed.
938
939 "(?&NAME)"
940 Recurse to a named subpattern. Identical to "(?PARNO)" except
941 that the parenthesis to recurse to is determined by name. If
942 multiple parentheses have the same name, then it recurses to
943 the leftmost.
944
945 It is an error to refer to a name that is not declared
946 somewhere in the pattern.
947
948 NOTE: In order to make things easier for programmers with
949 experience with the Python or PCRE regex engines the pattern
950 "(?P>NAME)" may be used instead of "(?&NAME)".
951
952 "(?(condition)yes-pattern|no-pattern)"
953 "(?(condition)yes-pattern)"
954 Conditional expression. "(condition)" should be either an
955 integer in parentheses (which is valid if the corresponding
956 pair of parentheses matched), a
957 look-ahead/look-behind/evaluate zero-width assertion, a name
958 in angle brackets or single quotes (which is valid if a
959 buffer with the given name matched), or the special symbol
960 (R) (true when evaluated inside of recursion or eval).
961 Additionally the R may be followed by a number, (which will
962 be true when evaluated when recursing inside of the
963 appropriate group), or by &NAME, in which case it will be
964 true only when evaluated during recursion in the named group.
965
966 Here's a summary of the possible predicates:
967
968 (1) (2) ...
969 Checks if the numbered capturing buffer has matched
970 something.
971
972 (<NAME>) ('NAME')
973 Checks if a buffer with the given name has matched
974 something.
975
976 (?{ CODE })
977 Treats the code block as the condition.
978
979 (R) Checks if the expression has been evaluated inside of
980 recursion.
981
982 (R1) (R2) ...
983 Checks if the expression has been evaluated while
984 executing directly inside of the n-th capture group. This
985 check is the regex equivalent of
986
987 if ((caller(0))[3] eq 'subname') { ... }
988
989 In other words, it does not check the full recursion
990 stack.
991
992 (R&NAME)
993 Similar to "(R1)", this predicate checks to see if we're
994 executing directly inside of the leftmost group with a
995 given name (this is the same logic used by "(?&NAME)" to
996 disambiguate). It does not check the full stack, but only
997 the name of the innermost active recursion.
998
999 (DEFINE)
1000 In this case, the yes-pattern is never directly executed,
1001 and no no-pattern is allowed. Similar in spirit to
1002 "(?{0})" but more efficient. See below for details.
1003
1004 For example:
1005
1006 m{ ( \( )?
1007 [^()]+
1008 (?(1) \) )
1009 }x
1010
1011 matches a chunk of non-parentheses, possibly included in
1012 parentheses themselves.
1013
1014 A special form is the "(DEFINE)" predicate, which never
1015 executes directly its yes-pattern, and does not allow a no-
1016 pattern. This allows to define subpatterns which will be
1017 executed only by using the recursion mechanism. This way,
1018 you can define a set of regular expression rules that can be
1019 bundled into any pattern you choose.
1020
1021 It is recommended that for this usage you put the DEFINE
1022 block at the end of the pattern, and that you name any
1023 subpatterns defined within it.
1024
1025 Also, it's worth noting that patterns defined this way
1026 probably will not be as efficient, as the optimiser is not
1027 very clever about handling them.
1028
1029 An example of how this might be used is as follows:
1030
1031 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
1032 (?(DEFINE)
1033 (?<NAME_PAT>....)
1034 (?<ADRESS_PAT>....)
1035 )/x
1036
1037 Note that capture buffers matched inside of recursion are not
1038 accessible after the recursion returns, so the extra layer of
1039 capturing buffers is necessary. Thus $+{NAME_PAT} would not
1040 be defined even though $+{NAME} would be.
1041
1042 "(?>pattern)"
1043 An "independent" subexpression, one which matches the
1044 substring that a standalone "pattern" would match if anchored
1045 at the given position, and it matches nothing other than this
1046 substring. This construct is useful for optimizations of
1047 what would otherwise be "eternal" matches, because it will
1048 not backtrack (see "Backtracking"). It may also be useful in
1049 places where the "grab all you can, and do not give anything
1050 back" semantic is desirable.
1051
1052 For example: "^(?>a*)ab" will never match, since "(?>a*)"
1053 (anchored at the beginning of string, as above) will match
1054 all characters "a" at the beginning of string, leaving no "a"
1055 for "ab" to match. In contrast, "a*ab" will match the same
1056 as "a+b", since the match of the subgroup "a*" is influenced
1057 by the following group "ab" (see "Backtracking"). In
1058 particular, "a*" inside "a*ab" will match fewer characters
1059 than a standalone "a*", since this makes the tail match.
1060
1061 An effect similar to "(?>pattern)" may be achieved by writing
1062 "(?=(pattern))\1". This matches the same substring as a
1063 standalone "a+", and the following "\1" eats the matched
1064 string; it therefore makes a zero-length assertion into an
1065 analogue of "(?>...)". (The difference between these two
1066 constructs is that the second one uses a capturing group,
1067 thus shifting ordinals of backreferences in the rest of a
1068 regular expression.)
1069
1070 Consider this pattern:
1071
1072 m{ \(
1073 (
1074 [^()]+ # x+
1075 |
1076 \( [^()]* \)
1077 )+
1078 \)
1079 }x
1080
1081 That will efficiently match a nonempty group with matching
1082 parentheses two levels deep or less. However, if there is no
1083 such group, it will take virtually forever on a long string.
1084 That's because there are so many different ways to split a
1085 long string into several substrings. This is what "(.+)+" is
1086 doing, and "(.+)+" is similar to a subpattern of the above
1087 pattern. Consider how the pattern above detects no-match on
1088 "((()aaaaaaaaaaaaaaaaaa" in several seconds, but that each
1089 extra letter doubles this time. This exponential performance
1090 will make it appear that your program has hung. However, a
1091 tiny change to this pattern
1092
1093 m{ \(
1094 (
1095 (?> [^()]+ ) # change x+ above to (?> x+ )
1096 |
1097 \( [^()]* \)
1098 )+
1099 \)
1100 }x
1101
1102 which uses "(?>...)" matches exactly when the one above does
1103 (verifying this yourself would be a productive exercise), but
1104 finishes in a fourth the time when used on a similar string
1105 with 1000000 "a"s. Be aware, however, that this pattern
1106 currently triggers a warning message under the "use warnings"
1107 pragma or -w switch saying it "matches null string many times
1108 in regex".
1109
1110 On simple groups, such as the pattern "(?> [^()]+ )", a
1111 comparable effect may be achieved by negative look-ahead, as
1112 in "[^()]+ (?! [^()] )". This was only 4 times slower on a
1113 string with 1000000 "a"s.
1114
1115 The "grab all you can, and do not give anything back"
1116 semantic is desirable in many situations where on the first
1117 sight a simple "()*" looks like the correct solution.
1118 Suppose we parse text with comments being delimited by "#"
1119 followed by some optional (horizontal) whitespace. Contrary
1120 to its appearance, "#[ \t]*" is not the correct subexpression
1121 to match the comment delimiter, because it may "give up" some
1122 whitespace if the remainder of the pattern can be made to
1123 match that way. The correct answer is either one of these:
1124
1125 (?>#[ \t]*)
1126 #[ \t]*(?![ \t])
1127
1128 For example, to grab non-empty comments into $1, one should
1129 use either one of these:
1130
1131 / (?> \# [ \t]* ) ( .+ ) /x;
1132 / \# [ \t]* ( [^ \t] .* ) /x;
1133
1134 Which one you pick depends on which of these expressions
1135 better reflects the above specification of comments.
1136
1137 In some literature this construct is called "atomic matching"
1138 or "possessive matching".
1139
1140 Possessive quantifiers are equivalent to putting the item
1141 they are applied to inside of one of these constructs. The
1142 following equivalences apply:
1143
1144 Quantifier Form Bracketing Form
1145 --------------- ---------------
1146 PAT*+ (?>PAT*)
1147 PAT++ (?>PAT+)
1148 PAT?+ (?>PAT?)
1149 PAT{min,max}+ (?>PAT{min,max})
1150
1151 Special Backtracking Control Verbs
1152 WARNING: These patterns are experimental and subject to change or
1153 removal in a future version of Perl. Their usage in production code
1154 should be noted to avoid problems during upgrades.
1155
1156 These special patterns are generally of the form "(*VERB:ARG)". Unless
1157 otherwise stated the ARG argument is optional; in some cases, it is
1158 forbidden.
1159
1160 Any pattern containing a special backtracking verb that allows an
1161 argument has the special behaviour that when executed it sets the
1162 current package's $REGERROR and $REGMARK variables. When doing so the
1163 following rules apply:
1164
1165 On failure, the $REGERROR variable will be set to the ARG value of the
1166 verb pattern, if the verb was involved in the failure of the match. If
1167 the ARG part of the pattern was omitted, then $REGERROR will be set to
1168 the name of the last "(*MARK:NAME)" pattern executed, or to TRUE if
1169 there was none. Also, the $REGMARK variable will be set to FALSE.
1170
1171 On a successful match, the $REGERROR variable will be set to FALSE, and
1172 the $REGMARK variable will be set to the name of the last
1173 "(*MARK:NAME)" pattern executed. See the explanation for the
1174 "(*MARK:NAME)" verb below for more details.
1175
1176 NOTE: $REGERROR and $REGMARK are not magic variables like $1 and most
1177 other regex related variables. They are not local to a scope, nor
1178 readonly, but instead are volatile package variables similar to
1179 $AUTOLOAD. Use "local" to localize changes to them to a specific scope
1180 if necessary.
1181
1182 If a pattern does not contain a special backtracking verb that allows
1183 an argument, then $REGERROR and $REGMARK are not touched at all.
1184
1185 Verbs that take an argument
1186 "(*PRUNE)" "(*PRUNE:NAME)"
1187 This zero-width pattern prunes the backtracking tree at the
1188 current point when backtracked into on failure. Consider the
1189 pattern "A (*PRUNE) B", where A and B are complex patterns.
1190 Until the "(*PRUNE)" verb is reached, A may backtrack as
1191 necessary to match. Once it is reached, matching continues in
1192 B, which may also backtrack as necessary; however, should B not
1193 match, then no further backtracking will take place, and the
1194 pattern will fail outright at the current starting position.
1195
1196 The following example counts all the possible matching strings
1197 in a pattern (without actually matching any of them).
1198
1199 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
1200 print "Count=$count\n";
1201
1202 which produces:
1203
1204 aaab
1205 aaa
1206 aa
1207 a
1208 aab
1209 aa
1210 a
1211 ab
1212 a
1213 Count=9
1214
1215 If we add a "(*PRUNE)" before the count like the following
1216
1217 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
1218 print "Count=$count\n";
1219
1220 we prevent backtracking and find the count of the longest
1221 matching at each matching starting point like so:
1222
1223 aaab
1224 aab
1225 ab
1226 Count=3
1227
1228 Any number of "(*PRUNE)" assertions may be used in a pattern.
1229
1230 See also "(?>pattern)" and possessive quantifiers for other
1231 ways to control backtracking. In some cases, the use of
1232 "(*PRUNE)" can be replaced with a "(?>pattern)" with no
1233 functional difference; however, "(*PRUNE)" can be used to
1234 handle cases that cannot be expressed using a "(?>pattern)"
1235 alone.
1236
1237 "(*SKIP)" "(*SKIP:NAME)"
1238 This zero-width pattern is similar to "(*PRUNE)", except that
1239 on failure it also signifies that whatever text that was
1240 matched leading up to the "(*SKIP)" pattern being executed
1241 cannot be part of any match of this pattern. This effectively
1242 means that the regex engine "skips" forward to this position on
1243 failure and tries to match again, (assuming that there is
1244 sufficient room to match).
1245
1246 The name of the "(*SKIP:NAME)" pattern has special
1247 significance. If a "(*MARK:NAME)" was encountered while
1248 matching, then it is that position which is used as the "skip
1249 point". If no "(*MARK)" of that name was encountered, then the
1250 "(*SKIP)" operator has no effect. When used without a name the
1251 "skip point" is where the match point was when executing the
1252 (*SKIP) pattern.
1253
1254 Compare the following to the examples in "(*PRUNE)", note the
1255 string is twice as long:
1256
1257 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1258 print "Count=$count\n";
1259
1260 outputs
1261
1262 aaab
1263 aaab
1264 Count=2
1265
1266 Once the 'aaab' at the start of the string has matched, and the
1267 "(*SKIP)" executed, the next starting point will be where the
1268 cursor was when the "(*SKIP)" was executed.
1269
1270 "(*MARK:NAME)" "(*:NAME)" "(*MARK:NAME)" "(*:NAME)"
1271 This zero-width pattern can be used to mark the point reached
1272 in a string when a certain part of the pattern has been
1273 successfully matched. This mark may be given a name. A later
1274 "(*SKIP)" pattern will then skip forward to that point if
1275 backtracked into on failure. Any number of "(*MARK)" patterns
1276 are allowed, and the NAME portion may be duplicated.
1277
1278 In addition to interacting with the "(*SKIP)" pattern,
1279 "(*MARK:NAME)" can be used to "label" a pattern branch, so that
1280 after matching, the program can determine which branches of the
1281 pattern were involved in the match.
1282
1283 When a match is successful, the $REGMARK variable will be set
1284 to the name of the most recently executed "(*MARK:NAME)" that
1285 was involved in the match.
1286
1287 This can be used to determine which branch of a pattern was
1288 matched without using a separate capture buffer for each
1289 branch, which in turn can result in a performance improvement,
1290 as perl cannot optimize "/(?:(x)|(y)|(z))/" as efficiently as
1291 something like "/(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/".
1292
1293 When a match has failed, and unless another verb has been
1294 involved in failing the match and has provided its own name to
1295 use, the $REGERROR variable will be set to the name of the most
1296 recently executed "(*MARK:NAME)".
1297
1298 See "(*SKIP)" for more details.
1299
1300 As a shortcut "(*MARK:NAME)" can be written "(*:NAME)".
1301
1302 "(*THEN)" "(*THEN:NAME)"
1303 This is similar to the "cut group" operator "::" from Perl 6.
1304 Like "(*PRUNE)", this verb always matches, and when backtracked
1305 into on failure, it causes the regex engine to try the next
1306 alternation in the innermost enclosing group (capturing or
1307 otherwise).
1308
1309 Its name comes from the observation that this operation
1310 combined with the alternation operator ("|") can be used to
1311 create what is essentially a pattern-based if/then/else block:
1312
1313 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1314
1315 Note that if this operator is used and NOT inside of an
1316 alternation then it acts exactly like the "(*PRUNE)" operator.
1317
1318 / A (*PRUNE) B /
1319
1320 is the same as
1321
1322 / A (*THEN) B /
1323
1324 but
1325
1326 / ( A (*THEN) B | C (*THEN) D ) /
1327
1328 is not the same as
1329
1330 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1331
1332 as after matching the A but failing on the B the "(*THEN)" verb
1333 will backtrack and try C; but the "(*PRUNE)" verb will simply
1334 fail.
1335
1336 "(*COMMIT)"
1337 This is the Perl 6 "commit pattern" "<commit>" or ":::". It's a
1338 zero-width pattern similar to "(*SKIP)", except that when
1339 backtracked into on failure it causes the match to fail
1340 outright. No further attempts to find a valid match by
1341 advancing the start pointer will occur again. For example,
1342
1343 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
1344 print "Count=$count\n";
1345
1346 outputs
1347
1348 aaab
1349 Count=1
1350
1351 In other words, once the "(*COMMIT)" has been entered, and if
1352 the pattern does not match, the regex engine will not try any
1353 further matching on the rest of the string.
1354
1355 Verbs without an argument
1356 "(*FAIL)" "(*F)"
1357 This pattern matches nothing and always fails. It can be used
1358 to force the engine to backtrack. It is equivalent to "(?!)",
1359 but easier to read. In fact, "(?!)" gets optimised into
1360 "(*FAIL)" internally.
1361
1362 It is probably useful only when combined with "(?{})" or
1363 "(??{})".
1364
1365 "(*ACCEPT)"
1366 WARNING: This feature is highly experimental. It is not
1367 recommended for production code.
1368
1369 This pattern matches nothing and causes the end of successful
1370 matching at the point at which the "(*ACCEPT)" pattern was
1371 encountered, regardless of whether there is actually more to
1372 match in the string. When inside of a nested pattern, such as
1373 recursion, or in a subpattern dynamically generated via
1374 "(??{})", only the innermost pattern is ended immediately.
1375
1376 If the "(*ACCEPT)" is inside of capturing buffers then the
1377 buffers are marked as ended at the point at which the
1378 "(*ACCEPT)" was encountered. For instance:
1379
1380 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
1381
1382 will match, and $1 will be "AB" and $2 will be "B", $3 will not
1383 be set. If another branch in the inner parentheses were
1384 matched, such as in the string 'ACDE', then the "D" and "E"
1385 would have to be matched as well.
1386
1387 Backtracking
1388 NOTE: This section presents an abstract approximation of regular
1389 expression behavior. For a more rigorous (and complicated) view of the
1390 rules involved in selecting a match among possible alternatives, see
1391 "Combining RE Pieces".
1392
1393 A fundamental feature of regular expression matching involves the
1394 notion called backtracking, which is currently used (when needed) by
1395 all regular non-possessive expression quantifiers, namely "*", "*?",
1396 "+", "+?", "{n,m}", and "{n,m}?". Backtracking is often optimized
1397 internally, but the general principle outlined here is valid.
1398
1399 For a regular expression to match, the entire regular expression must
1400 match, not just part of it. So if the beginning of a pattern
1401 containing a quantifier succeeds in a way that causes later parts in
1402 the pattern to fail, the matching engine backs up and recalculates the
1403 beginning part--that's why it's called backtracking.
1404
1405 Here is an example of backtracking: Let's say you want to find the
1406 word following "foo" in the string "Food is on the foo table.":
1407
1408 $_ = "Food is on the foo table.";
1409 if ( /\b(foo)\s+(\w+)/i ) {
1410 print "$2 follows $1.\n";
1411 }
1412
1413 When the match runs, the first part of the regular expression
1414 ("\b(foo)") finds a possible match right at the beginning of the
1415 string, and loads up $1 with "Foo". However, as soon as the matching
1416 engine sees that there's no whitespace following the "Foo" that it had
1417 saved in $1, it realizes its mistake and starts over again one
1418 character after where it had the tentative match. This time it goes
1419 all the way until the next occurrence of "foo". The complete regular
1420 expression matches this time, and you get the expected output of "table
1421 follows foo."
1422
1423 Sometimes minimal matching can help a lot. Imagine you'd like to match
1424 everything between "foo" and "bar". Initially, you write something
1425 like this:
1426
1427 $_ = "The food is under the bar in the barn.";
1428 if ( /foo(.*)bar/ ) {
1429 print "got <$1>\n";
1430 }
1431
1432 Which perhaps unexpectedly yields:
1433
1434 got <d is under the bar in the >
1435
1436 That's because ".*" was greedy, so you get everything between the first
1437 "foo" and the last "bar". Here it's more effective to use minimal
1438 matching to make sure you get the text between a "foo" and the first
1439 "bar" thereafter.
1440
1441 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1442 got <d is under the >
1443
1444 Here's another example. Let's say you'd like to match a number at the
1445 end of a string, and you also want to keep the preceding part of the
1446 match. So you write this:
1447
1448 $_ = "I have 2 numbers: 53147";
1449 if ( /(.*)(\d*)/ ) { # Wrong!
1450 print "Beginning is <$1>, number is <$2>.\n";
1451 }
1452
1453 That won't work at all, because ".*" was greedy and gobbled up the
1454 whole string. As "\d*" can match on an empty string the complete
1455 regular expression matched successfully.
1456
1457 Beginning is <I have 2 numbers: 53147>, number is <>.
1458
1459 Here are some variants, most of which don't work:
1460
1461 $_ = "I have 2 numbers: 53147";
1462 @pats = qw{
1463 (.*)(\d*)
1464 (.*)(\d+)
1465 (.*?)(\d*)
1466 (.*?)(\d+)
1467 (.*)(\d+)$
1468 (.*?)(\d+)$
1469 (.*)\b(\d+)$
1470 (.*\D)(\d+)$
1471 };
1472
1473 for $pat (@pats) {
1474 printf "%-12s ", $pat;
1475 if ( /$pat/ ) {
1476 print "<$1> <$2>\n";
1477 } else {
1478 print "FAIL\n";
1479 }
1480 }
1481
1482 That will print out:
1483
1484 (.*)(\d*) <I have 2 numbers: 53147> <>
1485 (.*)(\d+) <I have 2 numbers: 5314> <7>
1486 (.*?)(\d*) <> <>
1487 (.*?)(\d+) <I have > <2>
1488 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
1489 (.*?)(\d+)$ <I have 2 numbers: > <53147>
1490 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1491 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1492
1493 As you see, this can be a bit tricky. It's important to realize that a
1494 regular expression is merely a set of assertions that gives a
1495 definition of success. There may be 0, 1, or several different ways
1496 that the definition might succeed against a particular string. And if
1497 there are multiple ways it might succeed, you need to understand
1498 backtracking to know which variety of success you will achieve.
1499
1500 When using look-ahead assertions and negations, this can all get even
1501 trickier. Imagine you'd like to find a sequence of non-digits not
1502 followed by "123". You might try to write that as
1503
1504 $_ = "ABC123";
1505 if ( /^\D*(?!123)/ ) { # Wrong!
1506 print "Yup, no 123 in $_\n";
1507 }
1508
1509 But that isn't going to match; at least, not the way you're hoping. It
1510 claims that there is no 123 in the string. Here's a clearer picture of
1511 why that pattern matches, contrary to popular expectations:
1512
1513 $x = 'ABC123';
1514 $y = 'ABC445';
1515
1516 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1517 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
1518
1519 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1520 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
1521
1522 This prints
1523
1524 2: got ABC
1525 3: got AB
1526 4: got ABC
1527
1528 You might have expected test 3 to fail because it seems to a more
1529 general purpose version of test 1. The important difference between
1530 them is that test 3 contains a quantifier ("\D*") and so can use
1531 backtracking, whereas test 1 will not. What's happening is that you've
1532 asked "Is it true that at the start of $x, following 0 or more non-
1533 digits, you have something that's not 123?" If the pattern matcher had
1534 let "\D*" expand to "ABC", this would have caused the whole pattern to
1535 fail.
1536
1537 The search engine will initially match "\D*" with "ABC". Then it will
1538 try to match "(?!123" with "123", which fails. But because a
1539 quantifier ("\D*") has been used in the regular expression, the search
1540 engine can backtrack and retry the match differently in the hope of
1541 matching the complete regular expression.
1542
1543 The pattern really, really wants to succeed, so it uses the standard
1544 pattern back-off-and-retry and lets "\D*" expand to just "AB" this
1545 time. Now there's indeed something following "AB" that is not "123".
1546 It's "C123", which suffices.
1547
1548 We can deal with this by using both an assertion and a negation. We'll
1549 say that the first part in $1 must be followed both by a digit and by
1550 something that's not "123". Remember that the look-aheads are zero-
1551 width expressions--they only look, but don't consume any of the string
1552 in their match. So rewriting this way produces what you'd expect; that
1553 is, case 5 will fail, but case 6 succeeds:
1554
1555 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
1556 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
1557
1558 6: got ABC
1559
1560 In other words, the two zero-width assertions next to each other work
1561 as though they're ANDed together, just as you'd use any built-in
1562 assertions: "/^$/" matches only if you're at the beginning of the line
1563 AND the end of the line simultaneously. The deeper underlying truth is
1564 that juxtaposition in regular expressions always means AND, except when
1565 you write an explicit OR using the vertical bar. "/ab/" means match
1566 "a" AND (then) match "b", although the attempted matches are made at
1567 different positions because "a" is not a zero-width assertion, but a
1568 one-width assertion.
1569
1570 WARNING: Particularly complicated regular expressions can take
1571 exponential time to solve because of the immense number of possible
1572 ways they can use backtracking to try for a match. For example,
1573 without internal optimizations done by the regular expression engine,
1574 this will take a painfully long time to run:
1575
1576 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
1577
1578 And if you used "*"'s in the internal groups instead of limiting them
1579 to 0 through 5 matches, then it would take forever--or until you ran
1580 out of stack space. Moreover, these internal optimizations are not
1581 always applicable. For example, if you put "{0,5}" instead of "*" on
1582 the external group, no current optimization is applicable, and the
1583 match takes a long time to finish.
1584
1585 A powerful tool for optimizing such beasts is what is known as an
1586 "independent group", which does not backtrack (see "(?>pattern)").
1587 Note also that zero-length look-ahead/look-behind assertions will not
1588 backtrack to make the tail match, since they are in "logical" context:
1589 only whether they match is considered relevant. For an example where
1590 side-effects of look-ahead might have influenced the following match,
1591 see "(?>pattern)".
1592
1593 Version 8 Regular Expressions
1594 In case you're not familiar with the "regular" Version 8 regex
1595 routines, here are the pattern-matching rules not described above.
1596
1597 Any single character matches itself, unless it is a metacharacter with
1598 a special meaning described here or above. You can cause characters
1599 that normally function as metacharacters to be interpreted literally by
1600 prefixing them with a "\" (e.g., "\." matches a ".", not any character;
1601 "\\" matches a "\"). This escape mechanism is also required for the
1602 character used as the pattern delimiter.
1603
1604 A series of characters matches that series of characters in the target
1605 string, so the pattern "blurfl" would match "blurfl" in the target
1606 string.
1607
1608 You can specify a character class, by enclosing a list of characters in
1609 "[]", which will match any character from the list. If the first
1610 character after the "[" is "^", the class matches any character not in
1611 the list. Within a list, the "-" character specifies a range, so that
1612 "a-z" represents all characters between "a" and "z", inclusive. If you
1613 want either "-" or "]" itself to be a member of a class, put it at the
1614 start of the list (possibly after a "^"), or escape it with a
1615 backslash. "-" is also taken literally when it is at the end of the
1616 list, just before the closing "]". (The following all specify the same
1617 class of three characters: "[-az]", "[az-]", and "[a\-z]". All are
1618 different from "[a-z]", which specifies a class containing twenty-six
1619 characters, even on EBCDIC-based character sets.) Also, if you try to
1620 use the character classes "\w", "\W", "\s", "\S", "\d", or "\D" as
1621 endpoints of a range, the "-" is understood literally.
1622
1623 Note also that the whole range idea is rather unportable between
1624 character sets--and even within character sets they may cause results
1625 you probably didn't expect. A sound principle is to use only ranges
1626 that begin from and end at either alphabetics of equal case ([a-e],
1627 [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
1628 spell out the character sets in full.
1629
1630 Characters may be specified using a metacharacter syntax much like that
1631 used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
1632 "\f" a form feed, etc. More generally, \nnn, where nnn is a string of
1633 octal digits, matches the character whose coded character set value is
1634 nnn. Similarly, \xnn, where nn are hexadecimal digits, matches the
1635 character whose numeric value is nn. The expression \cx matches the
1636 character control-x. Finally, the "." metacharacter matches any
1637 character except "\n" (unless you use "/s").
1638
1639 You can specify a series of alternatives for a pattern using "|" to
1640 separate them, so that "fee|fie|foe" will match any of "fee", "fie", or
1641 "foe" in the target string (as would "f(e|i|o)e"). The first
1642 alternative includes everything from the last pattern delimiter ("(",
1643 "[", or the beginning of the pattern) up to the first "|", and the last
1644 alternative contains everything from the last "|" to the next pattern
1645 delimiter. That's why it's common practice to include alternatives in
1646 parentheses: to minimize confusion about where they start and end.
1647
1648 Alternatives are tried from left to right, so the first alternative
1649 found for which the entire expression matches, is the one that is
1650 chosen. This means that alternatives are not necessarily greedy. For
1651 example: when matching "foo|foot" against "barefoot", only the "foo"
1652 part will match, as that is the first alternative tried, and it
1653 successfully matches the target string. (This might not seem important,
1654 but it is important when you are capturing matched text using
1655 parentheses.)
1656
1657 Also remember that "|" is interpreted as a literal within square
1658 brackets, so if you write "[fee|fie|foe]" you're really only matching
1659 "[feio|]".
1660
1661 Within a pattern, you may designate subpatterns for later reference by
1662 enclosing them in parentheses, and you may refer back to the nth
1663 subpattern later in the pattern using the metacharacter \n.
1664 Subpatterns are numbered based on the left to right order of their
1665 opening parenthesis. A backreference matches whatever actually matched
1666 the subpattern in the string being examined, not the rules for that
1667 subpattern. Therefore, "(0|0x)\d*\s\1\d*" will match "0x1234 0x4321",
1668 but not "0x1234 01234", because subpattern 1 matched "0x", even though
1669 the rule "0|0x" could potentially match the leading 0 in the second
1670 number.
1671
1672 Warning on \1 Instead of $1
1673 Some people get too used to writing things like:
1674
1675 $pattern =~ s/(\W)/\\\1/g;
1676
1677 This is grandfathered (for \1 to \9) for the RHS of a substitute to
1678 avoid shocking the sed addicts, but it's a dirty habit to get into.
1679 That's because in PerlThink, the righthand side of an "s///" is a
1680 double-quoted string. "\1" in the usual double-quoted string means a
1681 control-A. The customary Unix meaning of "\1" is kludged in for
1682 "s///". However, if you get into the habit of doing that, you get
1683 yourself into trouble if you then add an "/e" modifier.
1684
1685 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
1686
1687 Or if you try to do
1688
1689 s/(\d+)/\1000/;
1690
1691 You can't disambiguate that by saying "\{1}000", whereas you can fix it
1692 with "${1}000". The operation of interpolation should not be confused
1693 with the operation of matching a backreference. Certainly they mean
1694 two different things on the left side of the "s///".
1695
1696 Repeated Patterns Matching a Zero-length Substring
1697 WARNING: Difficult material (and prose) ahead. This section needs a
1698 rewrite.
1699
1700 Regular expressions provide a terse and powerful programming language.
1701 As with most other power tools, power comes together with the ability
1702 to wreak havoc.
1703
1704 A common abuse of this power stems from the ability to make infinite
1705 loops using regular expressions, with something as innocuous as:
1706
1707 'foo' =~ m{ ( o? )* }x;
1708
1709 The "o?" matches at the beginning of 'foo', and since the position in
1710 the string is not moved by the match, "o?" would match again and again
1711 because of the "*" quantifier. Another common way to create a similar
1712 cycle is with the looping modifier "//g":
1713
1714 @matches = ( 'foo' =~ m{ o? }xg );
1715
1716 or
1717
1718 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
1719
1720 or the loop implied by split().
1721
1722 However, long experience has shown that many programming tasks may be
1723 significantly simplified by using repeated subexpressions that may
1724 match zero-length substrings. Here's a simple example being:
1725
1726 @chars = split //, $string; # // is not magic in split
1727 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1728
1729 Thus Perl allows such constructs, by forcefully breaking the infinite
1730 loop. The rules for this are different for lower-level loops given by
1731 the greedy quantifiers "*+{}", and for higher-level ones like the "/g"
1732 modifier or split() operator.
1733
1734 The lower-level loops are interrupted (that is, the loop is broken)
1735 when Perl detects that a repeated expression matched a zero-length
1736 substring. Thus
1737
1738 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
1739
1740 is made equivalent to
1741
1742 m{ (?: NON_ZERO_LENGTH )*
1743 |
1744 (?: ZERO_LENGTH )?
1745 }x;
1746
1747 The higher level-loops preserve an additional state between iterations:
1748 whether the last match was zero-length. To break the loop, the
1749 following match after a zero-length match is prohibited to have a
1750 length of zero. This prohibition interacts with backtracking (see
1751 "Backtracking"), and so the second best match is chosen if the best
1752 match is of zero length.
1753
1754 For example:
1755
1756 $_ = 'bar';
1757 s/\w??/<$&>/g;
1758
1759 results in "<><b><><a><><r><>". At each position of the string the
1760 best match given by non-greedy "??" is the zero-length match, and the
1761 second best match is what is matched by "\w". Thus zero-length matches
1762 alternate with one-character-long matches.
1763
1764 Similarly, for repeated "m/()/g" the second-best match is the match at
1765 the position one notch further in the string.
1766
1767 The additional state of being matched with zero-length is associated
1768 with the matched string, and is reset by each assignment to pos().
1769 Zero-length matches at the end of the previous match are ignored during
1770 "split".
1771
1772 Combining RE Pieces
1773 Each of the elementary pieces of regular expressions which were
1774 described before (such as "ab" or "\Z") could match at most one
1775 substring at the given position of the input string. However, in a
1776 typical regular expression these elementary pieces are combined into
1777 more complicated patterns using combining operators "ST", "S|T", "S*"
1778 etc (in these examples "S" and "T" are regular subexpressions).
1779
1780 Such combinations can include alternatives, leading to a problem of
1781 choice: if we match a regular expression "a|ab" against "abc", will it
1782 match substring "a" or "ab"? One way to describe which substring is
1783 actually matched is the concept of backtracking (see "Backtracking").
1784 However, this description is too low-level and makes you think in terms
1785 of a particular implementation.
1786
1787 Another description starts with notions of "better"/"worse". All the
1788 substrings which may be matched by the given regular expression can be
1789 sorted from the "best" match to the "worst" match, and it is the "best"
1790 match which is chosen. This substitutes the question of "what is
1791 chosen?" by the question of "which matches are better, and which are
1792 worse?".
1793
1794 Again, for elementary pieces there is no such question, since at most
1795 one match at a given position is possible. This section describes the
1796 notion of better/worse for combining operators. In the description
1797 below "S" and "T" are regular subexpressions.
1798
1799 "ST"
1800 Consider two possible matches, "AB" and "A'B'", "A" and "A'" are
1801 substrings which can be matched by "S", "B" and "B'" are substrings
1802 which can be matched by "T".
1803
1804 If "A" is better match for "S" than "A'", "AB" is a better match
1805 than "A'B'".
1806
1807 If "A" and "A'" coincide: "AB" is a better match than "AB'" if "B"
1808 is better match for "T" than "B'".
1809
1810 "S|T"
1811 When "S" can match, it is a better match than when only "T" can
1812 match.
1813
1814 Ordering of two matches for "S" is the same as for "S". Similar
1815 for two matches for "T".
1816
1817 "S{REPEAT_COUNT}"
1818 Matches as "SSS...S" (repeated as many times as necessary).
1819
1820 "S{min,max}"
1821 Matches as "S{max}|S{max-1}|...|S{min+1}|S{min}".
1822
1823 "S{min,max}?"
1824 Matches as "S{min}|S{min+1}|...|S{max-1}|S{max}".
1825
1826 "S?", "S*", "S+"
1827 Same as "S{0,1}", "S{0,BIG_NUMBER}", "S{1,BIG_NUMBER}"
1828 respectively.
1829
1830 "S??", "S*?", "S+?"
1831 Same as "S{0,1}?", "S{0,BIG_NUMBER}?", "S{1,BIG_NUMBER}?"
1832 respectively.
1833
1834 "(?>S)"
1835 Matches the best match for "S" and only that.
1836
1837 "(?=S)", "(?<=S)"
1838 Only the best match for "S" is considered. (This is important only
1839 if "S" has capturing parentheses, and backreferences are used
1840 somewhere else in the whole regular expression.)
1841
1842 "(?!S)", "(?<!S)"
1843 For this grouping operator there is no need to describe the
1844 ordering, since only whether or not "S" can match is important.
1845
1846 "(??{ EXPR })", "(?PARNO)"
1847 The ordering is the same as for the regular expression which is the
1848 result of EXPR, or the pattern contained by capture buffer PARNO.
1849
1850 "(?(condition)yes-pattern|no-pattern)"
1851 Recall that which of "yes-pattern" or "no-pattern" actually matches
1852 is already determined. The ordering of the matches is the same as
1853 for the chosen subexpression.
1854
1855 The above recipes describe the ordering of matches at a given position.
1856 One more rule is needed to understand how a match is determined for the
1857 whole regular expression: a match at an earlier position is always
1858 better than a match at a later position.
1859
1860 Creating Custom RE Engines
1861 Overloaded constants (see overload) provide a simple way to extend the
1862 functionality of the RE engine.
1863
1864 Suppose that we want to enable a new RE escape-sequence "\Y|" which
1865 matches at a boundary between whitespace characters and non-whitespace
1866 characters. Note that "(?=\S)(?<!\S)|(?!\S)(?<=\S)" matches exactly at
1867 these positions, so we want to have each "\Y|" in the place of the more
1868 complicated version. We can create a module "customre" to do this:
1869
1870 package customre;
1871 use overload;
1872
1873 sub import {
1874 shift;
1875 die "No argument to customre::import allowed" if @_;
1876 overload::constant 'qr' => \&convert;
1877 }
1878
1879 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
1880
1881 # We must also take care of not escaping the legitimate \\Y|
1882 # sequence, hence the presence of '\\' in the conversion rules.
1883 my %rules = ( '\\' => '\\\\',
1884 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
1885 sub convert {
1886 my $re = shift;
1887 $re =~ s{
1888 \\ ( \\ | Y . )
1889 }
1890 { $rules{$1} or invalid($re,$1) }sgex;
1891 return $re;
1892 }
1893
1894 Now "use customre" enables the new escape in constant regular
1895 expressions, i.e., those without any runtime variable interpolations.
1896 As documented in overload, this conversion will work only over literal
1897 parts of regular expressions. For "\Y|$re\Y|" the variable part of
1898 this regular expression needs to be converted explicitly (but only if
1899 the special meaning of "\Y|" should be enabled inside $re):
1900
1901 use customre;
1902 $re = <>;
1903 chomp $re;
1904 $re = customre::convert $re;
1905 /\Y|$re\Y|/;
1906
1908 As of Perl 5.10.0, Perl supports several Python/PCRE specific
1909 extensions to the regex syntax. While Perl programmers are encouraged
1910 to use the Perl specific syntax, the following are also accepted:
1911
1912 "(?P<NAME>pattern)"
1913 Define a named capture buffer. Equivalent to "(?<NAME>pattern)".
1914
1915 "(?P=NAME)"
1916 Backreference to a named capture buffer. Equivalent to "\g{NAME}".
1917
1918 "(?P>NAME)"
1919 Subroutine call to a named capture buffer. Equivalent to
1920 "(?&NAME)".
1921
1923 There are numerous problems with case insensitive matching of
1924 characters outside the ASCII range, especially with those whose folds
1925 are multiple characters, such as ligatures like "LATIN SMALL LIGATURE
1926 FF".
1927
1928 In a bracketed character class with case insensitive matching, ranges
1929 only work for ASCII characters. For example, "m/[\N{CYRILLIC CAPITAL
1930 LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}]/i" doesn't match all the
1931 Russian upper and lower case letters.
1932
1933 Many regular expression constructs don't work on EBCDIC platforms.
1934
1935 This document varies from difficult to understand to completely and
1936 utterly opaque. The wandering prose riddled with jargon is hard to
1937 fathom in several places.
1938
1939 This document needs a rewrite that separates the tutorial content from
1940 the reference content.
1941
1943 perlrequick.
1944
1945 perlretut.
1946
1947 "Regexp Quote-Like Operators" in perlop.
1948
1949 "Gory details of parsing quoted constructs" in perlop.
1950
1951 perlfaq6.
1952
1953 "pos" in perlfunc.
1954
1955 perllocale.
1956
1957 perlebcdic.
1958
1959 Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly
1960 and Associates.
1961
1962
1963
1964perl v5.12.4 2011-06-07 PERLRE(1)