1PERLRE(1) Perl Programmers Reference Guide PERLRE(1)
2
3
4
6 perlre - Perl regular expressions
7
9 This page describes the syntax of regular expressions in Perl.
10
11 If you haven't used regular expressions before, a quick-start
12 introduction is available in perlrequick, and a longer tutorial
13 introduction is available in perlretut.
14
15 For reference on how regular expressions are used in matching
16 operations, plus various examples of the same, see discussions of
17 "m//", "s///", "qr//" and "??" in "Regexp Quote-Like Operators" in
18 perlop.
19
20 Modifiers
21 Matching operations can have various modifiers. Modifiers that relate
22 to the interpretation of the regular expression inside are listed
23 below. Modifiers that alter the way a regular expression is used by
24 Perl are detailed in "Regexp Quote-Like Operators" in perlop and "Gory
25 details of parsing quoted constructs" in perlop.
26
27 m Treat string as multiple lines. That is, change "^" and "$" from
28 matching the start or end of the string to matching the start or
29 end of any line anywhere within the string.
30
31 s Treat string as single line. That is, change "." to match any
32 character whatsoever, even a newline, which normally it would not
33 match.
34
35 Used together, as "/ms", they let the "." match any character
36 whatsoever, while still allowing "^" and "$" to match,
37 respectively, just after and just before newlines within the
38 string.
39
40 i Do case-insensitive pattern matching.
41
42 If locale matching rules are in effect, the case map is taken from
43 the current locale for code points less than 255, and from Unicode
44 rules for larger code points. However, matches that would cross
45 the Unicode rules/non-Unicode rules boundary (ords 255/256) will
46 not succeed. See perllocale.
47
48 There are a number of Unicode characters that match multiple
49 characters under "/i". For example, "LATIN SMALL LIGATURE FI"
50 should match the sequence "fi". Perl is not currently able to do
51 this when the multiple characters are in the pattern and are split
52 between groupings, or when one or more are quantified. Thus
53
54 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
55 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
56 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
57
58 # The below doesn't match, and it isn't clear what $1 and $2 would
59 # be even if it did!!
60 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
61
62 Perl doesn't match multiple characters in an inverted bracketed
63 character class, which otherwise could be highly confusing. See
64 "Negation" in perlrecharclass.
65
66 Another bug involves character classes that match both a sequence
67 of multiple characters, and an initial sub-string of that sequence.
68 For example,
69
70 /[s\xDF]/i
71
72 should match both a single and a double "s", since "\xDF" (on ASCII
73 platforms) matches "ss". However, this bug ([perl #89774]
74 <https://rt.perl.org/rt3/Ticket/Display.html?id=89774>) causes it
75 to only match a single "s", even if the final larger match fails,
76 and matching the double "ss" would have succeeded.
77
78 Also, Perl matching doesn't fully conform to the current Unicode
79 "/i" recommendations, which ask that the matching be made upon the
80 NFD (Normalization Form Decomposed) of the text. However, Unicode
81 is in the process of reconsidering and revising their
82 recommendations.
83
84 x Extend your pattern's legibility by permitting whitespace and
85 comments. Details in "/x"
86
87 p Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
88 ${^POSTMATCH} are available for use after matching.
89
90 g and c
91 Global matching, and keep the Current position after failed
92 matching. Unlike i, m, s and x, these two flags affect the way the
93 regex is used rather than the regex itself. See "Using regular
94 expressions in Perl" in perlretut for further explanation of the g
95 and c modifiers.
96
97 a, d, l and u
98 These modifiers, all new in 5.14, affect which character-set
99 semantics (Unicode, etc.) are used, as described below in
100 "Character set modifiers".
101
102 Regular expression modifiers are usually written in documentation as
103 e.g., "the "/x" modifier", even though the delimiter in question might
104 not really be a slash. The modifiers "/imsxadlup" may also be embedded
105 within the regular expression itself using the "(?...)" construct, see
106 "Extended Patterns" below.
107
108 /x
109
110 "/x" tells the regular expression parser to ignore most whitespace that
111 is neither backslashed nor within a character class. You can use this
112 to break up your regular expression into (slightly) more readable
113 parts. The "#" character is also treated as a metacharacter
114 introducing a comment, just as in ordinary Perl code. This also means
115 that if you want real whitespace or "#" characters in the pattern
116 (outside a character class, where they are unaffected by "/x"), then
117 you'll either have to escape them (using backslashes or "\Q...\E") or
118 encode them using octal, hex, or "\N{}" escapes. Taken together, these
119 features go a long way towards making Perl's regular expressions more
120 readable. Note that you have to be careful not to include the pattern
121 delimiter in the comment--perl has no way of knowing you did not intend
122 to close the pattern early. See the C-comment deletion code in perlop.
123 Also note that anything inside a "\Q...\E" stays unaffected by "/x".
124 And note that "/x" doesn't affect space interpretation within a single
125 multi-character construct. For example in "\x{...}", regardless of the
126 "/x" modifier, there can be no spaces. Same for a quantifier such as
127 "{3}" or "{5,}". Similarly, "(?:...)" can't have a space between the
128 "?" and ":", but can between the "(" and "?". Within any delimiters
129 for such a construct, allowed spaces are not affected by "/x", and
130 depend on the construct. For example, "\x{...}" can't have spaces
131 because hexadecimal numbers don't have spaces in them. But, Unicode
132 properties can have spaces, so in "\p{...}" there can be spaces that
133 follow the Unicode rules, for which see "Properties accessible through
134 \p{} and \P{}" in perluniprops.
135
136 Character set modifiers
137
138 "/d", "/u", "/a", and "/l", available starting in 5.14, are called the
139 character set modifiers; they affect the character set semantics used
140 for the regular expression.
141
142 The "/d", "/u", and "/l" modifiers are not likely to be of much use to
143 you, and so you need not worry about them very much. They exist for
144 Perl's internal use, so that complex regular expression data structures
145 can be automatically serialized and later exactly reconstituted,
146 including all their nuances. But, since Perl can't keep a secret, and
147 there may be rare instances where they are useful, they are documented
148 here.
149
150 The "/a" modifier, on the other hand, may be useful. Its purpose is to
151 allow code that is to work mostly on ASCII data to not have to concern
152 itself with Unicode.
153
154 Briefly, "/l" sets the character set to that of whatever Locale is in
155 effect at the time of the execution of the pattern match.
156
157 "/u" sets the character set to Unicode.
158
159 "/a" also sets the character set to Unicode, BUT adds several
160 restrictions for ASCII-safe matching.
161
162 "/d" is the old, problematic, pre-5.14 Default character set behavior.
163 Its only use is to force that old behavior.
164
165 At any given time, exactly one of these modifiers is in effect. Their
166 existence allows Perl to keep the originally compiled behavior of a
167 regular expression, regardless of what rules are in effect when it is
168 actually executed. And if it is interpolated into a larger regex, the
169 original's rules continue to apply to it, and only it.
170
171 The "/l" and "/u" modifiers are automatically selected for regular
172 expressions compiled within the scope of various pragmas, and we
173 recommend that in general, you use those pragmas instead of specifying
174 these modifiers explicitly. For one thing, the modifiers affect only
175 pattern matching, and do not extend to even any replacement done,
176 whereas using the pragmas give consistent results for all appropriate
177 operations within their scopes. For example,
178
179 s/foo/\Ubar/il
180
181 will match "foo" using the locale's rules for case-insensitive
182 matching, but the "/l" does not affect how the "\U" operates. Most
183 likely you want both of them to use locale rules. To do this, instead
184 compile the regular expression within the scope of "use locale". This
185 both implicitly adds the "/l" and applies locale rules to the "\U".
186 The lesson is to "use locale" and not "/l" explicitly.
187
188 Similarly, it would be better to use "use feature 'unicode_strings'"
189 instead of,
190
191 s/foo/\Lbar/iu
192
193 to get Unicode rules, as the "\L" in the former (but not necessarily
194 the latter) would also use Unicode rules.
195
196 More detail on each of the modifiers follows. Most likely you don't
197 need to know this detail for "/l", "/u", and "/d", and can skip ahead
198 to /a.
199
200 /l
201
202 means to use the current locale's rules (see perllocale) when pattern
203 matching. For example, "\w" will match the "word" characters of that
204 locale, and "/i" case-insensitive matching will match according to the
205 locale's case folding rules. The locale used will be the one in effect
206 at the time of execution of the pattern match. This may not be the
207 same as the compilation-time locale, and can differ from one match to
208 another if there is an intervening call of the setlocale() function.
209
210 Perl only supports single-byte locales. This means that code points
211 above 255 are treated as Unicode no matter what locale is in effect.
212 Under Unicode rules, there are a few case-insensitive matches that
213 cross the 255/256 boundary. These are disallowed under "/l". For
214 example, 0xFF (on ASCII platforms) does not caselessly match the
215 character at 0x178, "LATIN CAPITAL LETTER Y WITH DIAERESIS", because
216 0xFF may not be "LATIN SMALL LETTER Y WITH DIAERESIS" in the current
217 locale, and Perl has no way of knowing if that character even exists in
218 the locale, much less what code point it is.
219
220 This modifier may be specified to be the default by "use locale", but
221 see "Which character set modifier is in effect?".
222
223 /u
224
225 means to use Unicode rules when pattern matching. On ASCII platforms,
226 this means that the code points between 128 and 255 take on their
227 Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
228 (Otherwise Perl considers their meanings to be undefined.) Thus, under
229 this modifier, the ASCII platform effectively becomes a Unicode
230 platform; and hence, for example, "\w" will match any of the more than
231 100_000 word characters in Unicode.
232
233 Unlike most locales, which are specific to a language and country pair,
234 Unicode classifies all the characters that are letters somewhere in the
235 world as "\w". For example, your locale might not think that "LATIN
236 SMALL LETTER ETH" is a letter (unless you happen to speak Icelandic),
237 but Unicode does. Similarly, all the characters that are decimal
238 digits somewhere in the world will match "\d"; this is hundreds, not
239 10, possible matches. And some of those digits look like some of the
240 10 ASCII digits, but mean a different number, so a human could easily
241 think a number is a different quantity than it really is. For example,
242 "BENGALI DIGIT FOUR" (U+09EA) looks very much like an "ASCII DIGIT
243 EIGHT" (U+0038). And, "\d+", may match strings of digits that are a
244 mixture from different writing systems, creating a security issue.
245 "num()" in Unicode::UCD can be used to sort this out. Or the "/a"
246 modifier can be used to force "\d" to match just the ASCII 0 through 9.
247
248 Also, under this modifier, case-insensitive matching works on the full
249 set of Unicode characters. The "KELVIN SIGN", for example matches the
250 letters "k" and "K"; and "LATIN SMALL LIGATURE FF" matches the sequence
251 "ff", which, if you're not prepared, might make it look like a
252 hexadecimal constant, presenting another potential security issue. See
253 <http://unicode.org/reports/tr36> for a detailed discussion of Unicode
254 security issues.
255
256 On the EBCDIC platforms that Perl handles, the native character set is
257 equivalent to Latin-1. Thus this modifier changes behavior only when
258 the "/i" modifier is also specified, and it turns out it affects only
259 two characters, giving them full Unicode semantics: the "MICRO SIGN"
260 will match the Greek capital and small letters "MU", otherwise not; and
261 the "LATIN CAPITAL LETTER SHARP S" will match any of "SS", "Ss", "sS",
262 and "ss", otherwise not.
263
264 This modifier may be specified to be the default by "use feature
265 'unicode_strings", "use locale ':not_characters'", or "use 5.012" (or
266 higher), but see "Which character set modifier is in effect?".
267
268 /d
269
270 This modifier means to use the "Default" native rules of the platform
271 except when there is cause to use Unicode rules instead, as follows:
272
273 1. the target string is encoded in UTF-8; or
274
275 2. the pattern is encoded in UTF-8; or
276
277 3. the pattern explicitly mentions a code point that is above 255 (say
278 by "\x{100}"); or
279
280 4. the pattern uses a Unicode name ("\N{...}"); or
281
282 5. the pattern uses a Unicode property ("\p{...}")
283
284 Another mnemonic for this modifier is "Depends", as the rules actually
285 used depend on various things, and as a result you can get unexpected
286 results. See "The "Unicode Bug"" in perlunicode. The Unicode Bug has
287 become rather infamous, leading to yet another (printable) name for
288 this modifier, "Dodgy".
289
290 On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
291 (at least the ones that Perl handles), they are Latin-1.
292
293 Here are some examples of how that works on an ASCII platform:
294
295 $str = "\xDF"; # $str is not in UTF-8 format.
296 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
297 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
298 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
299 chop $str;
300 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
301
302 This modifier is automatically selected by default when none of the
303 others are, so yet another name for it is "Default".
304
305 Because of the unexpected behaviors associated with this modifier, you
306 probably should only use it to maintain weird backward compatibilities.
307
308 /a (and /aa)
309
310 This modifier stands for ASCII-restrict (or ASCII-safe). This
311 modifier, unlike the others, may be doubled-up to increase its effect.
312
313 When it appears singly, it causes the sequences "\d", "\s", "\w", and
314 the Posix character classes to match only in the ASCII range. They
315 thus revert to their pre-5.6, pre-Unicode meanings. Under "/a", "\d"
316 always means precisely the digits "0" to "9"; "\s" means the five
317 characters "[ \f\n\r\t]"; "\w" means the 63 characters "[A-Za-z0-9_]";
318 and likewise, all the Posix classes such as "[[:print:]]" match only
319 the appropriate ASCII-range characters.
320
321 This modifier is useful for people who only incidentally use Unicode,
322 and who do not wish to be burdened with its complexities and security
323 concerns.
324
325 With "/a", one can write "\d" with confidence that it will only match
326 ASCII characters, and should the need arise to match beyond ASCII, you
327 can instead use "\p{Digit}" (or "\p{Word}" for "\w"). There are
328 similar "\p{...}" constructs that can match beyond ASCII both white
329 space (see "Whitespace" in perlrecharclass), and Posix classes (see
330 "POSIX Character Classes" in perlrecharclass). Thus, this modifier
331 doesn't mean you can't use Unicode, it means that to get Unicode
332 matching you must explicitly use a construct ("\p{}", "\P{}") that
333 signals Unicode.
334
335 As you would expect, this modifier causes, for example, "\D" to mean
336 the same thing as "[^0-9]"; in fact, all non-ASCII characters match
337 "\D", "\S", and "\W". "\b" still means to match at the boundary
338 between "\w" and "\W", using the "/a" definitions of them (similarly
339 for "\B").
340
341 Otherwise, "/a" behaves like the "/u" modifier, in that case-
342 insensitive matching uses Unicode semantics; for example, "k" will
343 match the Unicode "\N{KELVIN SIGN}" under "/i" matching, and code
344 points in the Latin1 range, above ASCII will have Unicode rules when it
345 comes to case-insensitive matching.
346
347 To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"),
348 specify the "a" twice, for example "/aai" or "/aia". (The first
349 occurrence of "a" restricts the "\d", etc., and the second occurrence
350 adds the "/i" restrictions.) But, note that code points outside the
351 ASCII range will use Unicode rules for "/i" matching, so the modifier
352 doesn't really restrict things to just ASCII; it just forbids the
353 intermixing of ASCII and non-ASCII.
354
355 To summarize, this modifier provides protection for applications that
356 don't wish to be exposed to all of Unicode. Specifying it twice gives
357 added protection.
358
359 This modifier may be specified to be the default by "use re '/a'" or
360 "use re '/aa'". If you do so, you may actually have occasion to use
361 the "/u" modifier explictly if there are a few regular expressions
362 where you do want full Unicode rules (but even here, it's best if
363 everything were under feature "unicode_strings", along with the "use re
364 '/aa'"). Also see "Which character set modifier is in effect?".
365
366 Which character set modifier is in effect?
367
368 Which of these modifiers is in effect at any given point in a regular
369 expression depends on a fairly complex set of interactions. These have
370 been designed so that in general you don't have to worry about it, but
371 this section gives the gory details. As explained below in "Extended
372 Patterns" it is possible to explicitly specify modifiers that apply
373 only to portions of a regular expression. The innermost always has
374 priority over any outer ones, and one applying to the whole expression
375 has priority over any of the default settings that are described in the
376 remainder of this section.
377
378 The "use re '/foo'" pragma can be used to set default modifiers
379 (including these) for regular expressions compiled within its scope.
380 This pragma has precedence over the other pragmas listed below that
381 also change the defaults.
382
383 Otherwise, "use locale" sets the default modifier to "/l"; and "use
384 feature 'unicode_strings", or "use 5.012" (or higher) set the default
385 to "/u" when not in the same scope as either "use locale" or "use
386 bytes". ("use locale ':not_characters'" also sets the default to "/u",
387 overriding any plain "use locale".) Unlike the mechanisms mentioned
388 above, these affect operations besides regular expressions pattern
389 matching, and so give more consistent results with other operators,
390 including using "\U", "\l", etc. in substitution replacements.
391
392 If none of the above apply, for backwards compatibility reasons, the
393 "/d" modifier is the one in effect by default. As this can lead to
394 unexpected results, it is best to specify which other rule set should
395 be used.
396
397 Character set modifier behavior prior to Perl 5.14
398
399 Prior to 5.14, there were no explicit modifiers, but "/l" was implied
400 for regexes compiled within the scope of "use locale", and "/d" was
401 implied otherwise. However, interpolating a regex into a larger regex
402 would ignore the original compilation in favor of whatever was in
403 effect at the time of the second compilation. There were a number of
404 inconsistencies (bugs) with the "/d" modifier, where Unicode rules
405 would be used when inappropriate, and vice versa. "\p{}" did not imply
406 Unicode rules, and neither did all occurrences of "\N{}", until 5.12.
407
408 Regular Expressions
409 Metacharacters
410
411 The patterns used in Perl pattern matching evolved from those supplied
412 in the Version 8 regex routines. (The routines are derived (distantly)
413 from Henry Spencer's freely redistributable reimplementation of the V8
414 routines.) See "Version 8 Regular Expressions" for details.
415
416 In particular the following metacharacters have their standard
417 egrep-ish meanings:
418
419 \ Quote the next metacharacter
420 ^ Match the beginning of the line
421 . Match any character (except newline)
422 $ Match the end of the line (or before newline at the end)
423 | Alternation
424 () Grouping
425 [] Bracketed Character class
426
427 By default, the "^" character is guaranteed to match only the beginning
428 of the string, the "$" character only the end (or before the newline at
429 the end), and Perl does certain optimizations with the assumption that
430 the string contains only one line. Embedded newlines will not be
431 matched by "^" or "$". You may, however, wish to treat a string as a
432 multi-line buffer, such that the "^" will match after any newline
433 within the string (except if the newline is the last character in the
434 string), and "$" will match before any newline. At the cost of a
435 little more overhead, you can do this by using the /m modifier on the
436 pattern match operator. (Older programs did this by setting $*, but
437 this option was removed in perl 5.9.)
438
439 To simplify multi-line substitutions, the "." character never matches a
440 newline unless you use the "/s" modifier, which in effect tells Perl to
441 pretend the string is a single line--even if it isn't.
442
443 Quantifiers
444
445 The following standard quantifiers are recognized:
446
447 * Match 0 or more times
448 + Match 1 or more times
449 ? Match 1 or 0 times
450 {n} Match exactly n times
451 {n,} Match at least n times
452 {n,m} Match at least n but not more than m times
453
454 (If a curly bracket occurs in any other context and does not form part
455 of a backslashed sequence like "\x{...}", it is treated as a regular
456 character. In particular, the lower quantifier bound is not optional.
457 However, in Perl v5.18, it is planned to issue a deprecation warning
458 for all such occurrences, and in Perl v5.20 to require literal uses of
459 a curly bracket to be escaped, say by preceding them with a backslash
460 or enclosing them within square brackets, ("\{" or "[{]"). This change
461 will allow for future syntax extensions (like making the lower bound of
462 a quantifier optional), and better error checking of quantifiers. Now,
463 a typo in a quantifier silently causes it to be treated as the literal
464 characters. For example,
465
466 /o{4,3}/
467
468 looks like a quantifier that matches 0 times, since 4 is greater than
469 3, but it really means to match the sequence of six characters
470 "o { 4 , 3 }".)
471
472 The "*" quantifier is equivalent to "{0,}", the "+" quantifier to
473 "{1,}", and the "?" quantifier to "{0,1}". n and m are limited to non-
474 negative integral values less than a preset limit defined when perl is
475 built. This is usually 32766 on the most common platforms. The actual
476 limit can be seen in the error message generated by code such as this:
477
478 $_ **= $_ , / {$_} / for 2 .. 42;
479
480 By default, a quantified subpattern is "greedy", that is, it will match
481 as many times as possible (given a particular starting location) while
482 still allowing the rest of the pattern to match. If you want it to
483 match the minimum number of times possible, follow the quantifier with
484 a "?". Note that the meanings don't change, just the "greediness":
485
486 *? Match 0 or more times, not greedily
487 +? Match 1 or more times, not greedily
488 ?? Match 0 or 1 time, not greedily
489 {n}? Match exactly n times, not greedily (redundant)
490 {n,}? Match at least n times, not greedily
491 {n,m}? Match at least n but not more than m times, not greedily
492
493 By default, when a quantified subpattern does not allow the rest of the
494 overall pattern to match, Perl will backtrack. However, this behaviour
495 is sometimes undesirable. Thus Perl provides the "possessive"
496 quantifier form as well.
497
498 *+ Match 0 or more times and give nothing back
499 ++ Match 1 or more times and give nothing back
500 ?+ Match 0 or 1 time and give nothing back
501 {n}+ Match exactly n times and give nothing back (redundant)
502 {n,}+ Match at least n times and give nothing back
503 {n,m}+ Match at least n but not more than m times and give nothing back
504
505 For instance,
506
507 'aaaa' =~ /a++a/
508
509 will never match, as the "a++" will gobble up all the "a"'s in the
510 string and won't leave any for the remaining part of the pattern. This
511 feature can be extremely useful to give perl hints about where it
512 shouldn't backtrack. For instance, the typical "match a double-quoted
513 string" problem can be most efficiently performed when written as:
514
515 /"(?:[^"\\]++|\\.)*+"/
516
517 as we know that if the final quote does not match, backtracking will
518 not help. See the independent subexpression ""(?>pattern)"" for more
519 details; possessive quantifiers are just syntactic sugar for that
520 construct. For instance the above example could also be written as
521 follows:
522
523 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
524
525 Escape sequences
526
527 Because patterns are processed as double-quoted strings, the following
528 also work:
529
530 \t tab (HT, TAB)
531 \n newline (LF, NL)
532 \r return (CR)
533 \f form feed (FF)
534 \a alarm (bell) (BEL)
535 \e escape (think troff) (ESC)
536 \cK control char (example: VT)
537 \x{}, \x00 character whose ordinal is the given hexadecimal number
538 \N{name} named Unicode character or character sequence
539 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
540 \o{}, \000 character whose ordinal is the given octal number
541 \l lowercase next char (think vi)
542 \u uppercase next char (think vi)
543 \L lowercase till \E (think vi)
544 \U uppercase till \E (think vi)
545 \Q quote (disable) pattern metacharacters till \E
546 \E end either case modification or quoted section, think vi
547
548 Details are in "Quote and Quote-like Operators" in perlop.
549
550 Character Classes and other Special Escapes
551
552 In addition, Perl defines the following:
553
554 Sequence Note Description
555 [...] [1] Match a character according to the rules of the
556 bracketed character class defined by the "...".
557 Example: [a-z] matches "a" or "b" or "c" ... or "z"
558 [[:...:]] [2] Match a character according to the rules of the POSIX
559 character class "..." within the outer bracketed
560 character class. Example: [[:upper:]] matches any
561 uppercase character.
562 \w [3] Match a "word" character (alphanumeric plus "_", plus
563 other connector punctuation chars plus Unicode
564 marks)
565 \W [3] Match a non-"word" character
566 \s [3] Match a whitespace character
567 \S [3] Match a non-whitespace character
568 \d [3] Match a decimal digit character
569 \D [3] Match a non-digit character
570 \pP [3] Match P, named property. Use \p{Prop} for longer names
571 \PP [3] Match non-P
572 \X [4] Match Unicode "eXtended grapheme cluster"
573 \C Match a single C-language char (octet) even if that is
574 part of a larger UTF-8 character. Thus it breaks up
575 characters into their UTF-8 bytes, so you may end up
576 with malformed pieces of UTF-8. Unsupported in
577 lookbehind.
578 \1 [5] Backreference to a specific capture group or buffer.
579 '1' may actually be any positive integer.
580 \g1 [5] Backreference to a specific or previous group,
581 \g{-1} [5] The number may be negative indicating a relative
582 previous group and may optionally be wrapped in
583 curly brackets for safer parsing.
584 \g{name} [5] Named backreference
585 \k<name> [5] Named backreference
586 \K [6] Keep the stuff left of the \K, don't include it in $&
587 \N [7] Any character but \n (experimental). Not affected by
588 /s modifier
589 \v [3] Vertical whitespace
590 \V [3] Not vertical whitespace
591 \h [3] Horizontal whitespace
592 \H [3] Not horizontal whitespace
593 \R [4] Linebreak
594
595 [1] See "Bracketed Character Classes" in perlrecharclass for details.
596
597 [2] See "POSIX Character Classes" in perlrecharclass for details.
598
599 [3] See "Backslash sequences" in perlrecharclass for details.
600
601 [4] See "Misc" in perlrebackslash for details.
602
603 [5] See "Capture groups" below for details.
604
605 [6] See "Extended Patterns" below for details.
606
607 [7] Note that "\N" has two meanings. When of the form "\N{NAME}", it
608 matches the character or character sequence whose name is "NAME";
609 and similarly when of the form "\N{U+hex}", it matches the
610 character whose Unicode code point is hex. Otherwise it matches
611 any character but "\n".
612
613 Assertions
614
615 Perl defines the following zero-width assertions:
616
617 \b Match a word boundary
618 \B Match except at a word boundary
619 \A Match only at beginning of string
620 \Z Match only at end of string, or before newline at the end
621 \z Match only at end of string
622 \G Match only at pos() (e.g. at the end-of-match position
623 of prior m//g)
624
625 A word boundary ("\b") is a spot between two characters that has a "\w"
626 on one side of it and a "\W" on the other side of it (in either order),
627 counting the imaginary characters off the beginning and end of the
628 string as matching a "\W". (Within character classes "\b" represents
629 backspace rather than a word boundary, just as it normally does in any
630 double-quoted string.) The "\A" and "\Z" are just like "^" and "$",
631 except that they won't match multiple times when the "/m" modifier is
632 used, while "^" and "$" will match at every internal line boundary. To
633 match the actual end of the string and not ignore an optional trailing
634 newline, use "\z".
635
636 The "\G" assertion can be used to chain global matches (using "m//g"),
637 as described in "Regexp Quote-Like Operators" in perlop. It is also
638 useful when writing "lex"-like scanners, when you have several patterns
639 that you want to match against consequent substrings of your string;
640 see the previous reference. The actual location where "\G" will match
641 can also be influenced by using "pos()" as an lvalue: see "pos" in
642 perlfunc. Note that the rule for zero-length matches (see "Repeated
643 Patterns Matching a Zero-length Substring") is modified somewhat, in
644 that contents to the left of "\G" are not counted when determining the
645 length of the match. Thus the following will not match forever:
646
647 my $string = 'ABC';
648 pos($string) = 1;
649 while ($string =~ /(.\G)/g) {
650 print $1;
651 }
652
653 It will print 'A' and then terminate, as it considers the match to be
654 zero-width, and thus will not match at the same position twice in a
655 row.
656
657 It is worth noting that "\G" improperly used can result in an infinite
658 loop. Take care when using patterns that include "\G" in an
659 alternation.
660
661 Capture groups
662
663 The bracketing construct "( ... )" creates capture groups (also
664 referred to as capture buffers). To refer to the current contents of a
665 group later on, within the same pattern, use "\g1" (or "\g{1}") for the
666 first, "\g2" (or "\g{2}") for the second, and so on. This is called a
667 backreference.
668
669
670
671
672
673
674
675
676 There is no limit to the number of captured substrings that you may
677 use. Groups are numbered with the leftmost open parenthesis being
678 number 1, etc. If a group did not match, the associated backreference
679 won't match either. (This can happen if the group is optional, or in a
680 different branch of an alternation.) You can omit the "g", and write
681 "\1", etc, but there are some issues with this form, described below.
682
683 You can also refer to capture groups relatively, by using a negative
684 number, so that "\g-1" and "\g{-1}" both refer to the immediately
685 preceding capture group, and "\g-2" and "\g{-2}" both refer to the
686 group before it. For example:
687
688 /
689 (Y) # group 1
690 ( # group 2
691 (X) # group 3
692 \g{-1} # backref to group 3
693 \g{-3} # backref to group 1
694 )
695 /x
696
697 would match the same as "/(Y) ( (X) \g3 \g1 )/x". This allows you to
698 interpolate regexes into larger regexes and not have to worry about the
699 capture groups being renumbered.
700
701 You can dispense with numbers altogether and create named capture
702 groups. The notation is "(?<name>...)" to declare and "\g{name}" to
703 reference. (To be compatible with .Net regular expressions, "\g{name}"
704 may also be written as "\k{name}", "\k<name>" or "\k'name'".) name
705 must not begin with a number, nor contain hyphens. When different
706 groups within the same pattern have the same name, any reference to
707 that name assumes the leftmost defined group. Named groups count in
708 absolute and relative numbering, and so can also be referred to by
709 those numbers. (It's possible to do things with named capture groups
710 that would otherwise require "(??{})".)
711
712 Capture group contents are dynamically scoped and available to you
713 outside the pattern until the end of the enclosing block or until the
714 next successful match, whichever comes first. (See "Compound
715 Statements" in perlsyn.) You can refer to them by absolute number
716 (using "$1" instead of "\g1", etc); or by name via the "%+" hash, using
717 "$+{name}".
718
719 Braces are required in referring to named capture groups, but are
720 optional for absolute or relative numbered ones. Braces are safer when
721 creating a regex by concatenating smaller strings. For example if you
722 have "qr/$a$b/", and $a contained "\g1", and $b contained "37", you
723 would get "/\g137/" which is probably not what you intended.
724
725 The "\g" and "\k" notations were introduced in Perl 5.10.0. Prior to
726 that there were no named nor relative numbered capture groups.
727 Absolute numbered groups were referred to using "\1", "\2", etc., and
728 this notation is still accepted (and likely always will be). But it
729 leads to some ambiguities if there are more than 9 capture groups, as
730 "\10" could mean either the tenth capture group, or the character whose
731 ordinal in octal is 010 (a backspace in ASCII). Perl resolves this
732 ambiguity by interpreting "\10" as a backreference only if at least 10
733 left parentheses have opened before it. Likewise "\11" is a
734 backreference only if at least 11 left parentheses have opened before
735 it. And so on. "\1" through "\9" are always interpreted as
736 backreferences. There are several examples below that illustrate these
737 perils. You can avoid the ambiguity by always using "\g{}" or "\g" if
738 you mean capturing groups; and for octal constants always using "\o{}",
739 or for "\077" and below, using 3 digits padded with leading zeros,
740 since a leading zero implies an octal constant.
741
742 The "\digit" notation also works in certain circumstances outside the
743 pattern. See "Warning on \1 Instead of $1" below for details.
744
745 Examples:
746
747 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
748
749 /(.)\g1/ # find first doubled char
750 and print "'$1' is the first doubled character\n";
751
752 /(?<char>.)\k<char>/ # ... a different way
753 and print "'$+{char}' is the first doubled character\n";
754
755 /(?'char'.)\g1/ # ... mix and match
756 and print "'$1' is the first doubled character\n";
757
758 if (/Time: (..):(..):(..)/) { # parse out values
759 $hours = $1;
760 $minutes = $2;
761 $seconds = $3;
762 }
763
764 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
765 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
766 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
767 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
768
769 $a = '(.)\1'; # Creates problems when concatenated.
770 $b = '(.)\g{1}'; # Avoids the problems.
771 "aa" =~ /${a}/; # True
772 "aa" =~ /${b}/; # True
773 "aa0" =~ /${a}0/; # False!
774 "aa0" =~ /${b}0/; # True
775 "aa\x08" =~ /${a}0/; # True!
776 "aa\x08" =~ /${b}0/; # False
777
778 Several special variables also refer back to portions of the previous
779 match. $+ returns whatever the last bracket match matched. $& returns
780 the entire matched string. (At one point $0 did also, but now it
781 returns the name of the program.) "$`" returns everything before the
782 matched string. "$'" returns everything after the matched string. And
783 $^N contains whatever was matched by the most-recently closed group
784 (submatch). $^N can be used in extended patterns (see below), for
785 example to assign a submatch to a variable.
786
787 These special variables, like the "%+" hash and the numbered match
788 variables ($1, $2, $3, etc.) are dynamically scoped until the end of
789 the enclosing block or until the next successful match, whichever comes
790 first. (See "Compound Statements" in perlsyn.)
791
792 NOTE: Failed matches in Perl do not reset the match variables, which
793 makes it easier to write code that tests for a series of more specific
794 cases and remembers the best match.
795
796 WARNING: Once Perl sees that you need one of $&, "$`", or "$'" anywhere
797 in the program, it has to provide them for every pattern match. This
798 may substantially slow your program. Perl uses the same mechanism to
799 produce $1, $2, etc, so you also pay a price for each pattern that
800 contains capturing parentheses. (To avoid this cost while retaining
801 the grouping behaviour, use the extended regular expression "(?: ... )"
802 instead.) But if you never use $&, "$`" or "$'", then patterns without
803 capturing parentheses will not be penalized. So avoid $&, "$'", and
804 "$`" if you can, but if you can't (and some algorithms really
805 appreciate them), once you've used them once, use them at will, because
806 you've already paid the price. As of 5.005, $& is not so costly as the
807 other two.
808
809 As a workaround for this problem, Perl 5.10.0 introduces
810 "${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
811 to "$`", $& and "$'", except that they are only guaranteed to be
812 defined after a successful match that was executed with the "/p"
813 (preserve) modifier. The use of these variables incurs no global
814 performance penalty, unlike their punctuation char equivalents, however
815 at the trade-off that you have to tell perl when you want to use them.
816
817 Quoting metacharacters
818 Backslashed metacharacters in Perl are alphanumeric, such as "\b",
819 "\w", "\n". Unlike some other regular expression languages, there are
820 no backslashed symbols that aren't alphanumeric. So anything that
821 looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a
822 literal character, not a metacharacter. This was once used in a common
823 idiom to disable or quote the special meanings of regular expression
824 metacharacters in a string that you want to use for a pattern. Simply
825 quote all non-"word" characters:
826
827 $pattern =~ s/(\W)/\\$1/g;
828
829 (If "use locale" is set, then this depends on the current locale.)
830 Today it is more common to use the quotemeta() function or the "\Q"
831 metaquoting escape sequence to disable all metacharacters' special
832 meanings like this:
833
834 /$unquoted\Q$quoted\E$unquoted/
835
836 Beware that if you put literal backslashes (those not inside
837 interpolated variables) between "\Q" and "\E", double-quotish backslash
838 interpolation may lead to confusing results. If you need to use
839 literal backslashes within "\Q...\E", consult "Gory details of parsing
840 quoted constructs" in perlop.
841
842 "quotemeta()" and "\Q" are fully described in "quotemeta" in perlfunc.
843
844 Extended Patterns
845 Perl also defines a consistent extension syntax for features not found
846 in standard tools like awk and lex. The syntax for most of these is a
847 pair of parentheses with a question mark as the first thing within the
848 parentheses. The character after the question mark indicates the
849 extension.
850
851 The stability of these extensions varies widely. Some have been part
852 of the core language for many years. Others are experimental and may
853 change without warning or be completely removed. Check the
854 documentation on an individual feature to verify its current status.
855
856 A question mark was chosen for this and for the minimal-matching
857 construct because 1) question marks are rare in older regular
858 expressions, and 2) whenever you see one, you should stop and
859 "question" exactly what is going on. That's psychology....
860
861 "(?#text)"
862 A comment. The text is ignored. If the "/x" modifier enables
863 whitespace formatting, a simple "#" will suffice. Note that Perl
864 closes the comment as soon as it sees a ")", so there is no way to
865 put a literal ")" in the comment.
866
867 "(?adlupimsx-imsx)"
868 "(?^alupimsx)"
869 One or more embedded pattern-match modifiers, to be turned on (or
870 turned off, if preceded by "-") for the remainder of the pattern or
871 the remainder of the enclosing pattern group (if any).
872
873 This is particularly useful for dynamic patterns, such as those
874 read in from a configuration file, taken from an argument, or
875 specified in a table somewhere. Consider the case where some
876 patterns want to be case-sensitive and some do not: The case-
877 insensitive ones merely need to include "(?i)" at the front of the
878 pattern. For example:
879
880 $pattern = "foobar";
881 if ( /$pattern/i ) { }
882
883 # more flexible:
884
885 $pattern = "(?i)foobar";
886 if ( /$pattern/ ) { }
887
888 These modifiers are restored at the end of the enclosing group. For
889 example,
890
891 ( (?i) blah ) \s+ \g1
892
893 will match "blah" in any case, some spaces, and an exact (including
894 the case!) repetition of the previous word, assuming the "/x"
895 modifier, and no "/i" modifier outside this group.
896
897 These modifiers do not carry over into named subpatterns called in
898 the enclosing group. In other words, a pattern such as
899 "((?i)(?&NAME))" does not change the case-sensitivity of the "NAME"
900 pattern.
901
902 Any of these modifiers can be set to apply globally to all regular
903 expressions compiled within the scope of a "use re". See "'/flags'
904 mode" in re.
905
906 Starting in Perl 5.14, a "^" (caret or circumflex accent)
907 immediately after the "?" is a shorthand equivalent to "d-imsx".
908 Flags (except "d") may follow the caret to override it. But a
909 minus sign is not legal with it.
910
911 Note that the "a", "d", "l", "p", and "u" modifiers are special in
912 that they can only be enabled, not disabled, and the "a", "d", "l",
913 and "u" modifiers are mutually exclusive: specifying one de-
914 specifies the others, and a maximum of one (or two "a"'s) may
915 appear in the construct. Thus, for example, "(?-p)" will warn when
916 compiled under "use warnings"; "(?-d:...)" and "(?dl:...)" are
917 fatal errors.
918
919 Note also that the "p" modifier is special in that its presence
920 anywhere in a pattern has a global effect.
921
922 "(?:pattern)"
923 "(?adluimsx-imsx:pattern)"
924 "(?^aluimsx:pattern)"
925 This is for clustering, not capturing; it groups subexpressions
926 like "()", but doesn't make backreferences as "()" does. So
927
928 @fields = split(/\b(?:a|b|c)\b/)
929
930 is like
931
932 @fields = split(/\b(a|b|c)\b/)
933
934 but doesn't spit out extra fields. It's also cheaper not to
935 capture characters if you don't need to.
936
937 Any letters between "?" and ":" act as flags modifiers as with
938 "(?adluimsx-imsx)". For example,
939
940 /(?s-i:more.*than).*million/i
941
942 is equivalent to the more verbose
943
944 /(?:(?s-i)more.*than).*million/i
945
946 Starting in Perl 5.14, a "^" (caret or circumflex accent)
947 immediately after the "?" is a shorthand equivalent to "d-imsx".
948 Any positive flags (except "d") may follow the caret, so
949
950 (?^x:foo)
951
952 is equivalent to
953
954 (?x-ims:foo)
955
956 The caret tells Perl that this cluster doesn't inherit the flags of
957 any surrounding pattern, but uses the system defaults ("d-imsx"),
958 modified by any flags specified.
959
960 The caret allows for simpler stringification of compiled regular
961 expressions. These look like
962
963 (?^:pattern)
964
965 with any non-default flags appearing between the caret and the
966 colon. A test that looks at such stringification thus doesn't need
967 to have the system default flags hard-coded in it, just the caret.
968 If new flags are added to Perl, the meaning of the caret's
969 expansion will change to include the default for those flags, so
970 the test will still work, unchanged.
971
972 Specifying a negative flag after the caret is an error, as the flag
973 is redundant.
974
975 Mnemonic for "(?^...)": A fresh beginning since the usual use of a
976 caret is to match at the beginning.
977
978 "(?|pattern)"
979 This is the "branch reset" pattern, which has the special property
980 that the capture groups are numbered from the same starting point
981 in each alternation branch. It is available starting from perl
982 5.10.0.
983
984 Capture groups are numbered from left to right, but inside this
985 construct the numbering is restarted for each branch.
986
987 The numbering within each branch will be as normal, and any groups
988 following this construct will be numbered as though the construct
989 contained only one branch, that being the one with the most capture
990 groups in it.
991
992 This construct is useful when you want to capture one of a number
993 of alternative matches.
994
995 Consider the following pattern. The numbers underneath show in
996 which group the captured content will be stored.
997
998 # before ---------------branch-reset----------- after
999 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1000 # 1 2 2 3 2 3 4
1001
1002 Be careful when using the branch reset pattern in combination with
1003 named captures. Named captures are implemented as being aliases to
1004 numbered groups holding the captures, and that interferes with the
1005 implementation of the branch reset pattern. If you are using named
1006 captures in a branch reset pattern, it's best to use the same
1007 names, in the same order, in each of the alternations:
1008
1009 /(?| (?<a> x ) (?<b> y )
1010 | (?<a> z ) (?<b> w )) /x
1011
1012 Not doing so may lead to surprises:
1013
1014 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
1015 say $+ {a}; # Prints '12'
1016 say $+ {b}; # *Also* prints '12'.
1017
1018 The problem here is that both the group named "a" and the group
1019 named "b" are aliases for the group belonging to $1.
1020
1021 Look-Around Assertions
1022 Look-around assertions are zero-width patterns which match a
1023 specific pattern without including it in $&. Positive assertions
1024 match when their subpattern matches, negative assertions match when
1025 their subpattern fails. Look-behind matches text up to the current
1026 match position, look-ahead matches text following the current match
1027 position.
1028
1029 "(?=pattern)"
1030 A zero-width positive look-ahead assertion. For example,
1031 "/\w+(?=\t)/" matches a word followed by a tab, without
1032 including the tab in $&.
1033
1034 "(?!pattern)"
1035 A zero-width negative look-ahead assertion. For example
1036 "/foo(?!bar)/" matches any occurrence of "foo" that isn't
1037 followed by "bar". Note however that look-ahead and look-
1038 behind are NOT the same thing. You cannot use this for look-
1039 behind.
1040
1041 If you are looking for a "bar" that isn't preceded by a "foo",
1042 "/(?!foo)bar/" will not do what you want. That's because the
1043 "(?!foo)" is just saying that the next thing cannot be
1044 "foo"--and it's not, it's a "bar", so "foobar" will match. Use
1045 look-behind instead (see below).
1046
1047 "(?<=pattern)" "\K"
1048 A zero-width positive look-behind assertion. For example,
1049 "/(?<=\t)\w+/" matches a word that follows a tab, without
1050 including the tab in $&. Works only for fixed-width look-
1051 behind.
1052
1053 There is a special form of this construct, called "\K", which
1054 causes the regex engine to "keep" everything it had matched
1055 prior to the "\K" and not include it in $&. This effectively
1056 provides variable-length look-behind. The use of "\K" inside of
1057 another look-around assertion is allowed, but the behaviour is
1058 currently not well defined.
1059
1060 For various reasons "\K" may be significantly more efficient
1061 than the equivalent "(?<=...)" construct, and it is especially
1062 useful in situations where you want to efficiently remove
1063 something following something else in a string. For instance
1064
1065 s/(foo)bar/$1/g;
1066
1067 can be rewritten as the much more efficient
1068
1069 s/foo\Kbar//g;
1070
1071 "(?<!pattern)"
1072 A zero-width negative look-behind assertion. For example
1073 "/(?<!bar)foo/" matches any occurrence of "foo" that does not
1074 follow "bar". Works only for fixed-width look-behind.
1075
1076 "(?'NAME'pattern)"
1077 "(?<NAME>pattern)"
1078 A named capture group. Identical in every respect to normal
1079 capturing parentheses "()" but for the additional fact that the
1080 group can be referred to by name in various regular expression
1081 constructs (like "\g{NAME}") and can be accessed by name after a
1082 successful match via "%+" or "%-". See perlvar for more details on
1083 the "%+" and "%-" hashes.
1084
1085 If multiple distinct capture groups have the same name then the
1086 $+{NAME} will refer to the leftmost defined group in the match.
1087
1088 The forms "(?'NAME'pattern)" and "(?<NAME>pattern)" are equivalent.
1089
1090 NOTE: While the notation of this construct is the same as the
1091 similar function in .NET regexes, the behavior is not. In Perl the
1092 groups are numbered sequentially regardless of being named or not.
1093 Thus in the pattern
1094
1095 /(x)(?<foo>y)(z)/
1096
1097 $+{foo} will be the same as $2, and $3 will contain 'z' instead of
1098 the opposite which is what a .NET regex hacker might expect.
1099
1100 Currently NAME is restricted to simple identifiers only. In other
1101 words, it must match "/^[_A-Za-z][_A-Za-z0-9]*\z/" or its Unicode
1102 extension (see utf8), though it isn't extended by the locale (see
1103 perllocale).
1104
1105 NOTE: In order to make things easier for programmers with
1106 experience with the Python or PCRE regex engines, the pattern
1107 "(?P<NAME>pattern)" may be used instead of "(?<NAME>pattern)";
1108 however this form does not support the use of single quotes as a
1109 delimiter for the name.
1110
1111 "\k<NAME>"
1112 "\k'NAME'"
1113 Named backreference. Similar to numeric backreferences, except that
1114 the group is designated by name and not number. If multiple groups
1115 have the same name then it refers to the leftmost defined group in
1116 the current match.
1117
1118 It is an error to refer to a name not defined by a "(?<NAME>)"
1119 earlier in the pattern.
1120
1121 Both forms are equivalent.
1122
1123 NOTE: In order to make things easier for programmers with
1124 experience with the Python or PCRE regex engines, the pattern
1125 "(?P=NAME)" may be used instead of "\k<NAME>".
1126
1127 "(?{ code })"
1128 WARNING: This extended regular expression feature is considered
1129 experimental, and may be changed without notice. Code executed that
1130 has side effects may not perform identically from version to
1131 version due to the effect of future optimisations in the regex
1132 engine.
1133
1134 This zero-width assertion evaluates any embedded Perl code. It
1135 always succeeds, and its "code" is not interpolated. Currently,
1136 the rules to determine where the "code" ends are somewhat
1137 convoluted.
1138
1139 This feature can be used together with the special variable $^N to
1140 capture the results of submatches in variables without having to
1141 keep track of the number of nested parentheses. For example:
1142
1143 $_ = "The brown fox jumps over the lazy dog";
1144 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
1145 print "color = $color, animal = $animal\n";
1146
1147 Inside the "(?{...})" block, $_ refers to the string the regular
1148 expression is matching against. You can also use "pos()" to know
1149 what is the current position of matching within this string.
1150
1151 The "code" is properly scoped in the following sense: If the
1152 assertion is backtracked (compare "Backtracking"), all changes
1153 introduced after "local"ization are undone, so that
1154
1155 $_ = 'a' x 8;
1156 m<
1157 (?{ $cnt = 0 }) # Initialize $cnt.
1158 (
1159 a
1160 (?{
1161 local $cnt = $cnt + 1; # Update $cnt,
1162 # backtracking-safe.
1163 })
1164 )*
1165 aaaa
1166 (?{ $res = $cnt }) # On success copy to
1167 # non-localized location.
1168 >x;
1169
1170 will set "$res = 4". Note that after the match, $cnt returns to
1171 the globally introduced value, because the scopes that restrict
1172 "local" operators are unwound.
1173
1174 This assertion may be used as a
1175 "(?(condition)yes-pattern|no-pattern)" switch. If not used in this
1176 way, the result of evaluation of "code" is put into the special
1177 variable $^R. This happens immediately, so $^R can be used from
1178 other "(?{ code })" assertions inside the same regular expression.
1179
1180 The assignment to $^R above is properly localized, so the old value
1181 of $^R is restored if the assertion is backtracked; compare
1182 "Backtracking".
1183
1184 For reasons of security, this construct is forbidden if the regular
1185 expression involves run-time interpolation of variables, unless the
1186 perilous "use re 'eval'" pragma has been used (see re), or the
1187 variables contain results of the "qr//" operator (see
1188 "qr/STRING/msixpodual" in perlop).
1189
1190 This restriction is due to the wide-spread and remarkably
1191 convenient custom of using run-time determined strings as patterns.
1192 For example:
1193
1194 $re = <>;
1195 chomp $re;
1196 $string =~ /$re/;
1197
1198 Before Perl knew how to execute interpolated code within a pattern,
1199 this operation was completely safe from a security point of view,
1200 although it could raise an exception from an illegal pattern. If
1201 you turn on the "use re 'eval'", though, it is no longer secure, so
1202 you should only do so if you are also using taint checking. Better
1203 yet, use the carefully constrained evaluation within a Safe
1204 compartment. See perlsec for details about both these mechanisms.
1205
1206 WARNING: Use of lexical ("my") variables in these blocks is broken.
1207 The result is unpredictable and will make perl unstable. The
1208 workaround is to use global ("our") variables.
1209
1210 WARNING: In perl 5.12.x and earlier, the regex engine was not re-
1211 entrant, so interpolated code could not safely invoke the regex
1212 engine either directly with "m//" or "s///"), or indirectly with
1213 functions such as "split". Invoking the regex engine in these
1214 blocks would make perl unstable.
1215
1216 "(??{ code })"
1217 WARNING: This extended regular expression feature is considered
1218 experimental, and may be changed without notice. Code executed that
1219 has side effects may not perform identically from version to
1220 version due to the effect of future optimisations in the regex
1221 engine.
1222
1223 This is a "postponed" regular subexpression. The "code" is
1224 evaluated at run time, at the moment this subexpression may match.
1225 The result of evaluation is considered a regular expression and
1226 matched as if it were inserted instead of this construct. Note
1227 that this means that the contents of capture groups defined inside
1228 an eval'ed pattern are not available outside of the pattern, and
1229 vice versa, there is no way for the inner pattern returned from the
1230 code block to refer to a capture group defined outside. (The code
1231 block itself can use $1, etc., to refer to the enclosing pattern's
1232 capture groups.) Thus,
1233
1234 ('a' x 100)=~/(??{'(.)' x 100})/
1235
1236 will match, it will not set $1.
1237
1238 The "code" is not interpolated. As before, the rules to determine
1239 where the "code" ends are currently somewhat convoluted.
1240
1241 The following pattern matches a parenthesized group:
1242
1243 $re = qr{
1244 \(
1245 (?:
1246 (?> [^()]+ ) # Non-parens without backtracking
1247 |
1248 (??{ $re }) # Group with matching parens
1249 )*
1250 \)
1251 }x;
1252
1253 See also "(?PARNO)" for a different, more efficient way to
1254 accomplish the same task.
1255
1256 For reasons of security, this construct is forbidden if the regular
1257 expression involves run-time interpolation of variables, unless the
1258 perilous "use re 'eval'" pragma has been used (see re), or the
1259 variables contain results of the "qr//" operator (see
1260 "qr/STRING/msixpodual" in perlop).
1261
1262 In perl 5.12.x and earlier, because the regex engine was not re-
1263 entrant, delayed code could not safely invoke the regex engine
1264 either directly with "m//" or "s///"), or indirectly with functions
1265 such as "split".
1266
1267 Recursing deeper than 50 times without consuming any input string
1268 will result in a fatal error. The maximum depth is compiled into
1269 perl, so changing it requires a custom build.
1270
1271 "(?PARNO)" "(?-PARNO)" "(?+PARNO)" "(?R)" "(?0)"
1272 Similar to "(??{ code })" except it does not involve compiling any
1273 code, instead it treats the contents of a capture group as an
1274 independent pattern that must match at the current position.
1275 Capture groups contained by the pattern will have the value as
1276 determined by the outermost recursion.
1277
1278 PARNO is a sequence of digits (not starting with 0) whose value
1279 reflects the paren-number of the capture group to recurse to.
1280 "(?R)" recurses to the beginning of the whole pattern. "(?0)" is an
1281 alternate syntax for "(?R)". If PARNO is preceded by a plus or
1282 minus sign then it is assumed to be relative, with negative numbers
1283 indicating preceding capture groups and positive ones following.
1284 Thus "(?-1)" refers to the most recently declared group, and
1285 "(?+1)" indicates the next group to be declared. Note that the
1286 counting for relative recursion differs from that of relative
1287 backreferences, in that with recursion unclosed groups are
1288 included.
1289
1290 The following pattern matches a function foo() which may contain
1291 balanced parentheses as the argument.
1292
1293 $re = qr{ ( # paren group 1 (full function)
1294 foo
1295 ( # paren group 2 (parens)
1296 \(
1297 ( # paren group 3 (contents of parens)
1298 (?:
1299 (?> [^()]+ ) # Non-parens without backtracking
1300 |
1301 (?2) # Recurse to start of paren group 2
1302 )*
1303 )
1304 \)
1305 )
1306 )
1307 }x;
1308
1309 If the pattern was used as follows
1310
1311 'foo(bar(baz)+baz(bop))'=~/$re/
1312 and print "\$1 = $1\n",
1313 "\$2 = $2\n",
1314 "\$3 = $3\n";
1315
1316 the output produced should be the following:
1317
1318 $1 = foo(bar(baz)+baz(bop))
1319 $2 = (bar(baz)+baz(bop))
1320 $3 = bar(baz)+baz(bop)
1321
1322 If there is no corresponding capture group defined, then it is a
1323 fatal error. Recursing deeper than 50 times without consuming any
1324 input string will also result in a fatal error. The maximum depth
1325 is compiled into perl, so changing it requires a custom build.
1326
1327 The following shows how using negative indexing can make it easier
1328 to embed recursive patterns inside of a "qr//" construct for later
1329 use:
1330
1331 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
1332 if (/foo $parens \s+ + \s+ bar $parens/x) {
1333 # do something here...
1334 }
1335
1336 Note that this pattern does not behave the same way as the
1337 equivalent PCRE or Python construct of the same form. In Perl you
1338 can backtrack into a recursed group, in PCRE and Python the
1339 recursed into group is treated as atomic. Also, modifiers are
1340 resolved at compile time, so constructs like (?i:(?1)) or
1341 (?:(?i)(?1)) do not affect how the sub-pattern will be processed.
1342
1343 "(?&NAME)"
1344 Recurse to a named subpattern. Identical to "(?PARNO)" except that
1345 the parenthesis to recurse to is determined by name. If multiple
1346 parentheses have the same name, then it recurses to the leftmost.
1347
1348 It is an error to refer to a name that is not declared somewhere in
1349 the pattern.
1350
1351 NOTE: In order to make things easier for programmers with
1352 experience with the Python or PCRE regex engines the pattern
1353 "(?P>NAME)" may be used instead of "(?&NAME)".
1354
1355 "(?(condition)yes-pattern|no-pattern)"
1356 "(?(condition)yes-pattern)"
1357 Conditional expression. Matches "yes-pattern" if "condition" yields
1358 a true value, matches "no-pattern" otherwise. A missing pattern
1359 always matches.
1360
1361 "(condition)" should be either an integer in parentheses (which is
1362 valid if the corresponding pair of parentheses matched), a
1363 look-ahead/look-behind/evaluate zero-width assertion, a name in
1364 angle brackets or single quotes (which is valid if a group with the
1365 given name matched), or the special symbol (R) (true when evaluated
1366 inside of recursion or eval). Additionally the R may be followed by
1367 a number, (which will be true when evaluated when recursing inside
1368 of the appropriate group), or by &NAME, in which case it will be
1369 true only when evaluated during recursion in the named group.
1370
1371 Here's a summary of the possible predicates:
1372
1373 (1) (2) ...
1374 Checks if the numbered capturing group has matched something.
1375
1376 (<NAME>) ('NAME')
1377 Checks if a group with the given name has matched something.
1378
1379 (?=...) (?!...) (?<=...) (?<!...)
1380 Checks whether the pattern matches (or does not match, for the
1381 '!' variants).
1382
1383 (?{ CODE })
1384 Treats the return value of the code block as the condition.
1385
1386 (R) Checks if the expression has been evaluated inside of
1387 recursion.
1388
1389 (R1) (R2) ...
1390 Checks if the expression has been evaluated while executing
1391 directly inside of the n-th capture group. This check is the
1392 regex equivalent of
1393
1394 if ((caller(0))[3] eq 'subname') { ... }
1395
1396 In other words, it does not check the full recursion stack.
1397
1398 (R&NAME)
1399 Similar to "(R1)", this predicate checks to see if we're
1400 executing directly inside of the leftmost group with a given
1401 name (this is the same logic used by "(?&NAME)" to
1402 disambiguate). It does not check the full stack, but only the
1403 name of the innermost active recursion.
1404
1405 (DEFINE)
1406 In this case, the yes-pattern is never directly executed, and
1407 no no-pattern is allowed. Similar in spirit to "(?{0})" but
1408 more efficient. See below for details.
1409
1410 For example:
1411
1412 m{ ( \( )?
1413 [^()]+
1414 (?(1) \) )
1415 }x
1416
1417 matches a chunk of non-parentheses, possibly included in
1418 parentheses themselves.
1419
1420 A special form is the "(DEFINE)" predicate, which never executes
1421 its yes-pattern directly, and does not allow a no-pattern. This
1422 allows one to define subpatterns which will be executed only by the
1423 recursion mechanism. This way, you can define a set of regular
1424 expression rules that can be bundled into any pattern you choose.
1425
1426 It is recommended that for this usage you put the DEFINE block at
1427 the end of the pattern, and that you name any subpatterns defined
1428 within it.
1429
1430 Also, it's worth noting that patterns defined this way probably
1431 will not be as efficient, as the optimiser is not very clever about
1432 handling them.
1433
1434 An example of how this might be used is as follows:
1435
1436 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
1437 (?(DEFINE)
1438 (?<NAME_PAT>....)
1439 (?<ADRESS_PAT>....)
1440 )/x
1441
1442 Note that capture groups matched inside of recursion are not
1443 accessible after the recursion returns, so the extra layer of
1444 capturing groups is necessary. Thus $+{NAME_PAT} would not be
1445 defined even though $+{NAME} would be.
1446
1447 Finally, keep in mind that subpatterns created inside a DEFINE
1448 block count towards the absolute and relative number of captures,
1449 so this:
1450
1451 my @captures = "a" =~ /(.) # First capture
1452 (?(DEFINE)
1453 (?<EXAMPLE> 1 ) # Second capture
1454 )/x;
1455 say scalar @captures;
1456
1457 Will output 2, not 1. This is particularly important if you intend
1458 to compile the definitions with the "qr//" operator, and later
1459 interpolate them in another pattern.
1460
1461 "(?>pattern)"
1462 An "independent" subexpression, one which matches the substring
1463 that a standalone "pattern" would match if anchored at the given
1464 position, and it matches nothing other than this substring. This
1465 construct is useful for optimizations of what would otherwise be
1466 "eternal" matches, because it will not backtrack (see
1467 "Backtracking"). It may also be useful in places where the "grab
1468 all you can, and do not give anything back" semantic is desirable.
1469
1470 For example: "^(?>a*)ab" will never match, since "(?>a*)" (anchored
1471 at the beginning of string, as above) will match all characters "a"
1472 at the beginning of string, leaving no "a" for "ab" to match. In
1473 contrast, "a*ab" will match the same as "a+b", since the match of
1474 the subgroup "a*" is influenced by the following group "ab" (see
1475 "Backtracking"). In particular, "a*" inside "a*ab" will match
1476 fewer characters than a standalone "a*", since this makes the tail
1477 match.
1478
1479 "(?>pattern)" does not disable backtracking altogether once it has
1480 matched. It is still possible to backtrack past the construct, but
1481 not into it. So "((?>a*)|(?>b*))ar" will still match "bar".
1482
1483 An effect similar to "(?>pattern)" may be achieved by writing
1484 "(?=(pattern))\g{-1}". This matches the same substring as a
1485 standalone "a+", and the following "\g{-1}" eats the matched
1486 string; it therefore makes a zero-length assertion into an analogue
1487 of "(?>...)". (The difference between these two constructs is that
1488 the second one uses a capturing group, thus shifting ordinals of
1489 backreferences in the rest of a regular expression.)
1490
1491 Consider this pattern:
1492
1493 m{ \(
1494 (
1495 [^()]+ # x+
1496 |
1497 \( [^()]* \)
1498 )+
1499 \)
1500 }x
1501
1502 That will efficiently match a nonempty group with matching
1503 parentheses two levels deep or less. However, if there is no such
1504 group, it will take virtually forever on a long string. That's
1505 because there are so many different ways to split a long string
1506 into several substrings. This is what "(.+)+" is doing, and
1507 "(.+)+" is similar to a subpattern of the above pattern. Consider
1508 how the pattern above detects no-match on "((()aaaaaaaaaaaaaaaaaa"
1509 in several seconds, but that each extra letter doubles this time.
1510 This exponential performance will make it appear that your program
1511 has hung. However, a tiny change to this pattern
1512
1513 m{ \(
1514 (
1515 (?> [^()]+ ) # change x+ above to (?> x+ )
1516 |
1517 \( [^()]* \)
1518 )+
1519 \)
1520 }x
1521
1522 which uses "(?>...)" matches exactly when the one above does
1523 (verifying this yourself would be a productive exercise), but
1524 finishes in a fourth the time when used on a similar string with
1525 1000000 "a"s. Be aware, however, that, when this construct is
1526 followed by a quantifier, it currently triggers a warning message
1527 under the "use warnings" pragma or -w switch saying it "matches
1528 null string many times in regex".
1529
1530 On simple groups, such as the pattern "(?> [^()]+ )", a comparable
1531 effect may be achieved by negative look-ahead, as in "[^()]+ (?!
1532 [^()] )". This was only 4 times slower on a string with 1000000
1533 "a"s.
1534
1535 The "grab all you can, and do not give anything back" semantic is
1536 desirable in many situations where on the first sight a simple
1537 "()*" looks like the correct solution. Suppose we parse text with
1538 comments being delimited by "#" followed by some optional
1539 (horizontal) whitespace. Contrary to its appearance, "#[ \t]*" is
1540 not the correct subexpression to match the comment delimiter,
1541 because it may "give up" some whitespace if the remainder of the
1542 pattern can be made to match that way. The correct answer is
1543 either one of these:
1544
1545 (?>#[ \t]*)
1546 #[ \t]*(?![ \t])
1547
1548 For example, to grab non-empty comments into $1, one should use
1549 either one of these:
1550
1551 / (?> \# [ \t]* ) ( .+ ) /x;
1552 / \# [ \t]* ( [^ \t] .* ) /x;
1553
1554 Which one you pick depends on which of these expressions better
1555 reflects the above specification of comments.
1556
1557 In some literature this construct is called "atomic matching" or
1558 "possessive matching".
1559
1560 Possessive quantifiers are equivalent to putting the item they are
1561 applied to inside of one of these constructs. The following
1562 equivalences apply:
1563
1564 Quantifier Form Bracketing Form
1565 --------------- ---------------
1566 PAT*+ (?>PAT*)
1567 PAT++ (?>PAT+)
1568 PAT?+ (?>PAT?)
1569 PAT{min,max}+ (?>PAT{min,max})
1570
1571 Special Backtracking Control Verbs
1572 WARNING: These patterns are experimental and subject to change or
1573 removal in a future version of Perl. Their usage in production code
1574 should be noted to avoid problems during upgrades.
1575
1576 These special patterns are generally of the form "(*VERB:ARG)". Unless
1577 otherwise stated the ARG argument is optional; in some cases, it is
1578 forbidden.
1579
1580 Any pattern containing a special backtracking verb that allows an
1581 argument has the special behaviour that when executed it sets the
1582 current package's $REGERROR and $REGMARK variables. When doing so the
1583 following rules apply:
1584
1585 On failure, the $REGERROR variable will be set to the ARG value of the
1586 verb pattern, if the verb was involved in the failure of the match. If
1587 the ARG part of the pattern was omitted, then $REGERROR will be set to
1588 the name of the last "(*MARK:NAME)" pattern executed, or to TRUE if
1589 there was none. Also, the $REGMARK variable will be set to FALSE.
1590
1591 On a successful match, the $REGERROR variable will be set to FALSE, and
1592 the $REGMARK variable will be set to the name of the last
1593 "(*MARK:NAME)" pattern executed. See the explanation for the
1594 "(*MARK:NAME)" verb below for more details.
1595
1596 NOTE: $REGERROR and $REGMARK are not magic variables like $1 and most
1597 other regex-related variables. They are not local to a scope, nor
1598 readonly, but instead are volatile package variables similar to
1599 $AUTOLOAD. Use "local" to localize changes to them to a specific scope
1600 if necessary.
1601
1602 If a pattern does not contain a special backtracking verb that allows
1603 an argument, then $REGERROR and $REGMARK are not touched at all.
1604
1605 Verbs that take an argument
1606 "(*PRUNE)" "(*PRUNE:NAME)"
1607 This zero-width pattern prunes the backtracking tree at the
1608 current point when backtracked into on failure. Consider the
1609 pattern "A (*PRUNE) B", where A and B are complex patterns.
1610 Until the "(*PRUNE)" verb is reached, A may backtrack as
1611 necessary to match. Once it is reached, matching continues in B,
1612 which may also backtrack as necessary; however, should B not
1613 match, then no further backtracking will take place, and the
1614 pattern will fail outright at the current starting position.
1615
1616 The following example counts all the possible matching strings
1617 in a pattern (without actually matching any of them).
1618
1619 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
1620 print "Count=$count\n";
1621
1622 which produces:
1623
1624 aaab
1625 aaa
1626 aa
1627 a
1628 aab
1629 aa
1630 a
1631 ab
1632 a
1633 Count=9
1634
1635 If we add a "(*PRUNE)" before the count like the following
1636
1637 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
1638 print "Count=$count\n";
1639
1640 we prevent backtracking and find the count of the longest
1641 matching string at each matching starting point like so:
1642
1643 aaab
1644 aab
1645 ab
1646 Count=3
1647
1648 Any number of "(*PRUNE)" assertions may be used in a pattern.
1649
1650 See also "(?>pattern)" and possessive quantifiers for other ways
1651 to control backtracking. In some cases, the use of "(*PRUNE)"
1652 can be replaced with a "(?>pattern)" with no functional
1653 difference; however, "(*PRUNE)" can be used to handle cases that
1654 cannot be expressed using a "(?>pattern)" alone.
1655
1656 "(*SKIP)" "(*SKIP:NAME)"
1657 This zero-width pattern is similar to "(*PRUNE)", except that on
1658 failure it also signifies that whatever text that was matched
1659 leading up to the "(*SKIP)" pattern being executed cannot be
1660 part of any match of this pattern. This effectively means that
1661 the regex engine "skips" forward to this position on failure and
1662 tries to match again, (assuming that there is sufficient room to
1663 match).
1664
1665 The name of the "(*SKIP:NAME)" pattern has special significance.
1666 If a "(*MARK:NAME)" was encountered while matching, then it is
1667 that position which is used as the "skip point". If no "(*MARK)"
1668 of that name was encountered, then the "(*SKIP)" operator has no
1669 effect. When used without a name the "skip point" is where the
1670 match point was when executing the (*SKIP) pattern.
1671
1672 Compare the following to the examples in "(*PRUNE)"; note the
1673 string is twice as long:
1674
1675 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1676 print "Count=$count\n";
1677
1678 outputs
1679
1680 aaab
1681 aaab
1682 Count=2
1683
1684 Once the 'aaab' at the start of the string has matched, and the
1685 "(*SKIP)" executed, the next starting point will be where the
1686 cursor was when the "(*SKIP)" was executed.
1687
1688 "(*MARK:NAME)" "(*:NAME)"
1689 This zero-width pattern can be used to mark the point reached in
1690 a string when a certain part of the pattern has been
1691 successfully matched. This mark may be given a name. A later
1692 "(*SKIP)" pattern will then skip forward to that point if
1693 backtracked into on failure. Any number of "(*MARK)" patterns
1694 are allowed, and the NAME portion may be duplicated.
1695
1696 In addition to interacting with the "(*SKIP)" pattern,
1697 "(*MARK:NAME)" can be used to "label" a pattern branch, so that
1698 after matching, the program can determine which branches of the
1699 pattern were involved in the match.
1700
1701 When a match is successful, the $REGMARK variable will be set to
1702 the name of the most recently executed "(*MARK:NAME)" that was
1703 involved in the match.
1704
1705 This can be used to determine which branch of a pattern was
1706 matched without using a separate capture group for each branch,
1707 which in turn can result in a performance improvement, as perl
1708 cannot optimize "/(?:(x)|(y)|(z))/" as efficiently as something
1709 like "/(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/".
1710
1711 When a match has failed, and unless another verb has been
1712 involved in failing the match and has provided its own name to
1713 use, the $REGERROR variable will be set to the name of the most
1714 recently executed "(*MARK:NAME)".
1715
1716 See "(*SKIP)" for more details.
1717
1718 As a shortcut "(*MARK:NAME)" can be written "(*:NAME)".
1719
1720 "(*THEN)" "(*THEN:NAME)"
1721 This is similar to the "cut group" operator "::" from Perl 6.
1722 Like "(*PRUNE)", this verb always matches, and when backtracked
1723 into on failure, it causes the regex engine to try the next
1724 alternation in the innermost enclosing group (capturing or
1725 otherwise) that has alternations. The two branches of a
1726 "(?(condition)yes-pattern|no-pattern)" do not count as an
1727 alternation, as far as "(*THEN)" is concerned.
1728
1729 Its name comes from the observation that this operation combined
1730 with the alternation operator ("|") can be used to create what
1731 is essentially a pattern-based if/then/else block:
1732
1733 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1734
1735 Note that if this operator is used and NOT inside of an
1736 alternation then it acts exactly like the "(*PRUNE)" operator.
1737
1738 / A (*PRUNE) B /
1739
1740 is the same as
1741
1742 / A (*THEN) B /
1743
1744 but
1745
1746 / ( A (*THEN) B | C (*THEN) D ) /
1747
1748 is not the same as
1749
1750 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1751
1752 as after matching the A but failing on the B the "(*THEN)" verb
1753 will backtrack and try C; but the "(*PRUNE)" verb will simply
1754 fail.
1755
1756 Verbs without an argument
1757 "(*COMMIT)"
1758 This is the Perl 6 "commit pattern" "<commit>" or ":::". It's a
1759 zero-width pattern similar to "(*SKIP)", except that when
1760 backtracked into on failure it causes the match to fail
1761 outright. No further attempts to find a valid match by advancing
1762 the start pointer will occur again. For example,
1763
1764 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
1765 print "Count=$count\n";
1766
1767 outputs
1768
1769 aaab
1770 Count=1
1771
1772 In other words, once the "(*COMMIT)" has been entered, and if
1773 the pattern does not match, the regex engine will not try any
1774 further matching on the rest of the string.
1775
1776 "(*FAIL)" "(*F)"
1777 This pattern matches nothing and always fails. It can be used to
1778 force the engine to backtrack. It is equivalent to "(?!)", but
1779 easier to read. In fact, "(?!)" gets optimised into "(*FAIL)"
1780 internally.
1781
1782 It is probably useful only when combined with "(?{})" or
1783 "(??{})".
1784
1785 "(*ACCEPT)"
1786 WARNING: This feature is highly experimental. It is not
1787 recommended for production code.
1788
1789 This pattern matches nothing and causes the end of successful
1790 matching at the point at which the "(*ACCEPT)" pattern was
1791 encountered, regardless of whether there is actually more to
1792 match in the string. When inside of a nested pattern, such as
1793 recursion, or in a subpattern dynamically generated via
1794 "(??{})", only the innermost pattern is ended immediately.
1795
1796 If the "(*ACCEPT)" is inside of capturing groups then the groups
1797 are marked as ended at the point at which the "(*ACCEPT)" was
1798 encountered. For instance:
1799
1800 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
1801
1802 will match, and $1 will be "AB" and $2 will be "B", $3 will not
1803 be set. If another branch in the inner parentheses was matched,
1804 such as in the string 'ACDE', then the "D" and "E" would have to
1805 be matched as well.
1806
1807 Backtracking
1808 NOTE: This section presents an abstract approximation of regular
1809 expression behavior. For a more rigorous (and complicated) view of the
1810 rules involved in selecting a match among possible alternatives, see
1811 "Combining RE Pieces".
1812
1813 A fundamental feature of regular expression matching involves the
1814 notion called backtracking, which is currently used (when needed) by
1815 all regular non-possessive expression quantifiers, namely "*", "*?",
1816 "+", "+?", "{n,m}", and "{n,m}?". Backtracking is often optimized
1817 internally, but the general principle outlined here is valid.
1818
1819 For a regular expression to match, the entire regular expression must
1820 match, not just part of it. So if the beginning of a pattern
1821 containing a quantifier succeeds in a way that causes later parts in
1822 the pattern to fail, the matching engine backs up and recalculates the
1823 beginning part--that's why it's called backtracking.
1824
1825 Here is an example of backtracking: Let's say you want to find the
1826 word following "foo" in the string "Food is on the foo table.":
1827
1828 $_ = "Food is on the foo table.";
1829 if ( /\b(foo)\s+(\w+)/i ) {
1830 print "$2 follows $1.\n";
1831 }
1832
1833 When the match runs, the first part of the regular expression
1834 ("\b(foo)") finds a possible match right at the beginning of the
1835 string, and loads up $1 with "Foo". However, as soon as the matching
1836 engine sees that there's no whitespace following the "Foo" that it had
1837 saved in $1, it realizes its mistake and starts over again one
1838 character after where it had the tentative match. This time it goes
1839 all the way until the next occurrence of "foo". The complete regular
1840 expression matches this time, and you get the expected output of "table
1841 follows foo."
1842
1843 Sometimes minimal matching can help a lot. Imagine you'd like to match
1844 everything between "foo" and "bar". Initially, you write something
1845 like this:
1846
1847 $_ = "The food is under the bar in the barn.";
1848 if ( /foo(.*)bar/ ) {
1849 print "got <$1>\n";
1850 }
1851
1852 Which perhaps unexpectedly yields:
1853
1854 got <d is under the bar in the >
1855
1856 That's because ".*" was greedy, so you get everything between the first
1857 "foo" and the last "bar". Here it's more effective to use minimal
1858 matching to make sure you get the text between a "foo" and the first
1859 "bar" thereafter.
1860
1861 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1862 got <d is under the >
1863
1864 Here's another example. Let's say you'd like to match a number at the
1865 end of a string, and you also want to keep the preceding part of the
1866 match. So you write this:
1867
1868 $_ = "I have 2 numbers: 53147";
1869 if ( /(.*)(\d*)/ ) { # Wrong!
1870 print "Beginning is <$1>, number is <$2>.\n";
1871 }
1872
1873 That won't work at all, because ".*" was greedy and gobbled up the
1874 whole string. As "\d*" can match on an empty string the complete
1875 regular expression matched successfully.
1876
1877 Beginning is <I have 2 numbers: 53147>, number is <>.
1878
1879 Here are some variants, most of which don't work:
1880
1881 $_ = "I have 2 numbers: 53147";
1882 @pats = qw{
1883 (.*)(\d*)
1884 (.*)(\d+)
1885 (.*?)(\d*)
1886 (.*?)(\d+)
1887 (.*)(\d+)$
1888 (.*?)(\d+)$
1889 (.*)\b(\d+)$
1890 (.*\D)(\d+)$
1891 };
1892
1893 for $pat (@pats) {
1894 printf "%-12s ", $pat;
1895 if ( /$pat/ ) {
1896 print "<$1> <$2>\n";
1897 } else {
1898 print "FAIL\n";
1899 }
1900 }
1901
1902 That will print out:
1903
1904 (.*)(\d*) <I have 2 numbers: 53147> <>
1905 (.*)(\d+) <I have 2 numbers: 5314> <7>
1906 (.*?)(\d*) <> <>
1907 (.*?)(\d+) <I have > <2>
1908 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
1909 (.*?)(\d+)$ <I have 2 numbers: > <53147>
1910 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1911 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1912
1913 As you see, this can be a bit tricky. It's important to realize that a
1914 regular expression is merely a set of assertions that gives a
1915 definition of success. There may be 0, 1, or several different ways
1916 that the definition might succeed against a particular string. And if
1917 there are multiple ways it might succeed, you need to understand
1918 backtracking to know which variety of success you will achieve.
1919
1920 When using look-ahead assertions and negations, this can all get even
1921 trickier. Imagine you'd like to find a sequence of non-digits not
1922 followed by "123". You might try to write that as
1923
1924 $_ = "ABC123";
1925 if ( /^\D*(?!123)/ ) { # Wrong!
1926 print "Yup, no 123 in $_\n";
1927 }
1928
1929 But that isn't going to match; at least, not the way you're hoping. It
1930 claims that there is no 123 in the string. Here's a clearer picture of
1931 why that pattern matches, contrary to popular expectations:
1932
1933 $x = 'ABC123';
1934 $y = 'ABC445';
1935
1936 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1937 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
1938
1939 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1940 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
1941
1942 This prints
1943
1944 2: got ABC
1945 3: got AB
1946 4: got ABC
1947
1948 You might have expected test 3 to fail because it seems to a more
1949 general purpose version of test 1. The important difference between
1950 them is that test 3 contains a quantifier ("\D*") and so can use
1951 backtracking, whereas test 1 will not. What's happening is that you've
1952 asked "Is it true that at the start of $x, following 0 or more non-
1953 digits, you have something that's not 123?" If the pattern matcher had
1954 let "\D*" expand to "ABC", this would have caused the whole pattern to
1955 fail.
1956
1957 The search engine will initially match "\D*" with "ABC". Then it will
1958 try to match "(?!123)" with "123", which fails. But because a
1959 quantifier ("\D*") has been used in the regular expression, the search
1960 engine can backtrack and retry the match differently in the hope of
1961 matching the complete regular expression.
1962
1963 The pattern really, really wants to succeed, so it uses the standard
1964 pattern back-off-and-retry and lets "\D*" expand to just "AB" this
1965 time. Now there's indeed something following "AB" that is not "123".
1966 It's "C123", which suffices.
1967
1968 We can deal with this by using both an assertion and a negation. We'll
1969 say that the first part in $1 must be followed both by a digit and by
1970 something that's not "123". Remember that the look-aheads are zero-
1971 width expressions--they only look, but don't consume any of the string
1972 in their match. So rewriting this way produces what you'd expect; that
1973 is, case 5 will fail, but case 6 succeeds:
1974
1975 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
1976 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
1977
1978 6: got ABC
1979
1980 In other words, the two zero-width assertions next to each other work
1981 as though they're ANDed together, just as you'd use any built-in
1982 assertions: "/^$/" matches only if you're at the beginning of the line
1983 AND the end of the line simultaneously. The deeper underlying truth is
1984 that juxtaposition in regular expressions always means AND, except when
1985 you write an explicit OR using the vertical bar. "/ab/" means match
1986 "a" AND (then) match "b", although the attempted matches are made at
1987 different positions because "a" is not a zero-width assertion, but a
1988 one-width assertion.
1989
1990 WARNING: Particularly complicated regular expressions can take
1991 exponential time to solve because of the immense number of possible
1992 ways they can use backtracking to try for a match. For example,
1993 without internal optimizations done by the regular expression engine,
1994 this will take a painfully long time to run:
1995
1996 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
1997
1998 And if you used "*"'s in the internal groups instead of limiting them
1999 to 0 through 5 matches, then it would take forever--or until you ran
2000 out of stack space. Moreover, these internal optimizations are not
2001 always applicable. For example, if you put "{0,5}" instead of "*" on
2002 the external group, no current optimization is applicable, and the
2003 match takes a long time to finish.
2004
2005 A powerful tool for optimizing such beasts is what is known as an
2006 "independent group", which does not backtrack (see ""(?>pattern)"").
2007 Note also that zero-length look-ahead/look-behind assertions will not
2008 backtrack to make the tail match, since they are in "logical" context:
2009 only whether they match is considered relevant. For an example where
2010 side-effects of look-ahead might have influenced the following match,
2011 see ""(?>pattern)"".
2012
2013 Version 8 Regular Expressions
2014 In case you're not familiar with the "regular" Version 8 regex
2015 routines, here are the pattern-matching rules not described above.
2016
2017 Any single character matches itself, unless it is a metacharacter with
2018 a special meaning described here or above. You can cause characters
2019 that normally function as metacharacters to be interpreted literally by
2020 prefixing them with a "\" (e.g., "\." matches a ".", not any character;
2021 "\\" matches a "\"). This escape mechanism is also required for the
2022 character used as the pattern delimiter.
2023
2024 A series of characters matches that series of characters in the target
2025 string, so the pattern "blurfl" would match "blurfl" in the target
2026 string.
2027
2028 You can specify a character class, by enclosing a list of characters in
2029 "[]", which will match any character from the list. If the first
2030 character after the "[" is "^", the class matches any character not in
2031 the list. Within a list, the "-" character specifies a range, so that
2032 "a-z" represents all characters between "a" and "z", inclusive. If you
2033 want either "-" or "]" itself to be a member of a class, put it at the
2034 start of the list (possibly after a "^"), or escape it with a
2035 backslash. "-" is also taken literally when it is at the end of the
2036 list, just before the closing "]". (The following all specify the same
2037 class of three characters: "[-az]", "[az-]", and "[a\-z]". All are
2038 different from "[a-z]", which specifies a class containing twenty-six
2039 characters, even on EBCDIC-based character sets.) Also, if you try to
2040 use the character classes "\w", "\W", "\s", "\S", "\d", or "\D" as
2041 endpoints of a range, the "-" is understood literally.
2042
2043 Note also that the whole range idea is rather unportable between
2044 character sets--and even within character sets they may cause results
2045 you probably didn't expect. A sound principle is to use only ranges
2046 that begin from and end at either alphabetics of equal case ([a-e],
2047 [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
2048 spell out the character sets in full.
2049
2050 Characters may be specified using a metacharacter syntax much like that
2051 used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
2052 "\f" a form feed, etc. More generally, \nnn, where nnn is a string of
2053 three octal digits, matches the character whose coded character set
2054 value is nnn. Similarly, \xnn, where nn are hexadecimal digits,
2055 matches the character whose ordinal is nn. The expression \cx matches
2056 the character control-x. Finally, the "." metacharacter matches any
2057 character except "\n" (unless you use "/s").
2058
2059 You can specify a series of alternatives for a pattern using "|" to
2060 separate them, so that "fee|fie|foe" will match any of "fee", "fie", or
2061 "foe" in the target string (as would "f(e|i|o)e"). The first
2062 alternative includes everything from the last pattern delimiter ("(",
2063 "(?:", etc. or the beginning of the pattern) up to the first "|", and
2064 the last alternative contains everything from the last "|" to the next
2065 closing pattern delimiter. That's why it's common practice to include
2066 alternatives in parentheses: to minimize confusion about where they
2067 start and end.
2068
2069 Alternatives are tried from left to right, so the first alternative
2070 found for which the entire expression matches, is the one that is
2071 chosen. This means that alternatives are not necessarily greedy. For
2072 example: when matching "foo|foot" against "barefoot", only the "foo"
2073 part will match, as that is the first alternative tried, and it
2074 successfully matches the target string. (This might not seem important,
2075 but it is important when you are capturing matched text using
2076 parentheses.)
2077
2078 Also remember that "|" is interpreted as a literal within square
2079 brackets, so if you write "[fee|fie|foe]" you're really only matching
2080 "[feio|]".
2081
2082 Within a pattern, you may designate subpatterns for later reference by
2083 enclosing them in parentheses, and you may refer back to the nth
2084 subpattern later in the pattern using the metacharacter \n or \gn.
2085 Subpatterns are numbered based on the left to right order of their
2086 opening parenthesis. A backreference matches whatever actually matched
2087 the subpattern in the string being examined, not the rules for that
2088 subpattern. Therefore, "(0|0x)\d*\s\g1\d*" will match "0x1234 0x4321",
2089 but not "0x1234 01234", because subpattern 1 matched "0x", even though
2090 the rule "0|0x" could potentially match the leading 0 in the second
2091 number.
2092
2093 Warning on \1 Instead of $1
2094 Some people get too used to writing things like:
2095
2096 $pattern =~ s/(\W)/\\\1/g;
2097
2098 This is grandfathered (for \1 to \9) for the RHS of a substitute to
2099 avoid shocking the sed addicts, but it's a dirty habit to get into.
2100 That's because in PerlThink, the righthand side of an "s///" is a
2101 double-quoted string. "\1" in the usual double-quoted string means a
2102 control-A. The customary Unix meaning of "\1" is kludged in for
2103 "s///". However, if you get into the habit of doing that, you get
2104 yourself into trouble if you then add an "/e" modifier.
2105
2106 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
2107
2108 Or if you try to do
2109
2110 s/(\d+)/\1000/;
2111
2112 You can't disambiguate that by saying "\{1}000", whereas you can fix it
2113 with "${1}000". The operation of interpolation should not be confused
2114 with the operation of matching a backreference. Certainly they mean
2115 two different things on the left side of the "s///".
2116
2117 Repeated Patterns Matching a Zero-length Substring
2118 WARNING: Difficult material (and prose) ahead. This section needs a
2119 rewrite.
2120
2121 Regular expressions provide a terse and powerful programming language.
2122 As with most other power tools, power comes together with the ability
2123 to wreak havoc.
2124
2125 A common abuse of this power stems from the ability to make infinite
2126 loops using regular expressions, with something as innocuous as:
2127
2128 'foo' =~ m{ ( o? )* }x;
2129
2130 The "o?" matches at the beginning of 'foo', and since the position in
2131 the string is not moved by the match, "o?" would match again and again
2132 because of the "*" quantifier. Another common way to create a similar
2133 cycle is with the looping modifier "//g":
2134
2135 @matches = ( 'foo' =~ m{ o? }xg );
2136
2137 or
2138
2139 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
2140
2141 or the loop implied by split().
2142
2143 However, long experience has shown that many programming tasks may be
2144 significantly simplified by using repeated subexpressions that may
2145 match zero-length substrings. Here's a simple example being:
2146
2147 @chars = split //, $string; # // is not magic in split
2148 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
2149
2150 Thus Perl allows such constructs, by forcefully breaking the infinite
2151 loop. The rules for this are different for lower-level loops given by
2152 the greedy quantifiers "*+{}", and for higher-level ones like the "/g"
2153 modifier or split() operator.
2154
2155 The lower-level loops are interrupted (that is, the loop is broken)
2156 when Perl detects that a repeated expression matched a zero-length
2157 substring. Thus
2158
2159 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
2160
2161 is made equivalent to
2162
2163 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
2164
2165 For example, this program
2166
2167 #!perl -l
2168 "aaaaab" =~ /
2169 (?:
2170 a # non-zero
2171 | # or
2172 (?{print "hello"}) # print hello whenever this
2173 # branch is tried
2174 (?=(b)) # zero-width assertion
2175 )* # any number of times
2176 /x;
2177 print $&;
2178 print $1;
2179
2180 prints
2181
2182 hello
2183 aaaaa
2184 b
2185
2186 Notice that "hello" is only printed once, as when Perl sees that the
2187 sixth iteration of the outermost "(?:)*" matches a zero-length string,
2188 it stops the "*".
2189
2190 The higher-level loops preserve an additional state between iterations:
2191 whether the last match was zero-length. To break the loop, the
2192 following match after a zero-length match is prohibited to have a
2193 length of zero. This prohibition interacts with backtracking (see
2194 "Backtracking"), and so the second best match is chosen if the best
2195 match is of zero length.
2196
2197 For example:
2198
2199 $_ = 'bar';
2200 s/\w??/<$&>/g;
2201
2202 results in "<><b><><a><><r><>". At each position of the string the
2203 best match given by non-greedy "??" is the zero-length match, and the
2204 second best match is what is matched by "\w". Thus zero-length matches
2205 alternate with one-character-long matches.
2206
2207 Similarly, for repeated "m/()/g" the second-best match is the match at
2208 the position one notch further in the string.
2209
2210 The additional state of being matched with zero-length is associated
2211 with the matched string, and is reset by each assignment to pos().
2212 Zero-length matches at the end of the previous match are ignored during
2213 "split".
2214
2215 Combining RE Pieces
2216 Each of the elementary pieces of regular expressions which were
2217 described before (such as "ab" or "\Z") could match at most one
2218 substring at the given position of the input string. However, in a
2219 typical regular expression these elementary pieces are combined into
2220 more complicated patterns using combining operators "ST", "S|T", "S*"
2221 etc. (in these examples "S" and "T" are regular subexpressions).
2222
2223 Such combinations can include alternatives, leading to a problem of
2224 choice: if we match a regular expression "a|ab" against "abc", will it
2225 match substring "a" or "ab"? One way to describe which substring is
2226 actually matched is the concept of backtracking (see "Backtracking").
2227 However, this description is too low-level and makes you think in terms
2228 of a particular implementation.
2229
2230 Another description starts with notions of "better"/"worse". All the
2231 substrings which may be matched by the given regular expression can be
2232 sorted from the "best" match to the "worst" match, and it is the "best"
2233 match which is chosen. This substitutes the question of "what is
2234 chosen?" by the question of "which matches are better, and which are
2235 worse?".
2236
2237 Again, for elementary pieces there is no such question, since at most
2238 one match at a given position is possible. This section describes the
2239 notion of better/worse for combining operators. In the description
2240 below "S" and "T" are regular subexpressions.
2241
2242 "ST"
2243 Consider two possible matches, "AB" and "A'B'", "A" and "A'" are
2244 substrings which can be matched by "S", "B" and "B'" are substrings
2245 which can be matched by "T".
2246
2247 If "A" is a better match for "S" than "A'", "AB" is a better match
2248 than "A'B'".
2249
2250 If "A" and "A'" coincide: "AB" is a better match than "AB'" if "B"
2251 is a better match for "T" than "B'".
2252
2253 "S|T"
2254 When "S" can match, it is a better match than when only "T" can
2255 match.
2256
2257 Ordering of two matches for "S" is the same as for "S". Similar
2258 for two matches for "T".
2259
2260 "S{REPEAT_COUNT}"
2261 Matches as "SSS...S" (repeated as many times as necessary).
2262
2263 "S{min,max}"
2264 Matches as "S{max}|S{max-1}|...|S{min+1}|S{min}".
2265
2266 "S{min,max}?"
2267 Matches as "S{min}|S{min+1}|...|S{max-1}|S{max}".
2268
2269 "S?", "S*", "S+"
2270 Same as "S{0,1}", "S{0,BIG_NUMBER}", "S{1,BIG_NUMBER}"
2271 respectively.
2272
2273 "S??", "S*?", "S+?"
2274 Same as "S{0,1}?", "S{0,BIG_NUMBER}?", "S{1,BIG_NUMBER}?"
2275 respectively.
2276
2277 "(?>S)"
2278 Matches the best match for "S" and only that.
2279
2280 "(?=S)", "(?<=S)"
2281 Only the best match for "S" is considered. (This is important only
2282 if "S" has capturing parentheses, and backreferences are used
2283 somewhere else in the whole regular expression.)
2284
2285 "(?!S)", "(?<!S)"
2286 For this grouping operator there is no need to describe the
2287 ordering, since only whether or not "S" can match is important.
2288
2289 "(??{ EXPR })", "(?PARNO)"
2290 The ordering is the same as for the regular expression which is the
2291 result of EXPR, or the pattern contained by capture group PARNO.
2292
2293 "(?(condition)yes-pattern|no-pattern)"
2294 Recall that which of "yes-pattern" or "no-pattern" actually matches
2295 is already determined. The ordering of the matches is the same as
2296 for the chosen subexpression.
2297
2298 The above recipes describe the ordering of matches at a given position.
2299 One more rule is needed to understand how a match is determined for the
2300 whole regular expression: a match at an earlier position is always
2301 better than a match at a later position.
2302
2303 Creating Custom RE Engines
2304 As of Perl 5.10.0, one can create custom regular expression engines.
2305 This is not for the faint of heart, as they have to plug in at the C
2306 level. See perlreapi for more details.
2307
2308 As an alternative, overloaded constants (see overload) provide a simple
2309 way to extend the functionality of the RE engine, by substituting one
2310 pattern for another.
2311
2312 Suppose that we want to enable a new RE escape-sequence "\Y|" which
2313 matches at a boundary between whitespace characters and non-whitespace
2314 characters. Note that "(?=\S)(?<!\S)|(?!\S)(?<=\S)" matches exactly at
2315 these positions, so we want to have each "\Y|" in the place of the more
2316 complicated version. We can create a module "customre" to do this:
2317
2318 package customre;
2319 use overload;
2320
2321 sub import {
2322 shift;
2323 die "No argument to customre::import allowed" if @_;
2324 overload::constant 'qr' => \&convert;
2325 }
2326
2327 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2328
2329 # We must also take care of not escaping the legitimate \\Y|
2330 # sequence, hence the presence of '\\' in the conversion rules.
2331 my %rules = ( '\\' => '\\\\',
2332 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
2333 sub convert {
2334 my $re = shift;
2335 $re =~ s{
2336 \\ ( \\ | Y . )
2337 }
2338 { $rules{$1} or invalid($re,$1) }sgex;
2339 return $re;
2340 }
2341
2342 Now "use customre" enables the new escape in constant regular
2343 expressions, i.e., those without any runtime variable interpolations.
2344 As documented in overload, this conversion will work only over literal
2345 parts of regular expressions. For "\Y|$re\Y|" the variable part of
2346 this regular expression needs to be converted explicitly (but only if
2347 the special meaning of "\Y|" should be enabled inside $re):
2348
2349 use customre;
2350 $re = <>;
2351 chomp $re;
2352 $re = customre::convert $re;
2353 /\Y|$re\Y|/;
2354
2355 PCRE/Python Support
2356 As of Perl 5.10.0, Perl supports several Python/PCRE-specific
2357 extensions to the regex syntax. While Perl programmers are encouraged
2358 to use the Perl-specific syntax, the following are also accepted:
2359
2360 "(?P<NAME>pattern)"
2361 Define a named capture group. Equivalent to "(?<NAME>pattern)".
2362
2363 "(?P=NAME)"
2364 Backreference to a named capture group. Equivalent to "\g{NAME}".
2365
2366 "(?P>NAME)"
2367 Subroutine call to a named capture group. Equivalent to "(?&NAME)".
2368
2370 Many regular expression constructs don't work on EBCDIC platforms.
2371
2372 There are a number of issues with regard to case-insensitive matching
2373 in Unicode rules. See "i" under "Modifiers" above.
2374
2375 This document varies from difficult to understand to completely and
2376 utterly opaque. The wandering prose riddled with jargon is hard to
2377 fathom in several places.
2378
2379 This document needs a rewrite that separates the tutorial content from
2380 the reference content.
2381
2383 perlrequick.
2384
2385 perlretut.
2386
2387 "Regexp Quote-Like Operators" in perlop.
2388
2389 "Gory details of parsing quoted constructs" in perlop.
2390
2391 perlfaq6.
2392
2393 "pos" in perlfunc.
2394
2395 perllocale.
2396
2397 perlebcdic.
2398
2399 Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly
2400 and Associates.
2401
2402
2403
2404perl v5.16.3 2013-03-04 PERLRE(1)