perlretut(1)

1PERLRETUT(1)           Perl Programmers Reference Guide           PERLRETUT(1)
2
3
4

NAME

6       perlretut - Perl regular expressions tutorial
7

DESCRIPTION

9       This page provides a basic tutorial on understanding, creating and
10       using regular expressions in Perl.  It serves as a complement to the
11       reference page on regular expressions perlre.  Regular expressions are
12       an integral part of the "m//", "s///", "qr//" and "split" operators and
13       so this tutorial also overlaps with "Regexp Quote-Like Operators" in
14       perlop and "split" in perlfunc.
15
16       Perl is widely renowned for excellence in text processing, and regular
17       expressions are one of the big factors behind this fame.  Perl regular
18       expressions display an efficiency and flexibility unknown in most other
19       computer languages.  Mastering even the basics of regular expressions
20       will allow you to manipulate text with surprising ease.
21
22       What is a regular expression?  A regular expression is simply a string
23       that describes a pattern.  Patterns are in common use these days; exam‐
24       ples are the patterns typed into a search engine to find web pages and
25       the patterns used to list files in a directory, e.g., "ls *.txt" or
26       "dir *.*".  In Perl, the patterns described by regular expressions are
27       used to search strings, extract desired parts of strings, and to do
28       search and replace operations.
29
30       Regular expressions have the undeserved reputation of being abstract
31       and difficult to understand.  Regular expressions are constructed using
32       simple concepts like conditionals and loops and are no more difficult
33       to understand than the corresponding "if" conditionals and "while"
34       loops in the Perl language itself.  In fact, the main challenge in
35       learning regular expressions is just getting used to the terse notation
36       used to express these concepts.
37
38       This tutorial flattens the learning curve by discussing regular expres‐
39       sion concepts, along with their notation, one at a time and with many
40       examples.  The first part of the tutorial will progress from the sim‐
41       plest word searches to the basic regular expression concepts.  If you
42       master the first part, you will have all the tools needed to solve
43       about 98% of your needs.  The second part of the tutorial is for those
44       comfortable with the basics and hungry for more power tools.  It dis‐
45       cusses the more advanced regular expression operators and introduces
46       the latest cutting edge innovations in 5.6.0.
47
48       A note: to save time, 'regular expression' is often abbreviated as reg‐
49       exp or regex.  Regexp is a more natural abbreviation than regex, but is
50       harder to pronounce.  The Perl pod documentation is evenly split on
51       regexp vs regex; in Perl, there is more than one way to abbreviate it.
52       We'll use regexp in this tutorial.
53

Part 1: The basics

55       Simple word matching
56
57       The simplest regexp is simply a word, or more generally, a string of
58       characters.  A regexp consisting of a word matches any string that con‐
59       tains that word:
60
61           "Hello World" =~ /World/;  # matches
62
63       What is this perl statement all about? "Hello World" is a simple double
64       quoted string.  "World" is the regular expression and the "//" enclos‐
65       ing "/World/" tells perl to search a string for a match.  The operator
66       "=~" associates the string with the regexp match and produces a true
67       value if the regexp matched, or false if the regexp did not match.  In
68       our case, "World" matches the second word in "Hello World", so the
69       expression is true.  Expressions like this are useful in conditionals:
70
71           if ("Hello World" =~ /World/) {
72               print "It matches\n";
73           }
74           else {
75               print "It doesn't match\n";
76           }
77
78       There are useful variations on this theme.  The sense of the match can
79       be reversed by using "!~" operator:
80
81           if ("Hello World" !~ /World/) {
82               print "It doesn't match\n";
83           }
84           else {
85               print "It matches\n";
86           }
87
88       The literal string in the regexp can be replaced by a variable:
89
90           $greeting = "World";
91           if ("Hello World" =~ /$greeting/) {
92               print "It matches\n";
93           }
94           else {
95               print "It doesn't match\n";
96           }
97
98       If you're matching against the special default variable $_, the "$_ =~"
99       part can be omitted:
100
101           $_ = "Hello World";
102           if (/World/) {
103               print "It matches\n";
104           }
105           else {
106               print "It doesn't match\n";
107           }
108
109       And finally, the "//" default delimiters for a match can be changed to
110       arbitrary delimiters by putting an 'm' out front:
111
112           "Hello World" =~ m!World!;   # matches, delimited by '!'
113           "Hello World" =~ m{World};   # matches, note the matching '{}'
114           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
115                                        # '/' becomes an ordinary char
116
117       "/World/", "m!World!", and "m{World}" all represent the same thing.
118       When, e.g., "" is used as a delimiter, the forward slash '/' becomes an
119       ordinary character and can be used in a regexp without trouble.
120
121       Let's consider how different regexps would match "Hello World":
122
123           "Hello World" =~ /world/;  # doesn't match
124           "Hello World" =~ /o W/;    # matches
125           "Hello World" =~ /oW/;     # doesn't match
126           "Hello World" =~ /World /; # doesn't match
127
128       The first regexp "world" doesn't match because regexps are case-sensi‐
129       tive.  The second regexp matches because the substring 'o W'  occurs in
130       the string "Hello World" .  The space character ' ' is treated like any
131       other character in a regexp and is needed to match in this case.  The
132       lack of a space character is the reason the third regexp 'oW' doesn't
133       match.  The fourth regexp 'World ' doesn't match because there is a
134       space at the end of the regexp, but not at the end of the string.  The
135       lesson here is that regexps must match a part of the string exactly in
136       order for the statement to be true.
137
138       If a regexp matches in more than one place in the string, perl will
139       always match at the earliest possible point in the string:
140
141           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
142           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
143
144       With respect to character matching, there are a few more points you
145       need to know about.   First of all, not all characters can be used 'as
146       is' in a match.  Some characters, called metacharacters, are reserved
147       for use in regexp notation.  The metacharacters are
148
149           {}[]()^$.⎪*+?\
150
151       The significance of each of these will be explained in the rest of the
152       tutorial, but for now, it is important only to know that a metacharac‐
153       ter can be matched by putting a backslash before it:
154
155           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
156           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
157           "The interval is [0,1)." =~ /[0,1)./     # is a syntax error!
158           "The interval is [0,1)." =~ /\[0,1\)\./  # matches
159           "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
160
161       In the last regexp, the forward slash '/' is also backslashed, because
162       it is used to delimit the regexp.  This can lead to LTS (leaning tooth‐
163       pick syndrome), however, and it is often more readable to change delim‐
164       iters.
165
166           "/usr/bin/perl" =~ m!/usr/bin/perl!;    # easier to read
167
168       The backslash character '\' is a metacharacter itself and needs to be
169       backslashed:
170
171           'C:\WIN32' =~ /C:\\WIN/;   # matches
172
173       In addition to the metacharacters, there are some ASCII characters
174       which don't have printable character equivalents and are instead repre‐
175       sented by escape sequences.  Common examples are "\t" for a tab, "\n"
176       for a newline, "\r" for a carriage return and "\a" for a bell.  If your
177       string is better thought of as a sequence of arbitrary bytes, the octal
178       escape sequence, e.g., "\033", or hexadecimal escape sequence, e.g.,
179       "\x1B" may be a more natural representation for your bytes.  Here are
180       some examples of escapes:
181
182           "1000\t2000" =~ m(0\t2)   # matches
183           "1000\n2000" =~ /0\n20/   # matches
184           "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
185           "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
186
187       If you've been around Perl a while, all this talk of escape sequences
188       may seem familiar.  Similar escape sequences are used in double-quoted
189       strings and in fact the regexps in Perl are mostly treated as double-
190       quoted strings.  This means that variables can be used in regexps as
191       well.  Just like double-quoted strings, the values of the variables in
192       the regexp will be substituted in before the regexp is evaluated for
193       matching purposes.  So we have:
194
195           $foo = 'house';
196           'housecat' =~ /$foo/;      # matches
197           'cathouse' =~ /cat$foo/;   # matches
198           'housecat' =~ /${foo}cat/; # matches
199
200       So far, so good.  With the knowledge above you can already perform
201       searches with just about any literal string regexp you can dream up.
202       Here is a very simple emulation of the Unix grep program:
203
204           % cat > simple_grep
205           #!/usr/bin/perl
206           $regexp = shift;
207           while (<>) {
208               print if /$regexp/;
209           }
210           ^D
211
212           % chmod +x simple_grep
213
214           % simple_grep abba /usr/dict/words
215           Babbage
216           cabbage
217           cabbages
218           sabbath
219           Sabbathize
220           Sabbathizes
221           sabbatical
222           scabbard
223           scabbards
224
225       This program is easy to understand.  "#!/usr/bin/perl" is the standard
226       way to invoke a perl program from the shell.  "$regexp = shift;"  saves
227       the first command line argument as the regexp to be used, leaving the
228       rest of the command line arguments to be treated as files.
229       "while (<>)"  loops over all the lines in all the files.  For each
230       line, "print if /$regexp/;"  prints the line if the regexp matches the
231       line.  In this line, both "print" and "/$regexp/" use the default vari‐
232       able $_ implicitly.
233
234       With all of the regexps above, if the regexp matched anywhere in the
235       string, it was considered a match.  Sometimes, however, we'd like to
236       specify where in the string the regexp should try to match.  To do
237       this, we would use the anchor metacharacters "^" and "$".  The anchor
238       "^" means match at the beginning of the string and the anchor "$" means
239       match at the end of the string, or before a newline at the end of the
240       string.  Here is how they are used:
241
242           "housekeeper" =~ /keeper/;    # matches
243           "housekeeper" =~ /^keeper/;   # doesn't match
244           "housekeeper" =~ /keeper$/;   # matches
245           "housekeeper\n" =~ /keeper$/; # matches
246
247       The second regexp doesn't match because "^" constrains "keeper" to
248       match only at the beginning of the string, but "housekeeper" has keeper
249       starting in the middle.  The third regexp does match, since the "$"
250       constrains "keeper" to match only at the end of the string.
251
252       When both "^" and "$" are used at the same time, the regexp has to
253       match both the beginning and the end of the string, i.e., the regexp
254       matches the whole string.  Consider
255
256           "keeper" =~ /^keep$/;      # doesn't match
257           "keeper" =~ /^keeper$/;    # matches
258           ""       =~ /^$/;          # ^$ matches an empty string
259
260       The first regexp doesn't match because the string has more to it than
261       "keep".  Since the second regexp is exactly the string, it matches.
262       Using both "^" and "$" in a regexp forces the complete string to match,
263       so it gives you complete control over which strings match and which
264       don't.  Suppose you are looking for a fellow named bert, off in a
265       string by himself:
266
267           "dogbert" =~ /bert/;   # matches, but not what you want
268
269           "dilbert" =~ /^bert/;  # doesn't match, but ..
270           "bertram" =~ /^bert/;  # matches, so still not good enough
271
272           "bertram" =~ /^bert$/; # doesn't match, good
273           "dilbert" =~ /^bert$/; # doesn't match, good
274           "bert"    =~ /^bert$/; # matches, perfect
275
276       Of course, in the case of a literal string, one could just as easily
277       use the string equivalence "$string eq 'bert'"  and it would be more
278       efficient.   The  "^...$" regexp really becomes useful when we add in
279       the more powerful regexp tools below.
280
281       Using character classes
282
283       Although one can already do quite a lot with the literal string regexps
284       above, we've only scratched the surface of regular expression technol‐
285       ogy.  In this and subsequent sections we will introduce regexp concepts
286       (and associated metacharacter notations) that will allow a regexp to
287       not just represent a single character sequence, but a whole class of
288       them.
289
290       One such concept is that of a character class.  A character class
291       allows a set of possible characters, rather than just a single charac‐
292       ter, to match at a particular point in a regexp.  Character classes are
293       denoted by brackets "[...]", with the set of characters to be possibly
294       matched inside.  Here are some examples:
295
296           /cat/;       # matches 'cat'
297           /[bcr]at/;   # matches 'bat, 'cat', or 'rat'
298           /item[0123456789]/;  # matches 'item0' or ... or 'item9'
299           "abc" =~ /[cab]/;    # matches 'a'
300
301       In the last statement, even though 'c' is the first character in the
302       class, 'a' matches because the first character position in the string
303       is the earliest point at which the regexp can match.
304
305           /[yY][eE][sS]/;      # match 'yes' in a case-insensitive way
306                                # 'yes', 'Yes', 'YES', etc.
307
308       This regexp displays a common task: perform a case-insensitive match.
309       Perl provides away of avoiding all those brackets by simply appending
310       an 'i' to the end of the match.  Then "/[yY][eE][sS]/;" can be rewrit‐
311       ten as "/yes/i;".  The 'i' stands for case-insensitive and is an exam‐
312       ple of a modifier of the matching operation.  We will meet other modi‐
313       fiers later in the tutorial.
314
315       We saw in the section above that there were ordinary characters, which
316       represented themselves, and special characters, which needed a back‐
317       slash "\" to represent themselves.  The same is true in a character
318       class, but the sets of ordinary and special characters inside a charac‐
319       ter class are different than those outside a character class.  The spe‐
320       cial characters for a character class are "-]\^$".  "]" is special
321       because it denotes the end of a character class.  "$" is special
322       because it denotes a scalar variable.  "\" is special because it is
323       used in escape sequences, just like above.  Here is how the special
324       characters "]$\" are handled:
325
326          /[\]c]def/; # matches ']def' or 'cdef'
327          $x = 'bcr';
328          /[$x]at/;   # matches 'bat', 'cat', or 'rat'
329          /[\$x]at/;  # matches '$at' or 'xat'
330          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
331
332       The last two are a little tricky.  in "[\$x]", the backslash protects
333       the dollar sign, so the character class has two members "$" and "x".
334       In "[\\$x]", the backslash is protected, so $x is treated as a variable
335       and substituted in double quote fashion.
336
337       The special character '-' acts as a range operator within character
338       classes, so that a contiguous set of characters can be written as a
339       range.  With ranges, the unwieldy "[0123456789]" and "[abc...xyz]"
340       become the svelte "[0-9]" and "[a-z]".  Some examples are
341
342           /item[0-9]/;  # matches 'item0' or ... or 'item9'
343           /[0-9bx-z]aa/;  # matches '0aa', ..., '9aa',
344                           # 'baa', 'xaa', 'yaa', or 'zaa'
345           /[0-9a-fA-F]/;  # matches a hexadecimal digit
346           /[0-9a-zA-Z_]/; # matches a "word" character,
347                           # like those in a perl variable name
348
349       If '-' is the first or last character in a character class, it is
350       treated as an ordinary character; "[-ab]", "[ab-]" and "[a\-b]" are all
351       equivalent.
352
353       The special character "^" in the first position of a character class
354       denotes a negated character class, which matches any character but
355       those in the brackets.  Both "[...]" and "[^...]" must match a charac‐
356       ter, or the match fails.  Then
357
358           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
359                      # all other 'bat', 'cat, '0at', '%at', etc.
360           /[^0-9]/;  # matches a non-numeric character
361           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
362
363       Now, even "[0-9]" can be a bother the write multiple times, so in the
364       interest of saving keystrokes and making regexps more readable, Perl
365       has several abbreviations for common character classes:
366
367       ·   \d is a digit and represents [0-9]
368
369       ·   \s is a whitespace character and represents [\ \t\r\n\f]
370
371       ·   \w is a word character (alphanumeric or _) and represents
372           [0-9a-zA-Z_]
373
374       ·   \D is a negated \d; it represents any character but a digit [^0-9]
375
376       ·   \S is a negated \s; it represents any non-whitespace character
377           [^\s]
378
379       ·   \W is a negated \w; it represents any non-word character [^\w]
380
381       ·   The period '.' matches any character but "\n"
382
383       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
384       character classes.  Here are some in use:
385
386           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
387           /[\d\s]/;         # matches any digit or whitespace character
388           /\w\W\w/;         # matches a word char, followed by a
389                             # non-word char, followed by a word char
390           /..rt/;           # matches any two chars, followed by 'rt'
391           /end\./;          # matches 'end.'
392           /end[.]/;         # same thing, matches 'end.'
393
394       Because a period is a metacharacter, it needs to be escaped to match as
395       an ordinary period. Because, for example, "\d" and "\w" are sets of
396       characters, it is incorrect to think of "[^\d\w]" as "[\D\W]"; in fact
397       "[^\d\w]" is the same as "[^\w]", which is the same as "[\W]". Think
398       DeMorgan's laws.
399
400       An anchor useful in basic regexps is the word anchor  "\b".  This
401       matches a boundary between a word character and a non-word character
402       "\w\W" or "\W\w":
403
404           $x = "Housecat catenates house and cat";
405           $x =~ /cat/;    # matches cat in 'housecat'
406           $x =~ /\bcat/;  # matches cat in 'catenates'
407           $x =~ /cat\b/;  # matches cat in 'housecat'
408           $x =~ /\bcat\b/;  # matches 'cat' at end of string
409
410       Note in the last example, the end of the string is considered a word
411       boundary.
412
413       You might wonder why '.' matches everything but "\n" - why not every
414       character? The reason is that often one is matching against lines and
415       would like to ignore the newline characters.  For instance, while the
416       string "\n" represents one line, we would like to think of as empty.
417       Then
418
419           ""   =~ /^$/;    # matches
420           "\n" =~ /^$/;    # matches, "\n" is ignored
421
422           ""   =~ /./;      # doesn't match; it needs a char
423           ""   =~ /^.$/;    # doesn't match; it needs a char
424           "\n" =~ /^.$/;    # doesn't match; it needs a char other than "\n"
425           "a"  =~ /^.$/;    # matches
426           "a\n"  =~ /^.$/;  # matches, ignores the "\n"
427
428       This behavior is convenient, because we usually want to ignore newlines
429       when we count and match characters in a line.  Sometimes, however, we
430       want to keep track of newlines.  We might even want "^" and "$" to
431       anchor at the beginning and end of lines within the string, rather than
432       just the beginning and end of the string.  Perl allows us to choose
433       between ignoring and paying attention to newlines by using the "//s"
434       and "//m" modifiers.  "//s" and "//m" stand for single line and multi-
435       line and they determine whether a string is to be treated as one con‐
436       tinuous string, or as a set of lines.  The two modifiers affect two
437       aspects of how the regexp is interpreted: 1) how the '.' character
438       class is defined, and 2) where the anchors "^" and "$" are able to
439       match.  Here are the four possible combinations:
440
441       ·   no modifiers (//): Default behavior.  '.' matches any character
442           except "\n".  "^" matches only at the beginning of the string and
443           "$" matches only at the end or before a newline at the end.
444
445       ·   s modifier (//s): Treat string as a single long line.  '.' matches
446           any character, even "\n".  "^" matches only at the beginning of the
447           string and "$" matches only at the end or before a newline at the
448           end.
449
450       ·   m modifier (//m): Treat string as a set of multiple lines.  '.'
451           matches any character except "\n".  "^" and "$" are able to match
452           at the start or end of any line within the string.
453
454       ·   both s and m modifiers (//sm): Treat string as a single long line,
455           but detect multiple lines.  '.' matches any character, even "\n".
456           "^" and "$", however, are able to match at the start or end of any
457           line within the string.
458
459       Here are examples of "//s" and "//m" in action:
460
461           $x = "There once was a girl\nWho programmed in Perl\n";
462
463           $x =~ /^Who/;   # doesn't match, "Who" not at start of string
464           $x =~ /^Who/s;  # doesn't match, "Who" not at start of string
465           $x =~ /^Who/m;  # matches, "Who" at start of second line
466           $x =~ /^Who/sm; # matches, "Who" at start of second line
467
468           $x =~ /girl.Who/;   # doesn't match, "." doesn't match "\n"
469           $x =~ /girl.Who/s;  # matches, "." matches "\n"
470           $x =~ /girl.Who/m;  # doesn't match, "." doesn't match "\n"
471           $x =~ /girl.Who/sm; # matches, "." matches "\n"
472
473       Most of the time, the default behavior is what is want, but "//s" and
474       "//m" are occasionally very useful.  If "//m" is being used, the start
475       of the string can still be matched with "\A" and the end of string can
476       still be matched with the anchors "\Z" (matches both the end and the
477       newline before, like "$"), and "\z" (matches only the end):
478
479           $x =~ /^Who/m;   # matches, "Who" at start of second line
480           $x =~ /\AWho/m;  # doesn't match, "Who" is not at start of string
481
482           $x =~ /girl$/m;  # matches, "girl" at end of first line
483           $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
484
485           $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
486           $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
487
488       We now know how to create choices among classes of characters in a reg‐
489       exp.  What about choices among words or character strings? Such choices
490       are described in the next section.
491
492       Matching this or that
493
494       Sometimes we would like to our regexp to be able to match different
495       possible words or character strings.  This is accomplished by using the
496       alternation metacharacter "⎪".  To match "dog" or "cat", we form the
497       regexp "dog⎪cat".  As before, perl will try to match the regexp at the
498       earliest possible point in the string.  At each character position,
499       perl will first try to match the first alternative, "dog".  If "dog"
500       doesn't match, perl will then try the next alternative, "cat".  If
501       "cat" doesn't match either, then the match fails and perl moves to the
502       next position in the string.  Some examples:
503
504           "cats and dogs" =~ /cat⎪dog⎪bird/;  # matches "cat"
505           "cats and dogs" =~ /dog⎪cat⎪bird/;  # matches "cat"
506
507       Even though "dog" is the first alternative in the second regexp, "cat"
508       is able to match earlier in the string.
509
510           "cats"          =~ /c⎪ca⎪cat⎪cats/; # matches "c"
511           "cats"          =~ /cats⎪cat⎪ca⎪c/; # matches "cats"
512
513       Here, all the alternatives match at the first string position, so the
514       first alternative is the one that matches.  If some of the alternatives
515       are truncations of the others, put the longest ones first to give them
516       a chance to match.
517
518           "cab" =~ /a⎪b⎪c/ # matches "c"
519                            # /a⎪b⎪c/ == /[abc]/
520
521       The last example points out that character classes are like alterna‐
522       tions of characters.  At a given character position, the first alterna‐
523       tive that allows the regexp match to succeed will be the one that
524       matches.
525
526       Grouping things and hierarchical matching
527
528       Alternation allows a regexp to choose among alternatives, but by itself
529       it unsatisfying.  The reason is that each alternative is a whole reg‐
530       exp, but sometime we want alternatives for just part of a regexp.  For
531       instance, suppose we want to search for housecats or housekeepers.  The
532       regexp "housecat⎪housekeeper" fits the bill, but is inefficient because
533       we had to type "house" twice.  It would be nice to have parts of the
534       regexp be constant, like "house", and some parts have alternatives,
535       like "cat⎪keeper".
536
537       The grouping metacharacters "()" solve this problem.  Grouping allows
538       parts of a regexp to be treated as a single unit.  Parts of a regexp
539       are grouped by enclosing them in parentheses.  Thus we could solve the
540       "housecat⎪housekeeper" by forming the regexp as "house(cat⎪keeper)".
541       The regexp "house(cat⎪keeper)" means match "house" followed by either
542       "cat" or "keeper".  Some more examples are
543
544           /(a⎪b)b/;    # matches 'ab' or 'bb'
545           /(ac⎪b)b/;   # matches 'acb' or 'bb'
546           /(^a⎪b)c/;   # matches 'ac' at start of string or 'bc' anywhere
547           /(a⎪[bc])d/; # matches 'ad', 'bd', or 'cd'
548
549           /house(cat⎪)/;  # matches either 'housecat' or 'house'
550           /house(cat(s⎪)⎪)/;  # matches either 'housecats' or 'housecat' or
551                               # 'house'.  Note groups can be nested.
552
553           /(19⎪20⎪)\d\d/;  # match years 19xx, 20xx, or the Y2K problem, xx
554           "20" =~ /(19⎪20⎪)\d\d/;  # matches the null alternative '()\d\d',
555                                    # because '20\d\d' can't match
556
557       Alternations behave the same way in groups as out of them: at a given
558       string position, the leftmost alternative that allows the regexp to
559       match is taken.  So in the last example at the first string position,
560       "20" matches the second alternative, but there is nothing left over to
561       match the next two digits "\d\d".  So perl moves on to the next alter‐
562       native, which is the null alternative and that works, since "20" is two
563       digits.
564
565       The process of trying one alternative, seeing if it matches, and moving
566       on to the next alternative if it doesn't, is called backtracking.  The
567       term 'backtracking' comes from the idea that matching a regexp is like
568       a walk in the woods.  Successfully matching a regexp is like arriving
569       at a destination.  There are many possible trailheads, one for each
570       string position, and each one is tried in order, left to right.  From
571       each trailhead there may be many paths, some of which get you there,
572       and some which are dead ends.  When you walk along a trail and hit a
573       dead end, you have to backtrack along the trail to an earlier point to
574       try another trail.  If you hit your destination, you stop immediately
575       and forget about trying all the other trails.  You are persistent, and
576       only if you have tried all the trails from all the trailheads and not
577       arrived at your destination, do you declare failure.  To be concrete,
578       here is a step-by-step analysis of what perl does when it tries to
579       match the regexp
580
581           "abcde" =~ /(abd⎪abc)(df⎪d⎪de)/;
582
583       0   Start with the first letter in the string 'a'.
584
585       1   Try the first alternative in the first group 'abd'.
586
587       2   Match 'a' followed by 'b'. So far so good.
588
589       3   'd' in the regexp doesn't match 'c' in the string - a dead end.  So
590           backtrack two characters and pick the second alternative in the
591           first group 'abc'.
592
593       4   Match 'a' followed by 'b' followed by 'c'.  We are on a roll and
594           have satisfied the first group. Set $1 to 'abc'.
595
596       5   Move on to the second group and pick the first alternative 'df'.
597
598       6   Match the 'd'.
599
600       7   'f' in the regexp doesn't match 'e' in the string, so a dead end.
601           Backtrack one character and pick the second alternative in the sec‐
602           ond group 'd'.
603
604       8   'd' matches. The second grouping is satisfied, so set $2 to 'd'.
605
606       9   We are at the end of the regexp, so we are done! We have matched
607           'abcd' out of the string "abcde".
608
609       There are a couple of things to note about this analysis.  First, the
610       third alternative in the second group 'de' also allows a match, but we
611       stopped before we got to it - at a given character position, leftmost
612       wins.  Second, we were able to get a match at the first character posi‐
613       tion of the string 'a'.  If there were no matches at the first posi‐
614       tion, perl would move to the second character position 'b' and attempt
615       the match all over again.  Only when all possible paths at all possible
616       character positions have been exhausted does perl give up and declare
617       "$string =~ /(abd⎪abc)(df⎪d⎪de)/;"  to be false.
618
619       Even with all this work, regexp matching happens remarkably fast.  To
620       speed things up, during compilation stage, perl compiles the regexp
621       into a compact sequence of opcodes that can often fit inside a proces‐
622       sor cache.  When the code is executed, these opcodes can then run at
623       full throttle and search very quickly.
624
625       Extracting matches
626
627       The grouping metacharacters "()" also serve another completely differ‐
628       ent function: they allow the extraction of the parts of a string that
629       matched.  This is very useful to find out what matched and for text
630       processing in general.  For each grouping, the part that matched inside
631       goes into the special variables $1, $2, etc.  They can be used just as
632       ordinary variables:
633
634           # extract hours, minutes, seconds
635           if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
636               $hours = $1;
637               $minutes = $2;
638               $seconds = $3;
639           }
640
641       Now, we know that in scalar context, "$time =~ /(\d\d):(\d\d):(\d\d)/"
642       returns a true or false value.  In list context, however, it returns
643       the list of matched values "($1,$2,$3)".  So we could write the code
644       more compactly as
645
646           # extract hours, minutes, seconds
647           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
648
649       If the groupings in a regexp are nested, $1 gets the group with the
650       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
651       For example, here is a complex regexp and the matching variables indi‐
652       cated below it:
653
654           /(ab(cd⎪ef)((gi)⎪j))/;
655            1  2      34
656
657       so that if the regexp matched, e.g., $2 would contain 'cd' or 'ef'. For
658       convenience, perl sets $+ to the string held by the highest numbered
659       $1, $2, ... that got assigned (and, somewhat related, $^N to the value
660       of the $1, $2, ... most-recently assigned; i.e. the $1, $2, ... associ‐
661       ated with the rightmost closing parenthesis used in the match).
662
663       Closely associated with the matching variables $1, $2, ... are the
664       backreferences "\1", "\2", ... .  Backreferences are simply matching
665       variables that can be used inside a regexp.  This is a really nice fea‐
666       ture - what matches later in a regexp can depend on what matched ear‐
667       lier in the regexp.  Suppose we wanted to look for doubled words in
668       text, like 'the the'.  The following regexp finds all 3-letter doubles
669       with a space in between:
670
671           /(\w\w\w)\s\1/;
672
673       The grouping assigns a value to \1, so that the same 3 letter sequence
674       is used for both parts.  Here are some words with repeated parts:
675
676           % simple_grep '^(\w\w\w\w⎪\w\w\w⎪\w\w⎪\w)\1$' /usr/dict/words
677           beriberi
678           booboo
679           coco
680           mama
681           murmur
682           papa
683
684       The regexp has a single grouping which considers 4-letter combinations,
685       then 3-letter combinations, etc.  and uses "\1" to look for a repeat.
686       Although $1 and "\1" represent the same thing, care should be taken to
687       use matched variables $1, $2, ... only outside a regexp and backrefer‐
688       ences "\1", "\2", ... only inside a regexp; not doing so may lead to
689       surprising and/or undefined results.
690
691       In addition to what was matched, Perl 5.6.0 also provides the positions
692       of what was matched with the "@-" and "@+" arrays. "$-[0]" is the posi‐
693       tion of the start of the entire match and $+[0] is the position of the
694       end. Similarly, "$-[n]" is the position of the start of the $n match
695       and $+[n] is the position of the end. If $n is undefined, so are
696       "$-[n]" and $+[n]. Then this code
697
698           $x = "Mmm...donut, thought Homer";
699           $x =~ /^(Mmm⎪Yech)\.\.\.(donut⎪peas)/; # matches
700           foreach $expr (1..$#-) {
701               print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
702           }
703
704       prints
705
706           Match 1: 'Mmm' at position (0,3)
707           Match 2: 'donut' at position (6,11)
708
709       Even if there are no groupings in a regexp, it is still possible to
710       find out what exactly matched in a string.  If you use them, perl will
711       set $` to the part of the string before the match, will set $& to the
712       part of the string that matched, and will set $' to the part of the
713       string after the match.  An example:
714
715           $x = "the cat caught the mouse";
716           $x =~ /cat/;  # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
717           $x =~ /the/;  # $` = '', $& = 'the', $' = ' cat caught the mouse'
718
719       In the second match, "$` = ''"  because the regexp matched at the first
720       character position in the string and stopped, it never saw the second
721       'the'.  It is important to note that using $` and $' slows down regexp
722       matching quite a bit, and  $&  slows it down to a lesser extent,
723       because if they are used in one regexp in a program, they are generated
724       for <all> regexps in the program.  So if raw performance is a goal of
725       your application, they should be avoided.  If you need them, use "@-"
726       and "@+" instead:
727
728           $` is the same as substr( $x, 0, $-[0] )
729           $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
730           $' is the same as substr( $x, $+[0] )
731
732       Matching repetitions
733
734       The examples in the previous section display an annoying weakness.  We
735       were only matching 3-letter words, or syllables of 4 letters or less.
736       We'd like to be able to match words or syllables of any length, without
737       writing out tedious alternatives like "\w\w\w\w⎪\w\w\w⎪\w\w⎪\w".
738
739       This is exactly the problem the quantifier metacharacters "?", "*",
740       "+", and "{}" were created for.  They allow us to determine the number
741       of repeats of a portion of a regexp we consider to be a match.  Quanti‐
742       fiers are put immediately after the character, character class, or
743       grouping that we want to specify.  They have the following meanings:
744
745       ·   "a?" = match 'a' 1 or 0 times
746
747       ·   "a*" = match 'a' 0 or more times, i.e., any number of times
748
749       ·   "a+" = match 'a' 1 or more times, i.e., at least once
750
751       ·   "a{n,m}" = match at least "n" times, but not more than "m" times.
752
753       ·   "a{n,}" = match at least "n" or more times
754
755       ·   "a{n}" = match exactly "n" times
756
757       Here are some examples:
758
759           /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
760                            # any number of digits
761           /(\w+)\s+\1/;    # match doubled words of arbitrary length
762           /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'
763           $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
764                                # than 4 digits
765           $year =~ /\d{4}⎪\d{2}/;    # better match; throw out 3 digit dates
766           $year =~ /\d{2}(\d{2})?/;  # same thing written differently. However,
767                                      # this produces $1 and the other does not.
768
769           % simple_grep '^(\w+)\1$' /usr/dict/words   # isn't this easier?
770           beriberi
771           booboo
772           coco
773           mama
774           murmur
775           papa
776
777       For all of these quantifiers, perl will try to match as much of the
778       string as possible, while still allowing the regexp to succeed.  Thus
779       with "/a?.../", perl will first try to match the regexp with the "a"
780       present; if that fails, perl will try to match the regexp without the
781       "a" present.  For the quantifier "*", we get the following:
782
783           $x = "the cat in the hat";
784           $x =~ /^(.*)(cat)(.*)$/; # matches,
785                                    # $1 = 'the '
786                                    # $2 = 'cat'
787                                    # $3 = ' in the hat'
788
789       Which is what we might expect, the match finds the only "cat" in the
790       string and locks onto it.  Consider, however, this regexp:
791
792           $x =~ /^(.*)(at)(.*)$/; # matches,
793                                   # $1 = 'the cat in the h'
794                                   # $2 = 'at'
795                                   # $3 = ''   (0 matches)
796
797       One might initially guess that perl would find the "at" in "cat" and
798       stop there, but that wouldn't give the longest possible string to the
799       first quantifier ".*".  Instead, the first quantifier ".*" grabs as
800       much of the string as possible while still having the regexp match.  In
801       this example, that means having the "at" sequence with the final "at"
802       in the string.  The other important principle illustrated here is that
803       when there are two or more elements in a regexp, the leftmost quanti‐
804       fier, if there is one, gets to grab as much the string as possible,
805       leaving the rest of the regexp to fight over scraps.  Thus in our exam‐
806       ple, the first quantifier ".*" grabs most of the string, while the sec‐
807       ond quantifier ".*" gets the empty string.   Quantifiers that grab as
808       much of the string as possible are called maximal match or greedy quan‐
809       tifiers.
810
811       When a regexp can match a string in several different ways, we can use
812       the principles above to predict which way the regexp will match:
813
814       ·   Principle 0: Taken as a whole, any regexp will be matched at the
815           earliest possible position in the string.
816
817       ·   Principle 1: In an alternation "a⎪b⎪c...", the leftmost alternative
818           that allows a match for the whole regexp will be the one used.
819
820       ·   Principle 2: The maximal matching quantifiers "?", "*", "+" and
821           "{n,m}" will in general match as much of the string as possible
822           while still allowing the whole regexp to match.
823
824       ·   Principle 3: If there are two or more elements in a regexp, the
825           leftmost greedy quantifier, if any, will match as much of the
826           string as possible while still allowing the whole regexp to match.
827           The next leftmost greedy quantifier, if any, will try to match as
828           much of the string remaining available to it as possible, while
829           still allowing the whole regexp to match.  And so on, until all the
830           regexp elements are satisfied.
831
832       As we have seen above, Principle 0 overrides the others - the regexp
833       will be matched as early as possible, with the other principles deter‐
834       mining how the regexp matches at that earliest character position.
835
836       Here is an example of these principles in action:
837
838           $x = "The programming republic of Perl";
839           $x =~ /^(.+)(e⎪r)(.*)$/;  # matches,
840                                     # $1 = 'The programming republic of Pe'
841                                     # $2 = 'r'
842                                     # $3 = 'l'
843
844       This regexp matches at the earliest string position, 'T'.  One might
845       think that "e", being leftmost in the alternation, would be matched,
846       but "r" produces the longest string in the first quantifier.
847
848           $x =~ /(m{1,2})(.*)$/;  # matches,
849                                   # $1 = 'mm'
850                                   # $2 = 'ing republic of Perl'
851
852       Here, The earliest possible match is at the first 'm' in "programming".
853       "m{1,2}" is the first quantifier, so it gets to match a maximal "mm".
854
855           $x =~ /.*(m{1,2})(.*)$/;  # matches,
856                                     # $1 = 'm'
857                                     # $2 = 'ing republic of Perl'
858
859       Here, the regexp matches at the start of the string. The first quanti‐
860       fier ".*" grabs as much as possible, leaving just a single 'm' for the
861       second quantifier "m{1,2}".
862
863           $x =~ /(.?)(m{1,2})(.*)$/;  # matches,
864                                       # $1 = 'a'
865                                       # $2 = 'mm'
866                                       # $3 = 'ing republic of Perl'
867
868       Here, ".?" eats its maximal one character at the earliest possible
869       position in the string, 'a' in "programming", leaving "m{1,2}" the
870       opportunity to match both "m"'s. Finally,
871
872           "aXXXb" =~ /(X*)/; # matches with $1 = ''
873
874       because it can match zero copies of 'X' at the beginning of the string.
875       If you definitely want to match at least one 'X', use "X+", not "X*".
876
877       Sometimes greed is not good.  At times, we would like quantifiers to
878       match a minimal piece of string, rather than a maximal piece.  For this
879       purpose, Larry Wall created the minimal match  or non-greedy quanti‐
880       fiers "??","*?", "+?", and "{}?".  These are the usual quantifiers with
881       a "?" appended to them.  They have the following meanings:
882
883       ·   "a??" = match 'a' 0 or 1 times. Try 0 first, then 1.
884
885       ·   "a*?" = match 'a' 0 or more times, i.e., any number of times, but
886           as few times as possible
887
888       ·   "a+?" = match 'a' 1 or more times, i.e., at least once, but as few
889           times as possible
890
891       ·   "a{n,m}?" = match at least "n" times, not more than "m" times, as
892           few times as possible
893
894       ·   "a{n,}?" = match at least "n" times, but as few times as possible
895
896       ·   "a{n}?" = match exactly "n" times.  Because we match exactly "n"
897           times, "a{n}?" is equivalent to "a{n}" and is just there for nota‐
898           tional consistency.
899
900       Let's look at the example above, but with minimal quantifiers:
901
902           $x = "The programming republic of Perl";
903           $x =~ /^(.+?)(e⎪r)(.*)$/; # matches,
904                                     # $1 = 'Th'
905                                     # $2 = 'e'
906                                     # $3 = ' programming republic of Perl'
907
908       The minimal string that will allow both the start of the string "^" and
909       the alternation to match is "Th", with the alternation "e⎪r" matching
910       "e".  The second quantifier ".*" is free to gobble up the rest of the
911       string.
912
913           $x =~ /(m{1,2}?)(.*?)$/;  # matches,
914                                     # $1 = 'm'
915                                     # $2 = 'ming republic of Perl'
916
917       The first string position that this regexp can match is at the first
918       'm' in "programming". At this position, the minimal "m{1,2}?"  matches
919       just one 'm'.  Although the second quantifier ".*?" would prefer to
920       match no characters, it is constrained by the end-of-string anchor "$"
921       to match the rest of the string.
922
923           $x =~ /(.*?)(m{1,2}?)(.*)$/;  # matches,
924                                         # $1 = 'The progra'
925                                         # $2 = 'm'
926                                         # $3 = 'ming republic of Perl'
927
928       In this regexp, you might expect the first minimal quantifier ".*?"  to
929       match the empty string, because it is not constrained by a "^" anchor
930       to match the beginning of the word.  Principle 0 applies here, however.
931       Because it is possible for the whole regexp to match at the start of
932       the string, it will match at the start of the string.  Thus the first
933       quantifier has to match everything up to the first "m".  The second
934       minimal quantifier matches just one "m" and the third quantifier
935       matches the rest of the string.
936
937           $x =~ /(.??)(m{1,2})(.*)$/;  # matches,
938                                        # $1 = 'a'
939                                        # $2 = 'mm'
940                                        # $3 = 'ing republic of Perl'
941
942       Just as in the previous regexp, the first quantifier ".??" can match
943       earliest at position 'a', so it does.  The second quantifier is greedy,
944       so it matches "mm", and the third matches the rest of the string.
945
946       We can modify principle 3 above to take into account non-greedy quanti‐
947       fiers:
948
949       ·   Principle 3: If there are two or more elements in a regexp, the
950           leftmost greedy (non-greedy) quantifier, if any, will match as much
951           (little) of the string as possible while still allowing the whole
952           regexp to match.  The next leftmost greedy (non-greedy) quantifier,
953           if any, will try to match as much (little) of the string remaining
954           available to it as possible, while still allowing the whole regexp
955           to match.  And so on, until all the regexp elements are satisfied.
956
957       Just like alternation, quantifiers are also susceptible to backtrack‐
958       ing.  Here is a step-by-step analysis of the example
959
960           $x = "the cat in the hat";
961           $x =~ /^(.*)(at)(.*)$/; # matches,
962                                   # $1 = 'the cat in the h'
963                                   # $2 = 'at'
964                                   # $3 = ''   (0 matches)
965
966       0   Start with the first letter in the string 't'.
967
968       1   The first quantifier '.*' starts out by matching the whole string
969           'the cat in the hat'.
970
971       2   'a' in the regexp element 'at' doesn't match the end of the string.
972           Backtrack one character.
973
974       3   'a' in the regexp element 'at' still doesn't match the last letter
975           of the string 't', so backtrack one more character.
976
977       4   Now we can match the 'a' and the 't'.
978
979       5   Move on to the third element '.*'.  Since we are at the end of the
980           string and '.*' can match 0 times, assign it the empty string.
981
982       6   We are done!
983
984       Most of the time, all this moving forward and backtracking happens
985       quickly and searching is fast.   There are some pathological regexps,
986       however, whose execution time exponentially grows with the size of the
987       string.  A typical structure that blows up in your face is of the form
988
989           /(a⎪b+)*/;
990
991       The problem is the nested indeterminate quantifiers.  There are many
992       different ways of partitioning a string of length n between the "+" and
993       "*": one repetition with "b+" of length n, two repetitions with the
994       first "b+" length k and the second with length n-k, m repetitions whose
995       bits add up to length n, etc.  In fact there are an exponential number
996       of ways to partition a string as a function of length.  A regexp may
997       get lucky and match early in the process, but if there is no match,
998       perl will try every possibility before giving up.  So be careful with
999       nested "*"'s, "{n,m}"'s, and "+"'s.  The book Mastering regular expres‐
1000       sions by Jeffrey Friedl gives a wonderful discussion of this and other
1001       efficiency issues.
1002
1003       Building a regexp
1004
1005       At this point, we have all the basic regexp concepts covered, so let's
1006       give a more involved example of a regular expression.  We will build a
1007       regexp that matches numbers.
1008
1009       The first task in building a regexp is to decide what we want to match
1010       and what we want to exclude.  In our case, we want to match both inte‐
1011       gers and floating point numbers and we want to reject any string that
1012       isn't a number.
1013
1014       The next task is to break the problem down into smaller problems that
1015       are easily converted into a regexp.
1016
1017       The simplest case is integers.  These consist of a sequence of digits,
1018       with an optional sign in front.  The digits we can represent with "\d+"
1019       and the sign can be matched with "[+-]".  Thus the integer regexp is
1020
1021           /[+-]?\d+/;  # matches integers
1022
1023       A floating point number potentially has a sign, an integral part, a
1024       decimal point, a fractional part, and an exponent.  One or more of
1025       these parts is optional, so we need to check out the different possi‐
1026       bilities.  Floating point numbers which are in proper form include
1027       123., 0.345, .34, -1e6, and 25.4E-72.  As with integers, the sign out
1028       front is completely optional and can be matched by "[+-]?".  We can see
1029       that if there is no exponent, floating point numbers must have a deci‐
1030       mal point, otherwise they are integers.  We might be tempted to model
1031       these with "\d*\.\d*", but this would also match just a single decimal
1032       point, which is not a number.  So the three cases of floating point
1033       number sans exponent are
1034
1035          /[+-]?\d+\./;  # 1., 321., etc.
1036          /[+-]?\.\d+/;  # .1, .234, etc.
1037          /[+-]?\d+\.\d+/;  # 1.0, 30.56, etc.
1038
1039       These can be combined into a single regexp with a three-way alterna‐
1040       tion:
1041
1042          /[+-]?(\d+\.\d+⎪\d+\.⎪\.\d+)/;  # floating point, no exponent
1043
1044       In this alternation, it is important to put '\d+\.\d+' before '\d+\.'.
1045       If '\d+\.' were first, the regexp would happily match that and ignore
1046       the fractional part of the number.
1047
1048       Now consider floating point numbers with exponents.  The key observa‐
1049       tion here is that both integers and numbers with decimal points are
1050       allowed in front of an exponent.  Then exponents, like the overall
1051       sign, are independent of whether we are matching numbers with or with‐
1052       out decimal points, and can be 'decoupled' from the mantissa.  The
1053       overall form of the regexp now becomes clear:
1054
1055           /^(optional sign)(integer ⎪ f.p. mantissa)(optional exponent)$/;
1056
1057       The exponent is an "e" or "E", followed by an integer.  So the exponent
1058       regexp is
1059
1060          /[eE][+-]?\d+/;  # exponent
1061
1062       Putting all the parts together, we get a regexp that matches numbers:
1063
1064          /^[+-]?(\d+\.\d+⎪\d+\.⎪\.\d+⎪\d+)([eE][+-]?\d+)?$/;  # Ta da!
1065
1066       Long regexps like this may impress your friends, but can be hard to
1067       decipher.  In complex situations like this, the "//x" modifier for a
1068       match is invaluable.  It allows one to put nearly arbitrary whitespace
1069       and comments into a regexp without affecting their meaning.  Using it,
1070       we can rewrite our 'extended' regexp in the more pleasing form
1071
1072          /^
1073             [+-]?         # first, match an optional sign
1074             (             # then match integers or f.p. mantissas:
1075                 \d+\.\d+  # mantissa of the form a.b
1076                ⎪\d+\.     # mantissa of the form a.
1077                ⎪\.\d+     # mantissa of the form .b
1078                ⎪\d+       # integer of the form a
1079             )
1080             ([eE][+-]?\d+)?  # finally, optionally match an exponent
1081          $/x;
1082
1083       If whitespace is mostly irrelevant, how does one include space charac‐
1084       ters in an extended regexp? The answer is to backslash it '\ '  or put
1085       it in a character class "[ ]" .  The same thing goes for pound signs,
1086       use "\#" or "[#]".  For instance, Perl allows a space between the sign
1087       and the mantissa/integer, and we could add this to our regexp as fol‐
1088       lows:
1089
1090          /^
1091             [+-]?\ *      # first, match an optional sign *and space*
1092             (             # then match integers or f.p. mantissas:
1093                 \d+\.\d+  # mantissa of the form a.b
1094                ⎪\d+\.     # mantissa of the form a.
1095                ⎪\.\d+     # mantissa of the form .b
1096                ⎪\d+       # integer of the form a
1097             )
1098             ([eE][+-]?\d+)?  # finally, optionally match an exponent
1099          $/x;
1100
1101       In this form, it is easier to see a way to simplify the alternation.
1102       Alternatives 1, 2, and 4 all start with "\d+", so it could be factored
1103       out:
1104
1105          /^
1106             [+-]?\ *      # first, match an optional sign
1107             (             # then match integers or f.p. mantissas:
1108                 \d+       # start out with a ...
1109                 (
1110                     \.\d* # mantissa of the form a.b or a.
1111                 )?        # ? takes care of integers of the form a
1112                ⎪\.\d+     # mantissa of the form .b
1113             )
1114             ([eE][+-]?\d+)?  # finally, optionally match an exponent
1115          $/x;
1116
1117       or written in the compact form,
1118
1119           /^[+-]?\ *(\d+(\.\d*)?⎪\.\d+)([eE][+-]?\d+)?$/;
1120
1121       This is our final regexp.  To recap, we built a regexp by
1122
1123       ·   specifying the task in detail,
1124
1125       ·   breaking down the problem into smaller parts,
1126
1127       ·   translating the small parts into regexps,
1128
1129       ·   combining the regexps,
1130
1131       ·   and optimizing the final combined regexp.
1132
1133       These are also the typical steps involved in writing a computer pro‐
1134       gram.  This makes perfect sense, because regular expressions are essen‐
1135       tially programs written a little computer language that specifies pat‐
1136       terns.
1137
1138       Using regular expressions in Perl
1139
1140       The last topic of Part 1 briefly covers how regexps are used in Perl
1141       programs.  Where do they fit into Perl syntax?
1142
1143       We have already introduced the matching operator in its default "/reg‐
1144       exp/" and arbitrary delimiter "m!regexp!" forms.  We have used the
1145       binding operator "=~" and its negation "!~" to test for string matches.
1146       Associated with the matching operator, we have discussed the single
1147       line "//s", multi-line "//m", case-insensitive "//i" and extended "//x"
1148       modifiers.
1149
1150       There are a few more things you might want to know about matching oper‐
1151       ators.  First, we pointed out earlier that variables in regexps are
1152       substituted before the regexp is evaluated:
1153
1154           $pattern = 'Seuss';
1155           while (<>) {
1156               print if /$pattern/;
1157           }
1158
1159       This will print any lines containing the word "Seuss".  It is not as
1160       efficient as it could be, however, because perl has to re-evaluate
1161       $pattern each time through the loop.  If $pattern won't be changing
1162       over the lifetime of the script, we can add the "//o" modifier, which
1163       directs perl to only perform variable substitutions once:
1164
1165           #!/usr/bin/perl
1166           #    Improved simple_grep
1167           $regexp = shift;
1168           while (<>) {
1169               print if /$regexp/o;  # a good deal faster
1170           }
1171
1172       If you change $pattern after the first substitution happens, perl will
1173       ignore it.  If you don't want any substitutions at all, use the special
1174       delimiter "m''":
1175
1176           @pattern = ('Seuss');
1177           while (<>) {
1178               print if m'@pattern';  # matches literal '@pattern', not 'Seuss'
1179           }
1180
1181       "m''" acts like single quotes on a regexp; all other "m" delimiters act
1182       like double quotes.  If the regexp evaluates to the empty string, the
1183       regexp in the last successful match is used instead.  So we have
1184
1185           "dog" =~ /d/;  # 'd' matches
1186           "dogbert =~ //;  # this matches the 'd' regexp used before
1187
1188       The final two modifiers "//g" and "//c" concern multiple matches.  The
1189       modifier "//g" stands for global matching and allows the matching oper‐
1190       ator to match within a string as many times as possible.  In scalar
1191       context, successive invocations against a string will have `"//g" jump
1192       from match to match, keeping track of position in the string as it goes
1193       along.  You can get or set the position with the "pos()" function.
1194
1195       The use of "//g" is shown in the following example.  Suppose we have a
1196       string that consists of words separated by spaces.  If we know how many
1197       words there are in advance, we could extract the words using groupings:
1198
1199           $x = "cat dog house"; # 3 words
1200           $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1201                                                  # $1 = 'cat'
1202                                                  # $2 = 'dog'
1203                                                  # $3 = 'house'
1204
1205       But what if we had an indeterminate number of words? This is the sort
1206       of task "//g" was made for.  To extract all words, form the simple reg‐
1207       exp "(\w+)" and loop over all matches with "/(\w+)/g":
1208
1209           while ($x =~ /(\w+)/g) {
1210               print "Word is $1, ends at position ", pos $x, "\n";
1211           }
1212
1213       prints
1214
1215           Word is cat, ends at position 3
1216           Word is dog, ends at position 7
1217           Word is house, ends at position 13
1218
1219       A failed match or changing the target string resets the position.  If
1220       you don't want the position reset after failure to match, add the
1221       "//c", as in "/regexp/gc".  The current position in the string is asso‐
1222       ciated with the string, not the regexp.  This means that different
1223       strings have different positions and their respective positions can be
1224       set or read independently.
1225
1226       In list context, "//g" returns a list of matched groupings, or if there
1227       are no groupings, a list of matches to the whole regexp.  So if we
1228       wanted just the words, we could use
1229
1230           @words = ($x =~ /(\w+)/g);  # matches,
1231                                       # $word[0] = 'cat'
1232                                       # $word[1] = 'dog'
1233                                       # $word[2] = 'house'
1234
1235       Closely associated with the "//g" modifier is the "\G" anchor.  The
1236       "\G" anchor matches at the point where the previous "//g" match left
1237       off.  "\G" allows us to easily do context-sensitive matching:
1238
1239           $metric = 1;  # use metric units
1240           ...
1241           $x = <FILE>;  # read in measurement
1242           $x =~ /^([+-]?\d+)\s*/g;  # get magnitude
1243           $weight = $1;
1244           if ($metric) { # error checking
1245               print "Units error!" unless $x =~ /\Gkg\./g;
1246           }
1247           else {
1248               print "Units error!" unless $x =~ /\Glbs\./g;
1249           }
1250           $x =~ /\G\s+(widget⎪sprocket)/g;  # continue processing
1251
1252       The combination of "//g" and "\G" allows us to process the string a bit
1253       at a time and use arbitrary Perl logic to decide what to do next.  Cur‐
1254       rently, the "\G" anchor is only fully supported when used to anchor to
1255       the start of the pattern.
1256
1257       "\G" is also invaluable in processing fixed length records with reg‐
1258       exps.  Suppose we have a snippet of coding region DNA, encoded as base
1259       pair letters "ATCGTTGAAT..." and we want to find all the stop codons
1260       "TGA".  In a coding region, codons are 3-letter sequences, so we can
1261       think of the DNA snippet as a sequence of 3-letter records.  The naive
1262       regexp
1263
1264           # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1265           $dna = "ATCGTTGAATGCAAATGACATGAC";
1266           $dna =~ /TGA/;
1267
1268       doesn't work; it may match a "TGA", but there is no guarantee that the
1269       match is aligned with codon boundaries, e.g., the substring "GTT GAA"
1270       gives a match.  A better solution is
1271
1272           while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
1273               print "Got a TGA stop codon at position ", pos $dna, "\n";
1274           }
1275
1276       which prints
1277
1278           Got a TGA stop codon at position 18
1279           Got a TGA stop codon at position 23
1280
1281       Position 18 is good, but position 23 is bogus.  What happened?
1282
1283       The answer is that our regexp works well until we get past the last
1284       real match.  Then the regexp will fail to match a synchronized "TGA"
1285       and start stepping ahead one character position at a time, not what we
1286       want.  The solution is to use "\G" to anchor the match to the codon
1287       alignment:
1288
1289           while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1290               print "Got a TGA stop codon at position ", pos $dna, "\n";
1291           }
1292
1293       This prints
1294
1295           Got a TGA stop codon at position 18
1296
1297       which is the correct answer.  This example illustrates that it is
1298       important not only to match what is desired, but to reject what is not
1299       desired.
1300
1301       search and replace
1302
1303       Regular expressions also play a big role in search and replace opera‐
1304       tions in Perl.  Search and replace is accomplished with the "s///"
1305       operator.  The general form is "s/regexp/replacement/modifiers", with
1306       everything we know about regexps and modifiers applying in this case as
1307       well.  The "replacement" is a Perl double quoted string that replaces
1308       in the string whatever is matched with the "regexp".  The operator "=~"
1309       is also used here to associate a string with "s///".  If matching
1310       against $_, the "$_ =~"  can be dropped.  If there is a match, "s///"
1311       returns the number of substitutions made, otherwise it returns false.
1312       Here are a few examples:
1313
1314           $x = "Time to feed the cat!";
1315           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
1316           if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1317               $more_insistent = 1;
1318           }
1319           $y = "'quoted words'";
1320           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
1321                                  # $y contains "quoted words"
1322
1323       In the last example, the whole string was matched, but only the part
1324       inside the single quotes was grouped.  With the "s///" operator, the
1325       matched variables $1, $2, etc.  are immediately available for use in
1326       the replacement expression, so we use $1 to replace the quoted string
1327       with just what was quoted.  With the global modifier, "s///g" will
1328       search and replace all occurrences of the regexp in the string:
1329
1330           $x = "I batted 4 for 4";
1331           $x =~ s/4/four/;   # doesn't do it all:
1332                              # $x contains "I batted four for 4"
1333           $x = "I batted 4 for 4";
1334           $x =~ s/4/four/g;  # does it all:
1335                              # $x contains "I batted four for four"
1336
1337       If you prefer 'regex' over 'regexp' in this tutorial, you could use the
1338       following program to replace it:
1339
1340           % cat > simple_replace
1341           #!/usr/bin/perl
1342           $regexp = shift;
1343           $replacement = shift;
1344           while (<>) {
1345               s/$regexp/$replacement/go;
1346               print;
1347           }
1348           ^D
1349
1350           % simple_replace regexp regex perlretut.pod
1351
1352       In "simple_replace" we used the "s///g" modifier to replace all occur‐
1353       rences of the regexp on each line and the "s///o" modifier to compile
1354       the regexp only once.  As with "simple_grep", both the "print" and the
1355       "s/$regexp/$replacement/go" use $_ implicitly.
1356
1357       A modifier available specifically to search and replace is the "s///e"
1358       evaluation modifier.  "s///e" wraps an "eval{...}" around the replace‐
1359       ment string and the evaluated result is substituted for the matched
1360       substring.  "s///e" is useful if you need to do a bit of computation in
1361       the process of replacing text.  This example counts character frequen‐
1362       cies in a line:
1363
1364           $x = "Bill the cat";
1365           $x =~ s/(.)/$chars{$1}++;$1/eg;  # final $1 replaces char with itself
1366           print "frequency of '$_' is $chars{$_}\n"
1367               foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1368
1369       This prints
1370
1371           frequency of ' ' is 2
1372           frequency of 't' is 2
1373           frequency of 'l' is 2
1374           frequency of 'B' is 1
1375           frequency of 'c' is 1
1376           frequency of 'e' is 1
1377           frequency of 'h' is 1
1378           frequency of 'i' is 1
1379           frequency of 'a' is 1
1380
1381       As with the match "m//" operator, "s///" can use other delimiters, such
1382       as "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
1383       "s'''", then the regexp and replacement are treated as single quoted
1384       strings and there are no substitutions.  "s///" in list context returns
1385       the same thing as in scalar context, i.e., the number of matches.
1386
1387       The split operator
1388
1389       The "split"  function can also optionally use a matching operator "m//"
1390       to split a string.  "split /regexp/, string, limit" splits "string"
1391       into a list of substrings and returns that list.  The regexp is used to
1392       match the character sequence that the "string" is split with respect
1393       to.  The "limit", if present, constrains splitting into no more than
1394       "limit" number of strings.  For example, to split a string into words,
1395       use
1396
1397           $x = "Calvin and Hobbes";
1398           @words = split /\s+/, $x;  # $word[0] = 'Calvin'
1399                                      # $word[1] = 'and'
1400                                      # $word[2] = 'Hobbes'
1401
1402       If the empty regexp "//" is used, the regexp always matches and the
1403       string is split into individual characters.  If the regexp has group‐
1404       ings, then list produced contains the matched substrings from the
1405       groupings as well.  For instance,
1406
1407           $x = "/usr/bin/perl";
1408           @dirs = split m!/!, $x;  # $dirs[0] = ''
1409                                    # $dirs[1] = 'usr'
1410                                    # $dirs[2] = 'bin'
1411                                    # $dirs[3] = 'perl'
1412           @parts = split m!(/)!, $x;  # $parts[0] = ''
1413                                       # $parts[1] = '/'
1414                                       # $parts[2] = 'usr'
1415                                       # $parts[3] = '/'
1416                                       # $parts[4] = 'bin'
1417                                       # $parts[5] = '/'
1418                                       # $parts[6] = 'perl'
1419
1420       Since the first character of $x matched the regexp, "split" prepended
1421       an empty initial element to the list.
1422
1423       If you have read this far, congratulations! You now have all the basic
1424       tools needed to use regular expressions to solve a wide range of text
1425       processing problems.  If this is your first time through the tutorial,
1426       why not stop here and play around with regexps a while...  Part 2 con‐
1427       cerns the more esoteric aspects of regular expressions and those con‐
1428       cepts certainly aren't needed right at the start.
1429

Part 2: Power tools

1431       OK, you know the basics of regexps and you want to know more.  If
1432       matching regular expressions is analogous to a walk in the woods, then
1433       the tools discussed in Part 1 are analogous to topo maps and a compass,
1434       basic tools we use all the time.  Most of the tools in part 2 are anal‐
1435       ogous to flare guns and satellite phones.  They aren't used too often
1436       on a hike, but when we are stuck, they can be invaluable.
1437
1438       What follows are the more advanced, less used, or sometimes esoteric
1439       capabilities of perl regexps.  In Part 2, we will assume you are com‐
1440       fortable with the basics and concentrate on the new features.
1441
1442       More on characters, strings, and character classes
1443
1444       There are a number of escape sequences and character classes that we
1445       haven't covered yet.
1446
1447       There are several escape sequences that convert characters or strings
1448       between upper and lower case.  "\l" and "\u" convert the next character
1449       to lower or upper case, respectively:
1450
1451           $x = "perl";
1452           $string =~ /\u$x/;  # matches 'Perl' in $string
1453           $x = "M(rs?⎪s)\\."; # note the double backslash
1454           $string =~ /\l$x/;  # matches 'mr.', 'mrs.', and 'ms.',
1455
1456       "\L" and "\U" converts a whole substring, delimited by "\L" or "\U" and
1457       "\E", to lower or upper case:
1458
1459           $x = "This word is in lower case:\L SHOUT\E";
1460           $x =~ /shout/;       # matches
1461           $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1462           $x =~ /\Ukeypunch/;  # matches punch card string
1463
1464       If there is no "\E", case is converted until the end of the string. The
1465       regexps "\L\u$word" or "\u\L$word" convert the first character of $word
1466       to uppercase and the rest of the characters to lowercase.
1467
1468       Control characters can be escaped with "\c", so that a control-Z char‐
1469       acter would be matched with "\cZ".  The escape sequence "\Q"..."\E"
1470       quotes, or protects most non-alphabetic characters.   For instance,
1471
1472           $x = "\QThat !^*&%~& cat!";
1473           $x =~ /\Q!^*&%~&\E/;  # check for rough language
1474
1475       It does not protect "$" or "@", so that variables can still be substi‐
1476       tuted.
1477
1478       With the advent of 5.6.0, perl regexps can handle more than just the
1479       standard ASCII character set.  Perl now supports Unicode, a standard
1480       for encoding the character sets from many of the world's written lan‐
1481       guages.  Unicode does this by allowing characters to be more than one
1482       byte wide.  Perl uses the UTF-8 encoding, in which ASCII characters are
1483       still encoded as one byte, but characters greater than "chr(127)" may
1484       be stored as two or more bytes.
1485
1486       What does this mean for regexps? Well, regexp users don't need to know
1487       much about perl's internal representation of strings.  But they do need
1488       to know 1) how to represent Unicode characters in a regexp and 2) when
1489       a matching operation will treat the string to be searched as a sequence
1490       of bytes (the old way) or as a sequence of Unicode characters (the new
1491       way).  The answer to 1) is that Unicode characters greater than
1492       "chr(127)" may be represented using the "\x{hex}" notation, with "hex"
1493       a hexadecimal integer:
1494
1495           /\x{263a}/;  # match a Unicode smiley face :)
1496
1497       Unicode characters in the range of 128-255 use two hexadecimal digits
1498       with braces: "\x{ab}".  Note that this is different than "\xab", which
1499       is just a hexadecimal byte with no Unicode significance.
1500
1501       NOTE: in Perl 5.6.0 it used to be that one needed to say "use utf8" to
1502       use any Unicode features.  This is no more the case: for almost all
1503       Unicode processing, the explicit "utf8" pragma is not needed.  (The
1504       only case where it matters is if your Perl script is in Unicode and
1505       encoded in UTF-8, then an explicit "use utf8" is needed.)
1506
1507       Figuring out the hexadecimal sequence of a Unicode character you want
1508       or deciphering someone else's hexadecimal Unicode regexp is about as
1509       much fun as programming in machine code.  So another way to specify
1510       Unicode characters is to use the named character  escape sequence
1511       "\N{name}".  "name" is a name for the Unicode character, as specified
1512       in the Unicode standard.  For instance, if we wanted to represent or
1513       match the astrological sign for the planet Mercury, we could use
1514
1515           use charnames ":full"; # use named chars with Unicode full names
1516           $x = "abc\N{MERCURY}def";
1517           $x =~ /\N{MERCURY}/;   # matches
1518
1519       One can also use short names or restrict names to a certain alphabet:
1520
1521           use charnames ':full';
1522           print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
1523
1524           use charnames ":short";
1525           print "\N{greek:Sigma} is an upper-case sigma.\n";
1526
1527           use charnames qw(greek);
1528           print "\N{sigma} is Greek sigma\n";
1529
1530       A list of full names is found in the file Names.txt in the
1531       lib/perl5/5.X.X/unicore directory.
1532
1533       The answer to requirement 2), as of 5.6.0, is that if a regexp contains
1534       Unicode characters, the string is searched as a sequence of Unicode
1535       characters.  Otherwise, the string is searched as a sequence of bytes.
1536       If the string is being searched as a sequence of Unicode characters,
1537       but matching a single byte is required, we can use the "\C" escape
1538       sequence.  "\C" is a character class akin to "." except that it matches
1539       any byte 0-255.  So
1540
1541           use charnames ":full"; # use named chars with Unicode full names
1542           $x = "a";
1543           $x =~ /\C/;  # matches 'a', eats one byte
1544           $x = "";
1545           $x =~ /\C/;  # doesn't match, no bytes to match
1546           $x = "\N{MERCURY}";  # two-byte Unicode character
1547           $x =~ /\C/;  # matches, but dangerous!
1548
1549       The last regexp matches, but is dangerous because the string character
1550       position is no longer synchronized to the string byte position.  This
1551       generates the warning 'Malformed UTF-8 character'.  The "\C" is best
1552       used for matching the binary data in strings with binary data inter‐
1553       mixed with Unicode characters.
1554
1555       Let us now discuss the rest of the character classes.  Just as with
1556       Unicode characters, there are named Unicode character classes repre‐
1557       sented by the "\p{name}" escape sequence.  Closely associated is the
1558       "\P{name}" character class, which is the negation of the "\p{name}"
1559       class.  For example, to match lower and uppercase characters,
1560
1561           use charnames ":full"; # use named chars with Unicode full names
1562           $x = "BOB";
1563           $x =~ /^\p{IsUpper}/;   # matches, uppercase char class
1564           $x =~ /^\P{IsUpper}/;   # doesn't match, char class sans uppercase
1565           $x =~ /^\p{IsLower}/;   # doesn't match, lowercase char class
1566           $x =~ /^\P{IsLower}/;   # matches, char class sans lowercase
1567
1568       Here is the association between some Perl named classes and the tradi‐
1569       tional Unicode classes:
1570
1571           Perl class name  Unicode class name or regular expression
1572
1573           IsAlpha          /^[LM]/
1574           IsAlnum          /^[LMN]/
1575           IsASCII          $code <= 127
1576           IsCntrl          /^C/
1577           IsBlank          $code =~ /^(0020⎪0009)$/ ⎪⎪ /^Z[^lp]/
1578           IsDigit          Nd
1579           IsGraph          /^([LMNPS]⎪Co)/
1580           IsLower          Ll
1581           IsPrint          /^([LMNPS]⎪Co⎪Zs)/
1582           IsPunct          /^P/
1583           IsSpace          /^Z/ ⎪⎪ ($code =~ /^(0009⎪000A⎪000B⎪000C⎪000D)$/
1584           IsSpacePerl      /^Z/ ⎪⎪ ($code =~ /^(0009⎪000A⎪000C⎪000D⎪0085⎪2028⎪2029)$/
1585           IsUpper          /^L[ut]/
1586           IsWord           /^[LMN]/ ⎪⎪ $code eq "005F"
1587           IsXDigit         $code =~ /^00(3[0-9]⎪[46][1-6])$/
1588
1589       You can also use the official Unicode class names with the "\p" and
1590       "\P", like "\p{L}" for Unicode 'letters', or "\p{Lu}" for uppercase
1591       letters, or "\P{Nd}" for non-digits.  If a "name" is just one letter,
1592       the braces can be dropped.  For instance, "\pM" is the character class
1593       of Unicode 'marks', for example accent marks.  For the full list see
1594       perlunicode.
1595
1596       The Unicode has also been separated into various sets of characters
1597       which you can test with "\p{In...}" (in) and "\P{In...}" (not in), for
1598       example "\p{Latin}", "\p{Greek}", or "\P{Katakana}".  For the full list
1599       see perlunicode.
1600
1601       "\X" is an abbreviation for a character class sequence that includes
1602       the Unicode 'combining character sequences'.  A 'combining character
1603       sequence' is a base character followed by any number of combining char‐
1604       acters.  An example of a combining character is an accent.   Using the
1605       Unicode full names, e.g., "A + COMBINING RING"  is a combining charac‐
1606       ter sequence with base character "A" and combining character "COMBIN‐
1607       ING RING" , which translates in Danish to A with the circle atop it, as
1608       in the word Angstrom.  "\X" is equivalent to "\PM\pM*}", i.e., a non-
1609       mark followed by one or more marks.
1610
1611       For the full and latest information about Unicode see the latest Uni‐
1612       code standard, or the Unicode Consortium's website http://www.uni‐
1613       code.org/
1614
1615       As if all those classes weren't enough, Perl also defines POSIX style
1616       character classes.  These have the form "[:name:]", with "name" the
1617       name of the POSIX class.  The POSIX classes are "alpha", "alnum",
1618       "ascii", "cntrl", "digit", "graph", "lower", "print", "punct", "space",
1619       "upper", and "xdigit", and two extensions, "word" (a Perl extension to
1620       match "\w"), and "blank" (a GNU extension).  If "utf8" is being used,
1621       then these classes are defined the same as their corresponding perl
1622       Unicode classes: "[:upper:]" is the same as "\p{IsUpper}", etc.  The
1623       POSIX character classes, however, don't require using "utf8".  The
1624       "[:digit:]", "[:word:]", and "[:space:]" correspond to the familiar
1625       "\d", "\w", and "\s" character classes.  To negate a POSIX class, put a
1626       "^" in front of the name, so that, e.g., "[:^digit:]" corresponds to
1627       "\D" and under "utf8", "\P{IsDigit}".  The Unicode and POSIX character
1628       classes can be used just like "\d", with the exception that POSIX char‐
1629       acter classes can only be used inside of a character class:
1630
1631           /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
1632           /^=item\s[[:digit:]]/;      # match '=item',
1633                                       # followed by a space and a digit
1634           use charnames ":full";
1635           /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
1636           /^=item\s\p{IsDigit}/;        # match '=item',
1637                                         # followed by a space and a digit
1638
1639       Whew! That is all the rest of the characters and character classes.
1640
1641       Compiling and saving regular expressions
1642
1643       In Part 1 we discussed the "//o" modifier, which compiles a regexp just
1644       once.  This suggests that a compiled regexp is some data structure that
1645       can be stored once and used again and again.  The regexp quote "qr//"
1646       does exactly that: "qr/string/" compiles the "string" as a regexp and
1647       transforms the result into a form that can be assigned to a variable:
1648
1649           $reg = qr/foo+bar?/;  # reg contains a compiled regexp
1650
1651       Then $reg can be used as a regexp:
1652
1653           $x = "fooooba";
1654           $x =~ $reg;     # matches, just like /foo+bar?/
1655           $x =~ /$reg/;   # same thing, alternate form
1656
1657       $reg can also be interpolated into a larger regexp:
1658
1659           $x =~ /(abc)?$reg/;  # still matches
1660
1661       As with the matching operator, the regexp quote can use different
1662       delimiters, e.g., "qr!!", "qr{}" and "qr~~".  The single quote delim‐
1663       iters "qr''" prevent any interpolation from taking place.
1664
1665       Pre-compiled regexps are useful for creating dynamic matches that don't
1666       need to be recompiled each time they are encountered.  Using pre-com‐
1667       piled regexps, "simple_grep" program can be expanded into a program
1668       that matches multiple patterns:
1669
1670           % cat > multi_grep
1671           #!/usr/bin/perl
1672           # multi_grep - match any of <number> regexps
1673           # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
1674
1675           $number = shift;
1676           $regexp[$_] = shift foreach (0..$number-1);
1677           @compiled = map qr/$_/, @regexp;
1678           while ($line = <>) {
1679               foreach $pattern (@compiled) {
1680                   if ($line =~ /$pattern/) {
1681                       print $line;
1682                       last;  # we matched, so move onto the next line
1683                   }
1684               }
1685           }
1686           ^D
1687
1688           % multi_grep 2 last for multi_grep
1689               $regexp[$_] = shift foreach (0..$number-1);
1690                   foreach $pattern (@compiled) {
1691                           last;
1692
1693       Storing pre-compiled regexps in an array @compiled allows us to simply
1694       loop through the regexps without any recompilation, thus gaining flexi‐
1695       bility without sacrificing speed.
1696
1697       Embedding comments and modifiers in a regular expression
1698
1699       Starting with this section, we will be discussing Perl's set of
1700       extended patterns.  These are extensions to the traditional regular
1701       expression syntax that provide powerful new tools for pattern matching.
1702       We have already seen extensions in the form of the minimal matching
1703       constructs "??", "*?", "+?", "{n,m}?", and "{n,}?".  The rest of the
1704       extensions below have the form "(?char...)", where the "char" is a
1705       character that determines the type of extension.
1706
1707       The first extension is an embedded comment "(?#text)".  This embeds a
1708       comment into the regular expression without affecting its meaning.  The
1709       comment should not have any closing parentheses in the text.  An exam‐
1710       ple is
1711
1712           /(?# Match an integer:)[+-]?\d+/;
1713
1714       This style of commenting has been largely superseded by the raw,
1715       freeform commenting that is allowed with the "//x" modifier.
1716
1717       The modifiers "//i", "//m", "//s", and "//x" can also embedded in a
1718       regexp using "(?i)", "(?m)", "(?s)", and "(?x)".  For instance,
1719
1720           /(?i)yes/;  # match 'yes' case insensitively
1721           /yes/i;     # same thing
1722           /(?x)(          # freeform version of an integer regexp
1723                    [+-]?  # match an optional sign
1724                    \d+    # match a sequence of digits
1725                )
1726           /x;
1727
1728       Embedded modifiers can have two important advantages over the usual
1729       modifiers.  Embedded modifiers allow a custom set of modifiers to each
1730       regexp pattern.  This is great for matching an array of regexps that
1731       must have different modifiers:
1732
1733           $pattern[0] = '(?i)doctor';
1734           $pattern[1] = 'Johnson';
1735           ...
1736           while (<>) {
1737               foreach $patt (@pattern) {
1738                   print if /$patt/;
1739               }
1740           }
1741
1742       The second advantage is that embedded modifiers only affect the regexp
1743       inside the group the embedded modifier is contained in.  So grouping
1744       can be used to localize the modifier's effects:
1745
1746           /Answer: ((?i)yes)/;  # matches 'Answer: yes', 'Answer: YES', etc.
1747
1748       Embedded modifiers can also turn off any modifiers already present by
1749       using, e.g., "(?-i)".  Modifiers can also be combined into a single
1750       expression, e.g., "(?s-i)" turns on single line mode and turns off case
1751       insensitivity.
1752
1753       Non-capturing groupings
1754
1755       We noted in Part 1 that groupings "()" had two distinct functions: 1)
1756       group regexp elements together as a single unit, and 2) extract, or
1757       capture, substrings that matched the regexp in the grouping.  Non-cap‐
1758       turing groupings, denoted by "(?:regexp)", allow the regexp to be
1759       treated as a single unit, but don't extract substrings or set matching
1760       variables $1, etc.  Both capturing and non-capturing groupings are
1761       allowed to co-exist in the same regexp.  Because there is no extrac‐
1762       tion, non-capturing groupings are faster than capturing groupings.
1763       Non-capturing groupings are also handy for choosing exactly which parts
1764       of a regexp are to be extracted to matching variables:
1765
1766           # match a number, $1-$4 are set, but we only want $1
1767           /([+-]?\ *(\d+(\.\d*)?⎪\.\d+)([eE][+-]?\d+)?)/;
1768
1769           # match a number faster , only $1 is set
1770           /([+-]?\ *(?:\d+(?:\.\d*)?⎪\.\d+)(?:[eE][+-]?\d+)?)/;
1771
1772           # match a number, get $1 = whole number, $2 = exponent
1773           /([+-]?\ *(?:\d+(?:\.\d*)?⎪\.\d+)(?:[eE]([+-]?\d+))?)/;
1774
1775       Non-capturing groupings are also useful for removing nuisance elements
1776       gathered from a split operation:
1777
1778           $x = '12a34b5';
1779           @num = split /(a⎪b)/, $x;    # @num = ('12','a','34','b','5')
1780           @num = split /(?:a⎪b)/, $x;  # @num = ('12','34','5')
1781
1782       Non-capturing groupings may also have embedded modifiers: "(?i-m:reg‐
1783       exp)" is a non-capturing grouping that matches "regexp" case insensi‐
1784       tively and turns off multi-line mode.
1785
1786       Looking ahead and looking behind
1787
1788       This section concerns the lookahead and lookbehind assertions.  First,
1789       a little background.
1790
1791       In Perl regular expressions, most regexp elements 'eat up' a certain
1792       amount of string when they match.  For instance, the regexp element
1793       "[abc}]" eats up one character of the string when it matches, in the
1794       sense that perl moves to the next character position in the string
1795       after the match.  There are some elements, however, that don't eat up
1796       characters (advance the character position) if they match.  The exam‐
1797       ples we have seen so far are the anchors.  The anchor "^" matches the
1798       beginning of the line, but doesn't eat any characters.  Similarly, the
1799       word boundary anchor "\b" matches, e.g., if the character to the left
1800       is a word character and the character to the right is a non-word char‐
1801       acter, but it doesn't eat up any characters itself.  Anchors are exam‐
1802       ples of 'zero-width assertions'.  Zero-width, because they consume no
1803       characters, and assertions, because they test some property of the
1804       string.  In the context of our walk in the woods analogy to regexp
1805       matching, most regexp elements move us along a trail, but anchors have
1806       us stop a moment and check our surroundings.  If the local environment
1807       checks out, we can proceed forward.  But if the local environment
1808       doesn't satisfy us, we must backtrack.
1809
1810       Checking the environment entails either looking ahead on the trail,
1811       looking behind, or both.  "^" looks behind, to see that there are no
1812       characters before.  "$" looks ahead, to see that there are no charac‐
1813       ters after.  "\b" looks both ahead and behind, to see if the characters
1814       on either side differ in their 'word'-ness.
1815
1816       The lookahead and lookbehind assertions are generalizations of the
1817       anchor concept.  Lookahead and lookbehind are zero-width assertions
1818       that let us specify which characters we want to test for.  The looka‐
1819       head assertion is denoted by "(?=regexp)" and the lookbehind assertion
1820       is denoted by "(?<=fixed-regexp)".  Some examples are
1821
1822           $x = "I catch the housecat 'Tom-cat' with catnip";
1823           $x =~ /cat(?=\s+)/;  # matches 'cat' in 'housecat'
1824           @catwords = ($x =~ /(?<=\s)cat\w+/g);  # matches,
1825                                                  # $catwords[0] = 'catch'
1826                                                  # $catwords[1] = 'catnip'
1827           $x =~ /\bcat\b/;  # matches 'cat' in 'Tom-cat'
1828           $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
1829                                     # middle of $x
1830
1831       Note that the parentheses in "(?=regexp)" and "(?<=regexp)" are
1832       non-capturing, since these are zero-width assertions.  Thus in the sec‐
1833       ond regexp, the substrings captured are those of the whole regexp
1834       itself.  Lookahead "(?=regexp)" can match arbitrary regexps, but look‐
1835       behind "(?<=fixed-regexp)" only works for regexps of fixed width, i.e.,
1836       a fixed number of characters long.  Thus "(?<=(ab⎪bc))" is fine, but
1837       "(?<=(ab)*)" is not.  The negated versions of the lookahead and lookbe‐
1838       hind assertions are denoted by "(?!regexp)" and "(?<!fixed-regexp)"
1839       respectively.  They evaluate true if the regexps do not match:
1840
1841           $x = "foobar";
1842           $x =~ /foo(?!bar)/;  # doesn't match, 'bar' follows 'foo'
1843           $x =~ /foo(?!baz)/;  # matches, 'baz' doesn't follow 'foo'
1844           $x =~ /(?<!\s)foo/;  # matches, there is no \s before 'foo'
1845
1846       The "\C" is unsupported in lookbehind, because the already treacherous
1847       definition of "\C" would become even more so when going backwards.
1848
1849       Using independent subexpressions to prevent backtracking
1850
1851       The last few extended patterns in this tutorial are experimental as of
1852       5.6.0.  Play with them, use them in some code, but don't rely on them
1853       just yet for production code.
1854
1855       Independent subexpressions  are regular expressions, in the context of
1856       a larger regular expression, that function independently of the larger
1857       regular expression.  That is, they consume as much or as little of the
1858       string as they wish without regard for the ability of the larger regexp
1859       to match.  Independent subexpressions are represented by "(?>regexp)".
1860       We can illustrate their behavior by first considering an ordinary reg‐
1861       exp:
1862
1863           $x = "ab";
1864           $x =~ /a*ab/;  # matches
1865
1866       This obviously matches, but in the process of matching, the subexpres‐
1867       sion "a*" first grabbed the "a".  Doing so, however, wouldn't allow the
1868       whole regexp to match, so after backtracking, "a*" eventually gave back
1869       the "a" and matched the empty string.  Here, what "a*" matched was
1870       dependent on what the rest of the regexp matched.
1871
1872       Contrast that with an independent subexpression:
1873
1874           $x =~ /(?>a*)ab/;  # doesn't match!
1875
1876       The independent subexpression "(?>a*)" doesn't care about the rest of
1877       the regexp, so it sees an "a" and grabs it.  Then the rest of the reg‐
1878       exp "ab" cannot match.  Because "(?>a*)" is independent, there is no
1879       backtracking and the independent subexpression does not give up its
1880       "a".  Thus the match of the regexp as a whole fails.  A similar behav‐
1881       ior occurs with completely independent regexps:
1882
1883           $x = "ab";
1884           $x =~ /a*/g;   # matches, eats an 'a'
1885           $x =~ /\Gab/g; # doesn't match, no 'a' available
1886
1887       Here "//g" and "\G" create a 'tag team' handoff of the string from one
1888       regexp to the other.  Regexps with an independent subexpression are
1889       much like this, with a handoff of the string to the independent subex‐
1890       pression, and a handoff of the string back to the enclosing regexp.
1891
1892       The ability of an independent subexpression to prevent backtracking can
1893       be quite useful.  Suppose we want to match a non-empty string enclosed
1894       in parentheses up to two levels deep.  Then the following regexp
1895       matches:
1896
1897           $x = "abc(de(fg)h";  # unbalanced parentheses
1898           $x =~ /\( ( [^()]+ ⎪ \([^()]*\) )+ \)/x;
1899
1900       The regexp matches an open parenthesis, one or more copies of an alter‐
1901       nation, and a close parenthesis.  The alternation is two-way, with the
1902       first alternative "[^()]+" matching a substring with no parentheses and
1903       the second alternative "\([^()]*\)"  matching a substring delimited by
1904       parentheses.  The problem with this regexp is that it is pathological:
1905       it has nested indeterminate quantifiers of the form "(a+⎪b)+".  We dis‐
1906       cussed in Part 1 how nested quantifiers like this could take an expo‐
1907       nentially long time to execute if there was no match possible.  To pre‐
1908       vent the exponential blowup, we need to prevent useless backtracking at
1909       some point.  This can be done by enclosing the inner quantifier as an
1910       independent subexpression:
1911
1912           $x =~ /\( ( (?>[^()]+) ⎪ \([^()]*\) )+ \)/x;
1913
1914       Here, "(?>[^()]+)" breaks the degeneracy of string partitioning by gob‐
1915       bling up as much of the string as possible and keeping it.   Then match
1916       failures fail much more quickly.
1917
1918       Conditional expressions
1919
1920       A conditional expression  is a form of if-then-else statement that
1921       allows one to choose which patterns are to be matched, based on some
1922       condition.  There are two types of conditional expression: "(?(condi‐
1923       tion)yes-regexp)" and "(?(condition)yes-regexp⎪no-regexp)".  "(?(condi‐
1924       tion)yes-regexp)" is like an 'if () {}'  statement in Perl.  If the
1925       "condition" is true, the "yes-regexp" will be matched.  If the "condi‐
1926       tion" is false, the "yes-regexp" will be skipped and perl will move
1927       onto the next regexp element.  The second form is like an
1928       'if () {} else {}'  statement in Perl.  If the "condition" is true, the
1929       "yes-regexp" will be matched, otherwise the "no-regexp" will be
1930       matched.
1931
1932       The "condition" can have two forms.  The first form is simply an inte‐
1933       ger in parentheses "(integer)".  It is true if the corresponding back‐
1934       reference "\integer" matched earlier in the regexp.  The second form is
1935       a bare zero width assertion "(?...)", either a lookahead, a lookbehind,
1936       or a code assertion (discussed in the next section).
1937
1938       The integer form of the "condition" allows us to choose, with more
1939       flexibility, what to match based on what matched earlier in the regexp.
1940       This searches for words of the form "$x$x" or "$x$y$y$x":
1941
1942           % simple_grep '^(\w+)(\w+)?(?(2)\2\1⎪\1)$' /usr/dict/words
1943           beriberi
1944           coco
1945           couscous
1946           deed
1947           ...
1948           toot
1949           toto
1950           tutu
1951
1952       The lookbehind "condition" allows, along with backreferences, an ear‐
1953       lier part of the match to influence a later part of the match.  For
1954       instance,
1955
1956           /[ATGC]+(?(?<=AA)G⎪C)$/;
1957
1958       matches a DNA sequence such that it either ends in "AAG", or some other
1959       base pair combination and "C".  Note that the form is "(?(?<=AA)G⎪C)"
1960       and not "(?((?<=AA))G⎪C)"; for the lookahead, lookbehind or code asser‐
1961       tions, the parentheses around the conditional are not needed.
1962
1963       A bit of magic: executing Perl code in a regular expression
1964
1965       Normally, regexps are a part of Perl expressions.  Code evaluation
1966       expressions turn that around by allowing arbitrary Perl code to be a
1967       part of a regexp.  A code evaluation expression is denoted "(?{code})",
1968       with "code" a string of Perl statements.
1969
1970       Code expressions are zero-width assertions, and the value they return
1971       depends on their environment.  There are two possibilities: either the
1972       code expression is used as a conditional in a conditional expression
1973       "(?(condition)...)", or it is not.  If the code expression is a condi‐
1974       tional, the code is evaluated and the result (i.e., the result of the
1975       last statement) is used to determine truth or falsehood.  If the code
1976       expression is not used as a conditional, the assertion always evaluates
1977       true and the result is put into the special variable $^R.  The variable
1978       $^R can then be used in code expressions later in the regexp.  Here are
1979       some silly examples:
1980
1981           $x = "abcdef";
1982           $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
1983                                                # prints 'Hi Mom!'
1984           $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
1985                                                # no 'Hi Mom!'
1986
1987       Pay careful attention to the next example:
1988
1989           $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
1990                                                # no 'Hi Mom!'
1991                                                # but why not?
1992
1993       At first glance, you'd think that it shouldn't print, because obviously
1994       the "ddd" isn't going to match the target string. But look at this
1995       example:
1996
1997           $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
1998                                                  # but _does_ print
1999
2000       Hmm. What happened here? If you've been following along, you know that
2001       the above pattern should be effectively the same as the last one --
2002       enclosing the d in a character class isn't going to change what it
2003       matches. So why does the first not print while the second one does?
2004
2005       The answer lies in the optimizations the REx engine makes. In the first
2006       case, all the engine sees are plain old characters (aside from the
2007       "?{}" construct). It's smart enough to realize that the string 'ddd'
2008       doesn't occur in our target string before actually running the pattern
2009       through. But in the second case, we've tricked it into thinking that
2010       our pattern is more complicated than it is. It takes a look, sees our
2011       character class, and decides that it will have to actually run the pat‐
2012       tern to determine whether or not it matches, and in the process of run‐
2013       ning it hits the print statement before it discovers that we don't have
2014       a match.
2015
2016       To take a closer look at how the engine does optimizations, see the
2017       section "Pragmas and debugging" below.
2018
2019       More fun with "?{}":
2020
2021           $x =~ /(?{print "Hi Mom!";})/;       # matches,
2022                                                # prints 'Hi Mom!'
2023           $x =~ /(?{$c = 1;})(?{print "$c";})/;  # matches,
2024                                                  # prints '1'
2025           $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
2026                                                  # prints '1'
2027
2028       The bit of magic mentioned in the section title occurs when the regexp
2029       backtracks in the process of searching for a match.  If the regexp
2030       backtracks over a code expression and if the variables used within are
2031       localized using "local", the changes in the variables produced by the
2032       code expression are undone! Thus, if we wanted to count how many times
2033       a character got matched inside a group, we could use, e.g.,
2034
2035           $x = "aaaa";
2036           $count = 0;  # initialize 'a' count
2037           $c = "bob";  # test if $c gets clobbered
2038           $x =~ /(?{local $c = 0;})         # initialize count
2039                  ( a                        # match 'a'
2040                    (?{local $c = $c + 1;})  # increment count
2041                  )*                         # do this any number of times,
2042                  aa                         # but match 'aa' at the end
2043                  (?{$count = $c;})          # copy local $c var into $count
2044                 /x;
2045           print "'a' count is $count, \$c variable is '$c'\n";
2046
2047       This prints
2048
2049           'a' count is 2, $c variable is 'bob'
2050
2051       If we replace the " (?{local $c = $c + 1;})"  with
2052       " (?{$c = $c + 1;})" , the variable changes are not undone during back‐
2053       tracking, and we get
2054
2055           'a' count is 4, $c variable is 'bob'
2056
2057       Note that only localized variable changes are undone.  Other side
2058       effects of code expression execution are permanent.  Thus
2059
2060           $x = "aaaa";
2061           $x =~ /(a(?{print "Yow\n";}))*aa/;
2062
2063       produces
2064
2065          Yow
2066          Yow
2067          Yow
2068          Yow
2069
2070       The result $^R is automatically localized, so that it will behave prop‐
2071       erly in the presence of backtracking.
2072
2073       This example uses a code expression in a conditional to match the arti‐
2074       cle 'the' in either English or German:
2075
2076           $lang = 'DE';  # use German
2077           ...
2078           $text = "das";
2079           print "matched\n"
2080               if $text =~ /(?(?{
2081                                 $lang eq 'EN'; # is the language English?
2082                                })
2083                              the ⎪             # if so, then match 'the'
2084                              (die⎪das⎪der)     # else, match 'die⎪das⎪der'
2085                            )
2086                           /xi;
2087
2088       Note that the syntax here is "(?(?{...})yes-regexp⎪no-regexp)", not
2089       "(?((?{...}))yes-regexp⎪no-regexp)".  In other words, in the case of a
2090       code expression, we don't need the extra parentheses around the condi‐
2091       tional.
2092
2093       If you try to use code expressions with interpolating variables, perl
2094       may surprise you:
2095
2096           $bar = 5;
2097           $pat = '(?{ 1 })';
2098           /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
2099           /foo(?{ 1 })$bar/;   # compile error!
2100           /foo${pat}bar/;      # compile error!
2101
2102           $pat = qr/(?{ $foo = 1 })/;  # precompile code regexp
2103           /foo${pat}bar/;      # compiles ok
2104
2105       If a regexp has (1) code expressions and interpolating variables, or
2106       (2) a variable that interpolates a code expression, perl treats the
2107       regexp as an error. If the code expression is precompiled into a vari‐
2108       able, however, interpolating is ok. The question is, why is this an
2109       error?
2110
2111       The reason is that variable interpolation and code expressions together
2112       pose a security risk.  The combination is dangerous because many pro‐
2113       grammers who write search engines often take user input and plug it
2114       directly into a regexp:
2115
2116           $regexp = <>;       # read user-supplied regexp
2117           $chomp $regexp;     # get rid of possible newline
2118           $text =~ /$regexp/; # search $text for the $regexp
2119
2120       If the $regexp variable contains a code expression, the user could then
2121       execute arbitrary Perl code.  For instance, some joker could search for
2122       "system('rm -rf *');"  to erase your files.  In this sense, the combi‐
2123       nation of interpolation and code expressions taints your regexp.  So by
2124       default, using both interpolation and code expressions in the same reg‐
2125       exp is not allowed.  If you're not concerned about malicious users, it
2126       is possible to bypass this security check by invoking "use re 'eval'" :
2127
2128           use re 'eval';       # throw caution out the door
2129           $bar = 5;
2130           $pat = '(?{ 1 })';
2131           /foo(?{ 1 })$bar/;   # compiles ok
2132           /foo${pat}bar/;      # compiles ok
2133
2134       Another form of code expression is the pattern code expression .  The
2135       pattern code expression is like a regular code expression, except that
2136       the result of the code evaluation is treated as a regular expression
2137       and matched immediately.  A simple example is
2138
2139           $length = 5;
2140           $char = 'a';
2141           $x = 'aaaaabb';
2142           $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
2143
2144       This final example contains both ordinary and pattern code expressions.
2145       It detects if a binary string 1101010010001... has a Fibonacci spacing
2146       0,1,1,2,3,5,...  of the 1's:
2147
2148           $s0 = 0; $s1 = 1; # initial conditions
2149           $x = "1101010010001000001";
2150           print "It is a Fibonacci sequence\n"
2151               if $x =~ /^1         # match an initial '1'
2152                           (
2153                              (??{'0' x $s0}) # match $s0 of '0'
2154                              1               # and then a '1'
2155                              (?{
2156                                 $largest = $s0;   # largest seq so far
2157                                 $s2 = $s1 + $s0;  # compute next term
2158                                 $s0 = $s1;        # in Fibonacci sequence
2159                                 $s1 = $s2;
2160                                })
2161                           )+   # repeat as needed
2162                         $      # that is all there is
2163                        /x;
2164           print "Largest sequence matched was $largest\n";
2165
2166       This prints
2167
2168           It is a Fibonacci sequence
2169           Largest sequence matched was 5
2170
2171       Ha! Try that with your garden variety regexp package...
2172
2173       Note that the variables $s0 and $s1 are not substituted when the regexp
2174       is compiled, as happens for ordinary variables outside a code expres‐
2175       sion.  Rather, the code expressions are evaluated when perl encounters
2176       them during the search for a match.
2177
2178       The regexp without the "//x" modifier is
2179
2180           /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
2181
2182       and is a great start on an Obfuscated Perl entry :-) When working with
2183       code and conditional expressions, the extended form of regexps is
2184       almost necessary in creating and debugging regexps.
2185
2186       Pragmas and debugging
2187
2188       Speaking of debugging, there are several pragmas available to control
2189       and debug regexps in Perl.  We have already encountered one pragma in
2190       the previous section, "use re 'eval';" , that allows variable interpo‐
2191       lation and code expressions to coexist in a regexp.  The other pragmas
2192       are
2193
2194           use re 'taint';
2195           $tainted = <>;
2196           @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
2197
2198       The "taint" pragma causes any substrings from a match with a tainted
2199       variable to be tainted as well.  This is not normally the case, as reg‐
2200       exps are often used to extract the safe bits from a tainted variable.
2201       Use "taint" when you are not extracting safe bits, but are performing
2202       some other processing.  Both "taint" and "eval" pragmas are lexically
2203       scoped, which means they are in effect only until the end of the block
2204       enclosing the pragmas.
2205
2206           use re 'debug';
2207           /^(.*)$/s;       # output debugging info
2208
2209           use re 'debugcolor';
2210           /^(.*)$/s;       # output debugging info in living color
2211
2212       The global "debug" and "debugcolor" pragmas allow one to get detailed
2213       debugging info about regexp compilation and execution.  "debugcolor" is
2214       the same as debug, except the debugging information is displayed in
2215       color on terminals that can display termcap color sequences.  Here is
2216       example output:
2217
2218           % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
2219           Compiling REx `a*b+c'
2220           size 9 first at 1
2221              1: STAR(4)
2222              2:   EXACT <a>(0)
2223              4: PLUS(7)
2224              5:   EXACT <b>(0)
2225              7: EXACT <c>(9)
2226              9: END(0)
2227           floating `bc' at 0..2147483647 (checking floating) minlen 2
2228           Guessing start of match, REx `a*b+c' against `abc'...
2229           Found floating substr `bc' at offset 1...
2230           Guessed: match at offset 0
2231           Matching REx `a*b+c' against `abc'
2232             Setting an EVAL scope, savestack=3
2233              0 <> <abc>             ⎪  1:  STAR
2234                                      EXACT <a> can match 1 times out of 32767...
2235             Setting an EVAL scope, savestack=3
2236              1 <a> <bc>             ⎪  4:    PLUS
2237                                      EXACT <b> can match 1 times out of 32767...
2238             Setting an EVAL scope, savestack=3
2239              2 <ab> <c>             ⎪  7:      EXACT <c>
2240              3 <abc> <>             ⎪  9:      END
2241           Match successful!
2242           Freeing REx: `a*b+c'
2243
2244       If you have gotten this far into the tutorial, you can probably guess
2245       what the different parts of the debugging output tell you.  The first
2246       part
2247
2248           Compiling REx `a*b+c'
2249           size 9 first at 1
2250              1: STAR(4)
2251              2:   EXACT <a>(0)
2252              4: PLUS(7)
2253              5:   EXACT <b>(0)
2254              7: EXACT <c>(9)
2255              9: END(0)
2256
2257       describes the compilation stage.  STAR(4) means that there is a starred
2258       object, in this case 'a', and if it matches, goto line 4, i.e.,
2259       PLUS(7).  The middle lines describe some heuristics and optimizations
2260       performed before a match:
2261
2262           floating `bc' at 0..2147483647 (checking floating) minlen 2
2263           Guessing start of match, REx `a*b+c' against `abc'...
2264           Found floating substr `bc' at offset 1...
2265           Guessed: match at offset 0
2266
2267       Then the match is executed and the remaining lines describe the
2268       process:
2269
2270           Matching REx `a*b+c' against `abc'
2271             Setting an EVAL scope, savestack=3
2272              0 <> <abc>             ⎪  1:  STAR
2273                                      EXACT <a> can match 1 times out of 32767...
2274             Setting an EVAL scope, savestack=3
2275              1 <a> <bc>             ⎪  4:    PLUS
2276                                      EXACT <b> can match 1 times out of 32767...
2277             Setting an EVAL scope, savestack=3
2278              2 <ab> <c>             ⎪  7:      EXACT <c>
2279              3 <abc> <>             ⎪  9:      END
2280           Match successful!
2281           Freeing REx: `a*b+c'
2282
2283       Each step is of the form "n <x> <y>" , with "<x>" the part of the
2284       string matched and "<y>" the part not yet matched.  The "⎪ 1: STAR"
2285       says that perl is at line number 1 n the compilation list above.  See
2286       "Debugging regular expressions" in perldebguts for much more detail.
2287
2288       An alternative method of debugging regexps is to embed "print" state‐
2289       ments within the regexp.  This provides a blow-by-blow account of the
2290       backtracking in an alternation:
2291
2292           "that this" =~ m@(?{print "Start at position ", pos, "\n";})
2293                            t(?{print "t1\n";})
2294                            h(?{print "h1\n";})
2295                            i(?{print "i1\n";})
2296                            s(?{print "s1\n";})
2297                                ⎪
2298                            t(?{print "t2\n";})
2299                            h(?{print "h2\n";})
2300                            a(?{print "a2\n";})
2301                            t(?{print "t2\n";})
2302                            (?{print "Done at position ", pos, "\n";})
2303                           @x;
2304
2305       prints
2306
2307           Start at position 0
2308           t1
2309           h1
2310           t2
2311           h2
2312           a2
2313           t2
2314           Done at position 4
2315

BUGS

2317       Code expressions, conditional expressions, and independent expressions
2318       are experimental.  Don't use them in production code.  Yet.
2319

AUTHOR AND COPYRIGHT

2333       Copyright (c) 2000 Mark Kvale All rights reserved.
2334
2335       This document may be distributed under the same terms as Perl itself.
2336
2337       Acknowledgments
2338
2339       The inspiration for the stop codon DNA example came from the ZIP code
2340       example in chapter 7 of Mastering Regular Expressions.
2341
2342       The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
2343       Haworth, Ronald J Kimball, and Joe Smith for all their helpful com‐
2344       ments.
2345
2346
2347
2348perl v5.8.8                       2006-01-07                      PERLRETUT(1)

NAME

DESCRIPTION

Part 1: The basics

Part 2: Power tools

BUGS

SEE ALSO

AUTHOR AND COPYRIGHT