1PERLRETUT(1)           Perl Programmers Reference Guide           PERLRETUT(1)
2
3
4

NAME

6       perlretut - Perl regular expressions tutorial
7

DESCRIPTION

9       This page provides a basic tutorial on understanding, creating and
10       using regular expressions in Perl.  It serves as a complement to the
11       reference page on regular expressions perlre.  Regular expressions are
12       an integral part of the "m//", "s///", "qr//" and "split" operators and
13       so this tutorial also overlaps with "Regexp Quote-Like Operators" in
14       perlop and "split" in perlfunc.
15
16       Perl is widely renowned for excellence in text processing, and regular
17       expressions are one of the big factors behind this fame.  Perl regular
18       expressions display an efficiency and flexibility unknown in most other
19       computer languages.  Mastering even the basics of regular expressions
20       will allow you to manipulate text with surprising ease.
21
22       What is a regular expression?  At its most basic, a regular expression
23       is a template that is used to determine if a string has certain
24       characteristics.  The string is most often some text, such as a line,
25       sentence, web page, or even a whole book, but less commonly it could be
26       some binary data as well.  Suppose we want to determine if the text in
27       variable, $var contains the sequence of characters "m u s h r o o m"
28       (blanks added for legibility).  We can write in Perl
29
30        $var =~ m/mushroom/
31
32       The value of this expression will be TRUE if $var contains that
33       sequence of characters, and FALSE otherwise.  The portion enclosed in
34       '/' characters denotes the characteristic we are looking for.  We use
35       the term pattern for it.  The process of looking to see if the pattern
36       occurs in the string is called matching, and the "=~" operator along
37       with the "m//" tell Perl to try to match the pattern against the
38       string.  Note that the pattern is also a string, but a very special
39       kind of one, as we will see.  Patterns are in common use these days;
40       examples are the patterns typed into a search engine to find web pages
41       and the patterns used to list files in a directory, e.g., ""ls *.txt""
42       or ""dir *.*"".  In Perl, the patterns described by regular expressions
43       are used not only to search strings, but to also extract desired parts
44       of strings, and to do search and replace operations.
45
46       Regular expressions have the undeserved reputation of being abstract
47       and difficult to understand.  This really stems simply because the
48       notation used to express them tends to be terse and dense, and not
49       because of inherent complexity.  We recommend using the "/x" regular
50       expression modifier (described below) along with plenty of white space
51       to make them less dense, and easier to read.  Regular expressions are
52       constructed using simple concepts like conditionals and loops and are
53       no more difficult to understand than the corresponding "if"
54       conditionals and "while" loops in the Perl language itself.
55
56       This tutorial flattens the learning curve by discussing regular
57       expression concepts, along with their notation, one at a time and with
58       many examples.  The first part of the tutorial will progress from the
59       simplest word searches to the basic regular expression concepts.  If
60       you master the first part, you will have all the tools needed to solve
61       about 98% of your needs.  The second part of the tutorial is for those
62       comfortable with the basics and hungry for more power tools.  It
63       discusses the more advanced regular expression operators and introduces
64       the latest cutting-edge innovations.
65
66       A note: to save time, "regular expression" is often abbreviated as
67       regexp or regex.  Regexp is a more natural abbreviation than regex, but
68       is harder to pronounce.  The Perl pod documentation is evenly split on
69       regexp vs regex; in Perl, there is more than one way to abbreviate it.
70       We'll use regexp in this tutorial.
71
72       New in v5.22, "use re 'strict'" applies stricter rules than otherwise
73       when compiling regular expression patterns.  It can find things that,
74       while legal, may not be what you intended.
75

Part 1: The basics

77   Simple word matching
78       The simplest regexp is simply a word, or more generally, a string of
79       characters.  A regexp consisting of just a word matches any string that
80       contains that word:
81
82           "Hello World" =~ /World/;  # matches
83
84       What is this Perl statement all about? "Hello World" is a simple
85       double-quoted string.  "World" is the regular expression and the "//"
86       enclosing "/World/" tells Perl to search a string for a match.  The
87       operator "=~" associates the string with the regexp match and produces
88       a true value if the regexp matched, or false if the regexp did not
89       match.  In our case, "World" matches the second word in "Hello World",
90       so the expression is true.  Expressions like this are useful in
91       conditionals:
92
93           if ("Hello World" =~ /World/) {
94               print "It matches\n";
95           }
96           else {
97               print "It doesn't match\n";
98           }
99
100       There are useful variations on this theme.  The sense of the match can
101       be reversed by using the "!~" operator:
102
103           if ("Hello World" !~ /World/) {
104               print "It doesn't match\n";
105           }
106           else {
107               print "It matches\n";
108           }
109
110       The literal string in the regexp can be replaced by a variable:
111
112           my $greeting = "World";
113           if ("Hello World" =~ /$greeting/) {
114               print "It matches\n";
115           }
116           else {
117               print "It doesn't match\n";
118           }
119
120       If you're matching against the special default variable $_, the "$_ =~"
121       part can be omitted:
122
123           $_ = "Hello World";
124           if (/World/) {
125               print "It matches\n";
126           }
127           else {
128               print "It doesn't match\n";
129           }
130
131       And finally, the "//" default delimiters for a match can be changed to
132       arbitrary delimiters by putting an 'm' out front:
133
134           "Hello World" =~ m!World!;   # matches, delimited by '!'
135           "Hello World" =~ m{World};   # matches, note the matching '{}'
136           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
137                                        # '/' becomes an ordinary char
138
139       "/World/", "m!World!", and "m{World}" all represent the same thing.
140       When, e.g., the quote ('"') is used as a delimiter, the forward slash
141       '/' becomes an ordinary character and can be used in this regexp
142       without trouble.
143
144       Let's consider how different regexps would match "Hello World":
145
146           "Hello World" =~ /world/;  # doesn't match
147           "Hello World" =~ /o W/;    # matches
148           "Hello World" =~ /oW/;     # doesn't match
149           "Hello World" =~ /World /; # doesn't match
150
151       The first regexp "world" doesn't match because regexps are case-
152       sensitive.  The second regexp matches because the substring 'o W'
153       occurs in the string "Hello World".  The space character ' ' is treated
154       like any other character in a regexp and is needed to match in this
155       case.  The lack of a space character is the reason the third regexp
156       'oW' doesn't match.  The fourth regexp ""World "" doesn't match because
157       there is a space at the end of the regexp, but not at the end of the
158       string.  The lesson here is that regexps must match a part of the
159       string exactly in order for the statement to be true.
160
161       If a regexp matches in more than one place in the string, Perl will
162       always match at the earliest possible point in the string:
163
164           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
165           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
166
167       With respect to character matching, there are a few more points you
168       need to know about.   First of all, not all characters can be used "as
169       is" in a match.  Some characters, called metacharacters, are reserved
170       for use in regexp notation.  The metacharacters are
171
172           {}[]()^$.|*+?-\
173
174       The significance of each of these will be explained in the rest of the
175       tutorial, but for now, it is important only to know that a
176       metacharacter can be matched as-is by putting a backslash before it:
177
178           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
179           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
180           "The interval is [0,1)." =~ /[0,1)./     # is a syntax error!
181           "The interval is [0,1)." =~ /\[0,1\)\./  # matches
182           "#!/usr/bin/perl" =~ /#!\/usr\/bin\/perl/;  # matches
183
184       In the last regexp, the forward slash '/' is also backslashed, because
185       it is used to delimit the regexp.  This can lead to LTS (leaning
186       toothpick syndrome), however, and it is often more readable to change
187       delimiters.
188
189           "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!;  # easier to read
190
191       The backslash character '\' is a metacharacter itself and needs to be
192       backslashed:
193
194           'C:\WIN32' =~ /C:\\WIN/;   # matches
195
196       In situations where it doesn't make sense for a particular
197       metacharacter to mean what it normally does, it automatically loses its
198       metacharacter-ness and becomes an ordinary character that is to be
199       matched literally.  For example, the '}' is a metacharacter only when
200       it is the mate of a '{' metacharacter.  Otherwise it is treated as a
201       literal RIGHT CURLY BRACKET.  This may lead to unexpected results.
202       "use re 'strict'" can catch some of these.
203
204       In addition to the metacharacters, there are some ASCII characters
205       which don't have printable character equivalents and are instead
206       represented by escape sequences.  Common examples are "\t" for a tab,
207       "\n" for a newline, "\r" for a carriage return and "\a" for a bell (or
208       alert).  If your string is better thought of as a sequence of arbitrary
209       bytes, the octal escape sequence, e.g., "\033", or hexadecimal escape
210       sequence, e.g., "\x1B" may be a more natural representation for your
211       bytes.  Here are some examples of escapes:
212
213           "1000\t2000" =~ m(0\t2)   # matches
214           "1000\n2000" =~ /0\n20/   # matches
215           "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
216           "cat"   =~ /\o{143}\x61\x74/ # matches in ASCII, but a weird way
217                                        # to spell cat
218
219       If you've been around Perl a while, all this talk of escape sequences
220       may seem familiar.  Similar escape sequences are used in double-quoted
221       strings and in fact the regexps in Perl are mostly treated as double-
222       quoted strings.  This means that variables can be used in regexps as
223       well.  Just like double-quoted strings, the values of the variables in
224       the regexp will be substituted in before the regexp is evaluated for
225       matching purposes.  So we have:
226
227           $foo = 'house';
228           'housecat' =~ /$foo/;      # matches
229           'cathouse' =~ /cat$foo/;   # matches
230           'housecat' =~ /${foo}cat/; # matches
231
232       So far, so good.  With the knowledge above you can already perform
233       searches with just about any literal string regexp you can dream up.
234       Here is a very simple emulation of the Unix grep program:
235
236           % cat > simple_grep
237           #!/usr/bin/perl
238           $regexp = shift;
239           while (<>) {
240               print if /$regexp/;
241           }
242           ^D
243
244           % chmod +x simple_grep
245
246           % simple_grep abba /usr/dict/words
247           Babbage
248           cabbage
249           cabbages
250           sabbath
251           Sabbathize
252           Sabbathizes
253           sabbatical
254           scabbard
255           scabbards
256
257       This program is easy to understand.  "#!/usr/bin/perl" is the standard
258       way to invoke a perl program from the shell.  "$regexp = shift;" saves
259       the first command line argument as the regexp to be used, leaving the
260       rest of the command line arguments to be treated as files.
261       "while (<>)" loops over all the lines in all the files.  For each line,
262       "print if /$regexp/;" prints the line if the regexp matches the line.
263       In this line, both "print" and "/$regexp/" use the default variable $_
264       implicitly.
265
266       With all of the regexps above, if the regexp matched anywhere in the
267       string, it was considered a match.  Sometimes, however, we'd like to
268       specify where in the string the regexp should try to match.  To do
269       this, we would use the anchor metacharacters '^' and '$'.  The anchor
270       '^' means match at the beginning of the string and the anchor '$' means
271       match at the end of the string, or before a newline at the end of the
272       string.  Here is how they are used:
273
274           "housekeeper" =~ /keeper/;    # matches
275           "housekeeper" =~ /^keeper/;   # doesn't match
276           "housekeeper" =~ /keeper$/;   # matches
277           "housekeeper\n" =~ /keeper$/; # matches
278
279       The second regexp doesn't match because '^' constrains "keeper" to
280       match only at the beginning of the string, but "housekeeper" has keeper
281       starting in the middle.  The third regexp does match, since the '$'
282       constrains "keeper" to match only at the end of the string.
283
284       When both '^' and '$' are used at the same time, the regexp has to
285       match both the beginning and the end of the string, i.e., the regexp
286       matches the whole string.  Consider
287
288           "keeper" =~ /^keep$/;      # doesn't match
289           "keeper" =~ /^keeper$/;    # matches
290           ""       =~ /^$/;          # ^$ matches an empty string
291
292       The first regexp doesn't match because the string has more to it than
293       "keep".  Since the second regexp is exactly the string, it matches.
294       Using both '^' and '$' in a regexp forces the complete string to match,
295       so it gives you complete control over which strings match and which
296       don't.  Suppose you are looking for a fellow named bert, off in a
297       string by himself:
298
299           "dogbert" =~ /bert/;   # matches, but not what you want
300
301           "dilbert" =~ /^bert/;  # doesn't match, but ..
302           "bertram" =~ /^bert/;  # matches, so still not good enough
303
304           "bertram" =~ /^bert$/; # doesn't match, good
305           "dilbert" =~ /^bert$/; # doesn't match, good
306           "bert"    =~ /^bert$/; # matches, perfect
307
308       Of course, in the case of a literal string, one could just as easily
309       use the string comparison "$string eq 'bert'" and it would be more
310       efficient.   The  "^...$" regexp really becomes useful when we add in
311       the more powerful regexp tools below.
312
313   Using character classes
314       Although one can already do quite a lot with the literal string regexps
315       above, we've only scratched the surface of regular expression
316       technology.  In this and subsequent sections we will introduce regexp
317       concepts (and associated metacharacter notations) that will allow a
318       regexp to represent not just a single character sequence, but a whole
319       class of them.
320
321       One such concept is that of a character class.  A character class
322       allows a set of possible characters, rather than just a single
323       character, to match at a particular point in a regexp.  You can define
324       your own custom character classes.  These are denoted by brackets
325       "[...]", with the set of characters to be possibly matched inside.
326       Here are some examples:
327
328           /cat/;       # matches 'cat'
329           /[bcr]at/;   # matches 'bat, 'cat', or 'rat'
330           /item[0123456789]/;  # matches 'item0' or ... or 'item9'
331           "abc" =~ /[cab]/;    # matches 'a'
332
333       In the last statement, even though 'c' is the first character in the
334       class, 'a' matches because the first character position in the string
335       is the earliest point at which the regexp can match.
336
337           /[yY][eE][sS]/;      # match 'yes' in a case-insensitive way
338                                # 'yes', 'Yes', 'YES', etc.
339
340       This regexp displays a common task: perform a case-insensitive match.
341       Perl provides a way of avoiding all those brackets by simply appending
342       an 'i' to the end of the match.  Then "/[yY][eE][sS]/;" can be
343       rewritten as "/yes/i;".  The 'i' stands for case-insensitive and is an
344       example of a modifier of the matching operation.  We will meet other
345       modifiers later in the tutorial.
346
347       We saw in the section above that there were ordinary characters, which
348       represented themselves, and special characters, which needed a
349       backslash '\' to represent themselves.  The same is true in a character
350       class, but the sets of ordinary and special characters inside a
351       character class are different than those outside a character class.
352       The special characters for a character class are "-]\^$" (and the
353       pattern delimiter, whatever it is).  ']' is special because it denotes
354       the end of a character class.  '$' is special because it denotes a
355       scalar variable.  '\' is special because it is used in escape
356       sequences, just like above.  Here is how the special characters "]$\"
357       are handled:
358
359          /[\]c]def/; # matches ']def' or 'cdef'
360          $x = 'bcr';
361          /[$x]at/;   # matches 'bat', 'cat', or 'rat'
362          /[\$x]at/;  # matches '$at' or 'xat'
363          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
364
365       The last two are a little tricky.  In "[\$x]", the backslash protects
366       the dollar sign, so the character class has two members '$' and 'x'.
367       In "[\\$x]", the backslash is protected, so $x is treated as a variable
368       and substituted in double quote fashion.
369
370       The special character '-' acts as a range operator within character
371       classes, so that a contiguous set of characters can be written as a
372       range.  With ranges, the unwieldy "[0123456789]" and "[abc...xyz]"
373       become the svelte "[0-9]" and "[a-z]".  Some examples are
374
375           /item[0-9]/;  # matches 'item0' or ... or 'item9'
376           /[0-9bx-z]aa/;  # matches '0aa', ..., '9aa',
377                           # 'baa', 'xaa', 'yaa', or 'zaa'
378           /[0-9a-fA-F]/;  # matches a hexadecimal digit
379           /[0-9a-zA-Z_]/; # matches a "word" character,
380                           # like those in a Perl variable name
381
382       If '-' is the first or last character in a character class, it is
383       treated as an ordinary character; "[-ab]", "[ab-]" and "[a\-b]" are all
384       equivalent.
385
386       The special character '^' in the first position of a character class
387       denotes a negated character class, which matches any character but
388       those in the brackets.  Both "[...]" and "[^...]" must match a
389       character, or the match fails.  Then
390
391           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
392                      # all other 'bat', 'cat, '0at', '%at', etc.
393           /[^0-9]/;  # matches a non-numeric character
394           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
395
396       Now, even "[0-9]" can be a bother to write multiple times, so in the
397       interest of saving keystrokes and making regexps more readable, Perl
398       has several abbreviations for common character classes, as shown below.
399       Since the introduction of Unicode, unless the "/a" modifier is in
400       effect, these character classes match more than just a few characters
401       in the ASCII range.
402
403       ·   "\d" matches a digit, not just "[0-9]" but also digits from non-
404           roman scripts
405
406       ·   "\s" matches a whitespace character, the set "[\ \t\r\n\f]" and
407           others
408
409       ·   "\w" matches a word character (alphanumeric or '_'), not just
410           "[0-9a-zA-Z_]" but also digits and characters from non-roman
411           scripts
412
413       ·   "\D" is a negated "\d"; it represents any other character than a
414           digit, or "[^\d]"
415
416       ·   "\S" is a negated "\s"; it represents any non-whitespace character
417           "[^\s]"
418
419       ·   "\W" is a negated "\w"; it represents any non-word character
420           "[^\w]"
421
422       ·   The period '.' matches any character but "\n" (unless the modifier
423           "/s" is in effect, as explained below).
424
425       ·   "\N", like the period, matches any character but "\n", but it does
426           so regardless of whether the modifier "/s" is in effect.
427
428       The "/a" modifier, available starting in Perl 5.14,  is used to
429       restrict the matches of "\d", "\s", and "\w" to just those in the ASCII
430       range.  It is useful to keep your program from being needlessly exposed
431       to full Unicode (and its accompanying security considerations) when all
432       you want is to process English-like text.  (The "a" may be doubled,
433       "/aa", to provide even more restrictions, preventing case-insensitive
434       matching of ASCII with non-ASCII characters; otherwise a Unicode
435       "Kelvin Sign" would caselessly match a "k" or "K".)
436
437       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
438       bracketed character classes.  Here are some in use:
439
440           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
441           /[\d\s]/;         # matches any digit or whitespace character
442           /\w\W\w/;         # matches a word char, followed by a
443                             # non-word char, followed by a word char
444           /..rt/;           # matches any two chars, followed by 'rt'
445           /end\./;          # matches 'end.'
446           /end[.]/;         # same thing, matches 'end.'
447
448       Because a period is a metacharacter, it needs to be escaped to match as
449       an ordinary period. Because, for example, "\d" and "\w" are sets of
450       characters, it is incorrect to think of "[^\d\w]" as "[\D\W]"; in fact
451       "[^\d\w]" is the same as "[^\w]", which is the same as "[\W]". Think
452       DeMorgan's laws.
453
454       In actuality, the period and "\d\s\w\D\S\W" abbreviations are
455       themselves types of character classes, so the ones surrounded by
456       brackets are just one type of character class.  When we need to make a
457       distinction, we refer to them as "bracketed character classes."
458
459       An anchor useful in basic regexps is the word anchor "\b".  This
460       matches a boundary between a word character and a non-word character
461       "\w\W" or "\W\w":
462
463           $x = "Housecat catenates house and cat";
464           $x =~ /cat/;    # matches cat in 'housecat'
465           $x =~ /\bcat/;  # matches cat in 'catenates'
466           $x =~ /cat\b/;  # matches cat in 'housecat'
467           $x =~ /\bcat\b/;  # matches 'cat' at end of string
468
469       Note in the last example, the end of the string is considered a word
470       boundary.
471
472       For natural language processing (so that, for example, apostrophes are
473       included in words), use instead "\b{wb}"
474
475           "don't" =~ / .+? \b{wb} /x;  # matches the whole string
476
477       You might wonder why '.' matches everything but "\n" - why not every
478       character? The reason is that often one is matching against lines and
479       would like to ignore the newline characters.  For instance, while the
480       string "\n" represents one line, we would like to think of it as empty.
481       Then
482
483           ""   =~ /^$/;    # matches
484           "\n" =~ /^$/;    # matches, $ anchors before "\n"
485
486           ""   =~ /./;      # doesn't match; it needs a char
487           ""   =~ /^.$/;    # doesn't match; it needs a char
488           "\n" =~ /^.$/;    # doesn't match; it needs a char other than "\n"
489           "a"  =~ /^.$/;    # matches
490           "a\n"  =~ /^.$/;  # matches, $ anchors before "\n"
491
492       This behavior is convenient, because we usually want to ignore newlines
493       when we count and match characters in a line.  Sometimes, however, we
494       want to keep track of newlines.  We might even want '^' and '$' to
495       anchor at the beginning and end of lines within the string, rather than
496       just the beginning and end of the string.  Perl allows us to choose
497       between ignoring and paying attention to newlines by using the "/s" and
498       "/m" modifiers.  "/s" and "/m" stand for single line and multi-line and
499       they determine whether a string is to be treated as one continuous
500       string, or as a set of lines.  The two modifiers affect two aspects of
501       how the regexp is interpreted: 1) how the '.' character class is
502       defined, and 2) where the anchors '^' and '$' are able to match.  Here
503       are the four possible combinations:
504
505       ·   no modifiers: Default behavior.  '.' matches any character except
506           "\n".  '^' matches only at the beginning of the string and '$'
507           matches only at the end or before a newline at the end.
508
509       ·   s modifier ("/s"): Treat string as a single long line.  '.' matches
510           any character, even "\n".  '^' matches only at the beginning of the
511           string and '$' matches only at the end or before a newline at the
512           end.
513
514       ·   m modifier ("/m"): Treat string as a set of multiple lines.  '.'
515           matches any character except "\n".  '^' and '$' are able to match
516           at the start or end of any line within the string.
517
518       ·   both s and m modifiers ("/sm"): Treat string as a single long line,
519           but detect multiple lines.  '.' matches any character, even "\n".
520           '^' and '$', however, are able to match at the start or end of any
521           line within the string.
522
523       Here are examples of "/s" and "/m" in action:
524
525           $x = "There once was a girl\nWho programmed in Perl\n";
526
527           $x =~ /^Who/;   # doesn't match, "Who" not at start of string
528           $x =~ /^Who/s;  # doesn't match, "Who" not at start of string
529           $x =~ /^Who/m;  # matches, "Who" at start of second line
530           $x =~ /^Who/sm; # matches, "Who" at start of second line
531
532           $x =~ /girl.Who/;   # doesn't match, "." doesn't match "\n"
533           $x =~ /girl.Who/s;  # matches, "." matches "\n"
534           $x =~ /girl.Who/m;  # doesn't match, "." doesn't match "\n"
535           $x =~ /girl.Who/sm; # matches, "." matches "\n"
536
537       Most of the time, the default behavior is what is wanted, but "/s" and
538       "/m" are occasionally very useful.  If "/m" is being used, the start of
539       the string can still be matched with "\A" and the end of the string can
540       still be matched with the anchors "\Z" (matches both the end and the
541       newline before, like '$'), and "\z" (matches only the end):
542
543           $x =~ /^Who/m;   # matches, "Who" at start of second line
544           $x =~ /\AWho/m;  # doesn't match, "Who" is not at start of string
545
546           $x =~ /girl$/m;  # matches, "girl" at end of first line
547           $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
548
549           $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
550           $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
551
552       We now know how to create choices among classes of characters in a
553       regexp.  What about choices among words or character strings? Such
554       choices are described in the next section.
555
556   Matching this or that
557       Sometimes we would like our regexp to be able to match different
558       possible words or character strings.  This is accomplished by using the
559       alternation metacharacter '|'.  To match "dog" or "cat", we form the
560       regexp "dog|cat".  As before, Perl will try to match the regexp at the
561       earliest possible point in the string.  At each character position,
562       Perl will first try to match the first alternative, "dog".  If "dog"
563       doesn't match, Perl will then try the next alternative, "cat".  If
564       "cat" doesn't match either, then the match fails and Perl moves to the
565       next position in the string.  Some examples:
566
567           "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
568           "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
569
570       Even though "dog" is the first alternative in the second regexp, "cat"
571       is able to match earlier in the string.
572
573           "cats"          =~ /c|ca|cat|cats/; # matches "c"
574           "cats"          =~ /cats|cat|ca|c/; # matches "cats"
575
576       Here, all the alternatives match at the first string position, so the
577       first alternative is the one that matches.  If some of the alternatives
578       are truncations of the others, put the longest ones first to give them
579       a chance to match.
580
581           "cab" =~ /a|b|c/ # matches "c"
582                            # /a|b|c/ == /[abc]/
583
584       The last example points out that character classes are like
585       alternations of characters.  At a given character position, the first
586       alternative that allows the regexp match to succeed will be the one
587       that matches.
588
589   Grouping things and hierarchical matching
590       Alternation allows a regexp to choose among alternatives, but by itself
591       it is unsatisfying.  The reason is that each alternative is a whole
592       regexp, but sometime we want alternatives for just part of a regexp.
593       For instance, suppose we want to search for housecats or housekeepers.
594       The regexp "housecat|housekeeper" fits the bill, but is inefficient
595       because we had to type "house" twice.  It would be nice to have parts
596       of the regexp be constant, like "house", and some parts have
597       alternatives, like "cat|keeper".
598
599       The grouping metacharacters "()" solve this problem.  Grouping allows
600       parts of a regexp to be treated as a single unit.  Parts of a regexp
601       are grouped by enclosing them in parentheses.  Thus we could solve the
602       "housecat|housekeeper" by forming the regexp as "house(cat|keeper)".
603       The regexp "house(cat|keeper)" means match "house" followed by either
604       "cat" or "keeper".  Some more examples are
605
606           /(a|b)b/;    # matches 'ab' or 'bb'
607           /(ac|b)b/;   # matches 'acb' or 'bb'
608           /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
609           /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
610
611           /house(cat|)/;  # matches either 'housecat' or 'house'
612           /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
613                               # 'house'.  Note groups can be nested.
614
615           /(19|20|)\d\d/;  # match years 19xx, 20xx, or the Y2K problem, xx
616           "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
617                                    # because '20\d\d' can't match
618
619       Alternations behave the same way in groups as out of them: at a given
620       string position, the leftmost alternative that allows the regexp to
621       match is taken.  So in the last example at the first string position,
622       "20" matches the second alternative, but there is nothing left over to
623       match the next two digits "\d\d".  So Perl moves on to the next
624       alternative, which is the null alternative and that works, since "20"
625       is two digits.
626
627       The process of trying one alternative, seeing if it matches, and moving
628       on to the next alternative, while going back in the string from where
629       the previous alternative was tried, if it doesn't, is called
630       backtracking.  The term "backtracking" comes from the idea that
631       matching a regexp is like a walk in the woods.  Successfully matching a
632       regexp is like arriving at a destination.  There are many possible
633       trailheads, one for each string position, and each one is tried in
634       order, left to right.  From each trailhead there may be many paths,
635       some of which get you there, and some which are dead ends.  When you
636       walk along a trail and hit a dead end, you have to backtrack along the
637       trail to an earlier point to try another trail.  If you hit your
638       destination, you stop immediately and forget about trying all the other
639       trails.  You are persistent, and only if you have tried all the trails
640       from all the trailheads and not arrived at your destination, do you
641       declare failure.  To be concrete, here is a step-by-step analysis of
642       what Perl does when it tries to match the regexp
643
644           "abcde" =~ /(abd|abc)(df|d|de)/;
645
646       0. Start with the first letter in the string 'a'.
647
648
649       1. Try the first alternative in the first group 'abd'.
650
651
652       2.  Match 'a' followed by 'b'. So far so good.
653
654
655       3.  'd' in the regexp doesn't match 'c' in the string - a dead end.  So
656       backtrack two characters and pick the second alternative in the first
657       group 'abc'.
658
659
660       4.  Match 'a' followed by 'b' followed by 'c'.  We are on a roll and
661       have satisfied the first group. Set $1 to 'abc'.
662
663
664       5 Move on to the second group and pick the first alternative 'df'.
665
666
667       6 Match the 'd'.
668
669
670       7.  'f' in the regexp doesn't match 'e' in the string, so a dead end.
671       Backtrack one character and pick the second alternative in the second
672       group 'd'.
673
674
675       8.  'd' matches. The second grouping is satisfied, so set $2 to 'd'.
676
677
678       9.  We are at the end of the regexp, so we are done! We have matched
679       'abcd' out of the string "abcde".
680
681       There are a couple of things to note about this analysis.  First, the
682       third alternative in the second group 'de' also allows a match, but we
683       stopped before we got to it - at a given character position, leftmost
684       wins.  Second, we were able to get a match at the first character
685       position of the string 'a'.  If there were no matches at the first
686       position, Perl would move to the second character position 'b' and
687       attempt the match all over again.  Only when all possible paths at all
688       possible character positions have been exhausted does Perl give up and
689       declare "$string =~ /(abd|abc)(df|d|de)/;" to be false.
690
691       Even with all this work, regexp matching happens remarkably fast.  To
692       speed things up, Perl compiles the regexp into a compact sequence of
693       opcodes that can often fit inside a processor cache.  When the code is
694       executed, these opcodes can then run at full throttle and search very
695       quickly.
696
697   Extracting matches
698       The grouping metacharacters "()" also serve another completely
699       different function: they allow the extraction of the parts of a string
700       that matched.  This is very useful to find out what matched and for
701       text processing in general.  For each grouping, the part that matched
702       inside goes into the special variables $1, $2, etc.  They can be used
703       just as ordinary variables:
704
705           # extract hours, minutes, seconds
706           if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
707               $hours = $1;
708               $minutes = $2;
709               $seconds = $3;
710           }
711
712       Now, we know that in scalar context, "$time =~ /(\d\d):(\d\d):(\d\d)/"
713       returns a true or false value.  In list context, however, it returns
714       the list of matched values "($1,$2,$3)".  So we could write the code
715       more compactly as
716
717           # extract hours, minutes, seconds
718           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
719
720       If the groupings in a regexp are nested, $1 gets the group with the
721       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
722       Here is a regexp with nested groups:
723
724           /(ab(cd|ef)((gi)|j))/;
725            1  2      34
726
727       If this regexp matches, $1 contains a string starting with 'ab', $2 is
728       either set to 'cd' or 'ef', $3 equals either 'gi' or 'j', and $4 is
729       either set to 'gi', just like $3, or it remains undefined.
730
731       For convenience, Perl sets $+ to the string held by the highest
732       numbered $1, $2,... that got assigned (and, somewhat related, $^N to
733       the value of the $1, $2,... most-recently assigned; i.e. the $1, $2,...
734       associated with the rightmost closing parenthesis used in the match).
735
736   Backreferences
737       Closely associated with the matching variables $1, $2, ... are the
738       backreferences "\g1", "\g2",...  Backreferences are simply matching
739       variables that can be used inside a regexp.  This is a really nice
740       feature; what matches later in a regexp is made to depend on what
741       matched earlier in the regexp.  Suppose we wanted to look for doubled
742       words in a text, like "the the".  The following regexp finds all
743       3-letter doubles with a space in between:
744
745           /\b(\w\w\w)\s\g1\b/;
746
747       The grouping assigns a value to "\g1", so that the same 3-letter
748       sequence is used for both parts.
749
750       A similar task is to find words consisting of two identical parts:
751
752           % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words
753           beriberi
754           booboo
755           coco
756           mama
757           murmur
758           papa
759
760       The regexp has a single grouping which considers 4-letter combinations,
761       then 3-letter combinations, etc., and uses "\g1" to look for a repeat.
762       Although $1 and "\g1" represent the same thing, care should be taken to
763       use matched variables $1, $2,... only outside a regexp and
764       backreferences "\g1", "\g2",... only inside a regexp; not doing so may
765       lead to surprising and unsatisfactory results.
766
767   Relative backreferences
768       Counting the opening parentheses to get the correct number for a
769       backreference is error-prone as soon as there is more than one
770       capturing group.  A more convenient technique became available with
771       Perl 5.10: relative backreferences. To refer to the immediately
772       preceding capture group one now may write "\g{-1}", the next but last
773       is available via "\g{-2}", and so on.
774
775       Another good reason in addition to readability and maintainability for
776       using relative backreferences is illustrated by the following example,
777       where a simple pattern for matching peculiar strings is used:
778
779           $a99a = '([a-z])(\d)\g2\g1';   # matches a11a, g22g, x33x, etc.
780
781       Now that we have this pattern stored as a handy string, we might feel
782       tempted to use it as a part of some other pattern:
783
784           $line = "code=e99e";
785           if ($line =~ /^(\w+)=$a99a$/){   # unexpected behavior!
786               print "$1 is valid\n";
787           } else {
788               print "bad line: '$line'\n";
789           }
790
791       But this doesn't match, at least not the way one might expect. Only
792       after inserting the interpolated $a99a and looking at the resulting
793       full text of the regexp is it obvious that the backreferences have
794       backfired. The subexpression "(\w+)" has snatched number 1 and demoted
795       the groups in $a99a by one rank. This can be avoided by using relative
796       backreferences:
797
798           $a99a = '([a-z])(\d)\g{-1}\g{-2}';  # safe for being interpolated
799
800   Named backreferences
801       Perl 5.10 also introduced named capture groups and named
802       backreferences.  To attach a name to a capturing group, you write
803       either "(?<name>...)" or "(?'name'...)".  The backreference may then be
804       written as "\g{name}".  It is permissible to attach the same name to
805       more than one group, but then only the leftmost one of the eponymous
806       set can be referenced.  Outside of the pattern a named capture group is
807       accessible through the "%+" hash.
808
809       Assuming that we have to match calendar dates which may be given in one
810       of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write
811       three suitable patterns where we use 'd', 'm' and 'y' respectively as
812       the names of the groups capturing the pertaining components of a date.
813       The matching operation combines the three patterns as alternatives:
814
815           $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
816           $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
817           $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
818           for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
819               if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
820                   print "day=$+{d} month=$+{m} year=$+{y}\n";
821               }
822           }
823
824       If any of the alternatives matches, the hash "%+" is bound to contain
825       the three key-value pairs.
826
827   Alternative capture group numbering
828       Yet another capturing group numbering technique (also as from Perl
829       5.10) deals with the problem of referring to groups within a set of
830       alternatives.  Consider a pattern for matching a time of the day, civil
831       or military style:
832
833           if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){
834               # process hour and minute
835           }
836
837       Processing the results requires an additional if statement to determine
838       whether $1 and $2 or $3 and $4 contain the goodies. It would be easier
839       if we could use group numbers 1 and 2 in second alternative as well,
840       and this is exactly what the parenthesized construct "(?|...)", set
841       around an alternative achieves. Here is an extended version of the
842       previous pattern:
843
844         if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){
845             print "hour=$1 minute=$2 zone=$3\n";
846         }
847
848       Within the alternative numbering group, group numbers start at the same
849       position for each alternative. After the group, numbering continues
850       with one higher than the maximum reached across all the alternatives.
851
852   Position information
853       In addition to what was matched, Perl also provides the positions of
854       what was matched as contents of the "@-" and "@+" arrays. "$-[0]" is
855       the position of the start of the entire match and $+[0] is the position
856       of the end. Similarly, "$-[n]" is the position of the start of the $n
857       match and $+[n] is the position of the end. If $n is undefined, so are
858       "$-[n]" and $+[n]. Then this code
859
860           $x = "Mmm...donut, thought Homer";
861           $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
862           foreach $exp (1..$#-) {
863               print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n";
864           }
865
866       prints
867
868           Match 1: 'Mmm' at position (0,3)
869           Match 2: 'donut' at position (6,11)
870
871       Even if there are no groupings in a regexp, it is still possible to
872       find out what exactly matched in a string.  If you use them, Perl will
873       set "$`" to the part of the string before the match, will set $& to the
874       part of the string that matched, and will set '$' to the part of the
875       string after the match.  An example:
876
877           $x = "the cat caught the mouse";
878           $x =~ /cat/;  # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
879           $x =~ /the/;  # $` = '', $& = 'the', $' = ' cat caught the mouse'
880
881       In the second match, "$`" equals '' because the regexp matched at the
882       first character position in the string and stopped; it never saw the
883       second "the".
884
885       If your code is to run on Perl versions earlier than 5.20, it is
886       worthwhile to note that using "$`" and '$' slows down regexp matching
887       quite a bit, while $& slows it down to a lesser extent, because if they
888       are used in one regexp in a program, they are generated for all regexps
889       in the program.  So if raw performance is a goal of your application,
890       they should be avoided.  If you need to extract the corresponding
891       substrings, use "@-" and "@+" instead:
892
893           $` is the same as substr( $x, 0, $-[0] )
894           $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
895           $' is the same as substr( $x, $+[0] )
896
897       As of Perl 5.10, the "${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}"
898       variables may be used.  These are only set if the "/p" modifier is
899       present.  Consequently they do not penalize the rest of the program.
900       In Perl 5.20, "${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}" are
901       available whether the "/p" has been used or not (the modifier is
902       ignored), and "$`", '$' and $& do not cause any speed difference.
903
904   Non-capturing groupings
905       A group that is required to bundle a set of alternatives may or may not
906       be useful as a capturing group.  If it isn't, it just creates a
907       superfluous addition to the set of available capture group values,
908       inside as well as outside the regexp.  Non-capturing groupings, denoted
909       by "(?:regexp)", still allow the regexp to be treated as a single unit,
910       but don't establish a capturing group at the same time.  Both capturing
911       and non-capturing groupings are allowed to co-exist in the same regexp.
912       Because there is no extraction, non-capturing groupings are faster than
913       capturing groupings.  Non-capturing groupings are also handy for
914       choosing exactly which parts of a regexp are to be extracted to
915       matching variables:
916
917           # match a number, $1-$4 are set, but we only want $1
918           /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/;
919
920           # match a number faster , only $1 is set
921           /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/;
922
923           # match a number, get $1 = whole number, $2 = exponent
924           /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;
925
926       Non-capturing groupings are also useful for removing nuisance elements
927       gathered from a split operation where parentheses are required for some
928       reason:
929
930           $x = '12aba34ba5';
931           @num = split /(a|b)+/, $x;    # @num = ('12','a','34','a','5')
932           @num = split /(?:a|b)+/, $x;  # @num = ('12','34','5')
933
934       In Perl 5.22 and later, all groups within a regexp can be set to non-
935       capturing by using the new "/n" flag:
936
937           "hello" =~ /(hi|hello)/n; # $1 is not set!
938
939       See "n" in perlre for more information.
940
941   Matching repetitions
942       The examples in the previous section display an annoying weakness.  We
943       were only matching 3-letter words, or chunks of words of 4 letters or
944       less.  We'd like to be able to match words or, more generally, strings
945       of any length, without writing out tedious alternatives like
946       "\w\w\w\w|\w\w\w|\w\w|\w".
947
948       This is exactly the problem the quantifier metacharacters '?', '*',
949       '+', and "{}" were created for.  They allow us to delimit the number of
950       repeats for a portion of a regexp we consider to be a match.
951       Quantifiers are put immediately after the character, character class,
952       or grouping that we want to specify.  They have the following meanings:
953
954       ·   "a?" means: match 'a' 1 or 0 times
955
956       ·   "a*" means: match 'a' 0 or more times, i.e., any number of times
957
958       ·   "a+" means: match 'a' 1 or more times, i.e., at least once
959
960       ·   "a{n,m}" means: match at least "n" times, but not more than "m"
961           times.
962
963       ·   "a{n,}" means: match at least "n" or more times
964
965       ·   "a{n}" means: match exactly "n" times
966
967       Here are some examples:
968
969           /[a-z]+\s+\d*/;  # match a lowercase word, at least one space, and
970                            # any number of digits
971           /(\w+)\s+\g1/;    # match doubled words of arbitrary length
972           /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'
973           $year =~ /^\d{2,4}$/;  # make sure year is at least 2 but not more
974                                  # than 4 digits
975           $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates
976           $year =~ /^\d{2}(\d{2})?$/; # same thing written differently.
977                                       # However, this captures the last two
978                                       # digits in $1 and the other does not.
979
980           % simple_grep '^(\w+)\g1$' /usr/dict/words   # isn't this easier?
981           beriberi
982           booboo
983           coco
984           mama
985           murmur
986           papa
987
988       For all of these quantifiers, Perl will try to match as much of the
989       string as possible, while still allowing the regexp to succeed.  Thus
990       with "/a?.../", Perl will first try to match the regexp with the 'a'
991       present; if that fails, Perl will try to match the regexp without the
992       'a' present.  For the quantifier '*', we get the following:
993
994           $x = "the cat in the hat";
995           $x =~ /^(.*)(cat)(.*)$/; # matches,
996                                    # $1 = 'the '
997                                    # $2 = 'cat'
998                                    # $3 = ' in the hat'
999
1000       Which is what we might expect, the match finds the only "cat" in the
1001       string and locks onto it.  Consider, however, this regexp:
1002
1003           $x =~ /^(.*)(at)(.*)$/; # matches,
1004                                   # $1 = 'the cat in the h'
1005                                   # $2 = 'at'
1006                                   # $3 = ''   (0 characters match)
1007
1008       One might initially guess that Perl would find the "at" in "cat" and
1009       stop there, but that wouldn't give the longest possible string to the
1010       first quantifier ".*".  Instead, the first quantifier ".*" grabs as
1011       much of the string as possible while still having the regexp match.  In
1012       this example, that means having the "at" sequence with the final "at"
1013       in the string.  The other important principle illustrated here is that,
1014       when there are two or more elements in a regexp, the leftmost
1015       quantifier, if there is one, gets to grab as much of the string as
1016       possible, leaving the rest of the regexp to fight over scraps.  Thus in
1017       our example, the first quantifier ".*" grabs most of the string, while
1018       the second quantifier ".*" gets the empty string.   Quantifiers that
1019       grab as much of the string as possible are called maximal match or
1020       greedy quantifiers.
1021
1022       When a regexp can match a string in several different ways, we can use
1023       the principles above to predict which way the regexp will match:
1024
1025       ·   Principle 0: Taken as a whole, any regexp will be matched at the
1026           earliest possible position in the string.
1027
1028       ·   Principle 1: In an alternation "a|b|c...", the leftmost alternative
1029           that allows a match for the whole regexp will be the one used.
1030
1031       ·   Principle 2: The maximal matching quantifiers '?', '*', '+' and
1032           "{n,m}" will in general match as much of the string as possible
1033           while still allowing the whole regexp to match.
1034
1035       ·   Principle 3: If there are two or more elements in a regexp, the
1036           leftmost greedy quantifier, if any, will match as much of the
1037           string as possible while still allowing the whole regexp to match.
1038           The next leftmost greedy quantifier, if any, will try to match as
1039           much of the string remaining available to it as possible, while
1040           still allowing the whole regexp to match.  And so on, until all the
1041           regexp elements are satisfied.
1042
1043       As we have seen above, Principle 0 overrides the others. The regexp
1044       will be matched as early as possible, with the other principles
1045       determining how the regexp matches at that earliest character position.
1046
1047       Here is an example of these principles in action:
1048
1049           $x = "The programming republic of Perl";
1050           $x =~ /^(.+)(e|r)(.*)$/;  # matches,
1051                                     # $1 = 'The programming republic of Pe'
1052                                     # $2 = 'r'
1053                                     # $3 = 'l'
1054
1055       This regexp matches at the earliest string position, 'T'.  One might
1056       think that 'e', being leftmost in the alternation, would be matched,
1057       but 'r' produces the longest string in the first quantifier.
1058
1059           $x =~ /(m{1,2})(.*)$/;  # matches,
1060                                   # $1 = 'mm'
1061                                   # $2 = 'ing republic of Perl'
1062
1063       Here, The earliest possible match is at the first 'm' in "programming".
1064       "m{1,2}" is the first quantifier, so it gets to match a maximal "mm".
1065
1066           $x =~ /.*(m{1,2})(.*)$/;  # matches,
1067                                     # $1 = 'm'
1068                                     # $2 = 'ing republic of Perl'
1069
1070       Here, the regexp matches at the start of the string. The first
1071       quantifier ".*" grabs as much as possible, leaving just a single 'm'
1072       for the second quantifier "m{1,2}".
1073
1074           $x =~ /(.?)(m{1,2})(.*)$/;  # matches,
1075                                       # $1 = 'a'
1076                                       # $2 = 'mm'
1077                                       # $3 = 'ing republic of Perl'
1078
1079       Here, ".?" eats its maximal one character at the earliest possible
1080       position in the string, 'a' in "programming", leaving "m{1,2}" the
1081       opportunity to match both 'm''s. Finally,
1082
1083           "aXXXb" =~ /(X*)/; # matches with $1 = ''
1084
1085       because it can match zero copies of 'X' at the beginning of the string.
1086       If you definitely want to match at least one 'X', use "X+", not "X*".
1087
1088       Sometimes greed is not good.  At times, we would like quantifiers to
1089       match a minimal piece of string, rather than a maximal piece.  For this
1090       purpose, Larry Wall created the minimal match or non-greedy quantifiers
1091       "??", "*?", "+?", and "{}?".  These are the usual quantifiers with a
1092       '?' appended to them.  They have the following meanings:
1093
1094       ·   "a??" means: match 'a' 0 or 1 times. Try 0 first, then 1.
1095
1096       ·   "a*?" means: match 'a' 0 or more times, i.e., any number of times,
1097           but as few times as possible
1098
1099       ·   "a+?" means: match 'a' 1 or more times, i.e., at least once, but as
1100           few times as possible
1101
1102       ·   "a{n,m}?" means: match at least "n" times, not more than "m" times,
1103           as few times as possible
1104
1105       ·   "a{n,}?" means: match at least "n" times, but as few times as
1106           possible
1107
1108       ·   "a{n}?" means: match exactly "n" times.  Because we match exactly
1109           "n" times, "a{n}?" is equivalent to "a{n}" and is just there for
1110           notational consistency.
1111
1112       Let's look at the example above, but with minimal quantifiers:
1113
1114           $x = "The programming republic of Perl";
1115           $x =~ /^(.+?)(e|r)(.*)$/; # matches,
1116                                     # $1 = 'Th'
1117                                     # $2 = 'e'
1118                                     # $3 = ' programming republic of Perl'
1119
1120       The minimal string that will allow both the start of the string '^' and
1121       the alternation to match is "Th", with the alternation "e|r" matching
1122       'e'.  The second quantifier ".*" is free to gobble up the rest of the
1123       string.
1124
1125           $x =~ /(m{1,2}?)(.*?)$/;  # matches,
1126                                     # $1 = 'm'
1127                                     # $2 = 'ming republic of Perl'
1128
1129       The first string position that this regexp can match is at the first
1130       'm' in "programming". At this position, the minimal "m{1,2}?"  matches
1131       just one 'm'.  Although the second quantifier ".*?" would prefer to
1132       match no characters, it is constrained by the end-of-string anchor '$'
1133       to match the rest of the string.
1134
1135           $x =~ /(.*?)(m{1,2}?)(.*)$/;  # matches,
1136                                         # $1 = 'The progra'
1137                                         # $2 = 'm'
1138                                         # $3 = 'ming republic of Perl'
1139
1140       In this regexp, you might expect the first minimal quantifier ".*?"  to
1141       match the empty string, because it is not constrained by a '^' anchor
1142       to match the beginning of the word.  Principle 0 applies here, however.
1143       Because it is possible for the whole regexp to match at the start of
1144       the string, it will match at the start of the string.  Thus the first
1145       quantifier has to match everything up to the first 'm'.  The second
1146       minimal quantifier matches just one 'm' and the third quantifier
1147       matches the rest of the string.
1148
1149           $x =~ /(.??)(m{1,2})(.*)$/;  # matches,
1150                                        # $1 = 'a'
1151                                        # $2 = 'mm'
1152                                        # $3 = 'ing republic of Perl'
1153
1154       Just as in the previous regexp, the first quantifier ".??" can match
1155       earliest at position 'a', so it does.  The second quantifier is greedy,
1156       so it matches "mm", and the third matches the rest of the string.
1157
1158       We can modify principle 3 above to take into account non-greedy
1159       quantifiers:
1160
1161       ·   Principle 3: If there are two or more elements in a regexp, the
1162           leftmost greedy (non-greedy) quantifier, if any, will match as much
1163           (little) of the string as possible while still allowing the whole
1164           regexp to match.  The next leftmost greedy (non-greedy) quantifier,
1165           if any, will try to match as much (little) of the string remaining
1166           available to it as possible, while still allowing the whole regexp
1167           to match.  And so on, until all the regexp elements are satisfied.
1168
1169       Just like alternation, quantifiers are also susceptible to
1170       backtracking.  Here is a step-by-step analysis of the example
1171
1172           $x = "the cat in the hat";
1173           $x =~ /^(.*)(at)(.*)$/; # matches,
1174                                   # $1 = 'the cat in the h'
1175                                   # $2 = 'at'
1176                                   # $3 = ''   (0 matches)
1177
1178       0.  Start with the first letter in the string 't'.
1179
1180
1181       1.  The first quantifier '.*' starts out by matching the whole string
1182       ""the cat in the hat"".
1183
1184
1185       2.  'a' in the regexp element 'at' doesn't match the end of the string.
1186       Backtrack one character.
1187
1188
1189       3.  'a' in the regexp element 'at' still doesn't match the last letter
1190       of the string 't', so backtrack one more character.
1191
1192
1193       4.  Now we can match the 'a' and the 't'.
1194
1195
1196       5.  Move on to the third element '.*'.  Since we are at the end of the
1197       string and '.*' can match 0 times, assign it the empty string.
1198
1199
1200       6.  We are done!
1201
1202       Most of the time, all this moving forward and backtracking happens
1203       quickly and searching is fast. There are some pathological regexps,
1204       however, whose execution time exponentially grows with the size of the
1205       string.  A typical structure that blows up in your face is of the form
1206
1207           /(a|b+)*/;
1208
1209       The problem is the nested indeterminate quantifiers.  There are many
1210       different ways of partitioning a string of length n between the '+' and
1211       '*': one repetition with "b+" of length n, two repetitions with the
1212       first "b+" length k and the second with length n-k, m repetitions whose
1213       bits add up to length n, etc.  In fact there are an exponential number
1214       of ways to partition a string as a function of its length.  A regexp
1215       may get lucky and match early in the process, but if there is no match,
1216       Perl will try every possibility before giving up.  So be careful with
1217       nested '*''s, "{n,m}"'s, and '+''s.  The book Mastering Regular
1218       Expressions by Jeffrey Friedl gives a wonderful discussion of this and
1219       other efficiency issues.
1220
1221   Possessive quantifiers
1222       Backtracking during the relentless search for a match may be a waste of
1223       time, particularly when the match is bound to fail.  Consider the
1224       simple pattern
1225
1226           /^\w+\s+\w+$/; # a word, spaces, a word
1227
1228       Whenever this is applied to a string which doesn't quite meet the
1229       pattern's expectations such as "abc  " or "abc  def ", the regexp
1230       engine will backtrack, approximately once for each character in the
1231       string.  But we know that there is no way around taking all of the
1232       initial word characters to match the first repetition, that all spaces
1233       must be eaten by the middle part, and the same goes for the second
1234       word.
1235
1236       With the introduction of the possessive quantifiers in Perl 5.10, we
1237       have a way of instructing the regexp engine not to backtrack, with the
1238       usual quantifiers with a '+' appended to them.  This makes them greedy
1239       as well as stingy; once they succeed they won't give anything back to
1240       permit another solution. They have the following meanings:
1241
1242       ·   "a{n,m}+" means: match at least "n" times, not more than "m" times,
1243           as many times as possible, and don't give anything up. "a?+" is
1244           short for "a{0,1}+"
1245
1246       ·   "a{n,}+" means: match at least "n" times, but as many times as
1247           possible, and don't give anything up. "a*+" is short for "a{0,}+"
1248           and "a++" is short for "a{1,}+".
1249
1250       ·   "a{n}+" means: match exactly "n" times.  It is just there for
1251           notational consistency.
1252
1253       These possessive quantifiers represent a special case of a more general
1254       concept, the independent subexpression, see below.
1255
1256       As an example where a possessive quantifier is suitable we consider
1257       matching a quoted string, as it appears in several programming
1258       languages.  The backslash is used as an escape character that indicates
1259       that the next character is to be taken literally, as another character
1260       for the string.  Therefore, after the opening quote, we expect a
1261       (possibly empty) sequence of alternatives: either some character except
1262       an unescaped quote or backslash or an escaped character.
1263
1264           /"(?:[^"\\]++|\\.)*+"/;
1265
1266   Building a regexp
1267       At this point, we have all the basic regexp concepts covered, so let's
1268       give a more involved example of a regular expression.  We will build a
1269       regexp that matches numbers.
1270
1271       The first task in building a regexp is to decide what we want to match
1272       and what we want to exclude.  In our case, we want to match both
1273       integers and floating point numbers and we want to reject any string
1274       that isn't a number.
1275
1276       The next task is to break the problem down into smaller problems that
1277       are easily converted into a regexp.
1278
1279       The simplest case is integers.  These consist of a sequence of digits,
1280       with an optional sign in front.  The digits we can represent with "\d+"
1281       and the sign can be matched with "[+-]".  Thus the integer regexp is
1282
1283           /[+-]?\d+/;  # matches integers
1284
1285       A floating point number potentially has a sign, an integral part, a
1286       decimal point, a fractional part, and an exponent.  One or more of
1287       these parts is optional, so we need to check out the different
1288       possibilities.  Floating point numbers which are in proper form include
1289       123., 0.345, .34, -1e6, and 25.4E-72.  As with integers, the sign out
1290       front is completely optional and can be matched by "[+-]?".  We can see
1291       that if there is no exponent, floating point numbers must have a
1292       decimal point, otherwise they are integers.  We might be tempted to
1293       model these with "\d*\.\d*", but this would also match just a single
1294       decimal point, which is not a number.  So the three cases of floating
1295       point number without exponent are
1296
1297          /[+-]?\d+\./;  # 1., 321., etc.
1298          /[+-]?\.\d+/;  # .1, .234, etc.
1299          /[+-]?\d+\.\d+/;  # 1.0, 30.56, etc.
1300
1301       These can be combined into a single regexp with a three-way
1302       alternation:
1303
1304          /[+-]?(\d+\.\d+|\d+\.|\.\d+)/;  # floating point, no exponent
1305
1306       In this alternation, it is important to put '\d+\.\d+' before '\d+\.'.
1307       If '\d+\.' were first, the regexp would happily match that and ignore
1308       the fractional part of the number.
1309
1310       Now consider floating point numbers with exponents.  The key
1311       observation here is that both integers and numbers with decimal points
1312       are allowed in front of an exponent.  Then exponents, like the overall
1313       sign, are independent of whether we are matching numbers with or
1314       without decimal points, and can be "decoupled" from the mantissa.  The
1315       overall form of the regexp now becomes clear:
1316
1317           /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
1318
1319       The exponent is an 'e' or 'E', followed by an integer.  So the exponent
1320       regexp is
1321
1322          /[eE][+-]?\d+/;  # exponent
1323
1324       Putting all the parts together, we get a regexp that matches numbers:
1325
1326          /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/;  # Ta da!
1327
1328       Long regexps like this may impress your friends, but can be hard to
1329       decipher.  In complex situations like this, the "/x" modifier for a
1330       match is invaluable.  It allows one to put nearly arbitrary whitespace
1331       and comments into a regexp without affecting their meaning.  Using it,
1332       we can rewrite our "extended" regexp in the more pleasing form
1333
1334          /^
1335             [+-]?         # first, match an optional sign
1336             (             # then match integers or f.p. mantissas:
1337                 \d+\.\d+  # mantissa of the form a.b
1338                |\d+\.     # mantissa of the form a.
1339                |\.\d+     # mantissa of the form .b
1340                |\d+       # integer of the form a
1341             )
1342             ( [eE] [+-]? \d+ )?  # finally, optionally match an exponent
1343          $/x;
1344
1345       If whitespace is mostly irrelevant, how does one include space
1346       characters in an extended regexp? The answer is to backslash it '\ ' or
1347       put it in a character class "[ ]".  The same thing goes for pound
1348       signs: use "\#" or "[#]".  For instance, Perl allows a space between
1349       the sign and the mantissa or integer, and we could add this to our
1350       regexp as follows:
1351
1352          /^
1353             [+-]?\ *      # first, match an optional sign *and space*
1354             (             # then match integers or f.p. mantissas:
1355                 \d+\.\d+  # mantissa of the form a.b
1356                |\d+\.     # mantissa of the form a.
1357                |\.\d+     # mantissa of the form .b
1358                |\d+       # integer of the form a
1359             )
1360             ( [eE] [+-]? \d+ )?  # finally, optionally match an exponent
1361          $/x;
1362
1363       In this form, it is easier to see a way to simplify the alternation.
1364       Alternatives 1, 2, and 4 all start with "\d+", so it could be factored
1365       out:
1366
1367          /^
1368             [+-]?\ *      # first, match an optional sign
1369             (             # then match integers or f.p. mantissas:
1370                 \d+       # start out with a ...
1371                 (
1372                     \.\d* # mantissa of the form a.b or a.
1373                 )?        # ? takes care of integers of the form a
1374                |\.\d+     # mantissa of the form .b
1375             )
1376             ( [eE] [+-]? \d+ )?  # finally, optionally match an exponent
1377          $/x;
1378
1379       Starting in Perl v5.26, specifying "/xx" changes the square-bracketed
1380       portions of a pattern to ignore tabs and space characters unless they
1381       are escaped by preceding them with a backslash.  So, we could write
1382
1383          /^
1384             [ + - ]?\ *   # first, match an optional sign
1385             (             # then match integers or f.p. mantissas:
1386                 \d+       # start out with a ...
1387                 (
1388                     \.\d* # mantissa of the form a.b or a.
1389                 )?        # ? takes care of integers of the form a
1390                |\.\d+     # mantissa of the form .b
1391             )
1392             ( [ e E ] [ + - ]? \d+ )?  # finally, optionally match an exponent
1393          $/xx;
1394
1395       This doesn't really improve the legibility of this example, but it's
1396       available in case you want it.  Squashing the pattern down to the
1397       compact form, we have
1398
1399           /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
1400
1401       This is our final regexp.  To recap, we built a regexp by
1402
1403       ·   specifying the task in detail,
1404
1405       ·   breaking down the problem into smaller parts,
1406
1407       ·   translating the small parts into regexps,
1408
1409       ·   combining the regexps,
1410
1411       ·   and optimizing the final combined regexp.
1412
1413       These are also the typical steps involved in writing a computer
1414       program.  This makes perfect sense, because regular expressions are
1415       essentially programs written in a little computer language that
1416       specifies patterns.
1417
1418   Using regular expressions in Perl
1419       The last topic of Part 1 briefly covers how regexps are used in Perl
1420       programs.  Where do they fit into Perl syntax?
1421
1422       We have already introduced the matching operator in its default
1423       "/regexp/" and arbitrary delimiter "m!regexp!" forms.  We have used the
1424       binding operator "=~" and its negation "!~" to test for string matches.
1425       Associated with the matching operator, we have discussed the single
1426       line "/s", multi-line "/m", case-insensitive "/i" and extended "/x"
1427       modifiers.  There are a few more things you might want to know about
1428       matching operators.
1429
1430       Prohibiting substitution
1431
1432       If you change $pattern after the first substitution happens, Perl will
1433       ignore it.  If you don't want any substitutions at all, use the special
1434       delimiter "m''":
1435
1436           @pattern = ('Seuss');
1437           while (<>) {
1438               print if m'@pattern';  # matches literal '@pattern', not 'Seuss'
1439           }
1440
1441       Similar to strings, "m''" acts like apostrophes on a regexp; all other
1442       'm' delimiters act like quotes.  If the regexp evaluates to the empty
1443       string, the regexp in the last successful match is used instead.  So we
1444       have
1445
1446           "dog" =~ /d/;  # 'd' matches
1447           "dogbert =~ //;  # this matches the 'd' regexp used before
1448
1449       Global matching
1450
1451       The final two modifiers we will discuss here, "/g" and "/c", concern
1452       multiple matches.  The modifier "/g" stands for global matching and
1453       allows the matching operator to match within a string as many times as
1454       possible.  In scalar context, successive invocations against a string
1455       will have "/g" jump from match to match, keeping track of position in
1456       the string as it goes along.  You can get or set the position with the
1457       "pos()" function.
1458
1459       The use of "/g" is shown in the following example.  Suppose we have a
1460       string that consists of words separated by spaces.  If we know how many
1461       words there are in advance, we could extract the words using groupings:
1462
1463           $x = "cat dog house"; # 3 words
1464           $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1465                                                  # $1 = 'cat'
1466                                                  # $2 = 'dog'
1467                                                  # $3 = 'house'
1468
1469       But what if we had an indeterminate number of words? This is the sort
1470       of task "/g" was made for.  To extract all words, form the simple
1471       regexp "(\w+)" and loop over all matches with "/(\w+)/g":
1472
1473           while ($x =~ /(\w+)/g) {
1474               print "Word is $1, ends at position ", pos $x, "\n";
1475           }
1476
1477       prints
1478
1479           Word is cat, ends at position 3
1480           Word is dog, ends at position 7
1481           Word is house, ends at position 13
1482
1483       A failed match or changing the target string resets the position.  If
1484       you don't want the position reset after failure to match, add the "/c",
1485       as in "/regexp/gc".  The current position in the string is associated
1486       with the string, not the regexp.  This means that different strings
1487       have different positions and their respective positions can be set or
1488       read independently.
1489
1490       In list context, "/g" returns a list of matched groupings, or if there
1491       are no groupings, a list of matches to the whole regexp.  So if we
1492       wanted just the words, we could use
1493
1494           @words = ($x =~ /(\w+)/g);  # matches,
1495                                       # $words[0] = 'cat'
1496                                       # $words[1] = 'dog'
1497                                       # $words[2] = 'house'
1498
1499       Closely associated with the "/g" modifier is the "\G" anchor.  The "\G"
1500       anchor matches at the point where the previous "/g" match left off.
1501       "\G" allows us to easily do context-sensitive matching:
1502
1503           $metric = 1;  # use metric units
1504           ...
1505           $x = <FILE>;  # read in measurement
1506           $x =~ /^([+-]?\d+)\s*/g;  # get magnitude
1507           $weight = $1;
1508           if ($metric) { # error checking
1509               print "Units error!" unless $x =~ /\Gkg\./g;
1510           }
1511           else {
1512               print "Units error!" unless $x =~ /\Glbs\./g;
1513           }
1514           $x =~ /\G\s+(widget|sprocket)/g;  # continue processing
1515
1516       The combination of "/g" and "\G" allows us to process the string a bit
1517       at a time and use arbitrary Perl logic to decide what to do next.
1518       Currently, the "\G" anchor is only fully supported when used to anchor
1519       to the start of the pattern.
1520
1521       "\G" is also invaluable in processing fixed-length records with
1522       regexps.  Suppose we have a snippet of coding region DNA, encoded as
1523       base pair letters "ATCGTTGAAT..." and we want to find all the stop
1524       codons "TGA".  In a coding region, codons are 3-letter sequences, so we
1525       can think of the DNA snippet as a sequence of 3-letter records.  The
1526       naive regexp
1527
1528           # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1529           $dna = "ATCGTTGAATGCAAATGACATGAC";
1530           $dna =~ /TGA/;
1531
1532       doesn't work; it may match a "TGA", but there is no guarantee that the
1533       match is aligned with codon boundaries, e.g., the substring "GTT GAA"
1534       gives a match.  A better solution is
1535
1536           while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
1537               print "Got a TGA stop codon at position ", pos $dna, "\n";
1538           }
1539
1540       which prints
1541
1542           Got a TGA stop codon at position 18
1543           Got a TGA stop codon at position 23
1544
1545       Position 18 is good, but position 23 is bogus.  What happened?
1546
1547       The answer is that our regexp works well until we get past the last
1548       real match.  Then the regexp will fail to match a synchronized "TGA"
1549       and start stepping ahead one character position at a time, not what we
1550       want.  The solution is to use "\G" to anchor the match to the codon
1551       alignment:
1552
1553           while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1554               print "Got a TGA stop codon at position ", pos $dna, "\n";
1555           }
1556
1557       This prints
1558
1559           Got a TGA stop codon at position 18
1560
1561       which is the correct answer.  This example illustrates that it is
1562       important not only to match what is desired, but to reject what is not
1563       desired.
1564
1565       (There are other regexp modifiers that are available, such as "/o", but
1566       their specialized uses are beyond the scope of this introduction.  )
1567
1568       Search and replace
1569
1570       Regular expressions also play a big role in search and replace
1571       operations in Perl.  Search and replace is accomplished with the "s///"
1572       operator.  The general form is "s/regexp/replacement/modifiers", with
1573       everything we know about regexps and modifiers applying in this case as
1574       well.  The replacement is a Perl double-quoted string that replaces in
1575       the string whatever is matched with the "regexp".  The operator "=~" is
1576       also used here to associate a string with "s///".  If matching against
1577       $_, the "$_ =~" can be dropped.  If there is a match, "s///" returns
1578       the number of substitutions made; otherwise it returns false.  Here are
1579       a few examples:
1580
1581           $x = "Time to feed the cat!";
1582           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
1583           if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1584               $more_insistent = 1;
1585           }
1586           $y = "'quoted words'";
1587           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
1588                                  # $y contains "quoted words"
1589
1590       In the last example, the whole string was matched, but only the part
1591       inside the single quotes was grouped.  With the "s///" operator, the
1592       matched variables $1, $2, etc. are immediately available for use in the
1593       replacement expression, so we use $1 to replace the quoted string with
1594       just what was quoted.  With the global modifier, "s///g" will search
1595       and replace all occurrences of the regexp in the string:
1596
1597           $x = "I batted 4 for 4";
1598           $x =~ s/4/four/;   # doesn't do it all:
1599                              # $x contains "I batted four for 4"
1600           $x = "I batted 4 for 4";
1601           $x =~ s/4/four/g;  # does it all:
1602                              # $x contains "I batted four for four"
1603
1604       If you prefer "regex" over "regexp" in this tutorial, you could use the
1605       following program to replace it:
1606
1607           % cat > simple_replace
1608           #!/usr/bin/perl
1609           $regexp = shift;
1610           $replacement = shift;
1611           while (<>) {
1612               s/$regexp/$replacement/g;
1613               print;
1614           }
1615           ^D
1616
1617           % simple_replace regexp regex perlretut.pod
1618
1619       In "simple_replace" we used the "s///g" modifier to replace all
1620       occurrences of the regexp on each line.  (Even though the regular
1621       expression appears in a loop, Perl is smart enough to compile it only
1622       once.)  As with "simple_grep", both the "print" and the
1623       "s/$regexp/$replacement/g" use $_ implicitly.
1624
1625       If you don't want "s///" to change your original variable you can use
1626       the non-destructive substitute modifier, "s///r".  This changes the
1627       behavior so that "s///r" returns the final substituted string (instead
1628       of the number of substitutions):
1629
1630           $x = "I like dogs.";
1631           $y = $x =~ s/dogs/cats/r;
1632           print "$x $y\n";
1633
1634       That example will print "I like dogs. I like cats". Notice the original
1635       $x variable has not been affected. The overall result of the
1636       substitution is instead stored in $y. If the substitution doesn't
1637       affect anything then the original string is returned:
1638
1639           $x = "I like dogs.";
1640           $y = $x =~ s/elephants/cougars/r;
1641           print "$x $y\n"; # prints "I like dogs. I like dogs."
1642
1643       One other interesting thing that the "s///r" flag allows is chaining
1644       substitutions:
1645
1646           $x = "Cats are great.";
1647           print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
1648               s/Frogs/Hedgehogs/r, "\n";
1649           # prints "Hedgehogs are great."
1650
1651       A modifier available specifically to search and replace is the "s///e"
1652       evaluation modifier.  "s///e" treats the replacement text as Perl code,
1653       rather than a double-quoted string.  The value that the code returns is
1654       substituted for the matched substring.  "s///e" is useful if you need
1655       to do a bit of computation in the process of replacing text.  This
1656       example counts character frequencies in a line:
1657
1658           $x = "Bill the cat";
1659           $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
1660           print "frequency of '$_' is $chars{$_}\n"
1661               foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1662
1663       This prints
1664
1665           frequency of ' ' is 2
1666           frequency of 't' is 2
1667           frequency of 'l' is 2
1668           frequency of 'B' is 1
1669           frequency of 'c' is 1
1670           frequency of 'e' is 1
1671           frequency of 'h' is 1
1672           frequency of 'i' is 1
1673           frequency of 'a' is 1
1674
1675       As with the match "m//" operator, "s///" can use other delimiters, such
1676       as "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
1677       "s'''", then the regexp and replacement are treated as single-quoted
1678       strings and there are no variable substitutions.  "s///" in list
1679       context returns the same thing as in scalar context, i.e., the number
1680       of matches.
1681
1682       The split function
1683
1684       The "split()" function is another place where a regexp is used.  "split
1685       /regexp/, string, limit" separates the "string" operand into a list of
1686       substrings and returns that list.  The regexp must be designed to match
1687       whatever constitutes the separators for the desired substrings.  The
1688       "limit", if present, constrains splitting into no more than "limit"
1689       number of strings.  For example, to split a string into words, use
1690
1691           $x = "Calvin and Hobbes";
1692           @words = split /\s+/, $x;  # $word[0] = 'Calvin'
1693                                      # $word[1] = 'and'
1694                                      # $word[2] = 'Hobbes'
1695
1696       If the empty regexp "//" is used, the regexp always matches and the
1697       string is split into individual characters.  If the regexp has
1698       groupings, then the resulting list contains the matched substrings from
1699       the groupings as well.  For instance,
1700
1701           $x = "/usr/bin/perl";
1702           @dirs = split m!/!, $x;  # $dirs[0] = ''
1703                                    # $dirs[1] = 'usr'
1704                                    # $dirs[2] = 'bin'
1705                                    # $dirs[3] = 'perl'
1706           @parts = split m!(/)!, $x;  # $parts[0] = ''
1707                                       # $parts[1] = '/'
1708                                       # $parts[2] = 'usr'
1709                                       # $parts[3] = '/'
1710                                       # $parts[4] = 'bin'
1711                                       # $parts[5] = '/'
1712                                       # $parts[6] = 'perl'
1713
1714       Since the first character of $x matched the regexp, "split" prepended
1715       an empty initial element to the list.
1716
1717       If you have read this far, congratulations! You now have all the basic
1718       tools needed to use regular expressions to solve a wide range of text
1719       processing problems.  If this is your first time through the tutorial,
1720       why not stop here and play around with regexps a while....  Part 2
1721       concerns the more esoteric aspects of regular expressions and those
1722       concepts certainly aren't needed right at the start.
1723

Part 2: Power tools

1725       OK, you know the basics of regexps and you want to know more.  If
1726       matching regular expressions is analogous to a walk in the woods, then
1727       the tools discussed in Part 1 are analogous to topo maps and a compass,
1728       basic tools we use all the time.  Most of the tools in part 2 are
1729       analogous to flare guns and satellite phones.  They aren't used too
1730       often on a hike, but when we are stuck, they can be invaluable.
1731
1732       What follows are the more advanced, less used, or sometimes esoteric
1733       capabilities of Perl regexps.  In Part 2, we will assume you are
1734       comfortable with the basics and concentrate on the advanced features.
1735
1736   More on characters, strings, and character classes
1737       There are a number of escape sequences and character classes that we
1738       haven't covered yet.
1739
1740       There are several escape sequences that convert characters or strings
1741       between upper and lower case, and they are also available within
1742       patterns.  "\l" and "\u" convert the next character to lower or upper
1743       case, respectively:
1744
1745           $x = "perl";
1746           $string =~ /\u$x/;  # matches 'Perl' in $string
1747           $x = "M(rs?|s)\\."; # note the double backslash
1748           $string =~ /\l$x/;  # matches 'mr.', 'mrs.', and 'ms.',
1749
1750       A "\L" or "\U" indicates a lasting conversion of case, until terminated
1751       by "\E" or thrown over by another "\U" or "\L":
1752
1753           $x = "This word is in lower case:\L SHOUT\E";
1754           $x =~ /shout/;       # matches
1755           $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1756           $x =~ /\Ukeypunch/;  # matches punch card string
1757
1758       If there is no "\E", case is converted until the end of the string. The
1759       regexps "\L\u$word" or "\u\L$word" convert the first character of $word
1760       to uppercase and the rest of the characters to lowercase.
1761
1762       Control characters can be escaped with "\c", so that a control-Z
1763       character would be matched with "\cZ".  The escape sequence "\Q"..."\E"
1764       quotes, or protects most non-alphabetic characters.   For instance,
1765
1766           $x = "\QThat !^*&%~& cat!";
1767           $x =~ /\Q!^*&%~&\E/;  # check for rough language
1768
1769       It does not protect '$' or '@', so that variables can still be
1770       substituted.
1771
1772       "\Q", "\L", "\l", "\U", "\u" and "\E" are actually part of double-
1773       quotish syntax, and not part of regexp syntax proper.  They will work
1774       if they appear in a regular expression embedded directly in a program,
1775       but not when contained in a string that is interpolated in a pattern.
1776
1777       Perl regexps can handle more than just the standard ASCII character
1778       set.  Perl supports Unicode, a standard for representing the alphabets
1779       from virtually all of the world's written languages, and a host of
1780       symbols.  Perl's text strings are Unicode strings, so they can contain
1781       characters with a value (codepoint or character number) higher than
1782       255.
1783
1784       What does this mean for regexps? Well, regexp users don't need to know
1785       much about Perl's internal representation of strings.  But they do need
1786       to know 1) how to represent Unicode characters in a regexp and 2) that
1787       a matching operation will treat the string to be searched as a sequence
1788       of characters, not bytes.  The answer to 1) is that Unicode characters
1789       greater than "chr(255)" are represented using the "\x{hex}" notation,
1790       because "\x"XY (without curly braces and XY are two hex digits) doesn't
1791       go further than 255.  (Starting in Perl 5.14, if you're an octal fan,
1792       you can also use "\o{oct}".)
1793
1794           /\x{263a}/;  # match a Unicode smiley face :)
1795
1796       NOTE: In Perl 5.6.0 it used to be that one needed to say "use utf8" to
1797       use any Unicode features.  This is no more the case: for almost all
1798       Unicode processing, the explicit "utf8" pragma is not needed.  (The
1799       only case where it matters is if your Perl script is in Unicode and
1800       encoded in UTF-8, then an explicit "use utf8" is needed.)
1801
1802       Figuring out the hexadecimal sequence of a Unicode character you want
1803       or deciphering someone else's hexadecimal Unicode regexp is about as
1804       much fun as programming in machine code.  So another way to specify
1805       Unicode characters is to use the named character escape sequence
1806       "\N{name}".  name is a name for the Unicode character, as specified in
1807       the Unicode standard.  For instance, if we wanted to represent or match
1808       the astrological sign for the planet Mercury, we could use
1809
1810           $x = "abc\N{MERCURY}def";
1811           $x =~ /\N{MERCURY}/;   # matches
1812
1813       One can also use "short" names:
1814
1815           print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
1816           print "\N{greek:Sigma} is an upper-case sigma.\n";
1817
1818       You can also restrict names to a certain alphabet by specifying the
1819       charnames pragma:
1820
1821           use charnames qw(greek);
1822           print "\N{sigma} is Greek sigma\n";
1823
1824       An index of character names is available on-line from the Unicode
1825       Consortium, <http://www.unicode.org/charts/charindex.html>; explanatory
1826       material with links to other resources at
1827       <http://www.unicode.org/standard/where>.
1828
1829       The answer to requirement 2) is that a regexp (mostly) uses Unicode
1830       characters.  The "mostly" is for messy backward compatibility reasons,
1831       but starting in Perl 5.14, any regexp compiled in the scope of a "use
1832       feature 'unicode_strings'" (which is automatically turned on within the
1833       scope of a "use 5.012" or higher) will turn that "mostly" into
1834       "always".  If you want to handle Unicode properly, you should ensure
1835       that 'unicode_strings' is turned on.  Internally, this is encoded to
1836       bytes using either UTF-8 or a native 8 bit encoding, depending on the
1837       history of the string, but conceptually it is a sequence of characters,
1838       not bytes. See perlunitut for a tutorial about that.
1839
1840       Let us now discuss Unicode character classes, most usually called
1841       "character properties".  These are represented by the "\p{name}" escape
1842       sequence.  The negation of this is "\P{name}".  For example, to match
1843       lower and uppercase characters,
1844
1845           $x = "BOB";
1846           $x =~ /^\p{IsUpper}/;   # matches, uppercase char class
1847           $x =~ /^\P{IsUpper}/;   # doesn't match, char class sans uppercase
1848           $x =~ /^\p{IsLower}/;   # doesn't match, lowercase char class
1849           $x =~ /^\P{IsLower}/;   # matches, char class sans lowercase
1850
1851       (The ""Is"" is optional.)
1852
1853       There are many, many Unicode character properties.  For the full list
1854       see perluniprops.  Most of them have synonyms with shorter names, also
1855       listed there.  Some synonyms are a single character.  For these, you
1856       can drop the braces.  For instance, "\pM" is the same thing as
1857       "\p{Mark}", meaning things like accent marks.
1858
1859       The Unicode "\p{Script}" and "\p{Script_Extensions}" properties are
1860       used to categorize every Unicode character into the language script it
1861       is written in.  ("Script_Extensions" is an improved version of
1862       "Script", which is retained for backward compatibility, and so you
1863       should generally use "Script_Extensions".)  For example, English,
1864       French, and a bunch of other European languages are written in the
1865       Latin script.  But there is also the Greek script, the Thai script, the
1866       Katakana script, etc.  You can test whether a character is in a
1867       particular script (based on "Script_Extensions") with, for example
1868       "\p{Latin}", "\p{Greek}", or "\p{Katakana}".  To test if it isn't in
1869       the Balinese script, you would use "\P{Balinese}".
1870
1871       What we have described so far is the single form of the "\p{...}"
1872       character classes.  There is also a compound form which you may run
1873       into.  These look like "\p{name=value}" or "\p{name:value}" (the equals
1874       sign and colon can be used interchangeably).  These are more general
1875       than the single form, and in fact most of the single forms are just
1876       Perl-defined shortcuts for common compound forms.  For example, the
1877       script examples in the previous paragraph could be written equivalently
1878       as "\p{Script_Extensions=Latin}", "\p{Script_Extensions:Greek}",
1879       "\p{script_extensions=katakana}", and "\P{script_extensions=balinese}"
1880       (case is irrelevant between the "{}" braces).  You may never have to
1881       use the compound forms, but sometimes it is necessary, and their use
1882       can make your code easier to understand.
1883
1884       "\X" is an abbreviation for a character class that comprises a Unicode
1885       extended grapheme cluster.  This represents a "logical character": what
1886       appears to be a single character, but may be represented internally by
1887       more than one.  As an example, using the Unicode full names, e.g.,
1888       "A + COMBINING RING" is a grapheme cluster with base character "A" and
1889       combining character "COMBINING RING, which translates in Danish to "A"
1890       with the circle atop it, as in the word Aangstrom.
1891
1892       For the full and latest information about Unicode see the latest
1893       Unicode standard, or the Unicode Consortium's website
1894       <http://www.unicode.org>
1895
1896       As if all those classes weren't enough, Perl also defines POSIX-style
1897       character classes.  These have the form "[:name:]", with name the name
1898       of the POSIX class.  The POSIX classes are "alpha", "alnum", "ascii",
1899       "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper",
1900       and "xdigit", and two extensions, "word" (a Perl extension to match
1901       "\w"), and "blank" (a GNU extension).  The "/a" modifier restricts
1902       these to matching just in the ASCII range; otherwise they can match the
1903       same as their corresponding Perl Unicode classes: "[:upper:]" is the
1904       same as "\p{IsUpper}", etc.  (There are some exceptions and gotchas
1905       with this; see perlrecharclass for a full discussion.) The "[:digit:]",
1906       "[:word:]", and "[:space:]" correspond to the familiar "\d", "\w", and
1907       "\s" character classes.  To negate a POSIX class, put a '^' in front of
1908       the name, so that, e.g., "[:^digit:]" corresponds to "\D" and, under
1909       Unicode, "\P{IsDigit}".  The Unicode and POSIX character classes can be
1910       used just like "\d", with the exception that POSIX character classes
1911       can only be used inside of a character class:
1912
1913           /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
1914           /^=item\s[[:digit:]]/;      # match '=item',
1915                                       # followed by a space and a digit
1916           /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
1917           /^=item\s\p{IsDigit}/;        # match '=item',
1918                                         # followed by a space and a digit
1919
1920       Whew! That is all the rest of the characters and character classes.
1921
1922   Compiling and saving regular expressions
1923       In Part 1 we mentioned that Perl compiles a regexp into a compact
1924       sequence of opcodes.  Thus, a compiled regexp is a data structure that
1925       can be stored once and used again and again.  The regexp quote "qr//"
1926       does exactly that: "qr/string/" compiles the "string" as a regexp and
1927       transforms the result into a form that can be assigned to a variable:
1928
1929           $reg = qr/foo+bar?/;  # reg contains a compiled regexp
1930
1931       Then $reg can be used as a regexp:
1932
1933           $x = "fooooba";
1934           $x =~ $reg;     # matches, just like /foo+bar?/
1935           $x =~ /$reg/;   # same thing, alternate form
1936
1937       $reg can also be interpolated into a larger regexp:
1938
1939           $x =~ /(abc)?$reg/;  # still matches
1940
1941       As with the matching operator, the regexp quote can use different
1942       delimiters, e.g., "qr!!", "qr{}" or "qr~~".  Apostrophes as delimiters
1943       ("qr''") inhibit any interpolation.
1944
1945       Pre-compiled regexps are useful for creating dynamic matches that don't
1946       need to be recompiled each time they are encountered.  Using pre-
1947       compiled regexps, we write a "grep_step" program which greps for a
1948       sequence of patterns, advancing to the next pattern as soon as one has
1949       been satisfied.
1950
1951           % cat > grep_step
1952           #!/usr/bin/perl
1953           # grep_step - match <number> regexps, one after the other
1954           # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
1955
1956           $number = shift;
1957           $regexp[$_] = shift foreach (0..$number-1);
1958           @compiled = map qr/$_/, @regexp;
1959           while ($line = <>) {
1960               if ($line =~ /$compiled[0]/) {
1961                   print $line;
1962                   shift @compiled;
1963                   last unless @compiled;
1964               }
1965           }
1966           ^D
1967
1968           % grep_step 3 shift print last grep_step
1969           $number = shift;
1970                   print $line;
1971                   last unless @compiled;
1972
1973       Storing pre-compiled regexps in an array @compiled allows us to simply
1974       loop through the regexps without any recompilation, thus gaining
1975       flexibility without sacrificing speed.
1976
1977   Composing regular expressions at runtime
1978       Backtracking is more efficient than repeated tries with different
1979       regular expressions.  If there are several regular expressions and a
1980       match with any of them is acceptable, then it is possible to combine
1981       them into a set of alternatives.  If the individual expressions are
1982       input data, this can be done by programming a join operation.  We'll
1983       exploit this idea in an improved version of the "simple_grep" program:
1984       a program that matches multiple patterns:
1985
1986           % cat > multi_grep
1987           #!/usr/bin/perl
1988           # multi_grep - match any of <number> regexps
1989           # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
1990
1991           $number = shift;
1992           $regexp[$_] = shift foreach (0..$number-1);
1993           $pattern = join '|', @regexp;
1994
1995           while ($line = <>) {
1996               print $line if $line =~ /$pattern/;
1997           }
1998           ^D
1999
2000           % multi_grep 2 shift for multi_grep
2001           $number = shift;
2002           $regexp[$_] = shift foreach (0..$number-1);
2003
2004       Sometimes it is advantageous to construct a pattern from the input that
2005       is to be analyzed and use the permissible values on the left hand side
2006       of the matching operations.  As an example for this somewhat
2007       paradoxical situation, let's assume that our input contains a command
2008       verb which should match one out of a set of available command verbs,
2009       with the additional twist that commands may be abbreviated as long as
2010       the given string is unique. The program below demonstrates the basic
2011       algorithm.
2012
2013           % cat > keymatch
2014           #!/usr/bin/perl
2015           $kwds = 'copy compare list print';
2016           while( $cmd = <> ){
2017               $cmd =~ s/^\s+|\s+$//g;  # trim leading and trailing spaces
2018               if( ( @matches = $kwds =~ /\b$cmd\w*/g ) == 1 ){
2019                   print "command: '@matches'\n";
2020               } elsif( @matches == 0 ){
2021                   print "no such command: '$cmd'\n";
2022               } else {
2023                   print "not unique: '$cmd' (could be one of: @matches)\n";
2024               }
2025           }
2026           ^D
2027
2028           % keymatch
2029           li
2030           command: 'list'
2031           co
2032           not unique: 'co' (could be one of: copy compare)
2033           printer
2034           no such command: 'printer'
2035
2036       Rather than trying to match the input against the keywords, we match
2037       the combined set of keywords against the input.  The pattern matching
2038       operation "$kwds =~ /\b($cmd\w*)/g" does several things at the same
2039       time. It makes sure that the given command begins where a keyword
2040       begins ("\b"). It tolerates abbreviations due to the added "\w*". It
2041       tells us the number of matches ("scalar @matches") and all the keywords
2042       that were actually matched.  You could hardly ask for more.
2043
2044   Embedding comments and modifiers in a regular expression
2045       Starting with this section, we will be discussing Perl's set of
2046       extended patterns.  These are extensions to the traditional regular
2047       expression syntax that provide powerful new tools for pattern matching.
2048       We have already seen extensions in the form of the minimal matching
2049       constructs "??", "*?", "+?", "{n,m}?", and "{n,}?".  Most of the
2050       extensions below have the form "(?char...)", where the "char" is a
2051       character that determines the type of extension.
2052
2053       The first extension is an embedded comment "(?#text)".  This embeds a
2054       comment into the regular expression without affecting its meaning.  The
2055       comment should not have any closing parentheses in the text.  An
2056       example is
2057
2058           /(?# Match an integer:)[+-]?\d+/;
2059
2060       This style of commenting has been largely superseded by the raw,
2061       freeform commenting that is allowed with the "/x" modifier.
2062
2063       Most modifiers, such as "/i", "/m", "/s" and "/x" (or any combination
2064       thereof) can also be embedded in a regexp using "(?i)", "(?m)", "(?s)",
2065       and "(?x)".  For instance,
2066
2067           /(?i)yes/;  # match 'yes' case insensitively
2068           /yes/i;     # same thing
2069           /(?x)(          # freeform version of an integer regexp
2070                    [+-]?  # match an optional sign
2071                    \d+    # match a sequence of digits
2072                )
2073           /x;
2074
2075       Embedded modifiers can have two important advantages over the usual
2076       modifiers.  Embedded modifiers allow a custom set of modifiers for each
2077       regexp pattern.  This is great for matching an array of regexps that
2078       must have different modifiers:
2079
2080           $pattern[0] = '(?i)doctor';
2081           $pattern[1] = 'Johnson';
2082           ...
2083           while (<>) {
2084               foreach $patt (@pattern) {
2085                   print if /$patt/;
2086               }
2087           }
2088
2089       The second advantage is that embedded modifiers (except "/p", which
2090       modifies the entire regexp) only affect the regexp inside the group the
2091       embedded modifier is contained in.  So grouping can be used to localize
2092       the modifier's effects:
2093
2094           /Answer: ((?i)yes)/;  # matches 'Answer: yes', 'Answer: YES', etc.
2095
2096       Embedded modifiers can also turn off any modifiers already present by
2097       using, e.g., "(?-i)".  Modifiers can also be combined into a single
2098       expression, e.g., "(?s-i)" turns on single line mode and turns off case
2099       insensitivity.
2100
2101       Embedded modifiers may also be added to a non-capturing grouping.
2102       "(?i-m:regexp)" is a non-capturing grouping that matches "regexp" case
2103       insensitively and turns off multi-line mode.
2104
2105   Looking ahead and looking behind
2106       This section concerns the lookahead and lookbehind assertions.  First,
2107       a little background.
2108
2109       In Perl regular expressions, most regexp elements "eat up" a certain
2110       amount of string when they match.  For instance, the regexp element
2111       "[abc]" eats up one character of the string when it matches, in the
2112       sense that Perl moves to the next character position in the string
2113       after the match.  There are some elements, however, that don't eat up
2114       characters (advance the character position) if they match.  The
2115       examples we have seen so far are the anchors.  The anchor '^' matches
2116       the beginning of the line, but doesn't eat any characters.  Similarly,
2117       the word boundary anchor "\b" matches wherever a character matching
2118       "\w" is next to a character that doesn't, but it doesn't eat up any
2119       characters itself.  Anchors are examples of zero-width assertions:
2120       zero-width, because they consume no characters, and assertions, because
2121       they test some property of the string.  In the context of our walk in
2122       the woods analogy to regexp matching, most regexp elements move us
2123       along a trail, but anchors have us stop a moment and check our
2124       surroundings.  If the local environment checks out, we can proceed
2125       forward.  But if the local environment doesn't satisfy us, we must
2126       backtrack.
2127
2128       Checking the environment entails either looking ahead on the trail,
2129       looking behind, or both.  '^' looks behind, to see that there are no
2130       characters before.  '$' looks ahead, to see that there are no
2131       characters after.  "\b" looks both ahead and behind, to see if the
2132       characters on either side differ in their "word-ness".
2133
2134       The lookahead and lookbehind assertions are generalizations of the
2135       anchor concept.  Lookahead and lookbehind are zero-width assertions
2136       that let us specify which characters we want to test for.  The
2137       lookahead assertion is denoted by "(?=regexp)" and the lookbehind
2138       assertion is denoted by "(?<=fixed-regexp)".  Some examples are
2139
2140           $x = "I catch the housecat 'Tom-cat' with catnip";
2141           $x =~ /cat(?=\s)/;   # matches 'cat' in 'housecat'
2142           @catwords = ($x =~ /(?<=\s)cat\w+/g);  # matches,
2143                                                  # $catwords[0] = 'catch'
2144                                                  # $catwords[1] = 'catnip'
2145           $x =~ /\bcat\b/;  # matches 'cat' in 'Tom-cat'
2146           $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
2147                                     # middle of $x
2148
2149       Note that the parentheses in "(?=regexp)" and "(?<=regexp)" are non-
2150       capturing, since these are zero-width assertions.  Thus in the second
2151       regexp, the substrings captured are those of the whole regexp itself.
2152       Lookahead "(?=regexp)" can match arbitrary regexps, but lookbehind
2153       "(?<=fixed-regexp)" only works for regexps of fixed width, i.e., a
2154       fixed number of characters long.  Thus "(?<=(ab|bc))" is fine, but
2155       "(?<=(ab)*)" is not.  The negated versions of the lookahead and
2156       lookbehind assertions are denoted by "(?!regexp)" and
2157       "(?<!fixed-regexp)" respectively.  They evaluate true if the regexps do
2158       not match:
2159
2160           $x = "foobar";
2161           $x =~ /foo(?!bar)/;  # doesn't match, 'bar' follows 'foo'
2162           $x =~ /foo(?!baz)/;  # matches, 'baz' doesn't follow 'foo'
2163           $x =~ /(?<!\s)foo/;  # matches, there is no \s before 'foo'
2164
2165       Here is an example where a string containing blank-separated words,
2166       numbers and single dashes is to be split into its components.  Using
2167       "/\s+/" alone won't work, because spaces are not required between
2168       dashes, or a word or a dash. Additional places for a split are
2169       established by looking ahead and behind:
2170
2171           $str = "one two - --6-8";
2172           @toks = split / \s+              # a run of spaces
2173                         | (?<=\S) (?=-)    # any non-space followed by '-'
2174                         | (?<=-)  (?=\S)   # a '-' followed by any non-space
2175                         /x, $str;          # @toks = qw(one two - - - 6 - 8)
2176
2177   Using independent subexpressions to prevent backtracking
2178       Independent subexpressions are regular expressions, in the context of a
2179       larger regular expression, that function independently of the larger
2180       regular expression.  That is, they consume as much or as little of the
2181       string as they wish without regard for the ability of the larger regexp
2182       to match.  Independent subexpressions are represented by "(?>regexp)".
2183       We can illustrate their behavior by first considering an ordinary
2184       regexp:
2185
2186           $x = "ab";
2187           $x =~ /a*ab/;  # matches
2188
2189       This obviously matches, but in the process of matching, the
2190       subexpression "a*" first grabbed the 'a'.  Doing so, however, wouldn't
2191       allow the whole regexp to match, so after backtracking, "a*" eventually
2192       gave back the 'a' and matched the empty string.  Here, what "a*"
2193       matched was dependent on what the rest of the regexp matched.
2194
2195       Contrast that with an independent subexpression:
2196
2197           $x =~ /(?>a*)ab/;  # doesn't match!
2198
2199       The independent subexpression "(?>a*)" doesn't care about the rest of
2200       the regexp, so it sees an 'a' and grabs it.  Then the rest of the
2201       regexp "ab" cannot match.  Because "(?>a*)" is independent, there is no
2202       backtracking and the independent subexpression does not give up its
2203       'a'.  Thus the match of the regexp as a whole fails.  A similar
2204       behavior occurs with completely independent regexps:
2205
2206           $x = "ab";
2207           $x =~ /a*/g;   # matches, eats an 'a'
2208           $x =~ /\Gab/g; # doesn't match, no 'a' available
2209
2210       Here "/g" and "\G" create a "tag team" handoff of the string from one
2211       regexp to the other.  Regexps with an independent subexpression are
2212       much like this, with a handoff of the string to the independent
2213       subexpression, and a handoff of the string back to the enclosing
2214       regexp.
2215
2216       The ability of an independent subexpression to prevent backtracking can
2217       be quite useful.  Suppose we want to match a non-empty string enclosed
2218       in parentheses up to two levels deep.  Then the following regexp
2219       matches:
2220
2221           $x = "abc(de(fg)h";  # unbalanced parentheses
2222           $x =~ /\( ( [ ^ () ]+ | \( [ ^ () ]* \) )+ \)/xx;
2223
2224       The regexp matches an open parenthesis, one or more copies of an
2225       alternation, and a close parenthesis.  The alternation is two-way, with
2226       the first alternative "[^()]+" matching a substring with no parentheses
2227       and the second alternative "\([^()]*\)"  matching a substring delimited
2228       by parentheses.  The problem with this regexp is that it is
2229       pathological: it has nested indeterminate quantifiers of the form
2230       "(a+|b)+".  We discussed in Part 1 how nested quantifiers like this
2231       could take an exponentially long time to execute if there was no match
2232       possible.  To prevent the exponential blowup, we need to prevent
2233       useless backtracking at some point.  This can be done by enclosing the
2234       inner quantifier as an independent subexpression:
2235
2236           $x =~ /\( ( (?> [ ^ () ]+ ) | \([ ^ () ]* \) )+ \)/xx;
2237
2238       Here, "(?>[^()]+)" breaks the degeneracy of string partitioning by
2239       gobbling up as much of the string as possible and keeping it.   Then
2240       match failures fail much more quickly.
2241
2242   Conditional expressions
2243       A conditional expression is a form of if-then-else statement that
2244       allows one to choose which patterns are to be matched, based on some
2245       condition.  There are two types of conditional expression:
2246       "(?(condition)yes-regexp)" and "(?(condition)yes-regexp|no-regexp)".
2247       "(?(condition)yes-regexp)" is like an 'if () {}' statement in Perl.  If
2248       the condition is true, the yes-regexp will be matched.  If the
2249       condition is false, the yes-regexp will be skipped and Perl will move
2250       onto the next regexp element.  The second form is like an
2251       'if () {} else {}' statement in Perl.  If the condition is true, the
2252       yes-regexp will be matched, otherwise the no-regexp will be matched.
2253
2254       The condition can have several forms.  The first form is simply an
2255       integer in parentheses "(integer)".  It is true if the corresponding
2256       backreference "\integer" matched earlier in the regexp.  The same thing
2257       can be done with a name associated with a capture group, written as
2258       "(<name>)" or "('name')".  The second form is a bare zero-width
2259       assertion "(?...)", either a lookahead, a lookbehind, or a code
2260       assertion (discussed in the next section).  The third set of forms
2261       provides tests that return true if the expression is executed within a
2262       recursion ("(R)") or is being called from some capturing group,
2263       referenced either by number ("(R1)", "(R2)",...) or by name
2264       ("(R&name)").
2265
2266       The integer or name form of the "condition" allows us to choose, with
2267       more flexibility, what to match based on what matched earlier in the
2268       regexp. This searches for words of the form "$x$x" or "$x$y$y$x":
2269
2270           % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words
2271           beriberi
2272           coco
2273           couscous
2274           deed
2275           ...
2276           toot
2277           toto
2278           tutu
2279
2280       The lookbehind "condition" allows, along with backreferences, an
2281       earlier part of the match to influence a later part of the match.  For
2282       instance,
2283
2284           /[ATGC]+(?(?<=AA)G|C)$/;
2285
2286       matches a DNA sequence such that it either ends in "AAG", or some other
2287       base pair combination and 'C'.  Note that the form is "(?(?<=AA)G|C)"
2288       and not "(?((?<=AA))G|C)"; for the lookahead, lookbehind or code
2289       assertions, the parentheses around the conditional are not needed.
2290
2291   Defining named patterns
2292       Some regular expressions use identical subpatterns in several places.
2293       Starting with Perl 5.10, it is possible to define named subpatterns in
2294       a section of the pattern so that they can be called up by name anywhere
2295       in the pattern.  This syntactic pattern for this definition group is
2296       "(?(DEFINE)(?<name>pattern)...)".  An insertion of a named pattern is
2297       written as "(?&name)".
2298
2299       The example below illustrates this feature using the pattern for
2300       floating point numbers that was presented earlier on.  The three
2301       subpatterns that are used more than once are the optional sign, the
2302       digit sequence for an integer and the decimal fraction.  The "DEFINE"
2303       group at the end of the pattern contains their definition.  Notice that
2304       the decimal fraction pattern is the first place where we can reuse the
2305       integer pattern.
2306
2307          /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
2308             (?: [eE](?&osg)(?&int) )?
2309           $
2310           (?(DEFINE)
2311             (?<osg>[-+]?)         # optional sign
2312             (?<int>\d++)          # integer
2313             (?<dec>\.(?&int))     # decimal fraction
2314           )/x
2315
2316   Recursive patterns
2317       This feature (introduced in Perl 5.10) significantly extends the power
2318       of Perl's pattern matching.  By referring to some other capture group
2319       anywhere in the pattern with the construct "(?group-ref)", the pattern
2320       within the referenced group is used as an independent subpattern in
2321       place of the group reference itself.  Because the group reference may
2322       be contained within the group it refers to, it is now possible to apply
2323       pattern matching to tasks that hitherto required a recursive parser.
2324
2325       To illustrate this feature, we'll design a pattern that matches if a
2326       string contains a palindrome. (This is a word or a sentence that, while
2327       ignoring spaces, interpunctuation and case, reads the same backwards as
2328       forwards. We begin by observing that the empty string or a string
2329       containing just one word character is a palindrome. Otherwise it must
2330       have a word character up front and the same at its end, with another
2331       palindrome in between.
2332
2333           /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x
2334
2335       Adding "\W*" at either end to eliminate what is to be ignored, we
2336       already have the full pattern:
2337
2338           my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
2339           for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
2340               print "'$s' is a palindrome\n" if $s =~ /$pp/;
2341           }
2342
2343       In "(?...)" both absolute and relative backreferences may be used.  The
2344       entire pattern can be reinserted with "(?R)" or "(?0)".  If you prefer
2345       to name your groups, you can use "(?&name)" to recurse into that group.
2346
2347   A bit of magic: executing Perl code in a regular expression
2348       Normally, regexps are a part of Perl expressions.  Code evaluation
2349       expressions turn that around by allowing arbitrary Perl code to be a
2350       part of a regexp.  A code evaluation expression is denoted "(?{code})",
2351       with code a string of Perl statements.
2352
2353       Code expressions are zero-width assertions, and the value they return
2354       depends on their environment.  There are two possibilities: either the
2355       code expression is used as a conditional in a conditional expression
2356       "(?(condition)...)", or it is not.  If the code expression is a
2357       conditional, the code is evaluated and the result (i.e., the result of
2358       the last statement) is used to determine truth or falsehood.  If the
2359       code expression is not used as a conditional, the assertion always
2360       evaluates true and the result is put into the special variable $^R.
2361       The variable $^R can then be used in code expressions later in the
2362       regexp.  Here are some silly examples:
2363
2364           $x = "abcdef";
2365           $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
2366                                                # prints 'Hi Mom!'
2367           $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
2368                                                # no 'Hi Mom!'
2369
2370       Pay careful attention to the next example:
2371
2372           $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
2373                                                # no 'Hi Mom!'
2374                                                # but why not?
2375
2376       At first glance, you'd think that it shouldn't print, because obviously
2377       the "ddd" isn't going to match the target string. But look at this
2378       example:
2379
2380           $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match,
2381                                                   # but _does_ print
2382
2383       Hmm. What happened here? If you've been following along, you know that
2384       the above pattern should be effectively (almost) the same as the last
2385       one; enclosing the 'd' in a character class isn't going to change what
2386       it matches. So why does the first not print while the second one does?
2387
2388       The answer lies in the optimizations the regexp engine makes. In the
2389       first case, all the engine sees are plain old characters (aside from
2390       the "?{}" construct). It's smart enough to realize that the string
2391       'ddd' doesn't occur in our target string before actually running the
2392       pattern through. But in the second case, we've tricked it into thinking
2393       that our pattern is more complicated. It takes a look, sees our
2394       character class, and decides that it will have to actually run the
2395       pattern to determine whether or not it matches, and in the process of
2396       running it hits the print statement before it discovers that we don't
2397       have a match.
2398
2399       To take a closer look at how the engine does optimizations, see the
2400       section "Pragmas and debugging" below.
2401
2402       More fun with "?{}":
2403
2404           $x =~ /(?{print "Hi Mom!";})/;       # matches,
2405                                                # prints 'Hi Mom!'
2406           $x =~ /(?{$c = 1;})(?{print "$c";})/;  # matches,
2407                                                  # prints '1'
2408           $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
2409                                                  # prints '1'
2410
2411       The bit of magic mentioned in the section title occurs when the regexp
2412       backtracks in the process of searching for a match.  If the regexp
2413       backtracks over a code expression and if the variables used within are
2414       localized using "local", the changes in the variables produced by the
2415       code expression are undone! Thus, if we wanted to count how many times
2416       a character got matched inside a group, we could use, e.g.,
2417
2418           $x = "aaaa";
2419           $count = 0;  # initialize 'a' count
2420           $c = "bob";  # test if $c gets clobbered
2421           $x =~ /(?{local $c = 0;})         # initialize count
2422                  ( a                        # match 'a'
2423                    (?{local $c = $c + 1;})  # increment count
2424                  )*                         # do this any number of times,
2425                  aa                         # but match 'aa' at the end
2426                  (?{$count = $c;})          # copy local $c var into $count
2427                 /x;
2428           print "'a' count is $count, \$c variable is '$c'\n";
2429
2430       This prints
2431
2432           'a' count is 2, $c variable is 'bob'
2433
2434       If we replace the " (?{local $c = $c + 1;})" with " (?{$c = $c + 1;})",
2435       the variable changes are not undone during backtracking, and we get
2436
2437           'a' count is 4, $c variable is 'bob'
2438
2439       Note that only localized variable changes are undone.  Other side
2440       effects of code expression execution are permanent.  Thus
2441
2442           $x = "aaaa";
2443           $x =~ /(a(?{print "Yow\n";}))*aa/;
2444
2445       produces
2446
2447          Yow
2448          Yow
2449          Yow
2450          Yow
2451
2452       The result $^R is automatically localized, so that it will behave
2453       properly in the presence of backtracking.
2454
2455       This example uses a code expression in a conditional to match a
2456       definite article, either 'the' in English or 'der|die|das' in German:
2457
2458           $lang = 'DE';  # use German
2459           ...
2460           $text = "das";
2461           print "matched\n"
2462               if $text =~ /(?(?{
2463                                 $lang eq 'EN'; # is the language English?
2464                                })
2465                              the |             # if so, then match 'the'
2466                              (der|die|das)     # else, match 'der|die|das'
2467                            )
2468                           /xi;
2469
2470       Note that the syntax here is "(?(?{...})yes-regexp|no-regexp)", not
2471       "(?((?{...}))yes-regexp|no-regexp)".  In other words, in the case of a
2472       code expression, we don't need the extra parentheses around the
2473       conditional.
2474
2475       If you try to use code expressions where the code text is contained
2476       within an interpolated variable, rather than appearing literally in the
2477       pattern, Perl may surprise you:
2478
2479           $bar = 5;
2480           $pat = '(?{ 1 })';
2481           /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
2482           /foo(?{ 1 })$bar/;   # compiles ok, $bar interpolated
2483           /foo${pat}bar/;      # compile error!
2484
2485           $pat = qr/(?{ $foo = 1 })/;  # precompile code regexp
2486           /foo${pat}bar/;      # compiles ok
2487
2488       If a regexp has a variable that interpolates a code expression, Perl
2489       treats the regexp as an error. If the code expression is precompiled
2490       into a variable, however, interpolating is ok. The question is, why is
2491       this an error?
2492
2493       The reason is that variable interpolation and code expressions together
2494       pose a security risk.  The combination is dangerous because many
2495       programmers who write search engines often take user input and plug it
2496       directly into a regexp:
2497
2498           $regexp = <>;       # read user-supplied regexp
2499           $chomp $regexp;     # get rid of possible newline
2500           $text =~ /$regexp/; # search $text for the $regexp
2501
2502       If the $regexp variable contains a code expression, the user could then
2503       execute arbitrary Perl code.  For instance, some joker could search for
2504       "system('rm -rf *');" to erase your files.  In this sense, the
2505       combination of interpolation and code expressions taints your regexp.
2506       So by default, using both interpolation and code expressions in the
2507       same regexp is not allowed.  If you're not concerned about malicious
2508       users, it is possible to bypass this security check by invoking
2509       "use re 'eval'":
2510
2511           use re 'eval';       # throw caution out the door
2512           $bar = 5;
2513           $pat = '(?{ 1 })';
2514           /foo${pat}bar/;      # compiles ok
2515
2516       Another form of code expression is the pattern code expression.  The
2517       pattern code expression is like a regular code expression, except that
2518       the result of the code evaluation is treated as a regular expression
2519       and matched immediately.  A simple example is
2520
2521           $length = 5;
2522           $char = 'a';
2523           $x = 'aaaaabb';
2524           $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
2525
2526       This final example contains both ordinary and pattern code expressions.
2527       It detects whether a binary string 1101010010001... has a Fibonacci
2528       spacing 0,1,1,2,3,5,...  of the '1''s:
2529
2530           $x = "1101010010001000001";
2531           $z0 = ''; $z1 = '0';   # initial conditions
2532           print "It is a Fibonacci sequence\n"
2533               if $x =~ /^1         # match an initial '1'
2534                           (?:
2535                              ((??{ $z0 })) # match some '0'
2536                              1             # and then a '1'
2537                              (?{ $z0 = $z1; $z1 .= $^N; })
2538                           )+   # repeat as needed
2539                         $      # that is all there is
2540                        /x;
2541           printf "Largest sequence matched was %d\n", length($z1)-length($z0);
2542
2543       Remember that $^N is set to whatever was matched by the last completed
2544       capture group. This prints
2545
2546           It is a Fibonacci sequence
2547           Largest sequence matched was 5
2548
2549       Ha! Try that with your garden variety regexp package...
2550
2551       Note that the variables $z0 and $z1 are not substituted when the regexp
2552       is compiled, as happens for ordinary variables outside a code
2553       expression.  Rather, the whole code block is parsed as perl code at the
2554       same time as perl is compiling the code containing the literal regexp
2555       pattern.
2556
2557       This regexp without the "/x" modifier is
2558
2559           /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/
2560
2561       which shows that spaces are still possible in the code parts.
2562       Nevertheless, when working with code and conditional expressions, the
2563       extended form of regexps is almost necessary in creating and debugging
2564       regexps.
2565
2566   Backtracking control verbs
2567       Perl 5.10 introduced a number of control verbs intended to provide
2568       detailed control over the backtracking process, by directly influencing
2569       the regexp engine and by providing monitoring techniques.  See "Special
2570       Backtracking Control Verbs" in perlre for a detailed description.
2571
2572       Below is just one example, illustrating the control verb "(*FAIL)",
2573       which may be abbreviated as "(*F)". If this is inserted in a regexp it
2574       will cause it to fail, just as it would at some mismatch between the
2575       pattern and the string. Processing of the regexp continues as it would
2576       after any "normal" failure, so that, for instance, the next position in
2577       the string or another alternative will be tried. As failing to match
2578       doesn't preserve capture groups or produce results, it may be necessary
2579       to use this in combination with embedded code.
2580
2581          %count = ();
2582          "supercalifragilisticexpialidocious" =~
2583              /([aeiou])(?{ $count{$1}++; })(*FAIL)/i;
2584          printf "%3d '%s'\n", $count{$_}, $_ for (sort keys %count);
2585
2586       The pattern begins with a class matching a subset of letters.  Whenever
2587       this matches, a statement like "$count{'a'}++;" is executed,
2588       incrementing the letter's counter. Then "(*FAIL)" does what it says,
2589       and the regexp engine proceeds according to the book: as long as the
2590       end of the string hasn't been reached, the position is advanced before
2591       looking for another vowel. Thus, match or no match makes no difference,
2592       and the regexp engine proceeds until the entire string has been
2593       inspected.  (It's remarkable that an alternative solution using
2594       something like
2595
2596          $count{lc($_)}++ for split('', "supercalifragilisticexpialidocious");
2597          printf "%3d '%s'\n", $count2{$_}, $_ for ( qw{ a e i o u } );
2598
2599       is considerably slower.)
2600
2601   Pragmas and debugging
2602       Speaking of debugging, there are several pragmas available to control
2603       and debug regexps in Perl.  We have already encountered one pragma in
2604       the previous section, "use re 'eval';", that allows variable
2605       interpolation and code expressions to coexist in a regexp.  The other
2606       pragmas are
2607
2608           use re 'taint';
2609           $tainted = <>;
2610           @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
2611
2612       The "taint" pragma causes any substrings from a match with a tainted
2613       variable to be tainted as well.  This is not normally the case, as
2614       regexps are often used to extract the safe bits from a tainted
2615       variable.  Use "taint" when you are not extracting safe bits, but are
2616       performing some other processing.  Both "taint" and "eval" pragmas are
2617       lexically scoped, which means they are in effect only until the end of
2618       the block enclosing the pragmas.
2619
2620           use re '/m';  # or any other flags
2621           $multiline_string =~ /^foo/; # /m is implied
2622
2623       The "re '/flags'" pragma (introduced in Perl 5.14) turns on the given
2624       regular expression flags until the end of the lexical scope.  See
2625       "'/flags' mode" in re for more detail.
2626
2627           use re 'debug';
2628           /^(.*)$/s;       # output debugging info
2629
2630           use re 'debugcolor';
2631           /^(.*)$/s;       # output debugging info in living color
2632
2633       The global "debug" and "debugcolor" pragmas allow one to get detailed
2634       debugging info about regexp compilation and execution.  "debugcolor" is
2635       the same as debug, except the debugging information is displayed in
2636       color on terminals that can display termcap color sequences.  Here is
2637       example output:
2638
2639           % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
2640           Compiling REx 'a*b+c'
2641           size 9 first at 1
2642              1: STAR(4)
2643              2:   EXACT <a>(0)
2644              4: PLUS(7)
2645              5:   EXACT <b>(0)
2646              7: EXACT <c>(9)
2647              9: END(0)
2648           floating 'bc' at 0..2147483647 (checking floating) minlen 2
2649           Guessing start of match, REx 'a*b+c' against 'abc'...
2650           Found floating substr 'bc' at offset 1...
2651           Guessed: match at offset 0
2652           Matching REx 'a*b+c' against 'abc'
2653             Setting an EVAL scope, savestack=3
2654              0 <> <abc>           |  1:  STAR
2655                                    EXACT <a> can match 1 times out of 32767...
2656             Setting an EVAL scope, savestack=3
2657              1 <a> <bc>           |  4:    PLUS
2658                                    EXACT <b> can match 1 times out of 32767...
2659             Setting an EVAL scope, savestack=3
2660              2 <ab> <c>           |  7:      EXACT <c>
2661              3 <abc> <>           |  9:      END
2662           Match successful!
2663           Freeing REx: 'a*b+c'
2664
2665       If you have gotten this far into the tutorial, you can probably guess
2666       what the different parts of the debugging output tell you.  The first
2667       part
2668
2669           Compiling REx 'a*b+c'
2670           size 9 first at 1
2671              1: STAR(4)
2672              2:   EXACT <a>(0)
2673              4: PLUS(7)
2674              5:   EXACT <b>(0)
2675              7: EXACT <c>(9)
2676              9: END(0)
2677
2678       describes the compilation stage.  STAR(4) means that there is a starred
2679       object, in this case 'a', and if it matches, goto line 4, i.e.,
2680       PLUS(7).  The middle lines describe some heuristics and optimizations
2681       performed before a match:
2682
2683           floating 'bc' at 0..2147483647 (checking floating) minlen 2
2684           Guessing start of match, REx 'a*b+c' against 'abc'...
2685           Found floating substr 'bc' at offset 1...
2686           Guessed: match at offset 0
2687
2688       Then the match is executed and the remaining lines describe the
2689       process:
2690
2691           Matching REx 'a*b+c' against 'abc'
2692             Setting an EVAL scope, savestack=3
2693              0 <> <abc>           |  1:  STAR
2694                                    EXACT <a> can match 1 times out of 32767...
2695             Setting an EVAL scope, savestack=3
2696              1 <a> <bc>           |  4:    PLUS
2697                                    EXACT <b> can match 1 times out of 32767...
2698             Setting an EVAL scope, savestack=3
2699              2 <ab> <c>           |  7:      EXACT <c>
2700              3 <abc> <>           |  9:      END
2701           Match successful!
2702           Freeing REx: 'a*b+c'
2703
2704       Each step is of the form "n <x> <y>", with "<x>" the part of the string
2705       matched and "<y>" the part not yet matched.  The "|  1:  STAR" says
2706       that Perl is at line number 1 in the compilation list above.  See
2707       "Debugging Regular Expressions" in perldebguts for much more detail.
2708
2709       An alternative method of debugging regexps is to embed "print"
2710       statements within the regexp.  This provides a blow-by-blow account of
2711       the backtracking in an alternation:
2712
2713           "that this" =~ m@(?{print "Start at position ", pos, "\n";})
2714                            t(?{print "t1\n";})
2715                            h(?{print "h1\n";})
2716                            i(?{print "i1\n";})
2717                            s(?{print "s1\n";})
2718                                |
2719                            t(?{print "t2\n";})
2720                            h(?{print "h2\n";})
2721                            a(?{print "a2\n";})
2722                            t(?{print "t2\n";})
2723                            (?{print "Done at position ", pos, "\n";})
2724                           @x;
2725
2726       prints
2727
2728           Start at position 0
2729           t1
2730           h1
2731           t2
2732           h2
2733           a2
2734           t2
2735           Done at position 4
2736

SEE ALSO

2738       This is just a tutorial.  For the full story on Perl regular
2739       expressions, see the perlre regular expressions reference page.
2740
2741       For more information on the matching "m//" and substitution "s///"
2742       operators, see "Regexp Quote-Like Operators" in perlop.  For
2743       information on the "split" operation, see "split" in perlfunc.
2744
2745       For an excellent all-around resource on the care and feeding of regular
2746       expressions, see the book Mastering Regular Expressions by Jeffrey
2747       Friedl (published by O'Reilly, ISBN 1556592-257-3).
2748
2750       Copyright (c) 2000 Mark Kvale.  All rights reserved.  Now maintained by
2751       Perl porters.
2752
2753       This document may be distributed under the same terms as Perl itself.
2754
2755   Acknowledgments
2756       The inspiration for the stop codon DNA example came from the ZIP code
2757       example in chapter 7 of Mastering Regular Expressions.
2758
2759       The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
2760       Haworth, Ronald J Kimball, and Joe Smith for all their helpful
2761       comments.
2762
2763
2764
2765perl v5.26.3                      2018-03-23                      PERLRETUT(1)
Impressum