1PERLRETUT(1) Perl Programmers Reference Guide PERLRETUT(1)
2
3
4
6 perlretut - Perl regular expressions tutorial
7
9 This page provides a basic tutorial on understanding, creating and
10 using regular expressions in Perl. It serves as a complement to the
11 reference page on regular expressions perlre. Regular expressions are
12 an integral part of the "m//", "s///", "qr//" and "split" operators and
13 so this tutorial also overlaps with "Regexp Quote-Like Operators" in
14 perlop and "split" in perlfunc.
15
16 Perl is widely renowned for excellence in text processing, and regular
17 expressions are one of the big factors behind this fame. Perl regular
18 expressions display an efficiency and flexibility unknown in most other
19 computer languages. Mastering even the basics of regular expressions
20 will allow you to manipulate text with surprising ease.
21
22 What is a regular expression? A regular expression is simply a string
23 that describes a pattern. Patterns are in common use these days; exam‐
24 ples are the patterns typed into a search engine to find web pages and
25 the patterns used to list files in a directory, e.g., "ls *.txt" or
26 "dir *.*". In Perl, the patterns described by regular expressions are
27 used to search strings, extract desired parts of strings, and to do
28 search and replace operations.
29
30 Regular expressions have the undeserved reputation of being abstract
31 and difficult to understand. Regular expressions are constructed using
32 simple concepts like conditionals and loops and are no more difficult
33 to understand than the corresponding "if" conditionals and "while"
34 loops in the Perl language itself. In fact, the main challenge in
35 learning regular expressions is just getting used to the terse notation
36 used to express these concepts.
37
38 This tutorial flattens the learning curve by discussing regular expres‐
39 sion concepts, along with their notation, one at a time and with many
40 examples. The first part of the tutorial will progress from the sim‐
41 plest word searches to the basic regular expression concepts. If you
42 master the first part, you will have all the tools needed to solve
43 about 98% of your needs. The second part of the tutorial is for those
44 comfortable with the basics and hungry for more power tools. It dis‐
45 cusses the more advanced regular expression operators and introduces
46 the latest cutting edge innovations in 5.6.0.
47
48 A note: to save time, 'regular expression' is often abbreviated as reg‐
49 exp or regex. Regexp is a more natural abbreviation than regex, but is
50 harder to pronounce. The Perl pod documentation is evenly split on
51 regexp vs regex; in Perl, there is more than one way to abbreviate it.
52 We'll use regexp in this tutorial.
53
55 Simple word matching
56
57 The simplest regexp is simply a word, or more generally, a string of
58 characters. A regexp consisting of a word matches any string that con‐
59 tains that word:
60
61 "Hello World" =~ /World/; # matches
62
63 What is this perl statement all about? "Hello World" is a simple double
64 quoted string. "World" is the regular expression and the "//" enclos‐
65 ing "/World/" tells perl to search a string for a match. The operator
66 "=~" associates the string with the regexp match and produces a true
67 value if the regexp matched, or false if the regexp did not match. In
68 our case, "World" matches the second word in "Hello World", so the
69 expression is true. Expressions like this are useful in conditionals:
70
71 if ("Hello World" =~ /World/) {
72 print "It matches\n";
73 }
74 else {
75 print "It doesn't match\n";
76 }
77
78 There are useful variations on this theme. The sense of the match can
79 be reversed by using "!~" operator:
80
81 if ("Hello World" !~ /World/) {
82 print "It doesn't match\n";
83 }
84 else {
85 print "It matches\n";
86 }
87
88 The literal string in the regexp can be replaced by a variable:
89
90 $greeting = "World";
91 if ("Hello World" =~ /$greeting/) {
92 print "It matches\n";
93 }
94 else {
95 print "It doesn't match\n";
96 }
97
98 If you're matching against the special default variable $_, the "$_ =~"
99 part can be omitted:
100
101 $_ = "Hello World";
102 if (/World/) {
103 print "It matches\n";
104 }
105 else {
106 print "It doesn't match\n";
107 }
108
109 And finally, the "//" default delimiters for a match can be changed to
110 arbitrary delimiters by putting an 'm' out front:
111
112 "Hello World" =~ m!World!; # matches, delimited by '!'
113 "Hello World" =~ m{World}; # matches, note the matching '{}'
114 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
115 # '/' becomes an ordinary char
116
117 "/World/", "m!World!", and "m{World}" all represent the same thing.
118 When, e.g., "" is used as a delimiter, the forward slash '/' becomes an
119 ordinary character and can be used in a regexp without trouble.
120
121 Let's consider how different regexps would match "Hello World":
122
123 "Hello World" =~ /world/; # doesn't match
124 "Hello World" =~ /o W/; # matches
125 "Hello World" =~ /oW/; # doesn't match
126 "Hello World" =~ /World /; # doesn't match
127
128 The first regexp "world" doesn't match because regexps are case-sensi‐
129 tive. The second regexp matches because the substring 'o W' occurs in
130 the string "Hello World" . The space character ' ' is treated like any
131 other character in a regexp and is needed to match in this case. The
132 lack of a space character is the reason the third regexp 'oW' doesn't
133 match. The fourth regexp 'World ' doesn't match because there is a
134 space at the end of the regexp, but not at the end of the string. The
135 lesson here is that regexps must match a part of the string exactly in
136 order for the statement to be true.
137
138 If a regexp matches in more than one place in the string, perl will
139 always match at the earliest possible point in the string:
140
141 "Hello World" =~ /o/; # matches 'o' in 'Hello'
142 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
143
144 With respect to character matching, there are a few more points you
145 need to know about. First of all, not all characters can be used 'as
146 is' in a match. Some characters, called metacharacters, are reserved
147 for use in regexp notation. The metacharacters are
148
149 {}[]()^$.⎪*+?\
150
151 The significance of each of these will be explained in the rest of the
152 tutorial, but for now, it is important only to know that a metacharac‐
153 ter can be matched by putting a backslash before it:
154
155 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
156 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
157 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
158 "The interval is [0,1)." =~ /\[0,1\)\./ # matches
159 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
160
161 In the last regexp, the forward slash '/' is also backslashed, because
162 it is used to delimit the regexp. This can lead to LTS (leaning tooth‐
163 pick syndrome), however, and it is often more readable to change delim‐
164 iters.
165
166 "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
167
168 The backslash character '\' is a metacharacter itself and needs to be
169 backslashed:
170
171 'C:\WIN32' =~ /C:\\WIN/; # matches
172
173 In addition to the metacharacters, there are some ASCII characters
174 which don't have printable character equivalents and are instead repre‐
175 sented by escape sequences. Common examples are "\t" for a tab, "\n"
176 for a newline, "\r" for a carriage return and "\a" for a bell. If your
177 string is better thought of as a sequence of arbitrary bytes, the octal
178 escape sequence, e.g., "\033", or hexadecimal escape sequence, e.g.,
179 "\x1B" may be a more natural representation for your bytes. Here are
180 some examples of escapes:
181
182 "1000\t2000" =~ m(0\t2) # matches
183 "1000\n2000" =~ /0\n20/ # matches
184 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
185 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
186
187 If you've been around Perl a while, all this talk of escape sequences
188 may seem familiar. Similar escape sequences are used in double-quoted
189 strings and in fact the regexps in Perl are mostly treated as double-
190 quoted strings. This means that variables can be used in regexps as
191 well. Just like double-quoted strings, the values of the variables in
192 the regexp will be substituted in before the regexp is evaluated for
193 matching purposes. So we have:
194
195 $foo = 'house';
196 'housecat' =~ /$foo/; # matches
197 'cathouse' =~ /cat$foo/; # matches
198 'housecat' =~ /${foo}cat/; # matches
199
200 So far, so good. With the knowledge above you can already perform
201 searches with just about any literal string regexp you can dream up.
202 Here is a very simple emulation of the Unix grep program:
203
204 % cat > simple_grep
205 #!/usr/bin/perl
206 $regexp = shift;
207 while (<>) {
208 print if /$regexp/;
209 }
210 ^D
211
212 % chmod +x simple_grep
213
214 % simple_grep abba /usr/dict/words
215 Babbage
216 cabbage
217 cabbages
218 sabbath
219 Sabbathize
220 Sabbathizes
221 sabbatical
222 scabbard
223 scabbards
224
225 This program is easy to understand. "#!/usr/bin/perl" is the standard
226 way to invoke a perl program from the shell. "$regexp = shift;" saves
227 the first command line argument as the regexp to be used, leaving the
228 rest of the command line arguments to be treated as files.
229 "while (<>)" loops over all the lines in all the files. For each
230 line, "print if /$regexp/;" prints the line if the regexp matches the
231 line. In this line, both "print" and "/$regexp/" use the default vari‐
232 able $_ implicitly.
233
234 With all of the regexps above, if the regexp matched anywhere in the
235 string, it was considered a match. Sometimes, however, we'd like to
236 specify where in the string the regexp should try to match. To do
237 this, we would use the anchor metacharacters "^" and "$". The anchor
238 "^" means match at the beginning of the string and the anchor "$" means
239 match at the end of the string, or before a newline at the end of the
240 string. Here is how they are used:
241
242 "housekeeper" =~ /keeper/; # matches
243 "housekeeper" =~ /^keeper/; # doesn't match
244 "housekeeper" =~ /keeper$/; # matches
245 "housekeeper\n" =~ /keeper$/; # matches
246
247 The second regexp doesn't match because "^" constrains "keeper" to
248 match only at the beginning of the string, but "housekeeper" has keeper
249 starting in the middle. The third regexp does match, since the "$"
250 constrains "keeper" to match only at the end of the string.
251
252 When both "^" and "$" are used at the same time, the regexp has to
253 match both the beginning and the end of the string, i.e., the regexp
254 matches the whole string. Consider
255
256 "keeper" =~ /^keep$/; # doesn't match
257 "keeper" =~ /^keeper$/; # matches
258 "" =~ /^$/; # ^$ matches an empty string
259
260 The first regexp doesn't match because the string has more to it than
261 "keep". Since the second regexp is exactly the string, it matches.
262 Using both "^" and "$" in a regexp forces the complete string to match,
263 so it gives you complete control over which strings match and which
264 don't. Suppose you are looking for a fellow named bert, off in a
265 string by himself:
266
267 "dogbert" =~ /bert/; # matches, but not what you want
268
269 "dilbert" =~ /^bert/; # doesn't match, but ..
270 "bertram" =~ /^bert/; # matches, so still not good enough
271
272 "bertram" =~ /^bert$/; # doesn't match, good
273 "dilbert" =~ /^bert$/; # doesn't match, good
274 "bert" =~ /^bert$/; # matches, perfect
275
276 Of course, in the case of a literal string, one could just as easily
277 use the string equivalence "$string eq 'bert'" and it would be more
278 efficient. The "^...$" regexp really becomes useful when we add in
279 the more powerful regexp tools below.
280
281 Using character classes
282
283 Although one can already do quite a lot with the literal string regexps
284 above, we've only scratched the surface of regular expression technol‐
285 ogy. In this and subsequent sections we will introduce regexp concepts
286 (and associated metacharacter notations) that will allow a regexp to
287 not just represent a single character sequence, but a whole class of
288 them.
289
290 One such concept is that of a character class. A character class
291 allows a set of possible characters, rather than just a single charac‐
292 ter, to match at a particular point in a regexp. Character classes are
293 denoted by brackets "[...]", with the set of characters to be possibly
294 matched inside. Here are some examples:
295
296 /cat/; # matches 'cat'
297 /[bcr]at/; # matches 'bat, 'cat', or 'rat'
298 /item[0123456789]/; # matches 'item0' or ... or 'item9'
299 "abc" =~ /[cab]/; # matches 'a'
300
301 In the last statement, even though 'c' is the first character in the
302 class, 'a' matches because the first character position in the string
303 is the earliest point at which the regexp can match.
304
305 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
306 # 'yes', 'Yes', 'YES', etc.
307
308 This regexp displays a common task: perform a case-insensitive match.
309 Perl provides away of avoiding all those brackets by simply appending
310 an 'i' to the end of the match. Then "/[yY][eE][sS]/;" can be rewrit‐
311 ten as "/yes/i;". The 'i' stands for case-insensitive and is an exam‐
312 ple of a modifier of the matching operation. We will meet other modi‐
313 fiers later in the tutorial.
314
315 We saw in the section above that there were ordinary characters, which
316 represented themselves, and special characters, which needed a back‐
317 slash "\" to represent themselves. The same is true in a character
318 class, but the sets of ordinary and special characters inside a charac‐
319 ter class are different than those outside a character class. The spe‐
320 cial characters for a character class are "-]\^$". "]" is special
321 because it denotes the end of a character class. "$" is special
322 because it denotes a scalar variable. "\" is special because it is
323 used in escape sequences, just like above. Here is how the special
324 characters "]$\" are handled:
325
326 /[\]c]def/; # matches ']def' or 'cdef'
327 $x = 'bcr';
328 /[$x]at/; # matches 'bat', 'cat', or 'rat'
329 /[\$x]at/; # matches '$at' or 'xat'
330 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
331
332 The last two are a little tricky. in "[\$x]", the backslash protects
333 the dollar sign, so the character class has two members "$" and "x".
334 In "[\\$x]", the backslash is protected, so $x is treated as a variable
335 and substituted in double quote fashion.
336
337 The special character '-' acts as a range operator within character
338 classes, so that a contiguous set of characters can be written as a
339 range. With ranges, the unwieldy "[0123456789]" and "[abc...xyz]"
340 become the svelte "[0-9]" and "[a-z]". Some examples are
341
342 /item[0-9]/; # matches 'item0' or ... or 'item9'
343 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
344 # 'baa', 'xaa', 'yaa', or 'zaa'
345 /[0-9a-fA-F]/; # matches a hexadecimal digit
346 /[0-9a-zA-Z_]/; # matches a "word" character,
347 # like those in a perl variable name
348
349 If '-' is the first or last character in a character class, it is
350 treated as an ordinary character; "[-ab]", "[ab-]" and "[a\-b]" are all
351 equivalent.
352
353 The special character "^" in the first position of a character class
354 denotes a negated character class, which matches any character but
355 those in the brackets. Both "[...]" and "[^...]" must match a charac‐
356 ter, or the match fails. Then
357
358 /[^a]at/; # doesn't match 'aat' or 'at', but matches
359 # all other 'bat', 'cat, '0at', '%at', etc.
360 /[^0-9]/; # matches a non-numeric character
361 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
362
363 Now, even "[0-9]" can be a bother the write multiple times, so in the
364 interest of saving keystrokes and making regexps more readable, Perl
365 has several abbreviations for common character classes:
366
367 · \d is a digit and represents [0-9]
368
369 · \s is a whitespace character and represents [\ \t\r\n\f]
370
371 · \w is a word character (alphanumeric or _) and represents
372 [0-9a-zA-Z_]
373
374 · \D is a negated \d; it represents any character but a digit [^0-9]
375
376 · \S is a negated \s; it represents any non-whitespace character
377 [^\s]
378
379 · \W is a negated \w; it represents any non-word character [^\w]
380
381 · The period '.' matches any character but "\n"
382
383 The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
384 character classes. Here are some in use:
385
386 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
387 /[\d\s]/; # matches any digit or whitespace character
388 /\w\W\w/; # matches a word char, followed by a
389 # non-word char, followed by a word char
390 /..rt/; # matches any two chars, followed by 'rt'
391 /end\./; # matches 'end.'
392 /end[.]/; # same thing, matches 'end.'
393
394 Because a period is a metacharacter, it needs to be escaped to match as
395 an ordinary period. Because, for example, "\d" and "\w" are sets of
396 characters, it is incorrect to think of "[^\d\w]" as "[\D\W]"; in fact
397 "[^\d\w]" is the same as "[^\w]", which is the same as "[\W]". Think
398 DeMorgan's laws.
399
400 An anchor useful in basic regexps is the word anchor "\b". This
401 matches a boundary between a word character and a non-word character
402 "\w\W" or "\W\w":
403
404 $x = "Housecat catenates house and cat";
405 $x =~ /cat/; # matches cat in 'housecat'
406 $x =~ /\bcat/; # matches cat in 'catenates'
407 $x =~ /cat\b/; # matches cat in 'housecat'
408 $x =~ /\bcat\b/; # matches 'cat' at end of string
409
410 Note in the last example, the end of the string is considered a word
411 boundary.
412
413 You might wonder why '.' matches everything but "\n" - why not every
414 character? The reason is that often one is matching against lines and
415 would like to ignore the newline characters. For instance, while the
416 string "\n" represents one line, we would like to think of as empty.
417 Then
418
419 "" =~ /^$/; # matches
420 "\n" =~ /^$/; # matches, "\n" is ignored
421
422 "" =~ /./; # doesn't match; it needs a char
423 "" =~ /^.$/; # doesn't match; it needs a char
424 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
425 "a" =~ /^.$/; # matches
426 "a\n" =~ /^.$/; # matches, ignores the "\n"
427
428 This behavior is convenient, because we usually want to ignore newlines
429 when we count and match characters in a line. Sometimes, however, we
430 want to keep track of newlines. We might even want "^" and "$" to
431 anchor at the beginning and end of lines within the string, rather than
432 just the beginning and end of the string. Perl allows us to choose
433 between ignoring and paying attention to newlines by using the "//s"
434 and "//m" modifiers. "//s" and "//m" stand for single line and multi-
435 line and they determine whether a string is to be treated as one con‐
436 tinuous string, or as a set of lines. The two modifiers affect two
437 aspects of how the regexp is interpreted: 1) how the '.' character
438 class is defined, and 2) where the anchors "^" and "$" are able to
439 match. Here are the four possible combinations:
440
441 · no modifiers (//): Default behavior. '.' matches any character
442 except "\n". "^" matches only at the beginning of the string and
443 "$" matches only at the end or before a newline at the end.
444
445 · s modifier (//s): Treat string as a single long line. '.' matches
446 any character, even "\n". "^" matches only at the beginning of the
447 string and "$" matches only at the end or before a newline at the
448 end.
449
450 · m modifier (//m): Treat string as a set of multiple lines. '.'
451 matches any character except "\n". "^" and "$" are able to match
452 at the start or end of any line within the string.
453
454 · both s and m modifiers (//sm): Treat string as a single long line,
455 but detect multiple lines. '.' matches any character, even "\n".
456 "^" and "$", however, are able to match at the start or end of any
457 line within the string.
458
459 Here are examples of "//s" and "//m" in action:
460
461 $x = "There once was a girl\nWho programmed in Perl\n";
462
463 $x =~ /^Who/; # doesn't match, "Who" not at start of string
464 $x =~ /^Who/s; # doesn't match, "Who" not at start of string
465 $x =~ /^Who/m; # matches, "Who" at start of second line
466 $x =~ /^Who/sm; # matches, "Who" at start of second line
467
468 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
469 $x =~ /girl.Who/s; # matches, "." matches "\n"
470 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
471 $x =~ /girl.Who/sm; # matches, "." matches "\n"
472
473 Most of the time, the default behavior is what is want, but "//s" and
474 "//m" are occasionally very useful. If "//m" is being used, the start
475 of the string can still be matched with "\A" and the end of string can
476 still be matched with the anchors "\Z" (matches both the end and the
477 newline before, like "$"), and "\z" (matches only the end):
478
479 $x =~ /^Who/m; # matches, "Who" at start of second line
480 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
481
482 $x =~ /girl$/m; # matches, "girl" at end of first line
483 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
484
485 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
486 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
487
488 We now know how to create choices among classes of characters in a reg‐
489 exp. What about choices among words or character strings? Such choices
490 are described in the next section.
491
492 Matching this or that
493
494 Sometimes we would like to our regexp to be able to match different
495 possible words or character strings. This is accomplished by using the
496 alternation metacharacter "⎪". To match "dog" or "cat", we form the
497 regexp "dog⎪cat". As before, perl will try to match the regexp at the
498 earliest possible point in the string. At each character position,
499 perl will first try to match the first alternative, "dog". If "dog"
500 doesn't match, perl will then try the next alternative, "cat". If
501 "cat" doesn't match either, then the match fails and perl moves to the
502 next position in the string. Some examples:
503
504 "cats and dogs" =~ /cat⎪dog⎪bird/; # matches "cat"
505 "cats and dogs" =~ /dog⎪cat⎪bird/; # matches "cat"
506
507 Even though "dog" is the first alternative in the second regexp, "cat"
508 is able to match earlier in the string.
509
510 "cats" =~ /c⎪ca⎪cat⎪cats/; # matches "c"
511 "cats" =~ /cats⎪cat⎪ca⎪c/; # matches "cats"
512
513 Here, all the alternatives match at the first string position, so the
514 first alternative is the one that matches. If some of the alternatives
515 are truncations of the others, put the longest ones first to give them
516 a chance to match.
517
518 "cab" =~ /a⎪b⎪c/ # matches "c"
519 # /a⎪b⎪c/ == /[abc]/
520
521 The last example points out that character classes are like alterna‐
522 tions of characters. At a given character position, the first alterna‐
523 tive that allows the regexp match to succeed will be the one that
524 matches.
525
526 Grouping things and hierarchical matching
527
528 Alternation allows a regexp to choose among alternatives, but by itself
529 it unsatisfying. The reason is that each alternative is a whole reg‐
530 exp, but sometime we want alternatives for just part of a regexp. For
531 instance, suppose we want to search for housecats or housekeepers. The
532 regexp "housecat⎪housekeeper" fits the bill, but is inefficient because
533 we had to type "house" twice. It would be nice to have parts of the
534 regexp be constant, like "house", and some parts have alternatives,
535 like "cat⎪keeper".
536
537 The grouping metacharacters "()" solve this problem. Grouping allows
538 parts of a regexp to be treated as a single unit. Parts of a regexp
539 are grouped by enclosing them in parentheses. Thus we could solve the
540 "housecat⎪housekeeper" by forming the regexp as "house(cat⎪keeper)".
541 The regexp "house(cat⎪keeper)" means match "house" followed by either
542 "cat" or "keeper". Some more examples are
543
544 /(a⎪b)b/; # matches 'ab' or 'bb'
545 /(ac⎪b)b/; # matches 'acb' or 'bb'
546 /(^a⎪b)c/; # matches 'ac' at start of string or 'bc' anywhere
547 /(a⎪[bc])d/; # matches 'ad', 'bd', or 'cd'
548
549 /house(cat⎪)/; # matches either 'housecat' or 'house'
550 /house(cat(s⎪)⎪)/; # matches either 'housecats' or 'housecat' or
551 # 'house'. Note groups can be nested.
552
553 /(19⎪20⎪)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx
554 "20" =~ /(19⎪20⎪)\d\d/; # matches the null alternative '()\d\d',
555 # because '20\d\d' can't match
556
557 Alternations behave the same way in groups as out of them: at a given
558 string position, the leftmost alternative that allows the regexp to
559 match is taken. So in the last example at the first string position,
560 "20" matches the second alternative, but there is nothing left over to
561 match the next two digits "\d\d". So perl moves on to the next alter‐
562 native, which is the null alternative and that works, since "20" is two
563 digits.
564
565 The process of trying one alternative, seeing if it matches, and moving
566 on to the next alternative if it doesn't, is called backtracking. The
567 term 'backtracking' comes from the idea that matching a regexp is like
568 a walk in the woods. Successfully matching a regexp is like arriving
569 at a destination. There are many possible trailheads, one for each
570 string position, and each one is tried in order, left to right. From
571 each trailhead there may be many paths, some of which get you there,
572 and some which are dead ends. When you walk along a trail and hit a
573 dead end, you have to backtrack along the trail to an earlier point to
574 try another trail. If you hit your destination, you stop immediately
575 and forget about trying all the other trails. You are persistent, and
576 only if you have tried all the trails from all the trailheads and not
577 arrived at your destination, do you declare failure. To be concrete,
578 here is a step-by-step analysis of what perl does when it tries to
579 match the regexp
580
581 "abcde" =~ /(abd⎪abc)(df⎪d⎪de)/;
582
583 0 Start with the first letter in the string 'a'.
584
585 1 Try the first alternative in the first group 'abd'.
586
587 2 Match 'a' followed by 'b'. So far so good.
588
589 3 'd' in the regexp doesn't match 'c' in the string - a dead end. So
590 backtrack two characters and pick the second alternative in the
591 first group 'abc'.
592
593 4 Match 'a' followed by 'b' followed by 'c'. We are on a roll and
594 have satisfied the first group. Set $1 to 'abc'.
595
596 5 Move on to the second group and pick the first alternative 'df'.
597
598 6 Match the 'd'.
599
600 7 'f' in the regexp doesn't match 'e' in the string, so a dead end.
601 Backtrack one character and pick the second alternative in the sec‐
602 ond group 'd'.
603
604 8 'd' matches. The second grouping is satisfied, so set $2 to 'd'.
605
606 9 We are at the end of the regexp, so we are done! We have matched
607 'abcd' out of the string "abcde".
608
609 There are a couple of things to note about this analysis. First, the
610 third alternative in the second group 'de' also allows a match, but we
611 stopped before we got to it - at a given character position, leftmost
612 wins. Second, we were able to get a match at the first character posi‐
613 tion of the string 'a'. If there were no matches at the first posi‐
614 tion, perl would move to the second character position 'b' and attempt
615 the match all over again. Only when all possible paths at all possible
616 character positions have been exhausted does perl give up and declare
617 "$string =~ /(abd⎪abc)(df⎪d⎪de)/;" to be false.
618
619 Even with all this work, regexp matching happens remarkably fast. To
620 speed things up, during compilation stage, perl compiles the regexp
621 into a compact sequence of opcodes that can often fit inside a proces‐
622 sor cache. When the code is executed, these opcodes can then run at
623 full throttle and search very quickly.
624
625 Extracting matches
626
627 The grouping metacharacters "()" also serve another completely differ‐
628 ent function: they allow the extraction of the parts of a string that
629 matched. This is very useful to find out what matched and for text
630 processing in general. For each grouping, the part that matched inside
631 goes into the special variables $1, $2, etc. They can be used just as
632 ordinary variables:
633
634 # extract hours, minutes, seconds
635 if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
636 $hours = $1;
637 $minutes = $2;
638 $seconds = $3;
639 }
640
641 Now, we know that in scalar context, "$time =~ /(\d\d):(\d\d):(\d\d)/"
642 returns a true or false value. In list context, however, it returns
643 the list of matched values "($1,$2,$3)". So we could write the code
644 more compactly as
645
646 # extract hours, minutes, seconds
647 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
648
649 If the groupings in a regexp are nested, $1 gets the group with the
650 leftmost opening parenthesis, $2 the next opening parenthesis, etc.
651 For example, here is a complex regexp and the matching variables indi‐
652 cated below it:
653
654 /(ab(cd⎪ef)((gi)⎪j))/;
655 1 2 34
656
657 so that if the regexp matched, e.g., $2 would contain 'cd' or 'ef'. For
658 convenience, perl sets $+ to the string held by the highest numbered
659 $1, $2, ... that got assigned (and, somewhat related, $^N to the value
660 of the $1, $2, ... most-recently assigned; i.e. the $1, $2, ... associ‐
661 ated with the rightmost closing parenthesis used in the match).
662
663 Closely associated with the matching variables $1, $2, ... are the
664 backreferences "\1", "\2", ... . Backreferences are simply matching
665 variables that can be used inside a regexp. This is a really nice fea‐
666 ture - what matches later in a regexp can depend on what matched ear‐
667 lier in the regexp. Suppose we wanted to look for doubled words in
668 text, like 'the the'. The following regexp finds all 3-letter doubles
669 with a space in between:
670
671 /(\w\w\w)\s\1/;
672
673 The grouping assigns a value to \1, so that the same 3 letter sequence
674 is used for both parts. Here are some words with repeated parts:
675
676 % simple_grep '^(\w\w\w\w⎪\w\w\w⎪\w\w⎪\w)\1$' /usr/dict/words
677 beriberi
678 booboo
679 coco
680 mama
681 murmur
682 papa
683
684 The regexp has a single grouping which considers 4-letter combinations,
685 then 3-letter combinations, etc. and uses "\1" to look for a repeat.
686 Although $1 and "\1" represent the same thing, care should be taken to
687 use matched variables $1, $2, ... only outside a regexp and backrefer‐
688 ences "\1", "\2", ... only inside a regexp; not doing so may lead to
689 surprising and/or undefined results.
690
691 In addition to what was matched, Perl 5.6.0 also provides the positions
692 of what was matched with the "@-" and "@+" arrays. "$-[0]" is the posi‐
693 tion of the start of the entire match and $+[0] is the position of the
694 end. Similarly, "$-[n]" is the position of the start of the $n match
695 and $+[n] is the position of the end. If $n is undefined, so are
696 "$-[n]" and $+[n]. Then this code
697
698 $x = "Mmm...donut, thought Homer";
699 $x =~ /^(Mmm⎪Yech)\.\.\.(donut⎪peas)/; # matches
700 foreach $expr (1..$#-) {
701 print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
702 }
703
704 prints
705
706 Match 1: 'Mmm' at position (0,3)
707 Match 2: 'donut' at position (6,11)
708
709 Even if there are no groupings in a regexp, it is still possible to
710 find out what exactly matched in a string. If you use them, perl will
711 set $` to the part of the string before the match, will set $& to the
712 part of the string that matched, and will set $' to the part of the
713 string after the match. An example:
714
715 $x = "the cat caught the mouse";
716 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
717 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'
718
719 In the second match, "$` = ''" because the regexp matched at the first
720 character position in the string and stopped, it never saw the second
721 'the'. It is important to note that using $` and $' slows down regexp
722 matching quite a bit, and $& slows it down to a lesser extent,
723 because if they are used in one regexp in a program, they are generated
724 for <all> regexps in the program. So if raw performance is a goal of
725 your application, they should be avoided. If you need them, use "@-"
726 and "@+" instead:
727
728 $` is the same as substr( $x, 0, $-[0] )
729 $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
730 $' is the same as substr( $x, $+[0] )
731
732 Matching repetitions
733
734 The examples in the previous section display an annoying weakness. We
735 were only matching 3-letter words, or syllables of 4 letters or less.
736 We'd like to be able to match words or syllables of any length, without
737 writing out tedious alternatives like "\w\w\w\w⎪\w\w\w⎪\w\w⎪\w".
738
739 This is exactly the problem the quantifier metacharacters "?", "*",
740 "+", and "{}" were created for. They allow us to determine the number
741 of repeats of a portion of a regexp we consider to be a match. Quanti‐
742 fiers are put immediately after the character, character class, or
743 grouping that we want to specify. They have the following meanings:
744
745 · "a?" = match 'a' 1 or 0 times
746
747 · "a*" = match 'a' 0 or more times, i.e., any number of times
748
749 · "a+" = match 'a' 1 or more times, i.e., at least once
750
751 · "a{n,m}" = match at least "n" times, but not more than "m" times.
752
753 · "a{n,}" = match at least "n" or more times
754
755 · "a{n}" = match exactly "n" times
756
757 Here are some examples:
758
759 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
760 # any number of digits
761 /(\w+)\s+\1/; # match doubled words of arbitrary length
762 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
763 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more
764 # than 4 digits
765 $year =~ /\d{4}⎪\d{2}/; # better match; throw out 3 digit dates
766 $year =~ /\d{2}(\d{2})?/; # same thing written differently. However,
767 # this produces $1 and the other does not.
768
769 % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier?
770 beriberi
771 booboo
772 coco
773 mama
774 murmur
775 papa
776
777 For all of these quantifiers, perl will try to match as much of the
778 string as possible, while still allowing the regexp to succeed. Thus
779 with "/a?.../", perl will first try to match the regexp with the "a"
780 present; if that fails, perl will try to match the regexp without the
781 "a" present. For the quantifier "*", we get the following:
782
783 $x = "the cat in the hat";
784 $x =~ /^(.*)(cat)(.*)$/; # matches,
785 # $1 = 'the '
786 # $2 = 'cat'
787 # $3 = ' in the hat'
788
789 Which is what we might expect, the match finds the only "cat" in the
790 string and locks onto it. Consider, however, this regexp:
791
792 $x =~ /^(.*)(at)(.*)$/; # matches,
793 # $1 = 'the cat in the h'
794 # $2 = 'at'
795 # $3 = '' (0 matches)
796
797 One might initially guess that perl would find the "at" in "cat" and
798 stop there, but that wouldn't give the longest possible string to the
799 first quantifier ".*". Instead, the first quantifier ".*" grabs as
800 much of the string as possible while still having the regexp match. In
801 this example, that means having the "at" sequence with the final "at"
802 in the string. The other important principle illustrated here is that
803 when there are two or more elements in a regexp, the leftmost quanti‐
804 fier, if there is one, gets to grab as much the string as possible,
805 leaving the rest of the regexp to fight over scraps. Thus in our exam‐
806 ple, the first quantifier ".*" grabs most of the string, while the sec‐
807 ond quantifier ".*" gets the empty string. Quantifiers that grab as
808 much of the string as possible are called maximal match or greedy quan‐
809 tifiers.
810
811 When a regexp can match a string in several different ways, we can use
812 the principles above to predict which way the regexp will match:
813
814 · Principle 0: Taken as a whole, any regexp will be matched at the
815 earliest possible position in the string.
816
817 · Principle 1: In an alternation "a⎪b⎪c...", the leftmost alternative
818 that allows a match for the whole regexp will be the one used.
819
820 · Principle 2: The maximal matching quantifiers "?", "*", "+" and
821 "{n,m}" will in general match as much of the string as possible
822 while still allowing the whole regexp to match.
823
824 · Principle 3: If there are two or more elements in a regexp, the
825 leftmost greedy quantifier, if any, will match as much of the
826 string as possible while still allowing the whole regexp to match.
827 The next leftmost greedy quantifier, if any, will try to match as
828 much of the string remaining available to it as possible, while
829 still allowing the whole regexp to match. And so on, until all the
830 regexp elements are satisfied.
831
832 As we have seen above, Principle 0 overrides the others - the regexp
833 will be matched as early as possible, with the other principles deter‐
834 mining how the regexp matches at that earliest character position.
835
836 Here is an example of these principles in action:
837
838 $x = "The programming republic of Perl";
839 $x =~ /^(.+)(e⎪r)(.*)$/; # matches,
840 # $1 = 'The programming republic of Pe'
841 # $2 = 'r'
842 # $3 = 'l'
843
844 This regexp matches at the earliest string position, 'T'. One might
845 think that "e", being leftmost in the alternation, would be matched,
846 but "r" produces the longest string in the first quantifier.
847
848 $x =~ /(m{1,2})(.*)$/; # matches,
849 # $1 = 'mm'
850 # $2 = 'ing republic of Perl'
851
852 Here, The earliest possible match is at the first 'm' in "programming".
853 "m{1,2}" is the first quantifier, so it gets to match a maximal "mm".
854
855 $x =~ /.*(m{1,2})(.*)$/; # matches,
856 # $1 = 'm'
857 # $2 = 'ing republic of Perl'
858
859 Here, the regexp matches at the start of the string. The first quanti‐
860 fier ".*" grabs as much as possible, leaving just a single 'm' for the
861 second quantifier "m{1,2}".
862
863 $x =~ /(.?)(m{1,2})(.*)$/; # matches,
864 # $1 = 'a'
865 # $2 = 'mm'
866 # $3 = 'ing republic of Perl'
867
868 Here, ".?" eats its maximal one character at the earliest possible
869 position in the string, 'a' in "programming", leaving "m{1,2}" the
870 opportunity to match both "m"'s. Finally,
871
872 "aXXXb" =~ /(X*)/; # matches with $1 = ''
873
874 because it can match zero copies of 'X' at the beginning of the string.
875 If you definitely want to match at least one 'X', use "X+", not "X*".
876
877 Sometimes greed is not good. At times, we would like quantifiers to
878 match a minimal piece of string, rather than a maximal piece. For this
879 purpose, Larry Wall created the minimal match or non-greedy quanti‐
880 fiers "??","*?", "+?", and "{}?". These are the usual quantifiers with
881 a "?" appended to them. They have the following meanings:
882
883 · "a??" = match 'a' 0 or 1 times. Try 0 first, then 1.
884
885 · "a*?" = match 'a' 0 or more times, i.e., any number of times, but
886 as few times as possible
887
888 · "a+?" = match 'a' 1 or more times, i.e., at least once, but as few
889 times as possible
890
891 · "a{n,m}?" = match at least "n" times, not more than "m" times, as
892 few times as possible
893
894 · "a{n,}?" = match at least "n" times, but as few times as possible
895
896 · "a{n}?" = match exactly "n" times. Because we match exactly "n"
897 times, "a{n}?" is equivalent to "a{n}" and is just there for nota‐
898 tional consistency.
899
900 Let's look at the example above, but with minimal quantifiers:
901
902 $x = "The programming republic of Perl";
903 $x =~ /^(.+?)(e⎪r)(.*)$/; # matches,
904 # $1 = 'Th'
905 # $2 = 'e'
906 # $3 = ' programming republic of Perl'
907
908 The minimal string that will allow both the start of the string "^" and
909 the alternation to match is "Th", with the alternation "e⎪r" matching
910 "e". The second quantifier ".*" is free to gobble up the rest of the
911 string.
912
913 $x =~ /(m{1,2}?)(.*?)$/; # matches,
914 # $1 = 'm'
915 # $2 = 'ming republic of Perl'
916
917 The first string position that this regexp can match is at the first
918 'm' in "programming". At this position, the minimal "m{1,2}?" matches
919 just one 'm'. Although the second quantifier ".*?" would prefer to
920 match no characters, it is constrained by the end-of-string anchor "$"
921 to match the rest of the string.
922
923 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
924 # $1 = 'The progra'
925 # $2 = 'm'
926 # $3 = 'ming republic of Perl'
927
928 In this regexp, you might expect the first minimal quantifier ".*?" to
929 match the empty string, because it is not constrained by a "^" anchor
930 to match the beginning of the word. Principle 0 applies here, however.
931 Because it is possible for the whole regexp to match at the start of
932 the string, it will match at the start of the string. Thus the first
933 quantifier has to match everything up to the first "m". The second
934 minimal quantifier matches just one "m" and the third quantifier
935 matches the rest of the string.
936
937 $x =~ /(.??)(m{1,2})(.*)$/; # matches,
938 # $1 = 'a'
939 # $2 = 'mm'
940 # $3 = 'ing republic of Perl'
941
942 Just as in the previous regexp, the first quantifier ".??" can match
943 earliest at position 'a', so it does. The second quantifier is greedy,
944 so it matches "mm", and the third matches the rest of the string.
945
946 We can modify principle 3 above to take into account non-greedy quanti‐
947 fiers:
948
949 · Principle 3: If there are two or more elements in a regexp, the
950 leftmost greedy (non-greedy) quantifier, if any, will match as much
951 (little) of the string as possible while still allowing the whole
952 regexp to match. The next leftmost greedy (non-greedy) quantifier,
953 if any, will try to match as much (little) of the string remaining
954 available to it as possible, while still allowing the whole regexp
955 to match. And so on, until all the regexp elements are satisfied.
956
957 Just like alternation, quantifiers are also susceptible to backtrack‐
958 ing. Here is a step-by-step analysis of the example
959
960 $x = "the cat in the hat";
961 $x =~ /^(.*)(at)(.*)$/; # matches,
962 # $1 = 'the cat in the h'
963 # $2 = 'at'
964 # $3 = '' (0 matches)
965
966 0 Start with the first letter in the string 't'.
967
968 1 The first quantifier '.*' starts out by matching the whole string
969 'the cat in the hat'.
970
971 2 'a' in the regexp element 'at' doesn't match the end of the string.
972 Backtrack one character.
973
974 3 'a' in the regexp element 'at' still doesn't match the last letter
975 of the string 't', so backtrack one more character.
976
977 4 Now we can match the 'a' and the 't'.
978
979 5 Move on to the third element '.*'. Since we are at the end of the
980 string and '.*' can match 0 times, assign it the empty string.
981
982 6 We are done!
983
984 Most of the time, all this moving forward and backtracking happens
985 quickly and searching is fast. There are some pathological regexps,
986 however, whose execution time exponentially grows with the size of the
987 string. A typical structure that blows up in your face is of the form
988
989 /(a⎪b+)*/;
990
991 The problem is the nested indeterminate quantifiers. There are many
992 different ways of partitioning a string of length n between the "+" and
993 "*": one repetition with "b+" of length n, two repetitions with the
994 first "b+" length k and the second with length n-k, m repetitions whose
995 bits add up to length n, etc. In fact there are an exponential number
996 of ways to partition a string as a function of length. A regexp may
997 get lucky and match early in the process, but if there is no match,
998 perl will try every possibility before giving up. So be careful with
999 nested "*"'s, "{n,m}"'s, and "+"'s. The book Mastering regular expres‐
1000 sions by Jeffrey Friedl gives a wonderful discussion of this and other
1001 efficiency issues.
1002
1003 Building a regexp
1004
1005 At this point, we have all the basic regexp concepts covered, so let's
1006 give a more involved example of a regular expression. We will build a
1007 regexp that matches numbers.
1008
1009 The first task in building a regexp is to decide what we want to match
1010 and what we want to exclude. In our case, we want to match both inte‐
1011 gers and floating point numbers and we want to reject any string that
1012 isn't a number.
1013
1014 The next task is to break the problem down into smaller problems that
1015 are easily converted into a regexp.
1016
1017 The simplest case is integers. These consist of a sequence of digits,
1018 with an optional sign in front. The digits we can represent with "\d+"
1019 and the sign can be matched with "[+-]". Thus the integer regexp is
1020
1021 /[+-]?\d+/; # matches integers
1022
1023 A floating point number potentially has a sign, an integral part, a
1024 decimal point, a fractional part, and an exponent. One or more of
1025 these parts is optional, so we need to check out the different possi‐
1026 bilities. Floating point numbers which are in proper form include
1027 123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out
1028 front is completely optional and can be matched by "[+-]?". We can see
1029 that if there is no exponent, floating point numbers must have a deci‐
1030 mal point, otherwise they are integers. We might be tempted to model
1031 these with "\d*\.\d*", but this would also match just a single decimal
1032 point, which is not a number. So the three cases of floating point
1033 number sans exponent are
1034
1035 /[+-]?\d+\./; # 1., 321., etc.
1036 /[+-]?\.\d+/; # .1, .234, etc.
1037 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
1038
1039 These can be combined into a single regexp with a three-way alterna‐
1040 tion:
1041
1042 /[+-]?(\d+\.\d+⎪\d+\.⎪\.\d+)/; # floating point, no exponent
1043
1044 In this alternation, it is important to put '\d+\.\d+' before '\d+\.'.
1045 If '\d+\.' were first, the regexp would happily match that and ignore
1046 the fractional part of the number.
1047
1048 Now consider floating point numbers with exponents. The key observa‐
1049 tion here is that both integers and numbers with decimal points are
1050 allowed in front of an exponent. Then exponents, like the overall
1051 sign, are independent of whether we are matching numbers with or with‐
1052 out decimal points, and can be 'decoupled' from the mantissa. The
1053 overall form of the regexp now becomes clear:
1054
1055 /^(optional sign)(integer ⎪ f.p. mantissa)(optional exponent)$/;
1056
1057 The exponent is an "e" or "E", followed by an integer. So the exponent
1058 regexp is
1059
1060 /[eE][+-]?\d+/; # exponent
1061
1062 Putting all the parts together, we get a regexp that matches numbers:
1063
1064 /^[+-]?(\d+\.\d+⎪\d+\.⎪\.\d+⎪\d+)([eE][+-]?\d+)?$/; # Ta da!
1065
1066 Long regexps like this may impress your friends, but can be hard to
1067 decipher. In complex situations like this, the "//x" modifier for a
1068 match is invaluable. It allows one to put nearly arbitrary whitespace
1069 and comments into a regexp without affecting their meaning. Using it,
1070 we can rewrite our 'extended' regexp in the more pleasing form
1071
1072 /^
1073 [+-]? # first, match an optional sign
1074 ( # then match integers or f.p. mantissas:
1075 \d+\.\d+ # mantissa of the form a.b
1076 ⎪\d+\. # mantissa of the form a.
1077 ⎪\.\d+ # mantissa of the form .b
1078 ⎪\d+ # integer of the form a
1079 )
1080 ([eE][+-]?\d+)? # finally, optionally match an exponent
1081 $/x;
1082
1083 If whitespace is mostly irrelevant, how does one include space charac‐
1084 ters in an extended regexp? The answer is to backslash it '\ ' or put
1085 it in a character class "[ ]" . The same thing goes for pound signs,
1086 use "\#" or "[#]". For instance, Perl allows a space between the sign
1087 and the mantissa/integer, and we could add this to our regexp as fol‐
1088 lows:
1089
1090 /^
1091 [+-]?\ * # first, match an optional sign *and space*
1092 ( # then match integers or f.p. mantissas:
1093 \d+\.\d+ # mantissa of the form a.b
1094 ⎪\d+\. # mantissa of the form a.
1095 ⎪\.\d+ # mantissa of the form .b
1096 ⎪\d+ # integer of the form a
1097 )
1098 ([eE][+-]?\d+)? # finally, optionally match an exponent
1099 $/x;
1100
1101 In this form, it is easier to see a way to simplify the alternation.
1102 Alternatives 1, 2, and 4 all start with "\d+", so it could be factored
1103 out:
1104
1105 /^
1106 [+-]?\ * # first, match an optional sign
1107 ( # then match integers or f.p. mantissas:
1108 \d+ # start out with a ...
1109 (
1110 \.\d* # mantissa of the form a.b or a.
1111 )? # ? takes care of integers of the form a
1112 ⎪\.\d+ # mantissa of the form .b
1113 )
1114 ([eE][+-]?\d+)? # finally, optionally match an exponent
1115 $/x;
1116
1117 or written in the compact form,
1118
1119 /^[+-]?\ *(\d+(\.\d*)?⎪\.\d+)([eE][+-]?\d+)?$/;
1120
1121 This is our final regexp. To recap, we built a regexp by
1122
1123 · specifying the task in detail,
1124
1125 · breaking down the problem into smaller parts,
1126
1127 · translating the small parts into regexps,
1128
1129 · combining the regexps,
1130
1131 · and optimizing the final combined regexp.
1132
1133 These are also the typical steps involved in writing a computer pro‐
1134 gram. This makes perfect sense, because regular expressions are essen‐
1135 tially programs written a little computer language that specifies pat‐
1136 terns.
1137
1138 Using regular expressions in Perl
1139
1140 The last topic of Part 1 briefly covers how regexps are used in Perl
1141 programs. Where do they fit into Perl syntax?
1142
1143 We have already introduced the matching operator in its default "/reg‐
1144 exp/" and arbitrary delimiter "m!regexp!" forms. We have used the
1145 binding operator "=~" and its negation "!~" to test for string matches.
1146 Associated with the matching operator, we have discussed the single
1147 line "//s", multi-line "//m", case-insensitive "//i" and extended "//x"
1148 modifiers.
1149
1150 There are a few more things you might want to know about matching oper‐
1151 ators. First, we pointed out earlier that variables in regexps are
1152 substituted before the regexp is evaluated:
1153
1154 $pattern = 'Seuss';
1155 while (<>) {
1156 print if /$pattern/;
1157 }
1158
1159 This will print any lines containing the word "Seuss". It is not as
1160 efficient as it could be, however, because perl has to re-evaluate
1161 $pattern each time through the loop. If $pattern won't be changing
1162 over the lifetime of the script, we can add the "//o" modifier, which
1163 directs perl to only perform variable substitutions once:
1164
1165 #!/usr/bin/perl
1166 # Improved simple_grep
1167 $regexp = shift;
1168 while (<>) {
1169 print if /$regexp/o; # a good deal faster
1170 }
1171
1172 If you change $pattern after the first substitution happens, perl will
1173 ignore it. If you don't want any substitutions at all, use the special
1174 delimiter "m''":
1175
1176 @pattern = ('Seuss');
1177 while (<>) {
1178 print if m'@pattern'; # matches literal '@pattern', not 'Seuss'
1179 }
1180
1181 "m''" acts like single quotes on a regexp; all other "m" delimiters act
1182 like double quotes. If the regexp evaluates to the empty string, the
1183 regexp in the last successful match is used instead. So we have
1184
1185 "dog" =~ /d/; # 'd' matches
1186 "dogbert =~ //; # this matches the 'd' regexp used before
1187
1188 The final two modifiers "//g" and "//c" concern multiple matches. The
1189 modifier "//g" stands for global matching and allows the matching oper‐
1190 ator to match within a string as many times as possible. In scalar
1191 context, successive invocations against a string will have `"//g" jump
1192 from match to match, keeping track of position in the string as it goes
1193 along. You can get or set the position with the "pos()" function.
1194
1195 The use of "//g" is shown in the following example. Suppose we have a
1196 string that consists of words separated by spaces. If we know how many
1197 words there are in advance, we could extract the words using groupings:
1198
1199 $x = "cat dog house"; # 3 words
1200 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1201 # $1 = 'cat'
1202 # $2 = 'dog'
1203 # $3 = 'house'
1204
1205 But what if we had an indeterminate number of words? This is the sort
1206 of task "//g" was made for. To extract all words, form the simple reg‐
1207 exp "(\w+)" and loop over all matches with "/(\w+)/g":
1208
1209 while ($x =~ /(\w+)/g) {
1210 print "Word is $1, ends at position ", pos $x, "\n";
1211 }
1212
1213 prints
1214
1215 Word is cat, ends at position 3
1216 Word is dog, ends at position 7
1217 Word is house, ends at position 13
1218
1219 A failed match or changing the target string resets the position. If
1220 you don't want the position reset after failure to match, add the
1221 "//c", as in "/regexp/gc". The current position in the string is asso‐
1222 ciated with the string, not the regexp. This means that different
1223 strings have different positions and their respective positions can be
1224 set or read independently.
1225
1226 In list context, "//g" returns a list of matched groupings, or if there
1227 are no groupings, a list of matches to the whole regexp. So if we
1228 wanted just the words, we could use
1229
1230 @words = ($x =~ /(\w+)/g); # matches,
1231 # $word[0] = 'cat'
1232 # $word[1] = 'dog'
1233 # $word[2] = 'house'
1234
1235 Closely associated with the "//g" modifier is the "\G" anchor. The
1236 "\G" anchor matches at the point where the previous "//g" match left
1237 off. "\G" allows us to easily do context-sensitive matching:
1238
1239 $metric = 1; # use metric units
1240 ...
1241 $x = <FILE>; # read in measurement
1242 $x =~ /^([+-]?\d+)\s*/g; # get magnitude
1243 $weight = $1;
1244 if ($metric) { # error checking
1245 print "Units error!" unless $x =~ /\Gkg\./g;
1246 }
1247 else {
1248 print "Units error!" unless $x =~ /\Glbs\./g;
1249 }
1250 $x =~ /\G\s+(widget⎪sprocket)/g; # continue processing
1251
1252 The combination of "//g" and "\G" allows us to process the string a bit
1253 at a time and use arbitrary Perl logic to decide what to do next. Cur‐
1254 rently, the "\G" anchor is only fully supported when used to anchor to
1255 the start of the pattern.
1256
1257 "\G" is also invaluable in processing fixed length records with reg‐
1258 exps. Suppose we have a snippet of coding region DNA, encoded as base
1259 pair letters "ATCGTTGAAT..." and we want to find all the stop codons
1260 "TGA". In a coding region, codons are 3-letter sequences, so we can
1261 think of the DNA snippet as a sequence of 3-letter records. The naive
1262 regexp
1263
1264 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1265 $dna = "ATCGTTGAATGCAAATGACATGAC";
1266 $dna =~ /TGA/;
1267
1268 doesn't work; it may match a "TGA", but there is no guarantee that the
1269 match is aligned with codon boundaries, e.g., the substring "GTT GAA"
1270 gives a match. A better solution is
1271
1272 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
1273 print "Got a TGA stop codon at position ", pos $dna, "\n";
1274 }
1275
1276 which prints
1277
1278 Got a TGA stop codon at position 18
1279 Got a TGA stop codon at position 23
1280
1281 Position 18 is good, but position 23 is bogus. What happened?
1282
1283 The answer is that our regexp works well until we get past the last
1284 real match. Then the regexp will fail to match a synchronized "TGA"
1285 and start stepping ahead one character position at a time, not what we
1286 want. The solution is to use "\G" to anchor the match to the codon
1287 alignment:
1288
1289 while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1290 print "Got a TGA stop codon at position ", pos $dna, "\n";
1291 }
1292
1293 This prints
1294
1295 Got a TGA stop codon at position 18
1296
1297 which is the correct answer. This example illustrates that it is
1298 important not only to match what is desired, but to reject what is not
1299 desired.
1300
1301 search and replace
1302
1303 Regular expressions also play a big role in search and replace opera‐
1304 tions in Perl. Search and replace is accomplished with the "s///"
1305 operator. The general form is "s/regexp/replacement/modifiers", with
1306 everything we know about regexps and modifiers applying in this case as
1307 well. The "replacement" is a Perl double quoted string that replaces
1308 in the string whatever is matched with the "regexp". The operator "=~"
1309 is also used here to associate a string with "s///". If matching
1310 against $_, the "$_ =~" can be dropped. If there is a match, "s///"
1311 returns the number of substitutions made, otherwise it returns false.
1312 Here are a few examples:
1313
1314 $x = "Time to feed the cat!";
1315 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
1316 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1317 $more_insistent = 1;
1318 }
1319 $y = "'quoted words'";
1320 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
1321 # $y contains "quoted words"
1322
1323 In the last example, the whole string was matched, but only the part
1324 inside the single quotes was grouped. With the "s///" operator, the
1325 matched variables $1, $2, etc. are immediately available for use in
1326 the replacement expression, so we use $1 to replace the quoted string
1327 with just what was quoted. With the global modifier, "s///g" will
1328 search and replace all occurrences of the regexp in the string:
1329
1330 $x = "I batted 4 for 4";
1331 $x =~ s/4/four/; # doesn't do it all:
1332 # $x contains "I batted four for 4"
1333 $x = "I batted 4 for 4";
1334 $x =~ s/4/four/g; # does it all:
1335 # $x contains "I batted four for four"
1336
1337 If you prefer 'regex' over 'regexp' in this tutorial, you could use the
1338 following program to replace it:
1339
1340 % cat > simple_replace
1341 #!/usr/bin/perl
1342 $regexp = shift;
1343 $replacement = shift;
1344 while (<>) {
1345 s/$regexp/$replacement/go;
1346 print;
1347 }
1348 ^D
1349
1350 % simple_replace regexp regex perlretut.pod
1351
1352 In "simple_replace" we used the "s///g" modifier to replace all occur‐
1353 rences of the regexp on each line and the "s///o" modifier to compile
1354 the regexp only once. As with "simple_grep", both the "print" and the
1355 "s/$regexp/$replacement/go" use $_ implicitly.
1356
1357 A modifier available specifically to search and replace is the "s///e"
1358 evaluation modifier. "s///e" wraps an "eval{...}" around the replace‐
1359 ment string and the evaluated result is substituted for the matched
1360 substring. "s///e" is useful if you need to do a bit of computation in
1361 the process of replacing text. This example counts character frequen‐
1362 cies in a line:
1363
1364 $x = "Bill the cat";
1365 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
1366 print "frequency of '$_' is $chars{$_}\n"
1367 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1368
1369 This prints
1370
1371 frequency of ' ' is 2
1372 frequency of 't' is 2
1373 frequency of 'l' is 2
1374 frequency of 'B' is 1
1375 frequency of 'c' is 1
1376 frequency of 'e' is 1
1377 frequency of 'h' is 1
1378 frequency of 'i' is 1
1379 frequency of 'a' is 1
1380
1381 As with the match "m//" operator, "s///" can use other delimiters, such
1382 as "s!!!" and "s{}{}", and even "s{}//". If single quotes are used
1383 "s'''", then the regexp and replacement are treated as single quoted
1384 strings and there are no substitutions. "s///" in list context returns
1385 the same thing as in scalar context, i.e., the number of matches.
1386
1387 The split operator
1388
1389 The "split" function can also optionally use a matching operator "m//"
1390 to split a string. "split /regexp/, string, limit" splits "string"
1391 into a list of substrings and returns that list. The regexp is used to
1392 match the character sequence that the "string" is split with respect
1393 to. The "limit", if present, constrains splitting into no more than
1394 "limit" number of strings. For example, to split a string into words,
1395 use
1396
1397 $x = "Calvin and Hobbes";
1398 @words = split /\s+/, $x; # $word[0] = 'Calvin'
1399 # $word[1] = 'and'
1400 # $word[2] = 'Hobbes'
1401
1402 If the empty regexp "//" is used, the regexp always matches and the
1403 string is split into individual characters. If the regexp has group‐
1404 ings, then list produced contains the matched substrings from the
1405 groupings as well. For instance,
1406
1407 $x = "/usr/bin/perl";
1408 @dirs = split m!/!, $x; # $dirs[0] = ''
1409 # $dirs[1] = 'usr'
1410 # $dirs[2] = 'bin'
1411 # $dirs[3] = 'perl'
1412 @parts = split m!(/)!, $x; # $parts[0] = ''
1413 # $parts[1] = '/'
1414 # $parts[2] = 'usr'
1415 # $parts[3] = '/'
1416 # $parts[4] = 'bin'
1417 # $parts[5] = '/'
1418 # $parts[6] = 'perl'
1419
1420 Since the first character of $x matched the regexp, "split" prepended
1421 an empty initial element to the list.
1422
1423 If you have read this far, congratulations! You now have all the basic
1424 tools needed to use regular expressions to solve a wide range of text
1425 processing problems. If this is your first time through the tutorial,
1426 why not stop here and play around with regexps a while... Part 2 con‐
1427 cerns the more esoteric aspects of regular expressions and those con‐
1428 cepts certainly aren't needed right at the start.
1429
1431 OK, you know the basics of regexps and you want to know more. If
1432 matching regular expressions is analogous to a walk in the woods, then
1433 the tools discussed in Part 1 are analogous to topo maps and a compass,
1434 basic tools we use all the time. Most of the tools in part 2 are anal‐
1435 ogous to flare guns and satellite phones. They aren't used too often
1436 on a hike, but when we are stuck, they can be invaluable.
1437
1438 What follows are the more advanced, less used, or sometimes esoteric
1439 capabilities of perl regexps. In Part 2, we will assume you are com‐
1440 fortable with the basics and concentrate on the new features.
1441
1442 More on characters, strings, and character classes
1443
1444 There are a number of escape sequences and character classes that we
1445 haven't covered yet.
1446
1447 There are several escape sequences that convert characters or strings
1448 between upper and lower case. "\l" and "\u" convert the next character
1449 to lower or upper case, respectively:
1450
1451 $x = "perl";
1452 $string =~ /\u$x/; # matches 'Perl' in $string
1453 $x = "M(rs?⎪s)\\."; # note the double backslash
1454 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',
1455
1456 "\L" and "\U" converts a whole substring, delimited by "\L" or "\U" and
1457 "\E", to lower or upper case:
1458
1459 $x = "This word is in lower case:\L SHOUT\E";
1460 $x =~ /shout/; # matches
1461 $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1462 $x =~ /\Ukeypunch/; # matches punch card string
1463
1464 If there is no "\E", case is converted until the end of the string. The
1465 regexps "\L\u$word" or "\u\L$word" convert the first character of $word
1466 to uppercase and the rest of the characters to lowercase.
1467
1468 Control characters can be escaped with "\c", so that a control-Z char‐
1469 acter would be matched with "\cZ". The escape sequence "\Q"..."\E"
1470 quotes, or protects most non-alphabetic characters. For instance,
1471
1472 $x = "\QThat !^*&%~& cat!";
1473 $x =~ /\Q!^*&%~&\E/; # check for rough language
1474
1475 It does not protect "$" or "@", so that variables can still be substi‐
1476 tuted.
1477
1478 With the advent of 5.6.0, perl regexps can handle more than just the
1479 standard ASCII character set. Perl now supports Unicode, a standard
1480 for encoding the character sets from many of the world's written lan‐
1481 guages. Unicode does this by allowing characters to be more than one
1482 byte wide. Perl uses the UTF-8 encoding, in which ASCII characters are
1483 still encoded as one byte, but characters greater than "chr(127)" may
1484 be stored as two or more bytes.
1485
1486 What does this mean for regexps? Well, regexp users don't need to know
1487 much about perl's internal representation of strings. But they do need
1488 to know 1) how to represent Unicode characters in a regexp and 2) when
1489 a matching operation will treat the string to be searched as a sequence
1490 of bytes (the old way) or as a sequence of Unicode characters (the new
1491 way). The answer to 1) is that Unicode characters greater than
1492 "chr(127)" may be represented using the "\x{hex}" notation, with "hex"
1493 a hexadecimal integer:
1494
1495 /\x{263a}/; # match a Unicode smiley face :)
1496
1497 Unicode characters in the range of 128-255 use two hexadecimal digits
1498 with braces: "\x{ab}". Note that this is different than "\xab", which
1499 is just a hexadecimal byte with no Unicode significance.
1500
1501 NOTE: in Perl 5.6.0 it used to be that one needed to say "use utf8" to
1502 use any Unicode features. This is no more the case: for almost all
1503 Unicode processing, the explicit "utf8" pragma is not needed. (The
1504 only case where it matters is if your Perl script is in Unicode and
1505 encoded in UTF-8, then an explicit "use utf8" is needed.)
1506
1507 Figuring out the hexadecimal sequence of a Unicode character you want
1508 or deciphering someone else's hexadecimal Unicode regexp is about as
1509 much fun as programming in machine code. So another way to specify
1510 Unicode characters is to use the named character escape sequence
1511 "\N{name}". "name" is a name for the Unicode character, as specified
1512 in the Unicode standard. For instance, if we wanted to represent or
1513 match the astrological sign for the planet Mercury, we could use
1514
1515 use charnames ":full"; # use named chars with Unicode full names
1516 $x = "abc\N{MERCURY}def";
1517 $x =~ /\N{MERCURY}/; # matches
1518
1519 One can also use short names or restrict names to a certain alphabet:
1520
1521 use charnames ':full';
1522 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
1523
1524 use charnames ":short";
1525 print "\N{greek:Sigma} is an upper-case sigma.\n";
1526
1527 use charnames qw(greek);
1528 print "\N{sigma} is Greek sigma\n";
1529
1530 A list of full names is found in the file Names.txt in the
1531 lib/perl5/5.X.X/unicore directory.
1532
1533 The answer to requirement 2), as of 5.6.0, is that if a regexp contains
1534 Unicode characters, the string is searched as a sequence of Unicode
1535 characters. Otherwise, the string is searched as a sequence of bytes.
1536 If the string is being searched as a sequence of Unicode characters,
1537 but matching a single byte is required, we can use the "\C" escape
1538 sequence. "\C" is a character class akin to "." except that it matches
1539 any byte 0-255. So
1540
1541 use charnames ":full"; # use named chars with Unicode full names
1542 $x = "a";
1543 $x =~ /\C/; # matches 'a', eats one byte
1544 $x = "";
1545 $x =~ /\C/; # doesn't match, no bytes to match
1546 $x = "\N{MERCURY}"; # two-byte Unicode character
1547 $x =~ /\C/; # matches, but dangerous!
1548
1549 The last regexp matches, but is dangerous because the string character
1550 position is no longer synchronized to the string byte position. This
1551 generates the warning 'Malformed UTF-8 character'. The "\C" is best
1552 used for matching the binary data in strings with binary data inter‐
1553 mixed with Unicode characters.
1554
1555 Let us now discuss the rest of the character classes. Just as with
1556 Unicode characters, there are named Unicode character classes repre‐
1557 sented by the "\p{name}" escape sequence. Closely associated is the
1558 "\P{name}" character class, which is the negation of the "\p{name}"
1559 class. For example, to match lower and uppercase characters,
1560
1561 use charnames ":full"; # use named chars with Unicode full names
1562 $x = "BOB";
1563 $x =~ /^\p{IsUpper}/; # matches, uppercase char class
1564 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase
1565 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
1566 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase
1567
1568 Here is the association between some Perl named classes and the tradi‐
1569 tional Unicode classes:
1570
1571 Perl class name Unicode class name or regular expression
1572
1573 IsAlpha /^[LM]/
1574 IsAlnum /^[LMN]/
1575 IsASCII $code <= 127
1576 IsCntrl /^C/
1577 IsBlank $code =~ /^(0020⎪0009)$/ ⎪⎪ /^Z[^lp]/
1578 IsDigit Nd
1579 IsGraph /^([LMNPS]⎪Co)/
1580 IsLower Ll
1581 IsPrint /^([LMNPS]⎪Co⎪Zs)/
1582 IsPunct /^P/
1583 IsSpace /^Z/ ⎪⎪ ($code =~ /^(0009⎪000A⎪000B⎪000C⎪000D)$/
1584 IsSpacePerl /^Z/ ⎪⎪ ($code =~ /^(0009⎪000A⎪000C⎪000D⎪0085⎪2028⎪2029)$/
1585 IsUpper /^L[ut]/
1586 IsWord /^[LMN]/ ⎪⎪ $code eq "005F"
1587 IsXDigit $code =~ /^00(3[0-9]⎪[46][1-6])$/
1588
1589 You can also use the official Unicode class names with the "\p" and
1590 "\P", like "\p{L}" for Unicode 'letters', or "\p{Lu}" for uppercase
1591 letters, or "\P{Nd}" for non-digits. If a "name" is just one letter,
1592 the braces can be dropped. For instance, "\pM" is the character class
1593 of Unicode 'marks', for example accent marks. For the full list see
1594 perlunicode.
1595
1596 The Unicode has also been separated into various sets of characters
1597 which you can test with "\p{In...}" (in) and "\P{In...}" (not in), for
1598 example "\p{Latin}", "\p{Greek}", or "\P{Katakana}". For the full list
1599 see perlunicode.
1600
1601 "\X" is an abbreviation for a character class sequence that includes
1602 the Unicode 'combining character sequences'. A 'combining character
1603 sequence' is a base character followed by any number of combining char‐
1604 acters. An example of a combining character is an accent. Using the
1605 Unicode full names, e.g., "A + COMBINING RING" is a combining charac‐
1606 ter sequence with base character "A" and combining character "COMBIN‐
1607 ING RING" , which translates in Danish to A with the circle atop it, as
1608 in the word Angstrom. "\X" is equivalent to "\PM\pM*}", i.e., a non-
1609 mark followed by one or more marks.
1610
1611 For the full and latest information about Unicode see the latest Uni‐
1612 code standard, or the Unicode Consortium's website http://www.uni‐
1613 code.org/
1614
1615 As if all those classes weren't enough, Perl also defines POSIX style
1616 character classes. These have the form "[:name:]", with "name" the
1617 name of the POSIX class. The POSIX classes are "alpha", "alnum",
1618 "ascii", "cntrl", "digit", "graph", "lower", "print", "punct", "space",
1619 "upper", and "xdigit", and two extensions, "word" (a Perl extension to
1620 match "\w"), and "blank" (a GNU extension). If "utf8" is being used,
1621 then these classes are defined the same as their corresponding perl
1622 Unicode classes: "[:upper:]" is the same as "\p{IsUpper}", etc. The
1623 POSIX character classes, however, don't require using "utf8". The
1624 "[:digit:]", "[:word:]", and "[:space:]" correspond to the familiar
1625 "\d", "\w", and "\s" character classes. To negate a POSIX class, put a
1626 "^" in front of the name, so that, e.g., "[:^digit:]" corresponds to
1627 "\D" and under "utf8", "\P{IsDigit}". The Unicode and POSIX character
1628 classes can be used just like "\d", with the exception that POSIX char‐
1629 acter classes can only be used inside of a character class:
1630
1631 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
1632 /^=item\s[[:digit:]]/; # match '=item',
1633 # followed by a space and a digit
1634 use charnames ":full";
1635 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
1636 /^=item\s\p{IsDigit}/; # match '=item',
1637 # followed by a space and a digit
1638
1639 Whew! That is all the rest of the characters and character classes.
1640
1641 Compiling and saving regular expressions
1642
1643 In Part 1 we discussed the "//o" modifier, which compiles a regexp just
1644 once. This suggests that a compiled regexp is some data structure that
1645 can be stored once and used again and again. The regexp quote "qr//"
1646 does exactly that: "qr/string/" compiles the "string" as a regexp and
1647 transforms the result into a form that can be assigned to a variable:
1648
1649 $reg = qr/foo+bar?/; # reg contains a compiled regexp
1650
1651 Then $reg can be used as a regexp:
1652
1653 $x = "fooooba";
1654 $x =~ $reg; # matches, just like /foo+bar?/
1655 $x =~ /$reg/; # same thing, alternate form
1656
1657 $reg can also be interpolated into a larger regexp:
1658
1659 $x =~ /(abc)?$reg/; # still matches
1660
1661 As with the matching operator, the regexp quote can use different
1662 delimiters, e.g., "qr!!", "qr{}" and "qr~~". The single quote delim‐
1663 iters "qr''" prevent any interpolation from taking place.
1664
1665 Pre-compiled regexps are useful for creating dynamic matches that don't
1666 need to be recompiled each time they are encountered. Using pre-com‐
1667 piled regexps, "simple_grep" program can be expanded into a program
1668 that matches multiple patterns:
1669
1670 % cat > multi_grep
1671 #!/usr/bin/perl
1672 # multi_grep - match any of <number> regexps
1673 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
1674
1675 $number = shift;
1676 $regexp[$_] = shift foreach (0..$number-1);
1677 @compiled = map qr/$_/, @regexp;
1678 while ($line = <>) {
1679 foreach $pattern (@compiled) {
1680 if ($line =~ /$pattern/) {
1681 print $line;
1682 last; # we matched, so move onto the next line
1683 }
1684 }
1685 }
1686 ^D
1687
1688 % multi_grep 2 last for multi_grep
1689 $regexp[$_] = shift foreach (0..$number-1);
1690 foreach $pattern (@compiled) {
1691 last;
1692
1693 Storing pre-compiled regexps in an array @compiled allows us to simply
1694 loop through the regexps without any recompilation, thus gaining flexi‐
1695 bility without sacrificing speed.
1696
1697 Embedding comments and modifiers in a regular expression
1698
1699 Starting with this section, we will be discussing Perl's set of
1700 extended patterns. These are extensions to the traditional regular
1701 expression syntax that provide powerful new tools for pattern matching.
1702 We have already seen extensions in the form of the minimal matching
1703 constructs "??", "*?", "+?", "{n,m}?", and "{n,}?". The rest of the
1704 extensions below have the form "(?char...)", where the "char" is a
1705 character that determines the type of extension.
1706
1707 The first extension is an embedded comment "(?#text)". This embeds a
1708 comment into the regular expression without affecting its meaning. The
1709 comment should not have any closing parentheses in the text. An exam‐
1710 ple is
1711
1712 /(?# Match an integer:)[+-]?\d+/;
1713
1714 This style of commenting has been largely superseded by the raw,
1715 freeform commenting that is allowed with the "//x" modifier.
1716
1717 The modifiers "//i", "//m", "//s", and "//x" can also embedded in a
1718 regexp using "(?i)", "(?m)", "(?s)", and "(?x)". For instance,
1719
1720 /(?i)yes/; # match 'yes' case insensitively
1721 /yes/i; # same thing
1722 /(?x)( # freeform version of an integer regexp
1723 [+-]? # match an optional sign
1724 \d+ # match a sequence of digits
1725 )
1726 /x;
1727
1728 Embedded modifiers can have two important advantages over the usual
1729 modifiers. Embedded modifiers allow a custom set of modifiers to each
1730 regexp pattern. This is great for matching an array of regexps that
1731 must have different modifiers:
1732
1733 $pattern[0] = '(?i)doctor';
1734 $pattern[1] = 'Johnson';
1735 ...
1736 while (<>) {
1737 foreach $patt (@pattern) {
1738 print if /$patt/;
1739 }
1740 }
1741
1742 The second advantage is that embedded modifiers only affect the regexp
1743 inside the group the embedded modifier is contained in. So grouping
1744 can be used to localize the modifier's effects:
1745
1746 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc.
1747
1748 Embedded modifiers can also turn off any modifiers already present by
1749 using, e.g., "(?-i)". Modifiers can also be combined into a single
1750 expression, e.g., "(?s-i)" turns on single line mode and turns off case
1751 insensitivity.
1752
1753 Non-capturing groupings
1754
1755 We noted in Part 1 that groupings "()" had two distinct functions: 1)
1756 group regexp elements together as a single unit, and 2) extract, or
1757 capture, substrings that matched the regexp in the grouping. Non-cap‐
1758 turing groupings, denoted by "(?:regexp)", allow the regexp to be
1759 treated as a single unit, but don't extract substrings or set matching
1760 variables $1, etc. Both capturing and non-capturing groupings are
1761 allowed to co-exist in the same regexp. Because there is no extrac‐
1762 tion, non-capturing groupings are faster than capturing groupings.
1763 Non-capturing groupings are also handy for choosing exactly which parts
1764 of a regexp are to be extracted to matching variables:
1765
1766 # match a number, $1-$4 are set, but we only want $1
1767 /([+-]?\ *(\d+(\.\d*)?⎪\.\d+)([eE][+-]?\d+)?)/;
1768
1769 # match a number faster , only $1 is set
1770 /([+-]?\ *(?:\d+(?:\.\d*)?⎪\.\d+)(?:[eE][+-]?\d+)?)/;
1771
1772 # match a number, get $1 = whole number, $2 = exponent
1773 /([+-]?\ *(?:\d+(?:\.\d*)?⎪\.\d+)(?:[eE]([+-]?\d+))?)/;
1774
1775 Non-capturing groupings are also useful for removing nuisance elements
1776 gathered from a split operation:
1777
1778 $x = '12a34b5';
1779 @num = split /(a⎪b)/, $x; # @num = ('12','a','34','b','5')
1780 @num = split /(?:a⎪b)/, $x; # @num = ('12','34','5')
1781
1782 Non-capturing groupings may also have embedded modifiers: "(?i-m:reg‐
1783 exp)" is a non-capturing grouping that matches "regexp" case insensi‐
1784 tively and turns off multi-line mode.
1785
1786 Looking ahead and looking behind
1787
1788 This section concerns the lookahead and lookbehind assertions. First,
1789 a little background.
1790
1791 In Perl regular expressions, most regexp elements 'eat up' a certain
1792 amount of string when they match. For instance, the regexp element
1793 "[abc}]" eats up one character of the string when it matches, in the
1794 sense that perl moves to the next character position in the string
1795 after the match. There are some elements, however, that don't eat up
1796 characters (advance the character position) if they match. The exam‐
1797 ples we have seen so far are the anchors. The anchor "^" matches the
1798 beginning of the line, but doesn't eat any characters. Similarly, the
1799 word boundary anchor "\b" matches, e.g., if the character to the left
1800 is a word character and the character to the right is a non-word char‐
1801 acter, but it doesn't eat up any characters itself. Anchors are exam‐
1802 ples of 'zero-width assertions'. Zero-width, because they consume no
1803 characters, and assertions, because they test some property of the
1804 string. In the context of our walk in the woods analogy to regexp
1805 matching, most regexp elements move us along a trail, but anchors have
1806 us stop a moment and check our surroundings. If the local environment
1807 checks out, we can proceed forward. But if the local environment
1808 doesn't satisfy us, we must backtrack.
1809
1810 Checking the environment entails either looking ahead on the trail,
1811 looking behind, or both. "^" looks behind, to see that there are no
1812 characters before. "$" looks ahead, to see that there are no charac‐
1813 ters after. "\b" looks both ahead and behind, to see if the characters
1814 on either side differ in their 'word'-ness.
1815
1816 The lookahead and lookbehind assertions are generalizations of the
1817 anchor concept. Lookahead and lookbehind are zero-width assertions
1818 that let us specify which characters we want to test for. The looka‐
1819 head assertion is denoted by "(?=regexp)" and the lookbehind assertion
1820 is denoted by "(?<=fixed-regexp)". Some examples are
1821
1822 $x = "I catch the housecat 'Tom-cat' with catnip";
1823 $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat'
1824 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches,
1825 # $catwords[0] = 'catch'
1826 # $catwords[1] = 'catnip'
1827 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat'
1828 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
1829 # middle of $x
1830
1831 Note that the parentheses in "(?=regexp)" and "(?<=regexp)" are
1832 non-capturing, since these are zero-width assertions. Thus in the sec‐
1833 ond regexp, the substrings captured are those of the whole regexp
1834 itself. Lookahead "(?=regexp)" can match arbitrary regexps, but look‐
1835 behind "(?<=fixed-regexp)" only works for regexps of fixed width, i.e.,
1836 a fixed number of characters long. Thus "(?<=(ab⎪bc))" is fine, but
1837 "(?<=(ab)*)" is not. The negated versions of the lookahead and lookbe‐
1838 hind assertions are denoted by "(?!regexp)" and "(?<!fixed-regexp)"
1839 respectively. They evaluate true if the regexps do not match:
1840
1841 $x = "foobar";
1842 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
1843 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
1844 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
1845
1846 The "\C" is unsupported in lookbehind, because the already treacherous
1847 definition of "\C" would become even more so when going backwards.
1848
1849 Using independent subexpressions to prevent backtracking
1850
1851 The last few extended patterns in this tutorial are experimental as of
1852 5.6.0. Play with them, use them in some code, but don't rely on them
1853 just yet for production code.
1854
1855 Independent subexpressions are regular expressions, in the context of
1856 a larger regular expression, that function independently of the larger
1857 regular expression. That is, they consume as much or as little of the
1858 string as they wish without regard for the ability of the larger regexp
1859 to match. Independent subexpressions are represented by "(?>regexp)".
1860 We can illustrate their behavior by first considering an ordinary reg‐
1861 exp:
1862
1863 $x = "ab";
1864 $x =~ /a*ab/; # matches
1865
1866 This obviously matches, but in the process of matching, the subexpres‐
1867 sion "a*" first grabbed the "a". Doing so, however, wouldn't allow the
1868 whole regexp to match, so after backtracking, "a*" eventually gave back
1869 the "a" and matched the empty string. Here, what "a*" matched was
1870 dependent on what the rest of the regexp matched.
1871
1872 Contrast that with an independent subexpression:
1873
1874 $x =~ /(?>a*)ab/; # doesn't match!
1875
1876 The independent subexpression "(?>a*)" doesn't care about the rest of
1877 the regexp, so it sees an "a" and grabs it. Then the rest of the reg‐
1878 exp "ab" cannot match. Because "(?>a*)" is independent, there is no
1879 backtracking and the independent subexpression does not give up its
1880 "a". Thus the match of the regexp as a whole fails. A similar behav‐
1881 ior occurs with completely independent regexps:
1882
1883 $x = "ab";
1884 $x =~ /a*/g; # matches, eats an 'a'
1885 $x =~ /\Gab/g; # doesn't match, no 'a' available
1886
1887 Here "//g" and "\G" create a 'tag team' handoff of the string from one
1888 regexp to the other. Regexps with an independent subexpression are
1889 much like this, with a handoff of the string to the independent subex‐
1890 pression, and a handoff of the string back to the enclosing regexp.
1891
1892 The ability of an independent subexpression to prevent backtracking can
1893 be quite useful. Suppose we want to match a non-empty string enclosed
1894 in parentheses up to two levels deep. Then the following regexp
1895 matches:
1896
1897 $x = "abc(de(fg)h"; # unbalanced parentheses
1898 $x =~ /\( ( [^()]+ ⎪ \([^()]*\) )+ \)/x;
1899
1900 The regexp matches an open parenthesis, one or more copies of an alter‐
1901 nation, and a close parenthesis. The alternation is two-way, with the
1902 first alternative "[^()]+" matching a substring with no parentheses and
1903 the second alternative "\([^()]*\)" matching a substring delimited by
1904 parentheses. The problem with this regexp is that it is pathological:
1905 it has nested indeterminate quantifiers of the form "(a+⎪b)+". We dis‐
1906 cussed in Part 1 how nested quantifiers like this could take an expo‐
1907 nentially long time to execute if there was no match possible. To pre‐
1908 vent the exponential blowup, we need to prevent useless backtracking at
1909 some point. This can be done by enclosing the inner quantifier as an
1910 independent subexpression:
1911
1912 $x =~ /\( ( (?>[^()]+) ⎪ \([^()]*\) )+ \)/x;
1913
1914 Here, "(?>[^()]+)" breaks the degeneracy of string partitioning by gob‐
1915 bling up as much of the string as possible and keeping it. Then match
1916 failures fail much more quickly.
1917
1918 Conditional expressions
1919
1920 A conditional expression is a form of if-then-else statement that
1921 allows one to choose which patterns are to be matched, based on some
1922 condition. There are two types of conditional expression: "(?(condi‐
1923 tion)yes-regexp)" and "(?(condition)yes-regexp⎪no-regexp)". "(?(condi‐
1924 tion)yes-regexp)" is like an 'if () {}' statement in Perl. If the
1925 "condition" is true, the "yes-regexp" will be matched. If the "condi‐
1926 tion" is false, the "yes-regexp" will be skipped and perl will move
1927 onto the next regexp element. The second form is like an
1928 'if () {} else {}' statement in Perl. If the "condition" is true, the
1929 "yes-regexp" will be matched, otherwise the "no-regexp" will be
1930 matched.
1931
1932 The "condition" can have two forms. The first form is simply an inte‐
1933 ger in parentheses "(integer)". It is true if the corresponding back‐
1934 reference "\integer" matched earlier in the regexp. The second form is
1935 a bare zero width assertion "(?...)", either a lookahead, a lookbehind,
1936 or a code assertion (discussed in the next section).
1937
1938 The integer form of the "condition" allows us to choose, with more
1939 flexibility, what to match based on what matched earlier in the regexp.
1940 This searches for words of the form "$x$x" or "$x$y$y$x":
1941
1942 % simple_grep '^(\w+)(\w+)?(?(2)\2\1⎪\1)$' /usr/dict/words
1943 beriberi
1944 coco
1945 couscous
1946 deed
1947 ...
1948 toot
1949 toto
1950 tutu
1951
1952 The lookbehind "condition" allows, along with backreferences, an ear‐
1953 lier part of the match to influence a later part of the match. For
1954 instance,
1955
1956 /[ATGC]+(?(?<=AA)G⎪C)$/;
1957
1958 matches a DNA sequence such that it either ends in "AAG", or some other
1959 base pair combination and "C". Note that the form is "(?(?<=AA)G⎪C)"
1960 and not "(?((?<=AA))G⎪C)"; for the lookahead, lookbehind or code asser‐
1961 tions, the parentheses around the conditional are not needed.
1962
1963 A bit of magic: executing Perl code in a regular expression
1964
1965 Normally, regexps are a part of Perl expressions. Code evaluation
1966 expressions turn that around by allowing arbitrary Perl code to be a
1967 part of a regexp. A code evaluation expression is denoted "(?{code})",
1968 with "code" a string of Perl statements.
1969
1970 Code expressions are zero-width assertions, and the value they return
1971 depends on their environment. There are two possibilities: either the
1972 code expression is used as a conditional in a conditional expression
1973 "(?(condition)...)", or it is not. If the code expression is a condi‐
1974 tional, the code is evaluated and the result (i.e., the result of the
1975 last statement) is used to determine truth or falsehood. If the code
1976 expression is not used as a conditional, the assertion always evaluates
1977 true and the result is put into the special variable $^R. The variable
1978 $^R can then be used in code expressions later in the regexp. Here are
1979 some silly examples:
1980
1981 $x = "abcdef";
1982 $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
1983 # prints 'Hi Mom!'
1984 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
1985 # no 'Hi Mom!'
1986
1987 Pay careful attention to the next example:
1988
1989 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
1990 # no 'Hi Mom!'
1991 # but why not?
1992
1993 At first glance, you'd think that it shouldn't print, because obviously
1994 the "ddd" isn't going to match the target string. But look at this
1995 example:
1996
1997 $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
1998 # but _does_ print
1999
2000 Hmm. What happened here? If you've been following along, you know that
2001 the above pattern should be effectively the same as the last one --
2002 enclosing the d in a character class isn't going to change what it
2003 matches. So why does the first not print while the second one does?
2004
2005 The answer lies in the optimizations the REx engine makes. In the first
2006 case, all the engine sees are plain old characters (aside from the
2007 "?{}" construct). It's smart enough to realize that the string 'ddd'
2008 doesn't occur in our target string before actually running the pattern
2009 through. But in the second case, we've tricked it into thinking that
2010 our pattern is more complicated than it is. It takes a look, sees our
2011 character class, and decides that it will have to actually run the pat‐
2012 tern to determine whether or not it matches, and in the process of run‐
2013 ning it hits the print statement before it discovers that we don't have
2014 a match.
2015
2016 To take a closer look at how the engine does optimizations, see the
2017 section "Pragmas and debugging" below.
2018
2019 More fun with "?{}":
2020
2021 $x =~ /(?{print "Hi Mom!";})/; # matches,
2022 # prints 'Hi Mom!'
2023 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
2024 # prints '1'
2025 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
2026 # prints '1'
2027
2028 The bit of magic mentioned in the section title occurs when the regexp
2029 backtracks in the process of searching for a match. If the regexp
2030 backtracks over a code expression and if the variables used within are
2031 localized using "local", the changes in the variables produced by the
2032 code expression are undone! Thus, if we wanted to count how many times
2033 a character got matched inside a group, we could use, e.g.,
2034
2035 $x = "aaaa";
2036 $count = 0; # initialize 'a' count
2037 $c = "bob"; # test if $c gets clobbered
2038 $x =~ /(?{local $c = 0;}) # initialize count
2039 ( a # match 'a'
2040 (?{local $c = $c + 1;}) # increment count
2041 )* # do this any number of times,
2042 aa # but match 'aa' at the end
2043 (?{$count = $c;}) # copy local $c var into $count
2044 /x;
2045 print "'a' count is $count, \$c variable is '$c'\n";
2046
2047 This prints
2048
2049 'a' count is 2, $c variable is 'bob'
2050
2051 If we replace the " (?{local $c = $c + 1;})" with
2052 " (?{$c = $c + 1;})" , the variable changes are not undone during back‐
2053 tracking, and we get
2054
2055 'a' count is 4, $c variable is 'bob'
2056
2057 Note that only localized variable changes are undone. Other side
2058 effects of code expression execution are permanent. Thus
2059
2060 $x = "aaaa";
2061 $x =~ /(a(?{print "Yow\n";}))*aa/;
2062
2063 produces
2064
2065 Yow
2066 Yow
2067 Yow
2068 Yow
2069
2070 The result $^R is automatically localized, so that it will behave prop‐
2071 erly in the presence of backtracking.
2072
2073 This example uses a code expression in a conditional to match the arti‐
2074 cle 'the' in either English or German:
2075
2076 $lang = 'DE'; # use German
2077 ...
2078 $text = "das";
2079 print "matched\n"
2080 if $text =~ /(?(?{
2081 $lang eq 'EN'; # is the language English?
2082 })
2083 the ⎪ # if so, then match 'the'
2084 (die⎪das⎪der) # else, match 'die⎪das⎪der'
2085 )
2086 /xi;
2087
2088 Note that the syntax here is "(?(?{...})yes-regexp⎪no-regexp)", not
2089 "(?((?{...}))yes-regexp⎪no-regexp)". In other words, in the case of a
2090 code expression, we don't need the extra parentheses around the condi‐
2091 tional.
2092
2093 If you try to use code expressions with interpolating variables, perl
2094 may surprise you:
2095
2096 $bar = 5;
2097 $pat = '(?{ 1 })';
2098 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
2099 /foo(?{ 1 })$bar/; # compile error!
2100 /foo${pat}bar/; # compile error!
2101
2102 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
2103 /foo${pat}bar/; # compiles ok
2104
2105 If a regexp has (1) code expressions and interpolating variables, or
2106 (2) a variable that interpolates a code expression, perl treats the
2107 regexp as an error. If the code expression is precompiled into a vari‐
2108 able, however, interpolating is ok. The question is, why is this an
2109 error?
2110
2111 The reason is that variable interpolation and code expressions together
2112 pose a security risk. The combination is dangerous because many pro‐
2113 grammers who write search engines often take user input and plug it
2114 directly into a regexp:
2115
2116 $regexp = <>; # read user-supplied regexp
2117 $chomp $regexp; # get rid of possible newline
2118 $text =~ /$regexp/; # search $text for the $regexp
2119
2120 If the $regexp variable contains a code expression, the user could then
2121 execute arbitrary Perl code. For instance, some joker could search for
2122 "system('rm -rf *');" to erase your files. In this sense, the combi‐
2123 nation of interpolation and code expressions taints your regexp. So by
2124 default, using both interpolation and code expressions in the same reg‐
2125 exp is not allowed. If you're not concerned about malicious users, it
2126 is possible to bypass this security check by invoking "use re 'eval'" :
2127
2128 use re 'eval'; # throw caution out the door
2129 $bar = 5;
2130 $pat = '(?{ 1 })';
2131 /foo(?{ 1 })$bar/; # compiles ok
2132 /foo${pat}bar/; # compiles ok
2133
2134 Another form of code expression is the pattern code expression . The
2135 pattern code expression is like a regular code expression, except that
2136 the result of the code evaluation is treated as a regular expression
2137 and matched immediately. A simple example is
2138
2139 $length = 5;
2140 $char = 'a';
2141 $x = 'aaaaabb';
2142 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
2143
2144 This final example contains both ordinary and pattern code expressions.
2145 It detects if a binary string 1101010010001... has a Fibonacci spacing
2146 0,1,1,2,3,5,... of the 1's:
2147
2148 $s0 = 0; $s1 = 1; # initial conditions
2149 $x = "1101010010001000001";
2150 print "It is a Fibonacci sequence\n"
2151 if $x =~ /^1 # match an initial '1'
2152 (
2153 (??{'0' x $s0}) # match $s0 of '0'
2154 1 # and then a '1'
2155 (?{
2156 $largest = $s0; # largest seq so far
2157 $s2 = $s1 + $s0; # compute next term
2158 $s0 = $s1; # in Fibonacci sequence
2159 $s1 = $s2;
2160 })
2161 )+ # repeat as needed
2162 $ # that is all there is
2163 /x;
2164 print "Largest sequence matched was $largest\n";
2165
2166 This prints
2167
2168 It is a Fibonacci sequence
2169 Largest sequence matched was 5
2170
2171 Ha! Try that with your garden variety regexp package...
2172
2173 Note that the variables $s0 and $s1 are not substituted when the regexp
2174 is compiled, as happens for ordinary variables outside a code expres‐
2175 sion. Rather, the code expressions are evaluated when perl encounters
2176 them during the search for a match.
2177
2178 The regexp without the "//x" modifier is
2179
2180 /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
2181
2182 and is a great start on an Obfuscated Perl entry :-) When working with
2183 code and conditional expressions, the extended form of regexps is
2184 almost necessary in creating and debugging regexps.
2185
2186 Pragmas and debugging
2187
2188 Speaking of debugging, there are several pragmas available to control
2189 and debug regexps in Perl. We have already encountered one pragma in
2190 the previous section, "use re 'eval';" , that allows variable interpo‐
2191 lation and code expressions to coexist in a regexp. The other pragmas
2192 are
2193
2194 use re 'taint';
2195 $tainted = <>;
2196 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
2197
2198 The "taint" pragma causes any substrings from a match with a tainted
2199 variable to be tainted as well. This is not normally the case, as reg‐
2200 exps are often used to extract the safe bits from a tainted variable.
2201 Use "taint" when you are not extracting safe bits, but are performing
2202 some other processing. Both "taint" and "eval" pragmas are lexically
2203 scoped, which means they are in effect only until the end of the block
2204 enclosing the pragmas.
2205
2206 use re 'debug';
2207 /^(.*)$/s; # output debugging info
2208
2209 use re 'debugcolor';
2210 /^(.*)$/s; # output debugging info in living color
2211
2212 The global "debug" and "debugcolor" pragmas allow one to get detailed
2213 debugging info about regexp compilation and execution. "debugcolor" is
2214 the same as debug, except the debugging information is displayed in
2215 color on terminals that can display termcap color sequences. Here is
2216 example output:
2217
2218 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
2219 Compiling REx `a*b+c'
2220 size 9 first at 1
2221 1: STAR(4)
2222 2: EXACT <a>(0)
2223 4: PLUS(7)
2224 5: EXACT <b>(0)
2225 7: EXACT <c>(9)
2226 9: END(0)
2227 floating `bc' at 0..2147483647 (checking floating) minlen 2
2228 Guessing start of match, REx `a*b+c' against `abc'...
2229 Found floating substr `bc' at offset 1...
2230 Guessed: match at offset 0
2231 Matching REx `a*b+c' against `abc'
2232 Setting an EVAL scope, savestack=3
2233 0 <> <abc> ⎪ 1: STAR
2234 EXACT <a> can match 1 times out of 32767...
2235 Setting an EVAL scope, savestack=3
2236 1 <a> <bc> ⎪ 4: PLUS
2237 EXACT <b> can match 1 times out of 32767...
2238 Setting an EVAL scope, savestack=3
2239 2 <ab> <c> ⎪ 7: EXACT <c>
2240 3 <abc> <> ⎪ 9: END
2241 Match successful!
2242 Freeing REx: `a*b+c'
2243
2244 If you have gotten this far into the tutorial, you can probably guess
2245 what the different parts of the debugging output tell you. The first
2246 part
2247
2248 Compiling REx `a*b+c'
2249 size 9 first at 1
2250 1: STAR(4)
2251 2: EXACT <a>(0)
2252 4: PLUS(7)
2253 5: EXACT <b>(0)
2254 7: EXACT <c>(9)
2255 9: END(0)
2256
2257 describes the compilation stage. STAR(4) means that there is a starred
2258 object, in this case 'a', and if it matches, goto line 4, i.e.,
2259 PLUS(7). The middle lines describe some heuristics and optimizations
2260 performed before a match:
2261
2262 floating `bc' at 0..2147483647 (checking floating) minlen 2
2263 Guessing start of match, REx `a*b+c' against `abc'...
2264 Found floating substr `bc' at offset 1...
2265 Guessed: match at offset 0
2266
2267 Then the match is executed and the remaining lines describe the
2268 process:
2269
2270 Matching REx `a*b+c' against `abc'
2271 Setting an EVAL scope, savestack=3
2272 0 <> <abc> ⎪ 1: STAR
2273 EXACT <a> can match 1 times out of 32767...
2274 Setting an EVAL scope, savestack=3
2275 1 <a> <bc> ⎪ 4: PLUS
2276 EXACT <b> can match 1 times out of 32767...
2277 Setting an EVAL scope, savestack=3
2278 2 <ab> <c> ⎪ 7: EXACT <c>
2279 3 <abc> <> ⎪ 9: END
2280 Match successful!
2281 Freeing REx: `a*b+c'
2282
2283 Each step is of the form "n <x> <y>" , with "<x>" the part of the
2284 string matched and "<y>" the part not yet matched. The "⎪ 1: STAR"
2285 says that perl is at line number 1 n the compilation list above. See
2286 "Debugging regular expressions" in perldebguts for much more detail.
2287
2288 An alternative method of debugging regexps is to embed "print" state‐
2289 ments within the regexp. This provides a blow-by-blow account of the
2290 backtracking in an alternation:
2291
2292 "that this" =~ m@(?{print "Start at position ", pos, "\n";})
2293 t(?{print "t1\n";})
2294 h(?{print "h1\n";})
2295 i(?{print "i1\n";})
2296 s(?{print "s1\n";})
2297 ⎪
2298 t(?{print "t2\n";})
2299 h(?{print "h2\n";})
2300 a(?{print "a2\n";})
2301 t(?{print "t2\n";})
2302 (?{print "Done at position ", pos, "\n";})
2303 @x;
2304
2305 prints
2306
2307 Start at position 0
2308 t1
2309 h1
2310 t2
2311 h2
2312 a2
2313 t2
2314 Done at position 4
2315
2317 Code expressions, conditional expressions, and independent expressions
2318 are experimental. Don't use them in production code. Yet.
2319
2321 This is just a tutorial. For the full story on perl regular expres‐
2322 sions, see the perlre regular expressions reference page.
2323
2324 For more information on the matching "m//" and substitution "s///"
2325 operators, see "Regexp Quote-Like Operators" in perlop. For informa‐
2326 tion on the "split" operation, see "split" in perlfunc.
2327
2328 For an excellent all-around resource on the care and feeding of regular
2329 expressions, see the book Mastering Regular Expressions by Jeffrey
2330 Friedl (published by O'Reilly, ISBN 1556592-257-3).
2331
2333 Copyright (c) 2000 Mark Kvale All rights reserved.
2334
2335 This document may be distributed under the same terms as Perl itself.
2336
2337 Acknowledgments
2338
2339 The inspiration for the stop codon DNA example came from the ZIP code
2340 example in chapter 7 of Mastering Regular Expressions.
2341
2342 The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
2343 Haworth, Ronald J Kimball, and Joe Smith for all their helpful com‐
2344 ments.
2345
2346
2347
2348perl v5.8.8 2006-01-07 PERLRETUT(1)