1PERLREQUICK(1) Perl Programmers Reference Guide PERLREQUICK(1)
2
3
4
6 perlrequick - Perl regular expressions quick start
7
9 This page covers the very basics of understanding, creating and using
10 regular expressions ('regexes') in Perl.
11
13 Simple word matching
14 The simplest regex is simply a word, or more generally, a string of
15 characters. A regex consisting of a word matches any string that
16 contains that word:
17
18 "Hello World" =~ /World/; # matches
19
20 In this statement, "World" is a regex and the "//" enclosing "/World/"
21 tells perl to search a string for a match. The operator "=~"
22 associates the string with the regex match and produces a true value if
23 the regex matched, or false if the regex did not match. In our case,
24 "World" matches the second word in "Hello World", so the expression is
25 true. This idea has several variations.
26
27 Expressions like this are useful in conditionals:
28
29 print "It matches\n" if "Hello World" =~ /World/;
30
31 The sense of the match can be reversed by using "!~" operator:
32
33 print "It doesn't match\n" if "Hello World" !~ /World/;
34
35 The literal string in the regex can be replaced by a variable:
36
37 $greeting = "World";
38 print "It matches\n" if "Hello World" =~ /$greeting/;
39
40 If you're matching against $_, the "$_ =~" part can be omitted:
41
42 $_ = "Hello World";
43 print "It matches\n" if /World/;
44
45 Finally, the "//" default delimiters for a match can be changed to
46 arbitrary delimiters by putting an 'm' out front:
47
48 "Hello World" =~ m!World!; # matches, delimited by '!'
49 "Hello World" =~ m{World}; # matches, note the matching '{}'
50 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
51 # '/' becomes an ordinary char
52
53 Regexes must match a part of the string exactly in order for the
54 statement to be true:
55
56 "Hello World" =~ /world/; # doesn't match, case sensitive
57 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
58 "Hello World" =~ /World /; # doesn't match, no ' ' at end
59
60 perl will always match at the earliest possible point in the string:
61
62 "Hello World" =~ /o/; # matches 'o' in 'Hello'
63 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
64
65 Not all characters can be used 'as is' in a match. Some characters,
66 called metacharacters, are reserved for use in regex notation. The
67 metacharacters are
68
69 {}[]()^$.|*+?\
70
71 A metacharacter can be matched by putting a backslash before it:
72
73 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
74 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
75 'C:\WIN32' =~ /C:\\WIN/; # matches
76 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
77
78 In the last regex, the forward slash '/' is also backslashed, because
79 it is used to delimit the regex.
80
81 Non-printable ASCII characters are represented by escape sequences.
82 Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
83 carriage return. Arbitrary bytes are represented by octal escape
84 sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
85
86 "1000\t2000" =~ m(0\t2) # matches
87 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
88
89 Regexes are treated mostly as double quoted strings, so variable
90 substitution works:
91
92 $foo = 'house';
93 'cathouse' =~ /cat$foo/; # matches
94 'housecat' =~ /${foo}cat/; # matches
95
96 With all of the regexes above, if the regex matched anywhere in the
97 string, it was considered a match. To specify where it should match,
98 we would use the anchor metacharacters "^" and "$". The anchor "^"
99 means match at the beginning of the string and the anchor "$" means
100 match at the end of the string, or before a newline at the end of the
101 string. Some examples:
102
103 "housekeeper" =~ /keeper/; # matches
104 "housekeeper" =~ /^keeper/; # doesn't match
105 "housekeeper" =~ /keeper$/; # matches
106 "housekeeper\n" =~ /keeper$/; # matches
107 "housekeeper" =~ /^housekeeper$/; # matches
108
109 Using character classes
110 A character class allows a set of possible characters, rather than just
111 a single character, to match at a particular point in a regex.
112 Character classes are denoted by brackets "[...]", with the set of
113 characters to be possibly matched inside. Here are some examples:
114
115 /cat/; # matches 'cat'
116 /[bcr]at/; # matches 'bat', 'cat', or 'rat'
117 "abc" =~ /[cab]/; # matches 'a'
118
119 In the last statement, even though 'c' is the first character in the
120 class, the earliest point at which the regex can match is 'a'.
121
122 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
123 # 'yes', 'Yes', 'YES', etc.
124 /yes/i; # also match 'yes' in a case-insensitive way
125
126 The last example shows a match with an 'i' modifier, which makes the
127 match case-insensitive.
128
129 Character classes also have ordinary and special characters, but the
130 sets of ordinary and special characters inside a character class are
131 different than those outside a character class. The special characters
132 for a character class are "-]\^$" and are matched using an escape:
133
134 /[\]c]def/; # matches ']def' or 'cdef'
135 $x = 'bcr';
136 /[$x]at/; # matches 'bat, 'cat', or 'rat'
137 /[\$x]at/; # matches '$at' or 'xat'
138 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
139
140 The special character '-' acts as a range operator within character
141 classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
142 the svelte "[0-9]" and "[a-z]":
143
144 /item[0-9]/; # matches 'item0' or ... or 'item9'
145 /[0-9a-fA-F]/; # matches a hexadecimal digit
146
147 If '-' is the first or last character in a character class, it is
148 treated as an ordinary character.
149
150 The special character "^" in the first position of a character class
151 denotes a negated character class, which matches any character but
152 those in the brackets. Both "[...]" and "[^...]" must match a
153 character, or the match fails. Then
154
155 /[^a]at/; # doesn't match 'aat' or 'at', but matches
156 # all other 'bat', 'cat, '0at', '%at', etc.
157 /[^0-9]/; # matches a non-numeric character
158 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
159
160 Perl has several abbreviations for common character classes:
161
162 · \d is a digit and represents
163
164 [0-9]
165
166 · \s is a whitespace character and represents
167
168 [\ \t\r\n\f]
169
170 · \w is a word character (alphanumeric or _) and represents
171
172 [0-9a-zA-Z_]
173
174 · \D is a negated \d; it represents any character but a digit
175
176 [^0-9]
177
178 · \S is a negated \s; it represents any non-whitespace character
179
180 [^\s]
181
182 · \W is a negated \w; it represents any non-word character
183
184 [^\w]
185
186 · The period '.' matches any character but "\n"
187
188 The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
189 character classes. Here are some in use:
190
191 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
192 /[\d\s]/; # matches any digit or whitespace character
193 /\w\W\w/; # matches a word char, followed by a
194 # non-word char, followed by a word char
195 /..rt/; # matches any two chars, followed by 'rt'
196 /end\./; # matches 'end.'
197 /end[.]/; # same thing, matches 'end.'
198
199 The word anchor "\b" matches a boundary between a word character and a
200 non-word character "\w\W" or "\W\w":
201
202 $x = "Housecat catenates house and cat";
203 $x =~ /\bcat/; # matches cat in 'catenates'
204 $x =~ /cat\b/; # matches cat in 'housecat'
205 $x =~ /\bcat\b/; # matches 'cat' at end of string
206
207 In the last example, the end of the string is considered a word
208 boundary.
209
210 Matching this or that
211 We can match different character strings with the alternation
212 metacharacter '|'. To match "dog" or "cat", we form the regex
213 "dog|cat". As before, perl will try to match the regex at the earliest
214 possible point in the string. At each character position, perl will
215 first try to match the first alternative, "dog". If "dog" doesn't
216 match, perl will then try the next alternative, "cat". If "cat"
217 doesn't match either, then the match fails and perl moves to the next
218 position in the string. Some examples:
219
220 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
221 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
222
223 Even though "dog" is the first alternative in the second regex, "cat"
224 is able to match earlier in the string.
225
226 "cats" =~ /c|ca|cat|cats/; # matches "c"
227 "cats" =~ /cats|cat|ca|c/; # matches "cats"
228
229 At a given character position, the first alternative that allows the
230 regex match to succeed will be the one that matches. Here, all the
231 alternatives match at the first string position, so the first matches.
232
233 Grouping things and hierarchical matching
234 The grouping metacharacters "()" allow a part of a regex to be treated
235 as a single unit. Parts of a regex are grouped by enclosing them in
236 parentheses. The regex "house(cat|keeper)" means match "house"
237 followed by either "cat" or "keeper". Some more examples are
238
239 /(a|b)b/; # matches 'ab' or 'bb'
240 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
241
242 /house(cat|)/; # matches either 'housecat' or 'house'
243 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
244 # 'house'. Note groups can be nested.
245
246 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
247 # because '20\d\d' can't match
248
249 Extracting matches
250 The grouping metacharacters "()" also allow the extraction of the parts
251 of a string that matched. For each grouping, the part that matched
252 inside goes into the special variables $1, $2, etc. They can be used
253 just as ordinary variables:
254
255 # extract hours, minutes, seconds
256 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
257 $hours = $1;
258 $minutes = $2;
259 $seconds = $3;
260
261 In list context, a match "/regex/" with groupings will return the list
262 of matched values "($1,$2,...)". So we could rewrite it as
263
264 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
265
266 If the groupings in a regex are nested, $1 gets the group with the
267 leftmost opening parenthesis, $2 the next opening parenthesis, etc.
268 For example, here is a complex regex and the matching variables
269 indicated below it:
270
271 /(ab(cd|ef)((gi)|j))/;
272 1 2 34
273
274 Associated with the matching variables $1, $2, ... are the
275 backreferences "\1", "\2", ... Backreferences are matching variables
276 that can be used inside a regex:
277
278 /(\w\w\w)\s\1/; # find sequences like 'the the' in string
279
280 $1, $2, ... should only be used outside of a regex, and "\1", "\2", ...
281 only inside a regex.
282
283 Matching repetitions
284 The quantifier metacharacters "?", "*", "+", and "{}" allow us to
285 determine the number of repeats of a portion of a regex we consider to
286 be a match. Quantifiers are put immediately after the character,
287 character class, or grouping that we want to specify. They have the
288 following meanings:
289
290 · "a?" = match 'a' 1 or 0 times
291
292 · "a*" = match 'a' 0 or more times, i.e., any number of times
293
294 · "a+" = match 'a' 1 or more times, i.e., at least once
295
296 · "a{n,m}" = match at least "n" times, but not more than "m" times.
297
298 · "a{n,}" = match at least "n" or more times
299
300 · "a{n}" = match exactly "n" times
301
302 Here are some examples:
303
304 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
305 # any number of digits
306 /(\w+)\s+\1/; # match doubled words of arbitrary length
307 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more
308 # than 4 digits
309 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
310
311 These quantifiers will try to match as much of the string as possible,
312 while still allowing the regex to match. So we have
313
314 $x = 'the cat in the hat';
315 $x =~ /^(.*)(at)(.*)$/; # matches,
316 # $1 = 'the cat in the h'
317 # $2 = 'at'
318 # $3 = '' (0 matches)
319
320 The first quantifier ".*" grabs as much of the string as possible while
321 still having the regex match. The second quantifier ".*" has no string
322 left to it, so it matches 0 times.
323
324 More matching
325 There are a few more things you might want to know about matching
326 operators. In the code
327
328 $pattern = 'Seuss';
329 while (<>) {
330 print if /$pattern/;
331 }
332
333 perl has to re-evaluate $pattern each time through the loop. If
334 $pattern won't be changing, use the "//o" modifier, to only perform
335 variable substitutions once. If you don't want any substitutions at
336 all, use the special delimiter "m''":
337
338 @pattern = ('Seuss');
339 m/@pattern/; # matches 'Seuss'
340 m'@pattern'; # matches the literal string '@pattern'
341
342 The global modifier "//g" allows the matching operator to match within
343 a string as many times as possible. In scalar context, successive
344 matches against a string will have "//g" jump from match to match,
345 keeping track of position in the string as it goes along. You can get
346 or set the position with the "pos()" function. For example,
347
348 $x = "cat dog house"; # 3 words
349 while ($x =~ /(\w+)/g) {
350 print "Word is $1, ends at position ", pos $x, "\n";
351 }
352
353 prints
354
355 Word is cat, ends at position 3
356 Word is dog, ends at position 7
357 Word is house, ends at position 13
358
359 A failed match or changing the target string resets the position. If
360 you don't want the position reset after failure to match, add the
361 "//c", as in "/regex/gc".
362
363 In list context, "//g" returns a list of matched groupings, or if there
364 are no groupings, a list of matches to the whole regex. So
365
366 @words = ($x =~ /(\w+)/g); # matches,
367 # $word[0] = 'cat'
368 # $word[1] = 'dog'
369 # $word[2] = 'house'
370
371 Search and replace
372 Search and replace is performed using "s/regex/replacement/modifiers".
373 The "replacement" is a Perl double quoted string that replaces in the
374 string whatever is matched with the "regex". The operator "=~" is also
375 used here to associate a string with "s///". If matching against $_,
376 the "$_ =~" can be dropped. If there is a match, "s///" returns the
377 number of substitutions made, otherwise it returns false. Here are a
378 few examples:
379
380 $x = "Time to feed the cat!";
381 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
382 $y = "'quoted words'";
383 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
384 # $y contains "quoted words"
385
386 With the "s///" operator, the matched variables $1, $2, etc. are
387 immediately available for use in the replacement expression. With the
388 global modifier, "s///g" will search and replace all occurrences of the
389 regex in the string:
390
391 $x = "I batted 4 for 4";
392 $x =~ s/4/four/; # $x contains "I batted four for 4"
393 $x = "I batted 4 for 4";
394 $x =~ s/4/four/g; # $x contains "I batted four for four"
395
396 The evaluation modifier "s///e" wraps an "eval{...}" around the
397 replacement string and the evaluated result is substituted for the
398 matched substring. Some examples:
399
400 # reverse all the words in a string
401 $x = "the cat in the hat";
402 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
403
404 # convert percentage to decimal
405 $x = "A 39% hit rate";
406 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
407
408 The last example shows that "s///" can use other delimiters, such as
409 "s!!!" and "s{}{}", and even "s{}//". If single quotes are used
410 "s'''", then the regex and replacement are treated as single quoted
411 strings.
412
413 The split operator
414 "split /regex/, string" splits "string" into a list of substrings and
415 returns that list. The regex determines the character sequence that
416 "string" is split with respect to. For example, to split a string into
417 words, use
418
419 $x = "Calvin and Hobbes";
420 @word = split /\s+/, $x; # $word[0] = 'Calvin'
421 # $word[1] = 'and'
422 # $word[2] = 'Hobbes'
423
424 To extract a comma-delimited list of numbers, use
425
426 $x = "1.618,2.718, 3.142";
427 @const = split /,\s*/, $x; # $const[0] = '1.618'
428 # $const[1] = '2.718'
429 # $const[2] = '3.142'
430
431 If the empty regex "//" is used, the string is split into individual
432 characters. If the regex has groupings, then the list produced
433 contains the matched substrings from the groupings as well:
434
435 $x = "/usr/bin";
436 @parts = split m!(/)!, $x; # $parts[0] = ''
437 # $parts[1] = '/'
438 # $parts[2] = 'usr'
439 # $parts[3] = '/'
440 # $parts[4] = 'bin'
441
442 Since the first character of $x matched the regex, "split" prepended an
443 empty initial element to the list.
444
446 None.
447
449 This is just a quick start guide. For a more in-depth tutorial on
450 regexes, see perlretut and for the reference page, see perlre.
451
453 Copyright (c) 2000 Mark Kvale All rights reserved.
454
455 This document may be distributed under the same terms as Perl itself.
456
457 Acknowledgments
458 The author would like to thank Mark-Jason Dominus, Tom Christiansen,
459 Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
460 comments.
461
462
463
464perl v5.10.1 2009-02-12 PERLREQUICK(1)