1PERLREQUICK(1) Perl Programmers Reference Guide PERLREQUICK(1)
2
3
4
6 perlrequick - Perl regular expressions quick start
7
9 This page covers the very basics of understanding, creating and using
10 regular expressions ('regexes') in Perl.
11
13 Simple word matching
14 The simplest regex is simply a word, or more generally, a string of
15 characters. A regex consisting of a word matches any string that
16 contains that word:
17
18 "Hello World" =~ /World/; # matches
19
20 In this statement, "World" is a regex and the "//" enclosing "/World/"
21 tells Perl to search a string for a match. The operator "=~"
22 associates the string with the regex match and produces a true value if
23 the regex matched, or false if the regex did not match. In our case,
24 "World" matches the second word in "Hello World", so the expression is
25 true. This idea has several variations.
26
27 Expressions like this are useful in conditionals:
28
29 print "It matches\n" if "Hello World" =~ /World/;
30
31 The sense of the match can be reversed by using "!~" operator:
32
33 print "It doesn't match\n" if "Hello World" !~ /World/;
34
35 The literal string in the regex can be replaced by a variable:
36
37 $greeting = "World";
38 print "It matches\n" if "Hello World" =~ /$greeting/;
39
40 If you're matching against $_, the "$_ =~" part can be omitted:
41
42 $_ = "Hello World";
43 print "It matches\n" if /World/;
44
45 Finally, the "//" default delimiters for a match can be changed to
46 arbitrary delimiters by putting an 'm' out front:
47
48 "Hello World" =~ m!World!; # matches, delimited by '!'
49 "Hello World" =~ m{World}; # matches, note the matching '{}'
50 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
51 # '/' becomes an ordinary char
52
53 Regexes must match a part of the string exactly in order for the
54 statement to be true:
55
56 "Hello World" =~ /world/; # doesn't match, case sensitive
57 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
58 "Hello World" =~ /World /; # doesn't match, no ' ' at end
59
60 Perl will always match at the earliest possible point in the string:
61
62 "Hello World" =~ /o/; # matches 'o' in 'Hello'
63 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
64
65 Not all characters can be used 'as is' in a match. Some characters,
66 called metacharacters, are reserved for use in regex notation. The
67 metacharacters are
68
69 {}[]()^$.|*+?\
70
71 A metacharacter can be matched by putting a backslash before it:
72
73 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
74 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
75 'C:\WIN32' =~ /C:\\WIN/; # matches
76 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
77
78 In the last regex, the forward slash '/' is also backslashed, because
79 it is used to delimit the regex.
80
81 Non-printable ASCII characters are represented by escape sequences.
82 Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
83 carriage return. Arbitrary bytes are represented by octal escape
84 sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
85
86 "1000\t2000" =~ m(0\t2) # matches
87 "cat" =~ /\143\x61\x74/ # matches in ASCII, but a weird way to spell cat
88
89 Regexes are treated mostly as double-quoted strings, so variable
90 substitution works:
91
92 $foo = 'house';
93 'cathouse' =~ /cat$foo/; # matches
94 'housecat' =~ /${foo}cat/; # matches
95
96 With all of the regexes above, if the regex matched anywhere in the
97 string, it was considered a match. To specify where it should match,
98 we would use the anchor metacharacters "^" and "$". The anchor "^"
99 means match at the beginning of the string and the anchor "$" means
100 match at the end of the string, or before a newline at the end of the
101 string. Some examples:
102
103 "housekeeper" =~ /keeper/; # matches
104 "housekeeper" =~ /^keeper/; # doesn't match
105 "housekeeper" =~ /keeper$/; # matches
106 "housekeeper\n" =~ /keeper$/; # matches
107 "housekeeper" =~ /^housekeeper$/; # matches
108
109 Using character classes
110 A character class allows a set of possible characters, rather than just
111 a single character, to match at a particular point in a regex.
112 Character classes are denoted by brackets "[...]", with the set of
113 characters to be possibly matched inside. Here are some examples:
114
115 /cat/; # matches 'cat'
116 /[bcr]at/; # matches 'bat', 'cat', or 'rat'
117 "abc" =~ /[cab]/; # matches 'a'
118
119 In the last statement, even though 'c' is the first character in the
120 class, the earliest point at which the regex can match is 'a'.
121
122 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
123 # 'yes', 'Yes', 'YES', etc.
124 /yes/i; # also match 'yes' in a case-insensitive way
125
126 The last example shows a match with an 'i' modifier, which makes the
127 match case-insensitive.
128
129 Character classes also have ordinary and special characters, but the
130 sets of ordinary and special characters inside a character class are
131 different than those outside a character class. The special characters
132 for a character class are "-]\^$" and are matched using an escape:
133
134 /[\]c]def/; # matches ']def' or 'cdef'
135 $x = 'bcr';
136 /[$x]at/; # matches 'bat, 'cat', or 'rat'
137 /[\$x]at/; # matches '$at' or 'xat'
138 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
139
140 The special character '-' acts as a range operator within character
141 classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
142 the svelte "[0-9]" and "[a-z]":
143
144 /item[0-9]/; # matches 'item0' or ... or 'item9'
145 /[0-9a-fA-F]/; # matches a hexadecimal digit
146
147 If '-' is the first or last character in a character class, it is
148 treated as an ordinary character.
149
150 The special character "^" in the first position of a character class
151 denotes a negated character class, which matches any character but
152 those in the brackets. Both "[...]" and "[^...]" must match a
153 character, or the match fails. Then
154
155 /[^a]at/; # doesn't match 'aat' or 'at', but matches
156 # all other 'bat', 'cat, '0at', '%at', etc.
157 /[^0-9]/; # matches a non-numeric character
158 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
159
160 Perl has several abbreviations for common character classes. (These
161 definitions are those that Perl uses in ASCII-safe mode with the "/a"
162 modifier. Otherwise they could match many more non-ASCII Unicode
163 characters as well. See "Backslash sequences" in perlrecharclass for
164 details.)
165
166 · \d is a digit and represents
167
168 [0-9]
169
170 · \s is a whitespace character and represents
171
172 [\ \t\r\n\f]
173
174 · \w is a word character (alphanumeric or _) and represents
175
176 [0-9a-zA-Z_]
177
178 · \D is a negated \d; it represents any character but a digit
179
180 [^0-9]
181
182 · \S is a negated \s; it represents any non-whitespace character
183
184 [^\s]
185
186 · \W is a negated \w; it represents any non-word character
187
188 [^\w]
189
190 · The period '.' matches any character but "\n"
191
192 The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
193 character classes. Here are some in use:
194
195 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
196 /[\d\s]/; # matches any digit or whitespace character
197 /\w\W\w/; # matches a word char, followed by a
198 # non-word char, followed by a word char
199 /..rt/; # matches any two chars, followed by 'rt'
200 /end\./; # matches 'end.'
201 /end[.]/; # same thing, matches 'end.'
202
203 The word anchor "\b" matches a boundary between a word character and a
204 non-word character "\w\W" or "\W\w":
205
206 $x = "Housecat catenates house and cat";
207 $x =~ /\bcat/; # matches cat in 'catenates'
208 $x =~ /cat\b/; # matches cat in 'housecat'
209 $x =~ /\bcat\b/; # matches 'cat' at end of string
210
211 In the last example, the end of the string is considered a word
212 boundary.
213
214 Matching this or that
215 We can match different character strings with the alternation
216 metacharacter '|'. To match "dog" or "cat", we form the regex
217 "dog|cat". As before, Perl will try to match the regex at the earliest
218 possible point in the string. At each character position, Perl will
219 first try to match the first alternative, "dog". If "dog" doesn't
220 match, Perl will then try the next alternative, "cat". If "cat"
221 doesn't match either, then the match fails and Perl moves to the next
222 position in the string. Some examples:
223
224 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
225 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
226
227 Even though "dog" is the first alternative in the second regex, "cat"
228 is able to match earlier in the string.
229
230 "cats" =~ /c|ca|cat|cats/; # matches "c"
231 "cats" =~ /cats|cat|ca|c/; # matches "cats"
232
233 At a given character position, the first alternative that allows the
234 regex match to succeed will be the one that matches. Here, all the
235 alternatives match at the first string position, so the first matches.
236
237 Grouping things and hierarchical matching
238 The grouping metacharacters "()" allow a part of a regex to be treated
239 as a single unit. Parts of a regex are grouped by enclosing them in
240 parentheses. The regex "house(cat|keeper)" means match "house"
241 followed by either "cat" or "keeper". Some more examples are
242
243 /(a|b)b/; # matches 'ab' or 'bb'
244 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
245
246 /house(cat|)/; # matches either 'housecat' or 'house'
247 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
248 # 'house'. Note groups can be nested.
249
250 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
251 # because '20\d\d' can't match
252
253 Extracting matches
254 The grouping metacharacters "()" also allow the extraction of the parts
255 of a string that matched. For each grouping, the part that matched
256 inside goes into the special variables $1, $2, etc. They can be used
257 just as ordinary variables:
258
259 # extract hours, minutes, seconds
260 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
261 $hours = $1;
262 $minutes = $2;
263 $seconds = $3;
264
265 In list context, a match "/regex/" with groupings will return the list
266 of matched values "($1,$2,...)". So we could rewrite it as
267
268 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
269
270 If the groupings in a regex are nested, $1 gets the group with the
271 leftmost opening parenthesis, $2 the next opening parenthesis, etc.
272 For example, here is a complex regex and the matching variables
273 indicated below it:
274
275 /(ab(cd|ef)((gi)|j))/;
276 1 2 34
277
278 Associated with the matching variables $1, $2, ... are the
279 backreferences "\g1", "\g2", ... Backreferences are matching variables
280 that can be used inside a regex:
281
282 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
283
284 $1, $2, ... should only be used outside of a regex, and "\g1", "\g2",
285 ... only inside a regex.
286
287 Matching repetitions
288 The quantifier metacharacters "?", "*", "+", and "{}" allow us to
289 determine the number of repeats of a portion of a regex we consider to
290 be a match. Quantifiers are put immediately after the character,
291 character class, or grouping that we want to specify. They have the
292 following meanings:
293
294 · "a?" = match 'a' 1 or 0 times
295
296 · "a*" = match 'a' 0 or more times, i.e., any number of times
297
298 · "a+" = match 'a' 1 or more times, i.e., at least once
299
300 · "a{n,m}" = match at least "n" times, but not more than "m" times.
301
302 · "a{n,}" = match at least "n" or more times
303
304 · "a{n}" = match exactly "n" times
305
306 Here are some examples:
307
308 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
309 # any number of digits
310 /(\w+)\s+\g1/; # match doubled words of arbitrary length
311 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
312 # than 4 digits
313 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates
314
315 These quantifiers will try to match as much of the string as possible,
316 while still allowing the regex to match. So we have
317
318 $x = 'the cat in the hat';
319 $x =~ /^(.*)(at)(.*)$/; # matches,
320 # $1 = 'the cat in the h'
321 # $2 = 'at'
322 # $3 = '' (0 matches)
323
324 The first quantifier ".*" grabs as much of the string as possible while
325 still having the regex match. The second quantifier ".*" has no string
326 left to it, so it matches 0 times.
327
328 More matching
329 There are a few more things you might want to know about matching
330 operators. The global modifier "//g" allows the matching operator to
331 match within a string as many times as possible. In scalar context,
332 successive matches against a string will have "//g" jump from match to
333 match, keeping track of position in the string as it goes along. You
334 can get or set the position with the "pos()" function. For example,
335
336 $x = "cat dog house"; # 3 words
337 while ($x =~ /(\w+)/g) {
338 print "Word is $1, ends at position ", pos $x, "\n";
339 }
340
341 prints
342
343 Word is cat, ends at position 3
344 Word is dog, ends at position 7
345 Word is house, ends at position 13
346
347 A failed match or changing the target string resets the position. If
348 you don't want the position reset after failure to match, add the
349 "//c", as in "/regex/gc".
350
351 In list context, "//g" returns a list of matched groupings, or if there
352 are no groupings, a list of matches to the whole regex. So
353
354 @words = ($x =~ /(\w+)/g); # matches,
355 # $word[0] = 'cat'
356 # $word[1] = 'dog'
357 # $word[2] = 'house'
358
359 Search and replace
360 Search and replace is performed using "s/regex/replacement/modifiers".
361 The "replacement" is a Perl double-quoted string that replaces in the
362 string whatever is matched with the "regex". The operator "=~" is also
363 used here to associate a string with "s///". If matching against $_,
364 the "$_ =~" can be dropped. If there is a match, "s///" returns the
365 number of substitutions made; otherwise it returns false. Here are a
366 few examples:
367
368 $x = "Time to feed the cat!";
369 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
370 $y = "'quoted words'";
371 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
372 # $y contains "quoted words"
373
374 With the "s///" operator, the matched variables $1, $2, etc. are
375 immediately available for use in the replacement expression. With the
376 global modifier, "s///g" will search and replace all occurrences of the
377 regex in the string:
378
379 $x = "I batted 4 for 4";
380 $x =~ s/4/four/; # $x contains "I batted four for 4"
381 $x = "I batted 4 for 4";
382 $x =~ s/4/four/g; # $x contains "I batted four for four"
383
384 The non-destructive modifier "s///r" causes the result of the
385 substitution to be returned instead of modifying $_ (or whatever
386 variable the substitute was bound to with "=~"):
387
388 $x = "I like dogs.";
389 $y = $x =~ s/dogs/cats/r;
390 print "$x $y\n"; # prints "I like dogs. I like cats."
391
392 $x = "Cats are great.";
393 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ s/Frogs/Hedgehogs/r, "\n";
394 # prints "Hedgehogs are great."
395
396 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
397 # @foo is now qw(X X X 1 2 3)
398
399 The evaluation modifier "s///e" wraps an "eval{...}" around the
400 replacement string and the evaluated result is substituted for the
401 matched substring. Some examples:
402
403 # reverse all the words in a string
404 $x = "the cat in the hat";
405 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
406
407 # convert percentage to decimal
408 $x = "A 39% hit rate";
409 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
410
411 The last example shows that "s///" can use other delimiters, such as
412 "s!!!" and "s{}{}", and even "s{}//". If single quotes are used
413 "s'''", then the regex and replacement are treated as single-quoted
414 strings.
415
416 The split operator
417 "split /regex/, string" splits "string" into a list of substrings and
418 returns that list. The regex determines the character sequence that
419 "string" is split with respect to. For example, to split a string into
420 words, use
421
422 $x = "Calvin and Hobbes";
423 @word = split /\s+/, $x; # $word[0] = 'Calvin'
424 # $word[1] = 'and'
425 # $word[2] = 'Hobbes'
426
427 To extract a comma-delimited list of numbers, use
428
429 $x = "1.618,2.718, 3.142";
430 @const = split /,\s*/, $x; # $const[0] = '1.618'
431 # $const[1] = '2.718'
432 # $const[2] = '3.142'
433
434 If the empty regex "//" is used, the string is split into individual
435 characters. If the regex has groupings, then the list produced
436 contains the matched substrings from the groupings as well:
437
438 $x = "/usr/bin";
439 @parts = split m!(/)!, $x; # $parts[0] = ''
440 # $parts[1] = '/'
441 # $parts[2] = 'usr'
442 # $parts[3] = '/'
443 # $parts[4] = 'bin'
444
445 Since the first character of $x matched the regex, "split" prepended an
446 empty initial element to the list.
447
449 None.
450
452 This is just a quick start guide. For a more in-depth tutorial on
453 regexes, see perlretut and for the reference page, see perlre.
454
456 Copyright (c) 2000 Mark Kvale All rights reserved.
457
458 This document may be distributed under the same terms as Perl itself.
459
460 Acknowledgments
461 The author would like to thank Mark-Jason Dominus, Tom Christiansen,
462 Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
463 comments.
464
465
466
467perl v5.16.3 2013-03-04 PERLREQUICK(1)