perlrequick(1)

1PERLREQUICK(1)         Perl Programmers Reference Guide         PERLREQUICK(1)
2
3
4

NAME

6       perlrequick - Perl regular expressions quick start
7

DESCRIPTION

9       This page covers the very basics of understanding, creating and using
10       regular expressions ('regexes') in Perl.
11

The Guide

13       Simple word matching
14
15       The simplest regex is simply a word, or more generally, a string of
16       characters.  A regex consisting of a word matches any string that con‐
17       tains that word:
18
19           "Hello World" =~ /World/;  # matches
20
21       In this statement, "World" is a regex and the "//" enclosing "/World/"
22       tells perl to search a string for a match.  The operator "=~" asso‐
23       ciates the string with the regex match and produces a true value if the
24       regex matched, or false if the regex did not match.  In our case,
25       "World" matches the second word in "Hello World", so the expression is
26       true.  This idea has several variations.
27
28       Expressions like this are useful in conditionals:
29
30           print "It matches\n" if "Hello World" =~ /World/;
31
32       The sense of the match can be reversed by using "!~" operator:
33
34           print "It doesn't match\n" if "Hello World" !~ /World/;
35
36       The literal string in the regex can be replaced by a variable:
37
38           $greeting = "World";
39           print "It matches\n" if "Hello World" =~ /$greeting/;
40
41       If you're matching against $_, the "$_ =~" part can be omitted:
42
43           $_ = "Hello World";
44           print "It matches\n" if /World/;
45
46       Finally, the "//" default delimiters for a match can be changed to
47       arbitrary delimiters by putting an 'm' out front:
48
49           "Hello World" =~ m!World!;   # matches, delimited by '!'
50           "Hello World" =~ m{World};   # matches, note the matching '{}'
51           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
52                                        # '/' becomes an ordinary char
53
54       Regexes must match a part of the string exactly in order for the state‐
55       ment to be true:
56
57           "Hello World" =~ /world/;  # doesn't match, case sensitive
58           "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
59           "Hello World" =~ /World /; # doesn't match, no ' ' at end
60
61       perl will always match at the earliest possible point in the string:
62
63           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
64           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
65
66       Not all characters can be used 'as is' in a match.  Some characters,
67       called metacharacters, are reserved for use in regex notation.  The
68       metacharacters are
69
70           {}[]()^$.⎪*+?\
71
72       A metacharacter can be matched by putting a backslash before it:
73
74           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
75           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
76           'C:\WIN32' =~ /C:\\WIN/;                       # matches
77           "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
78
79       In the last regex, the forward slash '/' is also backslashed, because
80       it is used to delimit the regex.
81
82       Non-printable ASCII characters are represented by escape sequences.
83       Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
84       carriage return.  Arbitrary bytes are represented by octal escape
85       sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
86
87           "1000\t2000" =~ m(0\t2)        # matches
88           "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
89
90       Regexes are treated mostly as double quoted strings, so variable sub‐
91       stitution works:
92
93           $foo = 'house';
94           'cathouse' =~ /cat$foo/;   # matches
95           'housecat' =~ /${foo}cat/; # matches
96
97       With all of the regexes above, if the regex matched anywhere in the
98       string, it was considered a match.  To specify where it should match,
99       we would use the anchor metacharacters "^" and "$".  The anchor "^"
100       means match at the beginning of the string and the anchor "$" means
101       match at the end of the string, or before a newline at the end of the
102       string.  Some examples:
103
104           "housekeeper" =~ /keeper/;         # matches
105           "housekeeper" =~ /^keeper/;        # doesn't match
106           "housekeeper" =~ /keeper$/;        # matches
107           "housekeeper\n" =~ /keeper$/;      # matches
108           "housekeeper" =~ /^housekeeper$/;  # matches
109
110       Using character classes
111
112       A character class allows a set of possible characters, rather than just
113       a single character, to match at a particular point in a regex.  Charac‐
114       ter classes are denoted by brackets "[...]", with the set of characters
115       to be possibly matched inside.  Here are some examples:
116
117           /cat/;            # matches 'cat'
118           /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
119           "abc" =~ /[cab]/; # matches 'a'
120
121       In the last statement, even though 'c' is the first character in the
122       class, the earliest point at which the regex can match is 'a'.
123
124           /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
125                           # 'yes', 'Yes', 'YES', etc.
126           /yes/i;         # also match 'yes' in a case-insensitive way
127
128       The last example shows a match with an 'i' modifier, which makes the
129       match case-insensitive.
130
131       Character classes also have ordinary and special characters, but the
132       sets of ordinary and special characters inside a character class are
133       different than those outside a character class.  The special characters
134       for a character class are "-]\^$" and are matched using an escape:
135
136          /[\]c]def/; # matches ']def' or 'cdef'
137          $x = 'bcr';
138          /[$x]at/;   # matches 'bat, 'cat', or 'rat'
139          /[\$x]at/;  # matches '$at' or 'xat'
140          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
141
142       The special character '-' acts as a range operator within character
143       classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
144       the svelte "[0-9]" and "[a-z]":
145
146           /item[0-9]/;  # matches 'item0' or ... or 'item9'
147           /[0-9a-fA-F]/;  # matches a hexadecimal digit
148
149       If '-' is the first or last character in a character class, it is
150       treated as an ordinary character.
151
152       The special character "^" in the first position of a character class
153       denotes a negated character class, which matches any character but
154       those in the brackets.  Both "[...]" and "[^...]" must match a charac‐
155       ter, or the match fails.  Then
156
157           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
158                      # all other 'bat', 'cat, '0at', '%at', etc.
159           /[^0-9]/;  # matches a non-numeric character
160           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
161
162       Perl has several abbreviations for common character classes:
163
164       ·   \d is a digit and represents
165
166               [0-9]
167
168       ·   \s is a whitespace character and represents
169
170               [\ \t\r\n\f]
171
172       ·   \w is a word character (alphanumeric or _) and represents
173
174               [0-9a-zA-Z_]
175
176       ·   \D is a negated \d; it represents any character but a digit
177
178               [^0-9]
179
180       ·   \S is a negated \s; it represents any non-whitespace character
181
182               [^\s]
183
184       ·   \W is a negated \w; it represents any non-word character
185
186               [^\w]
187
188       ·   The period '.' matches any character but "\n"
189
190       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
191       character classes.  Here are some in use:
192
193           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
194           /[\d\s]/;         # matches any digit or whitespace character
195           /\w\W\w/;         # matches a word char, followed by a
196                             # non-word char, followed by a word char
197           /..rt/;           # matches any two chars, followed by 'rt'
198           /end\./;          # matches 'end.'
199           /end[.]/;         # same thing, matches 'end.'
200
201       The word anchor  "\b" matches a boundary between a word character and a
202       non-word character "\w\W" or "\W\w":
203
204           $x = "Housecat catenates house and cat";
205           $x =~ /\bcat/;  # matches cat in 'catenates'
206           $x =~ /cat\b/;  # matches cat in 'housecat'
207           $x =~ /\bcat\b/;  # matches 'cat' at end of string
208
209       In the last example, the end of the string is considered a word bound‐
210       ary.
211
212       Matching this or that
213
214       We can match different character strings with the alternation metachar‐
215       acter '⎪'.  To match "dog" or "cat", we form the regex "dog⎪cat".  As
216       before, perl will try to match the regex at the earliest possible point
217       in the string.  At each character position, perl will first try to
218       match the first alternative, "dog".  If "dog" doesn't match, perl will
219       then try the next alternative, "cat".  If "cat" doesn't match either,
220       then the match fails and perl moves to the next position in the string.
221       Some examples:
222
223           "cats and dogs" =~ /cat⎪dog⎪bird/;  # matches "cat"
224           "cats and dogs" =~ /dog⎪cat⎪bird/;  # matches "cat"
225
226       Even though "dog" is the first alternative in the second regex, "cat"
227       is able to match earlier in the string.
228
229           "cats"          =~ /c⎪ca⎪cat⎪cats/; # matches "c"
230           "cats"          =~ /cats⎪cat⎪ca⎪c/; # matches "cats"
231
232       At a given character position, the first alternative that allows the
233       regex match to succeed will be the one that matches. Here, all the
234       alternatives match at the first string position, so the first matches.
235
236       Grouping things and hierarchical matching
237
238       The grouping metacharacters "()" allow a part of a regex to be treated
239       as a single unit.  Parts of a regex are grouped by enclosing them in
240       parentheses.  The regex "house(cat⎪keeper)" means match "house" fol‐
241       lowed by either "cat" or "keeper".  Some more examples are
242
243           /(a⎪b)b/;    # matches 'ab' or 'bb'
244           /(^a⎪b)c/;   # matches 'ac' at start of string or 'bc' anywhere
245
246           /house(cat⎪)/;  # matches either 'housecat' or 'house'
247           /house(cat(s⎪)⎪)/;  # matches either 'housecats' or 'housecat' or
248                               # 'house'.  Note groups can be nested.
249
250           "20" =~ /(19⎪20⎪)\d\d/;  # matches the null alternative '()\d\d',
251                                    # because '20\d\d' can't match
252
253       Extracting matches
254
255       The grouping metacharacters "()" also allow the extraction of the parts
256       of a string that matched.  For each grouping, the part that matched
257       inside goes into the special variables $1, $2, etc.  They can be used
258       just as ordinary variables:
259
260           # extract hours, minutes, seconds
261           $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
262           $hours = $1;
263           $minutes = $2;
264           $seconds = $3;
265
266       In list context, a match "/regex/" with groupings will return the list
267       of matched values "($1,$2,...)".  So we could rewrite it as
268
269           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
270
271       If the groupings in a regex are nested, $1 gets the group with the
272       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
273       For example, here is a complex regex and the matching variables indi‐
274       cated below it:
275
276           /(ab(cd⎪ef)((gi)⎪j))/;
277            1  2      34
278
279       Associated with the matching variables $1, $2, ... are the backrefer‐
280       ences "\1", "\2", ...  Backreferences are matching variables that can
281       be used inside a regex:
282
283           /(\w\w\w)\s\1/; # find sequences like 'the the' in string
284
285       $1, $2, ... should only be used outside of a regex, and "\1", "\2", ...
286       only inside a regex.
287
288       Matching repetitions
289
290       The quantifier metacharacters "?", "*", "+", and "{}" allow us to
291       determine the number of repeats of a portion of a regex we consider to
292       be a match.  Quantifiers are put immediately after the character, char‐
293       acter class, or grouping that we want to specify.  They have the fol‐
294       lowing meanings:
295
296       ·   "a?" = match 'a' 1 or 0 times
297
298       ·   "a*" = match 'a' 0 or more times, i.e., any number of times
299
300       ·   "a+" = match 'a' 1 or more times, i.e., at least once
301
302       ·   "a{n,m}" = match at least "n" times, but not more than "m" times.
303
304       ·   "a{n,}" = match at least "n" or more times
305
306       ·   "a{n}" = match exactly "n" times
307
308       Here are some examples:
309
310           /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
311                            # any number of digits
312           /(\w+)\s+\1/;    # match doubled words of arbitrary length
313           $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
314                                # than 4 digits
315           $year =~ /\d{4}⎪\d{2}/;    # better match; throw out 3 digit dates
316
317       These quantifiers will try to match as much of the string as possible,
318       while still allowing the regex to match.  So we have
319
320           $x = 'the cat in the hat';
321           $x =~ /^(.*)(at)(.*)$/; # matches,
322                                   # $1 = 'the cat in the h'
323                                   # $2 = 'at'
324                                   # $3 = ''   (0 matches)
325
326       The first quantifier ".*" grabs as much of the string as possible while
327       still having the regex match. The second quantifier ".*" has no string
328       left to it, so it matches 0 times.
329
330       More matching
331
332       There are a few more things you might want to know about matching oper‐
333       ators.  In the code
334
335           $pattern = 'Seuss';
336           while (<>) {
337               print if /$pattern/;
338           }
339
340       perl has to re-evaluate $pattern each time through the loop.  If $pat‐
341       tern won't be changing, use the "//o" modifier, to only perform vari‐
342       able substitutions once.  If you don't want any substitutions at all,
343       use the special delimiter "m''":
344
345           @pattern = ('Seuss');
346           m/@pattern/; # matches 'Seuss'
347           m'@pattern'; # matches the literal string '@pattern'
348
349       The global modifier "//g" allows the matching operator to match within
350       a string as many times as possible.  In scalar context, successive
351       matches against a string will have "//g" jump from match to match,
352       keeping track of position in the string as it goes along.  You can get
353       or set the position with the "pos()" function.  For example,
354
355           $x = "cat dog house"; # 3 words
356           while ($x =~ /(\w+)/g) {
357               print "Word is $1, ends at position ", pos $x, "\n";
358           }
359
360       prints
361
362           Word is cat, ends at position 3
363           Word is dog, ends at position 7
364           Word is house, ends at position 13
365
366       A failed match or changing the target string resets the position.  If
367       you don't want the position reset after failure to match, add the
368       "//c", as in "/regex/gc".
369
370       In list context, "//g" returns a list of matched groupings, or if there
371       are no groupings, a list of matches to the whole regex.  So
372
373           @words = ($x =~ /(\w+)/g);  # matches,
374                                       # $word[0] = 'cat'
375                                       # $word[1] = 'dog'
376                                       # $word[2] = 'house'
377
378       Search and replace
379
380       Search and replace is performed using "s/regex/replacement/modifiers".
381       The "replacement" is a Perl double quoted string that replaces in the
382       string whatever is matched with the "regex".  The operator "=~" is also
383       used here to associate a string with "s///".  If matching against $_,
384       the "$_ =~"  can be dropped.  If there is a match, "s///" returns the
385       number of substitutions made, otherwise it returns false.  Here are a
386       few examples:
387
388           $x = "Time to feed the cat!";
389           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
390           $y = "'quoted words'";
391           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
392                                  # $y contains "quoted words"
393
394       With the "s///" operator, the matched variables $1, $2, etc.  are imme‐
395       diately available for use in the replacement expression. With the
396       global modifier, "s///g" will search and replace all occurrences of the
397       regex in the string:
398
399           $x = "I batted 4 for 4";
400           $x =~ s/4/four/;   # $x contains "I batted four for 4"
401           $x = "I batted 4 for 4";
402           $x =~ s/4/four/g;  # $x contains "I batted four for four"
403
404       The evaluation modifier "s///e" wraps an "eval{...}" around the
405       replacement string and the evaluated result is substituted for the
406       matched substring.  Some examples:
407
408           # reverse all the words in a string
409           $x = "the cat in the hat";
410           $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
411
412           # convert percentage to decimal
413           $x = "A 39% hit rate";
414           $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
415
416       The last example shows that "s///" can use other delimiters, such as
417       "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
418       "s'''", then the regex and replacement are treated as single quoted
419       strings.
420
421       The split operator
422
423       "split /regex/, string" splits "string" into a list of substrings and
424       returns that list.  The regex determines the character sequence that
425       "string" is split with respect to.  For example, to split a string into
426       words, use
427
428           $x = "Calvin and Hobbes";
429           @word = split /\s+/, $x;  # $word[0] = 'Calvin'
430                                     # $word[1] = 'and'
431                                     # $word[2] = 'Hobbes'
432
433       To extract a comma-delimited list of numbers, use
434
435           $x = "1.618,2.718,   3.142";
436           @const = split /,\s*/, $x;  # $const[0] = '1.618'
437                                       # $const[1] = '2.718'
438                                       # $const[2] = '3.142'
439
440       If the empty regex "//" is used, the string is split into individual
441       characters.  If the regex has groupings, then the list produced con‐
442       tains the matched substrings from the groupings as well:
443
444           $x = "/usr/bin";
445           @parts = split m!(/)!, $x;  # $parts[0] = ''
446                                       # $parts[1] = '/'
447                                       # $parts[2] = 'usr'
448                                       # $parts[3] = '/'
449                                       # $parts[4] = 'bin'
450
451       Since the first character of $x matched the regex, "split" prepended an
452       empty initial element to the list.
453

BUGS

455       None.
456

AUTHOR AND COPYRIGHT

462       Copyright (c) 2000 Mark Kvale All rights reserved.
463
464       This document may be distributed under the same terms as Perl itself.
465
466       Acknowledgments
467
468       The author would like to thank Mark-Jason Dominus, Tom Christiansen,
469       Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
470       comments.
471
472
473
474perl v5.8.8                       2006-01-07                    PERLREQUICK(1)

NAME

DESCRIPTION

The Guide

BUGS

SEE ALSO

AUTHOR AND COPYRIGHT