perlrequick(1)

1PERLREQUICK(1)         Perl Programmers Reference Guide         PERLREQUICK(1)
2
3
4

NAME

6       perlrequick - Perl regular expressions quick start
7

DESCRIPTION

9       This page covers the very basics of understanding, creating and using
10       regular expressions ('regexes') in Perl.
11

The Guide

13   Simple word matching
14       The simplest regex is simply a word, or more generally, a string of
15       characters.  A regex consisting of a word matches any string that
16       contains that word:
17
18           "Hello World" =~ /World/;  # matches
19
20       In this statement, "World" is a regex and the "//" enclosing "/World/"
21       tells Perl to search a string for a match.  The operator "=~"
22       associates the string with the regex match and produces a true value if
23       the regex matched, or false if the regex did not match.  In our case,
24       "World" matches the second word in "Hello World", so the expression is
25       true.  This idea has several variations.
26
27       Expressions like this are useful in conditionals:
28
29           print "It matches\n" if "Hello World" =~ /World/;
30
31       The sense of the match can be reversed by using "!~" operator:
32
33           print "It doesn't match\n" if "Hello World" !~ /World/;
34
35       The literal string in the regex can be replaced by a variable:
36
37           $greeting = "World";
38           print "It matches\n" if "Hello World" =~ /$greeting/;
39
40       If you're matching against $_, the "$_ =~" part can be omitted:
41
42           $_ = "Hello World";
43           print "It matches\n" if /World/;
44
45       Finally, the "//" default delimiters for a match can be changed to
46       arbitrary delimiters by putting an 'm' out front:
47
48           "Hello World" =~ m!World!;   # matches, delimited by '!'
49           "Hello World" =~ m{World};   # matches, note the matching '{}'
50           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
51                                        # '/' becomes an ordinary char
52
53       Regexes must match a part of the string exactly in order for the
54       statement to be true:
55
56           "Hello World" =~ /world/;  # doesn't match, case sensitive
57           "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
58           "Hello World" =~ /World /; # doesn't match, no ' ' at end
59
60       Perl will always match at the earliest possible point in the string:
61
62           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
63           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
64
65       Not all characters can be used 'as is' in a match.  Some characters,
66       called metacharacters, are reserved for use in regex notation.  The
67       metacharacters are
68
69           {}[]()^$.|*+?\
70
71       A metacharacter can be matched by putting a backslash before it:
72
73           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
74           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
75           'C:\WIN32' =~ /C:\\WIN/;                       # matches
76           "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
77
78       In the last regex, the forward slash '/' is also backslashed, because
79       it is used to delimit the regex.
80
81       Non-printable ASCII characters are represented by escape sequences.
82       Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
83       carriage return.  Arbitrary bytes are represented by octal escape
84       sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
85
86           "1000\t2000" =~ m(0\t2)      # matches
87           "cat"      =~ /\143\x61\x74/ # matches in ASCII, but a weird way to spell cat
88
89       Regexes are treated mostly as double-quoted strings, so variable
90       substitution works:
91
92           $foo = 'house';
93           'cathouse' =~ /cat$foo/;   # matches
94           'housecat' =~ /${foo}cat/; # matches
95
96       With all of the regexes above, if the regex matched anywhere in the
97       string, it was considered a match.  To specify where it should match,
98       we would use the anchor metacharacters "^" and "$".  The anchor "^"
99       means match at the beginning of the string and the anchor "$" means
100       match at the end of the string, or before a newline at the end of the
101       string.  Some examples:
102
103           "housekeeper" =~ /keeper/;         # matches
104           "housekeeper" =~ /^keeper/;        # doesn't match
105           "housekeeper" =~ /keeper$/;        # matches
106           "housekeeper\n" =~ /keeper$/;      # matches
107           "housekeeper" =~ /^housekeeper$/;  # matches
108
109   Using character classes
110       A character class allows a set of possible characters, rather than just
111       a single character, to match at a particular point in a regex.
112       Character classes are denoted by brackets "[...]", with the set of
113       characters to be possibly matched inside.  Here are some examples:
114
115           /cat/;            # matches 'cat'
116           /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
117           "abc" =~ /[cab]/; # matches 'a'
118
119       In the last statement, even though 'c' is the first character in the
120       class, the earliest point at which the regex can match is 'a'.
121
122           /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
123                           # 'yes', 'Yes', 'YES', etc.
124           /yes/i;         # also match 'yes' in a case-insensitive way
125
126       The last example shows a match with an 'i' modifier, which makes the
127       match case-insensitive.
128
129       Character classes also have ordinary and special characters, but the
130       sets of ordinary and special characters inside a character class are
131       different than those outside a character class.  The special characters
132       for a character class are "-]\^$" and are matched using an escape:
133
134          /[\]c]def/; # matches ']def' or 'cdef'
135          $x = 'bcr';
136          /[$x]at/;   # matches 'bat, 'cat', or 'rat'
137          /[\$x]at/;  # matches '$at' or 'xat'
138          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
139
140       The special character '-' acts as a range operator within character
141       classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
142       the svelte "[0-9]" and "[a-z]":
143
144           /item[0-9]/;  # matches 'item0' or ... or 'item9'
145           /[0-9a-fA-F]/;  # matches a hexadecimal digit
146
147       If '-' is the first or last character in a character class, it is
148       treated as an ordinary character.
149
150       The special character "^" in the first position of a character class
151       denotes a negated character class, which matches any character but
152       those in the brackets.  Both "[...]" and "[^...]" must match a
153       character, or the match fails.  Then
154
155           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
156                      # all other 'bat', 'cat, '0at', '%at', etc.
157           /[^0-9]/;  # matches a non-numeric character
158           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
159
160       Perl has several abbreviations for common character classes. (These
161       definitions are those that Perl uses in ASCII-safe mode with the "/a"
162       modifier.  Otherwise they could match many more non-ASCII Unicode
163       characters as well.  See "Backslash sequences" in perlrecharclass for
164       details.)
165
166       ·   \d is a digit and represents
167
168               [0-9]
169
170       ·   \s is a whitespace character and represents
171
172               [\ \t\r\n\f]
173
174       ·   \w is a word character (alphanumeric or _) and represents
175
176               [0-9a-zA-Z_]
177
178       ·   \D is a negated \d; it represents any character but a digit
179
180               [^0-9]
181
182       ·   \S is a negated \s; it represents any non-whitespace character
183
184               [^\s]
185
186       ·   \W is a negated \w; it represents any non-word character
187
188               [^\w]
189
190       ·   The period '.' matches any character but "\n"
191
192       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
193       character classes.  Here are some in use:
194
195           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
196           /[\d\s]/;         # matches any digit or whitespace character
197           /\w\W\w/;         # matches a word char, followed by a
198                             # non-word char, followed by a word char
199           /..rt/;           # matches any two chars, followed by 'rt'
200           /end\./;          # matches 'end.'
201           /end[.]/;         # same thing, matches 'end.'
202
203       The word anchor  "\b" matches a boundary between a word character and a
204       non-word character "\w\W" or "\W\w":
205
206           $x = "Housecat catenates house and cat";
207           $x =~ /\bcat/;  # matches cat in 'catenates'
208           $x =~ /cat\b/;  # matches cat in 'housecat'
209           $x =~ /\bcat\b/;  # matches 'cat' at end of string
210
211       In the last example, the end of the string is considered a word
212       boundary.
213
214   Matching this or that
215       We can match different character strings with the alternation
216       metacharacter '|'.  To match "dog" or "cat", we form the regex
217       "dog|cat".  As before, Perl will try to match the regex at the earliest
218       possible point in the string.  At each character position, Perl will
219       first try to match the first alternative, "dog".  If "dog" doesn't
220       match, Perl will then try the next alternative, "cat".  If "cat"
221       doesn't match either, then the match fails and Perl moves to the next
222       position in the string.  Some examples:
223
224           "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
225           "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
226
227       Even though "dog" is the first alternative in the second regex, "cat"
228       is able to match earlier in the string.
229
230           "cats"          =~ /c|ca|cat|cats/; # matches "c"
231           "cats"          =~ /cats|cat|ca|c/; # matches "cats"
232
233       At a given character position, the first alternative that allows the
234       regex match to succeed will be the one that matches. Here, all the
235       alternatives match at the first string position, so the first matches.
236
237   Grouping things and hierarchical matching
238       The grouping metacharacters "()" allow a part of a regex to be treated
239       as a single unit.  Parts of a regex are grouped by enclosing them in
240       parentheses.  The regex "house(cat|keeper)" means match "house"
241       followed by either "cat" or "keeper".  Some more examples are
242
243           /(a|b)b/;    # matches 'ab' or 'bb'
244           /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
245
246           /house(cat|)/;  # matches either 'housecat' or 'house'
247           /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
248                               # 'house'.  Note groups can be nested.
249
250           "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
251                                    # because '20\d\d' can't match
252
253   Extracting matches
254       The grouping metacharacters "()" also allow the extraction of the parts
255       of a string that matched.  For each grouping, the part that matched
256       inside goes into the special variables $1, $2, etc.  They can be used
257       just as ordinary variables:
258
259           # extract hours, minutes, seconds
260           $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
261           $hours = $1;
262           $minutes = $2;
263           $seconds = $3;
264
265       In list context, a match "/regex/" with groupings will return the list
266       of matched values "($1,$2,...)".  So we could rewrite it as
267
268           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
269
270       If the groupings in a regex are nested, $1 gets the group with the
271       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
272       For example, here is a complex regex and the matching variables
273       indicated below it:
274
275           /(ab(cd|ef)((gi)|j))/;
276            1  2      34
277
278       Associated with the matching variables $1, $2, ... are the
279       backreferences "\g1", "\g2", ...  Backreferences are matching variables
280       that can be used inside a regex:
281
282           /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
283
284       $1, $2, ... should only be used outside of a regex, and "\g1", "\g2",
285       ... only inside a regex.
286
287   Matching repetitions
288       The quantifier metacharacters "?", "*", "+", and "{}" allow us to
289       determine the number of repeats of a portion of a regex we consider to
290       be a match.  Quantifiers are put immediately after the character,
291       character class, or grouping that we want to specify.  They have the
292       following meanings:
293
294       ·   "a?" = match 'a' 1 or 0 times
295
296       ·   "a*" = match 'a' 0 or more times, i.e., any number of times
297
298       ·   "a+" = match 'a' 1 or more times, i.e., at least once
299
300       ·   "a{n,m}" = match at least "n" times, but not more than "m" times.
301
302       ·   "a{n,}" = match at least "n" or more times
303
304       ·   "a{n}" = match exactly "n" times
305
306       Here are some examples:
307
308           /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
309                            # any number of digits
310           /(\w+)\s+\g1/;    # match doubled words of arbitrary length
311           $year =~ /^\d{2,4}$/;  # make sure year is at least 2 but not more
312                                  # than 4 digits
313           $year =~ /^\d{4}$|^\d{2}$/;    # better match; throw out 3 digit dates
314
315       These quantifiers will try to match as much of the string as possible,
316       while still allowing the regex to match.  So we have
317
318           $x = 'the cat in the hat';
319           $x =~ /^(.*)(at)(.*)$/; # matches,
320                                   # $1 = 'the cat in the h'
321                                   # $2 = 'at'
322                                   # $3 = ''   (0 matches)
323
324       The first quantifier ".*" grabs as much of the string as possible while
325       still having the regex match. The second quantifier ".*" has no string
326       left to it, so it matches 0 times.
327
328   More matching
329       There are a few more things you might want to know about matching
330       operators.  The global modifier "//g" allows the matching operator to
331       match within a string as many times as possible.  In scalar context,
332       successive matches against a string will have "//g" jump from match to
333       match, keeping track of position in the string as it goes along.  You
334       can get or set the position with the "pos()" function.  For example,
335
336           $x = "cat dog house"; # 3 words
337           while ($x =~ /(\w+)/g) {
338               print "Word is $1, ends at position ", pos $x, "\n";
339           }
340
341       prints
342
343           Word is cat, ends at position 3
344           Word is dog, ends at position 7
345           Word is house, ends at position 13
346
347       A failed match or changing the target string resets the position.  If
348       you don't want the position reset after failure to match, add the
349       "//c", as in "/regex/gc".
350
351       In list context, "//g" returns a list of matched groupings, or if there
352       are no groupings, a list of matches to the whole regex.  So
353
354           @words = ($x =~ /(\w+)/g);  # matches,
355                                       # $word[0] = 'cat'
356                                       # $word[1] = 'dog'
357                                       # $word[2] = 'house'
358
359   Search and replace
360       Search and replace is performed using "s/regex/replacement/modifiers".
361       The "replacement" is a Perl double-quoted string that replaces in the
362       string whatever is matched with the "regex".  The operator "=~" is also
363       used here to associate a string with "s///".  If matching against $_,
364       the "$_ =~" can be dropped.  If there is a match, "s///" returns the
365       number of substitutions made; otherwise it returns false.  Here are a
366       few examples:
367
368           $x = "Time to feed the cat!";
369           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
370           $y = "'quoted words'";
371           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
372                                  # $y contains "quoted words"
373
374       With the "s///" operator, the matched variables $1, $2, etc.  are
375       immediately available for use in the replacement expression. With the
376       global modifier, "s///g" will search and replace all occurrences of the
377       regex in the string:
378
379           $x = "I batted 4 for 4";
380           $x =~ s/4/four/;   # $x contains "I batted four for 4"
381           $x = "I batted 4 for 4";
382           $x =~ s/4/four/g;  # $x contains "I batted four for four"
383
384       The non-destructive modifier "s///r" causes the result of the
385       substitution to be returned instead of modifying $_ (or whatever
386       variable the substitute was bound to with "=~"):
387
388           $x = "I like dogs.";
389           $y = $x =~ s/dogs/cats/r;
390           print "$x $y\n"; # prints "I like dogs. I like cats."
391
392           $x = "Cats are great.";
393           print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ s/Frogs/Hedgehogs/r, "\n";
394           # prints "Hedgehogs are great."
395
396           @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
397           # @foo is now qw(X X X 1 2 3)
398
399       The evaluation modifier "s///e" wraps an "eval{...}" around the
400       replacement string and the evaluated result is substituted for the
401       matched substring.  Some examples:
402
403           # reverse all the words in a string
404           $x = "the cat in the hat";
405           $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
406
407           # convert percentage to decimal
408           $x = "A 39% hit rate";
409           $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
410
411       The last example shows that "s///" can use other delimiters, such as
412       "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
413       "s'''", then the regex and replacement are treated as single-quoted
414       strings.
415
416   The split operator
417       "split /regex/, string" splits "string" into a list of substrings and
418       returns that list.  The regex determines the character sequence that
419       "string" is split with respect to.  For example, to split a string into
420       words, use
421
422           $x = "Calvin and Hobbes";
423           @word = split /\s+/, $x;  # $word[0] = 'Calvin'
424                                     # $word[1] = 'and'
425                                     # $word[2] = 'Hobbes'
426
427       To extract a comma-delimited list of numbers, use
428
429           $x = "1.618,2.718,   3.142";
430           @const = split /,\s*/, $x;  # $const[0] = '1.618'
431                                       # $const[1] = '2.718'
432                                       # $const[2] = '3.142'
433
434       If the empty regex "//" is used, the string is split into individual
435       characters.  If the regex has groupings, then the list produced
436       contains the matched substrings from the groupings as well:
437
438           $x = "/usr/bin";
439           @parts = split m!(/)!, $x;  # $parts[0] = ''
440                                       # $parts[1] = '/'
441                                       # $parts[2] = 'usr'
442                                       # $parts[3] = '/'
443                                       # $parts[4] = 'bin'
444
445       Since the first character of $x matched the regex, "split" prepended an
446       empty initial element to the list.
447

BUGS

449       None.
450

AUTHOR AND COPYRIGHT

456       Copyright (c) 2000 Mark Kvale All rights reserved.
457
458       This document may be distributed under the same terms as Perl itself.
459
460   Acknowledgments
461       The author would like to thank Mark-Jason Dominus, Tom Christiansen,
462       Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
463       comments.
464
465
466
467perl v5.16.3                      2013-03-04                    PERLREQUICK(1)

NAME

DESCRIPTION

The Guide

BUGS

SEE ALSO

AUTHOR AND COPYRIGHT