perlrequick(1)

1PERLREQUICK(1)         Perl Programmers Reference Guide         PERLREQUICK(1)
2
3
4

NAME

6       perlrequick - Perl regular expressions quick start
7

DESCRIPTION

9       This page covers the very basics of understanding, creating and using
10       regular expressions ('regexes') in Perl.
11

The Guide

13   Simple word matching
14       The simplest regex is simply a word, or more generally, a string of
15       characters.  A regex consisting of a word matches any string that
16       contains that word:
17
18           "Hello World" =~ /World/;  # matches
19
20       In this statement, "World" is a regex and the "//" enclosing "/World/"
21       tells perl to search a string for a match.  The operator "=~"
22       associates the string with the regex match and produces a true value if
23       the regex matched, or false if the regex did not match.  In our case,
24       "World" matches the second word in "Hello World", so the expression is
25       true.  This idea has several variations.
26
27       Expressions like this are useful in conditionals:
28
29           print "It matches\n" if "Hello World" =~ /World/;
30
31       The sense of the match can be reversed by using "!~" operator:
32
33           print "It doesn't match\n" if "Hello World" !~ /World/;
34
35       The literal string in the regex can be replaced by a variable:
36
37           $greeting = "World";
38           print "It matches\n" if "Hello World" =~ /$greeting/;
39
40       If you're matching against $_, the "$_ =~" part can be omitted:
41
42           $_ = "Hello World";
43           print "It matches\n" if /World/;
44
45       Finally, the "//" default delimiters for a match can be changed to
46       arbitrary delimiters by putting an 'm' out front:
47
48           "Hello World" =~ m!World!;   # matches, delimited by '!'
49           "Hello World" =~ m{World};   # matches, note the matching '{}'
50           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
51                                        # '/' becomes an ordinary char
52
53       Regexes must match a part of the string exactly in order for the
54       statement to be true:
55
56           "Hello World" =~ /world/;  # doesn't match, case sensitive
57           "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
58           "Hello World" =~ /World /; # doesn't match, no ' ' at end
59
60       perl will always match at the earliest possible point in the string:
61
62           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
63           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
64
65       Not all characters can be used 'as is' in a match.  Some characters,
66       called metacharacters, are reserved for use in regex notation.  The
67       metacharacters are
68
69           {}[]()^$.|*+?\
70
71       A metacharacter can be matched by putting a backslash before it:
72
73           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
74           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
75           'C:\WIN32' =~ /C:\\WIN/;                       # matches
76           "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
77
78       In the last regex, the forward slash '/' is also backslashed, because
79       it is used to delimit the regex.
80
81       Non-printable ASCII characters are represented by escape sequences.
82       Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
83       carriage return.  Arbitrary bytes are represented by octal escape
84       sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
85
86           "1000\t2000" =~ m(0\t2)        # matches
87           "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
88
89       Regexes are treated mostly as double quoted strings, so variable
90       substitution works:
91
92           $foo = 'house';
93           'cathouse' =~ /cat$foo/;   # matches
94           'housecat' =~ /${foo}cat/; # matches
95
96       With all of the regexes above, if the regex matched anywhere in the
97       string, it was considered a match.  To specify where it should match,
98       we would use the anchor metacharacters "^" and "$".  The anchor "^"
99       means match at the beginning of the string and the anchor "$" means
100       match at the end of the string, or before a newline at the end of the
101       string.  Some examples:
102
103           "housekeeper" =~ /keeper/;         # matches
104           "housekeeper" =~ /^keeper/;        # doesn't match
105           "housekeeper" =~ /keeper$/;        # matches
106           "housekeeper\n" =~ /keeper$/;      # matches
107           "housekeeper" =~ /^housekeeper$/;  # matches
108
109   Using character classes
110       A character class allows a set of possible characters, rather than just
111       a single character, to match at a particular point in a regex.
112       Character classes are denoted by brackets "[...]", with the set of
113       characters to be possibly matched inside.  Here are some examples:
114
115           /cat/;            # matches 'cat'
116           /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
117           "abc" =~ /[cab]/; # matches 'a'
118
119       In the last statement, even though 'c' is the first character in the
120       class, the earliest point at which the regex can match is 'a'.
121
122           /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
123                           # 'yes', 'Yes', 'YES', etc.
124           /yes/i;         # also match 'yes' in a case-insensitive way
125
126       The last example shows a match with an 'i' modifier, which makes the
127       match case-insensitive.
128
129       Character classes also have ordinary and special characters, but the
130       sets of ordinary and special characters inside a character class are
131       different than those outside a character class.  The special characters
132       for a character class are "-]\^$" and are matched using an escape:
133
134          /[\]c]def/; # matches ']def' or 'cdef'
135          $x = 'bcr';
136          /[$x]at/;   # matches 'bat, 'cat', or 'rat'
137          /[\$x]at/;  # matches '$at' or 'xat'
138          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
139
140       The special character '-' acts as a range operator within character
141       classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
142       the svelte "[0-9]" and "[a-z]":
143
144           /item[0-9]/;  # matches 'item0' or ... or 'item9'
145           /[0-9a-fA-F]/;  # matches a hexadecimal digit
146
147       If '-' is the first or last character in a character class, it is
148       treated as an ordinary character.
149
150       The special character "^" in the first position of a character class
151       denotes a negated character class, which matches any character but
152       those in the brackets.  Both "[...]" and "[^...]" must match a
153       character, or the match fails.  Then
154
155           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
156                      # all other 'bat', 'cat, '0at', '%at', etc.
157           /[^0-9]/;  # matches a non-numeric character
158           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
159
160       Perl has several abbreviations for common character classes:
161
162       ·   \d is a digit and represents
163
164               [0-9]
165
166       ·   \s is a whitespace character and represents
167
168               [\ \t\r\n\f]
169
170       ·   \w is a word character (alphanumeric or _) and represents
171
172               [0-9a-zA-Z_]
173
174       ·   \D is a negated \d; it represents any character but a digit
175
176               [^0-9]
177
178       ·   \S is a negated \s; it represents any non-whitespace character
179
180               [^\s]
181
182       ·   \W is a negated \w; it represents any non-word character
183
184               [^\w]
185
186       ·   The period '.' matches any character but "\n"
187
188       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
189       character classes.  Here are some in use:
190
191           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
192           /[\d\s]/;         # matches any digit or whitespace character
193           /\w\W\w/;         # matches a word char, followed by a
194                             # non-word char, followed by a word char
195           /..rt/;           # matches any two chars, followed by 'rt'
196           /end\./;          # matches 'end.'
197           /end[.]/;         # same thing, matches 'end.'
198
199       The word anchor  "\b" matches a boundary between a word character and a
200       non-word character "\w\W" or "\W\w":
201
202           $x = "Housecat catenates house and cat";
203           $x =~ /\bcat/;  # matches cat in 'catenates'
204           $x =~ /cat\b/;  # matches cat in 'housecat'
205           $x =~ /\bcat\b/;  # matches 'cat' at end of string
206
207       In the last example, the end of the string is considered a word
208       boundary.
209
210   Matching this or that
211       We can match different character strings with the alternation
212       metacharacter '|'.  To match "dog" or "cat", we form the regex
213       "dog|cat".  As before, perl will try to match the regex at the earliest
214       possible point in the string.  At each character position, perl will
215       first try to match the first alternative, "dog".  If "dog" doesn't
216       match, perl will then try the next alternative, "cat".  If "cat"
217       doesn't match either, then the match fails and perl moves to the next
218       position in the string.  Some examples:
219
220           "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
221           "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
222
223       Even though "dog" is the first alternative in the second regex, "cat"
224       is able to match earlier in the string.
225
226           "cats"          =~ /c|ca|cat|cats/; # matches "c"
227           "cats"          =~ /cats|cat|ca|c/; # matches "cats"
228
229       At a given character position, the first alternative that allows the
230       regex match to succeed will be the one that matches. Here, all the
231       alternatives match at the first string position, so the first matches.
232
233   Grouping things and hierarchical matching
234       The grouping metacharacters "()" allow a part of a regex to be treated
235       as a single unit.  Parts of a regex are grouped by enclosing them in
236       parentheses.  The regex "house(cat|keeper)" means match "house"
237       followed by either "cat" or "keeper".  Some more examples are
238
239           /(a|b)b/;    # matches 'ab' or 'bb'
240           /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
241
242           /house(cat|)/;  # matches either 'housecat' or 'house'
243           /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
244                               # 'house'.  Note groups can be nested.
245
246           "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
247                                    # because '20\d\d' can't match
248
249   Extracting matches
250       The grouping metacharacters "()" also allow the extraction of the parts
251       of a string that matched.  For each grouping, the part that matched
252       inside goes into the special variables $1, $2, etc.  They can be used
253       just as ordinary variables:
254
255           # extract hours, minutes, seconds
256           $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
257           $hours = $1;
258           $minutes = $2;
259           $seconds = $3;
260
261       In list context, a match "/regex/" with groupings will return the list
262       of matched values "($1,$2,...)".  So we could rewrite it as
263
264           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
265
266       If the groupings in a regex are nested, $1 gets the group with the
267       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
268       For example, here is a complex regex and the matching variables
269       indicated below it:
270
271           /(ab(cd|ef)((gi)|j))/;
272            1  2      34
273
274       Associated with the matching variables $1, $2, ... are the
275       backreferences "\1", "\2", ...  Backreferences are matching variables
276       that can be used inside a regex:
277
278           /(\w\w\w)\s\1/; # find sequences like 'the the' in string
279
280       $1, $2, ... should only be used outside of a regex, and "\1", "\2", ...
281       only inside a regex.
282
283   Matching repetitions
284       The quantifier metacharacters "?", "*", "+", and "{}" allow us to
285       determine the number of repeats of a portion of a regex we consider to
286       be a match.  Quantifiers are put immediately after the character,
287       character class, or grouping that we want to specify.  They have the
288       following meanings:
289
290       ·   "a?" = match 'a' 1 or 0 times
291
292       ·   "a*" = match 'a' 0 or more times, i.e., any number of times
293
294       ·   "a+" = match 'a' 1 or more times, i.e., at least once
295
296       ·   "a{n,m}" = match at least "n" times, but not more than "m" times.
297
298       ·   "a{n,}" = match at least "n" or more times
299
300       ·   "a{n}" = match exactly "n" times
301
302       Here are some examples:
303
304           /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
305                            # any number of digits
306           /(\w+)\s+\1/;    # match doubled words of arbitrary length
307           $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
308                                # than 4 digits
309           $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates
310
311       These quantifiers will try to match as much of the string as possible,
312       while still allowing the regex to match.  So we have
313
314           $x = 'the cat in the hat';
315           $x =~ /^(.*)(at)(.*)$/; # matches,
316                                   # $1 = 'the cat in the h'
317                                   # $2 = 'at'
318                                   # $3 = ''   (0 matches)
319
320       The first quantifier ".*" grabs as much of the string as possible while
321       still having the regex match. The second quantifier ".*" has no string
322       left to it, so it matches 0 times.
323
324   More matching
325       There are a few more things you might want to know about matching
326       operators.  In the code
327
328           $pattern = 'Seuss';
329           while (<>) {
330               print if /$pattern/;
331           }
332
333       perl has to re-evaluate $pattern each time through the loop.  If
334       $pattern won't be changing, use the "//o" modifier, to only perform
335       variable substitutions once.  If you don't want any substitutions at
336       all, use the special delimiter "m''":
337
338           @pattern = ('Seuss');
339           m/@pattern/; # matches 'Seuss'
340           m'@pattern'; # matches the literal string '@pattern'
341
342       The global modifier "//g" allows the matching operator to match within
343       a string as many times as possible.  In scalar context, successive
344       matches against a string will have "//g" jump from match to match,
345       keeping track of position in the string as it goes along.  You can get
346       or set the position with the "pos()" function.  For example,
347
348           $x = "cat dog house"; # 3 words
349           while ($x =~ /(\w+)/g) {
350               print "Word is $1, ends at position ", pos $x, "\n";
351           }
352
353       prints
354
355           Word is cat, ends at position 3
356           Word is dog, ends at position 7
357           Word is house, ends at position 13
358
359       A failed match or changing the target string resets the position.  If
360       you don't want the position reset after failure to match, add the
361       "//c", as in "/regex/gc".
362
363       In list context, "//g" returns a list of matched groupings, or if there
364       are no groupings, a list of matches to the whole regex.  So
365
366           @words = ($x =~ /(\w+)/g);  # matches,
367                                       # $word[0] = 'cat'
368                                       # $word[1] = 'dog'
369                                       # $word[2] = 'house'
370
371   Search and replace
372       Search and replace is performed using "s/regex/replacement/modifiers".
373       The "replacement" is a Perl double quoted string that replaces in the
374       string whatever is matched with the "regex".  The operator "=~" is also
375       used here to associate a string with "s///".  If matching against $_,
376       the "$_ =~"  can be dropped.  If there is a match, "s///" returns the
377       number of substitutions made, otherwise it returns false.  Here are a
378       few examples:
379
380           $x = "Time to feed the cat!";
381           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
382           $y = "'quoted words'";
383           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
384                                  # $y contains "quoted words"
385
386       With the "s///" operator, the matched variables $1, $2, etc.  are
387       immediately available for use in the replacement expression. With the
388       global modifier, "s///g" will search and replace all occurrences of the
389       regex in the string:
390
391           $x = "I batted 4 for 4";
392           $x =~ s/4/four/;   # $x contains "I batted four for 4"
393           $x = "I batted 4 for 4";
394           $x =~ s/4/four/g;  # $x contains "I batted four for four"
395
396       The evaluation modifier "s///e" wraps an "eval{...}" around the
397       replacement string and the evaluated result is substituted for the
398       matched substring.  Some examples:
399
400           # reverse all the words in a string
401           $x = "the cat in the hat";
402           $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
403
404           # convert percentage to decimal
405           $x = "A 39% hit rate";
406           $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
407
408       The last example shows that "s///" can use other delimiters, such as
409       "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
410       "s'''", then the regex and replacement are treated as single quoted
411       strings.
412
413   The split operator
414       "split /regex/, string" splits "string" into a list of substrings and
415       returns that list.  The regex determines the character sequence that
416       "string" is split with respect to.  For example, to split a string into
417       words, use
418
419           $x = "Calvin and Hobbes";
420           @word = split /\s+/, $x;  # $word[0] = 'Calvin'
421                                     # $word[1] = 'and'
422                                     # $word[2] = 'Hobbes'
423
424       To extract a comma-delimited list of numbers, use
425
426           $x = "1.618,2.718,   3.142";
427           @const = split /,\s*/, $x;  # $const[0] = '1.618'
428                                       # $const[1] = '2.718'
429                                       # $const[2] = '3.142'
430
431       If the empty regex "//" is used, the string is split into individual
432       characters.  If the regex has groupings, then the list produced
433       contains the matched substrings from the groupings as well:
434
435           $x = "/usr/bin";
436           @parts = split m!(/)!, $x;  # $parts[0] = ''
437                                       # $parts[1] = '/'
438                                       # $parts[2] = 'usr'
439                                       # $parts[3] = '/'
440                                       # $parts[4] = 'bin'
441
442       Since the first character of $x matched the regex, "split" prepended an
443       empty initial element to the list.
444

BUGS

446       None.
447

AUTHOR AND COPYRIGHT

453       Copyright (c) 2000 Mark Kvale All rights reserved.
454
455       This document may be distributed under the same terms as Perl itself.
456
457   Acknowledgments
458       The author would like to thank Mark-Jason Dominus, Tom Christiansen,
459       Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
460       comments.
461
462
463
464perl v5.10.1                      2009-02-12                    PERLREQUICK(1)

NAME

DESCRIPTION

The Guide

BUGS

SEE ALSO

AUTHOR AND COPYRIGHT