perlrequick(1)

1PERLREQUICK(1)         Perl Programmers Reference Guide         PERLREQUICK(1)
2
3
4

NAME

6       perlrequick - Perl regular expressions quick start
7

DESCRIPTION

9       This page covers the very basics of understanding, creating and using
10       regular expressions ('regexes') in Perl.
11

The Guide

13       This page assumes you already know things, like what a "pattern" is,
14       and the basic syntax of using them.  If you don't, see perlretut.
15
16   Simple word matching
17       The simplest regex is simply a word, or more generally, a string of
18       characters.  A regex consisting of a word matches any string that
19       contains that word:
20
21           "Hello World" =~ /World/;  # matches
22
23       In this statement, "World" is a regex and the "//" enclosing "/World/"
24       tells Perl to search a string for a match.  The operator "=~"
25       associates the string with the regex match and produces a true value if
26       the regex matched, or false if the regex did not match.  In our case,
27       "World" matches the second word in "Hello World", so the expression is
28       true.  This idea has several variations.
29
30       Expressions like this are useful in conditionals:
31
32           print "It matches\n" if "Hello World" =~ /World/;
33
34       The sense of the match can be reversed by using "!~" operator:
35
36           print "It doesn't match\n" if "Hello World" !~ /World/;
37
38       The literal string in the regex can be replaced by a variable:
39
40           $greeting = "World";
41           print "It matches\n" if "Hello World" =~ /$greeting/;
42
43       If you're matching against $_, the "$_ =~" part can be omitted:
44
45           $_ = "Hello World";
46           print "It matches\n" if /World/;
47
48       Finally, the "//" default delimiters for a match can be changed to
49       arbitrary delimiters by putting an 'm' out front:
50
51           "Hello World" =~ m!World!;   # matches, delimited by '!'
52           "Hello World" =~ m{World};   # matches, note the matching '{}'
53           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
54                                        # '/' becomes an ordinary char
55
56       Regexes must match a part of the string exactly in order for the
57       statement to be true:
58
59           "Hello World" =~ /world/;  # doesn't match, case sensitive
60           "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
61           "Hello World" =~ /World /; # doesn't match, no ' ' at end
62
63       Perl will always match at the earliest possible point in the string:
64
65           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
66           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
67
68       Not all characters can be used 'as is' in a match.  Some characters,
69       called metacharacters, are reserved for use in regex notation.  The
70       metacharacters are
71
72           {}[]()^$.|*+?\
73
74       A metacharacter can be matched by putting a backslash before it:
75
76           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
77           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
78           'C:\WIN32' =~ /C:\\WIN/;                       # matches
79           "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
80
81       In the last regex, the forward slash '/' is also backslashed, because
82       it is used to delimit the regex.
83
84       Non-printable ASCII characters are represented by escape sequences.
85       Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
86       carriage return.  Arbitrary bytes are represented by octal escape
87       sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
88
89           "1000\t2000" =~ m(0\t2)  # matches
90           "cat" =~ /\143\x61\x74/  # matches in ASCII, but
91                                    # a weird way to spell cat
92
93       Regexes are treated mostly as double-quoted strings, so variable
94       substitution works:
95
96           $foo = 'house';
97           'cathouse' =~ /cat$foo/;   # matches
98           'housecat' =~ /${foo}cat/; # matches
99
100       With all of the regexes above, if the regex matched anywhere in the
101       string, it was considered a match.  To specify where it should match,
102       we would use the anchor metacharacters "^" and "$".  The anchor "^"
103       means match at the beginning of the string and the anchor "$" means
104       match at the end of the string, or before a newline at the end of the
105       string.  Some examples:
106
107           "housekeeper" =~ /keeper/;         # matches
108           "housekeeper" =~ /^keeper/;        # doesn't match
109           "housekeeper" =~ /keeper$/;        # matches
110           "housekeeper\n" =~ /keeper$/;      # matches
111           "housekeeper" =~ /^housekeeper$/;  # matches
112
113   Using character classes
114       A character class allows a set of possible characters, rather than just
115       a single character, to match at a particular point in a regex.
116       Character classes are denoted by brackets "[...]", with the set of
117       characters to be possibly matched inside.  Here are some examples:
118
119           /cat/;            # matches 'cat'
120           /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
121           "abc" =~ /[cab]/; # matches 'a'
122
123       In the last statement, even though 'c' is the first character in the
124       class, the earliest point at which the regex can match is 'a'.
125
126           /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
127                           # 'yes', 'Yes', 'YES', etc.
128           /yes/i;         # also match 'yes' in a case-insensitive way
129
130       The last example shows a match with an 'i' modifier, which makes the
131       match case-insensitive.
132
133       Character classes also have ordinary and special characters, but the
134       sets of ordinary and special characters inside a character class are
135       different than those outside a character class.  The special characters
136       for a character class are "-]\^$" and are matched using an escape:
137
138          /[\]c]def/; # matches ']def' or 'cdef'
139          $x = 'bcr';
140          /[$x]at/;   # matches 'bat, 'cat', or 'rat'
141          /[\$x]at/;  # matches '$at' or 'xat'
142          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
143
144       The special character '-' acts as a range operator within character
145       classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
146       the svelte "[0-9]" and "[a-z]":
147
148           /item[0-9]/;  # matches 'item0' or ... or 'item9'
149           /[0-9a-fA-F]/;  # matches a hexadecimal digit
150
151       If '-' is the first or last character in a character class, it is
152       treated as an ordinary character.
153
154       The special character "^" in the first position of a character class
155       denotes a negated character class, which matches any character but
156       those in the brackets.  Both "[...]" and "[^...]" must match a
157       character, or the match fails.  Then
158
159           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
160                      # all other 'bat', 'cat, '0at', '%at', etc.
161           /[^0-9]/;  # matches a non-numeric character
162           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
163
164       Perl has several abbreviations for common character classes. (These
165       definitions are those that Perl uses in ASCII-safe mode with the "/a"
166       modifier.  Otherwise they could match many more non-ASCII Unicode
167       characters as well.  See "Backslash sequences" in perlrecharclass for
168       details.)
169
170       ·   \d is a digit and represents
171
172               [0-9]
173
174       ·   \s is a whitespace character and represents
175
176               [\ \t\r\n\f]
177
178       ·   \w is a word character (alphanumeric or _) and represents
179
180               [0-9a-zA-Z_]
181
182       ·   \D is a negated \d; it represents any character but a digit
183
184               [^0-9]
185
186       ·   \S is a negated \s; it represents any non-whitespace character
187
188               [^\s]
189
190       ·   \W is a negated \w; it represents any non-word character
191
192               [^\w]
193
194       ·   The period '.' matches any character but "\n"
195
196       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
197       character classes.  Here are some in use:
198
199           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
200           /[\d\s]/;         # matches any digit or whitespace character
201           /\w\W\w/;         # matches a word char, followed by a
202                             # non-word char, followed by a word char
203           /..rt/;           # matches any two chars, followed by 'rt'
204           /end\./;          # matches 'end.'
205           /end[.]/;         # same thing, matches 'end.'
206
207       The word anchor  "\b" matches a boundary between a word character and a
208       non-word character "\w\W" or "\W\w":
209
210           $x = "Housecat catenates house and cat";
211           $x =~ /\bcat/;  # matches cat in 'catenates'
212           $x =~ /cat\b/;  # matches cat in 'housecat'
213           $x =~ /\bcat\b/;  # matches 'cat' at end of string
214
215       In the last example, the end of the string is considered a word
216       boundary.
217
218       For natural language processing (so that, for example, apostrophes are
219       included in words), use instead "\b{wb}"
220
221           "don't" =~ / .+? \b{wb} /x;  # matches the whole string
222
223   Matching this or that
224       We can match different character strings with the alternation
225       metacharacter '|'.  To match "dog" or "cat", we form the regex
226       "dog|cat".  As before, Perl will try to match the regex at the earliest
227       possible point in the string.  At each character position, Perl will
228       first try to match the first alternative, "dog".  If "dog" doesn't
229       match, Perl will then try the next alternative, "cat".  If "cat"
230       doesn't match either, then the match fails and Perl moves to the next
231       position in the string.  Some examples:
232
233           "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
234           "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
235
236       Even though "dog" is the first alternative in the second regex, "cat"
237       is able to match earlier in the string.
238
239           "cats"          =~ /c|ca|cat|cats/; # matches "c"
240           "cats"          =~ /cats|cat|ca|c/; # matches "cats"
241
242       At a given character position, the first alternative that allows the
243       regex match to succeed will be the one that matches. Here, all the
244       alternatives match at the first string position, so the first matches.
245
246   Grouping things and hierarchical matching
247       The grouping metacharacters "()" allow a part of a regex to be treated
248       as a single unit.  Parts of a regex are grouped by enclosing them in
249       parentheses.  The regex "house(cat|keeper)" means match "house"
250       followed by either "cat" or "keeper".  Some more examples are
251
252           /(a|b)b/;    # matches 'ab' or 'bb'
253           /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
254
255           /house(cat|)/;  # matches either 'housecat' or 'house'
256           /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
257                               # 'house'.  Note groups can be nested.
258
259           "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
260                                    # because '20\d\d' can't match
261
262   Extracting matches
263       The grouping metacharacters "()" also allow the extraction of the parts
264       of a string that matched.  For each grouping, the part that matched
265       inside goes into the special variables $1, $2, etc.  They can be used
266       just as ordinary variables:
267
268           # extract hours, minutes, seconds
269           $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
270           $hours = $1;
271           $minutes = $2;
272           $seconds = $3;
273
274       In list context, a match "/regex/" with groupings will return the list
275       of matched values "($1,$2,...)".  So we could rewrite it as
276
277           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
278
279       If the groupings in a regex are nested, $1 gets the group with the
280       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
281       For example, here is a complex regex and the matching variables
282       indicated below it:
283
284           /(ab(cd|ef)((gi)|j))/;
285            1  2      34
286
287       Associated with the matching variables $1, $2, ... are the
288       backreferences "\g1", "\g2", ...  Backreferences are matching variables
289       that can be used inside a regex:
290
291           /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
292
293       $1, $2, ... should only be used outside of a regex, and "\g1", "\g2",
294       ... only inside a regex.
295
296   Matching repetitions
297       The quantifier metacharacters "?", "*", "+", and "{}" allow us to
298       determine the number of repeats of a portion of a regex we consider to
299       be a match.  Quantifiers are put immediately after the character,
300       character class, or grouping that we want to specify.  They have the
301       following meanings:
302
303       ·   "a?" = match 'a' 1 or 0 times
304
305       ·   "a*" = match 'a' 0 or more times, i.e., any number of times
306
307       ·   "a+" = match 'a' 1 or more times, i.e., at least once
308
309       ·   "a{n,m}" = match at least "n" times, but not more than "m" times.
310
311       ·   "a{n,}" = match at least "n" or more times
312
313       ·   "a{n}" = match exactly "n" times
314
315       Here are some examples:
316
317           /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
318                            # any number of digits
319           /(\w+)\s+\g1/;    # match doubled words of arbitrary length
320           $year =~ /^\d{2,4}$/;  # make sure year is at least 2 but not more
321                                  # than 4 digits
322           $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates
323
324       These quantifiers will try to match as much of the string as possible,
325       while still allowing the regex to match.  So we have
326
327           $x = 'the cat in the hat';
328           $x =~ /^(.*)(at)(.*)$/; # matches,
329                                   # $1 = 'the cat in the h'
330                                   # $2 = 'at'
331                                   # $3 = ''   (0 matches)
332
333       The first quantifier ".*" grabs as much of the string as possible while
334       still having the regex match. The second quantifier ".*" has no string
335       left to it, so it matches 0 times.
336
337   More matching
338       There are a few more things you might want to know about matching
339       operators.  The global modifier "/g" allows the matching operator to
340       match within a string as many times as possible.  In scalar context,
341       successive matches against a string will have "/g" jump from match to
342       match, keeping track of position in the string as it goes along.  You
343       can get or set the position with the "pos()" function.  For example,
344
345           $x = "cat dog house"; # 3 words
346           while ($x =~ /(\w+)/g) {
347               print "Word is $1, ends at position ", pos $x, "\n";
348           }
349
350       prints
351
352           Word is cat, ends at position 3
353           Word is dog, ends at position 7
354           Word is house, ends at position 13
355
356       A failed match or changing the target string resets the position.  If
357       you don't want the position reset after failure to match, add the "/c",
358       as in "/regex/gc".
359
360       In list context, "/g" returns a list of matched groupings, or if there
361       are no groupings, a list of matches to the whole regex.  So
362
363           @words = ($x =~ /(\w+)/g);  # matches,
364                                       # $word[0] = 'cat'
365                                       # $word[1] = 'dog'
366                                       # $word[2] = 'house'
367
368   Search and replace
369       Search and replace is performed using "s/regex/replacement/modifiers".
370       The "replacement" is a Perl double-quoted string that replaces in the
371       string whatever is matched with the "regex".  The operator "=~" is also
372       used here to associate a string with "s///".  If matching against $_,
373       the "$_ =~" can be dropped.  If there is a match, "s///" returns the
374       number of substitutions made; otherwise it returns false.  Here are a
375       few examples:
376
377           $x = "Time to feed the cat!";
378           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
379           $y = "'quoted words'";
380           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
381                                  # $y contains "quoted words"
382
383       With the "s///" operator, the matched variables $1, $2, etc.  are
384       immediately available for use in the replacement expression. With the
385       global modifier, "s///g" will search and replace all occurrences of the
386       regex in the string:
387
388           $x = "I batted 4 for 4";
389           $x =~ s/4/four/;   # $x contains "I batted four for 4"
390           $x = "I batted 4 for 4";
391           $x =~ s/4/four/g;  # $x contains "I batted four for four"
392
393       The non-destructive modifier "s///r" causes the result of the
394       substitution to be returned instead of modifying $_ (or whatever
395       variable the substitute was bound to with "=~"):
396
397           $x = "I like dogs.";
398           $y = $x =~ s/dogs/cats/r;
399           print "$x $y\n"; # prints "I like dogs. I like cats."
400
401           $x = "Cats are great.";
402           print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
403               s/Frogs/Hedgehogs/r, "\n";
404           # prints "Hedgehogs are great."
405
406           @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
407           # @foo is now qw(X X X 1 2 3)
408
409       The evaluation modifier "s///e" wraps an "eval{...}" around the
410       replacement string and the evaluated result is substituted for the
411       matched substring.  Some examples:
412
413           # reverse all the words in a string
414           $x = "the cat in the hat";
415           $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
416
417           # convert percentage to decimal
418           $x = "A 39% hit rate";
419           $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
420
421       The last example shows that "s///" can use other delimiters, such as
422       "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
423       "s'''", then the regex and replacement are treated as single-quoted
424       strings.
425
426   The split operator
427       "split /regex/, string" splits "string" into a list of substrings and
428       returns that list.  The regex determines the character sequence that
429       "string" is split with respect to.  For example, to split a string into
430       words, use
431
432           $x = "Calvin and Hobbes";
433           @word = split /\s+/, $x;  # $word[0] = 'Calvin'
434                                     # $word[1] = 'and'
435                                     # $word[2] = 'Hobbes'
436
437       To extract a comma-delimited list of numbers, use
438
439           $x = "1.618,2.718,   3.142";
440           @const = split /,\s*/, $x;  # $const[0] = '1.618'
441                                       # $const[1] = '2.718'
442                                       # $const[2] = '3.142'
443
444       If the empty regex "//" is used, the string is split into individual
445       characters.  If the regex has groupings, then the list produced
446       contains the matched substrings from the groupings as well:
447
448           $x = "/usr/bin";
449           @parts = split m!(/)!, $x;  # $parts[0] = ''
450                                       # $parts[1] = '/'
451                                       # $parts[2] = 'usr'
452                                       # $parts[3] = '/'
453                                       # $parts[4] = 'bin'
454
455       Since the first character of $x matched the regex, "split" prepended an
456       empty initial element to the list.
457
458   "use re 'strict'"
459       New in v5.22, this applies stricter rules than otherwise when compiling
460       regular expression patterns.  It can find things that, while legal, may
461       not be what you intended.
462
463       See 'strict' in re.
464

BUGS

466       None.
467

AUTHOR AND COPYRIGHT

473       Copyright (c) 2000 Mark Kvale All rights reserved.
474
475       This document may be distributed under the same terms as Perl itself.
476
477   Acknowledgments
478       The author would like to thank Mark-Jason Dominus, Tom Christiansen,
479       Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
480       comments.
481
482
483
484perl v5.26.3                      2018-03-23                    PERLREQUICK(1)

NAME

DESCRIPTION

The Guide

BUGS

SEE ALSO

AUTHOR AND COPYRIGHT