perlrequick(1)

1PERLREQUICK(1)         Perl Programmers Reference Guide         PERLREQUICK(1)
2
3
4

NAME

6       perlrequick - Perl regular expressions quick start
7

DESCRIPTION

9       This page covers the very basics of understanding, creating and using
10       regular expressions ('regexes') in Perl.
11

The Guide

13       This page assumes you already know things, like what a "pattern" is,
14       and the basic syntax of using them.  If you don't, see perlretut.
15
16   Simple word matching
17       The simplest regex is simply a word, or more generally, a string of
18       characters.  A regex consisting of a word matches any string that
19       contains that word:
20
21           "Hello World" =~ /World/;  # matches
22
23       In this statement, "World" is a regex and the "//" enclosing "/World/"
24       tells Perl to search a string for a match.  The operator "=~"
25       associates the string with the regex match and produces a true value if
26       the regex matched, or false if the regex did not match.  In our case,
27       "World" matches the second word in "Hello World", so the expression is
28       true.  This idea has several variations.
29
30       Expressions like this are useful in conditionals:
31
32           print "It matches\n" if "Hello World" =~ /World/;
33
34       The sense of the match can be reversed by using "!~" operator:
35
36           print "It doesn't match\n" if "Hello World" !~ /World/;
37
38       The literal string in the regex can be replaced by a variable:
39
40           $greeting = "World";
41           print "It matches\n" if "Hello World" =~ /$greeting/;
42
43       If you're matching against $_, the "$_ =~" part can be omitted:
44
45           $_ = "Hello World";
46           print "It matches\n" if /World/;
47
48       Finally, the "//" default delimiters for a match can be changed to
49       arbitrary delimiters by putting an 'm' out front:
50
51           "Hello World" =~ m!World!;   # matches, delimited by '!'
52           "Hello World" =~ m{World};   # matches, note the matching '{}'
53           "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
54                                        # '/' becomes an ordinary char
55
56       Regexes must match a part of the string exactly in order for the
57       statement to be true:
58
59           "Hello World" =~ /world/;  # doesn't match, case sensitive
60           "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
61           "Hello World" =~ /World /; # doesn't match, no ' ' at end
62
63       Perl will always match at the earliest possible point in the string:
64
65           "Hello World" =~ /o/;       # matches 'o' in 'Hello'
66           "That hat is red" =~ /hat/; # matches 'hat' in 'That'
67
68       Not all characters can be used 'as is' in a match.  Some characters,
69       called metacharacters, are considered special, and reserved for use in
70       regex notation.  The metacharacters are
71
72           {}[]()^$.|*+?\
73
74       A metacharacter can be matched literally by putting a backslash before
75       it:
76
77           "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
78           "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
79           'C:\WIN32' =~ /C:\\WIN/;                       # matches
80           "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
81
82       In the last regex, the forward slash '/' is also backslashed, because
83       it is used to delimit the regex.
84
85       Most of the metacharacters aren't always special, and other characters
86       (such as the ones delimitting the pattern) become special under various
87       circumstances.  This can be confusing and lead to unexpected results.
88       "use re 'strict'" can notify you of potential pitfalls.
89
90       Non-printable ASCII characters are represented by escape sequences.
91       Common examples are "\t" for a tab, "\n" for a newline, and "\r" for a
92       carriage return.  Arbitrary bytes are represented by octal escape
93       sequences, e.g., "\033", or hexadecimal escape sequences, e.g., "\x1B":
94
95           "1000\t2000" =~ m(0\t2)  # matches
96           "cat" =~ /\143\x61\x74/  # matches in ASCII, but
97                                    # a weird way to spell cat
98
99       Regexes are treated mostly as double-quoted strings, so variable
100       substitution works:
101
102           $foo = 'house';
103           'cathouse' =~ /cat$foo/;   # matches
104           'housecat' =~ /${foo}cat/; # matches
105
106       With all of the regexes above, if the regex matched anywhere in the
107       string, it was considered a match.  To specify where it should match,
108       we would use the anchor metacharacters "^" and "$".  The anchor "^"
109       means match at the beginning of the string and the anchor "$" means
110       match at the end of the string, or before a newline at the end of the
111       string.  Some examples:
112
113           "housekeeper" =~ /keeper/;         # matches
114           "housekeeper" =~ /^keeper/;        # doesn't match
115           "housekeeper" =~ /keeper$/;        # matches
116           "housekeeper\n" =~ /keeper$/;      # matches
117           "housekeeper" =~ /^housekeeper$/;  # matches
118
119   Using character classes
120       A character class allows a set of possible characters, rather than just
121       a single character, to match at a particular point in a regex.  There
122       are a number of different types of character classes, but usually when
123       people use this term, they are referring to the type described in this
124       section, which are technically called "Bracketed character classes",
125       because they are denoted by brackets "[...]", with the set of
126       characters to be possibly matched inside.  But we'll drop the
127       "bracketed" below to correspond with common usage.  Here are some
128       examples of (bracketed) character classes:
129
130           /cat/;            # matches 'cat'
131           /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
132           "abc" =~ /[cab]/; # matches 'a'
133
134       In the last statement, even though 'c' is the first character in the
135       class, the earliest point at which the regex can match is 'a'.
136
137           /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
138                           # 'yes', 'Yes', 'YES', etc.
139           /yes/i;         # also match 'yes' in a case-insensitive way
140
141       The last example shows a match with an 'i' modifier, which makes the
142       match case-insensitive.
143
144       Character classes also have ordinary and special characters, but the
145       sets of ordinary and special characters inside a character class are
146       different than those outside a character class.  The special characters
147       for a character class are "-]\^$" and are matched using an escape:
148
149          /[\]c]def/; # matches ']def' or 'cdef'
150          $x = 'bcr';
151          /[$x]at/;   # matches 'bat, 'cat', or 'rat'
152          /[\$x]at/;  # matches '$at' or 'xat'
153          /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
154
155       The special character '-' acts as a range operator within character
156       classes, so that the unwieldy "[0123456789]" and "[abc...xyz]" become
157       the svelte "[0-9]" and "[a-z]":
158
159           /item[0-9]/;  # matches 'item0' or ... or 'item9'
160           /[0-9a-fA-F]/;  # matches a hexadecimal digit
161
162       If '-' is the first or last character in a character class, it is
163       treated as an ordinary character.
164
165       The special character "^" in the first position of a character class
166       denotes a negated character class, which matches any character but
167       those in the brackets.  Both "[...]" and "[^...]" must match a
168       character, or the match fails.  Then
169
170           /[^a]at/;  # doesn't match 'aat' or 'at', but matches
171                      # all other 'bat', 'cat, '0at', '%at', etc.
172           /[^0-9]/;  # matches a non-numeric character
173           /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
174
175       Perl has several abbreviations for common character classes. (These
176       definitions are those that Perl uses in ASCII-safe mode with the "/a"
177       modifier.  Otherwise they could match many more non-ASCII Unicode
178       characters as well.  See "Backslash sequences" in perlrecharclass for
179       details.)
180
181       •   \d is a digit and represents
182
183               [0-9]
184
185       •   \s is a whitespace character and represents
186
187               [\ \t\r\n\f]
188
189       •   \w is a word character (alphanumeric or _) and represents
190
191               [0-9a-zA-Z_]
192
193       •   \D is a negated \d; it represents any character but a digit
194
195               [^0-9]
196
197       •   \S is a negated \s; it represents any non-whitespace character
198
199               [^\s]
200
201       •   \W is a negated \w; it represents any non-word character
202
203               [^\w]
204
205       •   The period '.' matches any character but "\n"
206
207       The "\d\s\w\D\S\W" abbreviations can be used both inside and outside of
208       character classes.  Here are some in use:
209
210           /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
211           /[\d\s]/;         # matches any digit or whitespace character
212           /\w\W\w/;         # matches a word char, followed by a
213                             # non-word char, followed by a word char
214           /..rt/;           # matches any two chars, followed by 'rt'
215           /end\./;          # matches 'end.'
216           /end[.]/;         # same thing, matches 'end.'
217
218       The word anchor  "\b" matches a boundary between a word character and a
219       non-word character "\w\W" or "\W\w":
220
221           $x = "Housecat catenates house and cat";
222           $x =~ /\bcat/;  # matches cat in 'catenates'
223           $x =~ /cat\b/;  # matches cat in 'housecat'
224           $x =~ /\bcat\b/;  # matches 'cat' at end of string
225
226       In the last example, the end of the string is considered a word
227       boundary.
228
229       For natural language processing (so that, for example, apostrophes are
230       included in words), use instead "\b{wb}"
231
232           "don't" =~ / .+? \b{wb} /x;  # matches the whole string
233
234   Matching this or that
235       We can match different character strings with the alternation
236       metacharacter '|'.  To match "dog" or "cat", we form the regex
237       "dog|cat".  As before, Perl will try to match the regex at the earliest
238       possible point in the string.  At each character position, Perl will
239       first try to match the first alternative, "dog".  If "dog" doesn't
240       match, Perl will then try the next alternative, "cat".  If "cat"
241       doesn't match either, then the match fails and Perl moves to the next
242       position in the string.  Some examples:
243
244           "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
245           "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
246
247       Even though "dog" is the first alternative in the second regex, "cat"
248       is able to match earlier in the string.
249
250           "cats"          =~ /c|ca|cat|cats/; # matches "c"
251           "cats"          =~ /cats|cat|ca|c/; # matches "cats"
252
253       At a given character position, the first alternative that allows the
254       regex match to succeed will be the one that matches. Here, all the
255       alternatives match at the first string position, so the first matches.
256
257   Grouping things and hierarchical matching
258       The grouping metacharacters "()" allow a part of a regex to be treated
259       as a single unit.  Parts of a regex are grouped by enclosing them in
260       parentheses.  The regex "house(cat|keeper)" means match "house"
261       followed by either "cat" or "keeper".  Some more examples are
262
263           /(a|b)b/;    # matches 'ab' or 'bb'
264           /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
265
266           /house(cat|)/;  # matches either 'housecat' or 'house'
267           /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
268                               # 'house'.  Note groups can be nested.
269
270           "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
271                                    # because '20\d\d' can't match
272
273   Extracting matches
274       The grouping metacharacters "()" also allow the extraction of the parts
275       of a string that matched.  For each grouping, the part that matched
276       inside goes into the special variables $1, $2, etc.  They can be used
277       just as ordinary variables:
278
279           # extract hours, minutes, seconds
280           $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
281           $hours = $1;
282           $minutes = $2;
283           $seconds = $3;
284
285       In list context, a match "/regex/" with groupings will return the list
286       of matched values "($1,$2,...)".  So we could rewrite it as
287
288           ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
289
290       If the groupings in a regex are nested, $1 gets the group with the
291       leftmost opening parenthesis, $2 the next opening parenthesis, etc.
292       For example, here is a complex regex and the matching variables
293       indicated below it:
294
295           /(ab(cd|ef)((gi)|j))/;
296            1  2      34
297
298       Associated with the matching variables $1, $2, ... are the
299       backreferences "\g1", "\g2", ...  Backreferences are matching variables
300       that can be used inside a regex:
301
302           /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
303
304       $1, $2, ... should only be used outside of a regex, and "\g1", "\g2",
305       ... only inside a regex.
306
307   Matching repetitions
308       The quantifier metacharacters "?", "*", "+", and "{}" allow us to
309       determine the number of repeats of a portion of a regex we consider to
310       be a match.  Quantifiers are put immediately after the character,
311       character class, or grouping that we want to specify.  They have the
312       following meanings:
313
314       •   "a?" = match 'a' 1 or 0 times
315
316       •   "a*" = match 'a' 0 or more times, i.e., any number of times
317
318       •   "a+" = match 'a' 1 or more times, i.e., at least once
319
320       •   "a{n,m}" = match at least "n" times, but not more than "m" times.
321
322       •   "a{n,}" = match at least "n" or more times
323
324       •   "a{,n}" = match "n" times or fewer
325
326       •   "a{n}" = match exactly "n" times
327
328       Here are some examples:
329
330           /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
331                            # any number of digits
332           /(\w+)\s+\g1/;    # match doubled words of arbitrary length
333           $year =~ /^\d{2,4}$/;  # make sure year is at least 2 but not more
334                                  # than 4 digits
335           $year =~ /^\d{ 4 }$|^\d{2}$/; # better match; throw out 3 digit dates
336
337       These quantifiers will try to match as much of the string as possible,
338       while still allowing the regex to match.  So we have
339
340           $x = 'the cat in the hat';
341           $x =~ /^(.*)(at)(.*)$/; # matches,
342                                   # $1 = 'the cat in the h'
343                                   # $2 = 'at'
344                                   # $3 = ''   (0 matches)
345
346       The first quantifier ".*" grabs as much of the string as possible while
347       still having the regex match. The second quantifier ".*" has no string
348       left to it, so it matches 0 times.
349
350   More matching
351       There are a few more things you might want to know about matching
352       operators.  The global modifier "/g" allows the matching operator to
353       match within a string as many times as possible.  In scalar context,
354       successive matches against a string will have "/g" jump from match to
355       match, keeping track of position in the string as it goes along.  You
356       can get or set the position with the "pos()" function.  For example,
357
358           $x = "cat dog house"; # 3 words
359           while ($x =~ /(\w+)/g) {
360               print "Word is $1, ends at position ", pos $x, "\n";
361           }
362
363       prints
364
365           Word is cat, ends at position 3
366           Word is dog, ends at position 7
367           Word is house, ends at position 13
368
369       A failed match or changing the target string resets the position.  If
370       you don't want the position reset after failure to match, add the "/c",
371       as in "/regex/gc".
372
373       In list context, "/g" returns a list of matched groupings, or if there
374       are no groupings, a list of matches to the whole regex.  So
375
376           @words = ($x =~ /(\w+)/g);  # matches,
377                                       # $word[0] = 'cat'
378                                       # $word[1] = 'dog'
379                                       # $word[2] = 'house'
380
381   Search and replace
382       Search and replace is performed using "s/regex/replacement/modifiers".
383       The "replacement" is a Perl double-quoted string that replaces in the
384       string whatever is matched with the "regex".  The operator "=~" is also
385       used here to associate a string with "s///".  If matching against $_,
386       the "$_ =~" can be dropped.  If there is a match, "s///" returns the
387       number of substitutions made; otherwise it returns false.  Here are a
388       few examples:
389
390           $x = "Time to feed the cat!";
391           $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
392           $y = "'quoted words'";
393           $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
394                                  # $y contains "quoted words"
395
396       With the "s///" operator, the matched variables $1, $2, etc.  are
397       immediately available for use in the replacement expression. With the
398       global modifier, "s///g" will search and replace all occurrences of the
399       regex in the string:
400
401           $x = "I batted 4 for 4";
402           $x =~ s/4/four/;   # $x contains "I batted four for 4"
403           $x = "I batted 4 for 4";
404           $x =~ s/4/four/g;  # $x contains "I batted four for four"
405
406       The non-destructive modifier "s///r" causes the result of the
407       substitution to be returned instead of modifying $_ (or whatever
408       variable the substitute was bound to with "=~"):
409
410           $x = "I like dogs.";
411           $y = $x =~ s/dogs/cats/r;
412           print "$x $y\n"; # prints "I like dogs. I like cats."
413
414           $x = "Cats are great.";
415           print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
416               s/Frogs/Hedgehogs/r, "\n";
417           # prints "Hedgehogs are great."
418
419           @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
420           # @foo is now qw(X X X 1 2 3)
421
422       The evaluation modifier "s///e" wraps an "eval{...}" around the
423       replacement string and the evaluated result is substituted for the
424       matched substring.  Some examples:
425
426           # reverse all the words in a string
427           $x = "the cat in the hat";
428           $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
429
430           # convert percentage to decimal
431           $x = "A 39% hit rate";
432           $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
433
434       The last example shows that "s///" can use other delimiters, such as
435       "s!!!" and "s{}{}", and even "s{}//".  If single quotes are used
436       "s'''", then the regex and replacement are treated as single-quoted
437       strings.
438
439   The split operator
440       "split /regex/, string" splits "string" into a list of substrings and
441       returns that list.  The regex determines the character sequence that
442       "string" is split with respect to.  For example, to split a string into
443       words, use
444
445           $x = "Calvin and Hobbes";
446           @word = split /\s+/, $x;  # $word[0] = 'Calvin'
447                                     # $word[1] = 'and'
448                                     # $word[2] = 'Hobbes'
449
450       To extract a comma-delimited list of numbers, use
451
452           $x = "1.618,2.718,   3.142";
453           @const = split /,\s*/, $x;  # $const[0] = '1.618'
454                                       # $const[1] = '2.718'
455                                       # $const[2] = '3.142'
456
457       If the empty regex "//" is used, the string is split into individual
458       characters.  If the regex has groupings, then the list produced
459       contains the matched substrings from the groupings as well:
460
461           $x = "/usr/bin";
462           @parts = split m!(/)!, $x;  # $parts[0] = ''
463                                       # $parts[1] = '/'
464                                       # $parts[2] = 'usr'
465                                       # $parts[3] = '/'
466                                       # $parts[4] = 'bin'
467
468       Since the first character of $x matched the regex, "split" prepended an
469       empty initial element to the list.
470
471   "use re 'strict'"
472       New in v5.22, this applies stricter rules than otherwise when compiling
473       regular expression patterns.  It can find things that, while legal, may
474       not be what you intended.
475
476       See 'strict' in re.
477

BUGS

479       None.
480

AUTHOR AND COPYRIGHT

486       Copyright (c) 2000 Mark Kvale All rights reserved.
487
488       This document may be distributed under the same terms as Perl itself.
489
490   Acknowledgments
491       The author would like to thank Mark-Jason Dominus, Tom Christiansen,
492       Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
493       comments.
494
495
496
497perl v5.34.0                      2021-10-18                    PERLREQUICK(1)

NAME

DESCRIPTION

The Guide

BUGS

SEE ALSO

AUTHOR AND COPYRIGHT