1Regexp::Common(3) User Contributed Perl Documentation Regexp::Common(3)
2
3
4
6 Regexp::Common - Provide commonly requested regular expressions
7
9 # STANDARD USAGE
10
11 use Regexp::Common;
12
13 while (<>) {
14 /$RE{num}{real}/ and print q{a number};
15 /$RE{quoted} and print q{a ['"`] quoted string};
16 /$RE{delimited}{-delim=>'/'}/ and print q{a /.../ sequence};
17 /$RE{balanced}{-parens=>'()'}/ and print q{balanced parentheses};
18 /$RE{profanity}/ and print q{a #*@%-ing word};
19 }
20
21
22 # SUBROUTINE-BASED INTERFACE
23
24 use Regexp::Common 'RE_ALL';
25
26 while (<>) {
27 $_ =~ RE_num_real() and print q{a number};
28 $_ =~ RE_quoted() and print q{a ['"`] quoted string};
29 $_ =~ RE_delimited(-delim=>'/') and print q{a /.../ sequence};
30 $_ =~ RE_balanced(-parens=>'()'} and print q{balanced parentheses};
31 $_ =~ RE_profanity() and print q{a #*@%-ing word};
32 }
33
34
35 # IN-LINE MATCHING...
36
37 if ( $RE{num}{int}->matches($text) ) {...}
38
39
40 # ...AND SUBSTITUTION
41
42 my $cropped = $RE{ws}{crop}->subs($uncropped);
43
44
45 # ROLL-YOUR-OWN PATTERNS
46
47 use Regexp::Common 'pattern';
48
49 pattern name => ['name', 'mine'],
50 create => '(?i:J[.]?\s+A[.]?\s+Perl-Hacker)',
51 ;
52
53 my $name_matcher = $RE{name}{mine};
54
55 pattern name => [ 'lineof', '-char=_' ],
56 create => sub {
57 my $flags = shift;
58 my $char = quotemeta $flags->{-char};
59 return '(?:^$char+$)';
60 },
61 matches => sub {
62 my ($self, $str) = @_;
63 return $str !~ /[^$self->{flags}{-char}]/;
64 },
65 subs => sub {
66 my ($self, $str, $replacement) = @_;
67 $_[1] =~ s/^$self->{flags}{-char}+$//g;
68 },
69 ;
70
71 my $asterisks = $RE{lineof}{-char=>'*'};
72
73 # DECIDING WHICH PATTERNS TO LOAD.
74
75 use Regexp::Common qw /comment number/; # Comment and number patterns.
76 use Regexp::Common qw /no_defaults/; # Don't load any patterns.
77 use Regexp::Common qw /!delimited/; # All, but delimited patterns.
78
80 By default, this module exports a single hash (%RE) that stores or
81 generates commonly needed regular expressions (see "List of available
82 patterns").
83
84 There is an alternative, subroutine-based syntax described in
85 "Subroutine-based interface".
86
87 General syntax for requesting patterns
88 To access a particular pattern, %RE is treated as a hierarchical hash
89 of hashes (of hashes...), with each successive key being an identifier.
90 For example, to access the pattern that matches real numbers, you
91 specify:
92
93 $RE{num}{real}
94
95 and to access the pattern that matches integers:
96
97 $RE{num}{int}
98
99 Deeper layers of the hash are used to specify flags: arguments that
100 modify the resulting pattern in some way. The keys used to access these
101 layers are prefixed with a minus sign and may have a value; if a value
102 is given, it's done by using a multidimensional key. For example, to
103 access the pattern that matches base-2 real numbers with embedded
104 commas separating groups of three digits (e.g. 10,101,110.110101101):
105
106 $RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}
107
108 Through the magic of Perl, these flag layers may be specified in any
109 order (and even interspersed through the identifier keys!) so you
110 could get the same pattern with:
111
112 $RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}
113
114 or:
115
116 $RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}
117
118 or even:
119
120 $RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}
121
122 etc.
123
124 Note, however, that the relative order of amongst the identifier keys
125 is significant. That is:
126
127 $RE{list}{set}
128
129 would not be the same as:
130
131 $RE{set}{list}
132
133 Flag syntax
134 In versions prior to 2.113, flags could also be written as
135 "{"-flag=value"}". This no longer works, although "{"-flag$;value"}"
136 still does. However, "{-flag => 'value'}" is the preferred syntax.
137
138 Universal flags
139 Normally, flags are specific to a single pattern. However, there is
140 two flags that all patterns may specify.
141
142 "-keep"
143 By default, the patterns provided by %RE contain no capturing
144 parentheses. However, if the "-keep" flag is specified (it requires
145 no value) then any significant substrings that the pattern matches
146 are captured. For example:
147
148 if ($str =~ $RE{num}{real}{-keep}) {
149 $number = $1;
150 $whole = $3;
151 $decimals = $5;
152 }
153
154 Special care is needed if a "kept" pattern is interpolated into a
155 larger regular expression, as the presence of other capturing
156 parentheses is likely to change the "number variables" into which
157 significant substrings are saved.
158
159 See also "Adding new regular expressions", which describes how to
160 create new patterns with "optional" capturing brackets that respond
161 to "-keep".
162
163 "-i"
164 Some patterns or subpatterns only match lowercase or uppercase
165 letters. If one wants the do case insensitive matching, one option
166 is to use the "/i" regexp modifier, or the special sequence "(?i)".
167 But if the functional interface is used, one does not have this
168 option. The "-i" switch solves this problem; by using it, the
169 pattern will do case insensitive matching.
170
171 OO interface and inline matching/substitution
172 The patterns returned from %RE are objects, so rather than writing:
173
174 if ($str =~ /$RE{some}{pattern}/ ) {...}
175
176 you can write:
177
178 if ( $RE{some}{pattern}->matches($str) ) {...}
179
180 For matching this would seem to have no great advantage apart from
181 readability (but see below).
182
183 For substitutions, it has other significant benefits. Frequently you
184 want to perform a substitution on a string without changing the
185 original. Most people use this:
186
187 $changed = $original;
188 $changed =~ s/$RE{some}{pattern}/$replacement/;
189
190 The more adept use:
191
192 ($changed = $original) =~ s/$RE{some}{pattern}/$replacement/;
193
194 Regexp::Common allows you do write this:
195
196 $changed = $RE{some}{pattern}->subs($original=>$replacement);
197
198 Apart from reducing precedence-angst, this approach has the added
199 advantages that the substitution behaviour can be optimized from the
200 regular expression, and the replacement string can be provided by
201 default (see "Adding new regular expressions").
202
203 For example, in the implementation of this substitution:
204
205 $cropped = $RE{ws}{crop}->subs($uncropped);
206
207 the default empty string is provided automatically, and the
208 substitution is optimized to use:
209
210 $uncropped =~ s/^\s+//;
211 $uncropped =~ s/\s+$//;
212
213 rather than:
214
215 $uncropped =~ s/^\s+|\s+$//g;
216
217 Subroutine-based interface
218 The hash-based interface was chosen because it allows regexes to be
219 effortlessly interpolated, and because it also allows them to be
220 "curried". For example:
221
222 my $num = $RE{num}{int};
223
224 my $commad = $num->{-sep=>','}{-group=>3};
225 my $duodecimal = $num->{-base=>12};
226
227 However, the use of tied hashes does make the access to Regexp::Common
228 patterns slower than it might otherwise be. In contexts where
229 impatience overrules laziness, Regexp::Common provides an additional
230 subroutine-based interface.
231
232 For each (sub-)entry in the %RE hash ($RE{key1}{key2}{etc}), there is a
233 corresponding exportable subroutine: "RE_key1_key2_etc()". The name of
234 each subroutine is the underscore-separated concatenation of the non-
235 flag keys that locate the same pattern in %RE. Flags are passed to the
236 subroutine in its argument list. Thus:
237
238 use Regexp::Common qw( RE_ws_crop RE_num_real RE_profanity );
239
240 $str =~ RE_ws_crop() and die "Surrounded by whitespace";
241
242 $str =~ RE_num_real(-base=>8, -sep=>" ") or next;
243
244 $offensive = RE_profanity(-keep);
245 $str =~ s/$offensive/$bad{$1}++; "<expletive deleted>"/ge;
246
247 Note that, unlike the hash-based interface (which returns objects),
248 these subroutines return ordinary "qr"'d regular expressions. Hence
249 they do not curry, nor do they provide the OO match and substitution
250 inlining described in the previous section.
251
252 It is also possible to export subroutines for all available patterns
253 like so:
254
255 use Regexp::Common 'RE_ALL';
256
257 Or you can export all subroutines with a common prefix of keys like so:
258
259 use Regexp::Common 'RE_num_ALL';
260
261 which will export "RE_num_int" and "RE_num_real" (and if you have
262 create more patterns who have first key num, those will be exported as
263 well). In general, RE_key1_..._keyn_ALL will export all subroutines
264 whose pattern names have first keys key1 ... keyn.
265
266 Adding new regular expressions
267 You can add your own regular expressions to the %RE hash at run-time,
268 using the exportable "pattern" subroutine. It expects a hash-like list
269 of key/value pairs that specify the behaviour of the pattern. The
270 various possible argument pairs are:
271
272 "name => [ @list ]"
273 A required argument that specifies the name of the pattern, and any
274 flags it may take, via a reference to a list of strings. For
275 example:
276
277 pattern name => [qw( line of -char )],
278 # other args here
279 ;
280
281 This specifies an entry $RE{line}{of}, which may take a "-char"
282 flag.
283
284 Flags may also be specified with a default value, which is then
285 used whenever the flag is specified without an explicit value (but
286 not when the flag is omitted). For example:
287
288 pattern name => [qw( line of -char=_ )],
289 # default char is '_'
290 # other args here
291 ;
292
293 "create => $sub_ref_or_string"
294 A required argument that specifies either a string that is to be
295 returned as the pattern:
296
297 pattern name => [qw( line of underscores )],
298 create => q/(?:^_+$)/
299 ;
300
301 or a reference to a subroutine that will be called to create the
302 pattern:
303
304 pattern name => [qw( line of -char=_ )],
305 create => sub {
306 my ($self, $flags) = @_;
307 my $char = quotemeta $flags->{-char};
308 return '(?:^$char+$)';
309 },
310 ;
311
312 If the subroutine version is used, the subroutine will be called
313 with three arguments: a reference to the pattern object itself, a
314 reference to a hash containing the flags and their values, and a
315 reference to an array containing the non-flag keys.
316
317 Whatever the subroutine returns is stringified as the pattern.
318
319 No matter how the pattern is created, it is immediately
320 postprocessed to include or exclude capturing parentheses
321 (according to the value of the "-keep" flag). To specify such
322 "optional" capturing parentheses within the regular expression
323 associated with "create", use the notation "(?k:...)". Any
324 parentheses of this type will be converted to "(...)" when the
325 "-keep" flag is specified, or "(?:...)" when it is not. It is a
326 Regexp::Common convention that the outermost capturing parentheses
327 always capture the entire pattern, but this is not enforced.
328
329 "matches => $sub_ref"
330 An optional argument that specifies a subroutine that is to be
331 called when the "$RE{...}->matches(...)" method of this pattern is
332 invoked.
333
334 The subroutine should expect two arguments: a reference to the
335 pattern object itself, and the string to be matched against.
336
337 It should return the same types of values as a "m/.../" does.
338
339 pattern name => [qw( line of -char )],
340 create => sub {...},
341 matches => sub {
342 my ($self, $str) = @_;
343 $str !~ /[^$self->{flags}{-char}]/;
344 },
345 ;
346
347 "subs => $sub_ref"
348 An optional argument that specifies a subroutine that is to be
349 called when the "$RE{...}->subs(...)" method of this pattern is
350 invoked.
351
352 The subroutine should expect three arguments: a reference to the
353 pattern object itself, the string to be changed, and the value to
354 be substituted into it. The third argument may be "undef",
355 indicating the default substitution is required.
356
357 The subroutine should return the same types of values as an
358 "s/.../.../" does.
359
360 For example:
361
362 pattern name => [ 'lineof', '-char=_' ],
363 create => sub {...},
364 subs => sub {
365 my ($self, $str, $ignore_replacement) = @_;
366 $_[1] =~ s/^$self->{flags}{-char}+$//g;
367 },
368 ;
369
370 Note that such a subroutine will almost always need to modify $_[1]
371 directly.
372
373 "version => $minimum_perl_version"
374 If this argument is given, it specifies the minimum version of perl
375 required to use the new pattern. Attempts to use the pattern with
376 earlier versions of perl will generate a fatal diagnostic.
377
378 Loading specific sets of patterns.
379 By default, all the sets of patterns listed below are made available.
380 However, it is possible to indicate which sets of patterns should be
381 made available - the wanted sets should be given as arguments to "use".
382 Alternatively, it is also possible to indicate which sets of patterns
383 should not be made available - those sets will be given as argument to
384 the "use" statement, but are preceeded with an exclaimation mark. The
385 argument no_defaults indicates none of the default patterns should be
386 made available. This is useful for instance if all you want is the
387 "pattern()" subroutine.
388
389 Examples:
390
391 use Regexp::Common qw /comment number/; # Comment and number patterns.
392 use Regexp::Common qw /no_defaults/; # Don't load any patterns.
393 use Regexp::Common qw /!delimited/; # All, but delimited patterns.
394
395 It's also possible to load your own set of patterns. If you have a
396 module "Regexp::Common::my_patterns" that makes patterns available, you
397 can have it made available with
398
399 use Regexp::Common qw /my_patterns/;
400
401 Note that the default patterns will still be made available - only if
402 you use no_defaults, or mention one of the default sets explicitely,
403 the non mentioned defaults aren't made available.
404
405 List of available patterns
406 The patterns listed below are currently available. Each set of patterns
407 has its own manual page describing the details. For each pattern set
408 named name, the manual page Regexp::Common::name describes the details.
409
410 Currently available are:
411
412 Regexp::Common::balanced
413 Provides regexes for strings with balanced parenthesized
414 delimiters.
415
416 Regexp::Common::comment
417 Provides regexes for comments of various languages (43 languages
418 currently).
419
420 Regexp::Common::delimited
421 Provides regexes for delimited strings.
422
423 Regexp::Common::lingua
424 Provides regexes for palindromes.
425
426 Regexp::Common::list
427 Provides regexes for lists.
428
429 Regexp::Common::net
430 Provides regexes for IPv4 addresses and MAC addresses.
431
432 Regexp::Common::number
433 Provides regexes for numbers (integers and reals).
434
435 Regexp::Common::profanity
436 Provides regexes for profanity.
437
438 Regexp::Common::whitespace
439 Provides regexes for leading and trailing whitespace.
440
441 Regexp::Common::zip
442 Provides regexes for zip codes.
443
444 Forthcoming patterns and features
445 Future releases of the module will also provide patterns for the
446 following:
447
448 * email addresses
449 * HTML/XML tags
450 * more numerical matchers,
451 * mail headers (including multiline ones),
452 * more URLS
453 * telephone numbers of various countries
454 * currency (universal 3 letter format, Latin-1, currency names)
455 * dates
456 * binary formats (e.g. UUencoded, MIMEd)
457
458 If you have other patterns or pattern generators that you think would
459 be generally useful, please send them to the maintainer -- preferably
460 as source code using the "pattern" subroutine. Submissions that include
461 a set of tests will be especially welcome.
462
464 "Can't export unknown subroutine %s"
465 The subroutine-based interface didn't recognize the requested
466 subroutine. Often caused by a spelling mistake or an incompletely
467 specified name.
468
469 "Can't create unknown regex: $RE{...}"
470 Regexp::Common doesn't have a generator for the requested pattern.
471 Often indicates a mispelt or missing parameter.
472
473 "Perl %f does not support the pattern $RE{...}. You need Perl %f or
474 later"
475 The requested pattern requires advanced regex features (e.g.
476 recursion) that not available in your version of Perl. Time to
477 upgrade.
478
479 "pattern() requires argument: name => [ @list ]"
480 Every user-defined pattern specification must have a name.
481
482 "pattern() requires argument: create => $sub_ref_or_string"
483 Every user-defined pattern specification must provide a pattern
484 creation mechanism: either a pattern string or a reference to a
485 subroutine that returns the pattern string.
486
487 "Base must be between 1 and 36"
488 The $RE{num}{real}{-base=>'I<N>'} pattern uses the characters
489 [0-9A-Z] to represent the digits of various bases. Hence it only
490 produces regular expressions for bases up to hexatricensimal.
491
492 "Must specify delimiter in $RE{delimited}"
493 The pattern has no default delimiter. You need to write:
494 $RE{delimited}{-delim=>I<X>'} for some character X
495
497 Deepest thanks to the many people who have encouraged and contributed
498 to this project, especially: Elijah, Jarkko, Tom, Nat, Ed, and Vivek.
499
500 Further thanks go to: Alexandr Ciornii, Blair Zajac, Bob Stockdale,
501 Charles Thomas, Chris Vertonghen, the CPAN Testers, David Hand, Fany,
502 Geoffrey Leach, Hermann-Marcus Behrens, Jerome Quelin, Jim Cromie, Lars
503 Wilke, Linda Julien, Mike Arms, Mike Castle, Mikko, Murat Uenalan,
504 Rafaeel Garcia-Suarez, Ron Savage, Sam Vilain, Slaven Rezic, Smylers,
505 Tim Maher, and all the others I've forgotten.
506
508 Damian Conway (damian@conway.org)
509
511 This package is maintained by Abigail (regexp-common@abigail.be).
512
514 Bound to be plenty.
515
516 For a start, there are many common regexes missing. Send them in to
517 regexp-common@abigail.be.
518
519 There are some POD issues when installing this module using a pre-5.6.0
520 perl; some manual pages may not install, or may not install correctly
521 using a perl that is that old. You might consider upgrading your perl.
522
524 This software is Copyright (c) 2001 - 2009, Damian Conway and Abigail.
525
526 This module is free software, and maybe used under any of the following
527 licenses:
528
529 1) The Perl Artistic License. See the file COPYRIGHT.AL.
530 2) The Perl Artistic License 2.0. See the file COPYRIGHT.AL2.
531 3) The BSD Licence. See the file COPYRIGHT.BSD.
532 4) The MIT Licence. See the file COPYRIGHT.MIT.
533
534
535
536perl v5.12.0 2010-01-02 Regexp::Common(3)