1Regexp::Common(3) User Contributed Perl Documentation Regexp::Common(3)
2
3
4
6 Regexp::Common - Provide commonly requested regular expressions
7
9 # STANDARD USAGE
10
11 use Regexp::Common;
12
13 while (<>) {
14 /$RE{num}{real}/ and print q{a number};
15 /$RE{quoted} and print q{a ['"`] quoted string};
16 /$RE{delimited}{-delim=>'/'}/ and print q{a /.../ sequence};
17 /$RE{balanced}{-parens=>'()'}/ and print q{balanced parentheses};
18 /$RE{profanity}/ and print q{a #*@%-ing word};
19 }
20
21 # SUBROUTINE-BASED INTERFACE
22
23 use Regexp::Common 'RE_ALL';
24
25 while (<>) {
26 $_ =~ RE_num_real() and print q{a number};
27 $_ =~ RE_quoted() and print q{a ['"`] quoted string};
28 $_ =~ RE_delimited(-delim=>'/') and print q{a /.../ sequence};
29 $_ =~ RE_balanced(-parens=>'()'} and print q{balanced parentheses};
30 $_ =~ RE_profanity() and print q{a #*@%-ing word};
31 }
32
33 # IN-LINE MATCHING...
34
35 if ( $RE{num}{int}->matches($text) ) {...}
36
37 # ...AND SUBSTITUTION
38
39 my $cropped = $RE{ws}{crop}->subs($uncropped);
40
41 # ROLL-YOUR-OWN PATTERNS
42
43 use Regexp::Common 'pattern';
44
45 pattern name => ['name', 'mine'],
46 create => '(?i:J[.]?\s+A[.]?\s+Perl-Hacker)',
47 ;
48
49 my $name_matcher = $RE{name}{mine};
50
51 pattern name => [ 'lineof', '-char=_' ],
52 create => sub {
53 my $flags = shift;
54 my $char = quotemeta $flags->{-char};
55 return '(?:^$char+$)';
56 },
57 matches => sub {
58 my ($self, $str) = @_;
59 return $str !~ /[^$self->{flags}{-char}]/;
60 },
61 subs => sub {
62 my ($self, $str, $replacement) = @_;
63 $_[1] =~ s/^$self->{flags}{-char}+$//g;
64 },
65 ;
66
67 my $asterisks = $RE{lineof}{-char=>'*'};
68
69 # DECIDING WHICH PATTERNS TO LOAD.
70
71 use Regexp::Common qw /comment number/; # Comment and number patterns.
72 use Regexp::Common qw /no_defaults/; # Don't load any patterns.
73 use Regexp::Common qw /!delimited/; # All, but delimited patterns.
74
76 By default, this module exports a single hash (%RE) that stores or gen‐
77 erates commonly needed regular expressions (see "List of available pat‐
78 terns").
79
80 There is an alternative, subroutine-based syntax described in "Subrou‐
81 tine-based interface".
82
83 General syntax for requesting patterns
84
85 To access a particular pattern, %RE is treated as a hierarchical hash
86 of hashes (of hashes...), with each successive key being an identifier.
87 For example, to access the pattern that matches real numbers, you spec‐
88 ify:
89
90 $RE{num}{real}
91
92 and to access the pattern that matches integers:
93
94 $RE{num}{int}
95
96 Deeper layers of the hash are used to specify flags: arguments that
97 modify the resulting pattern in some way. The keys used to access these
98 layers are prefixed with a minus sign and may have a value; if a value
99 is given, it's done by using a multidimensional key. For example, to
100 access the pattern that matches base-2 real numbers with embedded com‐
101 mas separating groups of three digits (e.g. 10,101,110.110101101):
102
103 $RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}
104
105 Through the magic of Perl, these flag layers may be specified in any
106 order (and even interspersed through the identifier keys!) so you
107 could get the same pattern with:
108
109 $RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}
110
111 or:
112
113 $RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}
114
115 or even:
116
117 $RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}
118
119 etc.
120
121 Note, however, that the relative order of amongst the identifier keys
122 is significant. That is:
123
124 $RE{list}{set}
125
126 would not be the same as:
127
128 $RE{set}{list}
129
130 Flag syntax
131
132 In versions prior to 2.113, flags could also be written as
133 "{"-flag=value"}". This no longer works, although "{"-flag$;value"}"
134 still does. However, "{-flag => 'value'}" is the preferred syntax.
135
136 Universal flags
137
138 Normally, flags are specific to a single pattern. However, there is
139 two flags that all patterns may specify.
140
141 "-keep"
142 By default, the patterns provided by %RE contain no capturing
143 parentheses. However, if the "-keep" flag is specified (it requires
144 no value) then any significant substrings that the pattern matches
145 are captured. For example:
146
147 if ($str =~ $RE{num}{real}{-keep}) {
148 $number = $1;
149 $whole = $3;
150 $decimals = $5;
151 }
152
153 Special care is needed if a "kept" pattern is interpolated into a
154 larger regular expression, as the presence of other capturing
155 parentheses is likely to change the "number variables" into which
156 significant substrings are saved.
157
158 See also "Adding new regular expressions", which describes how to
159 create new patterns with "optional" capturing brackets that respond
160 to "-keep".
161
162 "-i"
163 Some patterns or subpatterns only match lowercase or uppercase let‐
164 ters. If one wants the do case insensitive matching, one option is
165 to use the "/i" regexp modifier, or the special sequence "(?i)".
166 But if the functional interface is used, one does not have this
167 option. The "-i" switch solves this problem; by using it, the pat‐
168 tern will do case insensitive matching.
169
170 OO interface and inline matching/substitution
171
172 The patterns returned from %RE are objects, so rather than writing:
173
174 if ($str =~ /$RE{some}{pattern}/ ) {...}
175
176 you can write:
177
178 if ( $RE{some}{pattern}->matches($str) ) {...}
179
180 For matching this would seem to have no great advantage apart from
181 readability (but see below).
182
183 For substitutions, it has other significant benefits. Frequently you
184 want to perform a substitution on a string without changing the origi‐
185 nal. Most people use this:
186
187 $changed = $original;
188 $changed =~ s/$RE{some}{pattern}/$replacement/;
189
190 The more adept use:
191
192 ($changed = $original) =~ s/$RE{some}{pattern}/$replacement/;
193
194 Regexp::Common allows you do write this:
195
196 $changed = $RE{some}{pattern}->subs($original=>$replacement);
197
198 Apart from reducing precedence-angst, this approach has the added
199 advantages that the substitution behaviour can be optimized from the
200 regular expression, and the replacement string can be provided by
201 default (see "Adding new regular expressions").
202
203 For example, in the implementation of this substitution:
204
205 $cropped = $RE{ws}{crop}->subs($uncropped);
206
207 the default empty string is provided automatically, and the substitu‐
208 tion is optimized to use:
209
210 $uncropped =~ s/^\s+//;
211 $uncropped =~ s/\s+$//;
212
213 rather than:
214
215 $uncropped =~ s/^\s+⎪\s+$//g;
216
217 Subroutine-based interface
218
219 The hash-based interface was chosen because it allows regexes to be
220 effortlessly interpolated, and because it also allows them to be "cur‐
221 ried". For example:
222
223 my $num = $RE{num}{int};
224
225 my $commad = $num->{-sep=>','}{-group=>3};
226 my $duodecimal = $num->{-base=>12};
227
228 However, the use of tied hashes does make the access to Regexp::Common
229 patterns slower than it might otherwise be. In contexts where impa‐
230 tience overrules laziness, Regexp::Common provides an additional sub‐
231 routine-based interface.
232
233 For each (sub-)entry in the %RE hash ($RE{key1}{key2}{etc}), there is a
234 corresponding exportable subroutine: "RE_key1_key2_etc()". The name of
235 each subroutine is the underscore-separated concatenation of the non-
236 flag keys that locate the same pattern in %RE. Flags are passed to the
237 subroutine in its argument list. Thus:
238
239 use Regexp::Common qw( RE_ws_crop RE_num_real RE_profanity );
240
241 $str =~ RE_ws_crop() and die "Surrounded by whitespace";
242
243 $str =~ RE_num_real(-base=>8, -sep=>" ") or next;
244
245 $offensive = RE_profanity(-keep);
246 $str =~ s/$offensive/$bad{$1}++; "<expletive deleted>"/ge;
247
248 Note that, unlike the hash-based interface (which returns objects),
249 these subroutines return ordinary "qr"'d regular expressions. Hence
250 they do not curry, nor do they provide the OO match and substitution
251 inlining described in the previous section.
252
253 It is also possible to export subroutines for all available patterns
254 like so:
255
256 use Regexp::Common 'RE_ALL';
257
258 Or you can export all subroutines with a common prefix of keys like so:
259
260 use Regexp::Common 'RE_num_ALL';
261
262 which will export "RE_num_int" and "RE_num_real" (and if you have cre‐
263 ate more patterns who have first key num, those will be exported as
264 well). In general, RE_key1_..._keyn_ALL will export all subroutines
265 whose pattern names have first keys key1 ... keyn.
266
267 Adding new regular expressions
268
269 You can add your own regular expressions to the %RE hash at run-time,
270 using the exportable "pattern" subroutine. It expects a hash-like list
271 of key/value pairs that specify the behaviour of the pattern. The vari‐
272 ous possible argument pairs are:
273
274 "name => [ @list ]"
275 A required argument that specifies the name of the pattern, and
276 any flags it may take, via a reference to a list of strings.
277 For example:
278
279 pattern name => [qw( line of -char )],
280 # other args here
281 ;
282
283 This specifies an entry $RE{line}{of}, which may take a "-char"
284 flag.
285
286 Flags may also be specified with a default value, which is then
287 used whenever the flag is omitted, or specified without an
288 explicit value. For example:
289
290 pattern name => [qw( line of -char=_ )],
291 # default char is '_'
292 # other args here
293 ;
294
295 "create => $sub_ref_or_string"
296 A required argument that specifies either a string that is to
297 be returned as the pattern:
298
299 pattern name => [qw( line of underscores )],
300 create => q/(?:^_+$)/
301 ;
302
303 or a reference to a subroutine that will be called to create
304 the pattern:
305
306 pattern name => [qw( line of -char=_ )],
307 create => sub {
308 my ($self, $flags) = @_;
309 my $char = quotemeta $flags->{-char};
310 return '(?:^$char+$)';
311 },
312 ;
313
314 If the subroutine version is used, the subroutine will be
315 called with three arguments: a reference to the pattern object
316 itself, a reference to a hash containing the flags and their
317 values, and a reference to an array containing the non-flag
318 keys.
319
320 Whatever the subroutine returns is stringified as the pattern.
321
322 No matter how the pattern is created, it is immediately post‐
323 processed to include or exclude capturing parentheses (accord‐
324 ing to the value of the "-keep" flag). To specify such
325 "optional" capturing parentheses within the regular expression
326 associated with "create", use the notation "(?k:...)". Any
327 parentheses of this type will be converted to "(...)" when the
328 "-keep" flag is specified, or "(?:...)" when it is not. It is
329 a Regexp::Common convention that the outermost capturing paren‐
330 theses always capture the entire pattern, but this is not
331 enforced.
332
333 "matches => $sub_ref"
334 An optional argument that specifies a subroutine that is to be
335 called when the "$RE{...}->matches(...)" method of this pattern
336 is invoked.
337
338 The subroutine should expect two arguments: a reference to the
339 pattern object itself, and the string to be matched against.
340
341 It should return the same types of values as a "m/.../" does.
342
343 pattern name => [qw( line of -char )],
344 create => sub {...},
345 matches => sub {
346 my ($self, $str) = @_;
347 $str !~ /[^$self->{flags}{-char}]/;
348 },
349 ;
350
351 "subs => $sub_ref"
352 An optional argument that specifies a subroutine that is to be
353 called when the "$RE{...}->subs(...)" method of this pattern is
354 invoked.
355
356 The subroutine should expect three arguments: a reference to
357 the pattern object itself, the string to be changed, and the
358 value to be substituted into it. The third argument may be
359 "undef", indicating the default substitution is required.
360
361 The subroutine should return the same types of values as an
362 "s/.../.../" does.
363
364 For example:
365
366 pattern name => [ 'lineof', '-char=_' ],
367 create => sub {...},
368 subs => sub {
369 my ($self, $str, $ignore_replacement) = @_;
370 $_[1] =~ s/^$self->{flags}{-char}+$//g;
371 },
372 ;
373
374 Note that such a subroutine will almost always need to modify
375 $_[1] directly.
376
377 "version => $minimum_perl_version"
378 If this argument is given, it specifies the minimum version of
379 perl required to use the new pattern. Attempts to use the pat‐
380 tern with earlier versions of perl will generate a fatal diag‐
381 nostic.
382
383 Loading specific sets of patterns.
384
385 By default, all the sets of patterns listed below are made avail‐
386 able. However, it is possible to indicate which sets of patterns
387 should be made available - the wanted sets should be given as argu‐
388 ments to "use". Alternatively, it is also possible to indicate
389 which sets of patterns should not be made available - those sets
390 will be given as argument to the "use" statement, but are preceeded
391 with an exclaimation mark. The argument no_defaults indicates none
392 of the default patterns should be made available. This is useful
393 for instance if all you want is the "pattern()" subroutine.
394
395 Examples:
396
397 use Regexp::Common qw /comment number/; # Comment and number patterns.
398 use Regexp::Common qw /no_defaults/; # Don't load any patterns.
399 use Regexp::Common qw /!delimited/; # All, but delimited patterns.
400
401 It's also possible to load your own set of patterns. If you have a
402 module "Regexp::Common::my_patterns" that makes patterns available,
403 you can have it made available with
404
405 use Regexp::Common qw /my_patterns/;
406
407 Note that the default patterns will still be made available - only
408 if you use no_defaults, or mention one of the default sets
409 explicitely, the non mentioned defaults aren't made available.
410
411 List of available patterns
412
413 The patterns listed below are currently available. Each set of pat‐
414 terns has its own manual page describing the details. For each pat‐
415 tern set named name, the manual page Regexp::Common::name describes
416 the details.
417
418 Currently available are:
419
420 Regexp::Common::balanced
421 Provides regexes for strings with balanced parenthesized delim‐
422 iters.
423
424 Regexp::Common::comment
425 Provides regexes for comments of various languages (43 lan‐
426 guages currently).
427
428 Regexp::Common::delimited
429 Provides regexes for delimited strings.
430
431 Regexp::Common::lingua
432 Provides regexes for palindromes.
433
434 Regexp::Common::list
435 Provides regexes for lists.
436
437 Regexp::Common::net
438 Provides regexes for IPv4 addresses and MAC addresses.
439
440 Regexp::Common::number
441 Provides regexes for numbers (integers and reals).
442
443 Regexp::Common::profanity
444 Provides regexes for profanity.
445
446 Regexp::Common::whitespace
447 Provides regexes for leading and trailing whitespace.
448
449 Regexp::Common::zip
450 Provides regexes for zip codes.
451
452 Forthcoming patterns and features
453
454 Future releases of the module will also provide patterns for the
455 following:
456
457 * email addresses
458 * HTML/XML tags
459 * more numerical matchers,
460 * mail headers (including multiline ones),
461 * more URLS
462 * telephone numbers of various countries
463 * currency (universal 3 letter format, Latin-1, currency names)
464 * dates
465 * binary formats (e.g. UUencoded, MIMEd)
466
467 If you have other patterns or pattern generators that you think
468 would be generally useful, please send them to the maintainer --
469 preferably as source code using the "pattern" subroutine. Submis‐
470 sions that include a set of tests will be especially welcome.
471
473 "Can't export unknown subroutine %s"
474 The subroutine-based interface didn't recognize the requested sub‐
475 routine. Often caused by a spelling mistake or an incompletely
476 specified name.
477
478 "Can't create unknown regex: $RE{...}"
479 Regexp::Common doesn't have a generator for the requested pattern.
480 Often indicates a mispelt or missing parameter.
481
482 "Perl %f does not support the pattern $RE{...}. You need Perl %f or
483 later"
484 The requested pattern requires advanced regex features (e.g. recur‐
485 sion) that not available in your version of Perl. Time to upgrade.
486
487 "pattern() requires argument: name => [ @list ]"
488 Every user-defined pattern specification must have a name.
489
490 "pattern() requires argument: create => $sub_ref_or_string"
491 Every user-defined pattern specification must provide a pattern
492 creation mechanism: either a pattern string or a reference to a
493 subroutine that returns the pattern string.
494
495 "Base must be between 1 and 36"
496 The $RE{num}{real}{-base=>'N'} pattern uses the characters [0-9A-Z]
497 to represent the digits of various bases. Hence it only produces
498 regular expressions for bases up to hexatricensimal.
499
500 "Must specify delimiter in $RE{delimited}"
501 The pattern has no default delimiter. You need to write:
502 $RE{delimited}{-delim=>X'} for some character X
503
505 Deepest thanks to the many people who have encouraged and contributed
506 to this project, especially: Elijah, Jarkko, Tom, Nat, Ed, and Vivek.
507
509 $Log: Common.pm,v $
510 Revision 2.120 2005/03/16 00:24:45 abigail
511 Load Carp only on demand
512
513 Revision 2.119 2005/01/01 16:35:14 abigail
514 - Updated copyright notice. New release.
515
516 Revision 2.118 2004/12/14 23:17:57 abigail
517 Fixed the generic OO routines.
518
519 Revision 2.117 2004/06/30 15:01:35 abigail
520 Pod nits. (Jim Cromie)
521
522 Revision 2.116 2004/06/30 09:37:36 abigail
523 New version
524
525 Revision 2.115 2004/06/09 21:58:01 abigail
526 - 'SEN'
527 - New release.
528
529 Revision 2.114 2003/05/25 21:34:56 abigail
530 POD nits from Bryan C. Warnock
531
532 Revision 2.113 2003/04/02 21:23:48 abigail
533 Removed anything related to $; being '='
534
535 Revision 2.112 2003/03/25 23:27:27 abigail
536 New release
537
538 Revision 2.111 2003/03/12 22:37:13 abigail
539 + The -i switch.
540 + New release.
541
542 Revision 2.110 2003/02/21 14:55:31 abigail
543 New release
544
545 Revision 2.109 2003/02/10 21:36:58 abigail
546 New release
547
548 Revision 2.108 2003/02/09 21:45:07 abigail
549 New release
550
551 Revision 2.107 2003/02/07 15:23:03 abigail
552 New release
553
554 Revision 2.106 2003/02/02 17:44:58 abigail
555 New release
556
557 Revision 2.105 2003/02/02 03:20:32 abigail
558 New release
559
560 Revision 2.104 2003/01/24 15:43:40 abigail
561 New release
562
563 Revision 2.103 2003/01/23 02:19:01 abigail
564 New release
565
566 Revision 2.102 2003/01/22 17:32:34 abigail
567 New release
568
569 Revision 2.101 2003/01/21 23:52:18 abigail
570 POD fix.
571
572 Revision 2.100 2003/01/21 23:19:40 abigail
573 The whole world understands RCS/CVS version numbers, that 1.9 is an
574 older version than 1.10. Except CPAN. Curse the idiot(s) who think
575 that version numbers are floats (in which universe do floats have
576 more than one decimal dot?).
577 Everything is bumped to version 2.100 because CPAN couldn't deal
578 with the fact one file had version 1.10.
579
580 Revision 1.30 2003/01/17 13:19:04 abigail
581 New release
582
583 Revision 1.29 2003/01/16 11:08:41 abigail
584 New release
585
586 Revision 1.28 2003/01/01 23:03:53 abigail
587 New distribution
588
589 Revision 1.27 2003/01/01 17:09:07 abigail
590 lingua class added
591
592 Revision 1.26 2002/12/30 23:08:28 abigail
593 New module Regexp::Common::zip
594
595 Revision 1.25 2002/12/27 23:34:44 abigail
596 New release
597
598 Revision 1.24 2002/12/24 00:00:04 abigail
599 New release
600
601 Revision 1.23 2002/11/06 13:50:23 abigail
602 Minor POD changes.
603
604 Revision 1.22 2002/10/01 18:25:46 abigail
605 POD buglets.
606
607 Revision 1.21 2002/09/18 17:46:11 abigail
608 POD Typo fix (Douglas Hunter)
609
610 Revision 1.20 2002/08/27 17:04:29 abigail
611 VERSION is now extracted from the CVS revision number.
612
613 Revision 1.19 2002/08/06 14:46:49 abigail
614 Upped version number to 0.09.
615
616 Revision 1.18 2002/08/06 13:50:08 abigail
617 - Added HISTORY section with CVS log.
618 - Upped version number to 0.08.
619
620 Revision 1.17 2002/08/05 12:21:46 abigail
621 Upped version number to 0.07.
622
623 Revision 1.16 2002/08/05 12:16:30 abigail
624 Fixed 'Regex::' typo to 'Regexp::' (Found my Mike Castle).
625
626 Revision 1.15 2002/08/04 22:56:02 abigail
627 Upped version number to 0.06.
628
629 Revision 1.14 2002/08/04 19:33:33 abigail
630 Loaded URI by default.
631
632 Revision 1.13 2002/08/01 10:02:42 abigail
633 Upped version number.
634
635 Revision 1.12 2002/07/31 23:26:06 abigail
636 Upped version number.
637
638 Revision 1.11 2002/07/31 13:11:20 abigail
639 Removed URL from the list of default loaded regexes, as this one isn't
640 ready yet.
641
642 Upped the version number to 0.03.
643
644 Revision 1.10 2002/07/29 13:16:38 abigail
645 Introduced 'use strict' (which uncovered a bug, \@non_flags was used
646 when $spec{create} was called instead of \@nonflags).
647
648 Turned warnings on (using local $^W = 1; "use warnings" isn't available
649 in pre 5.6).
650
651 Revision 1.9 2002/07/28 23:02:54 abigail
652 Split out the remaining pattern groups to separate files.
653
654 Fixed a bug in _decache, changed the regex /$fpat=(.+)/ to
655 /$fpat=(.*)/, to be able to distinguish the case of a flag
656 set to the empty string, or a flag without an argument.
657
658 Added 'undef' to @_ in the sub_interface setting to avoid a warning
659 of setting a hash with an odd number of arguments.
660
661 POD fixes.
662
663 Revision 1.8 2002/07/25 23:55:54 abigail
664 Moved balanced, net and URL to separate files.
665
666 Revision 1.7 2002/07/25 20:01:40 abigail
667 Modified import() to deal with factoring out groups of related regexes.
668 Factored out comments into Common/comment.
669
670 Revision 1.6 2002/07/23 21:20:43 abigail
671 Upped version number to 0.02.
672
673 Revision 1.5 2002/07/23 21:14:55 abigail
674 Added $RE{comment}{HTML}.
675
676 Revision 1.4 2002/07/23 17:01:09 abigail
677 Added lines about new maintainer, and an email address to submit bugs
678 and new regexes to.
679
680 Revision 1.3 2002/07/23 13:58:58 abigail
681 Changed various occurences of C<... => ...> into C<< ... => ... >>.
682
683 Revision 1.2 2002/07/23 12:27:07 abigail
684 Line 733 was missing the closing > of a C<> in the POD.
685
686 Revision 1.1 2002/07/23 12:22:51 abigail
687 Initial revision
688
690 Damian Conway (damian@conway.org)
691
693 This package is maintained by Abigail (regexp-common@abigail.nl).
694
696 Bound to be plenty.
697
698 For a start, there are many common regexes missing. Send them in to
699 regexp-common@abigail.nl.
700
702 Copyright (c) 2001 - 2005, Damian Conway and Abigail. All Rights
703 Reserved. This module is free software. It may be used, redistributed
704 and/or modified under the terms of the Perl Artistic License
705 (see http://www.perl.com/perl/misc/Artistic.html)
706
707
708
709perl v5.8.8 2003-03-23 Regexp::Common(3)