1Assemble(3) User Contributed Perl Documentation Assemble(3)
2
3
4
6 Regexp::Assemble - Assemble multiple Regular Expressions into a single
7 RE
8
10 This document describes version 0.34 of Regexp::Assemble, released
11 2008-06-17.
12
14 use Regexp::Assemble;
15
16 my $ra = Regexp::Assemble->new;
17 $ra->add( 'ab+c' );
18 $ra->add( 'ab+-' );
19 $ra->add( 'a\w\d+' );
20 $ra->add( 'a\d+' );
21 print $ra->re; # prints a(?:\w?\d+|b+[-c])
22
24 Regexp::Assemble takes an arbitrary number of regular expressions and
25 assembles them into a single regular expression (or RE) that matches
26 all that the individual REs match.
27
28 As a result, instead of having a large list of expressions to loop
29 over, a target string only needs to be tested against one expression.
30 This is interesting when you have several thousand patterns to deal
31 with. Serious effort is made to produce the smallest pattern possible.
32
33 It is also possible to track the original patterns, so that you can
34 determine which, among the source patterns that form the assembled
35 pattern, was the one that caused the match to occur.
36
37 You should realise that large numbers of alternations are processed in
38 perl's regular expression engine in O(n) time, not O(1). If you are
39 still having performance problems, you should look at using a trie.
40 Note that Perl's own regular expression engine will implement trie
41 optimisations in perl 5.10 (they are already available in perl 5.9.3 if
42 you want to try them out). "Regexp::Assemble" will do the right thing
43 when it knows it's running on a a trie'd perl. (At least in some
44 version after this one).
45
46 Some more examples of usage appear in the accompanying README. If that
47 file isn't easy to access locally, you can find it on a web repository
48 such as http://search.cpan.org/dist/Regexp-Assemble/README
49 <http://search.cpan.org/dist/Regexp-Assemble/README> or
50 http://cpan.uwinnipeg.ca/htdocs/Regexp-Assemble/README.html
51 <http://cpan.uwinnipeg.ca/htdocs/Regexp-Assemble/README.html>.
52
54 new Creates a new "Regexp::Assemble" object. The following optional
55 key/value parameters may be employed. All keys have a
56 corresponding method that can be used to change the behaviour
57 later on. As a general rule, especially if you're just starting
58 out, you don't have to bother with any of these.
59
60 anchor_*, a family of optional attributes that allow anchors
61 ("^", "\b", "\Z"...) to be added to the resulting pattern.
62
63 flags, sets the "imsx" flags to add to the assembled regular
64 expression. Warning: no error checking is done, you should
65 ensure that the flags you pass are understood by the version of
66 Perl you are using. modifiers exists as an alias, for users
67 familiar with Regexp::List.
68
69 chomp, controls whether the pattern should be chomped before
70 being lexed. Handy if you are reading patterns from a file. By
71 default, "chomp"ing is performed (this behaviour changed as of
72 version 0.24, prior versions did not chomp automatically). See
73 also the "file" attribute and the "add_file" method.
74
75 file, slurp the contents of the specified file and add them to
76 the assembly. Multiple files may be processed by using a list.
77
78 my $r = Regexp::Assemble->new(file => 're.list');
79
80 my $r = Regexp::Assemble->new(file => ['re.1', 're.2']);
81
82 If you really don't want chomping to occur, you will have to
83 set the "chomp" attribute to 0 (zero). You may also want to
84 look at the "input_record_separator" attribute, as well.
85
86 input_record_separator, controls what constitutes a record
87 separator when using the "file" attribute or the "add_file"
88 method. May be abbreviated to rs. See the $/ variable in
89 perlvar.
90
91 lookahead, controls whether the pattern should contain zero-
92 width lookahead assertions (For instance:
93 (?=[abc])(?:bob|alice|charles). This is not activated by
94 default, because in many circumstances the cost of processing
95 the assertion itself outweighs the benefit of its faculty for
96 short-circuiting a match that will fail. This is sensitive to
97 the probability of a match succeeding, so if you're worried
98 about performance you'll have to benchmark a sample population
99 of targets to see which way the benefits lie.
100
101 track, controls whether you want know which of the initial
102 patterns was the one that matched. See the "matched" method for
103 more details. Note for version 5.8 of Perl and below, in this
104 mode of operation YOU SHOULD BE AWARE OF THE SECURITY
105 IMPLICATIONS that this entails. Perl 5.10 does not suffer from
106 any such restriction.
107
108 indent, the number of spaces used to indent nested grouping of
109 a pattern. Use this to produce a pretty-printed pattern. See
110 the "as_string" method for a more detailed explanation.
111
112 pre_filter, allows you to add a callback to enable sanity
113 checks on the pattern being loaded. This callback is triggered
114 before the pattern is split apart by the lexer. In other words,
115 it operates on the entire pattern. If you are loading patterns
116 from a file, this would be an appropriate place to remove
117 comments.
118
119 filter, allows you to add a callback to enable sanity checks on
120 the pattern being loaded. This callback is triggered after the
121 pattern has been split apart by the lexer.
122
123 unroll_plus, controls whether to unroll, for example, "x+" into
124 "x", "x*", which may allow additional reductions in the
125 resulting assembled pattern.
126
127 reduce, controls whether tail reduction occurs or not. If set,
128 patterns like "a(?:bc+d|ec+d)" will be reduced to "a[be]c+d".
129 That is, the end of the pattern in each part of the b... and
130 d... alternations is identical, and hence is hoisted out of
131 the alternation and placed after it. On by default. Turn it off
132 if you're really pressed for short assembly times.
133
134 lex, specifies the pattern used to lex the input lines into
135 tokens. You could replace the default pattern by a more
136 sophisticated version that matches arbitrarily nested
137 parentheses, for example.
138
139 debug, controls whether copious amounts of output is produced
140 during the loading stage or the reducing stage of assembly.
141
142 my $ra = Regexp::Assemble->new;
143 my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
144
145 mutable, controls whether new patterns can be added to the
146 object after the assembled pattern is generated. DEPRECATED.
147
148 This method/attribute will be removed in a future release. It
149 doesn't really serve any purpose, and may be more effectively
150 replaced by cloning an existing "Regexp::Assemble" object and
151 spinning out a pattern from that instead.
152
153 A more detailed explanation of these attributes follows.
154
155 clone Clones the contents of a Regexp::Assemble object and creates a
156 new object (in other words it performs a deep copy).
157
158 If the Storable module is installed, its dclone method will be
159 used, otherwise the cloning will be performed using a pure perl
160 approach.
161
162 You can use this method to take a snapshot of the patterns that
163 have been added so far to an object, and generate an assembly
164 from the clone. Additional patterns may to be added to the
165 original object afterwards.
166
167 my $re = $main->clone->re();
168 $main->add( 'another-pattern-\\d+' );
169
170 add(LIST)
171 Takes a string, breaks it apart into a set of tokens
172 (respecting meta characters) and inserts the resulting list
173 into the "R::A" object. It uses a naive regular expression to
174 lex the string that may be fooled complex expressions
175 (specifically, it will fail to lex nested parenthetical
176 expressions such as "ab(cd(ef)?gh)ij" correctly). If this is
177 the case, the end of the string will not be tokenised correctly
178 and returned as one long string.
179
180 On the one hand, this may indicate that the patterns you are
181 trying to feed the "R::A" object are too complex. Simpler
182 patterns might allow the algorithm to work more effectively and
183 perform more reductions in the resulting pattern.
184
185 On the other hand, you can supply your own pattern to perform
186 the lexing if you need. The test suite contains an example of a
187 lexer pattern that will match one level of nested parentheses.
188
189 Note that there is an internal optimisation that will bypass a
190 much of the lexing process. If a string contains no "\"
191 (backslash), "[" (open square bracket), "(" (open paren), "?"
192 (question mark), "+" (plus), "*" (star) or "{" (open curly), a
193 character split will be performed directly.
194
195 A list of strings may be supplied, thus you can pass it a file
196 handle of a file opened for reading:
197
198 $re->add( '\d+-\d+-\d+-\d+\.example\.com' );
199 $re->add( <IN> );
200
201 If the file is very large, it may be more efficient to use a
202 "while" loop, to read the file line-by-line:
203
204 $re->add($_) while <IN>;
205
206 The "add" method will chomp the lines automatically. If you do
207 not want this to occur (you want to keep the record separator),
208 then disable "chomp"ing.
209
210 $re->chomp(0);
211 $re->add($_) while <IN>;
212
213 This method is chainable.
214
215 add_file(FILENAME [...])
216 Takes a list of file names. Each file is opened and read line
217 by line. Each line is added to the assembly.
218
219 $r->add_file( 'file.1', 'file.2' );
220
221 If a file cannot be opened, the method will croak. If you
222 cannot afford to let this happen then you should wrap the call
223 in a "eval" block.
224
225 Chomping happens automatically unless you the chomp(0) method
226 to disable it. By default, input lines are read according to
227 the value of the "input_record_separator" attribute (if
228 defined), and will otherwise fall back to the current setting
229 of the system $/ variable. The record separator may also be
230 specified on each call to "add_file". Internally, the routine
231 "local"ises the value of $/ to whatever is required, for the
232 duration of the call.
233
234 An alternate calling mechanism using a hash reference is
235 available. The recognised keys are:
236
237 file
238 Reference to a list of file names, or the name of a single
239 file.
240
241 $r->add_file({file => ['file.1', 'file.2', 'file.3']});
242 $r->add_file({file => 'file.n'});
243
244 input_record_separator
245 If present, indicates what constitutes a line
246
247 $r->add_file({file => 'data.txt', input_record_separator => ':' });
248
249 rs An alias for input_record_separator (mnemonic: same as the
250 English variable names).
251
252 $r->add_file( {
253 file => [ 'pattern.txt', 'more.txt' ],
254 input_record_separator => "\r\n",
255 });
256
257 insert(LIST)
258 Takes a list of tokens representing a regular expression and
259 stores them in the object. Note: you should not pass it a bare
260 regular expression, such as "ab+c?d*e". You must pass it as a
261 list of tokens, e.g. "('a', 'b+', 'c?', 'd*', 'e')".
262
263 This method is chainable, e.g.:
264
265 my $ra = Regexp::Assemble->new
266 ->insert( qw[ a b+ c? d* e ] )
267 ->insert( qw[ a c+ d+ e* f ] );
268
269 Lexing complex patterns with metacharacters and so on can
270 consume a significant proportion of the overall time to build
271 an assembly. If you have the information available in a
272 tokenised form, calling "insert" directly can be a big win.
273
274 lexstr Use the "lexstr" method if you are curious to see how a pattern
275 gets tokenised. It takes a scalar on input, representing a
276 pattern, and returns a reference to an array, containing the
277 tokenised pattern. You can recover the original pattern by
278 performing a "join":
279
280 my @token = $re->lexstr($pattern);
281 my $new_pattern = join( '', @token );
282
283 If the original pattern contains unnecessary backslashes, or
284 "\x4b" escapes, or quotemeta escapes ("\Q"..."\E") the
285 resulting pattern may not be identical.
286
287 Call "lexstr" does not add the pattern to the object, it is
288 merely for exploratory purposes. It will, however, update
289 various statistical counters.
290
291 pre_filter(CODE)
292 Allows you to install a callback to check that the pattern
293 being loaded contains valid input. It receives the pattern as a
294 whole to be added, before it been tokenised by the lexer. It
295 may to return 0 or "undef" to indicate that the pattern should
296 not be added, any true value indicates that the contents are
297 fine.
298
299 A filter to strip out trailing comments (marked by #):
300
301 $re->pre_filter( sub { $_[0] =~ s/\s*#.*$//; 1 } );
302
303 A filter to ignore blank lines:
304
305 $re->pre_filter( sub { length(shift) } );
306
307 If you want to remove the filter, pass "undef" as a parameter.
308
309 $ra->pre_filter(undef);
310
311 This method is chainable.
312
313 filter(CODE)
314 Allows you to install a callback to check that the pattern
315 being loaded contains valid input. It receives a list on input,
316 after it has been tokenised by the lexer. It may to return 0 or
317 undef to indicate that the pattern should not be added, any
318 true value indicates that the contents are fine.
319
320 If you know that all patterns you expect to assemble contain a
321 restricted set of of tokens (e.g. no spaces), you could do the
322 following:
323
324 $ra->filter(sub { not grep { / / } @_ });
325
326 or
327
328 sub only_spaces_and_digits {
329 not grep { ![\d ] } @_
330 }
331 $ra->filter( \&only_spaces_and_digits );
332
333 These two examples will silently ignore faulty patterns, If you
334 want the user to be made aware of the problem you should raise
335 an error (via "warn" or "die"), log an error message, whatever
336 is best. If you want to remove a filter, pass "undef" as a
337 parameter.
338
339 $ra->filter(undef);
340
341 This method is chainable.
342
343 as_string
344 Assemble the expression and return it as a string. You may want
345 to do this if you are writing the pattern to a file. The
346 following arguments can be passed to control the aspect of the
347 resulting pattern:
348
349 indent, the number of spaces used to indent nested grouping of
350 a pattern. Use this to produce a pretty-printed pattern (for
351 some definition of "pretty"). The resulting output is rather
352 verbose. The reason is to ensure that the metacharacters "(?:"
353 and ")" always occur on otherwise empty lines. This allows you
354 grep the result for an even more synthetic view of the pattern:
355
356 egrep -v '^ *[()]' <regexp.file>
357
358 The result of the above is quite readable. Remember to
359 backslash the spaces appearing in your own patterns if you wish
360 to use an indented pattern in an "m/.../x" construct. Indenting
361 is ignored if tracking is enabled.
362
363 The indent argument takes precedence over the "indent"
364 method/attribute of the object.
365
366 Calling this method will drain the internal data structure.
367 Large numbers of patterns can eat a significant amount of
368 memory, and this lets perl recover the memory used for other
369 purposes.
370
371 If you want to reduce the pattern and continue to add new
372 patterns, clone the object and reduce the clone, leaving the
373 original object intact.
374
375 re Assembles the pattern and return it as a compiled RE, using the
376 "qr//" operator.
377
378 As with "as_string", calling this method will reset the
379 internal data structures to free the memory used in assembling
380 the RE.
381
382 The indent attribute, documented in the "as_string" method, can
383 be used here (it will be ignored if tracking is enabled).
384
385 With method chaining, it is possible to produce a RE without
386 having a temporary "Regexp::Assemble" object lying around,
387 e.g.:
388
389 my $re = Regexp::Assemble->new
390 ->add( q[ab+cd+e] )
391 ->add( q[ac\\d+e] )
392 ->add( q[c\\d+e] )
393 ->re;
394
395 The $re variable now contains a Regexp object that can be used
396 directly:
397
398 while( <> ) {
399 /$re/ and print "Something in [$_] matched\n";
400 )
401
402 The "re" method is called when the object is used in string
403 context (hence, within an "m//" operator), so by and large you
404 do not even need to save the RE in a separate variable. The
405 following will work as expected:
406
407 my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
408 while( <IN> ) {
409 if( /($re)/ ) {
410 print "Here be giants: $1\n";
411 }
412 }
413
414 This approach does not work with tracked patterns. The "match"
415 and "matched" methods must be used instead, see below.
416
417 match(SCALAR)
418 The following information applies to Perl 5.8 and below. See
419 the section that follows for information on Perl 5.10.
420
421 If pattern tracking is in use, you must "use re 'eval'" in
422 order to make things work correctly. At a minimum, this will
423 make your code look like this:
424
425 my $did_match = do { use re 'eval'; $target =~ /$ra/ }
426 if( $did_match ) {
427 print "matched ", $ra->matched, "\n";
428 }
429
430 (The main reason is that the $^R variable is currently broken
431 and an ugly workaround that runs some Perl code during the
432 match is required, in order to simulate what $^R should be
433 doing. See Perl bug #32840 for more information if you are
434 curious. The README also contains more information). This bug
435 has been fixed in 5.10.
436
437 The important thing to note is that with "use re 'eval'", THERE
438 ARE SECURITY IMPLICATIONS WHICH YOU IGNORE AT YOUR PERIL. The
439 problem is this: if you do not have strict control over the
440 patterns being fed to "Regexp::Assemble" when tracking is
441 enabled, and someone slips you a pattern such as "/^(?{system
442 'rm -rf /'})/" and you attempt to match a string against the
443 resulting pattern, you will know Fear and Loathing.
444
445 What is more, the $^R workaround means that that tracking does
446 not work if you perform a bare "/$re/" pattern match as shown
447 above. You have to instead call the "match" method, in order to
448 supply the necessary context to take care of the tracking
449 housekeeping details.
450
451 if( defined( my $match = $ra->match($_)) ) {
452 print " $_ matched by $match\n";
453 }
454
455 In the case of a successful match, the original matched pattern
456 is returned directly. The matched pattern will also be
457 available through the "matched" method.
458
459 (Except that the above is not true for 5.6.0: the "match"
460 method returns true or undef, and the "matched" method always
461 returns undef).
462
463 If you are capturing parts of the pattern e.g. "foo(bar)rat"
464 you will want to get at the captures. See the "mbegin", "mend",
465 "mvar" and "capture" methods. If you are not using captures
466 then you may safely ignore this section.
467
468 In 5.10, since the bug concerning $^R has been resolved, there
469 is no need to use "re 'eval'" and the assembled pattern does
470 not require any Perl code to be executed during the match.
471
472 source When using tracked mode, after a successful match is made,
473 returns the original source pattern that caused the match. In
474 Perl 5.10, the $^R variable can be used to as an index to fetch
475 the correct pattern from the object.
476
477 If no successful match has been performed, or the object is not
478 in tracked mode, this method returns "undef".
479
480 my $r = Regexp::Assemble->new->track(1)->add(qw(foo? bar{2} [Rr]at));
481
482 for my $w (qw(this food is rather barren)) {
483 if ($w =~ /$r/) {
484 print "$w matched by ", $r->source($^R), $/;
485 }
486 else {
487 print "$w no match\n";
488 }
489 }
490
491 mbegin This method returns a copy of "@-" at the moment of the last
492 match. You should ordinarily not need to bother with this,
493 "mvar" should be able to supply all your needs.
494
495 mend This method returns a copy of "@+" at the moment of the last
496 match.
497
498 mvar(NUMBER)
499 The "mvar" method returns the captures of the last match.
500 mvar(1) corresponds to $1, mvar(2) to $2, and so on. mvar(0)
501 happens to return the target string matched, as a byproduct of
502 walking down the "@-" and "@+" arrays after the match.
503
504 If called without a parameter, "mvar" will return a reference
505 to an array containing all captures.
506
507 capture The "capture" method returns the the captures of the last match
508 as an array. Unlink "mvar", this method does not include the
509 matched string. It is equivalent to getting an array back that
510 contains "$1, $2, $3, ...".
511
512 If no captures were found in the match, an empty array is
513 returned, rather than "undef". You are therefore guaranteed to
514 be able to use "for my $c ($re->capture) { ..." without have
515 to check whether anything was captured.
516
517 matched If pattern tracking has been set, via the "track" attribute, or
518 through the "track" method, this method will return the
519 original pattern of the last successful match. Returns undef
520 match has yet been performed, or tracking has not been enabled.
521
522 See below in the NOTES section for additional subtleties of
523 which you should be aware of when tracking patterns.
524
525 Note that this method is not available in 5.6.0, due to
526 limitations in the implementation of "(?{...})" at the time.
527
528 Statistics/Reporting routines
529 stats_add
530 Returns the number of patterns added to the assembly (whether
531 by "add" or "insert"). Duplicate patterns are not included in
532 this total.
533
534 stats_dup
535 Returns the number of duplicate patterns added to the assembly.
536 If non-zero, this may be a sign that something is wrong with
537 your data (or at the least, some needless redundancy). This may
538 occur when you have two patterns (for instance, "a\-b" and
539 "a-b") which map to the same result.
540
541 stats_raw
542 Returns the raw number of bytes in the patterns added to the
543 assembly. This includes both original and duplicate patterns.
544 For instance, adding the two patterns "ab" and "ab" will count
545 as 4 bytes.
546
547 stats_cooked
548 Return the true number of bytes added to the assembly. This
549 will not include duplicate patterns. Furthermore, it may differ
550 from the raw bytes due to quotemeta treatment. For instance,
551 "abc\,def" will count as 7 (not 8) bytes, because "\," will be
552 stored as ",". Also, "\Qa.b\E" is 7 bytes long, however, after
553 the quotemeta directives are processed, "a\.b" will be stored,
554 for a total of 4 bytes.
555
556 stats_length
557 Returns the length of the resulting assembled expression.
558 Until "as_string" or "re" have been called, the length will be
559 0 (since the assembly will have not yet been performed). The
560 length includes only the pattern, not the additional
561 ("(?-xism...") fluff added by the compilation.
562
563 dup_warn(NUMBER|CODEREF)
564 Turns warnings about duplicate patterns on or off. By default,
565 no warnings are emitted. If the method is called with no
566 parameters, or a true parameter, the object will carp about
567 patterns it has already seen. To turn off the warnings, use 0
568 as a parameter.
569
570 $r->dup_warn();
571
572 The method may also be passed a code block. In this case the
573 code will be executed and it will receive a reference to the
574 object in question, and the lexed pattern.
575
576 $r->dup_warn(
577 sub {
578 my $self = shift;
579 print $self->stats_add, " patterns added at line $.\n",
580 join( '', @_ ), " added previously\n";
581 }
582 )
583
584 Anchor routines
585 Suppose you wish to assemble a series of patterns that all begin with
586 "^" and end with "$" (anchor pattern to the beginning and end of
587 line). Rather than add the anchors to each and every pattern (and
588 possibly forget to do so when a new entry is added), you may specify
589 the anchors in the object, and they will appear in the resulting
590 pattern, and you no longer need to (or should) put them in your source
591 patterns. For example, the two following snippets will produce
592 identical patterns:
593
594 $r->add(qw(^this ^that ^them))->as_string;
595
596 $r->add(qw(this that them))->anchor_line_begin->as_string;
597
598 # both techniques will produce ^th(?:at|em|is)
599
600 All anchors are possible word ("\b") boundaries, line boundaries ("^"
601 and "$") and string boundaries ("\A" and "\Z" (or "\z" if you
602 absolutely need it)).
603
604 The shortcut "anchor_mumble" implies both "anchor_mumble_begin"
605 "anchor_mumble_end" is also available. If different anchors are
606 specified the most specific anchor wins. For instance, if both
607 "anchor_word_begin" and "anchor_line_begin" are specified,
608 "anchor_word_begin" takes precedence.
609
610 All the anchor methods are chainable.
611
612 anchor_word_begin
613 The resulting pattern will be prefixed with a "\b" word
614 boundary assertion when the value is true. Set to 0 to disable.
615
616 $r->add('pre')->anchor_word_begin->as_string;
617 # produces '\bpre'
618
619 anchor_word_end
620 The resulting pattern will be suffixed with a "\b" word
621 boundary assertion when the value is true. Set to 0 to disable.
622
623 $r->add(qw(ing tion))
624 ->anchor_word_end
625 ->as_string; # produces '(?:tion|ing)\b'
626
627 anchor_word
628 The resulting pattern will be have "\b" word boundary
629 assertions at the beginning and end of the pattern when the
630 value is true. Set to 0 to disable.
631
632 $r->add(qw(cat carrot)
633 ->anchor_word(1)
634 ->as_string; # produces '\bca(?:rro)t\b'
635
636 anchor_line_begin
637 The resulting pattern will be prefixed with a "^" line boundary
638 assertion when the value is true. Set to 0 to disable.
639
640 $r->anchor_line_begin;
641 # or
642 $r->anchor_line_begin(1);
643
644 anchor_line_end
645 The resulting pattern will be suffixed with a "$" line boundary
646 assertion when the value is true. Set to 0 to disable.
647
648 # turn it off
649 $r->anchor_line_end(0);
650
651 anchor_line
652 The resulting pattern will be have the "^" and "$" line
653 boundary assertions at the beginning and end of the pattern,
654 respectively, when the value is true. Set to 0 to disable.
655
656 $r->add(qw(cat carrot)
657 ->anchor_line
658 ->as_string; # produces '^ca(?:rro)t$'
659
660 anchor_string_begin
661 The resulting pattern will be prefixed with a "\A" string
662 boundary assertion when the value is true. Set to 0 to disable.
663
664 $r->anchor_string_begin(1);
665
666 anchor_string_end
667 The resulting pattern will be suffixed with a "\Z" string
668 boundary assertion when the value is true. Set to 0 to disable.
669
670 # disable the string boundary end anchor
671 $r->anchor_string_end(0);
672
673 anchor_string_end_absolute
674 The resulting pattern will be suffixed with a "\z" string
675 boundary assertion when the value is true. Set to 0 to disable.
676
677 # disable the string boundary absolute end anchor
678 $r->anchor_string_end_absolute(0);
679
680 If you don't understand the difference between "\Z" and "\z",
681 the former will probably do what you want.
682
683 anchor_string
684 The resulting pattern will be have the "\A" and "\Z" string
685 boundary assertions at the beginning and end of the pattern,
686 respectively, when the value is true. Set to 0 to disable.
687
688 $r->add(qw(cat carrot)
689 ->anchor_string
690 ->as_string; # produces '\Aca(?:rro)t\Z'
691
692 anchor_string_absolute
693 The resulting pattern will be have the "\A" and "\z" string
694 boundary assertions at the beginning and end of the pattern,
695 respectively, when the value is true. Set to 0 to disable.
696
697 $r->add(qw(cat carrot)
698 ->anchor_string_absolute
699 ->as_string; # produces '\Aca(?:rro)t\z'
700
701 debug(NUMBER)
702 Turns debugging on or off. Statements are printed to the
703 currently selected file handle (STDOUT by default). If you are
704 already using this handle, you will have to arrange to select
705 an output handle to a file of your own choosing, before call
706 the "add", "as_string" or "re") functions, otherwise it will
707 scribble all over your carefully formatted output.
708
709 0 Off. Turns off all debugging output.
710
711 1 Add. Trace the addition of patterns.
712
713 2 Reduce. Trace the process of reduction and assembly.
714
715 4 Lex. Trace the lexing of the input patterns into its
716 constituent tokens.
717
718 8 Time. Print to STDOUT the time taken to load all the
719 patterns. This is nothing more than the difference
720 between the time the object was instantiated and the
721 time reduction was initiated.
722
723 # load=<num>
724
725 Any lengthy computation performed in the client code
726 will be reflected in this value. Another line will be
727 printed after reduction is complete.
728
729 # reduce=<num>
730
731 The above output lines will be changed to "load-epoch"
732 and "reduce-epoch" if the internal state of the object
733 is corrupted and the initial timestamp is lost.
734
735 The code attempts to load Time::HiRes in order to
736 report fractional seconds. If this is not successful,
737 the elapsed time is displayed in whole seconds.
738
739 Values can be added (or or'ed together) to trace everything
740
741 $r->debug(7)->add( '\\d+abc' );
742
743 Calling "debug" with no arguments turns debugging off.
744
745 dump Produces a synthetic view of the internal data structure. How
746 to interpret the results is left as an exercise to the reader.
747
748 print $r->dump;
749
750 chomp(0|1)
751 Turns chomping on or off.
752
753 IMPORTANT: As of version 0.24, chomping is now on by default as
754 it makes "add_file" Just Work. The only time you may run into
755 trouble is with "add("\\$/")". So don't do that, or else
756 explicitly turn off chomping.
757
758 To avoid incorporating (spurious) record separators (such as
759 "\n" on Unix) when reading from a file, "add()" "chomp"s its
760 input. If you don't want this to happen, call "chomp" with a
761 false value.
762
763 $re->chomp(0); # really want the record separators
764 $re->add(<DATA>);
765
766 fold_meta_pairs(NUMBER)
767 Determines whether "\s", "\S" and "\w", "\W" and "\d", "\D" are
768 folded into a "." (dot). Folding happens by default (for
769 reasons of backwards compatibility, even though it is wrong
770 when the "/s" expression modifier is active).
771
772 Call this method with a false value to prevent this behaviour
773 (which is only a problem when dealing with "\n" if the "/s"
774 expression modifier is also set).
775
776 $re->add( '\\w', '\\W' );
777 my $clone = $re->clone;
778
779 $clone->fold_meta_pairs(0);
780 print $clone->as_string; # prints '.'
781 print $re->as_string; # print '[\W\w]'
782
783 indent(NUMBER)
784 Sets the level of indent for pretty-printing nested groups
785 within a pattern. See the "as_string" method for more details.
786 When called without a parameter, no indenting is performed.
787
788 $re->indent( 4 );
789 print $re->as_string;
790
791 lookahead(0|1)
792 Turns on zero-width lookahead assertions. This is usually
793 beneficial when you expect that the pattern will usually fail.
794 If you expect that the pattern will usually match you will
795 probably be worse off.
796
797 flags(STRING)
798 Sets the flags that govern how the pattern behaves (for
799 versions of Perl up to 5.9 or so, these are "imsx"). By default
800 no flags are enabled.
801
802 modifiers(STRING)
803 An alias of the "flags" method, for users familiar with
804 "Regexp::List".
805
806 track(0|1)
807 Turns tracking on or off. When this attribute is enabled,
808 additional housekeeping information is inserted into the
809 assembled expression using "({...}" embedded code constructs.
810 This provides the necessary information to determine which, of
811 the original patterns added, was the one that caused the match.
812
813 $re->track( 1 );
814 if( $target =~ /$re/ ) {
815 print "$target matched by ", $re->matched, "\n";
816 }
817
818 Note that when this functionality is enabled, no reduction is
819 performed and no character classes are generated. In other
820 words, "brag|tag" is not reduced down to "(?:br|t)ag" and
821 "dig|dim" is not reduced to "di[gm]".
822
823 unroll_plus(0|1)
824 Turns the unrolling of plus metacharacters on or off. When a
825 pattern is broken up, "a+" becomes "a", "a*" (and "b+?" becomes
826 "b", "b*?". This may allow the freed "a" to assemble with other
827 patterns. Not enabled by default.
828
829 lex(SCALAR)
830 Change the pattern used to break a string apart into tokens.
831 You can examine the "eg/naive" script as a starting point.
832
833 reduce(0|1)
834 Turns pattern reduction on or off. A reduced pattern may be
835 considerably shorter than an unreduced pattern. Consider
836 "/sl(?:ip|op|ap)/" versus "/sl[aio]p/". An unreduced pattern
837 will be very similar to those produced by "Regexp::Optimizer".
838 Reduction is on by default. Turning it off speeds assembly (but
839 assembly is pretty fast -- it's the breaking up of the initial
840 patterns in the lexing stage that can consume a non-negligible
841 amount of time).
842
843 mutable(0|1)
844 This method has been marked as DEPRECATED. It will be removed
845 in a future release. See the "clone" method for a technique to
846 replace its functionality.
847
848 reset Empties out the patterns that have been "add"ed or "insert"-ed
849 into the object. Does not modify the state of controller
850 attributes such as "debug", "lex", "reduce" and the like.
851
852 Default_Lexer
853 Warning: the "Default_Lexer" function is a class method, not an
854 object method. It is a fatal error to call it as an object
855 method.
856
857 The "Default_Lexer" method lets you replace the default pattern
858 used for all subsequently created "Regexp::Assemble" objects.
859 It will not have any effect on existing objects. (It is also
860 possible to override the lexer pattern used on a per-object
861 basis).
862
863 The parameter should be an ordinary scalar, not a compiled
864 pattern. If the pattern fails to match all parts of the string,
865 the missing parts will be returned as single chunks. Therefore
866 the following pattern is legal (albeit rather cork-brained):
867
868 Regexp::Assemble::Default_Lexer( '\\d' );
869
870 The above pattern will split up input strings digit by digit,
871 and all non-digit characters as single chunks.
872
874 "Cannot pass a C<refname> to Default_Lexer"
875
876 You tried to replace the default lexer pattern with an object instead
877 of a scalar. Solution: You probably tried to call
878 "$obj->Default_Lexer". Call the qualified class method instead
879 "Regexp::Assemble::Default_Lexer".
880
881 "filter method not passed a coderef"
882
883 "pre_filter method not passed a coderef"
884
885 A reference to a subroutine (anonymous or otherwise) was expected.
886 Solution: read the documentation for the "filter" method.
887
888 "duplicate pattern added: /.../"
889
890 The "dup_warn" attribute is active, and a duplicate pattern was added
891 (well duh!). Solution: clean your data.
892
893 "cannot open [file] for input: [reason]"
894
895 The "add_file" method was unable to open the specified file for
896 whatever reason. Solution: make sure the file exists and the script has
897 the required privileges to read it.
898
900 This module has been tested successfully with a range of versions of
901 perl, from 5.005_03 to 5.9.3. Use of 5.6.0 is not recommended.
902
903 The expressions produced by this module can be used with the PCRE
904 library.
905
906 Remember to "double up" your backslashes if the patterns are hard-coded
907 as constants in your program. That is, you should literally
908 "add('a\\d+b')" rather than "add('a\d+b')". It usually will work either
909 way, but it's good practice to do so.
910
911 Where possible, supply the simplest tokens possible. Don't add
912 "X(?-\d+){2})Y" when "X-\d+-\d+Y" will do. The reason is that if you
913 also add "X\d+Z" the resulting assembly changes dramatically:
914 "X(?:(?:-\d+){2}Y|-\d+Z)" versus "X-\d+(?:-\d+Y|Z)". Since R::A doesn't
915 perform enough analysis, it won't "unroll" the "{2}" quantifier, and
916 will fail to notice the divergence after the first "-d\d+".
917
918 Furthermore, when the string 'X-123000P' is matched against the first
919 assembly, the regexp engine will have to backtrack over each
920 alternation (the one that ends in Y and the one that ends in Z) before
921 determining that there is no match. No such backtracking occurs in the
922 second pattern: as soon as the engine encounters the 'P' in the target
923 string, neither of the alternations at that point ("-\d+Y" or "Z")
924 could succeed and so the match fails.
925
926 "Regexp::Assemble" does, however, know how to build character classes.
927 Given "a-b", "axb" and "a\db", it will assemble these into "a[-\dx]b".
928 When "-" (dash) appears as a candidate for a character class it will be
929 the first character in the class. When "^" (circumflex) appears as a
930 candidate for a character class it will be the last character in the
931 class.
932
933 It also knows about meta-characters than can "absorb" regular
934 characters. For instance, given "X\d" and "X5", it knows that 5 can be
935 represented by "\d" and so the assembly is just "X\d". The "absorbent"
936 meta-characters it deals with are ".", "\d", "\s" and "\W" and their
937 complements. It will replace "\d"/"\D", "\s"/"\S" and "\w"/"\W" by "."
938 (dot), and it will drop "\d" if "\w" is also present (as will "\D" in
939 the presence of "\W").
940
941 "Regexp::Assemble" deals correctly with "quotemeta"'s propensity to
942 backslash many characters that have no need to be. Backslashes on non-
943 metacharacters will be removed. Similarly, in character classes, a
944 number of characters lose their magic and so no longer need to be
945 backslashed within a character class. Two common examples are "."
946 (dot) and "$". Such characters will lose their backslash.
947
948 At the same time, it will also process "\Q...\E" sequences. When such a
949 sequence is encountered, the inner section is extracted and "quotemeta"
950 is applied to the section. The resulting quoted text is then used in
951 place of the original unquoted text, and the "\Q" and "\E"
952 metacharacters are thrown away. Similar processing occurs with the
953 "\U...\E" and "\L...\E" sequences. This may have surprising effects
954 when using a dispatch table. In this case, you will need to know
955 exactly what the module makes of your input. Use the "lexstr" method to
956 find out what's going on:
957
958 $pattern = join( '', @{$re->lexstr($pattern)} );
959
960 If all the digits 0..9 appear in a character class, "Regexp::Assemble"
961 will replace them by "\d". I'd do it for letters as well, but thinking
962 about accented characters and other glyphs hurts my head.
963
964 In an alternation, the longest paths are chosen first (for example,
965 "horse|bird|dog"). When two paths have the same length, the path with
966 the most subpaths will appear first. This aims to put the "busiest"
967 paths to the front of the alternation. For example, the list "bad",
968 "bit", "few", "fig" and "fun" will produce the pattern
969 "(?:f(?:ew|ig|un)|b(?:ad|it))". See eg/tld for a real-world example of
970 how alternations are sorted. Once you have looked at that, everything
971 should be crystal clear.
972
973 When tracking is in use, no reduction is performed. nor are character
974 classes formed. The reason is that it is too difficult to determine the
975 original pattern afterwards. Consider the two patterns "pale" and
976 "palm". These should be reduced to "pal[em]". The final character
977 matches one of two possibilities. To resolve whether it matched an 'e'
978 or 'm' would require keeping track of the fact that the pattern
979 finished up in a character class, which would the require a whole lot
980 more work to figure out which character of the class matched. Without
981 character classes it becomes much easier. Instead, "pal(?:e|m)" is
982 produced, which lets us find out more simply where we ended up.
983
984 Similarly, "dogfood" and "seafood" should form "(?:dog|sea)food". When
985 the pattern is being assembled, the tracking decision needs to be made
986 at the end of the grouping, but the tail of the pattern has not yet
987 been visited. Deferring things to make this work correctly is a vast
988 hassle. In this case, the pattern becomes merely "(?:dogfood|seafood".
989 Tracked patterns will therefore be bulkier than simple patterns.
990
991 There is an open bug on this issue:
992
993 <http://rt.perl.org/rt3/Ticket/Display.html?id=32840>
994
995 If this bug is ever resolved, tracking would become much easier to deal
996 with (none of the "match" hassle would be required - you could just
997 match like a regular RE and it would Just Work).
998
1000 perlre General information about Perl's regular expressions.
1001
1002 re Specific information about "use re 'eval'".
1003
1004 Regex::PreSuf
1005 "Regex::PreSuf" takes a string and chops it itself into tokens
1006 of length 1. Since it can't deal with tokens of more than one
1007 character, it can't deal with meta-characters and thus no
1008 regular expressions. Which is the main reason why I wrote this
1009 module.
1010
1011 Regexp::Optimizer
1012 "Regexp::Optimizer" produces regular expressions that are
1013 similar to those produced by R::A with reductions switched off.
1014 It's biggest drawback is that it is exponentially slower than
1015 Regexp::Assemble on very large sets of patterns.
1016
1017 Regexp::Parser
1018 Fine grained analysis of regular expressions.
1019
1020 Regexp::Trie
1021 Funnily enough, this was my working name for "Regexp::Assemble"
1022 during its developement. I changed the name because I thought
1023 it was too obscure. Anyway, "Regexp::Trie" does much the same
1024 as "Regexp::Optimizer" and "Regexp::Assemble" except that it
1025 runs much faster (according to the author). It does not
1026 recognise meta characters (that is, 'a+b' is interpreted as
1027 'a\+b').
1028
1029 Text::Trie
1030 "Text::Trie" is well worth investigating. Tries can outperform
1031 very bushy (read: many alternations) patterns.
1032
1033 Tree::Trie
1034 "Tree::Trie" is another module that builds tries. The algorithm
1035 that "Regexp::Assemble" uses appears to be quite similar to the
1036 algorithm described therein, except that "R::A" solves its end-
1037 marker problem without having to rewrite the leaves.
1038
1040 "Regexp::Assemble" does not attempt to find common substrings. For
1041 instance, it will not collapse "/cabababc/" down to "/c(?:ab){3}c/".
1042 If there's a module out there that performs this sort of string
1043 analysis I'd like to know about it. But keep in mind that the
1044 algorithms that do this are very expensive: quadratic or worse.
1045
1046 "Regexp::Assemble" does not interpret meta-character modifiers. For
1047 instance, if the following two patterns are given: "X\d" and "X\d+", it
1048 will not determine that "\d" can be matched by "\d+". Instead, it will
1049 produce "X(?:\d|\d+)". Along a similar line of reasoning, it will not
1050 determine that "Z" and "Z\d+" is equivalent to "Z\d*" (It will produce
1051 "Z(?:\d+)?" instead).
1052
1053 You cannot remove a pattern that has been added to an object. You'll
1054 just have to start over again. Adding a pattern is difficult enough,
1055 I'd need a solid argument to convince me to add a "remove" method. If
1056 you need to do this you should read the documentation for the "clone"
1057 method.
1058
1059 "Regexp::Assemble" does not (yet)? employ the "(?>...)" construct.
1060
1061 The module does not produce POSIX-style regular expressions. This would
1062 be quite easy to add, if there was a demand for it.
1063
1065 Patterns that generate look-ahead assertions sometimes produce
1066 incorrect patterns in certain obscure corner cases. If you suspect that
1067 this is occurring in your pattern, disable lookaheads.
1068
1069 Tracking doesn't really work at all with 5.6.0. It works better in
1070 subsequent 5.6 releases. For maximum reliability, the use of a 5.8
1071 release is strongly recommended. Tracking barely works with 5.005_04.
1072 Of note, using "\d"-style meta-characters invariably causes panics.
1073 Tracking really comes into its own in Perl 5.10.
1074
1075 If you feed "Regexp::Assemble" patterns with nested parentheses, there
1076 is a chance that the resulting pattern will be uncompilable due to
1077 mismatched parentheses (not enough closing parentheses). This is
1078 normal, so long as the default lexer pattern is used. If you want to
1079 find out which pattern among a list of 3000 patterns are to blame
1080 (speaking from experience here), the eg/debugging script offers a
1081 strategy for pinpointing the pattern at fault. While you may not be
1082 able to use the script directly, the general approach is easy to
1083 implement.
1084
1085 The algorithm used to assemble the regular expressions makes extensive
1086 use of mutually-recursive functions (that is, A calls B, B calls A,
1087 ...) For deeply similar expressions, it may be possible to provoke
1088 "Deep recursion" warnings.
1089
1090 The module has been tested extensively, and has an extensive test suite
1091 (that achieves close to 100% statement coverage), but you never know...
1092 A bug may manifest itself in two ways: creating a pattern that cannot
1093 be compiled, such as "a\(bc)", or a pattern that compiles correctly but
1094 that either matches things it shouldn't, or doesn't match things it
1095 should. It is assumed that Such problems will occur when the reduction
1096 algorithm encounters some sort of edge case. A temporary work-around is
1097 to disable reductions:
1098
1099 my $pattern = $assembler->reduce(0)->re;
1100
1101 A discussion about implementation details and where bugs might lurk
1102 appears in the README file. If this file is not available locally, you
1103 should be able to find a copy on the Web at your nearest CPAN mirror.
1104
1105 Seriously, though, a number of people have been using this module to
1106 create expressions anywhere from 140Kb to 600Kb in size, and it seems
1107 to be working according to spec. Thus, I don't think there are any
1108 serious bugs remaining.
1109
1110 If you are feeling brave, extensive debugging traces are available to
1111 figure out where assembly goes wrong.
1112
1113 Please report all bugs at
1114 http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble
1115 <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble>
1116
1117 Make sure you include the output from the following two commands:
1118
1119 perl -MRegexp::Assemble -le 'print $Regexp::Assemble::VERSION'
1120 perl -V
1121
1122 There is a mailing list for the discussion of "Regexp::Assemble".
1123 Subscription details are available at
1124 http://listes.mongueurs.net/mailman/listinfo/regexp-assemble
1125 <http://listes.mongueurs.net/mailman/listinfo/regexp-assemble>.
1126
1128 This module grew out of work I did building access maps for Postfix, a
1129 modern SMTP mail transfer agent. See <http://www.postfix.org/> for more
1130 information. I used Perl to build large regular expressions for
1131 blocking dynamic/residential IP addresses to cut down on spam and
1132 viruses. Once I had the code running for this, it was easy to start
1133 adding stuff to block really blatant spam subject lines, bogus HELO
1134 strings, spammer mailer-ids and more...
1135
1136 I presented the work at the French Perl Workshop in 2004, and the thing
1137 most people asked was whether the underlying mechanism for assembling
1138 the REs was available as a module. At that time it was nothing more
1139 that a twisty maze of scripts, all different. The interest shown
1140 indicated that a module was called for. I'd like to thank the people
1141 who showed interest. Hey, it's going to make my messy scripts smaller,
1142 in any case.
1143
1144 Thomas Drugeon was a valuable sounding board for trying out early
1145 ideas. Jean Forget and Philippe Blayo looked over an early version.
1146 H.Merijn Brandt stopped over in Paris one evening, and discussed things
1147 over a few beers.
1148
1149 Nicholas Clark pointed out that while what this module does
1150 (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he
1151 encouraged me to continue with the development of this module. In any
1152 event, this module allows one to gauge the difficulty of undertaking
1153 the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.
1154
1155 Paul Johnson settled the question as to whether this module should live
1156 in the Regex:: namespace, or Regexp:: namespace. If you're not
1157 convinced, try running the following one-liner:
1158
1159 perl -le 'print ref qr//'
1160
1161 Philippe Bruhat found a couple of corner cases where this module could
1162 produce incorrect results. Such feedback is invaluable, and only
1163 improves the module's quality.
1164
1166 David Landgren
1167
1168 Copyright (C) 2004-2008. All rights reserved.
1169
1170 http://www.landgren.net/perl/
1171
1172 If you use this module, I'd love to hear about what you're using it
1173 for. If you want to be informed of updates, send me a note.
1174
1175 You can look at the latest working copy in the following Subversion
1176 repository:
1177
1178 http://svnweb.mongueurs.net/Regexp-Assemble
1179
1181 This library is free software; you can redistribute it and/or modify it
1182 under the same terms as Perl itself.
1183
1184
1185
1186perl v5.12.0 2008-06-17 Assemble(3)