1Regexp::Assemble(3)   User Contributed Perl Documentation  Regexp::Assemble(3)


6       Regexp::Assemble - Assemble multiple Regular Expressions into a single
7       RE


10         use Regexp::Assemble;
12         my $ra = Regexp::Assemble->new;
13         $ra->add( 'ab+c' );
14         $ra->add( 'ab+-' );
15         $ra->add( 'a\w\d+' );
16         $ra->add( 'a\d+' );
17         print $ra->re; # prints a(?:\w?\d+|b+[-c])


20       Regexp::Assemble takes an arbitrary number of regular expressions and
21       assembles them into a single regular expression (or RE) that matches
22       all that the individual REs match.
24       As a result, instead of having a large list of expressions to loop
25       over, a target string only needs to be tested against one expression.
26       This is interesting when you have several thousand patterns to deal
27       with. Serious effort is made to produce the smallest pattern possible.
29       It is also possible to track the original patterns, so that you can
30       determine which, among the source patterns that form the assembled
31       pattern, was the one that caused the match to occur.
33       You should realise that large numbers of alternations are processed in
34       perl's regular expression engine in O(n) time, not O(1). If you are
35       still having performance problems, you should look at using a trie.
36       Note that Perl's own regular expression engine will implement trie
37       optimisations in perl 5.10 (they are already available in perl 5.9.3 if
38       you want to try them out). "Regexp::Assemble" will do the right thing
39       when it knows it's running on a trie'd perl.  (At least in some version
40       after this one).
42       Some more examples of usage appear in the accompanying README. If that
43       file is not easy to access locally, you can find it on a web repository
44       such as <http://search.cpan.org/dist/Regexp-Assemble/README> or
45       <http://cpan.uwinnipeg.ca/htdocs/Regexp-Assemble/README.html>.
47       See also "LIMITATIONS".


50   add(LIST)
51       Takes a string, breaks it apart into a set of tokens (respecting meta
52       characters) and inserts the resulting list into the "R::A" object. It
53       uses a naive regular expression to lex the string that may be fooled
54       complex expressions (specifically, it will fail to lex nested
55       parenthetical expressions such as "ab(cd(ef)?gh)ij" correctly). If this
56       is the case, the end of the string will not be tokenised correctly and
57       returned as one long string.
59       On the one hand, this may indicate that the patterns you are trying to
60       feed the "R::A" object are too complex. Simpler patterns might allow
61       the algorithm to work more effectively and perform more reductions in
62       the resulting pattern.
64       On the other hand, you can supply your own pattern to perform the
65       lexing if you need. The test suite contains an example of a lexer
66       pattern that will match one level of nested parentheses.
68       Note that there is an internal optimisation that will bypass a much of
69       the lexing process. If a string contains no "\" (backslash), "[" (open
70       square bracket), "(" (open paren), "?" (question mark), "+" (plus), "*"
71       (star) or "{" (open curly), a character split will be performed
72       directly.
74       A list of strings may be supplied, thus you can pass it a file handle
75       of a file opened for reading:
77           $re->add( '\d+-\d+-\d+-\d+\.example\.com' );
78           $re->add( <IN> );
80       If the file is very large, it may be more efficient to use a "while"
81       loop, to read the file line-by-line:
83           $re->add($_) while <IN>;
85       The "add" method will chomp the lines automatically. If you do not want
86       this to occur (you want to keep the record separator), then disable
87       "chomp"ing.
89           $re->chomp(0);
90           $re->add($_) while <IN>;
92       This method is chainable.
94   add_file(FILENAME [...])
95       Takes a list of file names. Each file is opened and read line by line.
96       Each line is added to the assembly.
98         $r->add_file( 'file.1', 'file.2' );
100       If a file cannot be opened, the method will croak. If you cannot afford
101       to let this happen then you should wrap the call in a "eval" block.
103       Chomping happens automatically unless you the chomp(0) method to
104       disable it. By default, input lines are read according to the value of
105       the "input_record_separator" attribute (if defined), and will otherwise
106       fall back to the current setting of the system $/ variable. The record
107       separator may also be specified on each call to "add_file". Internally,
108       the routine "local"ises the value of $/ to whatever is required, for
109       the duration of the call.
111       An alternate calling mechanism using a hash reference is available.
112       The recognised keys are:
114       file
115           Reference to a list of file names, or the name of a single file.
117             $r->add_file({file => ['file.1', 'file.2', 'file.3']});
118             $r->add_file({file => 'file.n'});
120       input_record_separator
121           If present, indicates what constitutes a line
123             $r->add_file({file => 'data.txt', input_record_separator => ':' });
125       rs  An alias for input_record_separator (mnemonic: same as the English
126           variable names).
128         $r->add_file( {
129           file => [ 'pattern.txt', 'more.txt' ],
130           input_record_separator  => "\r\n",
131         });
133   clone()
134       Clones the contents of a Regexp::Assemble object and creates a new
135       object (in other words it performs a deep copy).
137       If the Storable module is installed, its dclone method will be used,
138       otherwise the cloning will be performed using a pure perl approach.
140       You can use this method to take a snapshot of the patterns that have
141       been added so far to an object, and generate an assembly from the
142       clone. Additional patterns may to be added to the original object
143       afterwards.
145         my $re = $main->clone->re();
146         $main->add( 'another-pattern-\\d+' );
148   insert(LIST)
149       Takes a list of tokens representing a regular expression and stores
150       them in the object. Note: you should not pass it a bare regular
151       expression, such as "ab+c?d*e". You must pass it as a list of tokens,
152       e.g. "('a', 'b+', 'c?', 'd*', 'e')".
154       This method is chainable, e.g.:
156         my $ra = Regexp::Assemble->new
157           ->insert( qw[ a b+ c? d* e ] )
158           ->insert( qw[ a c+ d+ e* f ] );
160       Lexing complex patterns with metacharacters and so on can consume a
161       significant proportion of the overall time to build an assembly.  If
162       you have the information available in a tokenised form, calling
163       "insert" directly can be a big win.
165   lexstr
166       Use the "lexstr" method if you are curious to see how a pattern gets
167       tokenised. It takes a scalar on input, representing a pattern, and
168       returns a reference to an array, containing the tokenised pattern. You
169       can recover the original pattern by performing a "join":
171         my @token = $re->lexstr($pattern);
172         my $new_pattern = join( '', @token );
174       If the original pattern contains unnecessary backslashes, or "\x4b"
175       escapes, or quotemeta escapes ("\Q"..."\E") the resulting pattern may
176       not be identical.
178       Call "lexstr" does not add the pattern to the object, it is merely for
179       exploratory purposes. It will, however, update various statistical
180       counters.
182   pre_filter(CODE)
183       Allows you to install a callback to check that the pattern being loaded
184       contains valid input. It receives the pattern as a whole to be added,
185       before it been tokenised by the lexer. It may to return 0 or "undef" to
186       indicate that the pattern should not be added, any true value indicates
187       that the contents are fine.
189       A filter to strip out trailing comments (marked by #):
191         $re->pre_filter( sub { $_[0] =~ s/\s*#.*$//; 1 } );
193       A filter to ignore blank lines:
195         $re->pre_filter( sub { length(shift) } );
197       If you want to remove the filter, pass "undef" as a parameter.
199         $ra->pre_filter(undef);
201       This method is chainable.
203   filter(CODE)
204       Allows you to install a callback to check that the pattern being loaded
205       contains valid input. It receives a list on input, after it has been
206       tokenised by the lexer. It may to return 0 or undef to indicate that
207       the pattern should not be added, any true value indicates that the
208       contents are fine.
210       If you know that all patterns you expect to assemble contain a
211       restricted set of of tokens (e.g. no spaces), you could do the
212       following:
214         $ra->filter(sub { not grep { / / } @_ });
216       or
218         sub only_spaces_and_digits {
219           not grep { ![\d ] } @_
220         }
221         $ra->filter( \&only_spaces_and_digits );
223       These two examples will silently ignore faulty patterns, If you want
224       the user to be made aware of the problem you should raise an error (via
225       "warn" or "die"), log an error message, whatever is best. If you want
226       to remove a filter, pass "undef" as a parameter.
228         $ra->filter(undef);
230       This method is chainable.
232   as_string
233       Assemble the expression and return it as a string. You may want to do
234       this if you are writing the pattern to a file. The following arguments
235       can be passed to control the aspect of the resulting pattern:
237       indent, the number of spaces used to indent nested grouping of a
238       pattern. Use this to produce a pretty-printed pattern (for some
239       definition of "pretty"). The resulting output is rather verbose. The
240       reason is to ensure that the metacharacters "(?:" and ")" always occur
241       on otherwise empty lines. This allows you grep the result for an even
242       more synthetic view of the pattern:
244         egrep -v '^ *[()]' <regexp.file>
246       The result of the above is quite readable. Remember to backslash the
247       spaces appearing in your own patterns if you wish to use an indented
248       pattern in an "m/.../x" construct. Indenting is ignored if tracking is
249       enabled.
251       The indent argument takes precedence over the "indent" method/attribute
252       of the object.
254       Calling this method will drain the internal data structure. Large
255       numbers of patterns can eat a significant amount of memory, and this
256       lets perl recover the memory used for other purposes.
258       If you want to reduce the pattern and continue to add new patterns,
259       clone the object and reduce the clone, leaving the original object
260       intact.
262   re
263       Assembles the pattern and return it as a compiled RE, using the "qr//"
264       operator.
266       As with "as_string", calling this method will reset the internal data
267       structures to free the memory used in assembling the RE.
269       The indent attribute, documented in the "as_string" method, can be used
270       here (it will be ignored if tracking is enabled).
272       With method chaining, it is possible to produce a RE without having a
273       temporary "Regexp::Assemble" object lying around, e.g.:
275         my $re = Regexp::Assemble->new
276           ->add( q[ab+cd+e] )
277           ->add( q[ac\\d+e] )
278           ->add( q[c\\d+e] )
279           ->re;
281       The $re variable now contains a Regexp object that can be used
282       directly:
284         while( <> ) {
285           /$re/ and print "Something in [$_] matched\n";
286         )
288       The "re" method is called when the object is used in string context
289       (hence, within an "m//" operator), so by and large you do not even need
290       to save the RE in a separate variable. The following will work as
291       expected:
293         my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
294         while( <IN> ) {
295           if( /($re)/ ) {
296             print "Here be giants: $1\n";
297           }
298         }
300       This approach does not work with tracked patterns. The "match" and
301       "matched" methods must be used instead, see below.
303   match(SCALAR)
304       The following information applies to Perl 5.8 and below. See the
305       section that follows for information on Perl 5.10.
307       If pattern tracking is in use, you must "use re 'eval'" in order to
308       make things work correctly. At a minimum, this will make your code look
309       like this:
311           my $did_match = do { use re 'eval'; $target =~ /$ra/ }
312           if( $did_match ) {
313               print "matched ", $ra->matched, "\n";
314           }
316       (The main reason is that the $^R variable is currently broken and an
317       ugly workaround that runs some Perl code during the match is required,
318       in order to simulate what $^R should be doing. See Perl bug #32840 for
319       more information if you are curious. The README also contains more
320       information). This bug has been fixed in 5.10.
322       The important thing to note is that with "use re 'eval'", THERE ARE
324       this: if you do not have strict control over the patterns being fed to
325       "Regexp::Assemble" when tracking is enabled, and someone slips you a
326       pattern such as "/^(?{system 'rm -rf /'})/" and you attempt to match a
327       string against the resulting pattern, you will know Fear and Loathing.
329       What is more, the $^R workaround means that that tracking does not work
330       if you perform a bare "/$re/" pattern match as shown above. You have to
331       instead call the "match" method, in order to supply the necessary
332       context to take care of the tracking housekeeping details.
334          if( defined( my $match = $ra->match($_)) ) {
335              print "  $_ matched by $match\n";
336          }
338       In the case of a successful match, the original matched pattern is
339       returned directly. The matched pattern will also be available through
340       the "matched" method.
342       (Except that the above is not true for 5.6.0: the "match" method
343       returns true or undef, and the "matched" method always returns undef).
345       If you are capturing parts of the pattern e.g. "foo(bar)rat" you will
346       want to get at the captures. See the "mbegin", "mend", "mvar" and
347       "capture" methods. If you are not using captures then you may safely
348       ignore this section.
350       In 5.10, since the bug concerning $^R has been resolved, there is no
351       need to use "re 'eval'" and the assembled pattern does not require any
352       Perl code to be executed during the match.
354   new()
355       Creates a new "Regexp::Assemble" object. The following optional
356       key/value parameters may be employed. All keys have a corresponding
357       method that can be used to change the behaviour later on. As a general
358       rule, especially if you're just starting out, you don't have to bother
359       with any of these.
361       anchor_*, a family of optional attributes that allow anchors ("^",
362       "\b", "\Z"...) to be added to the resulting pattern.
364       flags, sets the "imsx" flags to add to the assembled regular
365       expression.  Warning: no error checking is done, you should ensure that
366       the flags you pass are understood by the version of Perl you are using.
367       modifiers exists as an alias, for users familiar with Regexp::List.
369       chomp, controls whether the pattern should be chomped before being
370       lexed. Handy if you are reading patterns from a file. By default,
371       "chomp"ing is performed (this behaviour changed as of version 0.24,
372       prior versions did not chomp automatically).  See also the "file"
373       attribute and the "add_file" method.
375       file, slurp the contents of the specified file and add them to the
376       assembly. Multiple files may be processed by using a list.
378         my $r = Regexp::Assemble->new(file => 're.list');
380         my $r = Regexp::Assemble->new(file => ['re.1', 're.2']);
382       If you really don't want chomping to occur, you will have to set the
383       "chomp" attribute to 0 (zero). You may also want to look at the
384       "input_record_separator" attribute, as well.
386       input_record_separator, controls what constitutes a record separator
387       when using the "file" attribute or the "add_file" method. May be
388       abbreviated to rs. See the $/ variable in perlvar.
390       lookahead, controls whether the pattern should contain zero-width
391       lookahead assertions (For instance: (?=[abc])(?:bob|alice|charles).
392       This is not activated by default, because in many circumstances the
393       cost of processing the assertion itself outweighs the benefit of its
394       faculty for short-circuiting a match that will fail. This is sensitive
395       to the probability of a match succeeding, so if you're worried about
396       performance you'll have to benchmark a sample population of targets to
397       see which way the benefits lie.
399       track, controls whether you want know which of the initial patterns was
400       the one that matched. See the "matched" method for more details. Note
401       for version 5.8 of Perl and below, in this mode of operation YOU SHOULD
402       BE AWARE OF THE SECURITY IMPLICATIONS that this entails. Perl 5.10 does
403       not suffer from any such restriction.
405       indent, the number of spaces used to indent nested grouping of a
406       pattern. Use this to produce a pretty-printed pattern. See the
407       "as_string" method for a more detailed explanation.
409       pre_filter, allows you to add a callback to enable sanity checks on the
410       pattern being loaded. This callback is triggered before the pattern is
411       split apart by the lexer. In other words, it operates on the entire
412       pattern. If you are loading patterns from a file, this would be an
413       appropriate place to remove comments.
415       filter, allows you to add a callback to enable sanity checks on the
416       pattern being loaded. This callback is triggered after the pattern has
417       been split apart by the lexer.
419       unroll_plus, controls whether to unroll, for example, "x+" into "x",
420       "x*", which may allow additional reductions in the resulting assembled
421       pattern.
423       reduce, controls whether tail reduction occurs or not. If set, patterns
424       like "a(?:bc+d|ec+d)" will be reduced to "a[be]c+d".  That is, the end
425       of the pattern in each part of the b... and d...  alternations is
426       identical, and hence is hoisted out of the alternation and placed after
427       it. On by default. Turn it off if you're really pressed for short
428       assembly times.
430       lex, specifies the pattern used to lex the input lines into tokens. You
431       could replace the default pattern by a more sophisticated version that
432       matches arbitrarily nested parentheses, for example.
434       debug, controls whether copious amounts of output is produced during
435       the loading stage or the reducing stage of assembly.
437         my $ra = Regexp::Assemble->new;
438         my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
440       mutable, controls whether new patterns can be added to the object after
441       the assembled pattern is generated. DEPRECATED.
443       This method/attribute will be removed in a future release. It doesn't
444       really serve any purpose, and may be more effectively replaced by
445       cloning an existing "Regexp::Assemble" object and spinning out a
446       pattern from that instead.
448   source()
449       When using tracked mode, after a successful match is made, returns the
450       original source pattern that caused the match. In Perl 5.10, the $^R
451       variable can be used to as an index to fetch the correct pattern from
452       the object.
454       If no successful match has been performed, or the object is not in
455       tracked mode, this method returns "undef".
457         my $r = Regexp::Assemble->new->track(1)->add(qw(foo? bar{2} [Rr]at));
459         for my $w (qw(this food is rather barren)) {
460           if ($w =~ /$r/) {
461             print "$w matched by ", $r->source($^R), $/;
462           }
463           else {
464             print "$w no match\n";
465           }
466         }
468   mbegin()
469       This method returns a copy of "@-" at the moment of the last match. You
470       should ordinarily not need to bother with this, "mvar" should be able
471       to supply all your needs.
473   mend()
474       This method returns a copy of "@+" at the moment of the last match.
476   mvar(NUMBER)
477       The "mvar" method returns the captures of the last match.  mvar(1)
478       corresponds to $1, mvar(2) to $2, and so on.  mvar(0) happens to return
479       the target string matched, as a byproduct of walking down the "@-" and
480       "@+" arrays after the match.
482       If called without a parameter, "mvar" will return a reference to an
483       array containing all captures.
485   capture
486       The "capture" method returns the the captures of the last match as an
487       array. Unlink "mvar", this method does not include the matched string.
488       It is equivalent to getting an array back that contains "$1, $2, $3,
489       ...".
491       If no captures were found in the match, an empty array is returned,
492       rather than "undef". You are therefore guaranteed to be able to use
493       "for my $c ($re->capture) { ..."  without have to check whether
494       anything was captured.
496   matched()
497       If pattern tracking has been set, via the "track" attribute, or through
498       the "track" method, this method will return the original pattern of the
499       last successful match. Returns undef match has yet been performed, or
500       tracking has not been enabled.
502       See below in the NOTES section for additional subtleties of which you
503       should be aware of when tracking patterns.
505       Note that this method is not available in 5.6.0, due to limitations in
506       the implementation of "(?{...})" at the time.
508   Statistics/Reporting routines
509   stats_add
510       Returns the number of patterns added to the assembly (whether by "add"
511       or "insert"). Duplicate patterns are not included in this total.
513   stats_dup
514       Returns the number of duplicate patterns added to the assembly.  If
515       non-zero, this may be a sign that something is wrong with your data (or
516       at the least, some needless redundancy). This may occur when you have
517       two patterns (for instance, "a\-b" and "a-b") which map to the same
518       result.
520   stats_raw()
521       Returns the raw number of bytes in the patterns added to the assembly.
522       This includes both original and duplicate patterns.  For instance,
523       adding the two patterns "ab" and "ab" will count as 4 bytes.
525   stats_cooked()
526       Return the true number of bytes added to the assembly. This will not
527       include duplicate patterns. Furthermore, it may differ from the raw
528       bytes due to quotemeta treatment. For instance, "abc\,def" will count
529       as 7 (not 8) bytes, because "\," will be stored as ",". Also, "\Qa.b\E"
530       is 7 bytes long, however, after the quotemeta directives are processed,
531       "a\.b" will be stored, for a total of 4 bytes.
533   stats_length()
534       Returns the length of the resulting assembled expression.  Until
535       "as_string" or "re" have been called, the length will be 0 (since the
536       assembly will have not yet been performed). The length includes only
537       the pattern, not the additional ("(?-xism...") fluff added by the
538       compilation.
540   dup_warn(NUMBER|CODEREF)
541       Turns warnings about duplicate patterns on or off. By default, no
542       warnings are emitted. If the method is called with no parameters, or a
543       true parameter, the object will carp about patterns it has already
544       seen. To turn off the warnings, use 0 as a parameter.
546         $r->dup_warn();
548       The method may also be passed a code block. In this case the code will
549       be executed and it will receive a reference to the object in question,
550       and the lexed pattern.
552         $r->dup_warn(
553           sub {
554             my $self = shift;
555             print $self->stats_add, " patterns added at line $.\n",
556                 join( '', @_ ), " added previously\n";
557           }
558         )
560   Anchor routines
561       Suppose you wish to assemble a series of patterns that all begin with
562       "^"  and end with "$" (anchor pattern to the beginning and end of
563       line). Rather than add the anchors to each and every pattern (and
564       possibly forget to do so when a new entry is added), you may specify
565       the anchors in the object, and they will appear in the resulting
566       pattern, and you no longer need to (or should) put them in your source
567       patterns. For example, the two following snippets will produce
568       identical patterns:
570         $r->add(qw(^this ^that ^them))->as_string;
572         $r->add(qw(this that them))->anchor_line_begin->as_string;
574         # both techniques will produce ^th(?:at|em|is)
576       All anchors are possible word ("\b") boundaries, line boundaries ("^"
577       and "$") and string boundaries ("\A" and "\Z" (or "\z" if you
578       absolutely need it)).
580       The shortcut "anchor_mumble" implies both "anchor_mumble_begin"
581       "anchor_mumble_end" is also available. If different anchors are
582       specified the most specific anchor wins. For instance, if both
583       "anchor_word_begin" and "anchor_line_begin" are specified,
584       "anchor_word_begin" takes precedence.
586       All the anchor methods are chainable.
588   anchor_word_begin
589       The resulting pattern will be prefixed with a "\b" word boundary
590       assertion when the value is true. Set to 0 to disable.
592         $r->add('pre')->anchor_word_begin->as_string;
593         # produces '\bpre'
595   anchor_word_end
596       The resulting pattern will be suffixed with a "\b" word boundary
597       assertion when the value is true. Set to 0 to disable.
599         $r->add(qw(ing tion))
600           ->anchor_word_end
601           ->as_string; # produces '(?:tion|ing)\b'
603   anchor_word
604       The resulting pattern will be have "\b" word boundary assertions at the
605       beginning and end of the pattern when the value is true. Set to 0 to
606       disable.
608         $r->add(qw(cat carrot)
609           ->anchor_word(1)
610           ->as_string; # produces '\bca(?:rro)t\b'
612   anchor_line_begin
613       The resulting pattern will be prefixed with a "^" line boundary
614       assertion when the value is true. Set to 0 to disable.
616         $r->anchor_line_begin;
617         # or
618         $r->anchor_line_begin(1);
620   anchor_line_end
621       The resulting pattern will be suffixed with a "$" line boundary
622       assertion when the value is true. Set to 0 to disable.
624         # turn it off
625         $r->anchor_line_end(0);
627   anchor_line
628       The resulting pattern will be have the "^" and "$" line boundary
629       assertions at the beginning and end of the pattern, respectively, when
630       the value is true. Set to 0 to disable.
632         $r->add(qw(cat carrot)
633           ->anchor_line
634           ->as_string; # produces '^ca(?:rro)t$'
636   anchor_string_begin
637       The resulting pattern will be prefixed with a "\A" string boundary
638       assertion when the value is true. Set to 0 to disable.
640         $r->anchor_string_begin(1);
642   anchor_string_end
643       The resulting pattern will be suffixed with a "\Z" string boundary
644       assertion when the value is true. Set to 0 to disable.
646         # disable the string boundary end anchor
647         $r->anchor_string_end(0);
649   anchor_string_end_absolute
650       The resulting pattern will be suffixed with a "\z" string boundary
651       assertion when the value is true. Set to 0 to disable.
653         # disable the string boundary absolute end anchor
654         $r->anchor_string_end_absolute(0);
656       If you don't understand the difference between "\Z" and "\z", the
657       former will probably do what you want.
659   anchor_string
660       The resulting pattern will be have the "\A" and "\Z" string boundary
661       assertions at the beginning and end of the pattern, respectively, when
662       the value is true. Set to 0 to disable.
664         $r->add(qw(cat carrot)
665           ->anchor_string
666           ->as_string; # produces '\Aca(?:rro)t\Z'
668   anchor_string_absolute
669       The resulting pattern will be have the "\A" and "\z" string boundary
670       assertions at the beginning and end of the pattern, respectively, when
671       the value is true. Set to 0 to disable.
673         $r->add(qw(cat carrot)
674           ->anchor_string_absolute
675           ->as_string; # produces '\Aca(?:rro)t\z'
677   debug(NUMBER)
678       Turns debugging on or off. Statements are printed to the currently
679       selected file handle (STDOUT by default).  If you are already using
680       this handle, you will have to arrange to select an output handle to a
681       file of your own choosing, before call the "add", "as_string" or "re")
682       functions, otherwise it will scribble all over your carefully formatted
683       output.
685       ·   Off. Turns off all debugging output.
687       ·   1
689           Add. Trace the addition of patterns.
691       ·   2
693           Reduce. Trace the process of reduction and assembly.
695       ·   4
697           Lex. Trace the lexing of the input patterns into its constituent
698           tokens.
700       ·   8
702           Time. Print to STDOUT the time taken to load all the patterns. This
703           is nothing more than the difference between the time the object was
704           instantiated and the time reduction was initiated.
706             # load=<num>
708           Any lengthy computation performed in the client code will be
709           reflected in this value. Another line will be printed after
710           reduction is complete.
712             # reduce=<num>
714           The above output lines will be changed to "load-epoch" and
715           "reduce-epoch" if the internal state of the object is corrupted and
716           the initial timestamp is lost.
718           The code attempts to load Time::HiRes in order to report fractional
719           seconds. If this is not successful, the elapsed time is displayed
720           in whole seconds.
722       Values can be added (or or'ed together) to trace everything
724         $r->debug(7)->add( '\\d+abc' );
726       Calling "debug" with no arguments turns debugging off.
728   dump()
729       Produces a synthetic view of the internal data structure. How to
730       interpret the results is left as an exercise to the reader.
732         print $r->dump;
734   chomp(0|1)
735       Turns chomping on or off.
737       IMPORTANT: As of version 0.24, chomping is now on by default as it
738       makes "add_file" Just Work. The only time you may run into trouble is
739       with "add("\\$/")". So don't do that, or else explicitly turn off
740       chomping.
742       To avoid incorporating (spurious) record separators (such as "\n" on
743       Unix) when reading from a file, "add()" "chomp"s its input. If you
744       don't want this to happen, call "chomp" with a false value.
746         $re->chomp(0); # really want the record separators
747         $re->add(<DATA>);
749   fold_meta_pairs(NUMBER)
750       Determines whether "\s", "\S" and "\w", "\W" and "\d", "\D" are folded
751       into a "." (dot). Folding happens by default (for reasons of backwards
752       compatibility, even though it is wrong when the "/s" expression
753       modifier is active).
755       Call this method with a false value to prevent this behaviour (which is
756       only a problem when dealing with "\n" if the "/s" expression modifier
757       is also set).
759         $re->add( '\\w', '\\W' );
760         my $clone = $re->clone;
762         $clone->fold_meta_pairs(0);
763         print $clone->as_string; # prints '.'
764         print $re->as_string;    # print '[\W\w]'
766   indent(NUMBER)
767       Sets the level of indent for pretty-printing nested groups within a
768       pattern. See the "as_string" method for more details.  When called
769       without a parameter, no indenting is performed.
771         $re->indent( 4 );
772         print $re->as_string;
774   lookahead(0|1)
775       Turns on zero-width lookahead assertions. This is usually beneficial
776       when you expect that the pattern will usually fail.  If you expect that
777       the pattern will usually match you will probably be worse off.
779   flags(STRING)
780       Sets the flags that govern how the pattern behaves (for versions of
781       Perl up to 5.9 or so, these are "imsx"). By default no flags are
782       enabled.
784   modifiers(STRING)
785       An alias of the "flags" method, for users familiar with "Regexp::List".
787   track(0|1)
788       Turns tracking on or off. When this attribute is enabled, additional
789       housekeeping information is inserted into the assembled expression
790       using "({...}" embedded code constructs. This provides the necessary
791       information to determine which, of the original patterns added, was the
792       one that caused the match.
794         $re->track( 1 );
795         if( $target =~ /$re/ ) {
796           print "$target matched by ", $re->matched, "\n";
797         }
799       Note that when this functionality is enabled, no reduction is performed
800       and no character classes are generated. In other words, "brag|tag" is
801       not reduced down to "(?:br|t)ag" and "dig|dim" is not reduced to
802       "di[gm]".
804   unroll_plus(0|1)
805       Turns the unrolling of plus metacharacters on or off. When a pattern is
806       broken up, "a+" becomes "a", "a*" (and "b+?" becomes "b", "b*?". This
807       may allow the freed "a" to assemble with other patterns. Not enabled by
808       default.
810   lex(SCALAR)
811       Change the pattern used to break a string apart into tokens.  You can
812       examine the "eg/naive" script as a starting point.
814   reduce(0|1)
815       Turns pattern reduction on or off. A reduced pattern may be
816       considerably shorter than an unreduced pattern. Consider
817       "/sl(?:ip|op|ap)/" versus "/sl[aio]p/". An unreduced pattern will be
818       very similar to those produced by "Regexp::Optimizer". Reduction is on
819       by default. Turning it off speeds assembly (but assembly is pretty fast
820       -- it's the breaking up of the initial patterns in the lexing stage
821       that can consume a non-negligible amount of time).
823   mutable(0|1)
824       This method has been marked as DEPRECATED. It will be removed in a
825       future release. See the "clone" method for a technique to replace its
826       functionality.
828   reset()
829       Empties out the patterns that have been "add"ed or "insert"-ed into the
830       object. Does not modify the state of controller attributes such as
831       "debug", "lex", "reduce" and the like.
833   Default_Lexer
834       Warning: the "Default_Lexer" function is a class method, not an object
835       method. It is a fatal error to call it as an object method.
837       The "Default_Lexer" method lets you replace the default pattern used
838       for all subsequently created "Regexp::Assemble" objects. It will not
839       have any effect on existing objects. (It is also possible to override
840       the lexer pattern used on a per-object basis).
842       The parameter should be an ordinary scalar, not a compiled pattern. If
843       the pattern fails to match all parts of the string, the missing parts
844       will be returned as single chunks. Therefore the following pattern is
845       legal (albeit rather cork-brained):
847           Regexp::Assemble::Default_Lexer( '\\d' );
849       The above pattern will split up input strings digit by digit, and all
850       non-digit characters as single chunks.


853         "Cannot pass a C<refname> to Default_Lexer"
855       You tried to replace the default lexer pattern with an object instead
856       of a scalar. Solution: You probably tried to call
857       "$obj->Default_Lexer". Call the qualified class method instead
858       "Regexp::Assemble::Default_Lexer".
860         "filter method not passed a coderef"
862         "pre_filter method not passed a coderef"
864       A reference to a subroutine (anonymous or otherwise) was expected.
865       Solution: read the documentation for the "filter" method.
867         "duplicate pattern added: /.../"
869       The "dup_warn" attribute is active, and a duplicate pattern was added
870       (well duh!). Solution: clean your data.
872         "cannot open [file] for input: [reason]"
874       The "add_file" method was unable to open the specified file for
875       whatever reason. Solution: make sure the file exists and the script has
876       the required privileges to read it.


879       This module has been tested successfully with a range of versions of
880       perl, from 5.005_03 to 5.9.3. Use of 5.6.0 is not recommended.
882       The expressions produced by this module can be used with the PCRE
883       library.
885       Remember to "double up" your backslashes if the patterns are hard-coded
886       as constants in your program. That is, you should literally
887       "add('a\\d+b')" rather than "add('a\d+b')". It usually will work either
888       way, but it's good practice to do so.
890       Where possible, supply the simplest tokens possible. Don't add
891       "X(?-\d+){2})Y" when "X-\d+-\d+Y" will do. The reason is that if you
892       also add "X\d+Z" the resulting assembly changes dramatically:
893       "X(?:(?:-\d+){2}Y|-\d+Z)" versus "X-\d+(?:-\d+Y|Z)". Since R::A doesn't
894       perform enough analysis, it won't "unroll" the "{2}" quantifier, and
895       will fail to notice the divergence after the first "-d\d+".
897       Furthermore, when the string 'X-123000P' is matched against the first
898       assembly, the regexp engine will have to backtrack over each
899       alternation (the one that ends in Y and the one that ends in Z) before
900       determining that there is no match. No such backtracking occurs in the
901       second pattern: as soon as the engine encounters the 'P' in the target
902       string, neither of the alternations at that point ("-\d+Y" or "Z")
903       could succeed and so the match fails.
905       "Regexp::Assemble" does, however, know how to build character classes.
906       Given "a-b", "axb" and "a\db", it will assemble these into "a[-\dx]b".
907       When "-" (dash) appears as a candidate for a character class it will be
908       the first character in the class. When "^" (circumflex) appears as a
909       candidate for a character class it will be the last character in the
910       class.
912       It also knows about meta-characters than can "absorb" regular
913       characters. For instance, given "X\d" and "X5", it knows that 5 can be
914       represented by "\d" and so the assembly is just "X\d".  The "absorbent"
915       meta-characters it deals with are ".", "\d", "\s" and "\W" and their
916       complements. It will replace "\d"/"\D", "\s"/"\S" and "\w"/"\W" by "."
917       (dot), and it will drop "\d" if "\w" is also present (as will "\D" in
918       the presence of "\W").
920       "Regexp::Assemble" deals correctly with "quotemeta"'s propensity to
921       backslash many characters that have no need to be. Backslashes on non-
922       metacharacters will be removed. Similarly, in character classes, a
923       number of characters lose their magic and so no longer need to be
924       backslashed within a character class. Two common examples are "."
925       (dot) and "$". Such characters will lose their backslash.
927       At the same time, it will also process "\Q...\E" sequences. When such a
928       sequence is encountered, the inner section is extracted and "quotemeta"
929       is applied to the section. The resulting quoted text is then used in
930       place of the original unquoted text, and the "\Q" and "\E"
931       metacharacters are thrown away. Similar processing occurs with the
932       "\U...\E" and "\L...\E" sequences. This may have surprising effects
933       when using a dispatch table. In this case, you will need to know
934       exactly what the module makes of your input. Use the "lexstr" method to
935       find out what's going on:
937         $pattern = join( '', @{$re->lexstr($pattern)} );
939       If all the digits 0..9 appear in a character class, "Regexp::Assemble"
940       will replace them by "\d". I'd do it for letters as well, but thinking
941       about accented characters and other glyphs hurts my head.
943       In an alternation, the longest paths are chosen first (for example,
944       "horse|bird|dog"). When two paths have the same length, the path with
945       the most subpaths will appear first. This aims to put the "busiest"
946       paths to the front of the alternation. For example, the list "bad",
947       "bit", "few", "fig" and "fun" will produce the pattern
948       "(?:f(?:ew|ig|un)|b(?:ad|it))". See eg/tld for a real-world example of
949       how alternations are sorted. Once you have looked at that, everything
950       should be crystal clear.
952       When tracking is in use, no reduction is performed. nor are character
953       classes formed. The reason is that it is too difficult to determine the
954       original pattern afterwards. Consider the two patterns "pale" and
955       "palm". These should be reduced to "pal[em]". The final character
956       matches one of two possibilities.  To resolve whether it matched an 'e'
957       or 'm' would require keeping track of the fact that the pattern
958       finished up in a character class, which would the require a whole lot
959       more work to figure out which character of the class matched. Without
960       character classes it becomes much easier. Instead, "pal(?:e|m)" is
961       produced, which lets us find out more simply where we ended up.
963       Similarly, "dogfood" and "seafood" should form "(?:dog|sea)food".  When
964       the pattern is being assembled, the tracking decision needs to be made
965       at the end of the grouping, but the tail of the pattern has not yet
966       been visited. Deferring things to make this work correctly is a vast
967       hassle. In this case, the pattern becomes merely "(?:dogfood|seafood".
968       Tracked patterns will therefore be bulkier than simple patterns.
970       There is an open bug on this issue:
972       <http://rt.perl.org/rt3/Ticket/Display.html?id=32840>
974       If this bug is ever resolved, tracking would become much easier to deal
975       with (none of the "match" hassle would be required - you could just
976       match like a regular RE and it would Just Work).


979       perlre
980           General information about Perl's regular expressions.
982       re  Specific information about "use re 'eval'".
984       Regex::PreSuf
985           "Regex::PreSuf" takes a string and chops it itself into tokens of
986           length 1. Since it can't deal with tokens of more than one
987           character, it can't deal with meta-characters and thus no regular
988           expressions.  Which is the main reason why I wrote this module.
990       Regexp::Optimizer
991           "Regexp::Optimizer" produces regular expressions that are similar
992           to those produced by R::A with reductions switched off. It's
993           biggest drawback is that it is exponentially slower than
994           Regexp::Assemble on very large sets of patterns.
996       Regexp::Parser
997           Fine grained analysis of regular expressions.
999       Regexp::Trie
1000           Funnily enough, this was my working name for "Regexp::Assemble"
1001           during its development. I changed the name because I thought it was
1002           too obscure. Anyway, "Regexp::Trie" does much the same as
1003           "Regexp::Optimizer" and "Regexp::Assemble" except that it runs much
1004           faster (according to the author). It does not recognise meta
1005           characters (that is, 'a+b' is interpreted as 'a\+b').
1007       Text::Trie
1008           "Text::Trie" is well worth investigating. Tries can outperform very
1009           bushy (read: many alternations) patterns.
1011       Tree::Trie
1012           "Tree::Trie" is another module that builds tries. The algorithm
1013           that "Regexp::Assemble" uses appears to be quite similar to the
1014           algorithm described therein, except that "R::A" solves its end-
1015           marker problem without having to rewrite the leaves.

See Also

1018       For alternatives to this module, consider one of:
1020       o Data::Munge
1021       o OnSearch::Regex
1022       o Regex::PreSuf


1025       Some mildly complex cases are not handled well. See
1026       examples/failure.01.pl and
1027       <https://rt.cpan.org/Public/Bug/Display.html?id=104897>.
1029       See also <https://rt.cpan.org/Public/Bug/Display.html?id=106480> for a
1030       discussion of some of the issues arising with the use of a huge number
1031       of alterations. Thanx to Slaven Rezic for the details of trie 'v' non-
1032       trie operations within Perl which influence regexp handling of
1033       alternations.
1035       <Regexp::Assemble> does not attempt to find common substrings. For
1036       instance, it will not collapse "/cabababc/" down to "/c(?:ab){3}c/".
1037       If there's a module out there that performs this sort of string
1038       analysis I'd like to know about it. But keep in mind that the
1039       algorithms that do this are very expensive: quadratic or worse.
1041       "Regexp::Assemble" does not interpret meta-character modifiers.  For
1042       instance, if the following two patterns are given: "X\d" and "X\d+", it
1043       will not determine that "\d" can be matched by "\d+". Instead, it will
1044       produce "X(?:\d|\d+)". Along a similar line of reasoning, it will not
1045       determine that "Z" and "Z\d+" is equivalent to "Z\d*" (It will produce
1046       "Z(?:\d+)?"  instead).
1048       You cannot remove a pattern that has been added to an object. You'll
1049       just have to start over again. Adding a pattern is difficult enough,
1050       I'd need a solid argument to convince me to add a "remove" method.  If
1051       you need to do this you should read the documentation for the "clone"
1052       method.
1054       "Regexp::Assemble" does not (yet)? employ the "(?>...)"  construct.
1056       The module does not produce POSIX-style regular expressions. This would
1057       be quite easy to add, if there was a demand for it.


1060       Patterns that generate look-ahead assertions sometimes produce
1061       incorrect patterns in certain obscure corner cases. If you suspect that
1062       this is occurring in your pattern, disable lookaheads.
1064       Tracking doesn't really work at all with 5.6.0. It works better in
1065       subsequent 5.6 releases. For maximum reliability, the use of a 5.8
1066       release is strongly recommended. Tracking barely works with 5.005_04.
1067       Of note, using "\d"-style meta-characters invariably causes panics.
1068       Tracking really comes into its own in Perl 5.10.
1070       If you feed "Regexp::Assemble" patterns with nested parentheses, there
1071       is a chance that the resulting pattern will be uncompilable due to
1072       mismatched parentheses (not enough closing parentheses). This is
1073       normal, so long as the default lexer pattern is used. If you want to
1074       find out which pattern among a list of 3000 patterns are to blame
1075       (speaking from experience here), the eg/debugging script offers a
1076       strategy for pinpointing the pattern at fault. While you may not be
1077       able to use the script directly, the general approach is easy to
1078       implement.
1080       The algorithm used to assemble the regular expressions makes extensive
1081       use of mutually-recursive functions (that is, A calls B, B calls A,
1082       ...) For deeply similar expressions, it may be possible to provoke
1083       "Deep recursion" warnings.
1085       The module has been tested extensively, and has an extensive test suite
1086       (that achieves close to 100% statement coverage), but you never know...
1087       A bug may manifest itself in two ways: creating a pattern that cannot
1088       be compiled, such as "a\(bc)", or a pattern that compiles correctly but
1089       that either matches things it shouldn't, or doesn't match things it
1090       should. It is assumed that Such problems will occur when the reduction
1091       algorithm encounters some sort of edge case. A temporary work-around is
1092       to disable reductions:
1094         my $pattern = $assembler->reduce(0)->re;
1096       A discussion about implementation details and where bugs might lurk
1097       appears in the README file. If this file is not available locally, you
1098       should be able to find a copy on the Web at your nearest CPAN mirror.
1100       Seriously, though, a number of people have been using this module to
1101       create expressions anywhere from 140Kb to 600Kb in size, and it seems
1102       to be working according to spec. Thus, I don't think there are any
1103       serious bugs remaining.
1105       If you are feeling brave, extensive debugging traces are available to
1106       figure out where assembly goes wrong.
1108       Please report all bugs at
1109       <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble>
1111       Make sure you include the output from the following two commands:
1113         perl -MRegexp::Assemble -le 'print $Regexp::Assemble::VERSION'
1114         perl -V
1116       There is a mailing list for the discussion of "Regexp::Assemble".
1117       Subscription details are available at
1118       <http://listes.mongueurs.net/mailman/listinfo/regexp-assemble>.


1121       This module grew out of work I did building access maps for Postfix, a
1122       modern SMTP mail transfer agent. See <http://www.postfix.org/> for more
1123       information. I used Perl to build large regular expressions for
1124       blocking dynamic/residential IP addresses to cut down on spam and
1125       viruses. Once I had the code running for this, it was easy to start
1126       adding stuff to block really blatant spam subject lines, bogus HELO
1127       strings, spammer mailer-ids and more...
1129       I presented the work at the French Perl Workshop in 2004, and the thing
1130       most people asked was whether the underlying mechanism for assembling
1131       the REs was available as a module. At that time it was nothing more
1132       that a twisty maze of scripts, all different. The interest shown
1133       indicated that a module was called for. I'd like to thank the people
1134       who showed interest. Hey, it's going to make my messy scripts smaller,
1135       in any case.
1137       Thomas Drugeon was a valuable sounding board for trying out early
1138       ideas. Jean Forget and Philippe Blayo looked over an early version.
1139       H.Merijn Brandt stopped over in Paris one evening, and discussed things
1140       over a few beers.
1142       Nicholas Clark pointed out that while what this module does
1143       (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he
1144       encouraged me to continue with the development of this module. In any
1145       event, this module allows one to gauge the difficulty of undertaking
1146       the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.
1148       Paul Johnson settled the question as to whether this module should live
1149       in the Regex:: namespace, or Regexp:: namespace. If you're not
1150       convinced, try running the following one-liner:
1152         perl -le 'print ref qr//'
1154       Philippe Bruhat found a couple of corner cases where this module could
1155       produce incorrect results. Such feedback is invaluable, and only
1156       improves the module's quality.

Machine-Readable Change Log

1159       The file Changes was converted into Changelog.ini by
1160       Module::Metadata::Changes.


1163       David Landgren
1165       Copyright (C) 2004-2011. All rights reserved.
1167         http://www.landgren.net/perl/
1169       If you use this module, I'd love to hear about what you're using it
1170       for. If you want to be informed of updates, send me a note.
1172       Ron Savage is co-maint of the module, starting with V 0.36.


1175       <https://github.com/ronsavage/Regexp-Assemble.git>


1178       1. Tree equivalencies. Currently, /contend/ /content/ /resend/ /resent/
1179       produces (?:conten[dt]|resend[dt]) but it is possible to produce
1180       (?:cont|res)en[dt] if one can spot the common tail nodes (and walk back
1181       the equivalent paths). Or be by me my => /[bm][ey]/ in the simplest
1182       case.
1184       To do this requires a certain amount of restructuring of the code.
1185       Currently, the algorithm uses a two-phase approach. In the first phase,
1186       the trie is traversed and reductions are performed. In the second
1187       phase, the reduced trie is traversed and the pattern is emitted.
1189       What has to occur is that the reduction and emission have to occur
1190       together. As a node is completed, it is replaced by its string
1191       representation. This then allows child nodes to be compared for
1192       equality with a simple 'eq'. Since there is only a single traversal,
1193       the overall generation time might drop, even though the context baggage
1194       required to delve through the tree will be more expensive to carry
1195       along (a hash rather than a couple of scalars).
1197       Actually, a simpler approach is to take on a secret sentinel atom at
1198       the end of every pattern, which gives the reduction algorithm
1199       sufficient traction to create a perfect trie.
1201       I'm rewriting the reduction code using this technique.
1203       2. Investigate how (?>foo) works. Can it be applied?
1205       5. How can a tracked pattern be serialised? (Add freeze and thaw
1206       methods).
1208       6. Store callbacks per tracked pattern.
1210       12. utf-8... hmmmm...
1212       14. Adding qr//'ed patterns. For example, consider
1213           $r->add ( qr/^abc/i )
1214               ->add( qr/^abd/ )
1215               ->add( qr/^ab e/x );
1216           this should admit abc abC aBc aBC abd abe as matches
1218       16. Allow a fast, unsafe tracking mode, that can be used if a(?bc)?
1219           can't happen. (Possibly carp if it does appear during traversal)?
1221       17. given a-\d+-\d+-\d+-\d+-b, produce a(?:-\d+){4}-b. Something
1222           along the lines of (.{4))(\1+) would let the regexp engine
1223           itself be brought to bear on the matter, which is a rather
1224           appealing idea. Consider
1226             while(/(?!\+)(\S{2,}?)(\1+)/g) { ... $1, $2 ... }
1228           as a starting point.
1230       19. The reduction code has become unbelievably baroque. Adding code
1231           to handle (sting,singing,sing) => s(?:(?:ing)?|t)ing was far
1232           too difficult. Adding more stuff just breaks existing behaviour.
1233           And fixing the ^abcd$ ... bug broke stuff all over again.
1234           Now that the corner cases are more clearly identified, a full
1235           rewrite of the reduction code is needed. And would admit the
1236           possibility of implementing items 1 and 17.
1238       20. Handle debug unrev with a separate bit
1240       23. Japhy's http://www.perlmonks.org/index.pl?node_id=90876 list2range
1241           regexp
1243       24. Lookahead assertions contain serious bugs (as shown by
1244           assembling powersets. Need to save more context during reduction,
1245           which in turn will simplify the preparation of the lookahead
1246           classes. See also 19.
1248       26. _lex() swamps the overall run-time. It stems from the decision
1249           to use a single regexp to pull apart any pattern. A suite of
1250           simpler regexp to pick of parens, char classes, quantifiers
1251           and bare tokens may be faster. (This has been implemented as
1252            _fastlex(), but it's only marginally faster. Perhaps split-by-
1253            char and lex a la C?
1255       27. We don't, as yet, unroll_plus a paren e.g. (abc)+?
1257       28. We don't reroll unrolled a a* to a+ in indented or tracked
1258           output
1260       29. Use (*MARK n) in blead for tracked patterns, and use (*FAIL) for
1261           the unmatchable pattern.


1264       This library is free software; you can redistribute it and/or modify it
1265       under the same terms as Perl itself.
1269perl v5.30.1                      2020-01-30               Regexp::Assemble(3)