1Assemble(3)           User Contributed Perl Documentation          Assemble(3)
2
3
4

NAME

6       Regexp::Assemble - Assemble multiple Regular Expressions into a single
7       RE
8

VERSION

10       This document describes version 0.34 of Regexp::Assemble, released
11       2008-06-17.
12

SYNOPSIS

14         use Regexp::Assemble;
15
16         my $ra = Regexp::Assemble->new;
17         $ra->add( 'ab+c' );
18         $ra->add( 'ab+-' );
19         $ra->add( 'a\w\d+' );
20         $ra->add( 'a\d+' );
21         print $ra->re; # prints a(?:\w?\d+|b+[-c])
22

DESCRIPTION

24       Regexp::Assemble takes an arbitrary number of regular expressions and
25       assembles them into a single regular expression (or RE) that matches
26       all that the individual REs match.
27
28       As a result, instead of having a large list of expressions to loop
29       over, a target string only needs to be tested against one expression.
30       This is interesting when you have several thousand patterns to deal
31       with. Serious effort is made to produce the smallest pattern possible.
32
33       It is also possible to track the original patterns, so that you can
34       determine which, among the source patterns that form the assembled
35       pattern, was the one that caused the match to occur.
36
37       You should realise that large numbers of alternations are processed in
38       perl's regular expression engine in O(n) time, not O(1). If you are
39       still having performance problems, you should look at using a trie.
40       Note that Perl's own regular expression engine will implement trie
41       optimisations in perl 5.10 (they are already available in perl 5.9.3 if
42       you want to try them out). "Regexp::Assemble" will do the right thing
43       when it knows it's running on a a trie'd perl.  (At least in some
44       version after this one).
45
46       Some more examples of usage appear in the accompanying README. If that
47       file isn't easy to access locally, you can find it on a web repository
48       such as http://search.cpan.org/dist/Regexp-Assemble/README
49       <http://search.cpan.org/dist/Regexp-Assemble/README> or
50       http://cpan.uwinnipeg.ca/htdocs/Regexp-Assemble/README.html
51       <http://cpan.uwinnipeg.ca/htdocs/Regexp-Assemble/README.html>.
52

METHODS

54       new     Creates a new "Regexp::Assemble" object. The following optional
55               key/value parameters may be employed. All keys have a
56               corresponding method that can be used to change the behaviour
57               later on. As a general rule, especially if you're just starting
58               out, you don't have to bother with any of these.
59
60               anchor_*, a family of optional attributes that allow anchors
61               ("^", "\b", "\Z"...) to be added to the resulting pattern.
62
63               flags, sets the "imsx" flags to add to the assembled regular
64               expression.  Warning: no error checking is done, you should
65               ensure that the flags you pass are understood by the version of
66               Perl you are using. modifiers exists as an alias, for users
67               familiar with Regexp::List.
68
69               chomp, controls whether the pattern should be chomped before
70               being lexed. Handy if you are reading patterns from a file. By
71               default, "chomp"ing is performed (this behaviour changed as of
72               version 0.24, prior versions did not chomp automatically).  See
73               also the "file" attribute and the "add_file" method.
74
75               file, slurp the contents of the specified file and add them to
76               the assembly. Multiple files may be processed by using a list.
77
78                 my $r = Regexp::Assemble->new(file => 're.list');
79
80                 my $r = Regexp::Assemble->new(file => ['re.1', 're.2']);
81
82               If you really don't want chomping to occur, you will have to
83               set the "chomp" attribute to 0 (zero). You may also want to
84               look at the "input_record_separator" attribute, as well.
85
86               input_record_separator, controls what constitutes a record
87               separator when using the "file" attribute or the "add_file"
88               method. May be abbreviated to rs. See the $/ variable in
89               perlvar.
90
91               lookahead, controls whether the pattern should contain zero-
92               width lookahead assertions (For instance:
93               (?=[abc])(?:bob|alice|charles).  This is not activated by
94               default, because in many circumstances the cost of processing
95               the assertion itself outweighs the benefit of its faculty for
96               short-circuiting a match that will fail. This is sensitive to
97               the probability of a match succeeding, so if you're worried
98               about performance you'll have to benchmark a sample population
99               of targets to see which way the benefits lie.
100
101               track, controls whether you want know which of the initial
102               patterns was the one that matched. See the "matched" method for
103               more details. Note for version 5.8 of Perl and below, in this
104               mode of operation YOU SHOULD BE AWARE OF THE SECURITY
105               IMPLICATIONS that this entails. Perl 5.10 does not suffer from
106               any such restriction.
107
108               indent, the number of spaces used to indent nested grouping of
109               a pattern. Use this to produce a pretty-printed pattern. See
110               the "as_string" method for a more detailed explanation.
111
112               pre_filter, allows you to add a callback to enable sanity
113               checks on the pattern being loaded. This callback is triggered
114               before the pattern is split apart by the lexer. In other words,
115               it operates on the entire pattern. If you are loading patterns
116               from a file, this would be an appropriate place to remove
117               comments.
118
119               filter, allows you to add a callback to enable sanity checks on
120               the pattern being loaded. This callback is triggered after the
121               pattern has been split apart by the lexer.
122
123               unroll_plus, controls whether to unroll, for example, "x+" into
124               "x", "x*", which may allow additional reductions in the
125               resulting assembled pattern.
126
127               reduce, controls whether tail reduction occurs or not. If set,
128               patterns like "a(?:bc+d|ec+d)" will be reduced to "a[be]c+d".
129               That is, the end of the pattern in each part of the b... and
130               d...  alternations is identical, and hence is hoisted out of
131               the alternation and placed after it. On by default. Turn it off
132               if you're really pressed for short assembly times.
133
134               lex, specifies the pattern used to lex the input lines into
135               tokens. You could replace the default pattern by a more
136               sophisticated version that matches arbitrarily nested
137               parentheses, for example.
138
139               debug, controls whether copious amounts of output is produced
140               during the loading stage or the reducing stage of assembly.
141
142                 my $ra = Regexp::Assemble->new;
143                 my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
144
145               mutable, controls whether new patterns can be added to the
146               object after the assembled pattern is generated. DEPRECATED.
147
148               This method/attribute will be removed in a future release. It
149               doesn't really serve any purpose, and may be more effectively
150               replaced by cloning an existing "Regexp::Assemble" object and
151               spinning out a pattern from that instead.
152
153               A more detailed explanation of these attributes follows.
154
155       clone   Clones the contents of a Regexp::Assemble object and creates a
156               new object (in other words it performs a deep copy).
157
158               If the Storable module is installed, its dclone method will be
159               used, otherwise the cloning will be performed using a pure perl
160               approach.
161
162               You can use this method to take a snapshot of the patterns that
163               have been added so far to an object, and generate an assembly
164               from the clone. Additional patterns may to be added to the
165               original object afterwards.
166
167                 my $re = $main->clone->re();
168                 $main->add( 'another-pattern-\\d+' );
169
170       add(LIST)
171               Takes a string, breaks it apart into a set of tokens
172               (respecting meta characters) and inserts the resulting list
173               into the "R::A" object. It uses a naive regular expression to
174               lex the string that may be fooled complex expressions
175               (specifically, it will fail to lex nested parenthetical
176               expressions such as "ab(cd(ef)?gh)ij" correctly). If this is
177               the case, the end of the string will not be tokenised correctly
178               and returned as one long string.
179
180               On the one hand, this may indicate that the patterns you are
181               trying to feed the "R::A" object are too complex. Simpler
182               patterns might allow the algorithm to work more effectively and
183               perform more reductions in the resulting pattern.
184
185               On the other hand, you can supply your own pattern to perform
186               the lexing if you need. The test suite contains an example of a
187               lexer pattern that will match one level of nested parentheses.
188
189               Note that there is an internal optimisation that will bypass a
190               much of the lexing process. If a string contains no "\"
191               (backslash), "[" (open square bracket), "(" (open paren), "?"
192               (question mark), "+" (plus), "*" (star) or "{" (open curly), a
193               character split will be performed directly.
194
195               A list of strings may be supplied, thus you can pass it a file
196               handle of a file opened for reading:
197
198                   $re->add( '\d+-\d+-\d+-\d+\.example\.com' );
199                   $re->add( <IN> );
200
201               If the file is very large, it may be more efficient to use a
202               "while" loop, to read the file line-by-line:
203
204                   $re->add($_) while <IN>;
205
206               The "add" method will chomp the lines automatically. If you do
207               not want this to occur (you want to keep the record separator),
208               then disable "chomp"ing.
209
210                   $re->chomp(0);
211                   $re->add($_) while <IN>;
212
213               This method is chainable.
214
215       add_file(FILENAME [...])
216               Takes a list of file names. Each file is opened and read line
217               by line. Each line is added to the assembly.
218
219                 $r->add_file( 'file.1', 'file.2' );
220
221               If a file cannot be opened, the method will croak. If you
222               cannot afford to let this happen then you should wrap the call
223               in a "eval" block.
224
225               Chomping happens automatically unless you the chomp(0) method
226               to disable it. By default, input lines are read according to
227               the value of the "input_record_separator" attribute (if
228               defined), and will otherwise fall back to the current setting
229               of the system $/ variable. The record separator may also be
230               specified on each call to "add_file". Internally, the routine
231               "local"ises the value of $/ to whatever is required, for the
232               duration of the call.
233
234               An alternate calling mechanism using a hash reference is
235               available.  The recognised keys are:
236
237               file
238                   Reference to a list of file names, or the name of a single
239                   file.
240
241                     $r->add_file({file => ['file.1', 'file.2', 'file.3']});
242                     $r->add_file({file => 'file.n'});
243
244               input_record_separator
245                   If present, indicates what constitutes a line
246
247                     $r->add_file({file => 'data.txt', input_record_separator => ':' });
248
249               rs  An alias for input_record_separator (mnemonic: same as the
250                   English variable names).
251
252                 $r->add_file( {
253                   file => [ 'pattern.txt', 'more.txt' ],
254                   input_record_separator  => "\r\n",
255                 });
256
257       insert(LIST)
258               Takes a list of tokens representing a regular expression and
259               stores them in the object. Note: you should not pass it a bare
260               regular expression, such as "ab+c?d*e". You must pass it as a
261               list of tokens, e.g. "('a', 'b+', 'c?', 'd*', 'e')".
262
263               This method is chainable, e.g.:
264
265                 my $ra = Regexp::Assemble->new
266                   ->insert( qw[ a b+ c? d* e ] )
267                   ->insert( qw[ a c+ d+ e* f ] );
268
269               Lexing complex patterns with metacharacters and so on can
270               consume a significant proportion of the overall time to build
271               an assembly.  If you have the information available in a
272               tokenised form, calling "insert" directly can be a big win.
273
274       lexstr  Use the "lexstr" method if you are curious to see how a pattern
275               gets tokenised. It takes a scalar on input, representing a
276               pattern, and returns a reference to an array, containing the
277               tokenised pattern. You can recover the original pattern by
278               performing a "join":
279
280                 my @token = $re->lexstr($pattern);
281                 my $new_pattern = join( '', @token );
282
283               If the original pattern contains unnecessary backslashes, or
284               "\x4b" escapes, or quotemeta escapes ("\Q"..."\E") the
285               resulting pattern may not be identical.
286
287               Call "lexstr" does not add the pattern to the object, it is
288               merely for exploratory purposes. It will, however, update
289               various statistical counters.
290
291       pre_filter(CODE)
292               Allows you to install a callback to check that the pattern
293               being loaded contains valid input. It receives the pattern as a
294               whole to be added, before it been tokenised by the lexer. It
295               may to return 0 or "undef" to indicate that the pattern should
296               not be added, any true value indicates that the contents are
297               fine.
298
299               A filter to strip out trailing comments (marked by #):
300
301                 $re->pre_filter( sub { $_[0] =~ s/\s*#.*$//; 1 } );
302
303               A filter to ignore blank lines:
304
305                 $re->pre_filter( sub { length(shift) } );
306
307               If you want to remove the filter, pass "undef" as a parameter.
308
309                 $ra->pre_filter(undef);
310
311               This method is chainable.
312
313       filter(CODE)
314               Allows you to install a callback to check that the pattern
315               being loaded contains valid input. It receives a list on input,
316               after it has been tokenised by the lexer. It may to return 0 or
317               undef to indicate that the pattern should not be added, any
318               true value indicates that the contents are fine.
319
320               If you know that all patterns you expect to assemble contain a
321               restricted set of of tokens (e.g. no spaces), you could do the
322               following:
323
324                 $ra->filter(sub { not grep { / / } @_ });
325
326               or
327
328                 sub only_spaces_and_digits {
329                   not grep { ![\d ] } @_
330                 }
331                 $ra->filter( \&only_spaces_and_digits );
332
333               These two examples will silently ignore faulty patterns, If you
334               want the user to be made aware of the problem you should raise
335               an error (via "warn" or "die"), log an error message, whatever
336               is best. If you want to remove a filter, pass "undef" as a
337               parameter.
338
339                 $ra->filter(undef);
340
341               This method is chainable.
342
343       as_string
344               Assemble the expression and return it as a string. You may want
345               to do this if you are writing the pattern to a file. The
346               following arguments can be passed to control the aspect of the
347               resulting pattern:
348
349               indent, the number of spaces used to indent nested grouping of
350               a pattern. Use this to produce a pretty-printed pattern (for
351               some definition of "pretty"). The resulting output is rather
352               verbose. The reason is to ensure that the metacharacters "(?:"
353               and ")" always occur on otherwise empty lines. This allows you
354               grep the result for an even more synthetic view of the pattern:
355
356                 egrep -v '^ *[()]' <regexp.file>
357
358               The result of the above is quite readable. Remember to
359               backslash the spaces appearing in your own patterns if you wish
360               to use an indented pattern in an "m/.../x" construct. Indenting
361               is ignored if tracking is enabled.
362
363               The indent argument takes precedence over the "indent"
364               method/attribute of the object.
365
366               Calling this method will drain the internal data structure.
367               Large numbers of patterns can eat a significant amount of
368               memory, and this lets perl recover the memory used for other
369               purposes.
370
371               If you want to reduce the pattern and continue to add new
372               patterns, clone the object and reduce the clone, leaving the
373               original object intact.
374
375       re      Assembles the pattern and return it as a compiled RE, using the
376               "qr//" operator.
377
378               As with "as_string", calling this method will reset the
379               internal data structures to free the memory used in assembling
380               the RE.
381
382               The indent attribute, documented in the "as_string" method, can
383               be used here (it will be ignored if tracking is enabled).
384
385               With method chaining, it is possible to produce a RE without
386               having a temporary "Regexp::Assemble" object lying around,
387               e.g.:
388
389                 my $re = Regexp::Assemble->new
390                   ->add( q[ab+cd+e] )
391                   ->add( q[ac\\d+e] )
392                   ->add( q[c\\d+e] )
393                   ->re;
394
395               The $re variable now contains a Regexp object that can be used
396               directly:
397
398                 while( <> ) {
399                   /$re/ and print "Something in [$_] matched\n";
400                 )
401
402               The "re" method is called when the object is used in string
403               context (hence, within an "m//" operator), so by and large you
404               do not even need to save the RE in a separate variable. The
405               following will work as expected:
406
407                 my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
408                 while( <IN> ) {
409                   if( /($re)/ ) {
410                     print "Here be giants: $1\n";
411                   }
412                 }
413
414               This approach does not work with tracked patterns. The "match"
415               and "matched" methods must be used instead, see below.
416
417       match(SCALAR)
418               The following information applies to Perl 5.8 and below. See
419               the section that follows for information on Perl 5.10.
420
421               If pattern tracking is in use, you must "use re 'eval'" in
422               order to make things work correctly. At a minimum, this will
423               make your code look like this:
424
425                   my $did_match = do { use re 'eval'; $target =~ /$ra/ }
426                   if( $did_match ) {
427                       print "matched ", $ra->matched, "\n";
428                   }
429
430               (The main reason is that the $^R variable is currently broken
431               and an ugly workaround that runs some Perl code during the
432               match is required, in order to simulate what $^R should be
433               doing. See Perl bug #32840 for more information if you are
434               curious. The README also contains more information). This bug
435               has been fixed in 5.10.
436
437               The important thing to note is that with "use re 'eval'", THERE
438               ARE SECURITY IMPLICATIONS WHICH YOU IGNORE AT YOUR PERIL. The
439               problem is this: if you do not have strict control over the
440               patterns being fed to "Regexp::Assemble" when tracking is
441               enabled, and someone slips you a pattern such as "/^(?{system
442               'rm -rf /'})/" and you attempt to match a string against the
443               resulting pattern, you will know Fear and Loathing.
444
445               What is more, the $^R workaround means that that tracking does
446               not work if you perform a bare "/$re/" pattern match as shown
447               above. You have to instead call the "match" method, in order to
448               supply the necessary context to take care of the tracking
449               housekeeping details.
450
451                  if( defined( my $match = $ra->match($_)) ) {
452                      print "  $_ matched by $match\n";
453                  }
454
455               In the case of a successful match, the original matched pattern
456               is returned directly. The matched pattern will also be
457               available through the "matched" method.
458
459               (Except that the above is not true for 5.6.0: the "match"
460               method returns true or undef, and the "matched" method always
461               returns undef).
462
463               If you are capturing parts of the pattern e.g. "foo(bar)rat"
464               you will want to get at the captures. See the "mbegin", "mend",
465               "mvar" and "capture" methods. If you are not using captures
466               then you may safely ignore this section.
467
468               In 5.10, since the bug concerning $^R has been resolved, there
469               is no need to use "re 'eval'" and the assembled pattern does
470               not require any Perl code to be executed during the match.
471
472       source  When using tracked mode, after a successful match is made,
473               returns the original source pattern that caused the match. In
474               Perl 5.10, the $^R variable can be used to as an index to fetch
475               the correct pattern from the object.
476
477               If no successful match has been performed, or the object is not
478               in tracked mode, this method returns "undef".
479
480                 my $r = Regexp::Assemble->new->track(1)->add(qw(foo? bar{2} [Rr]at));
481
482                 for my $w (qw(this food is rather barren)) {
483                   if ($w =~ /$r/) {
484                     print "$w matched by ", $r->source($^R), $/;
485                   }
486                   else {
487                     print "$w no match\n";
488                   }
489                 }
490
491       mbegin  This method returns a copy of "@-" at the moment of the last
492               match. You should ordinarily not need to bother with this,
493               "mvar" should be able to supply all your needs.
494
495       mend    This method returns a copy of "@+" at the moment of the last
496               match.
497
498       mvar(NUMBER)
499               The "mvar" method returns the captures of the last match.
500               mvar(1) corresponds to $1, mvar(2) to $2, and so on.  mvar(0)
501               happens to return the target string matched, as a byproduct of
502               walking down the "@-" and "@+" arrays after the match.
503
504               If called without a parameter, "mvar" will return a reference
505               to an array containing all captures.
506
507       capture The "capture" method returns the the captures of the last match
508               as an array. Unlink "mvar", this method does not include the
509               matched string. It is equivalent to getting an array back that
510               contains "$1, $2, $3, ...".
511
512               If no captures were found in the match, an empty array is
513               returned, rather than "undef". You are therefore guaranteed to
514               be able to use "for my $c ($re->capture) { ..."  without have
515               to check whether anything was captured.
516
517       matched If pattern tracking has been set, via the "track" attribute, or
518               through the "track" method, this method will return the
519               original pattern of the last successful match. Returns undef
520               match has yet been performed, or tracking has not been enabled.
521
522               See below in the NOTES section for additional subtleties of
523               which you should be aware of when tracking patterns.
524
525               Note that this method is not available in 5.6.0, due to
526               limitations in the implementation of "(?{...})" at the time.
527
528   Statistics/Reporting routines
529       stats_add
530               Returns the number of patterns added to the assembly (whether
531               by "add" or "insert"). Duplicate patterns are not included in
532               this total.
533
534       stats_dup
535               Returns the number of duplicate patterns added to the assembly.
536               If non-zero, this may be a sign that something is wrong with
537               your data (or at the least, some needless redundancy). This may
538               occur when you have two patterns (for instance, "a\-b" and
539               "a-b") which map to the same result.
540
541       stats_raw
542               Returns the raw number of bytes in the patterns added to the
543               assembly. This includes both original and duplicate patterns.
544               For instance, adding the two patterns "ab" and "ab" will count
545               as 4 bytes.
546
547       stats_cooked
548               Return the true number of bytes added to the assembly. This
549               will not include duplicate patterns. Furthermore, it may differ
550               from the raw bytes due to quotemeta treatment. For instance,
551               "abc\,def" will count as 7 (not 8) bytes, because "\," will be
552               stored as ",". Also, "\Qa.b\E" is 7 bytes long, however, after
553               the quotemeta directives are processed, "a\.b" will be stored,
554               for a total of 4 bytes.
555
556       stats_length
557               Returns the length of the resulting assembled expression.
558               Until "as_string" or "re" have been called, the length will be
559               0 (since the assembly will have not yet been performed). The
560               length includes only the pattern, not the additional
561               ("(?-xism...") fluff added by the compilation.
562
563       dup_warn(NUMBER|CODEREF)
564               Turns warnings about duplicate patterns on or off. By default,
565               no warnings are emitted. If the method is called with no
566               parameters, or a true parameter, the object will carp about
567               patterns it has already seen. To turn off the warnings, use 0
568               as a parameter.
569
570                 $r->dup_warn();
571
572               The method may also be passed a code block. In this case the
573               code will be executed and it will receive a reference to the
574               object in question, and the lexed pattern.
575
576                 $r->dup_warn(
577                   sub {
578                     my $self = shift;
579                     print $self->stats_add, " patterns added at line $.\n",
580                         join( '', @_ ), " added previously\n";
581                   }
582                 )
583
584   Anchor routines
585       Suppose you wish to assemble a series of patterns that all begin with
586       "^"  and end with "$" (anchor pattern to the beginning and end of
587       line). Rather than add the anchors to each and every pattern (and
588       possibly forget to do so when a new entry is added), you may specify
589       the anchors in the object, and they will appear in the resulting
590       pattern, and you no longer need to (or should) put them in your source
591       patterns. For example, the two following snippets will produce
592       identical patterns:
593
594         $r->add(qw(^this ^that ^them))->as_string;
595
596         $r->add(qw(this that them))->anchor_line_begin->as_string;
597
598         # both techniques will produce ^th(?:at|em|is)
599
600       All anchors are possible word ("\b") boundaries, line boundaries ("^"
601       and "$") and string boundaries ("\A" and "\Z" (or "\z" if you
602       absolutely need it)).
603
604       The shortcut "anchor_mumble" implies both "anchor_mumble_begin"
605       "anchor_mumble_end" is also available. If different anchors are
606       specified the most specific anchor wins. For instance, if both
607       "anchor_word_begin" and "anchor_line_begin" are specified,
608       "anchor_word_begin" takes precedence.
609
610       All the anchor methods are chainable.
611
612       anchor_word_begin
613               The resulting pattern will be prefixed with a "\b" word
614               boundary assertion when the value is true. Set to 0 to disable.
615
616                 $r->add('pre')->anchor_word_begin->as_string;
617                 # produces '\bpre'
618
619       anchor_word_end
620               The resulting pattern will be suffixed with a "\b" word
621               boundary assertion when the value is true. Set to 0 to disable.
622
623                 $r->add(qw(ing tion))
624                   ->anchor_word_end
625                   ->as_string; # produces '(?:tion|ing)\b'
626
627       anchor_word
628               The resulting pattern will be have "\b" word boundary
629               assertions at the beginning and end of the pattern when the
630               value is true. Set to 0 to disable.
631
632                 $r->add(qw(cat carrot)
633                   ->anchor_word(1)
634                   ->as_string; # produces '\bca(?:rro)t\b'
635
636       anchor_line_begin
637               The resulting pattern will be prefixed with a "^" line boundary
638               assertion when the value is true. Set to 0 to disable.
639
640                 $r->anchor_line_begin;
641                 # or
642                 $r->anchor_line_begin(1);
643
644       anchor_line_end
645               The resulting pattern will be suffixed with a "$" line boundary
646               assertion when the value is true. Set to 0 to disable.
647
648                 # turn it off
649                 $r->anchor_line_end(0);
650
651       anchor_line
652               The resulting pattern will be have the "^" and "$" line
653               boundary assertions at the beginning and end of the pattern,
654               respectively, when the value is true. Set to 0 to disable.
655
656                 $r->add(qw(cat carrot)
657                   ->anchor_line
658                   ->as_string; # produces '^ca(?:rro)t$'
659
660       anchor_string_begin
661               The resulting pattern will be prefixed with a "\A" string
662               boundary assertion when the value is true. Set to 0 to disable.
663
664                 $r->anchor_string_begin(1);
665
666       anchor_string_end
667               The resulting pattern will be suffixed with a "\Z" string
668               boundary assertion when the value is true. Set to 0 to disable.
669
670                 # disable the string boundary end anchor
671                 $r->anchor_string_end(0);
672
673       anchor_string_end_absolute
674               The resulting pattern will be suffixed with a "\z" string
675               boundary assertion when the value is true. Set to 0 to disable.
676
677                 # disable the string boundary absolute end anchor
678                 $r->anchor_string_end_absolute(0);
679
680               If you don't understand the difference between "\Z" and "\z",
681               the former will probably do what you want.
682
683       anchor_string
684               The resulting pattern will be have the "\A" and "\Z" string
685               boundary assertions at the beginning and end of the pattern,
686               respectively, when the value is true. Set to 0 to disable.
687
688                 $r->add(qw(cat carrot)
689                   ->anchor_string
690                   ->as_string; # produces '\Aca(?:rro)t\Z'
691
692       anchor_string_absolute
693               The resulting pattern will be have the "\A" and "\z" string
694               boundary assertions at the beginning and end of the pattern,
695               respectively, when the value is true. Set to 0 to disable.
696
697                 $r->add(qw(cat carrot)
698                   ->anchor_string_absolute
699                   ->as_string; # produces '\Aca(?:rro)t\z'
700
701       debug(NUMBER)
702               Turns debugging on or off. Statements are printed to the
703               currently selected file handle (STDOUT by default).  If you are
704               already using this handle, you will have to arrange to select
705               an output handle to a file of your own choosing, before call
706               the "add", "as_string" or "re") functions, otherwise it will
707               scribble all over your carefully formatted output.
708
709               0       Off. Turns off all debugging output.
710
711               1       Add. Trace the addition of patterns.
712
713               2       Reduce. Trace the process of reduction and assembly.
714
715               4       Lex. Trace the lexing of the input patterns into its
716                       constituent tokens.
717
718               8       Time. Print to STDOUT the time taken to load all the
719                       patterns. This is nothing more than the difference
720                       between the time the object was instantiated and the
721                       time reduction was initiated.
722
723                         # load=<num>
724
725                       Any lengthy computation performed in the client code
726                       will be reflected in this value. Another line will be
727                       printed after reduction is complete.
728
729                         # reduce=<num>
730
731                       The above output lines will be changed to "load-epoch"
732                       and "reduce-epoch" if the internal state of the object
733                       is corrupted and the initial timestamp is lost.
734
735                       The code attempts to load Time::HiRes in order to
736                       report fractional seconds. If this is not successful,
737                       the elapsed time is displayed in whole seconds.
738
739               Values can be added (or or'ed together) to trace everything
740
741                 $r->debug(7)->add( '\\d+abc' );
742
743               Calling "debug" with no arguments turns debugging off.
744
745       dump    Produces a synthetic view of the internal data structure. How
746               to interpret the results is left as an exercise to the reader.
747
748                 print $r->dump;
749
750       chomp(0|1)
751               Turns chomping on or off.
752
753               IMPORTANT: As of version 0.24, chomping is now on by default as
754               it makes "add_file" Just Work. The only time you may run into
755               trouble is with "add("\\$/")". So don't do that, or else
756               explicitly turn off chomping.
757
758               To avoid incorporating (spurious) record separators (such as
759               "\n" on Unix) when reading from a file, "add()" "chomp"s its
760               input. If you don't want this to happen, call "chomp" with a
761               false value.
762
763                 $re->chomp(0); # really want the record separators
764                 $re->add(<DATA>);
765
766       fold_meta_pairs(NUMBER)
767               Determines whether "\s", "\S" and "\w", "\W" and "\d", "\D" are
768               folded into a "." (dot). Folding happens by default (for
769               reasons of backwards compatibility, even though it is wrong
770               when the "/s" expression modifier is active).
771
772               Call this method with a false value to prevent this behaviour
773               (which is only a problem when dealing with "\n" if the "/s"
774               expression modifier is also set).
775
776                 $re->add( '\\w', '\\W' );
777                 my $clone = $re->clone;
778
779                 $clone->fold_meta_pairs(0);
780                 print $clone->as_string; # prints '.'
781                 print $re->as_string;    # print '[\W\w]'
782
783       indent(NUMBER)
784               Sets the level of indent for pretty-printing nested groups
785               within a pattern. See the "as_string" method for more details.
786               When called without a parameter, no indenting is performed.
787
788                 $re->indent( 4 );
789                 print $re->as_string;
790
791       lookahead(0|1)
792               Turns on zero-width lookahead assertions. This is usually
793               beneficial when you expect that the pattern will usually fail.
794               If you expect that the pattern will usually match you will
795               probably be worse off.
796
797       flags(STRING)
798               Sets the flags that govern how the pattern behaves (for
799               versions of Perl up to 5.9 or so, these are "imsx"). By default
800               no flags are enabled.
801
802       modifiers(STRING)
803               An alias of the "flags" method, for users familiar with
804               "Regexp::List".
805
806       track(0|1)
807               Turns tracking on or off. When this attribute is enabled,
808               additional housekeeping information is inserted into the
809               assembled expression using "({...}" embedded code constructs.
810               This provides the necessary information to determine which, of
811               the original patterns added, was the one that caused the match.
812
813                 $re->track( 1 );
814                 if( $target =~ /$re/ ) {
815                   print "$target matched by ", $re->matched, "\n";
816                 }
817
818               Note that when this functionality is enabled, no reduction is
819               performed and no character classes are generated. In other
820               words, "brag|tag" is not reduced down to "(?:br|t)ag" and
821               "dig|dim" is not reduced to "di[gm]".
822
823       unroll_plus(0|1)
824               Turns the unrolling of plus metacharacters on or off. When a
825               pattern is broken up, "a+" becomes "a", "a*" (and "b+?" becomes
826               "b", "b*?". This may allow the freed "a" to assemble with other
827               patterns. Not enabled by default.
828
829       lex(SCALAR)
830               Change the pattern used to break a string apart into tokens.
831               You can examine the "eg/naive" script as a starting point.
832
833       reduce(0|1)
834               Turns pattern reduction on or off. A reduced pattern may be
835               considerably shorter than an unreduced pattern. Consider
836               "/sl(?:ip|op|ap)/" versus "/sl[aio]p/". An unreduced pattern
837               will be very similar to those produced by "Regexp::Optimizer".
838               Reduction is on by default. Turning it off speeds assembly (but
839               assembly is pretty fast -- it's the breaking up of the initial
840               patterns in the lexing stage that can consume a non-negligible
841               amount of time).
842
843       mutable(0|1)
844               This method has been marked as DEPRECATED. It will be removed
845               in a future release. See the "clone" method for a technique to
846               replace its functionality.
847
848       reset   Empties out the patterns that have been "add"ed or "insert"-ed
849               into the object. Does not modify the state of controller
850               attributes such as "debug", "lex", "reduce" and the like.
851
852       Default_Lexer
853               Warning: the "Default_Lexer" function is a class method, not an
854               object method. It is a fatal error to call it as an object
855               method.
856
857               The "Default_Lexer" method lets you replace the default pattern
858               used for all subsequently created "Regexp::Assemble" objects.
859               It will not have any effect on existing objects. (It is also
860               possible to override the lexer pattern used on a per-object
861               basis).
862
863               The parameter should be an ordinary scalar, not a compiled
864               pattern. If the pattern fails to match all parts of the string,
865               the missing parts will be returned as single chunks. Therefore
866               the following pattern is legal (albeit rather cork-brained):
867
868                   Regexp::Assemble::Default_Lexer( '\\d' );
869
870               The above pattern will split up input strings digit by digit,
871               and all non-digit characters as single chunks.
872

DIAGNOSTICS

874         "Cannot pass a C<refname> to Default_Lexer"
875
876       You tried to replace the default lexer pattern with an object instead
877       of a scalar. Solution: You probably tried to call
878       "$obj->Default_Lexer". Call the qualified class method instead
879       "Regexp::Assemble::Default_Lexer".
880
881         "filter method not passed a coderef"
882
883         "pre_filter method not passed a coderef"
884
885       A reference to a subroutine (anonymous or otherwise) was expected.
886       Solution: read the documentation for the "filter" method.
887
888         "duplicate pattern added: /.../"
889
890       The "dup_warn" attribute is active, and a duplicate pattern was added
891       (well duh!). Solution: clean your data.
892
893         "cannot open [file] for input: [reason]"
894
895       The "add_file" method was unable to open the specified file for
896       whatever reason. Solution: make sure the file exists and the script has
897       the required privileges to read it.
898

NOTES

900       This module has been tested successfully with a range of versions of
901       perl, from 5.005_03 to 5.9.3. Use of 5.6.0 is not recommended.
902
903       The expressions produced by this module can be used with the PCRE
904       library.
905
906       Remember to "double up" your backslashes if the patterns are hard-coded
907       as constants in your program. That is, you should literally
908       "add('a\\d+b')" rather than "add('a\d+b')". It usually will work either
909       way, but it's good practice to do so.
910
911       Where possible, supply the simplest tokens possible. Don't add
912       "X(?-\d+){2})Y" when "X-\d+-\d+Y" will do. The reason is that if you
913       also add "X\d+Z" the resulting assembly changes dramatically:
914       "X(?:(?:-\d+){2}Y|-\d+Z)" versus "X-\d+(?:-\d+Y|Z)". Since R::A doesn't
915       perform enough analysis, it won't "unroll" the "{2}" quantifier, and
916       will fail to notice the divergence after the first "-d\d+".
917
918       Furthermore, when the string 'X-123000P' is matched against the first
919       assembly, the regexp engine will have to backtrack over each
920       alternation (the one that ends in Y and the one that ends in Z) before
921       determining that there is no match. No such backtracking occurs in the
922       second pattern: as soon as the engine encounters the 'P' in the target
923       string, neither of the alternations at that point ("-\d+Y" or "Z")
924       could succeed and so the match fails.
925
926       "Regexp::Assemble" does, however, know how to build character classes.
927       Given "a-b", "axb" and "a\db", it will assemble these into "a[-\dx]b".
928       When "-" (dash) appears as a candidate for a character class it will be
929       the first character in the class. When "^" (circumflex) appears as a
930       candidate for a character class it will be the last character in the
931       class.
932
933       It also knows about meta-characters than can "absorb" regular
934       characters. For instance, given "X\d" and "X5", it knows that 5 can be
935       represented by "\d" and so the assembly is just "X\d".  The "absorbent"
936       meta-characters it deals with are ".", "\d", "\s" and "\W" and their
937       complements. It will replace "\d"/"\D", "\s"/"\S" and "\w"/"\W" by "."
938       (dot), and it will drop "\d" if "\w" is also present (as will "\D" in
939       the presence of "\W").
940
941       "Regexp::Assemble" deals correctly with "quotemeta"'s propensity to
942       backslash many characters that have no need to be. Backslashes on non-
943       metacharacters will be removed. Similarly, in character classes, a
944       number of characters lose their magic and so no longer need to be
945       backslashed within a character class. Two common examples are "."
946       (dot) and "$". Such characters will lose their backslash.
947
948       At the same time, it will also process "\Q...\E" sequences. When such a
949       sequence is encountered, the inner section is extracted and "quotemeta"
950       is applied to the section. The resulting quoted text is then used in
951       place of the original unquoted text, and the "\Q" and "\E"
952       metacharacters are thrown away. Similar processing occurs with the
953       "\U...\E" and "\L...\E" sequences. This may have surprising effects
954       when using a dispatch table. In this case, you will need to know
955       exactly what the module makes of your input. Use the "lexstr" method to
956       find out what's going on:
957
958         $pattern = join( '', @{$re->lexstr($pattern)} );
959
960       If all the digits 0..9 appear in a character class, "Regexp::Assemble"
961       will replace them by "\d". I'd do it for letters as well, but thinking
962       about accented characters and other glyphs hurts my head.
963
964       In an alternation, the longest paths are chosen first (for example,
965       "horse|bird|dog"). When two paths have the same length, the path with
966       the most subpaths will appear first. This aims to put the "busiest"
967       paths to the front of the alternation. For example, the list "bad",
968       "bit", "few", "fig" and "fun" will produce the pattern
969       "(?:f(?:ew|ig|un)|b(?:ad|it))". See eg/tld for a real-world example of
970       how alternations are sorted. Once you have looked at that, everything
971       should be crystal clear.
972
973       When tracking is in use, no reduction is performed. nor are character
974       classes formed. The reason is that it is too difficult to determine the
975       original pattern afterwards. Consider the two patterns "pale" and
976       "palm". These should be reduced to "pal[em]". The final character
977       matches one of two possibilities.  To resolve whether it matched an 'e'
978       or 'm' would require keeping track of the fact that the pattern
979       finished up in a character class, which would the require a whole lot
980       more work to figure out which character of the class matched. Without
981       character classes it becomes much easier. Instead, "pal(?:e|m)" is
982       produced, which lets us find out more simply where we ended up.
983
984       Similarly, "dogfood" and "seafood" should form "(?:dog|sea)food".  When
985       the pattern is being assembled, the tracking decision needs to be made
986       at the end of the grouping, but the tail of the pattern has not yet
987       been visited. Deferring things to make this work correctly is a vast
988       hassle. In this case, the pattern becomes merely "(?:dogfood|seafood".
989       Tracked patterns will therefore be bulkier than simple patterns.
990
991       There is an open bug on this issue:
992
993       <http://rt.perl.org/rt3/Ticket/Display.html?id=32840>
994
995       If this bug is ever resolved, tracking would become much easier to deal
996       with (none of the "match" hassle would be required - you could just
997       match like a regular RE and it would Just Work).
998

SEE ALSO

1000       perlre  General information about Perl's regular expressions.
1001
1002       re      Specific information about "use re 'eval'".
1003
1004       Regex::PreSuf
1005               "Regex::PreSuf" takes a string and chops it itself into tokens
1006               of length 1. Since it can't deal with tokens of more than one
1007               character, it can't deal with meta-characters and thus no
1008               regular expressions.  Which is the main reason why I wrote this
1009               module.
1010
1011       Regexp::Optimizer
1012               "Regexp::Optimizer" produces regular expressions that are
1013               similar to those produced by R::A with reductions switched off.
1014               It's biggest drawback is that it is exponentially slower than
1015               Regexp::Assemble on very large sets of patterns.
1016
1017       Regexp::Parser
1018               Fine grained analysis of regular expressions.
1019
1020       Regexp::Trie
1021               Funnily enough, this was my working name for "Regexp::Assemble"
1022               during its developement. I changed the name because I thought
1023               it was too obscure. Anyway, "Regexp::Trie" does much the same
1024               as "Regexp::Optimizer" and "Regexp::Assemble" except that it
1025               runs much faster (according to the author). It does not
1026               recognise meta characters (that is, 'a+b' is interpreted as
1027               'a\+b').
1028
1029       Text::Trie
1030               "Text::Trie" is well worth investigating. Tries can outperform
1031               very bushy (read: many alternations) patterns.
1032
1033       Tree::Trie
1034               "Tree::Trie" is another module that builds tries. The algorithm
1035               that "Regexp::Assemble" uses appears to be quite similar to the
1036               algorithm described therein, except that "R::A" solves its end-
1037               marker problem without having to rewrite the leaves.
1038

LIMITATIONS

1040       "Regexp::Assemble" does not attempt to find common substrings. For
1041       instance, it will not collapse "/cabababc/" down to "/c(?:ab){3}c/".
1042       If there's a module out there that performs this sort of string
1043       analysis I'd like to know about it. But keep in mind that the
1044       algorithms that do this are very expensive: quadratic or worse.
1045
1046       "Regexp::Assemble" does not interpret meta-character modifiers.  For
1047       instance, if the following two patterns are given: "X\d" and "X\d+", it
1048       will not determine that "\d" can be matched by "\d+". Instead, it will
1049       produce "X(?:\d|\d+)". Along a similar line of reasoning, it will not
1050       determine that "Z" and "Z\d+" is equivalent to "Z\d*" (It will produce
1051       "Z(?:\d+)?"  instead).
1052
1053       You cannot remove a pattern that has been added to an object. You'll
1054       just have to start over again. Adding a pattern is difficult enough,
1055       I'd need a solid argument to convince me to add a "remove" method.  If
1056       you need to do this you should read the documentation for the "clone"
1057       method.
1058
1059       "Regexp::Assemble" does not (yet)? employ the "(?>...)"  construct.
1060
1061       The module does not produce POSIX-style regular expressions. This would
1062       be quite easy to add, if there was a demand for it.
1063

BUGS

1065       Patterns that generate look-ahead assertions sometimes produce
1066       incorrect patterns in certain obscure corner cases. If you suspect that
1067       this is occurring in your pattern, disable lookaheads.
1068
1069       Tracking doesn't really work at all with 5.6.0. It works better in
1070       subsequent 5.6 releases. For maximum reliability, the use of a 5.8
1071       release is strongly recommended. Tracking barely works with 5.005_04.
1072       Of note, using "\d"-style meta-characters invariably causes panics.
1073       Tracking really comes into its own in Perl 5.10.
1074
1075       If you feed "Regexp::Assemble" patterns with nested parentheses, there
1076       is a chance that the resulting pattern will be uncompilable due to
1077       mismatched parentheses (not enough closing parentheses). This is
1078       normal, so long as the default lexer pattern is used. If you want to
1079       find out which pattern among a list of 3000 patterns are to blame
1080       (speaking from experience here), the eg/debugging script offers a
1081       strategy for pinpointing the pattern at fault. While you may not be
1082       able to use the script directly, the general approach is easy to
1083       implement.
1084
1085       The algorithm used to assemble the regular expressions makes extensive
1086       use of mutually-recursive functions (that is, A calls B, B calls A,
1087       ...) For deeply similar expressions, it may be possible to provoke
1088       "Deep recursion" warnings.
1089
1090       The module has been tested extensively, and has an extensive test suite
1091       (that achieves close to 100% statement coverage), but you never know...
1092       A bug may manifest itself in two ways: creating a pattern that cannot
1093       be compiled, such as "a\(bc)", or a pattern that compiles correctly but
1094       that either matches things it shouldn't, or doesn't match things it
1095       should. It is assumed that Such problems will occur when the reduction
1096       algorithm encounters some sort of edge case. A temporary work-around is
1097       to disable reductions:
1098
1099         my $pattern = $assembler->reduce(0)->re;
1100
1101       A discussion about implementation details and where bugs might lurk
1102       appears in the README file. If this file is not available locally, you
1103       should be able to find a copy on the Web at your nearest CPAN mirror.
1104
1105       Seriously, though, a number of people have been using this module to
1106       create expressions anywhere from 140Kb to 600Kb in size, and it seems
1107       to be working according to spec. Thus, I don't think there are any
1108       serious bugs remaining.
1109
1110       If you are feeling brave, extensive debugging traces are available to
1111       figure out where assembly goes wrong.
1112
1113       Please report all bugs at
1114       http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble
1115       <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble>
1116
1117       Make sure you include the output from the following two commands:
1118
1119         perl -MRegexp::Assemble -le 'print $Regexp::Assemble::VERSION'
1120         perl -V
1121
1122       There is a mailing list for the discussion of "Regexp::Assemble".
1123       Subscription details are available at
1124       http://listes.mongueurs.net/mailman/listinfo/regexp-assemble
1125       <http://listes.mongueurs.net/mailman/listinfo/regexp-assemble>.
1126

ACKNOWLEDGEMENTS

1128       This module grew out of work I did building access maps for Postfix, a
1129       modern SMTP mail transfer agent. See <http://www.postfix.org/> for more
1130       information. I used Perl to build large regular expressions for
1131       blocking dynamic/residential IP addresses to cut down on spam and
1132       viruses. Once I had the code running for this, it was easy to start
1133       adding stuff to block really blatant spam subject lines, bogus HELO
1134       strings, spammer mailer-ids and more...
1135
1136       I presented the work at the French Perl Workshop in 2004, and the thing
1137       most people asked was whether the underlying mechanism for assembling
1138       the REs was available as a module. At that time it was nothing more
1139       that a twisty maze of scripts, all different. The interest shown
1140       indicated that a module was called for. I'd like to thank the people
1141       who showed interest. Hey, it's going to make my messy scripts smaller,
1142       in any case.
1143
1144       Thomas Drugeon was a valuable sounding board for trying out early
1145       ideas. Jean Forget and Philippe Blayo looked over an early version.
1146       H.Merijn Brandt stopped over in Paris one evening, and discussed things
1147       over a few beers.
1148
1149       Nicholas Clark pointed out that while what this module does
1150       (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he
1151       encouraged me to continue with the development of this module. In any
1152       event, this module allows one to gauge the difficulty of undertaking
1153       the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.
1154
1155       Paul Johnson settled the question as to whether this module should live
1156       in the Regex:: namespace, or Regexp:: namespace. If you're not
1157       convinced, try running the following one-liner:
1158
1159         perl -le 'print ref qr//'
1160
1161       Philippe Bruhat found a couple of corner cases where this module could
1162       produce incorrect results. Such feedback is invaluable, and only
1163       improves the module's quality.
1164

AUTHOR

1166       David Landgren
1167
1168       Copyright (C) 2004-2008. All rights reserved.
1169
1170         http://www.landgren.net/perl/
1171
1172       If you use this module, I'd love to hear about what you're using it
1173       for. If you want to be informed of updates, send me a note.
1174
1175       You can look at the latest working copy in the following Subversion
1176       repository:
1177
1178         http://svnweb.mongueurs.net/Regexp-Assemble
1179

LICENSE

1181       This library is free software; you can redistribute it and/or modify it
1182       under the same terms as Perl itself.
1183
1184
1185
1186perl v5.12.0                      2008-06-17                       Assemble(3)
Impressum