1PPIx::Regexp(3)       User Contributed Perl Documentation      PPIx::Regexp(3)
2
3
4

NAME

6       PPIx::Regexp - Represent a regular expression of some sort
7

SYNOPSIS

9        use PPIx::Regexp;
10        use PPIx::Regexp::Dumper;
11        my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12        PPIx::Regexp::Dumper->new( $re )
13            ->print();
14

INHERITANCE

16       "PPIx::Regexp" is a PPIx::Regexp::Node.
17
18       "PPIx::Regexp" has no descendants.
19

DESCRIPTION

21       The purpose of the PPIx-Regexp package is to parse regular expressions
22       in a manner similar to the way the PPI package parses Perl. This class
23       forms the root of the parse tree, playing a role similar to
24       PPI::Document.
25
26       This package shares with PPI the property of being round-trip safe.
27       That is,
28
29        my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
30        my $re = PPIx::Regexp->new( $expr );
31        print $re->content() eq $expr ? "yes\n" : "no\n"
32
33       should print 'yes' for any valid regular expression.
34
35       Navigation is similar to that provided by PPI. That is to say, things
36       like "children", "find_first", "snext_sibling" and so on all work
37       pretty much the same way as in PPI.
38
39       The class hierarchy is also similar to PPI. Except for some utility
40       classes (the dumper, the lexer, and the tokenizer) all classes are
41       descended from PPIx::Regexp::Element, which provides basic navigation.
42       Tokens are descended from PPIx::Regexp::Token, which provides content.
43       All containers are descended from PPIx::Regexp::Node, which provides
44       for children, and all structure elements are descended from
45       PPIx::Regexp::Structure, which provides beginning and ending
46       delimiters, and a type.
47
48       There are two features of PPI that this package does not provide -
49       mutability and operator overloading. There are no plans for serious
50       mutability, though something like PPI's "prune" functionality might be
51       considered. Similarly there are no plans for operator overloading,
52       which appears to the author to represent a performance hit for little
53       tangible gain.
54

NOTICE

56       The use of this class to parse non-regexp quote-like strings was an
57       experiment that I consider failed. Therefore this use is deprecated in
58       favor of PPIx::QuoteLike. As of version 0.058_01, the first use of the
59       "parse" argument to new() resulted in a warning. As of version
60       0.062_01, all uses of the "parse" argument resulted in a warning. As of
61       version 0.068_01, the "parse" argument will become fatal.
62
63       The author will attempt to preserve the documented interface, but if
64       the interface needs to change to correct some egregiously bad design or
65       implementation decision, then it will change.  Any incompatible changes
66       will go through a deprecation cycle.
67
68       The goal of this package is to parse well-formed regular expressions
69       correctly. A secondary goal is not to blow up on ill-formed regular
70       expressions. The correct identification and characterization of ill-
71       formed regular expressions is not a goal of this package, nor is the
72       consistent parsing of ill-formed regular expressions from release to
73       release.
74
75       This policy attempts to track features in development releases as well
76       as public releases. However, features added in a development release
77       and then removed before the next production release will not be
78       tracked, and any functionality relating to such features will be
79       removed. The issue here is the potential re-use (with different
80       semantics) of syntax that did not make it into the production release.
81
82       From time to time the Perl regular expression engine changes in ways
83       that change the parse of a given regular expression. When these changes
84       occur, "PPIx::Regexp" will be changed to produce the more modern parse.
85       Known examples of this include:
86
87       $( no longer interpolates as of Perl 5.005, per "perl5005delta".
88           Newer Perls seem to parse this as "qr{$}" (i.e. and end-of-string
89           or newline assertion) followed by an open parenthesis, and that is
90           what "PPIx::Regexp" does.
91
92       $) and $| also seem to parse as the "$" assertion
93           followed by the relevant meta-character, though I have no
94           documentation reference for this.
95
96       "@+" and "@-" no longer interpolate as of Perl 5.9.4
97           per "perl594delta". Subsequent Perls treat "@+" as a quantified
98           literal and "@-" as two literals, and that is what "PPIx::Regexp"
99           does. Note that subscripted references to these arrays do
100           interpolate, and are so parsed by "PPIx::Regexp".
101
102       Only space and horizontal tab are whitespace as of Perl 5.23.4
103           when inside a bracketed character class inside an extended
104           bracketed character class, per "perl5234delta". Formerly any white
105           space character parsed as whitespace. This change in "PPIx::Regexp"
106           will be reverted if the change in Perl does not make it into Perl
107           5.24.0.
108
109       Unescaped literal left curly brackets
110           These are being removed in positions where quantifiers are legal,
111           so that they can be used for new functionality. Some of them are
112           gone in 5.25.1, others will be removed in a future version of Perl.
113           In situations where they have been removed, perl_version_removed()
114           will return the version in which they were removed. When the new
115           functionality appears, the parse produced by this software will
116           reflect the new functionality.
117
118           NOTE that the situation with a literal left curly after a literal
119           character is complicated. It was made an error in Perl 5.25.1, and
120           remained so through all 5.26 releases, but became a warning again
121           in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
122           become illegal again is not clear to me based on the contents of
123           perl5271delta. At the moment perl_version_removed() returns
124           "undef", but obviously that is not the whole story, and methods
125           accepts_perl() and requirements_for_perl() were introduced to deal
126           with this complication.
127
128       "\o{...}"
129           is parsed as the octal equivalent of "\x{...}". This is its meaning
130           as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
131           on.
132
133       There are very probably other examples of this. When they come to light
134       they will be documented as producing the modern parse, and the code
135       modified to produce this parse if necessary.
136

METHODS

138       This class provides the following public methods. Methods not
139       documented here are private, and unsupported in the sense that the
140       author reserves the right to change or remove them without notice.
141
142   new
143        my $re = PPIx::Regexp->new('/foo/');
144
145       This method instantiates a "PPIx::Regexp" object from a string, a
146       PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
147       PPI::Token::Regexp::Substitute.  Honestly, any PPI::Element will work,
148       but only the three Regexp classes mentioned previously are likely to do
149       anything useful.
150
151       Whatever form the argument takes, it is assumed to consist entirely of
152       a valid match, substitution, or "qr<>" string.
153
154       Optionally you can pass one or more name/value pairs after the regular
155       expression. The possible options are:
156
157       default_modifiers array_reference
158           This option specifies a reference to an array of default modifiers
159           to apply to the regular expression being parsed. Each modifier is
160           specified as a string. Any actual modifiers found supersede the
161           defaults.
162
163           When applying the defaults, '?' and '/' are completely ignored, and
164           '^' is ignored unless it occurs at the beginning of the modifier.
165           The first dash ('-') causes subsequent modifiers to be negated.
166
167           So, for example, if you wish to produce a "PPIx::Regexp" object
168           representing the regular expression in
169
170            use re '/smx';
171            {
172               no re '/x';
173               m/ foo /;
174            }
175
176           you would (after some help from PPI in finding the relevant
177           statements), do something like
178
179            my $re = PPIx::Regexp->new( 'm/ foo /',
180                default_modifiers => [ '/smx', '-/x' ] );
181
182       encoding name
183           This option specifies the encoding of the regular expression. This
184           is passed to the tokenizer, which will "decode" the regular
185           expression string before it tokenizes it. For example:
186
187            my $re = PPIx::Regexp->new( '/foo/',
188                encoding => 'iso-8859-1',
189            );
190
191       parse parse_type
192           This option specifies what kind of parse is to be done. Possible
193           values are 'regex', 'string', or 'guess'. Any value but 'regex' is
194           experimental.
195
196           As it turns out, I consider parsing non-regexp quote-like things
197           with this class to be a failed experiment, and the relevant
198           functionality is being deprecated and removed in favor of
199           PPIx::QuoteLike. See above for details. As of version 0.068_01, any
200           use of this option throws an exception.
201
202       postderef boolean
203           This option is passed on to the tokenizer, where it specifies
204           whether postfix dereferences are recognized in interpolations and
205           code. This experimental feature was introduced in Perl 5.19.5.
206
207           The default is the value of
208           $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which is true. When
209           originally introduced this was false, but was documented as
210           becoming true when and if postfix dereferencing became mainstream.
211           The  intent to mainstream was announced with Perl 5.23.1, and
212           became official (so to speak) with Perl 5.24.0, so the default
213           became true with PPIx::Regexp 0.049_01.
214
215           Note that if PPI starts unconditionally recognizing postfix
216           dereferences, this argument will immediately become ignored, and
217           will be put through a deprecation cycle and removed.
218
219       strict boolean
220           This option is passed on to the tokenizer and lexer, where it
221           specifies whether the parse should assume "use re 'strict'" is in
222           effect.
223
224           The 'strict' pragma was introduced in Perl 5.22, and its
225           documentation says that it is experimental, and that there is no
226           commitment to backward compatibility. The same applies to the parse
227           produced when this option is asserted. Also, the usual caveat
228           applies: if "use re 'strict'" ends up being retracted, this option
229           and all related functionality will be also.
230
231           Given the nature of "use re 'strict'", you should expect that if
232           you assert this option, regular expressions that previously parsed
233           without error might no longer do so. If an element ends up being
234           declared an error because this option is set, its
235           "perl_version_introduced()" will be the Perl version at which "use
236           re 'strict'" started rejecting these elements.
237
238           The default is false.
239
240       trace number
241           If greater than zero, this option causes trace output from the
242           parse.  The author reserves the right to change or eliminate this
243           without notice.
244
245       Passing optional input other than the above is not an error, but
246       neither is it supported.
247
248   new_from_cache
249       This static method wraps "new" in a caching mechanism. Only one object
250       will be generated for a given PPI::Element, no matter how many times
251       this method is called. Calls after the first for a given PPI::Element
252       simply return the same "PPIx::Regexp" object.
253
254       When the "PPIx::Regexp" object is returned from cache, the values of
255       the optional arguments are ignored.
256
257       Calls to this method with the regular expression in a string rather
258       than a PPI::Element will not be cached.
259
260       Caveat: This method is provided for code like Perl::Critic which might
261       instantiate the same object multiple times. The cache will persist
262       until "flush_cache" is called.
263
264   flush_cache
265        $re->flush_cache();            # Remove $re from cache
266        PPIx::Regexp->flush_cache();   # Empty the cache
267
268       This method flushes the cache used by "new_from_cache". If called as a
269       static method with no arguments, the entire cache is emptied. Otherwise
270       any objects specified are removed from the cache.
271
272   capture_names
273        foreach my $name ( $re->capture_names() ) {
274            print "Capture name '$name'\n";
275        }
276
277       This convenience method returns the capture names found in the regular
278       expression.
279
280       This method is equivalent to
281
282        $self->regular_expression()->capture_names();
283
284       except that if "$self->regular_expression()" returns "undef" (meaning
285       that something went terribly wrong with the parse) this method will
286       simply return.
287
288   delimiters
289        print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
290        # prints '//      //'
291
292       When called in list context, this method returns either one or two
293       strings, depending on whether the parsed expression has a replacement
294       string. In the case of non-bracketed substitutions, the start delimiter
295       of the replacement string is considered to be the same as its finish
296       delimiter, as illustrated by the above example.
297
298       When called in scalar context, you get the delimiters of the regular
299       expression; that is, element 0 of the array that is returned in list
300       context.
301
302       Optionally, you can pass an index value and the corresponding
303       delimiters will be returned; index 0 represents the regular
304       expression's delimiters, and index 1 represents the replacement
305       string's delimiters, which may be undef. For example,
306
307        print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
308        # prints '<>'
309
310       If the object was not initialized with a valid regexp of some sort, the
311       results of this method are undefined.
312
313   errstr
314       This static method returns the error string from the most recent
315       attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
316       recent attempt succeeded.
317
318   extract_regexps
319        my $doc = PPI::Document->new( $path );
320        $doc->index_locations();
321        my @res = PPIx::Regexp->extract_regexps( $doc )
322
323       This convenience (well, sort-of) static method takes as its argument a
324       PPI::Document object and returns "PPIx::Regexp" objects corresponding
325       to all regular expressions found in it, in the order in which they
326       occur in the document. You will need to keep a reference to the
327       original PPI::Document object if you wish to be able to recover the
328       original PPI::Element objects via the PPIx::Regexp source() method.
329
330   failures
331        print "There were ", $re->failures(), " parse failures\n";
332
333       This method returns the number of parse failures. This is a count of
334       the number of unknown tokens plus the number of unterminated structures
335       plus the number of unmatched right brackets of any sort.
336
337   max_capture_number
338        print "Highest used capture number ",
339            $re->max_capture_number(), "\n";
340
341       This convenience method returns the highest capture number used by the
342       regular expression. If there are no captures, the return will be 0.
343
344       This method is equivalent to
345
346        $self->regular_expression()->max_capture_number();
347
348       except that if "$self->regular_expression()" returns "undef" (meaning
349       that something went terribly wrong with the parse) this method will
350       too.
351
352   modifier
353        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
354        print $re->modifier()->content(), "\n";
355        # prints 'smx'.
356
357       This method retrieves the modifier of the object. This comes from the
358       end of the initializing string or object and will be a
359       PPIx::Regexp::Token::Modifier.
360
361       Note that this object represents the actual modifiers present on the
362       regexp, and does not take into account any that may have been applied
363       by default (i.e. via the "default_modifiers" argument to "new()"). For
364       something that takes account of default modifiers, see
365       modifier_asserted(), below.
366
367       In the event of a parse failure, there may not be a modifier present,
368       in which case nothing is returned.
369
370   modifier_asserted
371        my $re = PPIx::Regexp->new( '/ . /',
372            default_modifiers => [ 'smx' ] );
373        print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
374        # prints 'yes'.
375
376       This method returns true if the given modifier is asserted for the
377       regexp, whether explicitly or by the modifiers passed in the
378       "default_modifiers" argument.
379
380       Starting with version 0.036_01, if the argument is a single-character
381       modifier followed by an asterisk (intended as a wild card character),
382       the return is the number of times that modifier appears. In this case
383       an exception will be thrown if you specify a multi-character modifier
384       (e.g.  'ee*'), or if you specify one of the match semantics modifiers
385       (e.g.  'a*').
386
387   regular_expression
388        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
389        print $re->regular_expression()->content(), "\n";
390        # prints '/(foo)/'.
391
392       This method returns that portion of the object which actually
393       represents a regular expression.
394
395   replacement
396        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
397        print $re->replacement()->content(), "\n";
398        # prints '${1}bar/'.
399
400       This method returns that portion of the object which represents the
401       replacement string. This will be "undef" unless the regular expression
402       actually has a replacement string. Delimiters will be included, but
403       there will be no beginning delimiter unless the regular expression was
404       bracketed.
405
406   source
407        my $source = $re->source();
408
409       This method returns the object or string that was used to instantiate
410       the object.
411
412   type
413        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
414        print $re->type()->content(), "\n";
415        # prints 's'.
416
417       This method retrieves the type of the object. This comes from the
418       beginning of the initializing string or object, and will be a
419       PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
420       'qr', or ''.
421

RESTRICTIONS

423       By the nature of this module, it is never going to get everything
424       right.  Many of the known problem areas involve interpolations one way
425       or another.
426
427   Ambiguous Syntax
428       Perl's regular expressions contain cases where the syntax is ambiguous.
429       A particularly egregious example is an interpolation followed by square
430       or curly brackets, for example $foo[...]. There is nothing in the
431       syntax to say whether the programmer wanted to interpolate an element
432       of array @foo, or whether he wanted to interpolate scalar $foo, and
433       then follow that interpolation by a character class.
434
435       The perlop documentation notes that in this case what Perl does is to
436       guess. That is, it employs various heuristics on the code to try to
437       figure out what the programmer wanted. These heuristics are documented
438       as being undocumented (!) and subject to change without notice. As an
439       example of the problems even perl faces in parsing Perl, see
440       <https://github.com/perl/perl5/issues/16478>.
441
442       Given this situation, this module's chances of duplicating every Perl
443       version's interpretation of every regular expression are pretty much
444       nil.  What it does now is to assume that square brackets containing
445       only an integer or an interpolation represent a subscript; otherwise
446       they represent a character class. Similarly, curly brackets containing
447       only a bareword or an interpolation are a subscript; otherwise they
448       represent a quantifier.
449
450   Changes in Syntax
451       Sometimes the introduction of new syntax changes the way a regular
452       expression is parsed. For example, the "\v" character class was
453       introduced in Perl 5.9.5. But it did not represent a syntax error prior
454       to that version of Perl, it was simply parsed as "v". So
455
456        $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
457
458       prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
459       generally assumes the more modern parse in cases like this.
460
461   Equivocation
462       Very occasionally, a construction will be removed and then added back
463       -- and then, conceivably, removed again. In this case, the plan is for
464       perl_version_introduced() to return the earliest version in which the
465       construction appeared, and perl_version_removed() to return the version
466       after the last version in which it appeared (whether production or
467       development), or "undef" if it is in the highest-numbered Perl.
468
469       The constructions involved in this are:
470
471       Un-escaped literal left curly after literal
472
473       That is, something like "qr<x{>".
474
475       This was made an error in 5.25.1, and it was an error in 5.26.0.  But
476       it became a warning again in 5.27.1. The perl5271delta says it was re-
477       instated because the changes broke GNU Autoconf, and the warning
478       message says it will be removed in Perl 5.30.
479
480       Accordingly, perl_version_introduced() returns 5.0. At the moment
481       perl_version_removed() returns '5.025001'. But if it is present with or
482       without warning in 5.28, perl_version_removed() will become "undef". If
483       you need finer resolution than this, see PPIx::Regexp::Element methods
484       l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
485       l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
486
487   Static Parsing
488       It is well known that Perl can not be statically parsed. That is, you
489       can not completely parse a piece of Perl code without executing that
490       same code.
491
492       Nevertheless, this class is trying to statically parse regular
493       expressions. The main problem with this is that there is no way to know
494       what is being interpolated into the regular expression by an
495       interpolated variable. This is a problem because the interpolated value
496       can change the interpretation of adjacent elements.
497
498       This module deals with this by making assumptions about what is in an
499       interpolated variable. These assumptions will not be enumerated here,
500       but in general the principal is to assume the interpolated value does
501       not change the interpretation of the regular expression. For example,
502
503        my $foo = 'a-z]';
504        my $re = qr{[$foo};
505
506       is fine with the Perl interpreter, but will confuse the dickens out of
507       this module. Similarly and more usefully, something like
508
509        my $mods = 'i';
510        my $re = qr{(?$mods:foo)};
511
512       or maybe
513
514        my $mods = 'i';
515        my $re = qr{(?$mods)$foo};
516
517       probably sets a modifier of some sort, and that is how this module
518       interprets it. If the interpolation is not about modifiers, this module
519       will get it wrong. Another such semi-benign example is
520
521        my $foo = $] >= 5.010 ? '?<foo>' : '';
522        my $re = qr{($foo\w+)};
523
524       which will parse, but this module will never realize that it might be
525       looking at a named capture.
526
527   Non-Standard Syntax
528       There are modules out there that alter the syntax of Perl. If the
529       syntax of a regular expression is altered, this module has no way to
530       understand that it has been altered, much less to adapt to the
531       alteration. The following modules are known to cause problems:
532
533       Acme::PerlML, which renders Perl as XML.
534
535       "Data::PostfixDeref", which causes Perl to interpret suffixed empty
536       brackets as dereferencing the thing they suffix. This module by Ben
537       Morrow ("BMORROW") appears to have been retracted.
538
539       Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
540       be written in the ISO 646 character set.
541
542       Perl6::Pugs. Enough said.
543
544       Perl6::Rules, which back-ports some of the Perl 6 regular expression
545       syntax to Perl 5.
546
547       Regexp::Extended, which extends regular expressions in various ways,
548       some of which seem to conflict with Perl 5.010.
549

SEE ALSO

551       Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
552       for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
553       supports modification of the parse tree.
554
555       Regexp::Parser, which parses a bare regular expression (without
556       enclosing "qr{}", "m//", or whatever) and uses a different navigation
557       model. After a long hiatus, this module has been adopted, and is again
558       supported.
559

SUPPORT

561       Support is by the author. Please file bug reports at
562       <https://rt.cpan.org>, or in electronic mail to the author.
563

AUTHOR

565       Thomas R. Wyant, III wyant at cpan dot org
566
568       Copyright (C) 2009-2020 by Thomas R. Wyant, III
569
570       This program is free software; you can redistribute it and/or modify it
571       under the same terms as Perl 5.10.0. For more details, see the full
572       text of the licenses in the directory LICENSES.
573
574       This program is distributed in the hope that it will be useful, but
575       without any warranty; without even the implied warranty of
576       merchantability or fitness for a particular purpose.
577
578
579
580perl v5.30.1                      2020-02-10                   PPIx::Regexp(3)
Impressum