1PPIx::Regexp(3)       User Contributed Perl Documentation      PPIx::Regexp(3)
2
3
4

NAME

6       PPIx::Regexp - Represent a regular expression of some sort
7

SYNOPSIS

9        use PPIx::Regexp;
10        use PPIx::Regexp::Dumper;
11        my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12        PPIx::Regexp::Dumper->new( $re )
13            ->print();
14

DEPRECATION NOTICE

16       The "postderef" argument to new() is retracted, and postfix
17       dereferences are always be recognized.
18
19       Starting with version 0.074_01, the first use of this argument warned.
20       With version 0.079_01, all uses warned. With version 0.080_01, all uses
21       became fatal. With version 0.084_01, all mention of this argument was
22       removed, except for this notice.
23

INHERITANCE

25       "PPIx::Regexp" is a PPIx::Regexp::Node.
26
27       "PPIx::Regexp" has no descendants.
28

DESCRIPTION

30       The purpose of the PPIx-Regexp package is to parse regular expressions
31       in a manner similar to the way the PPI package parses Perl. This class
32       forms the root of the parse tree, playing a role similar to
33       PPI::Document.
34
35       This package shares with PPI the property of being round-trip safe.
36       That is,
37
38        my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
39        my $re = PPIx::Regexp->new( $expr );
40        print $re->content() eq $expr ? "yes\n" : "no\n"
41
42       should print 'yes' for any valid regular expression.
43
44       Navigation is similar to that provided by PPI. That is to say, things
45       like "children", "find_first", "snext_sibling" and so on all work
46       pretty much the same way as in PPI.
47
48       The class hierarchy is also similar to PPI. Except for some utility
49       classes (the dumper, the lexer, and the tokenizer) all classes are
50       descended from PPIx::Regexp::Element, which provides basic navigation.
51       Tokens are descended from PPIx::Regexp::Token, which provides content.
52       All containers are descended from PPIx::Regexp::Node, which provides
53       for children, and all structure elements are descended from
54       PPIx::Regexp::Structure, which provides beginning and ending
55       delimiters, and a type.
56
57       There are two features of PPI that this package does not provide -
58       mutability and operator overloading. There are no plans for serious
59       mutability, though something like PPI's "prune" functionality might be
60       considered. Similarly there are no plans for operator overloading,
61       which appears to the author to represent a performance hit for little
62       tangible gain.
63

NOTICE

65       The author will attempt to preserve the documented interface, but if
66       the interface needs to change to correct some egregiously bad design or
67       implementation decision, then it will change.  Any incompatible changes
68       will go through a deprecation cycle.
69
70       The goal of this package is to parse well-formed regular expressions
71       correctly. A secondary goal is not to blow up on ill-formed regular
72       expressions. The correct identification and characterization of ill-
73       formed regular expressions is not a goal of this package, nor is the
74       consistent parsing of ill-formed regular expressions from release to
75       release.
76
77       This policy attempts to track features in development releases as well
78       as public releases. However, features added in a development release
79       and then removed before the next production release will not be
80       tracked, and any functionality relating to such features will be
81       removed. The issue here is the potential re-use (with different
82       semantics) of syntax that did not make it into the production release.
83
84       From time to time the Perl regular expression engine changes in ways
85       that change the parse of a given regular expression. When these changes
86       occur, "PPIx::Regexp" will be changed to produce the more modern parse.
87       Known examples of this include:
88
89       $( no longer interpolates as of Perl 5.005, per "perl5005delta".
90           Newer Perls seem to parse this as "qr{$}" (i.e. an end-of-string or
91           newline assertion) followed by an open parenthesis, and that is
92           what "PPIx::Regexp" does.
93
94       $) and $| also seem to parse as the "$" assertion
95           followed by the relevant meta-character, though I have no
96           documentation reference for this.
97
98       "@+" and "@-" no longer interpolate as of Perl 5.9.4
99           per "perl594delta". Subsequent Perls treat "@+" as a quantified
100           literal and "@-" as two literals, and that is what "PPIx::Regexp"
101           does. Note that subscripted references to these arrays do
102           interpolate, and are so parsed by "PPIx::Regexp".
103
104       Only space and horizontal tab are whitespace as of Perl 5.23.4
105           when inside a bracketed character class inside an extended
106           bracketed character class, per "perl5234delta". Formerly any white
107           space character parsed as whitespace. This change in "PPIx::Regexp"
108           will be reverted if the change in Perl does not make it into Perl
109           5.24.0.
110
111       Unescaped literal left curly brackets
112           These are being removed in positions where quantifiers are legal,
113           so that they can be used for new functionality. Some of them are
114           gone in 5.25.1, others will be removed in a future version of Perl.
115           In situations where they have been removed, perl_version_removed()
116           will return the version in which they were removed. When the new
117           functionality appears, the parse produced by this software will
118           reflect the new functionality.
119
120           NOTE that the situation with a literal left curly after a literal
121           character is complicated. It was made an error in Perl 5.25.1, and
122           remained so through all 5.26 releases, but became a warning again
123           in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
124           become illegal again is not clear to me based on the contents of
125           perl5271delta. At the moment perl_version_removed() returns
126           "undef", but obviously that is not the whole story, and methods
127           accepts_perl() and requirements_for_perl() were introduced to deal
128           with this complication.
129
130       "\o{...}"
131           is parsed as the octal equivalent of "\x{...}". This is its meaning
132           as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
133           on.
134
135       "x{,3}"
136           (with first count omitted) is allowed as a quantifier as of Perl
137           5.33.6.  The previous parse made this all literals.
138
139       "x{ 0 , 3 }"
140           (with spaces inside but adjacent to curly brackets, or around the
141           comma if any) is allowed as a quantifier as of Perl 5.33.6. The
142           previous parse made this all literals.
143
144       There are very probably other examples of this. When they come to light
145       they will be documented as producing the modern parse, and the code
146       modified to produce this parse if necessary.
147

METHODS

149       This class provides the following public methods. Methods not
150       documented here are private, and unsupported in the sense that the
151       author reserves the right to change or remove them without notice.
152
153   new
154        my $re = PPIx::Regexp->new('/foo/');
155
156       This method instantiates a "PPIx::Regexp" object from a string, a
157       PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
158       PPI::Token::Regexp::Substitute.  Honestly, any PPI::Element will work,
159       but only the three Regexp classes mentioned previously are likely to do
160       anything useful.
161
162       Whatever form the argument takes, it is assumed to consist entirely of
163       a valid match, substitution, or "qr<>" string.
164
165       Optionally you can pass one or more name/value pairs after the regular
166       expression. The possible options are:
167
168       default_modifiers array_reference
169           This option specifies a reference to an array of default modifiers
170           to apply to the regular expression being parsed. Each modifier is
171           specified as a string. Any actual modifiers found supersede the
172           defaults.
173
174           When applying the defaults, '?' and '/' are completely ignored, and
175           '^' is ignored unless it occurs at the beginning of the modifier.
176           The first dash ('-') causes subsequent modifiers to be negated.
177
178           So, for example, if you wish to produce a "PPIx::Regexp" object
179           representing the regular expression in
180
181            use re '/smx';
182            {
183               no re '/x';
184               m/ foo /;
185            }
186
187           you would (after some help from PPI in finding the relevant
188           statements), do something like
189
190            my $re = PPIx::Regexp->new( 'm/ foo /',
191                default_modifiers => [ '/smx', '-/x' ] );
192
193       encoding name
194           This option specifies the encoding of the regular expression. This
195           is passed to the tokenizer, which will "decode" the regular
196           expression string before it tokenizes it. For example:
197
198            my $re = PPIx::Regexp->new( '/foo/',
199                encoding => 'iso-8859-1',
200            );
201
202       index_locations Boolean
203           This Boolean option specifies whether the locations of the elements
204           in the regular expression should be indexed.
205
206           If unspecified or specified as "undef" a default value is used.
207           This default is true if the argument is a PPI::Element or the
208           "location" option was specified. Otherwise the default is false.
209
210       location array_reference
211           This option specifies the location of the new object in the
212           document from which it was created. It is a reference to a five-
213           element array compatible with that returned by the location()
214           method of PPI::Element.
215
216           If not specified, the location of the original string is used if it
217           was specified as a PPI::Element.
218
219           If no location can be determined, the various location() methods
220           will return "undef".
221
222       postderef Boolean
223           THIS ARGUMENT IS DEPRECATED.  See DEPRECATION NOTICE above for the
224           details.
225
226           This option is passed on to the tokenizer, where it specifies
227           whether postfix dereferences are recognized in interpolations and
228           code. This experimental feature was introduced in Perl 5.19.5.
229
230           As of version 0.074_01, the default is true.  Through release
231           0.074, the default was the value of
232           $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which was true. When
233           originally introduced this was false, but was documented as
234           becoming true when and if postfix dereferencing became mainstream.
235           The  intent to mainstream was announced with Perl 5.23.1, and
236           became official (so to speak) with Perl 5.24.0, so the default
237           became true with PPIx::Regexp 0.049_01.
238
239           Note that if PPI starts unconditionally recognizing postfix
240           dereferences, this argument will immediately become ignored, and
241           will be put through a deprecation cycle and removed.
242
243       strict Boolean
244           This option is passed on to the tokenizer and lexer, where it
245           specifies whether the parse should assume "use re 'strict'" is in
246           effect.
247
248           The 'strict' pragma was introduced in Perl 5.22, and its
249           documentation says that it is experimental, and that there is no
250           commitment to backward compatibility. The same applies to the parse
251           produced when this option is asserted. Also, the usual caveat
252           applies: if "use re 'strict'" ends up being retracted, this option
253           and all related functionality will be also.
254
255           Given the nature of "use re 'strict'", you should expect that if
256           you assert this option, regular expressions that previously parsed
257           without error might no longer do so. If an element ends up being
258           declared an error because this option is set, its
259           perl_version_introduced() will be the Perl version at which "use re
260           'strict'" started rejecting these elements.
261
262           The default is false.
263
264       trace number
265           If greater than zero, this option causes trace output from the
266           parse.  The author reserves the right to change or eliminate this
267           without notice.
268
269       Passing optional input other than the above is not an error, but
270       neither is it supported.
271
272   new_from_cache
273       This static method wraps "new" in a caching mechanism. Only one object
274       will be generated for a given PPI::Element, no matter how many times
275       this method is called. Calls after the first for a given PPI::Element
276       simply return the same "PPIx::Regexp" object.
277
278       When the "PPIx::Regexp" object is returned from cache, the values of
279       the optional arguments are ignored.
280
281       Calls to this method with the regular expression in a string rather
282       than a PPI::Element will not be cached.
283
284       Caveat: This method is provided for code like Perl::Critic which might
285       instantiate the same object multiple times. The cache will persist
286       until "flush_cache" is called.
287
288   flush_cache
289        $re->flush_cache();            # Remove $re from cache
290        PPIx::Regexp->flush_cache();   # Empty the cache
291
292       This method flushes the cache used by "new_from_cache". If called as a
293       static method with no arguments, the entire cache is emptied. Otherwise
294       any objects specified are removed from the cache.
295
296   capture_names
297        foreach my $name ( $re->capture_names() ) {
298            print "Capture name '$name'\n";
299        }
300
301       This convenience method returns the capture names found in the regular
302       expression.
303
304       This method is equivalent to
305
306        $self->regular_expression()->capture_names();
307
308       except that if "$self->regular_expression()" returns "undef" (meaning
309       that something went terribly wrong with the parse) this method will
310       simply return.
311
312   delimiters
313        print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
314        # prints '//      //'
315
316       When called in list context, this method returns either one or two
317       strings, depending on whether the parsed expression has a replacement
318       string. In the case of non-bracketed substitutions, the start delimiter
319       of the replacement string is considered to be the same as its finish
320       delimiter, as illustrated by the above example.
321
322       When called in scalar context, you get the delimiters of the regular
323       expression; that is, element 0 of the array that is returned in list
324       context.
325
326       Optionally, you can pass an index value and the corresponding
327       delimiters will be returned; index 0 represents the regular
328       expression's delimiters, and index 1 represents the replacement
329       string's delimiters, which may be undef. For example,
330
331        print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
332        # prints '<>'
333
334       If the object was not initialized with a valid regexp of some sort, the
335       results of this method are undefined.
336
337   errstr
338       This static method returns the error string from the most recent
339       attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
340       recent attempt succeeded.
341
342   extract_regexps
343        my $doc = PPI::Document->new( $path );
344        $doc->index_locations();
345        my @res = PPIx::Regexp->extract_regexps( $doc )
346
347       This convenience (well, sort-of) static method takes as its argument a
348       PPI::Document object and returns "PPIx::Regexp" objects corresponding
349       to all regular expressions found in it, in the order in which they
350       occur in the document. You will need to keep a reference to the
351       original PPI::Document object if you wish to be able to recover the
352       original PPI::Element objects via the PPIx::Regexp source() method.
353
354   failures
355        print "There were ", $re->failures(), " parse failures\n";
356
357       This method returns the number of parse failures. This is a count of
358       the number of unknown tokens plus the number of unterminated structures
359       plus the number of unmatched right brackets of any sort.
360
361   max_capture_number
362        print "Highest used capture number ",
363            $re->max_capture_number(), "\n";
364
365       This convenience method returns the highest capture number used by the
366       regular expression. If there are no captures, the return will be 0.
367
368       This method is equivalent to
369
370        $self->regular_expression()->max_capture_number();
371
372       except that if "$self->regular_expression()" returns "undef" (meaning
373       that something went terribly wrong with the parse) this method will
374       too.
375
376   modifier
377        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
378        print $re->modifier()->content(), "\n";
379        # prints 'smx'.
380
381       This method retrieves the modifier of the object. This comes from the
382       end of the initializing string or object and will be a
383       PPIx::Regexp::Token::Modifier.
384
385       Note that this object represents the actual modifiers present on the
386       regexp, and does not take into account any that may have been applied
387       by default (i.e. via the "default_modifiers" argument to new()). For
388       something that takes account of default modifiers, see
389       modifier_asserted(), below.
390
391       In the event of a parse failure, there may not be a modifier present,
392       in which case nothing is returned.
393
394   modifier_asserted
395        my $re = PPIx::Regexp->new( '/ . /',
396            default_modifiers => [ 'smx' ] );
397        print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
398        # prints 'yes'.
399
400       This method returns true if the given modifier is asserted for the
401       regexp, whether explicitly or by the modifiers passed in the
402       "default_modifiers" argument.
403
404       Starting with version 0.036_01, if the argument is a single-character
405       modifier followed by an asterisk (intended as a wild card character),
406       the return is the number of times that modifier appears. In this case
407       an exception will be thrown if you specify a multi-character modifier
408       (e.g.  'ee*'), or if you specify one of the match semantics modifiers
409       (e.g.  'a*').
410
411   regular_expression
412        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
413        print $re->regular_expression()->content(), "\n";
414        # prints '/(foo)/'.
415
416       This method returns that portion of the object which actually
417       represents a regular expression.
418
419   replacement
420        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
421        print $re->replacement()->content(), "\n";
422        # prints '${1}bar/'.
423
424       This method returns that portion of the object which represents the
425       replacement string. This will be "undef" unless the regular expression
426       actually has a replacement string. Delimiters will be included, but
427       there will be no beginning delimiter unless the regular expression was
428       bracketed.
429
430   source
431        my $source = $re->source();
432
433       This method returns the object or string that was used to instantiate
434       the object.
435
436   type
437        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
438        print $re->type()->content(), "\n";
439        # prints 's'.
440
441       This method retrieves the type of the object. This comes from the
442       beginning of the initializing string or object, and will be a
443       PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
444       'qr', or ''.
445

RESTRICTIONS

447       By the nature of this module, it is never going to get everything
448       right.  Many of the known problem areas involve interpolations one way
449       or another.
450
451   Ambiguous Syntax
452       Perl's regular expressions contain cases where the syntax is ambiguous.
453       A particularly egregious example is an interpolation followed by square
454       or curly brackets, for example $foo[...]. There is nothing in the
455       syntax to say whether the programmer wanted to interpolate an element
456       of array @foo, or whether he wanted to interpolate scalar $foo, and
457       then follow that interpolation by a character class.
458
459       The perlop documentation notes that in this case what Perl does is to
460       guess. That is, it employs various heuristics on the code to try to
461       figure out what the programmer wanted. These heuristics are documented
462       as being undocumented (!) and subject to change without notice. As an
463       example of the problems even perl faces in parsing Perl, see
464       <https://github.com/perl/perl5/issues/16478>.
465
466       Given this situation, this module's chances of duplicating every Perl
467       version's interpretation of every regular expression are pretty much
468       nil.  What it does now is to assume that square brackets containing
469       only an integer or an interpolation represent a subscript; otherwise
470       they represent a character class. Similarly, curly brackets containing
471       only a bareword or an interpolation are a subscript; otherwise they
472       represent a quantifier.
473
474   Changes in Syntax
475       Sometimes the introduction of new syntax changes the way a regular
476       expression is parsed. For example, the "\v" character class was
477       introduced in Perl 5.9.5. But it did not represent a syntax error prior
478       to that version of Perl, it was simply parsed as "v". So
479
480        $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
481
482       prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
483       generally assumes the more modern parse in cases like this.
484
485   Equivocation
486       Very occasionally, a construction will be removed and then added back
487       -- and then, conceivably, removed again. In this case, the plan is for
488       perl_version_introduced() to return the earliest version in which the
489       construction appeared, and perl_version_removed() to return the version
490       after the last version in which it appeared (whether production or
491       development), or "undef" if it is in the highest-numbered Perl.
492
493       The constructions involved in this are:
494
495       Un-escaped literal left curly after literal
496
497       That is, something like "qr<x{>".
498
499       This was made an error in 5.25.1, and it was an error in 5.26.0.  But
500       it became a warning again in 5.27.1. The perl5271delta says it was re-
501       instated because the changes broke GNU Autoconf, and the warning
502       message says it will be removed in Perl 5.30.
503
504       Accordingly, perl_version_introduced() returns 5.0. At the moment
505       perl_version_removed() returns '5.025001'. But if it is present with or
506       without warning in 5.28, perl_version_removed() will become "undef". If
507       you need finer resolution than this, see PPIx::Regexp::Element methods
508       l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
509       l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
510
511   Static Parsing
512       It is well known that Perl can not be statically parsed. That is, you
513       can not completely parse a piece of Perl code without executing that
514       same code.
515
516       Nevertheless, this class is trying to statically parse regular
517       expressions. The main problem with this is that there is no way to know
518       what is being interpolated into the regular expression by an
519       interpolated variable. This is a problem because the interpolated value
520       can change the interpretation of adjacent elements.
521
522       This module deals with this by making assumptions about what is in an
523       interpolated variable. These assumptions will not be enumerated here,
524       but in general the principal is to assume the interpolated value does
525       not change the interpretation of the regular expression. For example,
526
527        my $foo = 'a-z]';
528        my $re = qr{[$foo};
529
530       is fine with the Perl interpreter, but will confuse the dickens out of
531       this module. Similarly and more usefully, something like
532
533        my $mods = 'i';
534        my $re = qr{(?$mods:foo)};
535
536       or maybe
537
538        my $mods = 'i';
539        my $re = qr{(?$mods)$foo};
540
541       probably sets a modifier of some sort, and that is how this module
542       interprets it. If the interpolation is not about modifiers, this module
543       will get it wrong. Another such semi-benign example is
544
545        my $foo = $] >= 5.010 ? '?<foo>' : '';
546        my $re = qr{($foo\w+)};
547
548       which will parse, but this module will never realize that it might be
549       looking at a named capture.
550
551   Non-Standard Syntax
552       There are modules out there that alter the syntax of Perl. If the
553       syntax of a regular expression is altered, this module has no way to
554       understand that it has been altered, much less to adapt to the
555       alteration. The following modules are known to cause problems:
556
557       Acme::PerlML, which renders Perl as XML.
558
559       "Data::PostfixDeref", which causes Perl to interpret suffixed empty
560       brackets as dereferencing the thing they suffix. This module by Ben
561       Morrow ("BMORROW") appears to have been retracted.
562
563       Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
564       be written in the ISO 646 character set.
565
566       Perl6::Pugs. Enough said.
567
568       Perl6::Rules, which back-ports some of the Perl 6 regular expression
569       syntax to Perl 5.
570
571       Regexp::Extended, which extends regular expressions in various ways,
572       some of which seem to conflict with Perl 5.010.
573

SEE ALSO

575       Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
576       for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
577       supports modification of the parse tree.
578
579       Regexp::Parser, which parses a bare regular expression (without
580       enclosing "qr{}", "m//", or whatever) and uses a different navigation
581       model. After a long hiatus, this module has been adopted, and is again
582       supported.
583
584       YAPE::Regex, which provides the parse tree, and has a mechanism to
585       subclass the various element classes for customization. The most-recent
586       release is 2011, but the CPAN testers results are still all green.
587       Companion module YAPE::Regex::Explain says what the various pieces of a
588       regex do, though constructs added in perl 5.10 and later are not
589       supported. I have no idea how I missed this when I originally went
590       looking for "Regexp" parsers.
591
592       PPR, which recognizes Perl of all sorts, including regular expressions,
593       but does not actually provide a parse of the recognized constructs.
594

SUPPORT

596       Support is by the author. Please file bug reports at
597       <https://rt.cpan.org/Public/Dist/Display.html?Name=PPIx-Regexp>,
598       <https://github.com/trwyant/perl-PPIx-Regexp/issues>, or in electronic
599       mail to the author.
600

AUTHOR

602       Thomas R. Wyant, III wyant at cpan dot org
603
605       Copyright (C) 2009-2023 by Thomas R. Wyant, III
606
607       This program is free software; you can redistribute it and/or modify it
608       under the same terms as Perl 5.10.0. For more details, see the full
609       text of the licenses in the directory LICENSES.
610
611       This program is distributed in the hope that it will be useful, but
612       without any warranty; without even the implied warranty of
613       merchantability or fitness for a particular purpose.
614
615
616
617perl v5.38.0                      2023-07-21                   PPIx::Regexp(3)
Impressum