1PPIx::Regexp(3)       User Contributed Perl Documentation      PPIx::Regexp(3)
2
3
4

NAME

6       PPIx::Regexp - Represent a regular expression of some sort
7

SYNOPSIS

9        use PPIx::Regexp;
10        use PPIx::Regexp::Dumper;
11        my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12        PPIx::Regexp::Dumper->new( $re )
13            ->print();
14

DEPRECATION NOTICE

16       The postderef argument to new() is being put through a deprecation
17       cycle and retracted. After the retraction, postfix dereferences will
18       always be recognized. This is the default behaviour now.
19
20       Starting with version 0.074_01, the first use of this argument warned.
21       warn. With version 0.079_01, all uses will warn. With version 0.080_01,
22       all uses will become fatal. With the first release on or after April 15
23       2022 all mention of this argument will be removed.
24

INHERITANCE

26       "PPIx::Regexp" is a PPIx::Regexp::Node.
27
28       "PPIx::Regexp" has no descendants.
29

DESCRIPTION

31       The purpose of the PPIx-Regexp package is to parse regular expressions
32       in a manner similar to the way the PPI package parses Perl. This class
33       forms the root of the parse tree, playing a role similar to
34       PPI::Document.
35
36       This package shares with PPI the property of being round-trip safe.
37       That is,
38
39        my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
40        my $re = PPIx::Regexp->new( $expr );
41        print $re->content() eq $expr ? "yes\n" : "no\n"
42
43       should print 'yes' for any valid regular expression.
44
45       Navigation is similar to that provided by PPI. That is to say, things
46       like "children", "find_first", "snext_sibling" and so on all work
47       pretty much the same way as in PPI.
48
49       The class hierarchy is also similar to PPI. Except for some utility
50       classes (the dumper, the lexer, and the tokenizer) all classes are
51       descended from PPIx::Regexp::Element, which provides basic navigation.
52       Tokens are descended from PPIx::Regexp::Token, which provides content.
53       All containers are descended from PPIx::Regexp::Node, which provides
54       for children, and all structure elements are descended from
55       PPIx::Regexp::Structure, which provides beginning and ending
56       delimiters, and a type.
57
58       There are two features of PPI that this package does not provide -
59       mutability and operator overloading. There are no plans for serious
60       mutability, though something like PPI's "prune" functionality might be
61       considered. Similarly there are no plans for operator overloading,
62       which appears to the author to represent a performance hit for little
63       tangible gain.
64

NOTICE

66       The author will attempt to preserve the documented interface, but if
67       the interface needs to change to correct some egregiously bad design or
68       implementation decision, then it will change.  Any incompatible changes
69       will go through a deprecation cycle.
70
71       The goal of this package is to parse well-formed regular expressions
72       correctly. A secondary goal is not to blow up on ill-formed regular
73       expressions. The correct identification and characterization of ill-
74       formed regular expressions is not a goal of this package, nor is the
75       consistent parsing of ill-formed regular expressions from release to
76       release.
77
78       This policy attempts to track features in development releases as well
79       as public releases. However, features added in a development release
80       and then removed before the next production release will not be
81       tracked, and any functionality relating to such features will be
82       removed. The issue here is the potential re-use (with different
83       semantics) of syntax that did not make it into the production release.
84
85       From time to time the Perl regular expression engine changes in ways
86       that change the parse of a given regular expression. When these changes
87       occur, "PPIx::Regexp" will be changed to produce the more modern parse.
88       Known examples of this include:
89
90       $( no longer interpolates as of Perl 5.005, per "perl5005delta".
91           Newer Perls seem to parse this as "qr{$}" (i.e. an end-of-string or
92           newline assertion) followed by an open parenthesis, and that is
93           what "PPIx::Regexp" does.
94
95       $) and $| also seem to parse as the "$" assertion
96           followed by the relevant meta-character, though I have no
97           documentation reference for this.
98
99       "@+" and "@-" no longer interpolate as of Perl 5.9.4
100           per "perl594delta". Subsequent Perls treat "@+" as a quantified
101           literal and "@-" as two literals, and that is what "PPIx::Regexp"
102           does. Note that subscripted references to these arrays do
103           interpolate, and are so parsed by "PPIx::Regexp".
104
105       Only space and horizontal tab are whitespace as of Perl 5.23.4
106           when inside a bracketed character class inside an extended
107           bracketed character class, per "perl5234delta". Formerly any white
108           space character parsed as whitespace. This change in "PPIx::Regexp"
109           will be reverted if the change in Perl does not make it into Perl
110           5.24.0.
111
112       Unescaped literal left curly brackets
113           These are being removed in positions where quantifiers are legal,
114           so that they can be used for new functionality. Some of them are
115           gone in 5.25.1, others will be removed in a future version of Perl.
116           In situations where they have been removed, perl_version_removed()
117           will return the version in which they were removed. When the new
118           functionality appears, the parse produced by this software will
119           reflect the new functionality.
120
121           NOTE that the situation with a literal left curly after a literal
122           character is complicated. It was made an error in Perl 5.25.1, and
123           remained so through all 5.26 releases, but became a warning again
124           in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
125           become illegal again is not clear to me based on the contents of
126           perl5271delta. At the moment perl_version_removed() returns
127           "undef", but obviously that is not the whole story, and methods
128           accepts_perl() and requirements_for_perl() were introduced to deal
129           with this complication.
130
131       "\o{...}"
132           is parsed as the octal equivalent of "\x{...}". This is its meaning
133           as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
134           on.
135
136       "x{,3}"
137           (with first count omitted) is allowed as a quantifier as of Perl
138           5.33.6.  The previous parse made this all literals.
139
140       "x{ 0 , 3 }"
141           (with spaces inside but adjacent to curly brackets, or around the
142           comma if any) is allowed as a quantifier as of Perl 5.33.6. The
143           previous parse made this all literals.
144
145       There are very probably other examples of this. When they come to light
146       they will be documented as producing the modern parse, and the code
147       modified to produce this parse if necessary.
148

METHODS

150       This class provides the following public methods. Methods not
151       documented here are private, and unsupported in the sense that the
152       author reserves the right to change or remove them without notice.
153
154   new
155        my $re = PPIx::Regexp->new('/foo/');
156
157       This method instantiates a "PPIx::Regexp" object from a string, a
158       PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
159       PPI::Token::Regexp::Substitute.  Honestly, any PPI::Element will work,
160       but only the three Regexp classes mentioned previously are likely to do
161       anything useful.
162
163       Whatever form the argument takes, it is assumed to consist entirely of
164       a valid match, substitution, or "qr<>" string.
165
166       Optionally you can pass one or more name/value pairs after the regular
167       expression. The possible options are:
168
169       default_modifiers array_reference
170           This option specifies a reference to an array of default modifiers
171           to apply to the regular expression being parsed. Each modifier is
172           specified as a string. Any actual modifiers found supersede the
173           defaults.
174
175           When applying the defaults, '?' and '/' are completely ignored, and
176           '^' is ignored unless it occurs at the beginning of the modifier.
177           The first dash ('-') causes subsequent modifiers to be negated.
178
179           So, for example, if you wish to produce a "PPIx::Regexp" object
180           representing the regular expression in
181
182            use re '/smx';
183            {
184               no re '/x';
185               m/ foo /;
186            }
187
188           you would (after some help from PPI in finding the relevant
189           statements), do something like
190
191            my $re = PPIx::Regexp->new( 'm/ foo /',
192                default_modifiers => [ '/smx', '-/x' ] );
193
194       encoding name
195           This option specifies the encoding of the regular expression. This
196           is passed to the tokenizer, which will "decode" the regular
197           expression string before it tokenizes it. For example:
198
199            my $re = PPIx::Regexp->new( '/foo/',
200                encoding => 'iso-8859-1',
201            );
202
203       index_locations Boolean
204           This Boolean option specifies whether the locations of the elements
205           in the regular expression should be indexed.
206
207           If unspecified or specified as "undef" a default value is used.
208           This default is true if the argument is a PPI::Element or the
209           "location" option was specified. Otherwise the default is false.
210
211       location array_reference
212           This option specifies the location of the new object in the
213           document from which it was created. It is a reference to a five-
214           element array compatible with that returned by the "location()"
215           method of PPI::Element.
216
217           If not specified, the location of the original string is used if it
218           was specified as a PPI::Element.
219
220           If no location can be determined, the various "location()" methods
221           will return "undef".
222
223       postderef Boolean
224           THIS ARGUMENT IS DEPRECATED.  See DEPRECATION NOTICE above for the
225           details.
226
227           This option is passed on to the tokenizer, where it specifies
228           whether postfix dereferences are recognized in interpolations and
229           code. This experimental feature was introduced in Perl 5.19.5.
230
231           As of version 0.074_01, the default is true.  Through release
232           0.074, the default was the value of
233           $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which was true. When
234           originally introduced this was false, but was documented as
235           becoming true when and if postfix dereferencing became mainstream.
236           The  intent to mainstream was announced with Perl 5.23.1, and
237           became official (so to speak) with Perl 5.24.0, so the default
238           became true with PPIx::Regexp 0.049_01.
239
240           Note that if PPI starts unconditionally recognizing postfix
241           dereferences, this argument will immediately become ignored, and
242           will be put through a deprecation cycle and removed.
243
244       strict Boolean
245           This option is passed on to the tokenizer and lexer, where it
246           specifies whether the parse should assume "use re 'strict'" is in
247           effect.
248
249           The 'strict' pragma was introduced in Perl 5.22, and its
250           documentation says that it is experimental, and that there is no
251           commitment to backward compatibility. The same applies to the parse
252           produced when this option is asserted. Also, the usual caveat
253           applies: if "use re 'strict'" ends up being retracted, this option
254           and all related functionality will be also.
255
256           Given the nature of "use re 'strict'", you should expect that if
257           you assert this option, regular expressions that previously parsed
258           without error might no longer do so. If an element ends up being
259           declared an error because this option is set, its
260           "perl_version_introduced()" will be the Perl version at which "use
261           re 'strict'" started rejecting these elements.
262
263           The default is false.
264
265       trace number
266           If greater than zero, this option causes trace output from the
267           parse.  The author reserves the right to change or eliminate this
268           without notice.
269
270       Passing optional input other than the above is not an error, but
271       neither is it supported.
272
273   new_from_cache
274       This static method wraps "new" in a caching mechanism. Only one object
275       will be generated for a given PPI::Element, no matter how many times
276       this method is called. Calls after the first for a given PPI::Element
277       simply return the same "PPIx::Regexp" object.
278
279       When the "PPIx::Regexp" object is returned from cache, the values of
280       the optional arguments are ignored.
281
282       Calls to this method with the regular expression in a string rather
283       than a PPI::Element will not be cached.
284
285       Caveat: This method is provided for code like Perl::Critic which might
286       instantiate the same object multiple times. The cache will persist
287       until "flush_cache" is called.
288
289   flush_cache
290        $re->flush_cache();            # Remove $re from cache
291        PPIx::Regexp->flush_cache();   # Empty the cache
292
293       This method flushes the cache used by "new_from_cache". If called as a
294       static method with no arguments, the entire cache is emptied. Otherwise
295       any objects specified are removed from the cache.
296
297   capture_names
298        foreach my $name ( $re->capture_names() ) {
299            print "Capture name '$name'\n";
300        }
301
302       This convenience method returns the capture names found in the regular
303       expression.
304
305       This method is equivalent to
306
307        $self->regular_expression()->capture_names();
308
309       except that if "$self->regular_expression()" returns "undef" (meaning
310       that something went terribly wrong with the parse) this method will
311       simply return.
312
313   delimiters
314        print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
315        # prints '//      //'
316
317       When called in list context, this method returns either one or two
318       strings, depending on whether the parsed expression has a replacement
319       string. In the case of non-bracketed substitutions, the start delimiter
320       of the replacement string is considered to be the same as its finish
321       delimiter, as illustrated by the above example.
322
323       When called in scalar context, you get the delimiters of the regular
324       expression; that is, element 0 of the array that is returned in list
325       context.
326
327       Optionally, you can pass an index value and the corresponding
328       delimiters will be returned; index 0 represents the regular
329       expression's delimiters, and index 1 represents the replacement
330       string's delimiters, which may be undef. For example,
331
332        print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
333        # prints '<>'
334
335       If the object was not initialized with a valid regexp of some sort, the
336       results of this method are undefined.
337
338   errstr
339       This static method returns the error string from the most recent
340       attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
341       recent attempt succeeded.
342
343   extract_regexps
344        my $doc = PPI::Document->new( $path );
345        $doc->index_locations();
346        my @res = PPIx::Regexp->extract_regexps( $doc )
347
348       This convenience (well, sort-of) static method takes as its argument a
349       PPI::Document object and returns "PPIx::Regexp" objects corresponding
350       to all regular expressions found in it, in the order in which they
351       occur in the document. You will need to keep a reference to the
352       original PPI::Document object if you wish to be able to recover the
353       original PPI::Element objects via the PPIx::Regexp source() method.
354
355   failures
356        print "There were ", $re->failures(), " parse failures\n";
357
358       This method returns the number of parse failures. This is a count of
359       the number of unknown tokens plus the number of unterminated structures
360       plus the number of unmatched right brackets of any sort.
361
362   max_capture_number
363        print "Highest used capture number ",
364            $re->max_capture_number(), "\n";
365
366       This convenience method returns the highest capture number used by the
367       regular expression. If there are no captures, the return will be 0.
368
369       This method is equivalent to
370
371        $self->regular_expression()->max_capture_number();
372
373       except that if "$self->regular_expression()" returns "undef" (meaning
374       that something went terribly wrong with the parse) this method will
375       too.
376
377   modifier
378        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
379        print $re->modifier()->content(), "\n";
380        # prints 'smx'.
381
382       This method retrieves the modifier of the object. This comes from the
383       end of the initializing string or object and will be a
384       PPIx::Regexp::Token::Modifier.
385
386       Note that this object represents the actual modifiers present on the
387       regexp, and does not take into account any that may have been applied
388       by default (i.e. via the "default_modifiers" argument to "new()"). For
389       something that takes account of default modifiers, see
390       modifier_asserted(), below.
391
392       In the event of a parse failure, there may not be a modifier present,
393       in which case nothing is returned.
394
395   modifier_asserted
396        my $re = PPIx::Regexp->new( '/ . /',
397            default_modifiers => [ 'smx' ] );
398        print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
399        # prints 'yes'.
400
401       This method returns true if the given modifier is asserted for the
402       regexp, whether explicitly or by the modifiers passed in the
403       "default_modifiers" argument.
404
405       Starting with version 0.036_01, if the argument is a single-character
406       modifier followed by an asterisk (intended as a wild card character),
407       the return is the number of times that modifier appears. In this case
408       an exception will be thrown if you specify a multi-character modifier
409       (e.g.  'ee*'), or if you specify one of the match semantics modifiers
410       (e.g.  'a*').
411
412   regular_expression
413        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
414        print $re->regular_expression()->content(), "\n";
415        # prints '/(foo)/'.
416
417       This method returns that portion of the object which actually
418       represents a regular expression.
419
420   replacement
421        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
422        print $re->replacement()->content(), "\n";
423        # prints '${1}bar/'.
424
425       This method returns that portion of the object which represents the
426       replacement string. This will be "undef" unless the regular expression
427       actually has a replacement string. Delimiters will be included, but
428       there will be no beginning delimiter unless the regular expression was
429       bracketed.
430
431   source
432        my $source = $re->source();
433
434       This method returns the object or string that was used to instantiate
435       the object.
436
437   type
438        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
439        print $re->type()->content(), "\n";
440        # prints 's'.
441
442       This method retrieves the type of the object. This comes from the
443       beginning of the initializing string or object, and will be a
444       PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
445       'qr', or ''.
446

RESTRICTIONS

448       By the nature of this module, it is never going to get everything
449       right.  Many of the known problem areas involve interpolations one way
450       or another.
451
452   Ambiguous Syntax
453       Perl's regular expressions contain cases where the syntax is ambiguous.
454       A particularly egregious example is an interpolation followed by square
455       or curly brackets, for example $foo[...]. There is nothing in the
456       syntax to say whether the programmer wanted to interpolate an element
457       of array @foo, or whether he wanted to interpolate scalar $foo, and
458       then follow that interpolation by a character class.
459
460       The perlop documentation notes that in this case what Perl does is to
461       guess. That is, it employs various heuristics on the code to try to
462       figure out what the programmer wanted. These heuristics are documented
463       as being undocumented (!) and subject to change without notice. As an
464       example of the problems even perl faces in parsing Perl, see
465       <https://github.com/perl/perl5/issues/16478>.
466
467       Given this situation, this module's chances of duplicating every Perl
468       version's interpretation of every regular expression are pretty much
469       nil.  What it does now is to assume that square brackets containing
470       only an integer or an interpolation represent a subscript; otherwise
471       they represent a character class. Similarly, curly brackets containing
472       only a bareword or an interpolation are a subscript; otherwise they
473       represent a quantifier.
474
475   Changes in Syntax
476       Sometimes the introduction of new syntax changes the way a regular
477       expression is parsed. For example, the "\v" character class was
478       introduced in Perl 5.9.5. But it did not represent a syntax error prior
479       to that version of Perl, it was simply parsed as "v". So
480
481        $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
482
483       prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
484       generally assumes the more modern parse in cases like this.
485
486   Equivocation
487       Very occasionally, a construction will be removed and then added back
488       -- and then, conceivably, removed again. In this case, the plan is for
489       perl_version_introduced() to return the earliest version in which the
490       construction appeared, and perl_version_removed() to return the version
491       after the last version in which it appeared (whether production or
492       development), or "undef" if it is in the highest-numbered Perl.
493
494       The constructions involved in this are:
495
496       Un-escaped literal left curly after literal
497
498       That is, something like "qr<x{>".
499
500       This was made an error in 5.25.1, and it was an error in 5.26.0.  But
501       it became a warning again in 5.27.1. The perl5271delta says it was re-
502       instated because the changes broke GNU Autoconf, and the warning
503       message says it will be removed in Perl 5.30.
504
505       Accordingly, perl_version_introduced() returns 5.0. At the moment
506       perl_version_removed() returns '5.025001'. But if it is present with or
507       without warning in 5.28, perl_version_removed() will become "undef". If
508       you need finer resolution than this, see PPIx::Regexp::Element methods
509       l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
510       l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
511
512   Static Parsing
513       It is well known that Perl can not be statically parsed. That is, you
514       can not completely parse a piece of Perl code without executing that
515       same code.
516
517       Nevertheless, this class is trying to statically parse regular
518       expressions. The main problem with this is that there is no way to know
519       what is being interpolated into the regular expression by an
520       interpolated variable. This is a problem because the interpolated value
521       can change the interpretation of adjacent elements.
522
523       This module deals with this by making assumptions about what is in an
524       interpolated variable. These assumptions will not be enumerated here,
525       but in general the principal is to assume the interpolated value does
526       not change the interpretation of the regular expression. For example,
527
528        my $foo = 'a-z]';
529        my $re = qr{[$foo};
530
531       is fine with the Perl interpreter, but will confuse the dickens out of
532       this module. Similarly and more usefully, something like
533
534        my $mods = 'i';
535        my $re = qr{(?$mods:foo)};
536
537       or maybe
538
539        my $mods = 'i';
540        my $re = qr{(?$mods)$foo};
541
542       probably sets a modifier of some sort, and that is how this module
543       interprets it. If the interpolation is not about modifiers, this module
544       will get it wrong. Another such semi-benign example is
545
546        my $foo = $] >= 5.010 ? '?<foo>' : '';
547        my $re = qr{($foo\w+)};
548
549       which will parse, but this module will never realize that it might be
550       looking at a named capture.
551
552   Non-Standard Syntax
553       There are modules out there that alter the syntax of Perl. If the
554       syntax of a regular expression is altered, this module has no way to
555       understand that it has been altered, much less to adapt to the
556       alteration. The following modules are known to cause problems:
557
558       Acme::PerlML, which renders Perl as XML.
559
560       "Data::PostfixDeref", which causes Perl to interpret suffixed empty
561       brackets as dereferencing the thing they suffix. This module by Ben
562       Morrow ("BMORROW") appears to have been retracted.
563
564       Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
565       be written in the ISO 646 character set.
566
567       Perl6::Pugs. Enough said.
568
569       Perl6::Rules, which back-ports some of the Perl 6 regular expression
570       syntax to Perl 5.
571
572       Regexp::Extended, which extends regular expressions in various ways,
573       some of which seem to conflict with Perl 5.010.
574

SEE ALSO

576       Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
577       for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
578       supports modification of the parse tree.
579
580       Regexp::Parser, which parses a bare regular expression (without
581       enclosing "qr{}", "m//", or whatever) and uses a different navigation
582       model. After a long hiatus, this module has been adopted, and is again
583       supported.
584
585       YAPE::Regex, which provides the parse tree, and has a mechanism to
586       subclass the various element classes for customization. The most-recent
587       release is 2011, but the CPAN testers results are still all green.
588       Companion module YAPE::Regex::Explain says what the various pieces of a
589       regex do, though constructs added in perl 5.10 and later are not
590       supported. I have no idea how I missed this when I originally went
591       looking for "Regexp" parsers.
592

SUPPORT

594       Support is by the author. Please file bug reports at
595       <https://rt.cpan.org/Public/Dist/Display.html?Name=PPIx-Regexp>,
596       <https://github.com/trwyant/perl-PPIx-Regexp/issues>, or in electronic
597       mail to the author.
598

AUTHOR

600       Thomas R. Wyant, III wyant at cpan dot org
601
603       Copyright (C) 2009-2021 by Thomas R. Wyant, III
604
605       This program is free software; you can redistribute it and/or modify it
606       under the same terms as Perl 5.10.0. For more details, see the full
607       text of the licenses in the directory LICENSES.
608
609       This program is distributed in the hope that it will be useful, but
610       without any warranty; without even the implied warranty of
611       merchantability or fitness for a particular purpose.
612
613
614
615perl v5.34.0                      2021-10-25                   PPIx::Regexp(3)
Impressum