PPIx::Regexp(3pm)

1PPIx::Regexp(3)       User Contributed Perl Documentation      PPIx::Regexp(3)
2
3
4

NAME

6       PPIx::Regexp - Represent a regular expression of some sort
7

SYNOPSIS

9        use PPIx::Regexp;
10        use PPIx::Regexp::Dumper;
11        my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12        PPIx::Regexp::Dumper->new( $re )
13            ->print();
14

DEPRECATION NOTICE

16       The postderef argument to new() is being put through a deprecation
17       cycle and retracted. After the retraction, postfix dereferences will
18       always be recognized. This is the default behaviour now.
19
20       Starting with the first release after October 1 2020, the first use of
21       this argument will warn. Six months after that all uses will warn.
22       After a further six months, all uses will become fatal.
23

INHERITANCE

25       "PPIx::Regexp" is a PPIx::Regexp::Node.
26
27       "PPIx::Regexp" has no descendants.
28

DESCRIPTION

30       The purpose of the PPIx-Regexp package is to parse regular expressions
31       in a manner similar to the way the PPI package parses Perl. This class
32       forms the root of the parse tree, playing a role similar to
33       PPI::Document.
34
35       This package shares with PPI the property of being round-trip safe.
36       That is,
37
38        my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
39        my $re = PPIx::Regexp->new( $expr );
40        print $re->content() eq $expr ? "yes\n" : "no\n"
41
42       should print 'yes' for any valid regular expression.
43
44       Navigation is similar to that provided by PPI. That is to say, things
45       like "children", "find_first", "snext_sibling" and so on all work
46       pretty much the same way as in PPI.
47
48       The class hierarchy is also similar to PPI. Except for some utility
49       classes (the dumper, the lexer, and the tokenizer) all classes are
50       descended from PPIx::Regexp::Element, which provides basic navigation.
51       Tokens are descended from PPIx::Regexp::Token, which provides content.
52       All containers are descended from PPIx::Regexp::Node, which provides
53       for children, and all structure elements are descended from
54       PPIx::Regexp::Structure, which provides beginning and ending
55       delimiters, and a type.
56
57       There are two features of PPI that this package does not provide -
58       mutability and operator overloading. There are no plans for serious
59       mutability, though something like PPI's "prune" functionality might be
60       considered. Similarly there are no plans for operator overloading,
61       which appears to the author to represent a performance hit for little
62       tangible gain.
63

NOTICE

65       The use of this class to parse non-regexp quote-like strings was an
66       experiment that I consider failed. Therefore this use is deprecated in
67       favor of PPIx::QuoteLike. As of version 0.058_01, the first use of the
68       "parse" argument to new() resulted in a warning. As of version
69       0.062_01, all uses of the "parse" argument resulted in a warning. As of
70       version 0.068_01, the "parse" argument will become fatal.
71
72       The author will attempt to preserve the documented interface, but if
73       the interface needs to change to correct some egregiously bad design or
74       implementation decision, then it will change.  Any incompatible changes
75       will go through a deprecation cycle.
76
77       The goal of this package is to parse well-formed regular expressions
78       correctly. A secondary goal is not to blow up on ill-formed regular
79       expressions. The correct identification and characterization of ill-
80       formed regular expressions is not a goal of this package, nor is the
81       consistent parsing of ill-formed regular expressions from release to
82       release.
83
84       This policy attempts to track features in development releases as well
85       as public releases. However, features added in a development release
86       and then removed before the next production release will not be
87       tracked, and any functionality relating to such features will be
88       removed. The issue here is the potential re-use (with different
89       semantics) of syntax that did not make it into the production release.
90
91       From time to time the Perl regular expression engine changes in ways
92       that change the parse of a given regular expression. When these changes
93       occur, "PPIx::Regexp" will be changed to produce the more modern parse.
94       Known examples of this include:
95
96       $( no longer interpolates as of Perl 5.005, per "perl5005delta".
97           Newer Perls seem to parse this as "qr{$}" (i.e. and end-of-string
98           or newline assertion) followed by an open parenthesis, and that is
99           what "PPIx::Regexp" does.
100
101       $) and $| also seem to parse as the "$" assertion
102           followed by the relevant meta-character, though I have no
103           documentation reference for this.
104
105       "@+" and "@-" no longer interpolate as of Perl 5.9.4
106           per "perl594delta". Subsequent Perls treat "@+" as a quantified
107           literal and "@-" as two literals, and that is what "PPIx::Regexp"
108           does. Note that subscripted references to these arrays do
109           interpolate, and are so parsed by "PPIx::Regexp".
110
111       Only space and horizontal tab are whitespace as of Perl 5.23.4
112           when inside a bracketed character class inside an extended
113           bracketed character class, per "perl5234delta". Formerly any white
114           space character parsed as whitespace. This change in "PPIx::Regexp"
115           will be reverted if the change in Perl does not make it into Perl
116           5.24.0.
117
118       Unescaped literal left curly brackets
119           These are being removed in positions where quantifiers are legal,
120           so that they can be used for new functionality. Some of them are
121           gone in 5.25.1, others will be removed in a future version of Perl.
122           In situations where they have been removed, perl_version_removed()
123           will return the version in which they were removed. When the new
124           functionality appears, the parse produced by this software will
125           reflect the new functionality.
126
127           NOTE that the situation with a literal left curly after a literal
128           character is complicated. It was made an error in Perl 5.25.1, and
129           remained so through all 5.26 releases, but became a warning again
130           in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
131           become illegal again is not clear to me based on the contents of
132           perl5271delta. At the moment perl_version_removed() returns
133           "undef", but obviously that is not the whole story, and methods
134           accepts_perl() and requirements_for_perl() were introduced to deal
135           with this complication.
136
137       "\o{...}"
138           is parsed as the octal equivalent of "\x{...}". This is its meaning
139           as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
140           on.
141
142       There are very probably other examples of this. When they come to light
143       they will be documented as producing the modern parse, and the code
144       modified to produce this parse if necessary.
145

METHODS

147       This class provides the following public methods. Methods not
148       documented here are private, and unsupported in the sense that the
149       author reserves the right to change or remove them without notice.
150
151   new
152        my $re = PPIx::Regexp->new('/foo/');
153
154       This method instantiates a "PPIx::Regexp" object from a string, a
155       PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
156       PPI::Token::Regexp::Substitute.  Honestly, any PPI::Element will work,
157       but only the three Regexp classes mentioned previously are likely to do
158       anything useful.
159
160       Whatever form the argument takes, it is assumed to consist entirely of
161       a valid match, substitution, or "qr<>" string.
162
163       Optionally you can pass one or more name/value pairs after the regular
164       expression. The possible options are:
165
166       default_modifiers array_reference
167           This option specifies a reference to an array of default modifiers
168           to apply to the regular expression being parsed. Each modifier is
169           specified as a string. Any actual modifiers found supersede the
170           defaults.
171
172           When applying the defaults, '?' and '/' are completely ignored, and
173           '^' is ignored unless it occurs at the beginning of the modifier.
174           The first dash ('-') causes subsequent modifiers to be negated.
175
176           So, for example, if you wish to produce a "PPIx::Regexp" object
177           representing the regular expression in
178
179            use re '/smx';
180            {
181               no re '/x';
182               m/ foo /;
183            }
184
185           you would (after some help from PPI in finding the relevant
186           statements), do something like
187
188            my $re = PPIx::Regexp->new( 'm/ foo /',
189                default_modifiers => [ '/smx', '-/x' ] );
190
191       encoding name
192           This option specifies the encoding of the regular expression. This
193           is passed to the tokenizer, which will "decode" the regular
194           expression string before it tokenizes it. For example:
195
196            my $re = PPIx::Regexp->new( '/foo/',
197                encoding => 'iso-8859-1',
198            );
199
200       index_locations Boolean
201           This Boolean option specifies whether the locations of the elements
202           in the regular expression should be indexed.
203
204           If unspecified or specified as "undef" a default value is used.
205           This default is true if the argument is a PPI::Element or the
206           "location" option was specified. Otherwise the default is false.
207
208       location array_reference
209           This option specifies the location of the new object in the
210           document from which it was created. It is a reference to a five-
211           element array compatible with that returned by the "location()"
212           method of PPI::Element.
213
214           If not specified, the location of the original string is used if it
215           was specified as a PPI::Element.
216
217           If no location can be determined, the various "location()" methods
218           will return "undef".
219
220       parse parse_type
221           This option specifies what kind of parse is to be done. Possible
222           values are 'regex', 'string', or 'guess'. Any value but 'regex' is
223           experimental.
224
225           As it turns out, I consider parsing non-regexp quote-like things
226           with this class to be a failed experiment, and the relevant
227           functionality is being deprecated and removed in favor of
228           PPIx::QuoteLike. See above for details. As of version 0.068_01, any
229           use of this option throws an exception.
230
231       postderef Boolean
232           THIS ARGUMENT IS DEPRECATED.  See DEPRECATION NOTICE above for the
233           details.
234
235           This option is passed on to the tokenizer, where it specifies
236           whether postfix dereferences are recognized in interpolations and
237           code. This experimental feature was introduced in Perl 5.19.5.
238
239           The default is the value of
240           $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which is true. When
241           originally introduced this was false, but was documented as
242           becoming true when and if postfix dereferencing became mainstream.
243           The  intent to mainstream was announced with Perl 5.23.1, and
244           became official (so to speak) with Perl 5.24.0, so the default
245           became true with PPIx::Regexp 0.049_01.
246
247           Note that if PPI starts unconditionally recognizing postfix
248           dereferences, this argument will immediately become ignored, and
249           will be put through a deprecation cycle and removed.
250
251       strict Boolean
252           This option is passed on to the tokenizer and lexer, where it
253           specifies whether the parse should assume "use re 'strict'" is in
254           effect.
255
256           The 'strict' pragma was introduced in Perl 5.22, and its
257           documentation says that it is experimental, and that there is no
258           commitment to backward compatibility. The same applies to the parse
259           produced when this option is asserted. Also, the usual caveat
260           applies: if "use re 'strict'" ends up being retracted, this option
261           and all related functionality will be also.
262
263           Given the nature of "use re 'strict'", you should expect that if
264           you assert this option, regular expressions that previously parsed
265           without error might no longer do so. If an element ends up being
266           declared an error because this option is set, its
267           "perl_version_introduced()" will be the Perl version at which "use
268           re 'strict'" started rejecting these elements.
269
270           The default is false.
271
272       trace number
273           If greater than zero, this option causes trace output from the
274           parse.  The author reserves the right to change or eliminate this
275           without notice.
276
277       Passing optional input other than the above is not an error, but
278       neither is it supported.
279
280   new_from_cache
281       This static method wraps "new" in a caching mechanism. Only one object
282       will be generated for a given PPI::Element, no matter how many times
283       this method is called. Calls after the first for a given PPI::Element
284       simply return the same "PPIx::Regexp" object.
285
286       When the "PPIx::Regexp" object is returned from cache, the values of
287       the optional arguments are ignored.
288
289       Calls to this method with the regular expression in a string rather
290       than a PPI::Element will not be cached.
291
292       Caveat: This method is provided for code like Perl::Critic which might
293       instantiate the same object multiple times. The cache will persist
294       until "flush_cache" is called.
295
296   flush_cache
297        $re->flush_cache();            # Remove $re from cache
298        PPIx::Regexp->flush_cache();   # Empty the cache
299
300       This method flushes the cache used by "new_from_cache". If called as a
301       static method with no arguments, the entire cache is emptied. Otherwise
302       any objects specified are removed from the cache.
303
304   capture_names
305        foreach my $name ( $re->capture_names() ) {
306            print "Capture name '$name'\n";
307        }
308
309       This convenience method returns the capture names found in the regular
310       expression.
311
312       This method is equivalent to
313
314        $self->regular_expression()->capture_names();
315
316       except that if "$self->regular_expression()" returns "undef" (meaning
317       that something went terribly wrong with the parse) this method will
318       simply return.
319
320   delimiters
321        print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
322        # prints '//      //'
323
324       When called in list context, this method returns either one or two
325       strings, depending on whether the parsed expression has a replacement
326       string. In the case of non-bracketed substitutions, the start delimiter
327       of the replacement string is considered to be the same as its finish
328       delimiter, as illustrated by the above example.
329
330       When called in scalar context, you get the delimiters of the regular
331       expression; that is, element 0 of the array that is returned in list
332       context.
333
334       Optionally, you can pass an index value and the corresponding
335       delimiters will be returned; index 0 represents the regular
336       expression's delimiters, and index 1 represents the replacement
337       string's delimiters, which may be undef. For example,
338
339        print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
340        # prints '<>'
341
342       If the object was not initialized with a valid regexp of some sort, the
343       results of this method are undefined.
344
345   errstr
346       This static method returns the error string from the most recent
347       attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
348       recent attempt succeeded.
349
350   extract_regexps
351        my $doc = PPI::Document->new( $path );
352        $doc->index_locations();
353        my @res = PPIx::Regexp->extract_regexps( $doc )
354
355       This convenience (well, sort-of) static method takes as its argument a
356       PPI::Document object and returns "PPIx::Regexp" objects corresponding
357       to all regular expressions found in it, in the order in which they
358       occur in the document. You will need to keep a reference to the
359       original PPI::Document object if you wish to be able to recover the
360       original PPI::Element objects via the PPIx::Regexp source() method.
361
362   failures
363        print "There were ", $re->failures(), " parse failures\n";
364
365       This method returns the number of parse failures. This is a count of
366       the number of unknown tokens plus the number of unterminated structures
367       plus the number of unmatched right brackets of any sort.
368
369   max_capture_number
370        print "Highest used capture number ",
371            $re->max_capture_number(), "\n";
372
373       This convenience method returns the highest capture number used by the
374       regular expression. If there are no captures, the return will be 0.
375
376       This method is equivalent to
377
378        $self->regular_expression()->max_capture_number();
379
380       except that if "$self->regular_expression()" returns "undef" (meaning
381       that something went terribly wrong with the parse) this method will
382       too.
383
384   modifier
385        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
386        print $re->modifier()->content(), "\n";
387        # prints 'smx'.
388
389       This method retrieves the modifier of the object. This comes from the
390       end of the initializing string or object and will be a
391       PPIx::Regexp::Token::Modifier.
392
393       Note that this object represents the actual modifiers present on the
394       regexp, and does not take into account any that may have been applied
395       by default (i.e. via the "default_modifiers" argument to "new()"). For
396       something that takes account of default modifiers, see
397       modifier_asserted(), below.
398
399       In the event of a parse failure, there may not be a modifier present,
400       in which case nothing is returned.
401
402   modifier_asserted
403        my $re = PPIx::Regexp->new( '/ . /',
404            default_modifiers => [ 'smx' ] );
405        print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
406        # prints 'yes'.
407
408       This method returns true if the given modifier is asserted for the
409       regexp, whether explicitly or by the modifiers passed in the
410       "default_modifiers" argument.
411
412       Starting with version 0.036_01, if the argument is a single-character
413       modifier followed by an asterisk (intended as a wild card character),
414       the return is the number of times that modifier appears. In this case
415       an exception will be thrown if you specify a multi-character modifier
416       (e.g.  'ee*'), or if you specify one of the match semantics modifiers
417       (e.g.  'a*').
418
419   regular_expression
420        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
421        print $re->regular_expression()->content(), "\n";
422        # prints '/(foo)/'.
423
424       This method returns that portion of the object which actually
425       represents a regular expression.
426
427   replacement
428        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
429        print $re->replacement()->content(), "\n";
430        # prints '${1}bar/'.
431
432       This method returns that portion of the object which represents the
433       replacement string. This will be "undef" unless the regular expression
434       actually has a replacement string. Delimiters will be included, but
435       there will be no beginning delimiter unless the regular expression was
436       bracketed.
437
438   source
439        my $source = $re->source();
440
441       This method returns the object or string that was used to instantiate
442       the object.
443
444   type
445        my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
446        print $re->type()->content(), "\n";
447        # prints 's'.
448
449       This method retrieves the type of the object. This comes from the
450       beginning of the initializing string or object, and will be a
451       PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
452       'qr', or ''.
453

RESTRICTIONS

455       By the nature of this module, it is never going to get everything
456       right.  Many of the known problem areas involve interpolations one way
457       or another.
458
459   Ambiguous Syntax
460       Perl's regular expressions contain cases where the syntax is ambiguous.
461       A particularly egregious example is an interpolation followed by square
462       or curly brackets, for example $foo[...]. There is nothing in the
463       syntax to say whether the programmer wanted to interpolate an element
464       of array @foo, or whether he wanted to interpolate scalar $foo, and
465       then follow that interpolation by a character class.
466
467       The perlop documentation notes that in this case what Perl does is to
468       guess. That is, it employs various heuristics on the code to try to
469       figure out what the programmer wanted. These heuristics are documented
470       as being undocumented (!) and subject to change without notice. As an
471       example of the problems even perl faces in parsing Perl, see
472       <https://github.com/perl/perl5/issues/16478>.
473
474       Given this situation, this module's chances of duplicating every Perl
475       version's interpretation of every regular expression are pretty much
476       nil.  What it does now is to assume that square brackets containing
477       only an integer or an interpolation represent a subscript; otherwise
478       they represent a character class. Similarly, curly brackets containing
479       only a bareword or an interpolation are a subscript; otherwise they
480       represent a quantifier.
481
482   Changes in Syntax
483       Sometimes the introduction of new syntax changes the way a regular
484       expression is parsed. For example, the "\v" character class was
485       introduced in Perl 5.9.5. But it did not represent a syntax error prior
486       to that version of Perl, it was simply parsed as "v". So
487
488        $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
489
490       prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
491       generally assumes the more modern parse in cases like this.
492
493   Equivocation
494       Very occasionally, a construction will be removed and then added back
495       -- and then, conceivably, removed again. In this case, the plan is for
496       perl_version_introduced() to return the earliest version in which the
497       construction appeared, and perl_version_removed() to return the version
498       after the last version in which it appeared (whether production or
499       development), or "undef" if it is in the highest-numbered Perl.
500
501       The constructions involved in this are:
502
503       Un-escaped literal left curly after literal
504
505       That is, something like "qr<x{>".
506
507       This was made an error in 5.25.1, and it was an error in 5.26.0.  But
508       it became a warning again in 5.27.1. The perl5271delta says it was re-
509       instated because the changes broke GNU Autoconf, and the warning
510       message says it will be removed in Perl 5.30.
511
512       Accordingly, perl_version_introduced() returns 5.0. At the moment
513       perl_version_removed() returns '5.025001'. But if it is present with or
514       without warning in 5.28, perl_version_removed() will become "undef". If
515       you need finer resolution than this, see PPIx::Regexp::Element methods
516       l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
517       l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
518
519   Static Parsing
520       It is well known that Perl can not be statically parsed. That is, you
521       can not completely parse a piece of Perl code without executing that
522       same code.
523
524       Nevertheless, this class is trying to statically parse regular
525       expressions. The main problem with this is that there is no way to know
526       what is being interpolated into the regular expression by an
527       interpolated variable. This is a problem because the interpolated value
528       can change the interpretation of adjacent elements.
529
530       This module deals with this by making assumptions about what is in an
531       interpolated variable. These assumptions will not be enumerated here,
532       but in general the principal is to assume the interpolated value does
533       not change the interpretation of the regular expression. For example,
534
535        my $foo = 'a-z]';
536        my $re = qr{[$foo};
537
538       is fine with the Perl interpreter, but will confuse the dickens out of
539       this module. Similarly and more usefully, something like
540
541        my $mods = 'i';
542        my $re = qr{(?$mods:foo)};
543
544       or maybe
545
546        my $mods = 'i';
547        my $re = qr{(?$mods)$foo};
548
549       probably sets a modifier of some sort, and that is how this module
550       interprets it. If the interpolation is not about modifiers, this module
551       will get it wrong. Another such semi-benign example is
552
553        my $foo = $] >= 5.010 ? '?<foo>' : '';
554        my $re = qr{($foo\w+)};
555
556       which will parse, but this module will never realize that it might be
557       looking at a named capture.
558
559   Non-Standard Syntax
560       There are modules out there that alter the syntax of Perl. If the
561       syntax of a regular expression is altered, this module has no way to
562       understand that it has been altered, much less to adapt to the
563       alteration. The following modules are known to cause problems:
564
565       Acme::PerlML, which renders Perl as XML.
566
567       "Data::PostfixDeref", which causes Perl to interpret suffixed empty
568       brackets as dereferencing the thing they suffix. This module by Ben
569       Morrow ("BMORROW") appears to have been retracted.
570
571       Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
572       be written in the ISO 646 character set.
573
574       Perl6::Pugs. Enough said.
575
576       Perl6::Rules, which back-ports some of the Perl 6 regular expression
577       syntax to Perl 5.
578
579       Regexp::Extended, which extends regular expressions in various ways,
580       some of which seem to conflict with Perl 5.010.
581

SUPPORT

593       Support is by the author. Please file bug reports at
594       <https://rt.cpan.org>, or in electronic mail to the author.
595

AUTHOR

597       Thomas R. Wyant, III wyant at cpan dot org
598

COPYRIGHT AND LICENSE

600       Copyright (C) 2009-2020 by Thomas R. Wyant, III
601
602       This program is free software; you can redistribute it and/or modify it
603       under the same terms as Perl 5.10.0. For more details, see the full
604       text of the licenses in the directory LICENSES.
605
606       This program is distributed in the hope that it will be useful, but
607       without any warranty; without even the implied warranty of
608       merchantability or fitness for a particular purpose.
609
610
611
612perl v5.32.0                      2020-07-29                   PPIx::Regexp(3)