1PPIx::Regexp(3) User Contributed Perl Documentation PPIx::Regexp(3)
2
3
4
6 PPIx::Regexp - Represent a regular expression of some sort
7
9 use PPIx::Regexp;
10 use PPIx::Regexp::Dumper;
11 my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12 PPIx::Regexp::Dumper->new( $re )
13 ->print();
14
16 The "postderef" argument to new() is retracted, and postfix
17 dereferences are always be recognized.
18
19 Starting with version 0.074_01, the first use of this argument warned.
20 With version 0.079_01, all uses warned. With version 0.080_01, all uses
21 became fatal. With version 0.084_01, all mention of this argument was
22 removed, except for this notice.
23
25 "PPIx::Regexp" is a PPIx::Regexp::Node.
26
27 "PPIx::Regexp" has no descendants.
28
30 The purpose of the PPIx-Regexp package is to parse regular expressions
31 in a manner similar to the way the PPI package parses Perl. This class
32 forms the root of the parse tree, playing a role similar to
33 PPI::Document.
34
35 This package shares with PPI the property of being round-trip safe.
36 That is,
37
38 my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
39 my $re = PPIx::Regexp->new( $expr );
40 print $re->content() eq $expr ? "yes\n" : "no\n"
41
42 should print 'yes' for any valid regular expression.
43
44 Navigation is similar to that provided by PPI. That is to say, things
45 like "children", "find_first", "snext_sibling" and so on all work
46 pretty much the same way as in PPI.
47
48 The class hierarchy is also similar to PPI. Except for some utility
49 classes (the dumper, the lexer, and the tokenizer) all classes are
50 descended from PPIx::Regexp::Element, which provides basic navigation.
51 Tokens are descended from PPIx::Regexp::Token, which provides content.
52 All containers are descended from PPIx::Regexp::Node, which provides
53 for children, and all structure elements are descended from
54 PPIx::Regexp::Structure, which provides beginning and ending
55 delimiters, and a type.
56
57 There are two features of PPI that this package does not provide -
58 mutability and operator overloading. There are no plans for serious
59 mutability, though something like PPI's "prune" functionality might be
60 considered. Similarly there are no plans for operator overloading,
61 which appears to the author to represent a performance hit for little
62 tangible gain.
63
65 The author will attempt to preserve the documented interface, but if
66 the interface needs to change to correct some egregiously bad design or
67 implementation decision, then it will change. Any incompatible changes
68 will go through a deprecation cycle.
69
70 The goal of this package is to parse well-formed regular expressions
71 correctly. A secondary goal is not to blow up on ill-formed regular
72 expressions. The correct identification and characterization of ill-
73 formed regular expressions is not a goal of this package, nor is the
74 consistent parsing of ill-formed regular expressions from release to
75 release.
76
77 This policy attempts to track features in development releases as well
78 as public releases. However, features added in a development release
79 and then removed before the next production release will not be
80 tracked, and any functionality relating to such features will be
81 removed. The issue here is the potential re-use (with different
82 semantics) of syntax that did not make it into the production release.
83
84 From time to time the Perl regular expression engine changes in ways
85 that change the parse of a given regular expression. When these changes
86 occur, "PPIx::Regexp" will be changed to produce the more modern parse.
87 Known examples of this include:
88
89 $( no longer interpolates as of Perl 5.005, per "perl5005delta".
90 Newer Perls seem to parse this as "qr{$}" (i.e. an end-of-string or
91 newline assertion) followed by an open parenthesis, and that is
92 what "PPIx::Regexp" does.
93
94 $) and $| also seem to parse as the "$" assertion
95 followed by the relevant meta-character, though I have no
96 documentation reference for this.
97
98 "@+" and "@-" no longer interpolate as of Perl 5.9.4
99 per "perl594delta". Subsequent Perls treat "@+" as a quantified
100 literal and "@-" as two literals, and that is what "PPIx::Regexp"
101 does. Note that subscripted references to these arrays do
102 interpolate, and are so parsed by "PPIx::Regexp".
103
104 Only space and horizontal tab are whitespace as of Perl 5.23.4
105 when inside a bracketed character class inside an extended
106 bracketed character class, per "perl5234delta". Formerly any white
107 space character parsed as whitespace. This change in "PPIx::Regexp"
108 will be reverted if the change in Perl does not make it into Perl
109 5.24.0.
110
111 Unescaped literal left curly brackets
112 These are being removed in positions where quantifiers are legal,
113 so that they can be used for new functionality. Some of them are
114 gone in 5.25.1, others will be removed in a future version of Perl.
115 In situations where they have been removed, perl_version_removed()
116 will return the version in which they were removed. When the new
117 functionality appears, the parse produced by this software will
118 reflect the new functionality.
119
120 NOTE that the situation with a literal left curly after a literal
121 character is complicated. It was made an error in Perl 5.25.1, and
122 remained so through all 5.26 releases, but became a warning again
123 in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
124 become illegal again is not clear to me based on the contents of
125 perl5271delta. At the moment perl_version_removed() returns
126 "undef", but obviously that is not the whole story, and methods
127 accepts_perl() and requirements_for_perl() were introduced to deal
128 with this complication.
129
130 "\o{...}"
131 is parsed as the octal equivalent of "\x{...}". This is its meaning
132 as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
133 on.
134
135 "x{,3}"
136 (with first count omitted) is allowed as a quantifier as of Perl
137 5.33.6. The previous parse made this all literals.
138
139 "x{ 0 , 3 }"
140 (with spaces inside but adjacent to curly brackets, or around the
141 comma if any) is allowed as a quantifier as of Perl 5.33.6. The
142 previous parse made this all literals.
143
144 There are very probably other examples of this. When they come to light
145 they will be documented as producing the modern parse, and the code
146 modified to produce this parse if necessary.
147
149 This class provides the following public methods. Methods not
150 documented here are private, and unsupported in the sense that the
151 author reserves the right to change or remove them without notice.
152
153 new
154 my $re = PPIx::Regexp->new('/foo/');
155
156 This method instantiates a "PPIx::Regexp" object from a string, a
157 PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
158 PPI::Token::Regexp::Substitute. Honestly, any PPI::Element will work,
159 but only the three Regexp classes mentioned previously are likely to do
160 anything useful.
161
162 Whatever form the argument takes, it is assumed to consist entirely of
163 a valid match, substitution, or "qr<>" string.
164
165 Optionally you can pass one or more name/value pairs after the regular
166 expression. The possible options are:
167
168 default_modifiers array_reference
169 This option specifies a reference to an array of default modifiers
170 to apply to the regular expression being parsed. Each modifier is
171 specified as a string. Any actual modifiers found supersede the
172 defaults.
173
174 When applying the defaults, '?' and '/' are completely ignored, and
175 '^' is ignored unless it occurs at the beginning of the modifier.
176 The first dash ('-') causes subsequent modifiers to be negated.
177
178 So, for example, if you wish to produce a "PPIx::Regexp" object
179 representing the regular expression in
180
181 use re '/smx';
182 {
183 no re '/x';
184 m/ foo /;
185 }
186
187 you would (after some help from PPI in finding the relevant
188 statements), do something like
189
190 my $re = PPIx::Regexp->new( 'm/ foo /',
191 default_modifiers => [ '/smx', '-/x' ] );
192
193 encoding name
194 This option specifies the encoding of the regular expression. This
195 is passed to the tokenizer, which will "decode" the regular
196 expression string before it tokenizes it. For example:
197
198 my $re = PPIx::Regexp->new( '/foo/',
199 encoding => 'iso-8859-1',
200 );
201
202 index_locations Boolean
203 This Boolean option specifies whether the locations of the elements
204 in the regular expression should be indexed.
205
206 If unspecified or specified as "undef" a default value is used.
207 This default is true if the argument is a PPI::Element or the
208 "location" option was specified. Otherwise the default is false.
209
210 location array_reference
211 This option specifies the location of the new object in the
212 document from which it was created. It is a reference to a five-
213 element array compatible with that returned by the "location()"
214 method of PPI::Element.
215
216 If not specified, the location of the original string is used if it
217 was specified as a PPI::Element.
218
219 If no location can be determined, the various "location()" methods
220 will return "undef".
221
222 postderef Boolean
223 THIS ARGUMENT IS DEPRECATED. See DEPRECATION NOTICE above for the
224 details.
225
226 This option is passed on to the tokenizer, where it specifies
227 whether postfix dereferences are recognized in interpolations and
228 code. This experimental feature was introduced in Perl 5.19.5.
229
230 As of version 0.074_01, the default is true. Through release
231 0.074, the default was the value of
232 $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which was true. When
233 originally introduced this was false, but was documented as
234 becoming true when and if postfix dereferencing became mainstream.
235 The intent to mainstream was announced with Perl 5.23.1, and
236 became official (so to speak) with Perl 5.24.0, so the default
237 became true with PPIx::Regexp 0.049_01.
238
239 Note that if PPI starts unconditionally recognizing postfix
240 dereferences, this argument will immediately become ignored, and
241 will be put through a deprecation cycle and removed.
242
243 strict Boolean
244 This option is passed on to the tokenizer and lexer, where it
245 specifies whether the parse should assume "use re 'strict'" is in
246 effect.
247
248 The 'strict' pragma was introduced in Perl 5.22, and its
249 documentation says that it is experimental, and that there is no
250 commitment to backward compatibility. The same applies to the parse
251 produced when this option is asserted. Also, the usual caveat
252 applies: if "use re 'strict'" ends up being retracted, this option
253 and all related functionality will be also.
254
255 Given the nature of "use re 'strict'", you should expect that if
256 you assert this option, regular expressions that previously parsed
257 without error might no longer do so. If an element ends up being
258 declared an error because this option is set, its
259 "perl_version_introduced()" will be the Perl version at which "use
260 re 'strict'" started rejecting these elements.
261
262 The default is false.
263
264 trace number
265 If greater than zero, this option causes trace output from the
266 parse. The author reserves the right to change or eliminate this
267 without notice.
268
269 Passing optional input other than the above is not an error, but
270 neither is it supported.
271
272 new_from_cache
273 This static method wraps "new" in a caching mechanism. Only one object
274 will be generated for a given PPI::Element, no matter how many times
275 this method is called. Calls after the first for a given PPI::Element
276 simply return the same "PPIx::Regexp" object.
277
278 When the "PPIx::Regexp" object is returned from cache, the values of
279 the optional arguments are ignored.
280
281 Calls to this method with the regular expression in a string rather
282 than a PPI::Element will not be cached.
283
284 Caveat: This method is provided for code like Perl::Critic which might
285 instantiate the same object multiple times. The cache will persist
286 until "flush_cache" is called.
287
288 flush_cache
289 $re->flush_cache(); # Remove $re from cache
290 PPIx::Regexp->flush_cache(); # Empty the cache
291
292 This method flushes the cache used by "new_from_cache". If called as a
293 static method with no arguments, the entire cache is emptied. Otherwise
294 any objects specified are removed from the cache.
295
296 capture_names
297 foreach my $name ( $re->capture_names() ) {
298 print "Capture name '$name'\n";
299 }
300
301 This convenience method returns the capture names found in the regular
302 expression.
303
304 This method is equivalent to
305
306 $self->regular_expression()->capture_names();
307
308 except that if "$self->regular_expression()" returns "undef" (meaning
309 that something went terribly wrong with the parse) this method will
310 simply return.
311
312 delimiters
313 print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
314 # prints '// //'
315
316 When called in list context, this method returns either one or two
317 strings, depending on whether the parsed expression has a replacement
318 string. In the case of non-bracketed substitutions, the start delimiter
319 of the replacement string is considered to be the same as its finish
320 delimiter, as illustrated by the above example.
321
322 When called in scalar context, you get the delimiters of the regular
323 expression; that is, element 0 of the array that is returned in list
324 context.
325
326 Optionally, you can pass an index value and the corresponding
327 delimiters will be returned; index 0 represents the regular
328 expression's delimiters, and index 1 represents the replacement
329 string's delimiters, which may be undef. For example,
330
331 print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
332 # prints '<>'
333
334 If the object was not initialized with a valid regexp of some sort, the
335 results of this method are undefined.
336
337 errstr
338 This static method returns the error string from the most recent
339 attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
340 recent attempt succeeded.
341
342 extract_regexps
343 my $doc = PPI::Document->new( $path );
344 $doc->index_locations();
345 my @res = PPIx::Regexp->extract_regexps( $doc )
346
347 This convenience (well, sort-of) static method takes as its argument a
348 PPI::Document object and returns "PPIx::Regexp" objects corresponding
349 to all regular expressions found in it, in the order in which they
350 occur in the document. You will need to keep a reference to the
351 original PPI::Document object if you wish to be able to recover the
352 original PPI::Element objects via the PPIx::Regexp source() method.
353
354 failures
355 print "There were ", $re->failures(), " parse failures\n";
356
357 This method returns the number of parse failures. This is a count of
358 the number of unknown tokens plus the number of unterminated structures
359 plus the number of unmatched right brackets of any sort.
360
361 max_capture_number
362 print "Highest used capture number ",
363 $re->max_capture_number(), "\n";
364
365 This convenience method returns the highest capture number used by the
366 regular expression. If there are no captures, the return will be 0.
367
368 This method is equivalent to
369
370 $self->regular_expression()->max_capture_number();
371
372 except that if "$self->regular_expression()" returns "undef" (meaning
373 that something went terribly wrong with the parse) this method will
374 too.
375
376 modifier
377 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
378 print $re->modifier()->content(), "\n";
379 # prints 'smx'.
380
381 This method retrieves the modifier of the object. This comes from the
382 end of the initializing string or object and will be a
383 PPIx::Regexp::Token::Modifier.
384
385 Note that this object represents the actual modifiers present on the
386 regexp, and does not take into account any that may have been applied
387 by default (i.e. via the "default_modifiers" argument to "new()"). For
388 something that takes account of default modifiers, see
389 modifier_asserted(), below.
390
391 In the event of a parse failure, there may not be a modifier present,
392 in which case nothing is returned.
393
394 modifier_asserted
395 my $re = PPIx::Regexp->new( '/ . /',
396 default_modifiers => [ 'smx' ] );
397 print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
398 # prints 'yes'.
399
400 This method returns true if the given modifier is asserted for the
401 regexp, whether explicitly or by the modifiers passed in the
402 "default_modifiers" argument.
403
404 Starting with version 0.036_01, if the argument is a single-character
405 modifier followed by an asterisk (intended as a wild card character),
406 the return is the number of times that modifier appears. In this case
407 an exception will be thrown if you specify a multi-character modifier
408 (e.g. 'ee*'), or if you specify one of the match semantics modifiers
409 (e.g. 'a*').
410
411 regular_expression
412 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
413 print $re->regular_expression()->content(), "\n";
414 # prints '/(foo)/'.
415
416 This method returns that portion of the object which actually
417 represents a regular expression.
418
419 replacement
420 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
421 print $re->replacement()->content(), "\n";
422 # prints '${1}bar/'.
423
424 This method returns that portion of the object which represents the
425 replacement string. This will be "undef" unless the regular expression
426 actually has a replacement string. Delimiters will be included, but
427 there will be no beginning delimiter unless the regular expression was
428 bracketed.
429
430 source
431 my $source = $re->source();
432
433 This method returns the object or string that was used to instantiate
434 the object.
435
436 type
437 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
438 print $re->type()->content(), "\n";
439 # prints 's'.
440
441 This method retrieves the type of the object. This comes from the
442 beginning of the initializing string or object, and will be a
443 PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
444 'qr', or ''.
445
447 By the nature of this module, it is never going to get everything
448 right. Many of the known problem areas involve interpolations one way
449 or another.
450
451 Ambiguous Syntax
452 Perl's regular expressions contain cases where the syntax is ambiguous.
453 A particularly egregious example is an interpolation followed by square
454 or curly brackets, for example $foo[...]. There is nothing in the
455 syntax to say whether the programmer wanted to interpolate an element
456 of array @foo, or whether he wanted to interpolate scalar $foo, and
457 then follow that interpolation by a character class.
458
459 The perlop documentation notes that in this case what Perl does is to
460 guess. That is, it employs various heuristics on the code to try to
461 figure out what the programmer wanted. These heuristics are documented
462 as being undocumented (!) and subject to change without notice. As an
463 example of the problems even perl faces in parsing Perl, see
464 <https://github.com/perl/perl5/issues/16478>.
465
466 Given this situation, this module's chances of duplicating every Perl
467 version's interpretation of every regular expression are pretty much
468 nil. What it does now is to assume that square brackets containing
469 only an integer or an interpolation represent a subscript; otherwise
470 they represent a character class. Similarly, curly brackets containing
471 only a bareword or an interpolation are a subscript; otherwise they
472 represent a quantifier.
473
474 Changes in Syntax
475 Sometimes the introduction of new syntax changes the way a regular
476 expression is parsed. For example, the "\v" character class was
477 introduced in Perl 5.9.5. But it did not represent a syntax error prior
478 to that version of Perl, it was simply parsed as "v". So
479
480 $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
481
482 prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
483 generally assumes the more modern parse in cases like this.
484
485 Equivocation
486 Very occasionally, a construction will be removed and then added back
487 -- and then, conceivably, removed again. In this case, the plan is for
488 perl_version_introduced() to return the earliest version in which the
489 construction appeared, and perl_version_removed() to return the version
490 after the last version in which it appeared (whether production or
491 development), or "undef" if it is in the highest-numbered Perl.
492
493 The constructions involved in this are:
494
495 Un-escaped literal left curly after literal
496
497 That is, something like "qr<x{>".
498
499 This was made an error in 5.25.1, and it was an error in 5.26.0. But
500 it became a warning again in 5.27.1. The perl5271delta says it was re-
501 instated because the changes broke GNU Autoconf, and the warning
502 message says it will be removed in Perl 5.30.
503
504 Accordingly, perl_version_introduced() returns 5.0. At the moment
505 perl_version_removed() returns '5.025001'. But if it is present with or
506 without warning in 5.28, perl_version_removed() will become "undef". If
507 you need finer resolution than this, see PPIx::Regexp::Element methods
508 l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
509 l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
510
511 Static Parsing
512 It is well known that Perl can not be statically parsed. That is, you
513 can not completely parse a piece of Perl code without executing that
514 same code.
515
516 Nevertheless, this class is trying to statically parse regular
517 expressions. The main problem with this is that there is no way to know
518 what is being interpolated into the regular expression by an
519 interpolated variable. This is a problem because the interpolated value
520 can change the interpretation of adjacent elements.
521
522 This module deals with this by making assumptions about what is in an
523 interpolated variable. These assumptions will not be enumerated here,
524 but in general the principal is to assume the interpolated value does
525 not change the interpretation of the regular expression. For example,
526
527 my $foo = 'a-z]';
528 my $re = qr{[$foo};
529
530 is fine with the Perl interpreter, but will confuse the dickens out of
531 this module. Similarly and more usefully, something like
532
533 my $mods = 'i';
534 my $re = qr{(?$mods:foo)};
535
536 or maybe
537
538 my $mods = 'i';
539 my $re = qr{(?$mods)$foo};
540
541 probably sets a modifier of some sort, and that is how this module
542 interprets it. If the interpolation is not about modifiers, this module
543 will get it wrong. Another such semi-benign example is
544
545 my $foo = $] >= 5.010 ? '?<foo>' : '';
546 my $re = qr{($foo\w+)};
547
548 which will parse, but this module will never realize that it might be
549 looking at a named capture.
550
551 Non-Standard Syntax
552 There are modules out there that alter the syntax of Perl. If the
553 syntax of a regular expression is altered, this module has no way to
554 understand that it has been altered, much less to adapt to the
555 alteration. The following modules are known to cause problems:
556
557 Acme::PerlML, which renders Perl as XML.
558
559 "Data::PostfixDeref", which causes Perl to interpret suffixed empty
560 brackets as dereferencing the thing they suffix. This module by Ben
561 Morrow ("BMORROW") appears to have been retracted.
562
563 Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
564 be written in the ISO 646 character set.
565
566 Perl6::Pugs. Enough said.
567
568 Perl6::Rules, which back-ports some of the Perl 6 regular expression
569 syntax to Perl 5.
570
571 Regexp::Extended, which extends regular expressions in various ways,
572 some of which seem to conflict with Perl 5.010.
573
575 Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
576 for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
577 supports modification of the parse tree.
578
579 Regexp::Parser, which parses a bare regular expression (without
580 enclosing "qr{}", "m//", or whatever) and uses a different navigation
581 model. After a long hiatus, this module has been adopted, and is again
582 supported.
583
584 YAPE::Regex, which provides the parse tree, and has a mechanism to
585 subclass the various element classes for customization. The most-recent
586 release is 2011, but the CPAN testers results are still all green.
587 Companion module YAPE::Regex::Explain says what the various pieces of a
588 regex do, though constructs added in perl 5.10 and later are not
589 supported. I have no idea how I missed this when I originally went
590 looking for "Regexp" parsers.
591
593 Support is by the author. Please file bug reports at
594 <https://rt.cpan.org/Public/Dist/Display.html?Name=PPIx-Regexp>,
595 <https://github.com/trwyant/perl-PPIx-Regexp/issues>, or in electronic
596 mail to the author.
597
599 Thomas R. Wyant, III wyant at cpan dot org
600
602 Copyright (C) 2009-2022 by Thomas R. Wyant, III
603
604 This program is free software; you can redistribute it and/or modify it
605 under the same terms as Perl 5.10.0. For more details, see the full
606 text of the licenses in the directory LICENSES.
607
608 This program is distributed in the hope that it will be useful, but
609 without any warranty; without even the implied warranty of
610 merchantability or fitness for a particular purpose.
611
612
613
614perl v5.36.0 2022-07-22 PPIx::Regexp(3)