1PPIx::Regexp(3) User Contributed Perl Documentation PPIx::Regexp(3)
2
3
4
6 PPIx::Regexp - Represent a regular expression of some sort
7
9 use PPIx::Regexp;
10 use PPIx::Regexp::Dumper;
11 my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12 PPIx::Regexp::Dumper->new( $re )
13 ->print();
14
16 The postderef argument to new() is being put through a deprecation
17 cycle and retracted. After the retraction, postfix dereferences will
18 always be recognized. This is the default behaviour now.
19
20 Starting with version 0.074_01, the first use of this argument warned.
21 warn. With version 0.079_01, all uses will warn. With version 0.080_01,
22 all uses will become fatal. With the first release on or after April 15
23 2022 all mention of this argument will be removed.
24
26 "PPIx::Regexp" is a PPIx::Regexp::Node.
27
28 "PPIx::Regexp" has no descendants.
29
31 The purpose of the PPIx-Regexp package is to parse regular expressions
32 in a manner similar to the way the PPI package parses Perl. This class
33 forms the root of the parse tree, playing a role similar to
34 PPI::Document.
35
36 This package shares with PPI the property of being round-trip safe.
37 That is,
38
39 my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
40 my $re = PPIx::Regexp->new( $expr );
41 print $re->content() eq $expr ? "yes\n" : "no\n"
42
43 should print 'yes' for any valid regular expression.
44
45 Navigation is similar to that provided by PPI. That is to say, things
46 like "children", "find_first", "snext_sibling" and so on all work
47 pretty much the same way as in PPI.
48
49 The class hierarchy is also similar to PPI. Except for some utility
50 classes (the dumper, the lexer, and the tokenizer) all classes are
51 descended from PPIx::Regexp::Element, which provides basic navigation.
52 Tokens are descended from PPIx::Regexp::Token, which provides content.
53 All containers are descended from PPIx::Regexp::Node, which provides
54 for children, and all structure elements are descended from
55 PPIx::Regexp::Structure, which provides beginning and ending
56 delimiters, and a type.
57
58 There are two features of PPI that this package does not provide -
59 mutability and operator overloading. There are no plans for serious
60 mutability, though something like PPI's "prune" functionality might be
61 considered. Similarly there are no plans for operator overloading,
62 which appears to the author to represent a performance hit for little
63 tangible gain.
64
66 The author will attempt to preserve the documented interface, but if
67 the interface needs to change to correct some egregiously bad design or
68 implementation decision, then it will change. Any incompatible changes
69 will go through a deprecation cycle.
70
71 The goal of this package is to parse well-formed regular expressions
72 correctly. A secondary goal is not to blow up on ill-formed regular
73 expressions. The correct identification and characterization of ill-
74 formed regular expressions is not a goal of this package, nor is the
75 consistent parsing of ill-formed regular expressions from release to
76 release.
77
78 This policy attempts to track features in development releases as well
79 as public releases. However, features added in a development release
80 and then removed before the next production release will not be
81 tracked, and any functionality relating to such features will be
82 removed. The issue here is the potential re-use (with different
83 semantics) of syntax that did not make it into the production release.
84
85 From time to time the Perl regular expression engine changes in ways
86 that change the parse of a given regular expression. When these changes
87 occur, "PPIx::Regexp" will be changed to produce the more modern parse.
88 Known examples of this include:
89
90 $( no longer interpolates as of Perl 5.005, per "perl5005delta".
91 Newer Perls seem to parse this as "qr{$}" (i.e. an end-of-string or
92 newline assertion) followed by an open parenthesis, and that is
93 what "PPIx::Regexp" does.
94
95 $) and $| also seem to parse as the "$" assertion
96 followed by the relevant meta-character, though I have no
97 documentation reference for this.
98
99 "@+" and "@-" no longer interpolate as of Perl 5.9.4
100 per "perl594delta". Subsequent Perls treat "@+" as a quantified
101 literal and "@-" as two literals, and that is what "PPIx::Regexp"
102 does. Note that subscripted references to these arrays do
103 interpolate, and are so parsed by "PPIx::Regexp".
104
105 Only space and horizontal tab are whitespace as of Perl 5.23.4
106 when inside a bracketed character class inside an extended
107 bracketed character class, per "perl5234delta". Formerly any white
108 space character parsed as whitespace. This change in "PPIx::Regexp"
109 will be reverted if the change in Perl does not make it into Perl
110 5.24.0.
111
112 Unescaped literal left curly brackets
113 These are being removed in positions where quantifiers are legal,
114 so that they can be used for new functionality. Some of them are
115 gone in 5.25.1, others will be removed in a future version of Perl.
116 In situations where they have been removed, perl_version_removed()
117 will return the version in which they were removed. When the new
118 functionality appears, the parse produced by this software will
119 reflect the new functionality.
120
121 NOTE that the situation with a literal left curly after a literal
122 character is complicated. It was made an error in Perl 5.25.1, and
123 remained so through all 5.26 releases, but became a warning again
124 in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
125 become illegal again is not clear to me based on the contents of
126 perl5271delta. At the moment perl_version_removed() returns
127 "undef", but obviously that is not the whole story, and methods
128 accepts_perl() and requirements_for_perl() were introduced to deal
129 with this complication.
130
131 "\o{...}"
132 is parsed as the octal equivalent of "\x{...}". This is its meaning
133 as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
134 on.
135
136 "x{,3}"
137 (with first count omitted) is allowed as a quantifier as of Perl
138 5.33.6. The previous parse made this all literals.
139
140 "x{ 0 , 3 }"
141 (with spaces inside but adjacent to curly brackets, or around the
142 comma if any) is allowed as a quantifier as of Perl 5.33.6. The
143 previous parse made this all literals.
144
145 There are very probably other examples of this. When they come to light
146 they will be documented as producing the modern parse, and the code
147 modified to produce this parse if necessary.
148
150 This class provides the following public methods. Methods not
151 documented here are private, and unsupported in the sense that the
152 author reserves the right to change or remove them without notice.
153
154 new
155 my $re = PPIx::Regexp->new('/foo/');
156
157 This method instantiates a "PPIx::Regexp" object from a string, a
158 PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
159 PPI::Token::Regexp::Substitute. Honestly, any PPI::Element will work,
160 but only the three Regexp classes mentioned previously are likely to do
161 anything useful.
162
163 Whatever form the argument takes, it is assumed to consist entirely of
164 a valid match, substitution, or "qr<>" string.
165
166 Optionally you can pass one or more name/value pairs after the regular
167 expression. The possible options are:
168
169 default_modifiers array_reference
170 This option specifies a reference to an array of default modifiers
171 to apply to the regular expression being parsed. Each modifier is
172 specified as a string. Any actual modifiers found supersede the
173 defaults.
174
175 When applying the defaults, '?' and '/' are completely ignored, and
176 '^' is ignored unless it occurs at the beginning of the modifier.
177 The first dash ('-') causes subsequent modifiers to be negated.
178
179 So, for example, if you wish to produce a "PPIx::Regexp" object
180 representing the regular expression in
181
182 use re '/smx';
183 {
184 no re '/x';
185 m/ foo /;
186 }
187
188 you would (after some help from PPI in finding the relevant
189 statements), do something like
190
191 my $re = PPIx::Regexp->new( 'm/ foo /',
192 default_modifiers => [ '/smx', '-/x' ] );
193
194 encoding name
195 This option specifies the encoding of the regular expression. This
196 is passed to the tokenizer, which will "decode" the regular
197 expression string before it tokenizes it. For example:
198
199 my $re = PPIx::Regexp->new( '/foo/',
200 encoding => 'iso-8859-1',
201 );
202
203 index_locations Boolean
204 This Boolean option specifies whether the locations of the elements
205 in the regular expression should be indexed.
206
207 If unspecified or specified as "undef" a default value is used.
208 This default is true if the argument is a PPI::Element or the
209 "location" option was specified. Otherwise the default is false.
210
211 location array_reference
212 This option specifies the location of the new object in the
213 document from which it was created. It is a reference to a five-
214 element array compatible with that returned by the "location()"
215 method of PPI::Element.
216
217 If not specified, the location of the original string is used if it
218 was specified as a PPI::Element.
219
220 If no location can be determined, the various "location()" methods
221 will return "undef".
222
223 postderef Boolean
224 THIS ARGUMENT IS DEPRECATED. See DEPRECATION NOTICE above for the
225 details.
226
227 This option is passed on to the tokenizer, where it specifies
228 whether postfix dereferences are recognized in interpolations and
229 code. This experimental feature was introduced in Perl 5.19.5.
230
231 As of version 0.074_01, the default is true. Through release
232 0.074, the default was the value of
233 $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which was true. When
234 originally introduced this was false, but was documented as
235 becoming true when and if postfix dereferencing became mainstream.
236 The intent to mainstream was announced with Perl 5.23.1, and
237 became official (so to speak) with Perl 5.24.0, so the default
238 became true with PPIx::Regexp 0.049_01.
239
240 Note that if PPI starts unconditionally recognizing postfix
241 dereferences, this argument will immediately become ignored, and
242 will be put through a deprecation cycle and removed.
243
244 strict Boolean
245 This option is passed on to the tokenizer and lexer, where it
246 specifies whether the parse should assume "use re 'strict'" is in
247 effect.
248
249 The 'strict' pragma was introduced in Perl 5.22, and its
250 documentation says that it is experimental, and that there is no
251 commitment to backward compatibility. The same applies to the parse
252 produced when this option is asserted. Also, the usual caveat
253 applies: if "use re 'strict'" ends up being retracted, this option
254 and all related functionality will be also.
255
256 Given the nature of "use re 'strict'", you should expect that if
257 you assert this option, regular expressions that previously parsed
258 without error might no longer do so. If an element ends up being
259 declared an error because this option is set, its
260 "perl_version_introduced()" will be the Perl version at which "use
261 re 'strict'" started rejecting these elements.
262
263 The default is false.
264
265 trace number
266 If greater than zero, this option causes trace output from the
267 parse. The author reserves the right to change or eliminate this
268 without notice.
269
270 Passing optional input other than the above is not an error, but
271 neither is it supported.
272
273 new_from_cache
274 This static method wraps "new" in a caching mechanism. Only one object
275 will be generated for a given PPI::Element, no matter how many times
276 this method is called. Calls after the first for a given PPI::Element
277 simply return the same "PPIx::Regexp" object.
278
279 When the "PPIx::Regexp" object is returned from cache, the values of
280 the optional arguments are ignored.
281
282 Calls to this method with the regular expression in a string rather
283 than a PPI::Element will not be cached.
284
285 Caveat: This method is provided for code like Perl::Critic which might
286 instantiate the same object multiple times. The cache will persist
287 until "flush_cache" is called.
288
289 flush_cache
290 $re->flush_cache(); # Remove $re from cache
291 PPIx::Regexp->flush_cache(); # Empty the cache
292
293 This method flushes the cache used by "new_from_cache". If called as a
294 static method with no arguments, the entire cache is emptied. Otherwise
295 any objects specified are removed from the cache.
296
297 capture_names
298 foreach my $name ( $re->capture_names() ) {
299 print "Capture name '$name'\n";
300 }
301
302 This convenience method returns the capture names found in the regular
303 expression.
304
305 This method is equivalent to
306
307 $self->regular_expression()->capture_names();
308
309 except that if "$self->regular_expression()" returns "undef" (meaning
310 that something went terribly wrong with the parse) this method will
311 simply return.
312
313 delimiters
314 print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
315 # prints '// //'
316
317 When called in list context, this method returns either one or two
318 strings, depending on whether the parsed expression has a replacement
319 string. In the case of non-bracketed substitutions, the start delimiter
320 of the replacement string is considered to be the same as its finish
321 delimiter, as illustrated by the above example.
322
323 When called in scalar context, you get the delimiters of the regular
324 expression; that is, element 0 of the array that is returned in list
325 context.
326
327 Optionally, you can pass an index value and the corresponding
328 delimiters will be returned; index 0 represents the regular
329 expression's delimiters, and index 1 represents the replacement
330 string's delimiters, which may be undef. For example,
331
332 print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
333 # prints '<>'
334
335 If the object was not initialized with a valid regexp of some sort, the
336 results of this method are undefined.
337
338 errstr
339 This static method returns the error string from the most recent
340 attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
341 recent attempt succeeded.
342
343 extract_regexps
344 my $doc = PPI::Document->new( $path );
345 $doc->index_locations();
346 my @res = PPIx::Regexp->extract_regexps( $doc )
347
348 This convenience (well, sort-of) static method takes as its argument a
349 PPI::Document object and returns "PPIx::Regexp" objects corresponding
350 to all regular expressions found in it, in the order in which they
351 occur in the document. You will need to keep a reference to the
352 original PPI::Document object if you wish to be able to recover the
353 original PPI::Element objects via the PPIx::Regexp source() method.
354
355 failures
356 print "There were ", $re->failures(), " parse failures\n";
357
358 This method returns the number of parse failures. This is a count of
359 the number of unknown tokens plus the number of unterminated structures
360 plus the number of unmatched right brackets of any sort.
361
362 max_capture_number
363 print "Highest used capture number ",
364 $re->max_capture_number(), "\n";
365
366 This convenience method returns the highest capture number used by the
367 regular expression. If there are no captures, the return will be 0.
368
369 This method is equivalent to
370
371 $self->regular_expression()->max_capture_number();
372
373 except that if "$self->regular_expression()" returns "undef" (meaning
374 that something went terribly wrong with the parse) this method will
375 too.
376
377 modifier
378 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
379 print $re->modifier()->content(), "\n";
380 # prints 'smx'.
381
382 This method retrieves the modifier of the object. This comes from the
383 end of the initializing string or object and will be a
384 PPIx::Regexp::Token::Modifier.
385
386 Note that this object represents the actual modifiers present on the
387 regexp, and does not take into account any that may have been applied
388 by default (i.e. via the "default_modifiers" argument to "new()"). For
389 something that takes account of default modifiers, see
390 modifier_asserted(), below.
391
392 In the event of a parse failure, there may not be a modifier present,
393 in which case nothing is returned.
394
395 modifier_asserted
396 my $re = PPIx::Regexp->new( '/ . /',
397 default_modifiers => [ 'smx' ] );
398 print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
399 # prints 'yes'.
400
401 This method returns true if the given modifier is asserted for the
402 regexp, whether explicitly or by the modifiers passed in the
403 "default_modifiers" argument.
404
405 Starting with version 0.036_01, if the argument is a single-character
406 modifier followed by an asterisk (intended as a wild card character),
407 the return is the number of times that modifier appears. In this case
408 an exception will be thrown if you specify a multi-character modifier
409 (e.g. 'ee*'), or if you specify one of the match semantics modifiers
410 (e.g. 'a*').
411
412 regular_expression
413 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
414 print $re->regular_expression()->content(), "\n";
415 # prints '/(foo)/'.
416
417 This method returns that portion of the object which actually
418 represents a regular expression.
419
420 replacement
421 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
422 print $re->replacement()->content(), "\n";
423 # prints '${1}bar/'.
424
425 This method returns that portion of the object which represents the
426 replacement string. This will be "undef" unless the regular expression
427 actually has a replacement string. Delimiters will be included, but
428 there will be no beginning delimiter unless the regular expression was
429 bracketed.
430
431 source
432 my $source = $re->source();
433
434 This method returns the object or string that was used to instantiate
435 the object.
436
437 type
438 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
439 print $re->type()->content(), "\n";
440 # prints 's'.
441
442 This method retrieves the type of the object. This comes from the
443 beginning of the initializing string or object, and will be a
444 PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
445 'qr', or ''.
446
448 By the nature of this module, it is never going to get everything
449 right. Many of the known problem areas involve interpolations one way
450 or another.
451
452 Ambiguous Syntax
453 Perl's regular expressions contain cases where the syntax is ambiguous.
454 A particularly egregious example is an interpolation followed by square
455 or curly brackets, for example $foo[...]. There is nothing in the
456 syntax to say whether the programmer wanted to interpolate an element
457 of array @foo, or whether he wanted to interpolate scalar $foo, and
458 then follow that interpolation by a character class.
459
460 The perlop documentation notes that in this case what Perl does is to
461 guess. That is, it employs various heuristics on the code to try to
462 figure out what the programmer wanted. These heuristics are documented
463 as being undocumented (!) and subject to change without notice. As an
464 example of the problems even perl faces in parsing Perl, see
465 <https://github.com/perl/perl5/issues/16478>.
466
467 Given this situation, this module's chances of duplicating every Perl
468 version's interpretation of every regular expression are pretty much
469 nil. What it does now is to assume that square brackets containing
470 only an integer or an interpolation represent a subscript; otherwise
471 they represent a character class. Similarly, curly brackets containing
472 only a bareword or an interpolation are a subscript; otherwise they
473 represent a quantifier.
474
475 Changes in Syntax
476 Sometimes the introduction of new syntax changes the way a regular
477 expression is parsed. For example, the "\v" character class was
478 introduced in Perl 5.9.5. But it did not represent a syntax error prior
479 to that version of Perl, it was simply parsed as "v". So
480
481 $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
482
483 prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
484 generally assumes the more modern parse in cases like this.
485
486 Equivocation
487 Very occasionally, a construction will be removed and then added back
488 -- and then, conceivably, removed again. In this case, the plan is for
489 perl_version_introduced() to return the earliest version in which the
490 construction appeared, and perl_version_removed() to return the version
491 after the last version in which it appeared (whether production or
492 development), or "undef" if it is in the highest-numbered Perl.
493
494 The constructions involved in this are:
495
496 Un-escaped literal left curly after literal
497
498 That is, something like "qr<x{>".
499
500 This was made an error in 5.25.1, and it was an error in 5.26.0. But
501 it became a warning again in 5.27.1. The perl5271delta says it was re-
502 instated because the changes broke GNU Autoconf, and the warning
503 message says it will be removed in Perl 5.30.
504
505 Accordingly, perl_version_introduced() returns 5.0. At the moment
506 perl_version_removed() returns '5.025001'. But if it is present with or
507 without warning in 5.28, perl_version_removed() will become "undef". If
508 you need finer resolution than this, see PPIx::Regexp::Element methods
509 l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
510 l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
511
512 Static Parsing
513 It is well known that Perl can not be statically parsed. That is, you
514 can not completely parse a piece of Perl code without executing that
515 same code.
516
517 Nevertheless, this class is trying to statically parse regular
518 expressions. The main problem with this is that there is no way to know
519 what is being interpolated into the regular expression by an
520 interpolated variable. This is a problem because the interpolated value
521 can change the interpretation of adjacent elements.
522
523 This module deals with this by making assumptions about what is in an
524 interpolated variable. These assumptions will not be enumerated here,
525 but in general the principal is to assume the interpolated value does
526 not change the interpretation of the regular expression. For example,
527
528 my $foo = 'a-z]';
529 my $re = qr{[$foo};
530
531 is fine with the Perl interpreter, but will confuse the dickens out of
532 this module. Similarly and more usefully, something like
533
534 my $mods = 'i';
535 my $re = qr{(?$mods:foo)};
536
537 or maybe
538
539 my $mods = 'i';
540 my $re = qr{(?$mods)$foo};
541
542 probably sets a modifier of some sort, and that is how this module
543 interprets it. If the interpolation is not about modifiers, this module
544 will get it wrong. Another such semi-benign example is
545
546 my $foo = $] >= 5.010 ? '?<foo>' : '';
547 my $re = qr{($foo\w+)};
548
549 which will parse, but this module will never realize that it might be
550 looking at a named capture.
551
552 Non-Standard Syntax
553 There are modules out there that alter the syntax of Perl. If the
554 syntax of a regular expression is altered, this module has no way to
555 understand that it has been altered, much less to adapt to the
556 alteration. The following modules are known to cause problems:
557
558 Acme::PerlML, which renders Perl as XML.
559
560 "Data::PostfixDeref", which causes Perl to interpret suffixed empty
561 brackets as dereferencing the thing they suffix. This module by Ben
562 Morrow ("BMORROW") appears to have been retracted.
563
564 Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
565 be written in the ISO 646 character set.
566
567 Perl6::Pugs. Enough said.
568
569 Perl6::Rules, which back-ports some of the Perl 6 regular expression
570 syntax to Perl 5.
571
572 Regexp::Extended, which extends regular expressions in various ways,
573 some of which seem to conflict with Perl 5.010.
574
576 Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
577 for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
578 supports modification of the parse tree.
579
580 Regexp::Parser, which parses a bare regular expression (without
581 enclosing "qr{}", "m//", or whatever) and uses a different navigation
582 model. After a long hiatus, this module has been adopted, and is again
583 supported.
584
585 YAPE::Regex, which provides the parse tree, and has a mechanism to
586 subclass the various element classes for customization. The most-recent
587 release is 2011, but the CPAN testers results are still all green.
588 Companion module YAPE::Regex::Explain says what the various pieces of a
589 regex do, though constructs added in perl 5.10 and later are not
590 supported. I have no idea how I missed this when I originally went
591 looking for "Regexp" parsers.
592
594 Support is by the author. Please file bug reports at
595 <https://rt.cpan.org/Public/Dist/Display.html?Name=PPIx-Regexp>,
596 <https://github.com/trwyant/perl-PPIx-Regexp/issues>, or in electronic
597 mail to the author.
598
600 Thomas R. Wyant, III wyant at cpan dot org
601
603 Copyright (C) 2009-2022 by Thomas R. Wyant, III
604
605 This program is free software; you can redistribute it and/or modify it
606 under the same terms as Perl 5.10.0. For more details, see the full
607 text of the licenses in the directory LICENSES.
608
609 This program is distributed in the hope that it will be useful, but
610 without any warranty; without even the implied warranty of
611 merchantability or fitness for a particular purpose.
612
613
614
615perl v5.34.1 2022-03-22 PPIx::Regexp(3)