1PPIx::Regexp(3) User Contributed Perl Documentation PPIx::Regexp(3)
2
3
4
6 PPIx::Regexp - Represent a regular expression of some sort
7
9 use PPIx::Regexp;
10 use PPIx::Regexp::Dumper;
11 my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12 PPIx::Regexp::Dumper->new( $re )
13 ->print();
14
16 The postderef argument to new() is being put through a deprecation
17 cycle and retracted. After the retraction, postfix dereferences will
18 always be recognized. This is the default behaviour now.
19
20 Starting with the first release after October 1 2020, the first use of
21 this argument will warn. Six months after that all uses will warn.
22 After a further six months, all uses will become fatal.
23
25 "PPIx::Regexp" is a PPIx::Regexp::Node.
26
27 "PPIx::Regexp" has no descendants.
28
30 The purpose of the PPIx-Regexp package is to parse regular expressions
31 in a manner similar to the way the PPI package parses Perl. This class
32 forms the root of the parse tree, playing a role similar to
33 PPI::Document.
34
35 This package shares with PPI the property of being round-trip safe.
36 That is,
37
38 my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
39 my $re = PPIx::Regexp->new( $expr );
40 print $re->content() eq $expr ? "yes\n" : "no\n"
41
42 should print 'yes' for any valid regular expression.
43
44 Navigation is similar to that provided by PPI. That is to say, things
45 like "children", "find_first", "snext_sibling" and so on all work
46 pretty much the same way as in PPI.
47
48 The class hierarchy is also similar to PPI. Except for some utility
49 classes (the dumper, the lexer, and the tokenizer) all classes are
50 descended from PPIx::Regexp::Element, which provides basic navigation.
51 Tokens are descended from PPIx::Regexp::Token, which provides content.
52 All containers are descended from PPIx::Regexp::Node, which provides
53 for children, and all structure elements are descended from
54 PPIx::Regexp::Structure, which provides beginning and ending
55 delimiters, and a type.
56
57 There are two features of PPI that this package does not provide -
58 mutability and operator overloading. There are no plans for serious
59 mutability, though something like PPI's "prune" functionality might be
60 considered. Similarly there are no plans for operator overloading,
61 which appears to the author to represent a performance hit for little
62 tangible gain.
63
65 The use of this class to parse non-regexp quote-like strings was an
66 experiment that I consider failed. Therefore this use is deprecated in
67 favor of PPIx::QuoteLike. As of version 0.058_01, the first use of the
68 "parse" argument to new() resulted in a warning. As of version
69 0.062_01, all uses of the "parse" argument resulted in a warning. As of
70 version 0.068_01, the "parse" argument will become fatal.
71
72 The author will attempt to preserve the documented interface, but if
73 the interface needs to change to correct some egregiously bad design or
74 implementation decision, then it will change. Any incompatible changes
75 will go through a deprecation cycle.
76
77 The goal of this package is to parse well-formed regular expressions
78 correctly. A secondary goal is not to blow up on ill-formed regular
79 expressions. The correct identification and characterization of ill-
80 formed regular expressions is not a goal of this package, nor is the
81 consistent parsing of ill-formed regular expressions from release to
82 release.
83
84 This policy attempts to track features in development releases as well
85 as public releases. However, features added in a development release
86 and then removed before the next production release will not be
87 tracked, and any functionality relating to such features will be
88 removed. The issue here is the potential re-use (with different
89 semantics) of syntax that did not make it into the production release.
90
91 From time to time the Perl regular expression engine changes in ways
92 that change the parse of a given regular expression. When these changes
93 occur, "PPIx::Regexp" will be changed to produce the more modern parse.
94 Known examples of this include:
95
96 $( no longer interpolates as of Perl 5.005, per "perl5005delta".
97 Newer Perls seem to parse this as "qr{$}" (i.e. and end-of-string
98 or newline assertion) followed by an open parenthesis, and that is
99 what "PPIx::Regexp" does.
100
101 $) and $| also seem to parse as the "$" assertion
102 followed by the relevant meta-character, though I have no
103 documentation reference for this.
104
105 "@+" and "@-" no longer interpolate as of Perl 5.9.4
106 per "perl594delta". Subsequent Perls treat "@+" as a quantified
107 literal and "@-" as two literals, and that is what "PPIx::Regexp"
108 does. Note that subscripted references to these arrays do
109 interpolate, and are so parsed by "PPIx::Regexp".
110
111 Only space and horizontal tab are whitespace as of Perl 5.23.4
112 when inside a bracketed character class inside an extended
113 bracketed character class, per "perl5234delta". Formerly any white
114 space character parsed as whitespace. This change in "PPIx::Regexp"
115 will be reverted if the change in Perl does not make it into Perl
116 5.24.0.
117
118 Unescaped literal left curly brackets
119 These are being removed in positions where quantifiers are legal,
120 so that they can be used for new functionality. Some of them are
121 gone in 5.25.1, others will be removed in a future version of Perl.
122 In situations where they have been removed, perl_version_removed()
123 will return the version in which they were removed. When the new
124 functionality appears, the parse produced by this software will
125 reflect the new functionality.
126
127 NOTE that the situation with a literal left curly after a literal
128 character is complicated. It was made an error in Perl 5.25.1, and
129 remained so through all 5.26 releases, but became a warning again
130 in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
131 become illegal again is not clear to me based on the contents of
132 perl5271delta. At the moment perl_version_removed() returns
133 "undef", but obviously that is not the whole story, and methods
134 accepts_perl() and requirements_for_perl() were introduced to deal
135 with this complication.
136
137 "\o{...}"
138 is parsed as the octal equivalent of "\x{...}". This is its meaning
139 as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
140 on.
141
142 There are very probably other examples of this. When they come to light
143 they will be documented as producing the modern parse, and the code
144 modified to produce this parse if necessary.
145
147 This class provides the following public methods. Methods not
148 documented here are private, and unsupported in the sense that the
149 author reserves the right to change or remove them without notice.
150
151 new
152 my $re = PPIx::Regexp->new('/foo/');
153
154 This method instantiates a "PPIx::Regexp" object from a string, a
155 PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
156 PPI::Token::Regexp::Substitute. Honestly, any PPI::Element will work,
157 but only the three Regexp classes mentioned previously are likely to do
158 anything useful.
159
160 Whatever form the argument takes, it is assumed to consist entirely of
161 a valid match, substitution, or "qr<>" string.
162
163 Optionally you can pass one or more name/value pairs after the regular
164 expression. The possible options are:
165
166 default_modifiers array_reference
167 This option specifies a reference to an array of default modifiers
168 to apply to the regular expression being parsed. Each modifier is
169 specified as a string. Any actual modifiers found supersede the
170 defaults.
171
172 When applying the defaults, '?' and '/' are completely ignored, and
173 '^' is ignored unless it occurs at the beginning of the modifier.
174 The first dash ('-') causes subsequent modifiers to be negated.
175
176 So, for example, if you wish to produce a "PPIx::Regexp" object
177 representing the regular expression in
178
179 use re '/smx';
180 {
181 no re '/x';
182 m/ foo /;
183 }
184
185 you would (after some help from PPI in finding the relevant
186 statements), do something like
187
188 my $re = PPIx::Regexp->new( 'm/ foo /',
189 default_modifiers => [ '/smx', '-/x' ] );
190
191 encoding name
192 This option specifies the encoding of the regular expression. This
193 is passed to the tokenizer, which will "decode" the regular
194 expression string before it tokenizes it. For example:
195
196 my $re = PPIx::Regexp->new( '/foo/',
197 encoding => 'iso-8859-1',
198 );
199
200 index_locations Boolean
201 This Boolean option specifies whether the locations of the elements
202 in the regular expression should be indexed.
203
204 If unspecified or specified as "undef" a default value is used.
205 This default is true if the argument is a PPI::Element or the
206 "location" option was specified. Otherwise the default is false.
207
208 location array_reference
209 This option specifies the location of the new object in the
210 document from which it was created. It is a reference to a five-
211 element array compatible with that returned by the "location()"
212 method of PPI::Element.
213
214 If not specified, the location of the original string is used if it
215 was specified as a PPI::Element.
216
217 If no location can be determined, the various "location()" methods
218 will return "undef".
219
220 parse parse_type
221 This option specifies what kind of parse is to be done. Possible
222 values are 'regex', 'string', or 'guess'. Any value but 'regex' is
223 experimental.
224
225 As it turns out, I consider parsing non-regexp quote-like things
226 with this class to be a failed experiment, and the relevant
227 functionality is being deprecated and removed in favor of
228 PPIx::QuoteLike. See above for details. As of version 0.068_01, any
229 use of this option throws an exception.
230
231 postderef Boolean
232 THIS ARGUMENT IS DEPRECATED. See DEPRECATION NOTICE above for the
233 details.
234
235 This option is passed on to the tokenizer, where it specifies
236 whether postfix dereferences are recognized in interpolations and
237 code. This experimental feature was introduced in Perl 5.19.5.
238
239 The default is the value of
240 $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which is true. When
241 originally introduced this was false, but was documented as
242 becoming true when and if postfix dereferencing became mainstream.
243 The intent to mainstream was announced with Perl 5.23.1, and
244 became official (so to speak) with Perl 5.24.0, so the default
245 became true with PPIx::Regexp 0.049_01.
246
247 Note that if PPI starts unconditionally recognizing postfix
248 dereferences, this argument will immediately become ignored, and
249 will be put through a deprecation cycle and removed.
250
251 strict Boolean
252 This option is passed on to the tokenizer and lexer, where it
253 specifies whether the parse should assume "use re 'strict'" is in
254 effect.
255
256 The 'strict' pragma was introduced in Perl 5.22, and its
257 documentation says that it is experimental, and that there is no
258 commitment to backward compatibility. The same applies to the parse
259 produced when this option is asserted. Also, the usual caveat
260 applies: if "use re 'strict'" ends up being retracted, this option
261 and all related functionality will be also.
262
263 Given the nature of "use re 'strict'", you should expect that if
264 you assert this option, regular expressions that previously parsed
265 without error might no longer do so. If an element ends up being
266 declared an error because this option is set, its
267 "perl_version_introduced()" will be the Perl version at which "use
268 re 'strict'" started rejecting these elements.
269
270 The default is false.
271
272 trace number
273 If greater than zero, this option causes trace output from the
274 parse. The author reserves the right to change or eliminate this
275 without notice.
276
277 Passing optional input other than the above is not an error, but
278 neither is it supported.
279
280 new_from_cache
281 This static method wraps "new" in a caching mechanism. Only one object
282 will be generated for a given PPI::Element, no matter how many times
283 this method is called. Calls after the first for a given PPI::Element
284 simply return the same "PPIx::Regexp" object.
285
286 When the "PPIx::Regexp" object is returned from cache, the values of
287 the optional arguments are ignored.
288
289 Calls to this method with the regular expression in a string rather
290 than a PPI::Element will not be cached.
291
292 Caveat: This method is provided for code like Perl::Critic which might
293 instantiate the same object multiple times. The cache will persist
294 until "flush_cache" is called.
295
296 flush_cache
297 $re->flush_cache(); # Remove $re from cache
298 PPIx::Regexp->flush_cache(); # Empty the cache
299
300 This method flushes the cache used by "new_from_cache". If called as a
301 static method with no arguments, the entire cache is emptied. Otherwise
302 any objects specified are removed from the cache.
303
304 capture_names
305 foreach my $name ( $re->capture_names() ) {
306 print "Capture name '$name'\n";
307 }
308
309 This convenience method returns the capture names found in the regular
310 expression.
311
312 This method is equivalent to
313
314 $self->regular_expression()->capture_names();
315
316 except that if "$self->regular_expression()" returns "undef" (meaning
317 that something went terribly wrong with the parse) this method will
318 simply return.
319
320 delimiters
321 print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
322 # prints '// //'
323
324 When called in list context, this method returns either one or two
325 strings, depending on whether the parsed expression has a replacement
326 string. In the case of non-bracketed substitutions, the start delimiter
327 of the replacement string is considered to be the same as its finish
328 delimiter, as illustrated by the above example.
329
330 When called in scalar context, you get the delimiters of the regular
331 expression; that is, element 0 of the array that is returned in list
332 context.
333
334 Optionally, you can pass an index value and the corresponding
335 delimiters will be returned; index 0 represents the regular
336 expression's delimiters, and index 1 represents the replacement
337 string's delimiters, which may be undef. For example,
338
339 print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
340 # prints '<>'
341
342 If the object was not initialized with a valid regexp of some sort, the
343 results of this method are undefined.
344
345 errstr
346 This static method returns the error string from the most recent
347 attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
348 recent attempt succeeded.
349
350 extract_regexps
351 my $doc = PPI::Document->new( $path );
352 $doc->index_locations();
353 my @res = PPIx::Regexp->extract_regexps( $doc )
354
355 This convenience (well, sort-of) static method takes as its argument a
356 PPI::Document object and returns "PPIx::Regexp" objects corresponding
357 to all regular expressions found in it, in the order in which they
358 occur in the document. You will need to keep a reference to the
359 original PPI::Document object if you wish to be able to recover the
360 original PPI::Element objects via the PPIx::Regexp source() method.
361
362 failures
363 print "There were ", $re->failures(), " parse failures\n";
364
365 This method returns the number of parse failures. This is a count of
366 the number of unknown tokens plus the number of unterminated structures
367 plus the number of unmatched right brackets of any sort.
368
369 max_capture_number
370 print "Highest used capture number ",
371 $re->max_capture_number(), "\n";
372
373 This convenience method returns the highest capture number used by the
374 regular expression. If there are no captures, the return will be 0.
375
376 This method is equivalent to
377
378 $self->regular_expression()->max_capture_number();
379
380 except that if "$self->regular_expression()" returns "undef" (meaning
381 that something went terribly wrong with the parse) this method will
382 too.
383
384 modifier
385 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
386 print $re->modifier()->content(), "\n";
387 # prints 'smx'.
388
389 This method retrieves the modifier of the object. This comes from the
390 end of the initializing string or object and will be a
391 PPIx::Regexp::Token::Modifier.
392
393 Note that this object represents the actual modifiers present on the
394 regexp, and does not take into account any that may have been applied
395 by default (i.e. via the "default_modifiers" argument to "new()"). For
396 something that takes account of default modifiers, see
397 modifier_asserted(), below.
398
399 In the event of a parse failure, there may not be a modifier present,
400 in which case nothing is returned.
401
402 modifier_asserted
403 my $re = PPIx::Regexp->new( '/ . /',
404 default_modifiers => [ 'smx' ] );
405 print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
406 # prints 'yes'.
407
408 This method returns true if the given modifier is asserted for the
409 regexp, whether explicitly or by the modifiers passed in the
410 "default_modifiers" argument.
411
412 Starting with version 0.036_01, if the argument is a single-character
413 modifier followed by an asterisk (intended as a wild card character),
414 the return is the number of times that modifier appears. In this case
415 an exception will be thrown if you specify a multi-character modifier
416 (e.g. 'ee*'), or if you specify one of the match semantics modifiers
417 (e.g. 'a*').
418
419 regular_expression
420 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
421 print $re->regular_expression()->content(), "\n";
422 # prints '/(foo)/'.
423
424 This method returns that portion of the object which actually
425 represents a regular expression.
426
427 replacement
428 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
429 print $re->replacement()->content(), "\n";
430 # prints '${1}bar/'.
431
432 This method returns that portion of the object which represents the
433 replacement string. This will be "undef" unless the regular expression
434 actually has a replacement string. Delimiters will be included, but
435 there will be no beginning delimiter unless the regular expression was
436 bracketed.
437
438 source
439 my $source = $re->source();
440
441 This method returns the object or string that was used to instantiate
442 the object.
443
444 type
445 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
446 print $re->type()->content(), "\n";
447 # prints 's'.
448
449 This method retrieves the type of the object. This comes from the
450 beginning of the initializing string or object, and will be a
451 PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
452 'qr', or ''.
453
455 By the nature of this module, it is never going to get everything
456 right. Many of the known problem areas involve interpolations one way
457 or another.
458
459 Ambiguous Syntax
460 Perl's regular expressions contain cases where the syntax is ambiguous.
461 A particularly egregious example is an interpolation followed by square
462 or curly brackets, for example $foo[...]. There is nothing in the
463 syntax to say whether the programmer wanted to interpolate an element
464 of array @foo, or whether he wanted to interpolate scalar $foo, and
465 then follow that interpolation by a character class.
466
467 The perlop documentation notes that in this case what Perl does is to
468 guess. That is, it employs various heuristics on the code to try to
469 figure out what the programmer wanted. These heuristics are documented
470 as being undocumented (!) and subject to change without notice. As an
471 example of the problems even perl faces in parsing Perl, see
472 <https://github.com/perl/perl5/issues/16478>.
473
474 Given this situation, this module's chances of duplicating every Perl
475 version's interpretation of every regular expression are pretty much
476 nil. What it does now is to assume that square brackets containing
477 only an integer or an interpolation represent a subscript; otherwise
478 they represent a character class. Similarly, curly brackets containing
479 only a bareword or an interpolation are a subscript; otherwise they
480 represent a quantifier.
481
482 Changes in Syntax
483 Sometimes the introduction of new syntax changes the way a regular
484 expression is parsed. For example, the "\v" character class was
485 introduced in Perl 5.9.5. But it did not represent a syntax error prior
486 to that version of Perl, it was simply parsed as "v". So
487
488 $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
489
490 prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
491 generally assumes the more modern parse in cases like this.
492
493 Equivocation
494 Very occasionally, a construction will be removed and then added back
495 -- and then, conceivably, removed again. In this case, the plan is for
496 perl_version_introduced() to return the earliest version in which the
497 construction appeared, and perl_version_removed() to return the version
498 after the last version in which it appeared (whether production or
499 development), or "undef" if it is in the highest-numbered Perl.
500
501 The constructions involved in this are:
502
503 Un-escaped literal left curly after literal
504
505 That is, something like "qr<x{>".
506
507 This was made an error in 5.25.1, and it was an error in 5.26.0. But
508 it became a warning again in 5.27.1. The perl5271delta says it was re-
509 instated because the changes broke GNU Autoconf, and the warning
510 message says it will be removed in Perl 5.30.
511
512 Accordingly, perl_version_introduced() returns 5.0. At the moment
513 perl_version_removed() returns '5.025001'. But if it is present with or
514 without warning in 5.28, perl_version_removed() will become "undef". If
515 you need finer resolution than this, see PPIx::Regexp::Element methods
516 l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
517 l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
518
519 Static Parsing
520 It is well known that Perl can not be statically parsed. That is, you
521 can not completely parse a piece of Perl code without executing that
522 same code.
523
524 Nevertheless, this class is trying to statically parse regular
525 expressions. The main problem with this is that there is no way to know
526 what is being interpolated into the regular expression by an
527 interpolated variable. This is a problem because the interpolated value
528 can change the interpretation of adjacent elements.
529
530 This module deals with this by making assumptions about what is in an
531 interpolated variable. These assumptions will not be enumerated here,
532 but in general the principal is to assume the interpolated value does
533 not change the interpretation of the regular expression. For example,
534
535 my $foo = 'a-z]';
536 my $re = qr{[$foo};
537
538 is fine with the Perl interpreter, but will confuse the dickens out of
539 this module. Similarly and more usefully, something like
540
541 my $mods = 'i';
542 my $re = qr{(?$mods:foo)};
543
544 or maybe
545
546 my $mods = 'i';
547 my $re = qr{(?$mods)$foo};
548
549 probably sets a modifier of some sort, and that is how this module
550 interprets it. If the interpolation is not about modifiers, this module
551 will get it wrong. Another such semi-benign example is
552
553 my $foo = $] >= 5.010 ? '?<foo>' : '';
554 my $re = qr{($foo\w+)};
555
556 which will parse, but this module will never realize that it might be
557 looking at a named capture.
558
559 Non-Standard Syntax
560 There are modules out there that alter the syntax of Perl. If the
561 syntax of a regular expression is altered, this module has no way to
562 understand that it has been altered, much less to adapt to the
563 alteration. The following modules are known to cause problems:
564
565 Acme::PerlML, which renders Perl as XML.
566
567 "Data::PostfixDeref", which causes Perl to interpret suffixed empty
568 brackets as dereferencing the thing they suffix. This module by Ben
569 Morrow ("BMORROW") appears to have been retracted.
570
571 Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
572 be written in the ISO 646 character set.
573
574 Perl6::Pugs. Enough said.
575
576 Perl6::Rules, which back-ports some of the Perl 6 regular expression
577 syntax to Perl 5.
578
579 Regexp::Extended, which extends regular expressions in various ways,
580 some of which seem to conflict with Perl 5.010.
581
583 Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
584 for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
585 supports modification of the parse tree.
586
587 Regexp::Parser, which parses a bare regular expression (without
588 enclosing "qr{}", "m//", or whatever) and uses a different navigation
589 model. After a long hiatus, this module has been adopted, and is again
590 supported.
591
593 Support is by the author. Please file bug reports at
594 <https://rt.cpan.org>, or in electronic mail to the author.
595
597 Thomas R. Wyant, III wyant at cpan dot org
598
600 Copyright (C) 2009-2020 by Thomas R. Wyant, III
601
602 This program is free software; you can redistribute it and/or modify it
603 under the same terms as Perl 5.10.0. For more details, see the full
604 text of the licenses in the directory LICENSES.
605
606 This program is distributed in the hope that it will be useful, but
607 without any warranty; without even the implied warranty of
608 merchantability or fitness for a particular purpose.
609
610
611
612perl v5.32.0 2020-07-29 PPIx::Regexp(3)