1PPIx::Regexp(3) User Contributed Perl Documentation PPIx::Regexp(3)
2
3
4
6 PPIx::Regexp - Represent a regular expression of some sort
7
9 use PPIx::Regexp;
10 use PPIx::Regexp::Dumper;
11 my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12 PPIx::Regexp::Dumper->new( $re )
13 ->print();
14
16 "PPIx::Regexp" is a PPIx::Regexp::Node.
17
18 "PPIx::Regexp" has no descendants.
19
21 The purpose of the PPIx-Regexp package is to parse regular expressions
22 in a manner similar to the way the PPI package parses Perl. This class
23 forms the root of the parse tree, playing a role similar to
24 PPI::Document.
25
26 This package shares with PPI the property of being round-trip safe.
27 That is,
28
29 my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
30 my $re = PPIx::Regexp->new( $expr );
31 print $re->content() eq $expr ? "yes\n" : "no\n"
32
33 should print 'yes' for any valid regular expression.
34
35 Navigation is similar to that provided by PPI. That is to say, things
36 like "children", "find_first", "snext_sibling" and so on all work
37 pretty much the same way as in PPI.
38
39 The class hierarchy is also similar to PPI. Except for some utility
40 classes (the dumper, the lexer, and the tokenizer) all classes are
41 descended from PPIx::Regexp::Element, which provides basic navigation.
42 Tokens are descended from PPIx::Regexp::Token, which provides content.
43 All containers are descended from PPIx::Regexp::Node, which provides
44 for children, and all structure elements are descended from
45 PPIx::Regexp::Structure, which provides beginning and ending
46 delimiters, and a type.
47
48 There are two features of PPI that this package does not provide -
49 mutability and operator overloading. There are no plans for serious
50 mutability, though something like PPI's "prune" functionality might be
51 considered. Similarly there are no plans for operator overloading,
52 which appears to the author to represent a performance hit for little
53 tangible gain.
54
56 The use of this class to parse non-regexp quote-like strings was an
57 experiment that I consider failed. Therefore this use is deprecated in
58 favor of PPIx::QuoteLike. As of version 0.058_01, the first use of the
59 "parse" argument to new() resulted in a warning. As of version
60 0.062_01, all uses of the "parse" argument resulted in a warning. As of
61 version 0.068_01, the "parse" argument will become fatal.
62
63 The author will attempt to preserve the documented interface, but if
64 the interface needs to change to correct some egregiously bad design or
65 implementation decision, then it will change. Any incompatible changes
66 will go through a deprecation cycle.
67
68 The goal of this package is to parse well-formed regular expressions
69 correctly. A secondary goal is not to blow up on ill-formed regular
70 expressions. The correct identification and characterization of ill-
71 formed regular expressions is not a goal of this package, nor is the
72 consistent parsing of ill-formed regular expressions from release to
73 release.
74
75 This policy attempts to track features in development releases as well
76 as public releases. However, features added in a development release
77 and then removed before the next production release will not be
78 tracked, and any functionality relating to such features will be
79 removed. The issue here is the potential re-use (with different
80 semantics) of syntax that did not make it into the production release.
81
82 From time to time the Perl regular expression engine changes in ways
83 that change the parse of a given regular expression. When these changes
84 occur, "PPIx::Regexp" will be changed to produce the more modern parse.
85 Known examples of this include:
86
87 $( no longer interpolates as of Perl 5.005, per "perl5005delta".
88 Newer Perls seem to parse this as "qr{$}" (i.e. and end-of-string
89 or newline assertion) followed by an open parenthesis, and that is
90 what "PPIx::Regexp" does.
91
92 $) and $| also seem to parse as the "$" assertion
93 followed by the relevant meta-character, though I have no
94 documentation reference for this.
95
96 "@+" and "@-" no longer interpolate as of Perl 5.9.4
97 per "perl594delta". Subsequent Perls treat "@+" as a quantified
98 literal and "@-" as two literals, and that is what "PPIx::Regexp"
99 does. Note that subscripted references to these arrays do
100 interpolate, and are so parsed by "PPIx::Regexp".
101
102 Only space and horizontal tab are whitespace as of Perl 5.23.4
103 when inside a bracketed character class inside an extended
104 bracketed character class, per "perl5234delta". Formerly any white
105 space character parsed as whitespace. This change in "PPIx::Regexp"
106 will be reverted if the change in Perl does not make it into Perl
107 5.24.0.
108
109 Unescaped literal left curly brackets
110 These are being removed in positions where quantifiers are legal,
111 so that they can be used for new functionality. Some of them are
112 gone in 5.25.1, others will be removed in a future version of Perl.
113 In situations where they have been removed, perl_version_removed()
114 will return the version in which they were removed. When the new
115 functionality appears, the parse produced by this software will
116 reflect the new functionality.
117
118 NOTE that the situation with a literal left curly after a literal
119 character is complicated. It was made an error in Perl 5.25.1, and
120 remained so through all 5.26 releases, but became a warning again
121 in 5.27.1 due to its use in GNU Autoconf. Whether it will ever
122 become illegal again is not clear to me based on the contents of
123 perl5271delta. At the moment perl_version_removed() returns
124 "undef", but obviously that is not the whole story, and methods
125 accepts_perl() and requirements_for_perl() were introduced to deal
126 with this complication.
127
128 "\o{...}"
129 is parsed as the octal equivalent of "\x{...}". This is its meaning
130 as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
131 on.
132
133 There are very probably other examples of this. When they come to light
134 they will be documented as producing the modern parse, and the code
135 modified to produce this parse if necessary.
136
138 This class provides the following public methods. Methods not
139 documented here are private, and unsupported in the sense that the
140 author reserves the right to change or remove them without notice.
141
142 new
143 my $re = PPIx::Regexp->new('/foo/');
144
145 This method instantiates a "PPIx::Regexp" object from a string, a
146 PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
147 PPI::Token::Regexp::Substitute. Honestly, any PPI::Element will work,
148 but only the three Regexp classes mentioned previously are likely to do
149 anything useful.
150
151 Whatever form the argument takes, it is assumed to consist entirely of
152 a valid match, substitution, or "qr<>" string.
153
154 Optionally you can pass one or more name/value pairs after the regular
155 expression. The possible options are:
156
157 default_modifiers array_reference
158 This option specifies a reference to an array of default modifiers
159 to apply to the regular expression being parsed. Each modifier is
160 specified as a string. Any actual modifiers found supersede the
161 defaults.
162
163 When applying the defaults, '?' and '/' are completely ignored, and
164 '^' is ignored unless it occurs at the beginning of the modifier.
165 The first dash ('-') causes subsequent modifiers to be negated.
166
167 So, for example, if you wish to produce a "PPIx::Regexp" object
168 representing the regular expression in
169
170 use re '/smx';
171 {
172 no re '/x';
173 m/ foo /;
174 }
175
176 you would (after some help from PPI in finding the relevant
177 statements), do something like
178
179 my $re = PPIx::Regexp->new( 'm/ foo /',
180 default_modifiers => [ '/smx', '-/x' ] );
181
182 encoding name
183 This option specifies the encoding of the regular expression. This
184 is passed to the tokenizer, which will "decode" the regular
185 expression string before it tokenizes it. For example:
186
187 my $re = PPIx::Regexp->new( '/foo/',
188 encoding => 'iso-8859-1',
189 );
190
191 parse parse_type
192 This option specifies what kind of parse is to be done. Possible
193 values are 'regex', 'string', or 'guess'. Any value but 'regex' is
194 experimental.
195
196 As it turns out, I consider parsing non-regexp quote-like things
197 with this class to be a failed experiment, and the relevant
198 functionality is being deprecated and removed in favor of
199 PPIx::QuoteLike. See above for details. As of version 0.068_01, any
200 use of this option throws an exception.
201
202 postderef boolean
203 This option is passed on to the tokenizer, where it specifies
204 whether postfix dereferences are recognized in interpolations and
205 code. This experimental feature was introduced in Perl 5.19.5.
206
207 The default is the value of
208 $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF, which is true. When
209 originally introduced this was false, but was documented as
210 becoming true when and if postfix dereferencing became mainstream.
211 The intent to mainstream was announced with Perl 5.23.1, and
212 became official (so to speak) with Perl 5.24.0, so the default
213 became true with PPIx::Regexp 0.049_01.
214
215 Note that if PPI starts unconditionally recognizing postfix
216 dereferences, this argument will immediately become ignored, and
217 will be put through a deprecation cycle and removed.
218
219 strict boolean
220 This option is passed on to the tokenizer and lexer, where it
221 specifies whether the parse should assume "use re 'strict'" is in
222 effect.
223
224 The 'strict' pragma was introduced in Perl 5.22, and its
225 documentation says that it is experimental, and that there is no
226 commitment to backward compatibility. The same applies to the parse
227 produced when this option is asserted. Also, the usual caveat
228 applies: if "use re 'strict'" ends up being retracted, this option
229 and all related functionality will be also.
230
231 Given the nature of "use re 'strict'", you should expect that if
232 you assert this option, regular expressions that previously parsed
233 without error might no longer do so. If an element ends up being
234 declared an error because this option is set, its
235 "perl_version_introduced()" will be the Perl version at which "use
236 re 'strict'" started rejecting these elements.
237
238 The default is false.
239
240 trace number
241 If greater than zero, this option causes trace output from the
242 parse. The author reserves the right to change or eliminate this
243 without notice.
244
245 Passing optional input other than the above is not an error, but
246 neither is it supported.
247
248 new_from_cache
249 This static method wraps "new" in a caching mechanism. Only one object
250 will be generated for a given PPI::Element, no matter how many times
251 this method is called. Calls after the first for a given PPI::Element
252 simply return the same "PPIx::Regexp" object.
253
254 When the "PPIx::Regexp" object is returned from cache, the values of
255 the optional arguments are ignored.
256
257 Calls to this method with the regular expression in a string rather
258 than a PPI::Element will not be cached.
259
260 Caveat: This method is provided for code like Perl::Critic which might
261 instantiate the same object multiple times. The cache will persist
262 until "flush_cache" is called.
263
264 flush_cache
265 $re->flush_cache(); # Remove $re from cache
266 PPIx::Regexp->flush_cache(); # Empty the cache
267
268 This method flushes the cache used by "new_from_cache". If called as a
269 static method with no arguments, the entire cache is emptied. Otherwise
270 any objects specified are removed from the cache.
271
272 capture_names
273 foreach my $name ( $re->capture_names() ) {
274 print "Capture name '$name'\n";
275 }
276
277 This convenience method returns the capture names found in the regular
278 expression.
279
280 This method is equivalent to
281
282 $self->regular_expression()->capture_names();
283
284 except that if "$self->regular_expression()" returns "undef" (meaning
285 that something went terribly wrong with the parse) this method will
286 simply return.
287
288 delimiters
289 print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
290 # prints '// //'
291
292 When called in list context, this method returns either one or two
293 strings, depending on whether the parsed expression has a replacement
294 string. In the case of non-bracketed substitutions, the start delimiter
295 of the replacement string is considered to be the same as its finish
296 delimiter, as illustrated by the above example.
297
298 When called in scalar context, you get the delimiters of the regular
299 expression; that is, element 0 of the array that is returned in list
300 context.
301
302 Optionally, you can pass an index value and the corresponding
303 delimiters will be returned; index 0 represents the regular
304 expression's delimiters, and index 1 represents the replacement
305 string's delimiters, which may be undef. For example,
306
307 print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
308 # prints '<>'
309
310 If the object was not initialized with a valid regexp of some sort, the
311 results of this method are undefined.
312
313 errstr
314 This static method returns the error string from the most recent
315 attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
316 recent attempt succeeded.
317
318 extract_regexps
319 my $doc = PPI::Document->new( $path );
320 $doc->index_locations();
321 my @res = PPIx::Regexp->extract_regexps( $doc )
322
323 This convenience (well, sort-of) static method takes as its argument a
324 PPI::Document object and returns "PPIx::Regexp" objects corresponding
325 to all regular expressions found in it, in the order in which they
326 occur in the document. You will need to keep a reference to the
327 original PPI::Document object if you wish to be able to recover the
328 original PPI::Element objects via the PPIx::Regexp source() method.
329
330 failures
331 print "There were ", $re->failures(), " parse failures\n";
332
333 This method returns the number of parse failures. This is a count of
334 the number of unknown tokens plus the number of unterminated structures
335 plus the number of unmatched right brackets of any sort.
336
337 max_capture_number
338 print "Highest used capture number ",
339 $re->max_capture_number(), "\n";
340
341 This convenience method returns the highest capture number used by the
342 regular expression. If there are no captures, the return will be 0.
343
344 This method is equivalent to
345
346 $self->regular_expression()->max_capture_number();
347
348 except that if "$self->regular_expression()" returns "undef" (meaning
349 that something went terribly wrong with the parse) this method will
350 too.
351
352 modifier
353 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
354 print $re->modifier()->content(), "\n";
355 # prints 'smx'.
356
357 This method retrieves the modifier of the object. This comes from the
358 end of the initializing string or object and will be a
359 PPIx::Regexp::Token::Modifier.
360
361 Note that this object represents the actual modifiers present on the
362 regexp, and does not take into account any that may have been applied
363 by default (i.e. via the "default_modifiers" argument to "new()"). For
364 something that takes account of default modifiers, see
365 modifier_asserted(), below.
366
367 In the event of a parse failure, there may not be a modifier present,
368 in which case nothing is returned.
369
370 modifier_asserted
371 my $re = PPIx::Regexp->new( '/ . /',
372 default_modifiers => [ 'smx' ] );
373 print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
374 # prints 'yes'.
375
376 This method returns true if the given modifier is asserted for the
377 regexp, whether explicitly or by the modifiers passed in the
378 "default_modifiers" argument.
379
380 Starting with version 0.036_01, if the argument is a single-character
381 modifier followed by an asterisk (intended as a wild card character),
382 the return is the number of times that modifier appears. In this case
383 an exception will be thrown if you specify a multi-character modifier
384 (e.g. 'ee*'), or if you specify one of the match semantics modifiers
385 (e.g. 'a*').
386
387 regular_expression
388 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
389 print $re->regular_expression()->content(), "\n";
390 # prints '/(foo)/'.
391
392 This method returns that portion of the object which actually
393 represents a regular expression.
394
395 replacement
396 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
397 print $re->replacement()->content(), "\n";
398 # prints '${1}bar/'.
399
400 This method returns that portion of the object which represents the
401 replacement string. This will be "undef" unless the regular expression
402 actually has a replacement string. Delimiters will be included, but
403 there will be no beginning delimiter unless the regular expression was
404 bracketed.
405
406 source
407 my $source = $re->source();
408
409 This method returns the object or string that was used to instantiate
410 the object.
411
412 type
413 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
414 print $re->type()->content(), "\n";
415 # prints 's'.
416
417 This method retrieves the type of the object. This comes from the
418 beginning of the initializing string or object, and will be a
419 PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
420 'qr', or ''.
421
423 By the nature of this module, it is never going to get everything
424 right. Many of the known problem areas involve interpolations one way
425 or another.
426
427 Ambiguous Syntax
428 Perl's regular expressions contain cases where the syntax is ambiguous.
429 A particularly egregious example is an interpolation followed by square
430 or curly brackets, for example $foo[...]. There is nothing in the
431 syntax to say whether the programmer wanted to interpolate an element
432 of array @foo, or whether he wanted to interpolate scalar $foo, and
433 then follow that interpolation by a character class.
434
435 The perlop documentation notes that in this case what Perl does is to
436 guess. That is, it employs various heuristics on the code to try to
437 figure out what the programmer wanted. These heuristics are documented
438 as being undocumented (!) and subject to change without notice. As an
439 example of the problems even perl faces in parsing Perl, see
440 <https://github.com/perl/perl5/issues/16478>.
441
442 Given this situation, this module's chances of duplicating every Perl
443 version's interpretation of every regular expression are pretty much
444 nil. What it does now is to assume that square brackets containing
445 only an integer or an interpolation represent a subscript; otherwise
446 they represent a character class. Similarly, curly brackets containing
447 only a bareword or an interpolation are a subscript; otherwise they
448 represent a quantifier.
449
450 Changes in Syntax
451 Sometimes the introduction of new syntax changes the way a regular
452 expression is parsed. For example, the "\v" character class was
453 introduced in Perl 5.9.5. But it did not represent a syntax error prior
454 to that version of Perl, it was simply parsed as "v". So
455
456 $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
457
458 prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
459 generally assumes the more modern parse in cases like this.
460
461 Equivocation
462 Very occasionally, a construction will be removed and then added back
463 -- and then, conceivably, removed again. In this case, the plan is for
464 perl_version_introduced() to return the earliest version in which the
465 construction appeared, and perl_version_removed() to return the version
466 after the last version in which it appeared (whether production or
467 development), or "undef" if it is in the highest-numbered Perl.
468
469 The constructions involved in this are:
470
471 Un-escaped literal left curly after literal
472
473 That is, something like "qr<x{>".
474
475 This was made an error in 5.25.1, and it was an error in 5.26.0. But
476 it became a warning again in 5.27.1. The perl5271delta says it was re-
477 instated because the changes broke GNU Autoconf, and the warning
478 message says it will be removed in Perl 5.30.
479
480 Accordingly, perl_version_introduced() returns 5.0. At the moment
481 perl_version_removed() returns '5.025001'. But if it is present with or
482 without warning in 5.28, perl_version_removed() will become "undef". If
483 you need finer resolution than this, see PPIx::Regexp::Element methods
484 l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
485 l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>
486
487 Static Parsing
488 It is well known that Perl can not be statically parsed. That is, you
489 can not completely parse a piece of Perl code without executing that
490 same code.
491
492 Nevertheless, this class is trying to statically parse regular
493 expressions. The main problem with this is that there is no way to know
494 what is being interpolated into the regular expression by an
495 interpolated variable. This is a problem because the interpolated value
496 can change the interpretation of adjacent elements.
497
498 This module deals with this by making assumptions about what is in an
499 interpolated variable. These assumptions will not be enumerated here,
500 but in general the principal is to assume the interpolated value does
501 not change the interpretation of the regular expression. For example,
502
503 my $foo = 'a-z]';
504 my $re = qr{[$foo};
505
506 is fine with the Perl interpreter, but will confuse the dickens out of
507 this module. Similarly and more usefully, something like
508
509 my $mods = 'i';
510 my $re = qr{(?$mods:foo)};
511
512 or maybe
513
514 my $mods = 'i';
515 my $re = qr{(?$mods)$foo};
516
517 probably sets a modifier of some sort, and that is how this module
518 interprets it. If the interpolation is not about modifiers, this module
519 will get it wrong. Another such semi-benign example is
520
521 my $foo = $] >= 5.010 ? '?<foo>' : '';
522 my $re = qr{($foo\w+)};
523
524 which will parse, but this module will never realize that it might be
525 looking at a named capture.
526
527 Non-Standard Syntax
528 There are modules out there that alter the syntax of Perl. If the
529 syntax of a regular expression is altered, this module has no way to
530 understand that it has been altered, much less to adapt to the
531 alteration. The following modules are known to cause problems:
532
533 Acme::PerlML, which renders Perl as XML.
534
535 "Data::PostfixDeref", which causes Perl to interpret suffixed empty
536 brackets as dereferencing the thing they suffix. This module by Ben
537 Morrow ("BMORROW") appears to have been retracted.
538
539 Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
540 be written in the ISO 646 character set.
541
542 Perl6::Pugs. Enough said.
543
544 Perl6::Rules, which back-ports some of the Perl 6 regular expression
545 syntax to Perl 5.
546
547 Regexp::Extended, which extends regular expressions in various ways,
548 some of which seem to conflict with Perl 5.010.
549
551 Regexp::Parsertron, which uses Marpa::R2 to parse the regexp, and Tree
552 for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
553 supports modification of the parse tree.
554
555 Regexp::Parser, which parses a bare regular expression (without
556 enclosing "qr{}", "m//", or whatever) and uses a different navigation
557 model. After a long hiatus, this module has been adopted, and is again
558 supported.
559
561 Support is by the author. Please file bug reports at
562 <https://rt.cpan.org>, or in electronic mail to the author.
563
565 Thomas R. Wyant, III wyant at cpan dot org
566
568 Copyright (C) 2009-2020 by Thomas R. Wyant, III
569
570 This program is free software; you can redistribute it and/or modify it
571 under the same terms as Perl 5.10.0. For more details, see the full
572 text of the licenses in the directory LICENSES.
573
574 This program is distributed in the hope that it will be useful, but
575 without any warranty; without even the implied warranty of
576 merchantability or fitness for a particular purpose.
577
578
579
580perl v5.30.1 2020-02-10 PPIx::Regexp(3)