1PPIx::Regexp(3) User Contributed Perl Documentation PPIx::Regexp(3)
2
3
4
6 PPIx::Regexp - Represent a regular expression of some sort
7
9 use PPIx::Regexp;
10 use PPIx::Regexp::Dumper;
11 my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12 PPIx::Regexp::Dumper->new( $re )
13 ->print();
14
16 "PPIx::Regexp" is a PPIx::Regexp::Node.
17
18 "PPIx::Regexp" has no descendants.
19
21 The purpose of the PPIx-Regexp package is to parse regular expressions
22 in a manner similar to the way the PPI package parses Perl. This class
23 forms the root of the parse tree, playing a role similar to
24 PPI::Document.
25
26 This package shares with PPI the property of being round-trip safe.
27 That is,
28
29 my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
30 my $re = PPIx::Regexp->new( $expr );
31 print $re->content() eq $expr ? "yes\n" : "no\n"
32
33 should print 'yes' for any valid regular expression.
34
35 Navigation is similar to that provided by PPI. That is to say, things
36 like "children", "find_first", "snext_sibling" and so on all work
37 pretty much the same way as in PPI.
38
39 The class hierarchy is also similar to PPI. Except for some utility
40 classes (the dumper, the lexer, and the tokenizer) all classes are
41 descended from PPIx::Regexp::Element, which provides basic navigation.
42 Tokens are descended from PPIx::Regexp::Token, which provides content.
43 All containers are descended from PPIx::Regexp::Node, which provides
44 for children, and all structure elements are descended from
45 PPIx::Regexp::Structure, which provides beginning and ending
46 delimiters, and a type.
47
48 There are two features of PPI that this package does not provide -
49 mutability and operator overloading. There are no plans for serious
50 mutability, though something like PPI's "prune" functionality might be
51 considered. Similarly there are no plans for operator overloading,
52 which appears to the author to represent a performance hit for little
53 tangible gain.
54
56 The author will attempt to preserve the documented interface, but if
57 the interface needs to change to correct some egregiously bad design or
58 implementation decision, then it will change. Any incompatible changes
59 will go through a deprecation cycle.
60
61 The goal of this package is to parse well-formed regular expressions
62 correctly. A secondary goal is not to blow up on ill-formed regular
63 expressions. The correct identification and characterization of ill-
64 formed regular expressions is not a goal of this package.
65
66 This policy attempts to track features in development releases as well
67 as public releases. However, features added in a development release
68 and then removed before the next production release will not be
69 tracked, and any functionality relating to such features will be
70 removed. The issue here is the potential re-use (with different
71 semantics) of syntax that did not make it into the production release.
72
74 This class provides the following public methods. Methods not
75 documented here are private, and unsupported in the sense that the
76 author reserves the right to change or remove them without notice.
77
78 new
79 my $re = PPIx::Regexp->new('/foo/');
80
81 This method instantiates a "PPIx::Regexp" object from a string, a
82 PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
83 PPI::Token::Regexp::Substitute. Honestly, any PPI::Element will do,
84 but only the three Regexp classes mentioned previously are likely to do
85 anything useful.
86
87 Optionally you can pass one or more name/value pairs after the regular
88 expression. The possible options are:
89
90 default_modifiers array_reference
91 This option specifies a reference to an array of default modifiers
92 to apply to the regular expression being parsed. Each modifier is
93 specified as a string. Any actual modifiers found supersede the
94 defaults.
95
96 When applying the defaults, '?' and '/' are completely ignored, and
97 '^' is ignored unless it occurs at the beginning of the modifier.
98 The first dash ('-') causes subsequent modifiers to be negated.
99
100 So, for example, if you wish to produce a "PPIx::Regexp" object
101 representing the regular expression in
102
103 use re '/smx';
104 {
105 no re '/x';
106 m/ foo /;
107 }
108
109 you would (after some help from PPI in finding the relevant
110 statements), do something like
111
112 my $re = PPIx::Regexp->new( 'm/ foo /',
113 default_modifiers => [ '/smx', '-/x' ] );
114 `
115 =item encoding name
116
117 This option specifies the encoding of the regular expression. This
118 is passed to the tokenizer, which will "decode" the regular
119 expression string before it tokenizes it. For example:
120
121 my $re = PPIx::Regexp->new( '/foo/',
122 encoding => 'iso-8859-1',
123 );
124
125 trace number
126 If greater than zero, this option causes trace output from the
127 parse. The author reserves the right to change or eliminate this
128 without notice.
129
130 Passing optional input other than the above is not an error, but
131 neither is it supported.
132
133 new_from_cache
134 This static method wraps "new" in a caching mechanism. Only one object
135 will be generated for a given PPI::Element, no matter how many times
136 this method is called. Calls after the first for a given PPI::Element
137 simply return the same "PPIx::Regexp" object.
138
139 When the "PPIx::Regexp" object is returned from cache, the values of
140 the optional arguments are ignored.
141
142 Calls to this method with the regular expression in a string rather
143 than a PPI::Element will not be cached.
144
145 Caveat: This method is provided for code like Perl::Critic which might
146 instantiate the same object multiple times. The cache will persist
147 until "flush_cache" is called.
148
149 flush_cache
150 $re->flush_cache(); # Remove $re from cache
151 PPIx::Regexp->flush_cache(); # Empty the cache
152
153 This method flushes the cache used by "new_from_cache". If called as a
154 static method with no arguments, the entire cache is emptied. Otherwise
155 any objects specified are removed from the cache.
156
157 capture_names
158 foreach my $name ( $re->capture_names() ) {
159 print "Capture name '$name'\n";
160 }
161
162 This convenience method returns the capture names found in the regular
163 expression.
164
165 This method is equivalent to
166
167 $self->regular_expression()->capture_names();
168
169 except that if "$self->regular_expression()" returns "undef" (meaning
170 that something went terribly wrong with the parse) this method will
171 simply return.
172
173 delimiters
174 print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
175 # prints '// //'
176
177 When called in list context, this method returns either one or two
178 strings, depending on whether the parsed expression has a replacement
179 string. In the case of non-bracketed substitutions, the start delimiter
180 of the replacement string is considered to be the same as its finish
181 delimiter, as illustrated by the above example.
182
183 When called in scalar context, you get the delimiters of the regular
184 expression; that is, element 0 of the array that is returned in list
185 context.
186
187 Optionally, you can pass an index value and the corresponding
188 delimiters will be returned; index 0 represents the regular
189 expression's delimiters, and index 1 represents the replacement
190 string's delimiters, which may be undef. For example,
191
192 print PPIx::Regexp->new('s{foo}<bar>')-delimiters(1);
193 # prints '<>'
194
195 If the object was not initialized with a valid regexp of some sort, the
196 results of this method are undefined.
197
198 errstr
199 This static method returns the error string from the most recent
200 attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
201 recent attempt succeeded.
202
203 failures
204 print "There were ", $re->failures(), " parse failures\n";
205
206 This method returns the number of parse failures. This is a count of
207 the number of unknown tokens plus the number of unterminated structures
208 plus the number of unmatched right brackets of any sort.
209
210 max_capture_number
211 print "Highest used capture number ",
212 $re->max_capture_number(), "\n";
213
214 This convenience method returns the highest capture number used by the
215 regular expression. If there are no captures, the return will be 0.
216
217 This method is equivalent to
218
219 $self->regular_expression()->max_capture_number();
220
221 except that if "$self->regular_expression()" returns "undef" (meaning
222 that something went terribly wrong with the parse) this method will
223 too.
224
225 modifier
226 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
227 print $re->modifier()->content(), "\n";
228 # prints 'smx'.
229
230 This method retrieves the modifier of the object. This comes from the
231 end of the initializing string or object and will be a
232 PPIx::Regexp::Token::Modifier.
233
234 Note that this object represents the actual modifiers present on the
235 regexp, and does not take into account any that may have been applied
236 by default (i.e. via the "default_modifiers" argument to "new()"). For
237 something that takes account of default modifiers, see
238 modifier_asserted(), below.
239
240 In the event of a parse failure, there may not be a modifier present,
241 in which case nothing is returned.
242
243 modifier_asserted
244 my $re = PPIx::Regexp->new( '/ . /',
245 default_modifiers => [ 'smx' ] );
246 print $re->modifier_asserted( 'x' ) ? "yes\n" : "no\n";
247 # prints 'yes'.
248
249 This method returns true if the given modifier is asserted for the
250 regexp, whether explicitly or by the modifiers passed in the
251 "default_modifiers" argument.
252
253 regular_expression
254 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
255 print $re->regular_expression()->content(), "\n";
256 # prints '/(foo)/'.
257
258 This method returns that portion of the object which actually
259 represents a regular expression.
260
261 replacement
262 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
263 print $re->replacement()->content(), "\n";
264 # prints '${1}bar/'.
265
266 This method returns that portion of the object which represents the
267 replacement string. This will be "undef" unless the regular expression
268 actually has a replacement string. Delimiters will be included, but
269 there will be no beginning delimiter unless the regular expression was
270 bracketed.
271
272 source
273 my $source = $re->source();
274
275 This method returns the object or string that was used to instantiate
276 the object.
277
278 type
279 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
280 print $re->type()->content(), "\n";
281 # prints 's'.
282
283 This method retrieves the type of the object. This comes from the
284 beginning of the initializing string or object, and will be a
285 PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
286 'qr', or ''.
287
289 By the nature of this module, it is never going to get everything
290 right. Many of the known problem areas involve interpolations one way
291 or another.
292
293 Ambiguous Syntax
294 Perl's regular expressions contain cases where the syntax is ambiguous.
295 A particularly egregious example is an interpolation followed by square
296 or curly brackets, for example $foo[...]. There is nothing in the
297 syntax to say whether the programmer wanted to interpolate an element
298 of array @foo, or whether he wanted to interpolate scalar $foo, and
299 then follow that interpolation by a character class.
300
301 The perlop documentation notes that in this case what Perl does is to
302 guess. That is, it employs various heuristics on the code to try to
303 figure out what the programmer wanted. These heuristics are documented
304 as being undocumented (!) and subject to change without notice.
305
306 Given this situation, this module's chances of duplicating every Perl
307 version's interpretation of every regular expression are pretty much
308 nil. What it does now is to assume that square brackets containing
309 only an integer or an interpolation represent a subscript; otherwise
310 they represent a character class. Similarly, curly brackets containing
311 only a bareword or an interpolation are a subscript; otherwise they
312 represent a quantifier.
313
314 Changes in Syntax
315 Sometimes the introduction of new syntax changes the way a regular
316 expression is parsed. For example, the "\v" character class was
317 introduced in Perl 5.9.5. But it did not represent a syntax error prior
318 to that version of Perl, it was simply parsed as "v". So
319
320 $ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'
321
322 prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
323 generally assumes the more modern parse in cases like this.
324
325 Static Parsing
326 It is well known that Perl can not be statically parsed. That is, you
327 can not completely parse a piece of Perl code without executing that
328 same code.
329
330 Nevertheless, this class is trying to statically parse regular
331 expressions. The main problem with this is that there is no way to know
332 what is being interpolated into the regular expression by an
333 interpolated variable. This is a problem because the interpolated value
334 can change the interpretation of adjacent elements.
335
336 This module deals with this by making assumptions about what is in an
337 interpolated variable. These assumptions will not be enumerated here,
338 but in general the principal is to assume the interpolated value does
339 not change the interpretation of the regular expression. For example,
340
341 my $foo = 'a-z]';
342 my $re = qr{[$foo};
343
344 is fine with the Perl interpreter, but will confuse the dickens out of
345 this module. Similarly and more usefully, something like
346
347 my $mods = 'i';
348 my $re = qr{(?$mods:foo)};
349
350 or maybe
351
352 my $mods = 'i';
353 my $re = qr{(?$mods)$foo};
354
355 probably sets a modifier of some sort, and that is how this module
356 interprets it. If the interpolation is not about modifiers, this module
357 will get it wrong. Another such semi-benign example is
358
359 my $foo = $] >= 5.010 ? '?<foo>' : '';
360 my $re = qr{($foo\w+)};
361
362 which will parse, but this module will never realize that it might be
363 looking at a named capture.
364
365 Non-Standard Syntax
366 There are modules out there that alter the syntax of Perl. If the
367 syntax of a regular expression is altered, this module has no way to
368 understand that it has been altered, much less to adapt to the
369 alteration. The following modules are known to cause problems:
370
371 Acme::PerlML, which renders Perl as XML.
372
373 Data::PostfixDeref, which causes Perl to interpret suffixed empty
374 brackets as dereferencing the thing they suffix.
375
376 Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
377 be written in the ISO 646 character set.
378
379 Perl6::Pugs. Enough said.
380
381 Perl6::Rules, which back-ports some of the Perl 6 regular expression
382 syntax to Perl 5.
383
384 Regexp::Extended, which extends regular expressions in various ways,
385 some of which seem to conflict with Perl 5.010.
386
388 Regexp::Parser, which parses a bare regular expression (without
389 enclosing "qr{}", "m//", or whatever) and uses a different navigation
390 model.
391
393 Support is by the author. Please file bug reports at
394 <http://rt.cpan.org>, or in electronic mail to the author.
395
397 Thomas R. Wyant, III wyant at cpan dot org
398
400 Copyright (C) 2009-2013 by Thomas R. Wyant, III
401
402 This program is free software; you can redistribute it and/or modify it
403 under the same terms as Perl 5.10.0. For more details, see the full
404 text of the licenses in the directory LICENSES.
405
406 This program is distributed in the hope that it will be useful, but
407 without any warranty; without even the implied warranty of
408 merchantability or fitness for a particular purpose.
409
410
411
412perl v5.16.3 2014-06-10 PPIx::Regexp(3)