1PPIx::Regexp(3) User Contributed Perl Documentation PPIx::Regexp(3)
2
3
4
6 PPIx::Regexp - Represent a regular expression of some sort
7
9 use PPIx::Regexp;
10 use PPIx::Regexp::Dumper;
11 my $re = PPIx::Regexp->new( 'qr{foo}smx' );
12 PPIx::Regexp::Dumper->new( $re )
13 ->print();
14
16 "PPIx::Regexp" is a PPIx::Regexp::Node.
17
18 "PPIx::Regexp" has no descendants.
19
21 The purpose of the PPIx-Regexp package is to parse regular expressions
22 in a manner similar to the way the PPI package parses Perl. This class
23 forms the root of the parse tree, playing a role similar to
24 PPI::Document.
25
26 This package shares with PPI the property of being round-trip safe.
27 That is,
28
29 my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
30 my $re = PPIx::Regexp->new( $expr );
31 print $re->content() eq $expr ? "yes\n" : "no\n"
32
33 should print 'yes' for any valid regular expression.
34
35 Navigation is similar to that provided by PPI. That is to say, things
36 like "children", "find_first", "snext_sibling" and so on all work
37 pretty much the same way as in PPI.
38
39 The class hierarchy is also similar to PPI. Except for some utility
40 classes (the dumper, the lexer, and the tokenizer) all classes are
41 descended from PPIx::Regexp::Element, which provides basic navigation.
42 Tokens are descended from PPIx::Regexp::Token, which provides content.
43 All containers are descended from PPIx::Regexp::Node, which provides
44 for children, and all structure elements are descended from
45 PPIx::Regexp::Structure, which provides beginning and ending
46 delimiters, and a type.
47
48 There are two features of PPI that this package does not provide -
49 mutability and operator overloading. There are no plans for serious
50 mutability, though something like PPI's "prune" functionality might be
51 considered. Similarly there are no plans for operator overloading,
52 which appears to the author to represent a performance hit for little
53 tangible gain.
54
56 This is alpha code. The author will attempt to preserve the documented
57 interface, but if the interface needs to change to correct some
58 egregiously bad design or implementation decision, then it will change.
59
60 The goal of this package is to parse well-formed regular expressions
61 correctly. A secondary goal is not to blow up on ill-formed regular
62 expressions. The correct identification and characterization of ill-
63 formed regular expressions is not a goal of this package.
64
66 This class provides the following public methods. Methods not
67 documented here are private, and unsupported in the sense that the
68 author reserves the right to change or remove them without notice.
69
70 new
71 my $re = PPIx::Regexp->new('/foo/');
72
73 This method instantiates a "PPIx::Regexp" object from a string, a
74 PPI::Token::QuoteLike::Regexp, a PPI::Token::Regexp::Match, or a
75 PPI::Token::Regexp::Substitute. Honestly, any PPI::Element will do,
76 but only the three Regexp classes mentioned previously are likely to do
77 anything useful.
78
79 Optionally you can pass one or more name/value pairs after the regular
80 expression. The possible options are:
81
82 encoding name
83 This option specifies the encoding of the regular expression. This
84 is passed to the tokenizer, which will "decode" the regular
85 expression string before it tokenizes it. For example:
86
87 my $re = PPIx::Regexp->new( '/foo/',
88 encoding => 'iso-8859-1',
89 );
90
91 trace number
92 If greater than zero, this option causes trace output from the
93 parse. The author reserves the right to change or eliminate this
94 without notice.
95
96 Passing optional input other than the above is not an error, but
97 neither is it supported.
98
99 new_from_cache
100 This static method wraps "new" in a caching mechanism. Only one object
101 will be generated for a given PPI::Element, no matter how many times
102 this method is called. Calls after the first for a given PPI::Element
103 simply return the same "PPIx::Regexp" object.
104
105 When the "PPIx::Regexp" object is returned from cache, the values of
106 the optional arguments are ignored.
107
108 Calls to this method with the regular expression in a string rather
109 than a PPI::Element will not be cached.
110
111 Caveat: This method is provided for code like Perl::Critic which might
112 instantiate the same object multiple times. The cache will persist
113 until "flush_cache" is called.
114
115 flush_cache
116 $re->flush_cache(); # Remove $re from cache
117 PPIx::Regexp->flush_cache(); # Empty the cache
118
119 This method flushes the cache used by "new_from_cache". If called as a
120 static method with no arguments, the entire cache is emptied. Otherwise
121 any objects specified are removed from the cache.
122
123 capture_names
124 foreach my $name ( $re->capture_names() ) {
125 print "Capture name '$name'\n";
126 }
127
128 This convenience method returns the capture names found in the regular
129 expression.
130
131 This method is equivalent to
132
133 $self->regular_expression()->capture_names();
134
135 except that if "$self->regular_expression()" returns "undef" (meaning
136 that something went terribly wrong with the parse) this method will
137 simply return.
138
139 delimiters
140 print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
141 # prints '// //'
142
143 When called in list context, this method returns either one or two
144 strings, depending on whether the parsed expression has a replacement
145 string. In the case of non-bracketed substitutions, the start delimiter
146 of the replacement string is considered to be the same as its finish
147 delimiter, as illustrated by the above example.
148
149 When called in scalar context, you get the delimiters of the regular
150 expression; that is, element 0 of the array that is returned in list
151 context.
152
153 Optionally, you can pass an index value and the corresponding
154 delimiters will be returned; index 0 represents the regular
155 expression's delimiters, and index 1 represents the replacement
156 string's delimiters, which may be undef. For example,
157
158 print PPIx::Regexp->new('s{foo}<bar>')-delimiters(1);
159 # prints '[]'
160
161 If the object was not initialized with a valid regexp of some sort, the
162 results of this method are undefined.
163
164 errstr
165 This static method returns the error string from the most recent
166 attempt to instantiate a "PPIx::Regexp". It will be "undef" if the most
167 recent attempt succeeded.
168
169 failures
170 print "There were ", $re->failures(), " parse failures\n";
171
172 This method returns the number of parse failures. This is a count of
173 the number of unknown tokens plus the number of unterminated structures
174 plus the number of unmatched right brackets of any sort.
175
176 max_capture_number
177 print "Highest used capture number ",
178 $re->max_capture_number(), "\n";
179
180 This convenience method returns the highest capture number used by the
181 regular expression. If there are no captures, the return will be 0.
182
183 This method is equivalent to
184
185 $self->regular_expression()->max_capture_number();
186
187 except that if "$self->regular_expression()" returns "undef" (meaning
188 that something went terribly wrong with the parse) this method will
189 too.
190
191 modifier
192 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
193 print $re->modifier()->content(), "\n";
194 # prints 'smx'.
195
196 This method retrieves the modifier of the object. This comes from the
197 end of the initializing string or object and will be a
198 PPIx::Regexp::Token::Modifier.
199
200 In the event of a parse failure, there may not be a modifier present,
201 in which case nothing is returned.
202
203 regular_expression
204 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
205 print $re->regular_expression()->content(), "\n";
206 # prints '/(foo)/'.
207
208 This method returns that portion of the object which actually
209 represents a regular expression.
210
211 replacement
212 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
213 print $re->replacement()->content(), "\n";
214 # prints '${1}bar/'.
215
216 This method returns that portion of the object which represents the
217 replacement string. This will be "undef" unless the regular expression
218 actually has a replacement string. Delimiters will be included, but
219 there will be no beginning delimiter unless the regular expression was
220 bracketed.
221
222 source
223 my $source = $re->source();
224
225 This method returns the object or string that was used to instantiate
226 the object.
227
228 type
229 my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
230 print $re->type()->content(), "\n";
231 # prints 's'.
232
233 This method retrieves the type of the object. This comes from the
234 beginning of the initializing string or object, and will be a
235 PPIx::Regexp::Token::Structure whose "content" is one of 's', 'm',
236 'qr', or ''.
237
239 By the nature of this module, it is never going to get everything
240 right. Many of the known problem areas involve interpolations one way
241 or another.
242
243 Ambiguous Syntax
244 Perl's regular expressions contain cases where the syntax is ambiguous.
245 A particularly egregious example is an interpolation followed by square
246 or curly brackets, for example $foo[...]. There is nothing in the
247 syntax to say whether the programmer wanted to interpolate an element
248 of array @foo, or whether he wanted to interpolate scalar $foo, and
249 then follow that interpolation by a character class.
250
251 The perlop documentation notes that in this case what Perl does is to
252 guess. That is, it employs various heuristics on the code to try to
253 figure out what the programmer wanted. These heuristics are documented
254 as being undocumented (!) and subject to change without notice.
255
256 Given this situation, this module's chances of duplicating every Perl
257 version's interpretation of every regular expression are pretty much
258 nil. What it does now is to assume that square brackets containing
259 only an integer or an interpolation represent a subscript; otherwise
260 they represent a character class. Similarly, curly brackets containing
261 only a bareword or an interpolation are a subscript; otherwise they
262 represent a quantifier.
263
264 Static Parsing
265 It is well known that Perl can not be statically parsed. That is, you
266 can not completely parse a piece of Perl code without executing that
267 same code.
268
269 Nevertheless, this class is trying to statically parse regular
270 expressions. The main problem with this is that there is no way to know
271 what is being interpolated into the regular expression by an
272 interpolated variable. This is a problem because the interpolated value
273 can change the interpretation of adjacent elements.
274
275 This module deals with this by making assumptions about what is in an
276 interpolated variable. These assumptions will not be enumerated here,
277 but in general the principal is to assume the interpolated value does
278 not change the interpretation of the regular expression. For example,
279
280 my $foo = 'a-z]';
281 my $re = qr{[$foo};
282
283 is fine with the Perl interpreter, but will confuse the dickens out of
284 this module. Similarly and more usefully, something like
285
286 my $mods = 'i';
287 my $re = qr{(?$mods:foo)};
288
289 or maybe
290
291 my $mods = 'i';
292 my $re = qr{(?$mods)$foo};
293
294 probably sets a modifier of some sort, and that is how this module
295 interprets it. If the interpolation is not about modifiers, this module
296 will get it wrong. Another such semi-benign example is
297
298 my $foo = $] >= 5.010 ? '?<foo>' : '';
299 my $re = qr{($foo\w+)};
300
301 which will parse, but this module will never realize that it might be
302 looking at a named capture.
303
304 Non-Standard Syntax
305 There are modules out there that alter the syntax of Perl. If the
306 syntax of a regular expression is altered, this module has no way to
307 understand that it has been altered, much less to adapt to the
308 alteration. The following modules are known to cause problems:
309
310 Acme::PerlML, which renders Perl as XML.
311
312 Data::PostfixDeref, which causes Perl to interpret suffixed empty
313 brackets as dereferencing the thing they suffix.
314
315 Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
316 be written in the ISO 646 character set.
317
318 Perl6::Pugs. Enough said.
319
320 Perl6::Rules, which back-ports some of the Perl 6 regular expression
321 syntax to Perl 5.
322
323 Regexp::Extended, which extends regular expressions in various ways,
324 some of which seem to conflict with Perl 5.010.
325
327 Regexp::Parser, which parses a bare regular expression (without
328 enclosing "qr{}", "m//", or whatever) and uses a different navigation
329 model.
330
332 Support is by the author. Please file bug reports at
333 <http://rt.cpan.org>, or in electronic mail to the author.
334
336 Thomas R. Wyant, III wyant at cpan dot org
337
339 Copyright (C) 2009-2010, Thomas R. Wyant, III
340
341 This program is free software; you can redistribute it and/or modify it
342 under the same terms as Perl 5.10.0. For more details, see the full
343 text of the licenses in the directory LICENSES.
344
345 This program is distributed in the hope that it will be useful, but
346 without any warranty; without even the implied warranty of
347 merchantability or fitness for a particular purpose.
348
349
350
351perl v5.12.0 2010-06-08 PPIx::Regexp(3)