1PPIx::Regexp::TokenizerU(s3e)r Contributed Perl DocumentaPtPiIoxn::Regexp::Tokenizer(3)
2
3
4
6 PPIx::Regexp::Tokenizer - Tokenize a regular expression
7
9 use PPIx::Regexp::Dumper;
10 PPIx::Regexp::Dumper->new( 'qr{foo}smx' )
11 ->print();
12
14 "PPIx::Regexp::Tokenizer" is a PPIx::Regexp::Support.
15
16 "PPIx::Regexp::Tokenizer" has no descendants.
17
19 This class provides tokenization of the regular expression.
20
22 This class provides the following public methods. Methods not
23 documented here (or documented below under "EXTERNAL TOKENIZERS") are
24 private, and unsupported in the sense that the author reserves the
25 right to change or remove them without notice.
26
27 new
28 my $tokenizer = PPIx::Regexp::Tokenizer->new( 'xyzzy' );
29
30 This static method instantiates the tokenizer. You must pass it the
31 regular expression to be parsed, either as a string or as a
32 PPI::Element of some sort. You can also pass optional name/value pairs
33 of arguments. The option names are specified without a leading dash.
34 Supported options are:
35
36 default_modifiers array_reference
37 This argument specifies default statement modifiers. It is
38 optional, but if specified must be an array reference. See the
39 PPIx::Regexp new() documentation for the details.
40
41 encoding name
42 This option specifies the encoding of the string to be tokenized.
43 If specified, an "Encode::decode" is done on the string (or the
44 "content" of the PPI class) before it is tokenized.
45
46 trace number
47 Specifying a positive value for this option causes a trace of the
48 tokenization. This option is unsupported in the sense that the
49 author reserves the right to alter it without notice.
50
51 If this option is unspecified, the value comes from environment
52 variable "PPIX_REGEXP_TOKENIZER_TRACE" (see "ENVIRONMENT
53 VARIABLES"). If this environment variable does not exist, the
54 default is 0.
55
56 Undocumented options are unsupported.
57
58 The returned value is the instantiated tokenizer, or "undef" if
59 instantiation failed. In the latter case a call to "errstr" will return
60 the reason.
61
62 content
63 print $tokenizer->content();
64
65 This method returns the string being tokenized. This will be the result
66 of the PPI::Element->content() method if the object was instantiated
67 with a PPI::Element.
68
69 default_modifiers
70 print join ', ', @{ $tokenizer->default_modifiers() };
71
72 This method returns a reference to a copy of the array passed to the
73 "default_modifiers" argument to new(). If this argument was not used to
74 instantiate the object, the return is a reference to an empty array.
75
76 encoding
77 This method returns the encoding of the data being parsed, if one was
78 set when the class was instantiated; otherwise it simply returns undef.
79
80 errstr
81 my $tokenizer = PPIx::Regexp::Tokenizer->new( 'xyzzy' )
82 or die PPIx::Regexp::Tokenizer->errstr();
83
84 This static method returns an error description if tokenizer
85 instantiation failed.
86
87 failures
88 print $tokenizer->failures(), " tokenization failures\n";
89
90 This method returns the number of tokenization failures encountered. A
91 tokenization failure is represented in the output token stream by a
92 PPIx::Regexp::Token::Unknown.
93
94 modifier
95 $tokenizer->modifier( 'x' )
96 and print "Tokenizing an extended regular expression\n";
97
98 This method returns true if the given modifier character was found on
99 the end of the regular expression, and false otherwise.
100
101 next_token
102 my $token = $tokenizer->next_token();
103
104 This method returns the next token in the token stream, or nothing if
105 there are no more tokens.
106
107 significant
108 This method exists simply for the convenience of PPIx::Regexp::Dumper.
109 It always returns true.
110
111 tokens
112 my @tokens = $tokenizer->tokens();
113
114 This method returns all remaining tokens in the token stream.
115
117 This class does very little of its own tokenization. Instead the token
118 classes contain external tokenization routines, whose name is
119 '__PPIX_TOKENIZER__' concatenated with the current mode of the
120 tokenizer ('regexp' for regular expressions, 'repl' for the replacement
121 string).
122
123 These external tokenizers are called as static methods, and passed the
124 "PPIx::Regexp::Tokenizer" object and the current character in the
125 character stream.
126
127 If the external tokenizer wants to make one or more tokens, it returns
128 an array containing either length in characters for tokens of the
129 tokenizer's own class, or the results of one or more "make_token" calls
130 for tokens of an arbitrary class.
131
132 If the external tokenizer is not interested in the characters starting
133 at the current position it simply returns.
134
135 The following methods are for the use of external tokenizers, and are
136 not part of the public interface to this class.
137
138 capture
139 if ( $tokenizer->find_regexp( qr{ \A ( foo ) }smx ) ) {
140 foreach ( $tokenizer->capture() ) {
141 print "$_\n";
142 }
143 }
144
145 This method returns all the contents of any capture buffers from the
146 previous call to "find_regexp". The first element of the array (i.e.
147 element 0) corresponds to $1, and so on.
148
149 The captures are cleared by "make_token", as well as by another call to
150 "find_regexp".
151
152 cookie
153 $tokenizer->cookie( foo => sub { 1 } );
154 my $cookie = $tokenizer->cookie( 'foo' );
155 my $old_hint = $tokenizer->cookie( foo => undef );
156
157 This method either creates, deletes, or accesses a cookie.
158
159 A cookie is a code reference which is called whenever the tokenizer
160 makes a token. If it returns a false value, it is deleted. Explicitly
161 setting the cookie to "undef" also deletes it.
162
163 When you call "$tokenizer->cookie( 'foo' )", the current cookie is
164 returned. If you pass a new value of "undef" to delete the token, the
165 deleted cookie (if any) is returned.
166
167 When the "make_token" method calls a cookie, it passes it the tokenizer
168 and the token just made. If a token calls a cookie, it is recommended
169 that it merely pass the tokenizer, though of course the token can do
170 whatever it wants.
171
172 The cookie mechanism seems to be a bit of a crock, but it appeared to
173 be more work to fix things up in the lexer after the tokenizer got
174 something wrong.
175
176 The recommended way to write a cookie is to use a closure to store any
177 necessary data, and have a call to the cookie return the data;
178 otherwise the ultimate consumer of the cookie has no way to access the
179 data. Of course, it may be that the presence of the cookie at a certain
180 point in the parse is all that is required.
181
182 expect
183 $tokenizer->expect( 'PPIx::Regexp::Token::Code' );
184
185 This method inserts a given class at the head of the token scan, for
186 the next iteration only. More than one class can be specified. Class
187 names can be abbreviated by removing the leading 'PPIx::Regexp::'.
188
189 If no class is specified, this method does nothing.
190
191 The expectation lasts from the next time "get_token" is called until
192 the next time make_token makes a significant token, or until the next
193 "expect" call if that is done sooner.
194
195 find_regexp
196 my $end = $tokenizer->find_regexp( qr{ \A \w+ }smx );
197 my ( $begin, $end ) = $tokenizer->find_regexp(
198 qr{ \A \w+ }smx );
199
200 This method finds the given regular expression in the content, starting
201 at the current position. If called in scalar context, the offset from
202 the current position to the end of the matched string is returned. If
203 called in list context, the offsets to both the beginning and the end
204 of the matched string are returned.
205
206 find_matching_delimiter
207 my $offset = $tokenizer->find_matching_delimiter();
208
209 This method is used by tokenizers to find the delimiter matching the
210 character at the current position in the content string. If the
211 delimiter is an opening bracket of some sort, bracket nesting will be
212 taken into account.
213
214 When searching for the matching delimiter, the back slash character is
215 considered to escape the following character, so back-slashed
216 delimiters will be ignored. No other quoting mechanisms are recognized,
217 though, so delimiters inside quotes still count. This is actually the
218 way Perl works, as
219
220 $ perl -e 'qr<(?{ print "}" })>'
221
222 demonstrates.
223
224 This method returns the offset from the current position in the content
225 string to the matching delimiter (which will always be positive), or
226 undef if no match can be found.
227
228 get_start_delimiter
229 my $start_delimiter = $tokenizer->get_start_delimiter();
230
231 This method is used by tokenizers to access the start delimiter for the
232 regular expression.
233
234 get_token
235 my $token = $tokenizer->make_token( 3 );
236 my @tokens = $tokenizer->get_token();
237
238 This method returns the next token that can be made from the input
239 stream. It is not part of the external interface, but is intended for
240 the use of an external tokenizer which calls it after making and
241 retaining its own token to look at the next token ( if any ) in the
242 input stream.
243
244 If any external tokenizer calls get_token without first calling
245 make_token, a fatal error occurs; this is better than the infinite
246 recursion which would occur if the condition were not trapped.
247
248 An external tokenizer must return anything returned by get_token;
249 otherwise tokens get lost.
250
251 interpolates
252 This method returns true if the top-level structure being tokenized
253 interpolates; that is, if the delimiter is not a single quote.
254
255 make_token
256 return $tokenizer->make_token( 3, 'PPIx::Regexp::Token::Unknown' );
257
258 This method is used by this class (and possibly by individual
259 tokenizers) to manufacture a token. Its arguments are the number of
260 characters to include in the token, and optionally the class of the
261 token. If no class name is given, the caller's class is used. Class
262 names may be shortened by removing the initial 'PPIx::Regexp::', which
263 will be restored by this method.
264
265 The token will be manufactured from the given number of characters
266 starting at the current cursor position, which will be adjusted.
267
268 If the given length would include characters past the end of the string
269 being tokenized, the length is reduced appropriately. If this means a
270 token with no characters, nothing is returned.
271
272 match
273 if ( $tokenizer->find_regexp( qr{ \A \w+ }smx ) ) {
274 print $tokenizer->match(), "\n";
275 }
276
277 This method returns the string matched by the previous call to
278 "find_regexp".
279
280 The match is set to "undef" by "make_token", as well as by another call
281 to "find_regexp".
282
283 modifier_duplicate
284 $tokenizer->modifier_duplicate();
285
286 This method duplicates the modifiers on the top of the modifier stack,
287 with the intent of creating a locally-scoped copy of the modifiers.
288 This should only be called by an external tokenizer that is actually
289 creating a modifier scope. In other words, only when creating a
290 PPIx::Regexp::Token::Structure token whose content is '('.
291
292 modifier_modify
293 $tokenizer->modifier_modify( name => $value ... );
294
295 This method sets new values for the modifiers in the local scope. Only
296 the modifiers whose names are actually passed have their values
297 changed.
298
299 This method is intended to be called after manufacturing a
300 PPIx::Regexp::Token::Modifier token, and passed the results of its
301 "modifiers" method.
302
303 modifier_pop
304 $tokenizer->modifier_pop();
305
306 This method removes the modifiers on the top of the modifier stack.
307 This should only be called by an external tokenizer that is ending a
308 modifier scope. In other words, only when creating a
309 PPIx::Regexp::Token::Structure token whose content is ')'.
310
311 Note that this method will never pop the last modifier item off the
312 stack, to guard against unmatched right parentheses.
313
314 peek
315 my $character = $tokenizer->peek();
316 my $next_char = $tokenizer->peek( 1 );
317
318 This method returns the character at the given non-negative offset from
319 the current position. If no offset is given, an offset of 0 is used.
320
321 If you ask for a negative offset or an offset off the end of the sting,
322 "undef" is returned.
323
324 ppi_document
325 This method makes a PPI document out of the remainder of the string,
326 and returns it.
327
328 prior
329 $tokenizer->prior( 'can_be_quantified' )
330 and print "The prior token can be quantified.\n";
331
332 This method calls the named method on the most-recently-instantiated
333 significant token, and returns the result. Any arguments subsequent to
334 the method name will be passed to the method.
335
336 Because this method is designed to be used within the tokenizing
337 system, it will die horribly if the named method does not exist.
338
340 A tokenizer trace can be requested by setting environment variable
341 PPIX_REGEXP_TOKENIZER_TRACE to a numeric value other than 0. Use of
342 this environment variable is unsupported in the same sense that the
343 "trace" option of "new" is unsupported. Explicitly specifying the
344 "trace" option to "new" overrides the environment variable.
345
346 The real reason this is documented is to give the user a way to
347 troubleshoot funny output from the tokenizer.
348
350 Support is by the author. Please file bug reports at
351 <http://rt.cpan.org>, or in electronic mail to the author.
352
354 Thomas R. Wyant, III wyant at cpan dot org
355
357 Copyright (C) 2009-2013 by Thomas R. Wyant, III
358
359 This program is free software; you can redistribute it and/or modify it
360 under the same terms as Perl 5.10.0. For more details, see the full
361 text of the licenses in the directory LICENSES.
362
363 This program is distributed in the hope that it will be useful, but
364 without any warranty; without even the implied warranty of
365 merchantability or fitness for a particular purpose.
366
367
368
369perl v5.16.3 2014-06-10 PPIx::Regexp::Tokenizer(3)