1PPIx::Regexp::TokenizerU(s3e)r Contributed Perl DocumentaPtPiIoxn::Regexp::Tokenizer(3)
2
3
4
6 PPIx::Regexp::Tokenizer - Tokenize a regular expression
7
9 use PPIx::Regexp::Dumper;
10 PPIx::Regexp::Dumper->new( 'qr{foo}smx' )
11 ->print();
12
14 "PPIx::Regexp::Tokenizer" is a PPIx::Regexp::Support.
15
16 "PPIx::Regexp::Tokenizer" has no descendants.
17
19 This class provides tokenization of the regular expression.
20
22 This class provides the following public methods. Methods not
23 documented here (or documented below under "EXTERNAL TOKENIZERS") are
24 private, and unsupported in the sense that the author reserves the
25 right to change or remove them without notice.
26
27 new
28 my $tokenizer = PPIx::Regexp::Tokenizer->new( 'xyzzy' );
29
30 This static method instantiates the tokenizer. You must pass it the
31 regular expression to be parsed, either as a string or as a
32 PPI::Element of some sort. You can also pass optional name/value pairs
33 of arguments. The option names are specified without a leading dash.
34 Supported options are:
35
36 encoding name
37 This option specifies the encoding of the string to be tokenized.
38 If specified, an "Encode::decode" is done on the string (or the
39 "content" of the PPI class) before it is tokenized.
40
41 trace number
42 Specifying a positive value for this option causes a trace of the
43 tokenization. This option is unsupported in the sense that the
44 author reserves the right to alter it without notice.
45
46 If this option is unspecified, the value comes from environment
47 variable "PPIX_REGEXP_TOKENIZER_TRACE" (see "ENVIRONMENT
48 VARIABLES"). If this environment variable does not exist, the
49 default is 0.
50
51 Undocumented options are unsupported.
52
53 The returned value is the instantiated tokenizer, or "undef" if
54 instantiation failed. In the latter case a call to "errstr" will return
55 the reason.
56
57 content
58 print $tokenizer->content();
59
60 This method returns the string being tokenized. This will be the result
61 of the PPI::Element->content() method if the object was instantiated
62 with a PPI::Element.
63
64 encoding
65 This method returns the encoding of the data being parsed, if one was
66 set when the class was instantiated; otherwise it simply returns undef.
67
68 errstr
69 my $tokenizer = PPIx::Regexp::Tokenizer->new( 'xyzzy' )
70 or die PPIx::Regexp::Tokenizer->errstr();
71
72 This static method returns an error description if tokenizer
73 instantiation failed.
74
75 failures
76 print $tokenizer->failures(), " tokenization failures\n";
77
78 This method returns the number of tokenization failures encountered. A
79 tokenization failure is represented in the output token stream by a
80 PPIx::Regexp::Token::Unknown.
81
82 modifier
83 $tokenizer->modifier( 'x' )
84 and print "Tokenizing an extended regular expression\n";
85
86 This method returns true if the given modifier character was found on
87 the end of the regular expression, and false otherwise.
88
89 next_token
90 my $token = $tokenizer->next_token();
91
92 This method returns the next token in the token stream, or nothing if
93 there are no more tokens.
94
95 significant
96 This method exists simply for the convenience of PPIx::Regexp::Dumper.
97 It always returns true.
98
99 tokens
100 my @tokens = $tokenizer->tokens();
101
102 This method returns all remaining tokens in the token stream.
103
105 This class does very little of its own tokenization. Instead the token
106 classes contain external tokenization routines, whose name is
107 '__PPIX_TOKENIZER__' concatenated with the current mode of the
108 tokenizer ('regexp' for regular expressions, 'repl' for the replacement
109 string).
110
111 These external tokenizers are called as static methods, and passed the
112 "PPIx::Regexp::Tokenizer" object and the current character in the
113 character stream.
114
115 If the external tokenizer wants to make one or more tokens, it returns
116 an array containing either length in characters for tokens of the
117 tokenizer's own class, or the results of one or more "make_token" calls
118 for tokens of an arbitrary class.
119
120 If the external tokenizer is not interested in the characters starting
121 at the current position it simply returns.
122
123 The following methods are for the use of external tokenizers, and are
124 not part of the public interface to this class.
125
126 capture
127 if ( $tokenizer->find_regexp( qr{ \A ( foo ) }smx ) ) {
128 foreach ( $tokenizer->capture() ) {
129 print "$_\n";
130 }
131 }
132
133 This method returns all the contents of any capture buffers from the
134 previous call to "find_regexp". The first element of the array (i.e.
135 element 0) corresponds to $1, and so on.
136
137 The captures are cleared by "make_token", as well as by another call to
138 "find_regexp".
139
140 cookie
141 $tokenizer->cookie( foo => sub { 1 } );
142 my $cookie = $tokenizer->cookie( 'foo' );
143 my $old_hint = $tokenizer->cookie( foo => undef );
144
145 This method either creates, deletes, or accesses a cookie.
146
147 A cookie is a code reference which is called whenever the tokenizer
148 makes a token. If it returns a false value, it is deleted. Explicitly
149 setting the cookie to "undef" also deletes it.
150
151 When you call "$tokenizer->cookie( 'foo' )", the current cookie is
152 returned. If you pass a new value of "undef" to delete the token, the
153 deleted cookie (if any) is returned.
154
155 When the "make_token" method calls a cookie, it passes it the tokenizer
156 and the token just made. If a token calls a cookie, it is recommended
157 that it merely pass the tokenizer, though of course the token can do
158 whatever it wants.
159
160 The cookie mechanism seems to be a bit of a crock, but it appeared to
161 be more work to fix things up in the lexer after the tokenizer got
162 something wrong.
163
164 The recommended way to write a cookie is to use a closure to store any
165 necessary data, and have a call to the cookie return the data;
166 otherwise the ultimate consumer of the cookie has no way to access the
167 data. Of course, it may be that the presence of the cookie at a certain
168 point in the parse is all that is required.
169
170 expect
171 $tokenizer->expect( 'PPIx::Regexp::Token::Code' );
172
173 This method inserts a given class at the head of the token scan, for
174 the next iteration only. More than one class can be specified. Class
175 names can be abbreviated by removing the leading 'PPIx::Regexp::'.
176
177 The expectation lasts from the next time "get_token" is called until
178 the next time make_token makes a significant token, or until the next
179 "expect" call if that is done sooner.
180
181 find_regexp
182 my $end = $tokenizer->find_regexp( qr{ \A \w+ }smx );
183 my ( $begin, $end ) = $tokenizer->find_regexp(
184 qr{ \A \w+ }smx );
185
186 This method finds the given regular expression in the content, starting
187 at the current position. If called in scalar context, the offset from
188 the current position to the end of the matched string is returned. If
189 called in list context, the offsets to both the beginning and the end
190 of the matched string are returned.
191
192 find_matching_delimiter
193 my $offset = $tokenizer->find_matching_delimiter();
194
195 This method is used by tokenizers to find the delimiter matching the
196 character at the current position in the content string. If the
197 delimiter is an opening bracket of some sort, bracket nesting will be
198 taken into account.
199
200 This method returns the offset from the current position in the content
201 string to the matching delimiter (which will always be positive), or
202 undef if no match can be found.
203
204 get_token
205 my $token = $tokenizer->make_token( 3 );
206 my @tokens = $tokenizer->get_token();
207
208 This method returns the next token that can be made from the input
209 stream. It is not part of the external interface, but is intended for
210 the use of an external tokenizer which calls it after making and
211 retaining its own token to look at the next token ( if any ) in the
212 input stream.
213
214 If any external tokenizer calls get_token without first calling
215 make_token, a fatal error occurs; this is better than the infinite
216 recursion which would occur if the condition were not trapped.
217
218 An external tokenizer must return anything returned by get_token;
219 otherwise tokens get lost.
220
221 interpolates
222 This method returns true if the top-level structure being tokenized
223 interpolates; that is, if the delimiter is not a single quote.
224
225 make_token
226 return $tokenizer->make_token( 3, 'PPIx::Regexp::Token::Unknown' );
227
228 This method is used by this class (and possibly by individual
229 tokenizers) to manufacture a token. Its arguments are the number of
230 characters to include in the token, and optionally the class of the
231 token. If no class name is given, the caller's class is used. Class
232 names may be shortened by removing the initial 'PPIx::Regexp::', which
233 will be restored by this method.
234
235 The token will be manufactured from the given number of characters
236 starting at the current cursor position, which will be adjusted.
237
238 If the given length would include characters past the end of the string
239 being tokenized, the length is reduced appropriately. If this means a
240 token with no characters, nothing is returned.
241
242 match
243 if ( $tokenizer->find_regexp( qr{ \A \w+ }smx ) ) {
244 print $tokenizer->match(), "\n";
245 }
246
247 This method returns the string matched by the previous call to
248 "find_regexp".
249
250 The match is set to "undef" by "make_token", as well as by another call
251 to "find_regexp".
252
253 modifier_duplicate
254 $tokenizer->modifier_duplicate();
255
256 This method duplicates the modifiers on the top of the modifier stack,
257 with the intent of creating a locally-scoped copy of the modifiers.
258 This should only be called by an external tokenizer that is actually
259 creating a modifier scope. In other words, only when creating a
260 PPIx::Regexp::Token::Structure token whose content is '('.
261
262 modifier_modify
263 $tokenizer->modifier_modify( name => $value ... );
264
265 This method sets new values for the modifiers in the local scope. Only
266 the modifiers whose names are actually passed have their values
267 changed.
268
269 This method is intended to be called after manufacturing a
270 PPIx::Regexp::Token::Modifier token, and passed the results of its
271 "modifiers" method.
272
273 modifier_pop
274 $tokenizer->modifier_pop();
275
276 This method removes the modifiers on the top of the modifier stack.
277 This should only be called by an external tokenizer that is ending a
278 modifier scope. In other words, only when creating a
279 PPIx::Regexp::Token::Structure token whose content is ')'.
280
281 Note that this method will never pop the last modifier item off the
282 stack, to guard against unmatched right parentheses.
283
284 peek
285 my $character = $tokenizer->peek();
286 my $next_char = $tokenizer->peek( 1 );
287
288 This method returns the character at the given non-negative offset from
289 the current position. If no offset is given, an offset of 0 is used.
290
291 If you ask for a negative offset or an offset off the end of the sting,
292 "undef" is returned.
293
294 ppi_document
295 This method makes a PPI document out of the remainder of the string,
296 and returns it.
297
298 prior
299 $tokenizer->prior( 'can_be_quantified' )
300 and print "The prior token can be quantified.\n";
301
302 This method calls the named method on the most-recently-instantiated
303 significant token, and returns the result. Any arguments subsequent to
304 the method name will be passed to the method.
305
306 Because this method is designed to be used within the tokenizing
307 system, it will die horribly if the named method does not exist.
308
310 A tokenizer trace can be requested by setting environment variable
311 PPIX_REGEXP_TOKENIZER_TRACE to a numeric value other than 0. Use of
312 this environment variable is unsupported in the same sense that the
313 "trace" option of "new" is unsupported. Explicitly specifying the
314 "trace" option to "new" overrides the environment variable.
315
316 The real reason this is documented is to give the user a way to
317 troubleshoot funny output from the tokenizer.
318
320 Support is by the author. Please file bug reports at
321 <http://rt.cpan.org>, or in electronic mail to the author.
322
324 Thomas R. Wyant, III wyant at cpan dot org
325
327 Copyright (C) 2009-2010, Thomas R. Wyant, III
328
329 This program is free software; you can redistribute it and/or modify it
330 under the same terms as Perl 5.10.0. For more details, see the full
331 text of the licenses in the directory LICENSES.
332
333 This program is distributed in the hope that it will be useful, but
334 without any warranty; without even the implied warranty of
335 merchantability or fitness for a particular purpose.
336
337
338
339perl v5.12.0 2010-06-08 PPIx::Regexp::Tokenizer(3)