PPIx::Regexp::Tokenizer(3pm)

1PPIx::Regexp::TokenizerU(s3e)r Contributed Perl DocumentaPtPiIoxn::Regexp::Tokenizer(3)
2
3
4

NAME

6       PPIx::Regexp::Tokenizer - Tokenize a regular expression
7

SYNOPSIS

9        use PPIx::Regexp::Dumper;
10        PPIx::Regexp::Dumper->new( 'qr{foo}smx' )
11            ->print();
12

INHERITANCE

14       "PPIx::Regexp::Tokenizer" is a PPIx::Regexp::Support.
15
16       "PPIx::Regexp::Tokenizer" has no descendants.
17

DESCRIPTION

19       This class provides tokenization of the regular expression.
20

METHODS

22       This class provides the following public methods. Methods not
23       documented here (or documented below under "EXTERNAL TOKENIZERS") are
24       private, and unsupported in the sense that the author reserves the
25       right to change or remove them without notice.
26
27   new
28        my $tokenizer = PPIx::Regexp::Tokenizer->new( 'xyzzy' );
29
30       This static method instantiates the tokenizer. You must pass it the
31       regular expression to be parsed, either as a string or as a
32       PPI::Element of some sort. You can also pass optional name/value pairs
33       of arguments. The option names are specified without a leading dash.
34       Supported options are:
35
36       default_modifiers array_reference
37           This argument specifies default statement modifiers. It is
38           optional, but if specified must be an array reference. See the
39           PPIx::Regexp new() documentation for the details.
40
41       encoding name
42           This option specifies the encoding of the string to be tokenized.
43           If specified, an "Encode::decode" is done on the string (or the
44           "content" of the PPI class) before it is tokenized.
45
46       trace number
47           Specifying a positive value for this option causes a trace of the
48           tokenization. This option is unsupported in the sense that the
49           author reserves the right to alter it without notice.
50
51           If this option is unspecified, the value comes from environment
52           variable "PPIX_REGEXP_TOKENIZER_TRACE" (see "ENVIRONMENT
53           VARIABLES"). If this environment variable does not exist, the
54           default is 0.
55
56       Undocumented options are unsupported.
57
58       The returned value is the instantiated tokenizer, or "undef" if
59       instantiation failed. In the latter case a call to "errstr" will return
60       the reason.
61
62   content
63        print $tokenizer->content();
64
65       This method returns the string being tokenized. This will be the result
66       of the PPI::Element->content() method if the object was instantiated
67       with a PPI::Element.
68
69   default_modifiers
70        print join ', ', @{ $tokenizer->default_modifiers() };
71
72       This method returns a reference to a copy of the array passed to the
73       "default_modifiers" argument to new(). If this argument was not used to
74       instantiate the object, the return is a reference to an empty array.
75
76   encoding
77       This method returns the encoding of the data being parsed, if one was
78       set when the class was instantiated; otherwise it simply returns undef.
79
80   errstr
81        my $tokenizer = PPIx::Regexp::Tokenizer->new( 'xyzzy' )
82            or die PPIx::Regexp::Tokenizer->errstr();
83
84       This static method returns an error description if tokenizer
85       instantiation failed.
86
87   failures
88        print $tokenizer->failures(), " tokenization failures\n";
89
90       This method returns the number of tokenization failures encountered. A
91       tokenization failure is represented in the output token stream by a
92       PPIx::Regexp::Token::Unknown.
93
94   modifier
95        $tokenizer->modifier( 'x' )
96            and print "Tokenizing an extended regular expression\n";
97
98       This method returns true if the given modifier character was found on
99       the end of the regular expression, and false otherwise.
100
101   next_token
102        my $token = $tokenizer->next_token();
103
104       This method returns the next token in the token stream, or nothing if
105       there are no more tokens.
106
107   significant
108       This method exists simply for the convenience of PPIx::Regexp::Dumper.
109       It always returns true.
110
111   tokens
112        my @tokens = $tokenizer->tokens();
113
114       This method returns all remaining tokens in the token stream.
115

EXTERNAL TOKENIZERS

117       This class does very little of its own tokenization. Instead the token
118       classes contain external tokenization routines, whose name is
119       '__PPIX_TOKENIZER__' concatenated with the current mode of the
120       tokenizer ('regexp' for regular expressions, 'repl' for the replacement
121       string).
122
123       These external tokenizers are called as static methods, and passed the
124       "PPIx::Regexp::Tokenizer" object and the current character in the
125       character stream.
126
127       If the external tokenizer wants to make one or more tokens, it returns
128       an array containing either length in characters for tokens of the
129       tokenizer's own class, or the results of one or more "make_token" calls
130       for tokens of an arbitrary class.
131
132       If the external tokenizer is not interested in the characters starting
133       at the current position it simply returns.
134
135       The following methods are for the use of external tokenizers, and are
136       not part of the public interface to this class.
137
138   capture
139        if ( $tokenizer->find_regexp( qr{ \A ( foo ) }smx ) ) {
140            foreach ( $tokenizer->capture() ) {
141                print "$_\n";
142            }
143        }
144
145       This method returns all the contents of any capture buffers from the
146       previous call to "find_regexp". The first element of the array (i.e.
147       element 0) corresponds to $1, and so on.
148
149       The captures are cleared by "make_token", as well as by another call to
150       "find_regexp".
151
152   cookie
153        $tokenizer->cookie( foo => sub { 1 } );
154        my $cookie = $tokenizer->cookie( 'foo' );
155        my $old_hint = $tokenizer->cookie( foo => undef );
156
157       This method either creates, deletes, or accesses a cookie.
158
159       A cookie is a code reference which is called whenever the tokenizer
160       makes a token. If it returns a false value, it is deleted. Explicitly
161       setting the cookie to "undef" also deletes it.
162
163       When you call "$tokenizer->cookie( 'foo' )", the current cookie is
164       returned. If you pass a new value of "undef" to delete the token, the
165       deleted cookie (if any) is returned.
166
167       When the "make_token" method calls a cookie, it passes it the tokenizer
168       and the token just made. If a token calls a cookie, it is recommended
169       that it merely pass the tokenizer, though of course the token can do
170       whatever it wants.
171
172       The cookie mechanism seems to be a bit of a crock, but it appeared to
173       be more work to fix things up in the lexer after the tokenizer got
174       something wrong.
175
176       The recommended way to write a cookie is to use a closure to store any
177       necessary data, and have a call to the cookie return the data;
178       otherwise the ultimate consumer of the cookie has no way to access the
179       data. Of course, it may be that the presence of the cookie at a certain
180       point in the parse is all that is required.
181
182   expect
183        $tokenizer->expect( 'PPIx::Regexp::Token::Code' );
184
185       This method inserts a given class at the head of the token scan, for
186       the next iteration only. More than one class can be specified. Class
187       names can be abbreviated by removing the leading 'PPIx::Regexp::'.
188
189       If no class is specified, this method does nothing.
190
191       The expectation lasts from the next time "get_token" is called until
192       the next time make_token makes a significant token, or until the next
193       "expect" call if that is done sooner.
194
195   find_regexp
196        my $end = $tokenizer->find_regexp( qr{ \A \w+ }smx );
197        my ( $begin, $end ) = $tokenizer->find_regexp(
198            qr{ \A \w+ }smx );
199
200       This method finds the given regular expression in the content, starting
201       at the current position. If called in scalar context, the offset from
202       the current position to the end of the matched string is returned. If
203       called in list context, the offsets to both the beginning and the end
204       of the matched string are returned.
205
206   find_matching_delimiter
207        my $offset = $tokenizer->find_matching_delimiter();
208
209       This method is used by tokenizers to find the delimiter matching the
210       character at the current position in the content string. If the
211       delimiter is an opening bracket of some sort, bracket nesting will be
212       taken into account.
213
214       When searching for the matching delimiter, the back slash character is
215       considered to escape the following character, so back-slashed
216       delimiters will be ignored. No other quoting mechanisms are recognized,
217       though, so delimiters inside quotes still count. This is actually the
218       way Perl works, as
219
220        $ perl -e 'qr<(?{ print "}" })>'
221
222       demonstrates.
223
224       This method returns the offset from the current position in the content
225       string to the matching delimiter (which will always be positive), or
226       undef if no match can be found.
227
228   get_start_delimiter
229        my $start_delimiter = $tokenizer->get_start_delimiter();
230
231       This method is used by tokenizers to access the start delimiter for the
232       regular expression.
233
234   get_token
235        my $token = $tokenizer->make_token( 3 );
236        my @tokens = $tokenizer->get_token();
237
238       This method returns the next token that can be made from the input
239       stream. It is not part of the external interface, but is intended for
240       the use of an external tokenizer which calls it after making and
241       retaining its own token to look at the next token ( if any ) in the
242       input stream.
243
244       If any external tokenizer calls get_token without first calling
245       make_token, a fatal error occurs; this is better than the infinite
246       recursion which would occur if the condition were not trapped.
247
248       An external tokenizer must return anything returned by get_token;
249       otherwise tokens get lost.
250
251   interpolates
252       This method returns true if the top-level structure being tokenized
253       interpolates; that is, if the delimiter is not a single quote.
254
255   make_token
256        return $tokenizer->make_token( 3, 'PPIx::Regexp::Token::Unknown' );
257
258       This method is used by this class (and possibly by individual
259       tokenizers) to manufacture a token. Its arguments are the number of
260       characters to include in the token, and optionally the class of the
261       token. If no class name is given, the caller's class is used. Class
262       names may be shortened by removing the initial 'PPIx::Regexp::', which
263       will be restored by this method.
264
265       The token will be manufactured from the given number of characters
266       starting at the current cursor position, which will be adjusted.
267
268       If the given length would include characters past the end of the string
269       being tokenized, the length is reduced appropriately. If this means a
270       token with no characters, nothing is returned.
271
272   match
273        if ( $tokenizer->find_regexp( qr{ \A \w+ }smx ) ) {
274            print $tokenizer->match(), "\n";
275        }
276
277       This method returns the string matched by the previous call to
278       "find_regexp".
279
280       The match is set to "undef" by "make_token", as well as by another call
281       to "find_regexp".
282
283   modifier_duplicate
284        $tokenizer->modifier_duplicate();
285
286       This method duplicates the modifiers on the top of the modifier stack,
287       with the intent of creating a locally-scoped copy of the modifiers.
288       This should only be called by an external tokenizer that is actually
289       creating a modifier scope. In other words, only when creating a
290       PPIx::Regexp::Token::Structure token whose content is '('.
291
292   modifier_modify
293        $tokenizer->modifier_modify( name => $value ... );
294
295       This method sets new values for the modifiers in the local scope. Only
296       the modifiers whose names are actually passed have their values
297       changed.
298
299       This method is intended to be called after manufacturing a
300       PPIx::Regexp::Token::Modifier token, and passed the results of its
301       "modifiers" method.
302
303   modifier_pop
304        $tokenizer->modifier_pop();
305
306       This method removes the modifiers on the top of the modifier stack.
307       This should only be called by an external tokenizer that is ending a
308       modifier scope. In other words, only when creating a
309       PPIx::Regexp::Token::Structure token whose content is ')'.
310
311       Note that this method will never pop the last modifier item off the
312       stack, to guard against unmatched right parentheses.
313
314   peek
315        my $character = $tokenizer->peek();
316        my $next_char = $tokenizer->peek( 1 );
317
318       This method returns the character at the given non-negative offset from
319       the current position. If no offset is given, an offset of 0 is used.
320
321       If you ask for a negative offset or an offset off the end of the sting,
322       "undef" is returned.
323
324   ppi_document
325       This method makes a PPI document out of the remainder of the string,
326       and returns it.
327
328   prior
329        $tokenizer->prior( 'can_be_quantified' )
330           and print "The prior token can be quantified.\n";
331
332       This method calls the named method on the most-recently-instantiated
333       significant token, and returns the result. Any arguments subsequent to
334       the method name will be passed to the method.
335
336       Because this method is designed to be used within the tokenizing
337       system, it will die horribly if the named method does not exist.
338

ENVIRONMENT VARIABLES

340       A tokenizer trace can be requested by setting environment variable
341       PPIX_REGEXP_TOKENIZER_TRACE to a numeric value other than 0. Use of
342       this environment variable is unsupported in the same sense that the
343       "trace" option of "new" is unsupported. Explicitly specifying the
344       "trace" option to "new" overrides the environment variable.
345
346       The real reason this is documented is to give the user a way to
347       troubleshoot funny output from the tokenizer.
348

SUPPORT

350       Support is by the author. Please file bug reports at
351       <http://rt.cpan.org>, or in electronic mail to the author.
352

AUTHOR

354       Thomas R. Wyant, III wyant at cpan dot org
355

COPYRIGHT AND LICENSE

357       Copyright (C) 2009-2013 by Thomas R. Wyant, III
358
359       This program is free software; you can redistribute it and/or modify it
360       under the same terms as Perl 5.10.0. For more details, see the full
361       text of the licenses in the directory LICENSES.
362
363       This program is distributed in the hope that it will be useful, but
364       without any warranty; without even the implied warranty of
365       merchantability or fitness for a particular purpose.
366
367
368
369perl v5.16.3                      2014-06-10        PPIx::Regexp::Tokenizer(3)