PPI::Tokenizer(3pm)

1PPI::Tokenizer(3)     User Contributed Perl Documentation    PPI::Tokenizer(3)
2
3
4

NAME

6       PPI::Tokenizer - The Perl Document Tokenizer
7

SYNOPSIS

9         # Create a tokenizer for a file, array or string
10         $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
11         $Tokenizer = PPI::Tokenizer->new( \@lines       );
12         $Tokenizer = PPI::Tokenizer->new( \$source      );
13
14         # Return all the tokens for the document
15         my $tokens = $Tokenizer->all_tokens;
16
17         # Or we can use it as an iterator
18         while ( my $Token = $Tokenizer->get_token ) {
19               print "Found token '$Token'\n";
20         }
21
22         # If we REALLY need to manually nudge the cursor, you
23         # can do that to (The lexer needs this ability to do rollbacks)
24         $is_incremented = $Tokenizer->increment_cursor;
25         $is_decremented = $Tokenizer->decrement_cursor;
26

DESCRIPTION

28       PPI::Tokenizer is the class that provides Tokenizer objects for use in
29       breaking strings of Perl source code into Tokens.
30
31       By the time you are reading this, you probably need to know a little
32       about the difference between how perl parses Perl "code" and how PPI
33       parsers Perl "documents".
34
35       "perl" itself (the interpreter) uses a heavily modified lex specifica‐
36       tion to specify it's parsing logic, maintains several types of state as
37       it goes, and both tokenizes and lexes AND EXECUTES at the same time.
38       In fact, it's provably impossible to use perl's parsing method without
39       BEING perl.
40
41       This is where the truism "Only perl can parse Perl" comes from.
42
43       PPI uses a completely different approach by abandoning the ability to
44       provably parse perl, and instead to parse the source as a document, but
45       get so close to being perfect that unless you do insanely silly things,
46       it can handle it.
47
48       It was touch and go for a long time whether we could get it close
49       enough, but in the end it turned out that it could be done.
50
51       In this approach, the tokenizer "PPI::Tokenizer" is separate from the
52       lexer PPI::Lexer. The job of "PPI::Tokenizer" is to take pure source as
53       a string and break it up into a stream/set of tokens.
54
55       The Tokenizer uses a hell of a lot of heuristics, guessing, and cruft,
56       supported by a very VERY flexible internal API, but fortunately there's
57       not a lot that gets exposed to people using the "PPI::Tokenizer"
58       itself.
59

METHODS

61       Despite the incredible complexity, the Tokenizer itself only exposes a
62       relatively small number of methods, with most of the complexity imple‐
63       mented in private methods.
64
65       new $source ⎪ \@lines ⎪ \$source
66
67       The main "new" constructor creates a new Tokenizer object. These
68       objects have no configuration parameters, and can only be used once, to
69       tokenize a single perl source file.
70
71       It takes as argument either a normal scalar containing source code, a
72       reference to a scalar containing source code, or a reference to an
73       ARRAY containing newline-terminated lines of source code.
74
75       Returns a new "PPI::Tokenizer" object on success, or "undef" on error.
76
77       get_token
78
79       When using the PPI::Tokenizer object as an iterator, the "get_token"
80       method is the primary method that is used. It increments the cursor and
81       returns the next Token in the output array.
82
83       The actual parsing of the file is done only as-needed, and a line at a
84       time. When "get_token" hits the end of the token array, it will cause
85       the parser to pull in the next line and parse it, continuing as needed
86       until there are more tokens on the output array that get_token can then
87       return.
88
89       This means that a number of Tokenizer objects can be created, and won't
90       consume significant CPU until you actually begin to pull tokens from
91       it.
92
93       Return a PPI::Token object on success, 0 if the Tokenizer had reached
94       the end of the file, or "undef" on error.
95
96       all_tokens
97
98       When not being used as an iterator, the "all_tokens" method tells the
99       Tokenizer to parse the entire file and return all of the tokens in a
100       single ARRAY reference.
101
102       It should be noted that "all_tokens" does NOT interfere with the use of
103       the Tokenizer object as an iterator (does not modify the token cursor)
104       and use of the two different mechanisms can be mixed safely.
105
106       Returns a reference to an ARRAY of PPI::Token objects on success, 0 in
107       the special case that the file/string contains NO tokens at all, or
108       "undef" on error.
109
110       increment_cursor
111
112       Although exposed as a public method, "increment_method" is implemented
113       for expert use only, when writing lexers or other components that work
114       directly on token streams.
115
116       It manually increments the token cursor forward through the file, in
117       effect "skipping" the next token.
118
119       Return true if the cursor is incremented, 0 if already at the end of
120       the file, or "undef" on error.
121
122       decrement_cursor
123
124       Although exposed as a public method, "decrement_method" is implemented
125       for expert use only, when writing lexers or other components that work
126       directly on token streams.
127
128       It manually decrements the token cursor backwards through the file, in
129       effect "rolling back" the token stream. And indeed that is what it is
130       primarily intended for, when the component that is consuming the token
131       stream needs to implement some sort of "roll back" feature in its use
132       of the token stream.
133
134       Return true if the cursor is decremented, 0 if already at the beginning
135       of the file, or "undef" on error.
136
137       errstr
138
139       For any error that occurs, you can use the "errstr", as either a static
140       or object method, to access the error message.
141
142       If no error occurs for any particular action, "errstr" will return
143       false.
144

NOTES

146       How the Tokenizer Works
147
148       Understanding the Tokenizer is not for the feint-hearted. It is by far
149       the most complex and twisty piece of perl I've ever written that is
150       actually still built properly and isn't a terrible spaghetti-like mess.
151       In fact, you probably want to skip this section.
152
153       But if you really want to understand, well then here goes.
154
155       Source Input and Clean Up
156
157       The Tokenizer starts by taking source in a variety of forms, sucking it
158       all in and merging into one big string, and doing our own internal line
159       split, using a "universal line separator" which allows the Tokenizer to
160       take source for any platform (and even supports a few known types of
161       broken newlines caused by mixed mac/pc/*nix editor screw ups).
162
163       The resulting array of lines is used to feed the tokenizer, and is also
164       accessed directly by the heredoc-logic to do the line-oriented part of
165       here-doc support.
166
167       Doing Things the Old Fashioned Way
168
169       Due to the complexity of perl, and after 2 previously aborted parser
170       attempts, in the end the tokenizer was fashioned around a line-buffered
171       character-by-character method.
172
173       That is, the Tokenizer pulls and holds a line at a time into a line
174       buffer, and then iterates a cursor along it. At each cursor position, a
175       method is called in whatever token class we are currently in, which
176       will examine the character at the current position, and handle it.
177
178       As the handler methods in the various token classes are called, they
179       build up a output token array for the source code.
180
181       Various parts of the Tokenizer use look-ahead, arbitrary-distance look-
182       behind (although currently the maximum is three significant tokens), or
183       both, and various other heuristic guesses.
184
185       I've been told it is officially termed a "backtracking parser with
186       infinite lookaheads".
187
188       State Variables
189
190       Aside from the current line and the character cursor, the Tokenizer
191       maintains a number of different state variables.
192
193       Current Class
194           The Tokenizer maintains the current token class at all times. Much
195           of the time is just going to be the "Whitespace" class, which is
196           what the base of a document is. As the tokenizer executes the vari‐
197           ous character handlers, the class changes a lot as it moves a long.
198           In fact, in some instances, the character handler may not handle
199           the character directly itself, but rather change the "current
200           class" and then hand off to the character handler for the new
201           class.
202
203           Because of this, and some other things I'll deal with later, the
204           number of times the character handlers are called does not in fact
205           have a direct relationship to the number of actual characters in
206           the document.
207
208       Current Zone
209           Rather than create a class stack to allow for infinitely nested
210           layers of classes, the Tokenizer recognises just a single layer.
211
212           To put it a different way, in various parts of the file, the Tok‐
213           enizer will recognise different "base" or "substrate" classes. When
214           a Token such as a comment or a number is finalised by the tok‐
215           enizer, it "falls back" to the base state.
216
217           This allows proper tokenization of special areas such as __DATA__
218           and __END__ blocks, which also contain things like comments and
219           POD, without allowing the creation of any significant Tokens inside
220           these areas.
221
222           For the main part of a document we use PPI::Token::Whitespace for
223           this, with the idea being that code is "floating in a sea of white‐
224           space".
225
226       Current Token
227           The final main state variable is the "current token". This is the
228           Token that is currently being built by the Tokenizer. For certain
229           types, it can be manipulated and morphed and change class quite a
230           bit while being assembled, as the Tokenizer's understanding of the
231           token content changes.
232
233           When the Tokenizer is confident that it has seen the end of the
234           Token, it will be "finalized", which adds it to the output token
235           array and resets the current class to that of the zone that we are
236           currently in.
237
238           I should also note at this point that the "current token" variable
239           is optional. The Tokenizer is capable of knowing what class it is
240           currently set to, without actually having accumulated any charac‐
241           ters in the Token.
242
243       Making It Faster
244
245       As I'm sure you can imagine, calling several different methods for each
246       character and running regexes and other complex heuristics made the
247       first fully working version of the tokenizer extremely slow.
248
249       During testing, I created a metric to measure parsing speed called
250       LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
251       cycles on a typical single-core CPU, and so a Tokenizer running at
252       "1000 lines per gigacycle" should generate around 1200 lines of tok‐
253       enized code when running on a 1200 MHz processor.
254
255       The first working version of the tokenizer ran at only 350 LPGC, so to
256       tokenize a typical large module such as ExtUtils::MakeMaker took 10-15
257       seconds. This sluggishness made it unpractical for many uses.
258
259       So in the current parser, there are multiple layers of optimisation
260       very carefully built in to the basic. This has brought the tokenizer up
261       to a more reasonable 1000 LPGC, at the expense of making the code quite
262       a bit twistier.
263
264       Making It Faster - Whole Line Classification
265
266       The first step in the optimisation process was to add a hew handler to
267       enable several of the more basic classes (whitespace, comments) to be
268       able to be parsed a line at a time. At the start of each line, a spe‐
269       cial optional handler (only supported by a few classes) is called to
270       check and see if the entire line can be parsed in one go.
271
272       This is used mainly to handle things like POD, comments, empty lines,
273       and a few other minor special cases.
274
275       Making It Faster - Inlining
276
277       The second stage of the optimisation involved inlining a small number
278       of critical methods that were repeated an extremely high number of
279       times. Profiling suggested that there were about 1,000,000 individual
280       method calls per gigacycle, and by cutting these by two thirds a sig‐
281       nificant speed improvement was gained, in the order of about 50%.
282
283       You may notice that many methods in the "PPI::Tokenizer" code look very
284       nested and long hand. This is primarily due to this inlining.
285
286       At around this time, some statistics code that existed in the early
287       versions of the parser was also removed, as it was determined that it
288       was consuming around 15% of the CPU for the entire parser, while making
289       the core more complicated.
290
291       A judgment call was made that with the difficulties likely to be
292       encountered with future planned enhancements, and given the relatively
293       high cost involved, the statistics features would be removed from the
294       Tokenizer.
295
296       Making It Faster - Quote Engine
297
298       Once inlining had reached diminishing returns, it became obvious from
299       the profiling results that a huge amount of time was being spent step‐
300       ping a char at a time though long, simple and "syntactically boring"
301       code such as comments and strings.
302
303       The existing regex engine was expanded to also encompass quotes and
304       other quote-like things, and a special abstract base class was added
305       that provided a number of specialised parsing methods that would "scan
306       ahead", looking out ahead to find the end of a string, and updating the
307       cursor to leave it in a valid position for the next call.
308
309       This is also the point at which the number of character handler calls
310       began to greatly differ from the number of characters. But it has been
311       done in a way that allows the parser to retain the power of the origi‐
312       nal version at the critical points, while skipping through the "boring
313       bits" as needed for additional speed.
314
315       The addition of this feature allowed the tokenizer to exceed 1000 LPGC
316       for the first time.
317
318       Making It Faster - The "Complete" Mechanism
319
320       As it became evident that great speed increases were available by using
321       this "skipping ahead" mechanism, a new handler method was added that
322       explicitly handles the parsing of an entire token, where the structure
323       of the token is relatively simple. Tokens such as symbols fit this
324       case, as once we are passed the initial sigil and word char, we know
325       that we can skip ahead and "complete" the rest of the token much more
326       easily.
327
328       A number of these have been added for most or possibly all of the com‐
329       mon cases, with most of these "complete" handlers implemented using
330       regular expressions.
331
332       In fact, so many have been added that at this point, you could arguably
333       reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
334       tokenizer". More tokens are now consumed in "complete" methods in a
335       typical program than are handled by the normal char-by-char methods.
336
337       Many of the these complete-handlers were implemented during the writing
338       of the Lexer, and this has allowed the full parser to maintain around
339       1000 LPGC despite the increasing weight of the Lexer.
340
341       Making It Faster - Porting To C (In Progress)
342
343       While it would be extraordinarily difficult to port all of the Tok‐
344       enizer to C, work has started on a PPI::XS "accelerator" package which
345       acts as a separate and automatically-detected add-on to the main PPI
346       package.
347
348       PPI::XS implements faster versions of a variety of functions scattered
349       over the entire PPI codebase, from the Tokenizer Core, Quote Engine,
350       and various other places, and implements them identically in XS/C.
351
352       In particular, the skip-ahead methods from the Quote Engine would
353       appear to be extremely amenable to being done in C, and a number of
354       other functions could be cherry-picked one at a time and implemented in
355       C.
356
357       Each method is heavily tested to ensure that the functionality is iden‐
358       tical, and a versioning mechanism is included to ensure that if a func‐
359       tion gets out of sync, PPI::XS will degrade gracefully and just not
360       replace that single method.
361

TO DO

363       - Add an option to reset or seek the token stream...
364
365       - Implement more Tokenizer functions in PPI::XS
366

SUPPORT

368       See the support section in the main module.
369

AUTHOR

371       Adam Kennedy <adamk@cpan.org>
372

COPYRIGHT

374       Copyright 2001 - 2006 Adam Kennedy.
375
376       This program is free software; you can redistribute it and/or modify it
377       under the same terms as Perl itself.
378
379       The full text of the license can be found in the LICENSE file included
380       with this module.
381
382
383
384perl v5.8.8                       2006-09-23                 PPI::Tokenizer(3)