1PPI::Tokenizer(3) User Contributed Perl Documentation PPI::Tokenizer(3)
2
3
4
6 PPI::Tokenizer - The Perl Document Tokenizer
7
9 # Create a tokenizer for a file, array or string
10 $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
11 $Tokenizer = PPI::Tokenizer->new( \@lines );
12 $Tokenizer = PPI::Tokenizer->new( \$source );
13
14 # Return all the tokens for the document
15 my $tokens = $Tokenizer->all_tokens;
16
17 # Or we can use it as an iterator
18 while ( my $Token = $Tokenizer->get_token ) {
19 print "Found token '$Token'\n";
20 }
21
22 # If we REALLY need to manually nudge the cursor, you
23 # can do that to (The lexer needs this ability to do rollbacks)
24 $is_incremented = $Tokenizer->increment_cursor;
25 $is_decremented = $Tokenizer->decrement_cursor;
26
28 PPI::Tokenizer is the class that provides Tokenizer objects for use in
29 breaking strings of Perl source code into Tokens.
30
31 By the time you are reading this, you probably need to know a little
32 about the difference between how perl parses Perl "code" and how PPI
33 parsers Perl "documents".
34
35 "perl" itself (the interpreter) uses a heavily modified lex specifica‐
36 tion to specify it's parsing logic, maintains several types of state as
37 it goes, and both tokenizes and lexes AND EXECUTES at the same time.
38 In fact, it's provably impossible to use perl's parsing method without
39 BEING perl.
40
41 This is where the truism "Only perl can parse Perl" comes from.
42
43 PPI uses a completely different approach by abandoning the ability to
44 provably parse perl, and instead to parse the source as a document, but
45 get so close to being perfect that unless you do insanely silly things,
46 it can handle it.
47
48 It was touch and go for a long time whether we could get it close
49 enough, but in the end it turned out that it could be done.
50
51 In this approach, the tokenizer "PPI::Tokenizer" is separate from the
52 lexer PPI::Lexer. The job of "PPI::Tokenizer" is to take pure source as
53 a string and break it up into a stream/set of tokens.
54
55 The Tokenizer uses a hell of a lot of heuristics, guessing, and cruft,
56 supported by a very VERY flexible internal API, but fortunately there's
57 not a lot that gets exposed to people using the "PPI::Tokenizer"
58 itself.
59
61 Despite the incredible complexity, the Tokenizer itself only exposes a
62 relatively small number of methods, with most of the complexity imple‐
63 mented in private methods.
64
65 new $source ⎪ \@lines ⎪ \$source
66
67 The main "new" constructor creates a new Tokenizer object. These
68 objects have no configuration parameters, and can only be used once, to
69 tokenize a single perl source file.
70
71 It takes as argument either a normal scalar containing source code, a
72 reference to a scalar containing source code, or a reference to an
73 ARRAY containing newline-terminated lines of source code.
74
75 Returns a new "PPI::Tokenizer" object on success, or "undef" on error.
76
77 get_token
78
79 When using the PPI::Tokenizer object as an iterator, the "get_token"
80 method is the primary method that is used. It increments the cursor and
81 returns the next Token in the output array.
82
83 The actual parsing of the file is done only as-needed, and a line at a
84 time. When "get_token" hits the end of the token array, it will cause
85 the parser to pull in the next line and parse it, continuing as needed
86 until there are more tokens on the output array that get_token can then
87 return.
88
89 This means that a number of Tokenizer objects can be created, and won't
90 consume significant CPU until you actually begin to pull tokens from
91 it.
92
93 Return a PPI::Token object on success, 0 if the Tokenizer had reached
94 the end of the file, or "undef" on error.
95
96 all_tokens
97
98 When not being used as an iterator, the "all_tokens" method tells the
99 Tokenizer to parse the entire file and return all of the tokens in a
100 single ARRAY reference.
101
102 It should be noted that "all_tokens" does NOT interfere with the use of
103 the Tokenizer object as an iterator (does not modify the token cursor)
104 and use of the two different mechanisms can be mixed safely.
105
106 Returns a reference to an ARRAY of PPI::Token objects on success, 0 in
107 the special case that the file/string contains NO tokens at all, or
108 "undef" on error.
109
110 increment_cursor
111
112 Although exposed as a public method, "increment_method" is implemented
113 for expert use only, when writing lexers or other components that work
114 directly on token streams.
115
116 It manually increments the token cursor forward through the file, in
117 effect "skipping" the next token.
118
119 Return true if the cursor is incremented, 0 if already at the end of
120 the file, or "undef" on error.
121
122 decrement_cursor
123
124 Although exposed as a public method, "decrement_method" is implemented
125 for expert use only, when writing lexers or other components that work
126 directly on token streams.
127
128 It manually decrements the token cursor backwards through the file, in
129 effect "rolling back" the token stream. And indeed that is what it is
130 primarily intended for, when the component that is consuming the token
131 stream needs to implement some sort of "roll back" feature in its use
132 of the token stream.
133
134 Return true if the cursor is decremented, 0 if already at the beginning
135 of the file, or "undef" on error.
136
137 errstr
138
139 For any error that occurs, you can use the "errstr", as either a static
140 or object method, to access the error message.
141
142 If no error occurs for any particular action, "errstr" will return
143 false.
144
146 How the Tokenizer Works
147
148 Understanding the Tokenizer is not for the feint-hearted. It is by far
149 the most complex and twisty piece of perl I've ever written that is
150 actually still built properly and isn't a terrible spaghetti-like mess.
151 In fact, you probably want to skip this section.
152
153 But if you really want to understand, well then here goes.
154
155 Source Input and Clean Up
156
157 The Tokenizer starts by taking source in a variety of forms, sucking it
158 all in and merging into one big string, and doing our own internal line
159 split, using a "universal line separator" which allows the Tokenizer to
160 take source for any platform (and even supports a few known types of
161 broken newlines caused by mixed mac/pc/*nix editor screw ups).
162
163 The resulting array of lines is used to feed the tokenizer, and is also
164 accessed directly by the heredoc-logic to do the line-oriented part of
165 here-doc support.
166
167 Doing Things the Old Fashioned Way
168
169 Due to the complexity of perl, and after 2 previously aborted parser
170 attempts, in the end the tokenizer was fashioned around a line-buffered
171 character-by-character method.
172
173 That is, the Tokenizer pulls and holds a line at a time into a line
174 buffer, and then iterates a cursor along it. At each cursor position, a
175 method is called in whatever token class we are currently in, which
176 will examine the character at the current position, and handle it.
177
178 As the handler methods in the various token classes are called, they
179 build up a output token array for the source code.
180
181 Various parts of the Tokenizer use look-ahead, arbitrary-distance look-
182 behind (although currently the maximum is three significant tokens), or
183 both, and various other heuristic guesses.
184
185 I've been told it is officially termed a "backtracking parser with
186 infinite lookaheads".
187
188 State Variables
189
190 Aside from the current line and the character cursor, the Tokenizer
191 maintains a number of different state variables.
192
193 Current Class
194 The Tokenizer maintains the current token class at all times. Much
195 of the time is just going to be the "Whitespace" class, which is
196 what the base of a document is. As the tokenizer executes the vari‐
197 ous character handlers, the class changes a lot as it moves a long.
198 In fact, in some instances, the character handler may not handle
199 the character directly itself, but rather change the "current
200 class" and then hand off to the character handler for the new
201 class.
202
203 Because of this, and some other things I'll deal with later, the
204 number of times the character handlers are called does not in fact
205 have a direct relationship to the number of actual characters in
206 the document.
207
208 Current Zone
209 Rather than create a class stack to allow for infinitely nested
210 layers of classes, the Tokenizer recognises just a single layer.
211
212 To put it a different way, in various parts of the file, the Tok‐
213 enizer will recognise different "base" or "substrate" classes. When
214 a Token such as a comment or a number is finalised by the tok‐
215 enizer, it "falls back" to the base state.
216
217 This allows proper tokenization of special areas such as __DATA__
218 and __END__ blocks, which also contain things like comments and
219 POD, without allowing the creation of any significant Tokens inside
220 these areas.
221
222 For the main part of a document we use PPI::Token::Whitespace for
223 this, with the idea being that code is "floating in a sea of white‐
224 space".
225
226 Current Token
227 The final main state variable is the "current token". This is the
228 Token that is currently being built by the Tokenizer. For certain
229 types, it can be manipulated and morphed and change class quite a
230 bit while being assembled, as the Tokenizer's understanding of the
231 token content changes.
232
233 When the Tokenizer is confident that it has seen the end of the
234 Token, it will be "finalized", which adds it to the output token
235 array and resets the current class to that of the zone that we are
236 currently in.
237
238 I should also note at this point that the "current token" variable
239 is optional. The Tokenizer is capable of knowing what class it is
240 currently set to, without actually having accumulated any charac‐
241 ters in the Token.
242
243 Making It Faster
244
245 As I'm sure you can imagine, calling several different methods for each
246 character and running regexes and other complex heuristics made the
247 first fully working version of the tokenizer extremely slow.
248
249 During testing, I created a metric to measure parsing speed called
250 LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
251 cycles on a typical single-core CPU, and so a Tokenizer running at
252 "1000 lines per gigacycle" should generate around 1200 lines of tok‐
253 enized code when running on a 1200 MHz processor.
254
255 The first working version of the tokenizer ran at only 350 LPGC, so to
256 tokenize a typical large module such as ExtUtils::MakeMaker took 10-15
257 seconds. This sluggishness made it unpractical for many uses.
258
259 So in the current parser, there are multiple layers of optimisation
260 very carefully built in to the basic. This has brought the tokenizer up
261 to a more reasonable 1000 LPGC, at the expense of making the code quite
262 a bit twistier.
263
264 Making It Faster - Whole Line Classification
265
266 The first step in the optimisation process was to add a hew handler to
267 enable several of the more basic classes (whitespace, comments) to be
268 able to be parsed a line at a time. At the start of each line, a spe‐
269 cial optional handler (only supported by a few classes) is called to
270 check and see if the entire line can be parsed in one go.
271
272 This is used mainly to handle things like POD, comments, empty lines,
273 and a few other minor special cases.
274
275 Making It Faster - Inlining
276
277 The second stage of the optimisation involved inlining a small number
278 of critical methods that were repeated an extremely high number of
279 times. Profiling suggested that there were about 1,000,000 individual
280 method calls per gigacycle, and by cutting these by two thirds a sig‐
281 nificant speed improvement was gained, in the order of about 50%.
282
283 You may notice that many methods in the "PPI::Tokenizer" code look very
284 nested and long hand. This is primarily due to this inlining.
285
286 At around this time, some statistics code that existed in the early
287 versions of the parser was also removed, as it was determined that it
288 was consuming around 15% of the CPU for the entire parser, while making
289 the core more complicated.
290
291 A judgment call was made that with the difficulties likely to be
292 encountered with future planned enhancements, and given the relatively
293 high cost involved, the statistics features would be removed from the
294 Tokenizer.
295
296 Making It Faster - Quote Engine
297
298 Once inlining had reached diminishing returns, it became obvious from
299 the profiling results that a huge amount of time was being spent step‐
300 ping a char at a time though long, simple and "syntactically boring"
301 code such as comments and strings.
302
303 The existing regex engine was expanded to also encompass quotes and
304 other quote-like things, and a special abstract base class was added
305 that provided a number of specialised parsing methods that would "scan
306 ahead", looking out ahead to find the end of a string, and updating the
307 cursor to leave it in a valid position for the next call.
308
309 This is also the point at which the number of character handler calls
310 began to greatly differ from the number of characters. But it has been
311 done in a way that allows the parser to retain the power of the origi‐
312 nal version at the critical points, while skipping through the "boring
313 bits" as needed for additional speed.
314
315 The addition of this feature allowed the tokenizer to exceed 1000 LPGC
316 for the first time.
317
318 Making It Faster - The "Complete" Mechanism
319
320 As it became evident that great speed increases were available by using
321 this "skipping ahead" mechanism, a new handler method was added that
322 explicitly handles the parsing of an entire token, where the structure
323 of the token is relatively simple. Tokens such as symbols fit this
324 case, as once we are passed the initial sigil and word char, we know
325 that we can skip ahead and "complete" the rest of the token much more
326 easily.
327
328 A number of these have been added for most or possibly all of the com‐
329 mon cases, with most of these "complete" handlers implemented using
330 regular expressions.
331
332 In fact, so many have been added that at this point, you could arguably
333 reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
334 tokenizer". More tokens are now consumed in "complete" methods in a
335 typical program than are handled by the normal char-by-char methods.
336
337 Many of the these complete-handlers were implemented during the writing
338 of the Lexer, and this has allowed the full parser to maintain around
339 1000 LPGC despite the increasing weight of the Lexer.
340
341 Making It Faster - Porting To C (In Progress)
342
343 While it would be extraordinarily difficult to port all of the Tok‐
344 enizer to C, work has started on a PPI::XS "accelerator" package which
345 acts as a separate and automatically-detected add-on to the main PPI
346 package.
347
348 PPI::XS implements faster versions of a variety of functions scattered
349 over the entire PPI codebase, from the Tokenizer Core, Quote Engine,
350 and various other places, and implements them identically in XS/C.
351
352 In particular, the skip-ahead methods from the Quote Engine would
353 appear to be extremely amenable to being done in C, and a number of
354 other functions could be cherry-picked one at a time and implemented in
355 C.
356
357 Each method is heavily tested to ensure that the functionality is iden‐
358 tical, and a versioning mechanism is included to ensure that if a func‐
359 tion gets out of sync, PPI::XS will degrade gracefully and just not
360 replace that single method.
361
363 - Add an option to reset or seek the token stream...
364
365 - Implement more Tokenizer functions in PPI::XS
366
368 See the support section in the main module.
369
371 Adam Kennedy <adamk@cpan.org>
372
374 Copyright 2001 - 2006 Adam Kennedy.
375
376 This program is free software; you can redistribute it and/or modify it
377 under the same terms as Perl itself.
378
379 The full text of the license can be found in the LICENSE file included
380 with this module.
381
382
383
384perl v5.8.8 2006-09-23 PPI::Tokenizer(3)