1Parse::Token(3) User Contributed Perl Documentation Parse::Token(3)
2
3
4
6 "Parse::Token" - Definition of tokens used by "Parse::Lex"
7
9 require 5.005;
10
11 use Parse::Lex;
12 @token = qw(
13 ADDOP [-+]
14 INTEGER [1-9][0-9]*
15 );
16
17 $lexer = Parse::Lex->new(@token);
18 $lexer->from(\*DATA);
19
20 $content = $INTEGER->next;
21 if ($INTEGER->status) {
22 print "$content\n";
23 }
24 $content = $ADDOP->next;
25 if ($ADDOP->status) {
26 print "$content\n";
27 }
28 if ($INTEGER->isnext(\$content)) {
29 print "$content\n";
30 }
31 __END__
32 1+2
33
35 The "Parse::Token" class and its derived classes permit defining the
36 tokens used by "Parse::Lex" or "Parse::LexEvent".
37
38 The creation of tokens can be done by means of the "new()" or
39 "factory()" methods. The "Lex::new()" method of the "Parse::Lex"
40 package indirectly creates instances of the tokens to be recognized.
41
42 The "next()" or "isnext()" methods of the "Parse::Token" package permit
43 interfacing the lexical analyzer with a syntactic analyzer of recursive
44 descent type. For interfacing with "byacc", see the "Parse::YYLex"
45 package.
46
47 "Parse::Token" is included indirectly by means of "use Parse::Lex" or
48 "use Parse::LexEvent".
49
51 action
52 Returns the anonymous subroutine defined within the "Parse::Token"
53 object.
54
55 factory LIST
56 factory ARRAY_REF
57 The "factory(LIST)" method creates a list of tokens from a list of
58 specifications, which include for each token: a name, a regular
59 expression, and possibly an anonymous subroutine. The list can
60 also include objects of class "Parse::Token" or of a class derived
61 from it.
62
63 The "factory(ARRAY_REF)" method permits creating tokens from
64 specifications of type attribute-value:
65
66 Parse::Token->factory([Type => 'Simple',
67 Name => 'EXAMPLE',
68 Regex => '.+']);
69
70 "Type" indicates the type of each token to be created (the package
71 prefix is not indicated).
72
73 "factory()" creates a series of tokens but does not import these
74 tokens into the calling package.
75
76 You could for example write:
77
78 %keywords =
79 qw (
80 PROC undef
81 FUNC undef
82 RETURN undef
83 IF undef
84 ELSE undef
85 WHILE undef
86 PRINT undef
87 READ undef
88 );
89 @tokens = Parse::Token->factory(%keywords);
90
91 and install these tokens in a symbol table in the following manner:
92
93 foreach $name (keys %keywords) {
94 ${$name} = pop @tokens;
95 $symbol{"\L$name"} = [${$name}, ''];
96 }
97
98 "${$name}" is the token instance.
99
100 During the lexical analysis phase, you can use the tokens in the
101 following manner:
102
103 qw(IDENT [a-zA-Z][a-zA-Z0-9_]*), sub {
104 $symbol{$_[1]} = [] unless defined $symbol{$_[1]};
105 my $type = $symbol{$_[1]}[0];
106 $lexer->setToken((not defined $type) ? $VAR : $type);
107 $_[1]; # THE TOKEN TEXT
108 }
109
110 This permits indicating that any symbol of unknown type is a
111 variable.
112
113 In this example we have used $_[1] which corresponds to the text
114 recognized by the regular expression. This text associated with
115 the token must be returned by the anonymous subroutine.
116
117 get EXPR
118 "get" obtains the value of the attribute named by the result of
119 evaluating EXPR. You can also use the name of the attribute as a
120 method name.
121
122 getText
123 Returns the character string that was recognized by means of this
124 "Parse::Token" object.
125
126 Same as the text() method.
127
128 isnext EXPR
129 isnext
130 Returns the status of the token. The consumed string is put into
131 EXPR if it is a reference to a scalar.
132
133 name
134 Returns the name of the token.
135
136 next
137 Activate searching for the lexeme defined by the regular expression
138 contained in the object. If this lexeme is recognized on the
139 character stream to analyze, "next" returns the string found and
140 sets the status of the object to true.
141
142 new SYMBOL_NAME, REGEXP, SUB
143 new SYMBOL_NAME, REGEXP
144 Creates an object of type "Parse::Token::Simple" or
145 "Parse::Token::Segmented". The arguments of the "new()" method are,
146 respectively: a symbolic name, a regular expression, and possibly
147 an anonymous subroutine. The subclasses of "Parse::Token" permit
148 specifying tokens by means of a list of attribute-values.
149
150 REGEXP is either a simple regular expression, or a reference to an
151 array containing from one to three regular expressions. In the
152 first case, the instance belongs to the "Parse::Token::Simple"
153 class. In the second case, the instance belongs to the
154 "Parse::Token::Segmented" class. The tokens of this type permit
155 recognizing structures of type character string delimited by
156 quotation marks, comments in a C program, etc. The regular
157 expressions are used to recognize:
158
159 1. The beginning of the lexeme,
160
161 2. The "body" of the lexeme; if this second expression is missing,
162 "Parse::Lex" uses "(?:.*?)",
163
164 3. the end of the lexeme; if this last expression is missing then
165 the first one is used. (Note! The end of the lexeme cannot span
166 several lines).
167
168 Example:
169
170 qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],
171
172 These regular expressions can recognize multi-line strings
173 delimited by quotation marks, where the backslash is used to quote
174 the quotation marks appearing within the string. Notice the
175 quadrupling of the backslash.
176
177 Here is a variation of the previous example which uses the "s"
178 option to include newline in the characters recognized by ""."":
179
180 qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],
181
182 (Note: it is possible to write regular expressions which are more
183 efficient in terms of execution time, but this is not our objective
184 with this example. See Mastering Regular Expressions.)
185
186 The anonymous subroutine is called when the lexeme is recognized by
187 the lexical analyzer. This subroutine takes two arguments: $_[0]
188 contains the token instance, and $_[1] contains the string
189 recognized by the regular expression. The scalar returned by the
190 anonymous subroutine defines the character string memorized in the
191 token instance.
192
193 In the anonymous subroutine you can use the positional variables
194 $1, $2, etc. which correspond to the groups of parentheses in the
195 regular expression.
196
197 regexp
198 Returns the regular expression of the "Token" object.
199
200 set LIST
201 Allows marking a token with a list of attribute-value pairs.
202
203 An attribute name can be used as a method name.
204
205 setText EXPR
206 The value of "EXPR" defines the character string associated with
207 the lexeme.
208
209 Same as the "text(EXPR)" method.
210
211 status EXPR
212 status
213 Indicates if the last search of the lexeme succeeded or failed.
214 "status EXPR" overrides the existing value and sets it to the value
215 of EXPR.
216
217 text EXPR
218 text
219 "text()" returns the character string recognized by means of the
220 token. The value of "EXPR" sets the character string associated
221 with the lexeme.
222
223 trace OUTPUT
224 trace
225 Class method which activates/deactivates a trace of the lexical
226 analysis.
227
228 "OUTPUT" can be a file name or a reference to a filehandle to which
229 the trace will be directed.
230
232 Subclasses of the "Parse::Token" class are being defined. They permit
233 recognizing specific structures such as, for example, strings within
234 double-quotes, C comments, etc. Here are the subclasses which I am
235 working on:
236
237 "Parse::Token::Simple" : tokens of this class are defined by means of a
238 single regular expression.
239
240 "Parse::Token::Segmented" : tokens of this class are defined by means
241 of three regular expressions. Reading of new data is done
242 automatically.
243
244 "Parse::Token::Delimited" : permits recognizing, for example, C
245 language comments.
246
247 "Parse::Token::Quoted" : permits recognizing, for example, character
248 strings within quotation marks.
249
250 "Parse::Token::Nested" : permits recognizing nested structures such as
251 parenthesized expressions. NOT DEFINED.
252
253 These classes are recently created and no doubt contain some bugs.
254
255 Parse::Token::Action
256 Tokens of the "Parse::Token::Action" class permit inserting arbitrary
257 Perl expressions within a lexical analyzer. An expression can be used
258 for instance to print out internal variables of the analyzer:
259
260 · $LEX_BUFFER : contents of the buffer to be analyzed
261
262 · $LEX_LENGTH : length of the character string being analyzed
263
264 · $LEX_RECORD : number of the record being analyzed
265
266 · $LEX_OFFSET : number of characters already consumed since the start
267 of the analysis.
268
269 · $LEX_POS : position reached by the analysis as a number of
270 characters since the start of the buffer.
271
272 The class constructor accepts the following attributes:
273
274 · "Name" : the name of the token
275
276 · "Expr" : a Perl expression
277
278 Example :
279
280 $ACTION = new Parse::Token::Action(
281 Name => 'ACTION',
282 Expr => q!print "LEX_POS: $LEX_POS\n" .
283 "LEX_BUFFER: $LEX_BUFFER\n" .
284 "LEX_LENGTH: $LEX_LENGTH\n" .
285 "LEX_RECORD: $LEX_RECORD\n" .
286 "LEX_OFFSET: $LEX_OFFSET\n"
287 ;!,
288 );
289
290 Parse::Token::Simple
291 The class constructor accepts the following attributes:
292
293 · "Handler" : the value indicates the name of a function to call
294 during an analysis performed by an analyzer of class
295 "Parse::LexEvent".
296
297 · "Name" : the associated value is the name of the token.
298
299 · "Regex" : the associated value is a regular expression
300 corresponding to the pattern to be recognized.
301
302 · "ReadMore" : if the associated value is 1, the recognition of the
303 token continues after reading a new record. The strings recognized
304 are concatenated. This attribute only has effect during analysis
305 of a character stream.
306
307 · "Sub" : the associated value must be an anonymous subroutine to be
308 executed after the token is recognized. This function is only used
309 with analyzers of class "Parse::Lex" or "Parse::CLex".
310
311 Example.
312 new Parse::Token::Simple(Name => 'remainder',
313 Regex => '[^/\'\"]+',
314 ReadMore => 1);
315
316 Parse::Token::Segmented
317 The definition of these tokens includes three regular expressions.
318 During analysis of a data stream, new data is read as long as the end
319 of the token has not been reached.
320
321 The class constructor accepts the following attributes:
322
323 · "Handler" : the value indicates the name of a function to call
324 during analysis performed by an analyzer of class
325 "Parse::LexEvent".
326
327 · "Name" : the associated value is the name of the token.
328
329 · "Regex" : the associated value must be a reference to an array that
330 contains three regular expressions.
331
332 · "Sub" : the associated value must be an anonymous subroutine to be
333 executed after the token is recognized. This function is only used
334 with analyzers of class "Parse::Lex" or "Parse::CLex".
335
336 Parse::Token::Quoted
337 "Parse::Token::Quoted" is a subclass of "Parse::Token::Segmented". It
338 permits recognizing character strings within double quotes or single
339 quotes.
340
341 Examples.
342
343 ---------------------------------------------------------
344 Start End Escaping
345 ---------------------------------------------------------
346 ' ' ''
347 " " ""
348 " " \
349 ---------------------------------------------------------
350
351 The class constructor accepts the following attributes:
352
353 · "End" : The associated value is a regular expression permitting
354 recognizing the end of the token.
355
356 · "Escape" : The associated value indicates the character used to
357 escape the delimiter. By default, a double occurrence of the
358 terminating character escapes that character.
359
360 · "Handler" : the value indicates the name of a function to be called
361 during an analysis performed by an analyzer of class
362 "Parse::LexEvent".
363
364 · "Name" : the associated value is the name of the token.
365
366 · "Start" : the associated value is a regular expression permitting
367 recognizing the start of the token.
368
369 · "Sub" : the associated value must be an anonymous subroutine to be
370 executed after the token is recognized. This function is only used
371 with analyzers of class "Parse::Lex" or "Parse::CLex".
372
373 Example.
374 new Parse::Token::Quoted(Name => 'squotes',
375 Handler => 'string',
376 Escape => '\\',
377 Quote => qq!\'!,
378 );
379
380 Parse::Token::Delimited
381 "Parse::Token::Delimited" is a subclass of "Parse::Token::Segmented".
382 It permits, for example, recognizing C language comments.
383
384 Examples.
385
386 ---------------------------------------------------------
387 Start End Constraint
388 on the contents
389 ---------------------------------------------------------
390 /* */ C Comment
391 <!-- --> No '--' XML Comment
392 <!-- --> SGML Comment
393 <? ?> Processing instruction
394 in SGML/XML
395 ---------------------------------------------------------
396
397 The class constructor accepts the following attributes:
398
399 · "End" : The associated value is a regular expression permitting
400 recognizing the end of the token.
401
402 · "Handler" : the value indicates the name of a function to be called
403 during an analysis performed by an analyzer of class
404 "Parse::LexEvent".
405
406 · "Name" : the associated value is the name of the token.
407
408 · "Start" : the associated value is a regular expression permitting
409 recognizing the start of the token.
410
411 · "Sub" : the associated value must be an anonymous subroutine to be
412 executed after the token is recognized. This function is only used
413 with analyzers of class "Parse::Lex" or "Parse::CLex".
414
415 Example.
416 new Parse::Token::Delimited(Name => 'comment',
417 Start => '/[*]',
418 End => '[*]/'
419 );
420
421 Parse::Token::Nested - Not defined
422 Examples.
423
424 ----------------------------------------------------------
425 Start End
426 ----------------------------------------------------------
427 ( ) Symbolic Expressions
428 { } Rich Text Format Groups
429 ----------------------------------------------------------
430
432 The implementation of subclasses of tokens is not complete for
433 analyzers of the "Parse::CLex" class. I am not too keen to do it,
434 since an implementation for classes "Parse::Lex" and "Parse::LexEvent"
435 seems quite sufficient.
436
438 Philippe Verdret. Documentation translated to English by Vladimir
439 Alexiev and Ocrat.
440
442 Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat
443 has significantly contributed to improving this documentation. Thanks
444 also to the numerous persons who have made comments or sometimes sent
445 bug fixes.
446
448 Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates
449 1996.
450
451 Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
452
454 Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This
455 module is free software; you can redistribute it and/or modify it under
456 the same terms as Perl itself.
457
458
459
460perl v5.12.0 2010-03-26 Parse::Token(3)