1Parse::Lex(3) User Contributed Perl Documentation Parse::Lex(3)
2
3
4
6 "Parse::Lex" - Generator of lexical analyzers - moving pointer inside
7 text
8
10 require 5.005;
11
12 use Parse::Lex;
13 @token = (
14 qw(
15 ADDOP [-+]
16 LEFTP [\(]
17 RIGHTP [\)]
18 INTEGER [1-9][0-9]*
19 NEWLINE \n
20
21 ),
22 qw(STRING), [qw(" (?:[^"]+|"")* ")],
23 qw(ERROR .*), sub {
24 die qq!can\'t analyze: "$_[1]"!;
25 }
26 );
27
28 Parse::Lex->trace; # Class method
29 $lexer = Parse::Lex->new(@token);
30 $lexer->from(\*DATA);
31 print "Tokenization of DATA:\n";
32
33 TOKEN:while (1) {
34 $token = $lexer->next;
35 if (not $lexer->eoi) {
36 print "Line $.\t";
37 print "Type: ", $token->name, "\t";
38 print "Content:->", $token->text, "<-\n";
39 } else {
40 last TOKEN;
41 }
42 }
43
44 __END__
45 1+2-5
46 "a multiline
47 string with an embedded "" in it"
48 an invalid string with a "" in it"
49
51 The classes "Parse::Lex" and "Parse::CLex" create lexical analyzers.
52 They use different analysis techniques:
53
54 1. "Parse::Lex" steps through the analysis by moving a pointer within
55 the character strings to be analyzed (use of "pos()" together with
56 "\G"),
57
58 2. "Parse::CLex" steps through the analysis by consuming the data
59 recognized (use of "s///").
60
61 Analyzers of the "Parse::CLex" class do not allow the use of anchoring
62 in regular expressions. In addition, the subclasses of "Parse::Token"
63 are not implemented for this type of analyzer.
64
65 A lexical analyzer is specified by means of a list of tokens passed as
66 arguments to the "new()" method. Tokens are instances of the
67 "Parse::Token" class, which comes with "Parse::Lex". The definition of
68 a token usually comprises two arguments: a symbolic name (like
69 "INTEGER"), followed by a regular expression. If a sub ref (anonymous
70 subroutine) is given as third argument, it is called when the token is
71 recognized. Its arguments are the "Parse::Token" instance and the
72 string recognized by the regular expression. The anonymous
73 subroutine's return value is used as the new string contents of the
74 "Parse::Token" instance.
75
76 The order in which the lexical analyzer examines the regular
77 expressions is determined by the order in which these expressions are
78 passed as arguments to the "new()" method. The token returned by the
79 lexical analyzer corresponds to the first regular expression which
80 matches (this strategy is different from that used by Lex, which
81 returns the longest match possible out of all that can be recognized).
82
83 The lexical analyzer can recognize tokens which span multiple records.
84 If the definition of the token comprises more than one regular
85 expression (placed within a reference to an anonymous array), the
86 analyzer reads as many records as required to recognize the token (see
87 the documentation for the "Parse::Token" class). When the start
88 pattern is found, the analyzer looks for the end, and if necessary,
89 reads more records. No backtracking is done in case of failure.
90
91 The analyzer can be used to analyze an isolated character string or a
92 stream of data coming from a file handle. At the end of the input data
93 the analyzer returns a "Parse::Token" instance named "EOI" (End Of
94 Input).
95
96 Start Conditions
97 You can associate start conditions with the token-recognition rules
98 that comprise your lexical analyzer (this is similar to what Flex
99 provides). When start conditions are used, the rule which succeeds is
100 no longer necessarily the first rule that matches.
101
102 A token symbol may be preceded by a start condition specifier for the
103 associated recognition rule. For example:
104
105 qw(C1:TERMINAL_1 REGEXP), sub { # associated action },
106 qw(TERMINAL_2 REGEXP), sub { # associated action },
107
108 Symbol "TERMINAL_1" will be recognized only if start condition "C1" is
109 active. Start conditions are activated/deactivated using the
110 "start(CONDITION_NAME)" and "end(CONDITION_NAME)" methods.
111
112 "start('INITIAL')" resets the analysis automaton.
113
114 Start conditions can be combined using AND/OR operators as follows:
115
116 C1:SYMBOL condition C1
117
118 C1:C2:SYMBOL condition C1 AND condition C2
119
120 C1,C2:SYMBOL condition C1 OR condition C2
121
122 There are two types of start conditions: inclusive and exclusive, which
123 are declared by class methods "inclusive()" and "exclusive()"
124 respectively. With an inclusive start condition, all rules are active
125 regardless of whether or not they are qualified with the start
126 condition. With an exclusive start condition, only the rules qualified
127 with the start condition are active; all other rules are deactivated.
128
129 Example (borrowed from the documentation of Flex):
130
131 use Parse::Lex;
132 @token = (
133 'EXPECT', 'expect-floats', sub {
134 $lexer->start('expect');
135 $_[1]
136 },
137 'expect:FLOAT', '\d+\.\d+', sub {
138 print "found a float: $_[1]\n";
139 $_[1]
140 },
141 'expect:NEWLINE', '\n', sub {
142 $lexer->end('expect') ;
143 $_[1]
144 },
145 'NEWLINE2', '\n',
146 'INT', '\d+', sub {
147 print "found an integer: $_[1] \n";
148 $_[1]
149 },
150 'DOT', '\.', sub {
151 print "found a dot\n";
152 $_[1]
153 },
154 );
155
156 Parse::Lex->exclusive('expect');
157 $lexer = Parse::Lex->new(@token);
158
159 The special start condition "ALL" is always verified.
160
161 Methods
162 analyze EXPR
163 Analyzes "EXPR" and returns a list of pairs consisting of a token
164 name followed by recognized text. "EXPR" can be a character string
165 or a reference to a filehandle.
166
167 Examples:
168
169 @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3");
170 @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
171
172 buffer EXPR
173 buffer
174 Returns the contents of the internal buffer of the lexical
175 analyzer. With an expression as argument, places the result of the
176 expression in the buffer.
177
178 It is not advisable to directly change the contents of the buffer
179 without changing the position of the analysis pointer ("pos()") and
180 the value length of the buffer ("length()").
181
182 configure(HASH)
183 Instance method which permits specifying a lexical analyzer. This
184 method accepts the list of the following attribute values:
185
186 From => EXPR
187 This attribute plays the same role as the "from(EXPR)"
188 method. "EXPR" can be a filehandle or a character
189 string.
190
191 Tokens => ARRAY_REF
192 "ARRAY_REF" must contain the list of attribute values
193 specifying the tokens to be recognized (see the
194 documentation for "Parse::Token").
195
196 Skip => REGEX
197 This attribute plays the same role as the "skip(REGEX)"
198 method. "REGEX" describes the patterns to skip over
199 during the analysis.
200
201 end EXPR
202 Deactivates condition "EXPR".
203
204 eoi Returns TRUE when there is no more data to analyze.
205
206 every SUB
207 Avoids having to write a reading loop in order to
208 analyze a stream of data. "SUB" is an anonymous
209 subroutine executed after the recognition of each
210 token. For example, to lex the string "1+2" you can
211 write:
212
213 use Parse::Lex;
214
215 $lexer = Parse::Lex->new(
216 qw(
217 ADDOP [-+]
218 INTEGER \d+
219 ));
220
221 $lexer->from("1+2");
222 $lexer->every (sub {
223 print $_[0]->name, "\t";
224 print $_[0]->text, "\n";
225 });
226
227 The first argument of the anonymous subroutine is the
228 "Parse::Token" instance recognized.
229
230 exclusive LIST
231 Class method declaring the conditions present in LIST
232 to be exclusive.
233
234 flush
235 If saving of the consumed strings is activated,
236 "flush()" returns and clears the buffer containing
237 the character strings recognized up to now. This is
238 only useful if "hold()" has been called to activate
239 saving of consumed strings.
240
241 from EXPR
242 from
243 "from(EXPR)" allows specifying the source of the data
244 to be analyzed. The argument of this method can be a
245 string (or list of strings), or a reference to a
246 filehandle. If no argument is given, "from()"
247 returns the filehandle if defined, or "undef" if
248 input is a string. When an argument "EXPR" is used,
249 the return value is the calling lexer object itself.
250
251 By default it is assumed that data are read from
252 "STDIN".
253
254 Examples:
255
256 $handle = new IO::File;
257 $handle->open("< filename");
258 $lexer->from($handle);
259
260 $lexer->from(\*DATA);
261 $lexer->from('the data to be analyzed');
262
263 getSub
264 "getSub" returns the anonymous subroutine that
265 performs the lexical analysis.
266
267 Example:
268
269 my $token = '';
270 my $sub = $lexer->getSub;
271 while (($token = &$sub()) ne $Token::EOI) {
272 print $token->name, "\t";
273 print $token->text, "\n";
274 }
275
276 # or
277
278 my $token = '';
279 local *tokenizer = $lexer->getSub;
280 while (($token = tokenizer()) ne $Token::EOI) {
281 print $token->name, "\t";
282 print $token->text, "\n";
283 }
284
285 getToken
286 Same as "token()" method.
287
288 hold EXPR
289 hold
290 Activates/deactivates saving of the consumed strings.
291 The return value is the current setting (TRUE or
292 FALSE). Can be used as a class method.
293
294 You can obtain the contents of the buffer using the
295 "flush" method, which also empties the buffer.
296
297 inclusive LIST
298 Class method declaring the conditions present in LIST
299 to be inclusive.
300
301 length EXPR
302 length
303 Returns the length of the current record. "length
304 EXPR" sets the length of the current record.
305
306 line EXPR
307 line
308 Returns the line number of the current record. "line
309 EXPR" sets the value of the line number. Always
310 returns 1 if a character string is being analyzed.
311 The "readline()" method increments the line number.
312
313 name EXPR
314 name
315 "name EXPR" lets you give a name to the lexical
316 analyzer. "name()" return the value of this name.
317
318 next
319 Causes searching for the next token. Return the
320 recognized "Parse::Token" instance. Returns the
321 "Token::EOI" instance at the end of the data.
322
323 Examples:
324
325 $lexer = Parse::Lex->new(@token);
326 print $lexer->next->name; # print the token type
327 print $lexer->next->text; # print the token content
328
329 nextis SCALAR_REF
330 Variant of the "next()" method. Tokens are placed in
331 "SCALAR_REF". The method returns 1 as long as the
332 token is not "EOI".
333
334 Example:
335
336 while($lexer->nextis(\$token)) {
337 print $token->text();
338 }
339
340 new LIST
341 Creates and returns a new lexical analyzer. The
342 argument of the method is a list of "Parse::Token"
343 instances, or a list of triplets permitting their
344 creation. The triplets consist of: the symbolic name
345 of the token, the regular expression necessary for
346 its recognition, and possibly an anonymous subroutine
347 that is called when the token is recognized. For each
348 triplet, an instance of type "Parse::Token" is
349 created in the calling package.
350
351 offset
352 Returns the number of characters already consumed
353 since the beginning of the analyzed data stream.
354
355 pos EXPR
356 pos "pos EXPR" sets the position of the beginning of the
357 next token to be recognized in the current line (this
358 doesn't work with analyzers of the "Parse::CLex"
359 class). "pos()" returns the number of characters
360 already consumed in the current line.
361
362 readline
363 Reads data from the input specified by the "from()"
364 method. Returns the result of the reading.
365
366 Example:
367
368 use Parse::Lex;
369
370 $lexer = Parse::Lex->new();
371 while (not $lexer->eoi) {
372 print $lexer->readline() # read and print one line
373 }
374
375 reset
376 Clears the internal buffer of the lexical analyzer
377 and erases all tokens already recognized.
378
379 restart
380 Reinitializes the analysis automaton. The only active
381 condition becomes the condition "INITIAL".
382
383 setToken TOKEN
384 Sets the token to "TOKEN". Useful to requalify a
385 token inside the anonymous subroutine associated with
386 this token.
387
388 skip EXPR
389 skip
390 "EXPR" is a regular expression defining the token
391 separator pattern (by default "[ \t]+"). "skip('')"
392 sets this to no pattern. With no argument, "skip()"
393 returns the value of the pattern. "skip()" can be
394 used as a class method.
395
396 Changing the skip pattern causes recompilation of the
397 lexical analyzer.
398
399 Example:
400
401 Parse::Lex->skip('\s*#(?s:.*)|\s+');
402 @tokens = Parse::Lex->new('INTEGER' => '\d+')->analyze(\*DATA);
403 print "@tokens\n"; # print INTEGER 1 INTEGER 2 INTEGER 3 INTEGER 4 EOI
404 __END__
405 1 # first string to skip
406 2
407 3# second string to skip
408 4
409
410 start EXPR
411 Activates condition EXPR.
412
413 state EXPR
414 Returns the state of the condition represented by
415 EXPR.
416
417 token
418 Returns the instance corresponding to the last
419 recognized token. In case no token was recognized,
420 return the special token named "DEFAULT".
421
422 tokenClass EXPR
423 tokenClass
424 Indicates which is the class of the tokens to be
425 created from the list passed as argument to the
426 "new()" method. If no argument is given, returns the
427 name of the class. By default the class is
428 "Parse::Token".
429
430 trace OUTPUT
431 trace
432 Class method which activates trace mode. The
433 activation of trace mode must take place before the
434 creation of the lexical analyzer. The mode can then
435 be deactivated by another call of this method.
436
437 "OUTPUT" can be a file name or a reference to a
438 filehandle where the trace will be redirected.
439
441 To handle the cases of token non-recognition, you can define a specific
442 token at the end of the list of tokens that comprise our lexical
443 analyzer. If searching for this token succeeds, it is then possible to
444 call an error handling function:
445
446 qw(ERROR (?s:.*)), sub {
447 print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n";
448 die qq!can\'t analyze: "$_[1]"!;
449 }
450
452 ctokenizer.pl - Scan a stream of data using the "Parse::CLex" class.
453
454 tokenizer.pl - Scan a stream of data using the "Parse::Lex" class.
455
456 every.pl - Use of the "every" method.
457
458 sexp.pl - Interpreter for prefix arithmetic expressions.
459
460 sexpcond.pl - Interpeter for prefix arithmetic expressions, using
461 conditions.
462
464 Analyzers of the "Parse::CLex" class do not allow the use of regular
465 expressions with anchoring.
466
468 "Parse::Token", "Parse::LexEvent", "Parse::YYLex".
469
471 Philippe Verdret. Documentation translated to English by Vladimir
472 Alexiev and Ocrat.
473
475 Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat
476 has significantly contributed to improving this documentation. Thanks
477 also to the numerous people who have sent me bug reports and
478 occasionally fixes.
479
481 Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates
482 1996.
483
484 Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
485
486 FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and
487 elsewhere)
488
490 Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This
491 module is free software; you can redistribute it and/or modify it under
492 the same terms as Perl itself.
493
495 Hey! The above document had some coding errors, which are explained
496 below:
497
498 Around line 583:
499 You forgot a '=back' before '=head1'
500
501 You forgot a '=back' before '=head1'
502
503
504
505perl v5.12.0 2010-03-26 Parse::Lex(3)