1Parse::Lex(3) User Contributed Perl Documentation Parse::Lex(3)
2
3
4
6 "Parse::Lex" - Generator of lexical analyzers - moving pointer inside
7 text
8
10 require 5.005;
11
12 use Parse::Lex;
13 @token = (
14 qw(
15 ADDOP [-+]
16 LEFTP [\(]
17 RIGHTP [\)]
18 INTEGER [1-9][0-9]*
19 NEWLINE \n
20
21 ),
22 qw(STRING), [qw(" (?:[^"]+|"")* ")],
23 qw(ERROR .*), sub {
24 die qq!can\'t analyze: "$_[1]"!;
25 }
26 );
27
28 Parse::Lex->trace; # Class method
29 $lexer = Parse::Lex->new(@token);
30 $lexer->from(\*DATA);
31 print "Tokenization of DATA:\n";
32
33 TOKEN:while (1) {
34 $token = $lexer->next;
35 if (not $lexer->eoi) {
36 print "Line $.\t";
37 print "Type: ", $token->name, "\t";
38 print "Content:->", $token->text, "<-\n";
39 } else {
40 last TOKEN;
41 }
42 }
43
44 __END__
45 1+2-5
46 "a multiline
47 string with an embedded "" in it"
48 an invalid string with a "" in it"
49
51 The classes "Parse::Lex" and "Parse::CLex" create lexical analyzers.
52 They use different analysis techniques:
53
54 1. "Parse::Lex" steps through the analysis by moving a pointer within
55 the character strings to be analyzed (use of "pos()" together with
56 "\G"),
57
58 2. "Parse::CLex" steps through the analysis by consuming the data
59 recognized (use of "s///").
60
61 Analyzers of the "Parse::CLex" class do not allow the use of anchoring
62 in regular expressions. In addition, the subclasses of "Parse::Token"
63 are not implemented for this type of analyzer.
64
65 A lexical analyzer is specified by means of a list of tokens passed as
66 arguments to the "new()" method. Tokens are instances of the
67 "Parse::Token" class, which comes with "Parse::Lex". The definition of
68 a token usually comprises two arguments: a symbolic name (like
69 "INTEGER"), followed by a regular expression. If a sub ref (anonymous
70 subroutine) is given as third argument, it is called when the token is
71 recognized. Its arguments are the "Parse::Token" instance and the
72 string recognized by the regular expression. The anonymous
73 subroutine's return value is used as the new string contents of the
74 "Parse::Token" instance.
75
76 The order in which the lexical analyzer examines the regular
77 expressions is determined by the order in which these expressions are
78 passed as arguments to the "new()" method. The token returned by the
79 lexical analyzer corresponds to the first regular expression which
80 matches (this strategy is different from that used by Lex, which
81 returns the longest match possible out of all that can be recognized).
82
83 The lexical analyzer can recognize tokens which span multiple records.
84 If the definition of the token comprises more than one regular
85 expression (placed within a reference to an anonymous array), the
86 analyzer reads as many records as required to recognize the token (see
87 the documentation for the "Parse::Token" class). When the start
88 pattern is found, the analyzer looks for the end, and if necessary,
89 reads more records. No backtracking is done in case of failure.
90
91 The analyzer can be used to analyze an isolated character string or a
92 stream of data coming from a file handle. At the end of the input data
93 the analyzer returns a "Parse::Token" instance named "EOI" (End Of
94 Input).
95
96 Start Conditions
97 You can associate start conditions with the token-recognition rules
98 that comprise your lexical analyzer (this is similar to what Flex
99 provides). When start conditions are used, the rule which succeeds is
100 no longer necessarily the first rule that matches.
101
102 A token symbol may be preceded by a start condition specifier for the
103 associated recognition rule. For example:
104
105 qw(C1:TERMINAL_1 REGEXP), sub { # associated action },
106 qw(TERMINAL_2 REGEXP), sub { # associated action },
107
108 Symbol "TERMINAL_1" will be recognized only if start condition "C1" is
109 active. Start conditions are activated/deactivated using the
110 "start(CONDITION_NAME)" and "end(CONDITION_NAME)" methods.
111
112 "start('INITIAL')" resets the analysis automaton.
113
114 Start conditions can be combined using AND/OR operators as follows:
115
116 C1:SYMBOL condition C1
117
118 C1:C2:SYMBOL condition C1 AND condition C2
119
120 C1,C2:SYMBOL condition C1 OR condition C2
121
122 There are two types of start conditions: inclusive and exclusive, which
123 are declared by class methods "inclusive()" and "exclusive()"
124 respectively. With an inclusive start condition, all rules are active
125 regardless of whether or not they are qualified with the start
126 condition. With an exclusive start condition, only the rules qualified
127 with the start condition are active; all other rules are deactivated.
128
129 Example (borrowed from the documentation of Flex):
130
131 use Parse::Lex;
132 @token = (
133 'EXPECT', 'expect-floats', sub {
134 $lexer->start('expect');
135 $_[1]
136 },
137 'expect:FLOAT', '\d+\.\d+', sub {
138 print "found a float: $_[1]\n";
139 $_[1]
140 },
141 'expect:NEWLINE', '\n', sub {
142 $lexer->end('expect') ;
143 $_[1]
144 },
145 'NEWLINE2', '\n',
146 'INT', '\d+', sub {
147 print "found an integer: $_[1] \n";
148 $_[1]
149 },
150 'DOT', '\.', sub {
151 print "found a dot\n";
152 $_[1]
153 },
154 );
155
156 Parse::Lex->exclusive('expect');
157 $lexer = Parse::Lex->new(@token);
158
159 The special start condition "ALL" is always verified.
160
161 Methods
162 analyze EXPR
163 Analyzes "EXPR" and returns a list of pairs consisting of a token
164 name followed by recognized text. "EXPR" can be a character string
165 or a reference to a filehandle.
166
167 Examples:
168
169 @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3");
170 @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
171
172 buffer EXPR
173 buffer
174 Returns the contents of the internal buffer of the lexical
175 analyzer. With an expression as argument, places the result of the
176 expression in the buffer.
177
178 It is not advisable to directly change the contents of the buffer
179 without changing the position of the analysis pointer ("pos()") and
180 the value length of the buffer ("length()").
181
182 configure(HASH)
183 Instance method which permits specifying a lexical analyzer. This
184 method accepts the list of the following attribute values:
185
186 From => EXPR
187 This attribute plays the same role as the "from(EXPR)"
188 method. "EXPR" can be a filehandle or a character
189 string.
190
191 Tokens => ARRAY_REF
192 "ARRAY_REF" must contain the list of attribute values
193 specifying the tokens to be recognized (see the
194 documentation for "Parse::Token").
195
196 Skip => REGEX
197 This attribute plays the same role as the "skip(REGEX)"
198 method. "REGEX" describes the patterns to skip over
199 during the analysis.
200
201 end EXPR
202 Deactivates condition "EXPR".
203
204 eoi Returns TRUE when there is no more data to analyze.
205
206 every SUB
207 Avoids having to write a reading loop in order to analyze a stream
208 of data. "SUB" is an anonymous subroutine executed after the
209 recognition of each token. For example, to lex the string "1+2" you
210 can write:
211
212 use Parse::Lex;
213
214 $lexer = Parse::Lex->new(
215 qw(
216 ADDOP [-+]
217 INTEGER \d+
218 ));
219
220 $lexer->from("1+2");
221 $lexer->every (sub {
222 print $_[0]->name, "\t";
223 print $_[0]->text, "\n";
224 });
225
226 The first argument of the anonymous subroutine is the
227 "Parse::Token" instance recognized.
228
229 exclusive LIST
230 Class method declaring the conditions present in LIST to be
231 exclusive.
232
233 flush
234 If saving of the consumed strings is activated, "flush()" returns
235 and clears the buffer containing the character strings recognized
236 up to now. This is only useful if "hold()" has been called to
237 activate saving of consumed strings.
238
239 from EXPR
240 from
241 "from(EXPR)" allows specifying the source of the data to be
242 analyzed. The argument of this method can be a string (or list of
243 strings), or a reference to a filehandle. If no argument is given,
244 "from()" returns the filehandle if defined, or "undef" if input is
245 a string. When an argument "EXPR" is used, the return value is the
246 calling lexer object itself.
247
248 By default it is assumed that data are read from "STDIN".
249
250 Examples:
251
252 $handle = new IO::File;
253 $handle->open("< filename");
254 $lexer->from($handle);
255
256 $lexer->from(\*DATA);
257 $lexer->from('the data to be analyzed');
258
259 getSub
260 "getSub" returns the anonymous subroutine that performs the lexical
261 analysis.
262
263 Example:
264
265 my $token = '';
266 my $sub = $lexer->getSub;
267 while (($token = &$sub()) ne $Token::EOI) {
268 print $token->name, "\t";
269 print $token->text, "\n";
270 }
271
272 # or
273
274 my $token = '';
275 local *tokenizer = $lexer->getSub;
276 while (($token = tokenizer()) ne $Token::EOI) {
277 print $token->name, "\t";
278 print $token->text, "\n";
279 }
280
281 getToken
282 Same as "token()" method.
283
284 hold EXPR
285 hold
286 Activates/deactivates saving of the consumed strings. The return
287 value is the current setting (TRUE or FALSE). Can be used as a
288 class method.
289
290 You can obtain the contents of the buffer using the "flush" method,
291 which also empties the buffer.
292
293 inclusive LIST
294 Class method declaring the conditions present in LIST to be
295 inclusive.
296
297 length EXPR
298 length
299 Returns the length of the current record. "length EXPR" sets the
300 length of the current record.
301
302 line EXPR
303 line
304 Returns the line number of the current record. "line EXPR" sets
305 the value of the line number. Always returns 1 if a character
306 string is being analyzed. The "readline()" method increments the
307 line number.
308
309 name EXPR
310 name
311 "name EXPR" lets you give a name to the lexical analyzer. "name()"
312 return the value of this name.
313
314 next
315 Causes searching for the next token. Return the recognized
316 "Parse::Token" instance. Returns the "Token::EOI" instance at the
317 end of the data.
318
319 Examples:
320
321 $lexer = Parse::Lex->new(@token);
322 print $lexer->next->name; # print the token type
323 print $lexer->next->text; # print the token content
324
325 nextis SCALAR_REF
326 Variant of the "next()" method. Tokens are placed in "SCALAR_REF".
327 The method returns 1 as long as the token is not "EOI".
328
329 Example:
330
331 while($lexer->nextis(\$token)) {
332 print $token->text();
333 }
334
335 new LIST
336 Creates and returns a new lexical analyzer. The argument of the
337 method is a list of "Parse::Token" instances, or a list of triplets
338 permitting their creation. The triplets consist of: the symbolic
339 name of the token, the regular expression necessary for its
340 recognition, and possibly an anonymous subroutine that is called
341 when the token is recognized. For each triplet, an instance of type
342 "Parse::Token" is created in the calling package.
343
344 offset
345 Returns the number of characters already consumed since the
346 beginning of the analyzed data stream.
347
348 pos EXPR
349 pos "pos EXPR" sets the position of the beginning of the next token to
350 be recognized in the current line (this doesn't work with analyzers
351 of the "Parse::CLex" class). "pos()" returns the number of
352 characters already consumed in the current line.
353
354 readline
355 Reads data from the input specified by the "from()" method. Returns
356 the result of the reading.
357
358 Example:
359
360 use Parse::Lex;
361
362 $lexer = Parse::Lex->new();
363 while (not $lexer->eoi) {
364 print $lexer->readline() # read and print one line
365 }
366
367 reset
368 Clears the internal buffer of the lexical analyzer and erases all
369 tokens already recognized.
370
371 restart
372 Reinitializes the analysis automaton. The only active condition
373 becomes the condition "INITIAL".
374
375 setToken TOKEN
376 Sets the token to "TOKEN". Useful to requalify a token inside the
377 anonymous subroutine associated with this token.
378
379 skip EXPR
380 skip
381 "EXPR" is a regular expression defining the token separator pattern
382 (by default "[ \t]+"). "skip('')" sets this to no pattern. With no
383 argument, "skip()" returns the value of the pattern. "skip()" can
384 be used as a class method.
385
386 Changing the skip pattern causes recompilation of the lexical
387 analyzer.
388
389 Example:
390
391 Parse::Lex->skip('\s*#(?s:.*)|\s+');
392 @tokens = Parse::Lex->new('INTEGER' => '\d+')->analyze(\*DATA);
393 print "@tokens\n"; # print INTEGER 1 INTEGER 2 INTEGER 3 INTEGER 4 EOI
394 __END__
395 1 # first string to skip
396 2
397 3# second string to skip
398 4
399
400 start EXPR
401 Activates condition EXPR.
402
403 state EXPR
404 Returns the state of the condition represented by EXPR.
405
406 token
407 Returns the instance corresponding to the last recognized token. In
408 case no token was recognized, return the special token named
409 "DEFAULT".
410
411 tokenClass EXPR
412 tokenClass
413 Indicates which is the class of the tokens to be created from the
414 list passed as argument to the "new()" method. If no argument is
415 given, returns the name of the class. By default the class is
416 "Parse::Token".
417
418 trace OUTPUT
419 trace
420 Class method which activates trace mode. The activation of trace
421 mode must take place before the creation of the lexical analyzer.
422 The mode can then be deactivated by another call of this method.
423
424 "OUTPUT" can be a file name or a reference to a filehandle where
425 the trace will be redirected.
426
428 To handle the cases of token non-recognition, you can define a specific
429 token at the end of the list of tokens that comprise our lexical
430 analyzer. If searching for this token succeeds, it is then possible to
431 call an error handling function:
432
433 qw(ERROR (?s:.*)), sub {
434 print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n";
435 die qq!can\'t analyze: "$_[1]"!;
436 }
437
439 ctokenizer.pl - Scan a stream of data using the "Parse::CLex" class.
440
441 tokenizer.pl - Scan a stream of data using the "Parse::Lex" class.
442
443 every.pl - Use of the "every" method.
444
445 sexp.pl - Interpreter for prefix arithmetic expressions.
446
447 sexpcond.pl - Interpeter for prefix arithmetic expressions, using
448 conditions.
449
451 Analyzers of the "Parse::CLex" class do not allow the use of regular
452 expressions with anchoring.
453
455 "Parse::Token", "Parse::LexEvent", "Parse::YYLex".
456
458 Philippe Verdret. Documentation translated to English by Vladimir
459 Alexiev and Ocrat.
460
462 Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat
463 has significantly contributed to improving this documentation. Thanks
464 also to the numerous people who have sent me bug reports and
465 occasionally fixes.
466
468 Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates
469 1996.
470
471 Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
472
473 FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and
474 elsewhere)
475
477 Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This
478 module is free software; you can redistribute it and/or modify it under
479 the same terms as Perl itself.
480
481
482
483perl v5.32.1 2021-01-27 Parse::Lex(3)