Parse::Lex(3pm)

1Parse::Lex(3)         User Contributed Perl Documentation        Parse::Lex(3)
2
3
4

NAME

6       "Parse::Lex" - Generator of lexical analyzers - moving pointer inside
7       text
8

SYNOPSIS

10               require 5.005;
11
12               use Parse::Lex;
13               @token = (
14                 qw(
15                    ADDOP    [-+]
16                    LEFTP    [\(]
17                    RIGHTP   [\)]
18                    INTEGER  [1-9][0-9]*
19                    NEWLINE  \n
20
21                   ),
22                 qw(STRING),   [qw(" (?:[^"]+|"")* ")],
23                 qw(ERROR  .*), sub {
24                   die qq!can\'t analyze: "$_[1]"!;
25                 }
26                );
27
28               Parse::Lex->trace;  # Class method
29               $lexer = Parse::Lex->new(@token);
30               $lexer->from(\*DATA);
31               print "Tokenization of DATA:\n";
32
33               TOKEN:while (1) {
34                 $token = $lexer->next;
35                 if (not $lexer->eoi) {
36                   print "Line $.\t";
37                   print "Type: ", $token->name, "\t";
38                   print "Content:->", $token->text, "<-\n";
39                 } else {
40                   last TOKEN;
41                 }
42               }
43
44               __END__
45               1+2-5
46               "a multiline
47               string with an embedded "" in it"
48               an invalid string with a "" in it"
49

DESCRIPTION

51       The classes "Parse::Lex" and "Parse::CLex" create lexical analyzers.
52       They use different analysis techniques:
53
54       1.  "Parse::Lex" steps through the analysis by moving a pointer within
55       the character strings to be analyzed (use of "pos()" together with
56       "\G"),
57
58       2.  "Parse::CLex" steps through the analysis by consuming the data
59       recognized (use of "s///").
60
61       Analyzers of the "Parse::CLex" class do not allow the use of anchoring
62       in regular expressions.  In addition, the subclasses of "Parse::Token"
63       are not implemented for this type of analyzer.
64
65       A lexical analyzer is specified by means of a list of tokens passed as
66       arguments to the "new()" method. Tokens are instances of the
67       "Parse::Token" class, which comes with "Parse::Lex". The definition of
68       a token usually comprises two arguments: a symbolic name (like
69       "INTEGER"), followed by a regular expression. If a sub ref (anonymous
70       subroutine) is given as third argument, it is called when the token is
71       recognized.  Its arguments are the "Parse::Token" instance and the
72       string recognized by the regular expression.  The anonymous
73       subroutine's return value is used as the new string contents of the
74       "Parse::Token" instance.
75
76       The order in which the lexical analyzer examines the regular
77       expressions is determined by the order in which these expressions are
78       passed as arguments to the "new()" method. The token returned by the
79       lexical analyzer corresponds to the first regular expression which
80       matches (this strategy is different from that used by Lex, which
81       returns the longest match possible out of all that can be recognized).
82
83       The lexical analyzer can recognize tokens which span multiple records.
84       If the definition of the token comprises more than one regular
85       expression (placed within a reference to an anonymous array), the
86       analyzer reads as many records as required to recognize the token (see
87       the documentation for the "Parse::Token" class).  When the start
88       pattern is found, the analyzer looks for the end, and if necessary,
89       reads more records.  No backtracking is done in case of failure.
90
91       The analyzer can be used to analyze an isolated character string or a
92       stream of data coming from a file handle. At the end of the input data
93       the analyzer returns a "Parse::Token" instance named "EOI" (End Of
94       Input).
95
96   Start Conditions
97       You can associate start conditions with the token-recognition rules
98       that comprise your lexical analyzer (this is similar to what Flex
99       provides).  When start conditions are used, the rule which succeeds is
100       no longer necessarily the first rule that matches.
101
102       A token symbol may be preceded by a start condition specifier for the
103       associated recognition rule. For example:
104
105               qw(C1:TERMINAL_1  REGEXP), sub { # associated action },
106               qw(TERMINAL_2  REGEXP), sub { # associated action },
107
108       Symbol "TERMINAL_1" will be recognized only if start condition "C1" is
109       active.  Start conditions are activated/deactivated using the
110       "start(CONDITION_NAME)" and "end(CONDITION_NAME)" methods.
111
112       "start('INITIAL')" resets the analysis automaton.
113
114       Start conditions can be combined using AND/OR operators as follows:
115
116               C1:SYMBOL      condition C1
117
118               C1:C2:SYMBOL   condition C1 AND condition C2
119
120               C1,C2:SYMBOL   condition C1 OR  condition C2
121
122       There are two types of start conditions: inclusive and exclusive, which
123       are declared by class methods "inclusive()" and "exclusive()"
124       respectively.  With an inclusive start condition, all rules are active
125       regardless of whether or not they are qualified with the start
126       condition.  With an exclusive start condition, only the rules qualified
127       with the start condition are active; all other rules are deactivated.
128
129       Example (borrowed from the documentation of Flex):
130
131        use Parse::Lex;
132        @token = (
133                 'EXPECT', 'expect-floats', sub {
134                   $lexer->start('expect');
135                   $_[1]
136                 },
137                 'expect:FLOAT', '\d+\.\d+', sub {
138                   print "found a float: $_[1]\n";
139                   $_[1]
140                 },
141                 'expect:NEWLINE', '\n', sub {
142                   $lexer->end('expect') ;
143                   $_[1]
144                 },
145                 'NEWLINE2', '\n',
146                 'INT', '\d+', sub {
147                   print "found an integer: $_[1] \n";
148                   $_[1]
149                 },
150                 'DOT', '\.', sub {
151                   print "found a dot\n";
152                   $_[1]
153                 },
154                );
155
156        Parse::Lex->exclusive('expect');
157        $lexer = Parse::Lex->new(@token);
158
159       The special start condition "ALL" is always verified.
160
161   Methods
162       analyze EXPR
163           Analyzes "EXPR" and returns a list of pairs consisting of a token
164           name followed by recognized text. "EXPR" can be a character string
165           or a reference to a filehandle.
166
167           Examples:
168
169            @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3");
170            @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
171
172       buffer EXPR
173       buffer
174           Returns the contents of the internal buffer of the lexical
175           analyzer.  With an expression as argument, places the result of the
176           expression in the buffer.
177
178           It is not advisable to directly change the contents of the buffer
179           without changing the position of the analysis pointer ("pos()") and
180           the value length of the buffer ("length()").
181
182       configure(HASH)
183           Instance method which permits specifying a lexical analyzer.  This
184           method accepts the list of the following attribute values:
185
186           From => EXPR
187                     This attribute plays the same role as the "from(EXPR)"
188                     method.  "EXPR" can be a filehandle or a character
189                     string.
190
191           Tokens => ARRAY_REF
192                     "ARRAY_REF" must contain the list of attribute values
193                     specifying the tokens to be recognized (see the
194                     documentation for "Parse::Token").
195
196           Skip => REGEX
197                     This attribute plays the same role as the "skip(REGEX)"
198                     method. "REGEX" describes the patterns to skip over
199                     during the analysis.
200
201                     end EXPR
202                         Deactivates condition "EXPR".
203
204                     eoi Returns TRUE when there is no more data to analyze.
205
206                     every SUB
207                         Avoids having to write a reading loop in order to
208                         analyze a stream of data. "SUB" is an anonymous
209                         subroutine executed after the recognition of each
210                         token. For example, to lex the string "1+2" you can
211                         write:
212
213                                 use Parse::Lex;
214
215                                 $lexer = Parse::Lex->new(
216                                   qw(
217                                      ADDOP [-+]
218                                      INTEGER \d+
219                                     ));
220
221                                 $lexer->from("1+2");
222                                 $lexer->every (sub {
223                                   print $_[0]->name, "\t";
224                                   print $_[0]->text, "\n";
225                                 });
226
227                         The first argument of the anonymous subroutine is the
228                         "Parse::Token" instance recognized.
229
230                     exclusive LIST
231                         Class method declaring the conditions present in LIST
232                         to be exclusive.
233
234                     flush
235                         If saving of the consumed strings is activated,
236                         "flush()" returns and clears the buffer containing
237                         the character strings recognized up to now.  This is
238                         only useful if "hold()" has been called to activate
239                         saving of consumed strings.
240
241                     from EXPR
242                     from
243                         "from(EXPR)" allows specifying the source of the data
244                         to be analyzed. The argument of this method can be a
245                         string (or list of strings), or a reference to a
246                         filehandle.  If no argument is given, "from()"
247                         returns the filehandle if defined, or "undef" if
248                         input is a string.  When an argument "EXPR" is used,
249                         the return value is the calling lexer object itself.
250
251                         By default it is assumed that data are read from
252                         "STDIN".
253
254                         Examples:
255
256                                 $handle = new IO::File;
257                                 $handle->open("< filename");
258                                 $lexer->from($handle);
259
260                                 $lexer->from(\*DATA);
261                                 $lexer->from('the data to be analyzed');
262
263                     getSub
264                         "getSub" returns the anonymous subroutine that
265                         performs the lexical analysis.
266
267                         Example:
268
269                                 my $token = '';
270                                 my $sub = $lexer->getSub;
271                                 while (($token = &$sub()) ne $Token::EOI) {
272                                   print $token->name, "\t";
273                                   print $token->text, "\n";
274                                 }
275
276                            # or
277
278                                 my $token = '';
279                                 local *tokenizer = $lexer->getSub;
280                                 while (($token = tokenizer()) ne $Token::EOI) {
281                                   print $token->name, "\t";
282                                   print $token->text, "\n";
283                                 }
284
285                     getToken
286                         Same as "token()" method.
287
288                     hold EXPR
289                     hold
290                         Activates/deactivates saving of the consumed strings.
291                         The return value is the current setting (TRUE or
292                         FALSE).  Can be used as a class method.
293
294                         You can obtain the contents of the buffer using the
295                         "flush" method, which also empties the buffer.
296
297                     inclusive LIST
298                         Class method declaring the conditions present in LIST
299                         to be inclusive.
300
301                     length EXPR
302                     length
303                         Returns the length of the current record.  "length
304                         EXPR" sets the length of the current record.
305
306                     line EXPR
307                     line
308                         Returns the line number of the current record.  "line
309                         EXPR" sets the value of the line number.  Always
310                         returns 1 if a character string is being analyzed.
311                         The "readline()" method increments the line number.
312
313                     name EXPR
314                     name
315                         "name EXPR" lets you give a name to the lexical
316                         analyzer.  "name()" return the value of this name.
317
318                     next
319                         Causes searching for the next token. Return the
320                         recognized "Parse::Token" instance. Returns the
321                         "Token::EOI" instance at the end of the data.
322
323                         Examples:
324
325                                 $lexer = Parse::Lex->new(@token);
326                                 print $lexer->next->name;   # print the token type
327                                 print $lexer->next->text;   # print the token content
328
329                     nextis SCALAR_REF
330                         Variant of the "next()" method. Tokens are placed in
331                         "SCALAR_REF". The method returns 1 as long as the
332                         token is not "EOI".
333
334                         Example:
335
336                                 while($lexer->nextis(\$token)) {
337                                    print $token->text();
338                                 }
339
340                     new LIST
341                         Creates and returns a new lexical analyzer. The
342                         argument of the method is a list of "Parse::Token"
343                         instances, or a list of triplets permitting their
344                         creation.  The triplets consist of: the symbolic name
345                         of the token, the regular expression necessary for
346                         its recognition, and possibly an anonymous subroutine
347                         that is called when the token is recognized. For each
348                         triplet, an instance of type "Parse::Token" is
349                         created in the calling package.
350
351                     offset
352                         Returns the number of characters already consumed
353                         since the beginning of the analyzed data stream.
354
355                     pos EXPR
356                     pos "pos EXPR" sets the position of the beginning of the
357                         next token to be recognized in the current line (this
358                         doesn't work with analyzers of the "Parse::CLex"
359                         class).  "pos()" returns the number of characters
360                         already consumed in the current line.
361
362                     readline
363                         Reads data from the input specified by the "from()"
364                         method. Returns the result of the reading.
365
366                         Example:
367
368                                 use Parse::Lex;
369
370                                 $lexer = Parse::Lex->new();
371                                 while (not $lexer->eoi) {
372                                   print $lexer->readline() # read and print one line
373                                 }
374
375                     reset
376                         Clears the internal buffer of the lexical analyzer
377                         and erases all tokens already recognized.
378
379                     restart
380                         Reinitializes the analysis automaton. The only active
381                         condition becomes the condition "INITIAL".
382
383                     setToken TOKEN
384                         Sets the token to "TOKEN". Useful to requalify a
385                         token inside the anonymous subroutine associated with
386                         this token.
387
388                     skip EXPR
389                     skip
390                         "EXPR" is a regular expression defining the token
391                         separator pattern (by default "[ \t]+"). "skip('')"
392                         sets this to no pattern.  With no argument, "skip()"
393                         returns the value of the pattern.  "skip()" can be
394                         used as a class method.
395
396                         Changing the skip pattern causes recompilation of the
397                         lexical analyzer.
398
399                         Example:
400
401                           Parse::Lex->skip('\s*#(?s:.*)|\s+');
402                           @tokens = Parse::Lex->new('INTEGER' => '\d+')->analyze(\*DATA);
403                           print "@tokens\n"; # print INTEGER 1 INTEGER 2 INTEGER 3 INTEGER 4 EOI
404                           __END__
405                           1 # first string to skip
406                           2
407                           3# second string to skip
408                           4
409
410                     start EXPR
411                         Activates condition EXPR.
412
413                     state EXPR
414                         Returns the state of the condition represented by
415                         EXPR.
416
417                     token
418                         Returns the instance corresponding to the last
419                         recognized token. In case no token was recognized,
420                         return the special token named "DEFAULT".
421
422                     tokenClass EXPR
423                     tokenClass
424                         Indicates which is the class of the tokens to be
425                         created from the list passed as argument to the
426                         "new()" method.  If no argument is given, returns the
427                         name of the class.  By default the class is
428                         "Parse::Token".
429
430                     trace OUTPUT
431                     trace
432                         Class method which activates trace mode. The
433                         activation of trace mode must take place before the
434                         creation of the lexical analyzer. The mode can then
435                         be deactivated by another call of this method.
436
437                         "OUTPUT" can be a file name or a reference to a
438                         filehandle where the trace will be redirected.
439

ERROR HANDLING

441       To handle the cases of token non-recognition, you can define a specific
442       token at the end of the list of tokens that comprise our lexical
443       analyzer.  If searching for this token succeeds, it is then possible to
444       call an error handling function:
445
446            qw(ERROR  (?s:.*)), sub {
447              print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n";
448              die qq!can\'t analyze: "$_[1]"!;
449            }
450

EXAMPLES

452       ctokenizer.pl - Scan a stream of data using the "Parse::CLex" class.
453
454       tokenizer.pl - Scan a stream of data using the "Parse::Lex" class.
455
456       every.pl - Use of the "every" method.
457
458       sexp.pl - Interpreter for prefix arithmetic expressions.
459
460       sexpcond.pl - Interpeter for prefix arithmetic expressions, using
461       conditions.
462

BUGS

464       Analyzers of the "Parse::CLex" class do not allow the use of regular
465       expressions with anchoring.
466

AUTHOR

471       Philippe Verdret. Documentation translated to English by Vladimir
472       Alexiev and Ocrat.
473

ACKNOWLEDGMENTS

475       Version 2.0 owes much to suggestions made by Vladimir Alexiev.  Ocrat
476       has significantly contributed to improving this documentation.  Thanks
477       also to the numerous people who have sent me bug reports and
478       occasionally fixes.
479

REFERENCES

481       Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates
482       1996.
483
484       Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
485
486       FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and
487       elsewhere)
488

COPYRIGHT

490       Copyright (c) 1995-1999 Philippe Verdret. All rights reserved.  This
491       module is free software; you can redistribute it and/or modify it under
492       the same terms as Perl itself.
493

POD ERRORS

495       Hey! The above document had some coding errors, which are explained
496       below:
497
498       Around line 583:
499           You forgot a '=back' before '=head1'
500
501           You forgot a '=back' before '=head1'
502
503
504
505perl v5.12.0                      2010-03-26                     Parse::Lex(3)