1Parse::Lex(3)         User Contributed Perl Documentation        Parse::Lex(3)
2
3
4

NAME

6       "Parse::Lex" - Generator of lexical analyzers - moving pointer inside
7       text
8

SYNOPSIS

10               require 5.005;
11
12               use Parse::Lex;
13               @token = (
14                 qw(
15                    ADDOP    [-+]
16                    LEFTP    [\(]
17                    RIGHTP   [\)]
18                    INTEGER  [1-9][0-9]*
19                    NEWLINE  \n
20
21                   ),
22                 qw(STRING),   [qw(" (?:[^"]+|"")* ")],
23                 qw(ERROR  .*), sub {
24                   die qq!can\'t analyze: "$_[1]"!;
25                 }
26                );
27
28               Parse::Lex->trace;  # Class method
29               $lexer = Parse::Lex->new(@token);
30               $lexer->from(\*DATA);
31               print "Tokenization of DATA:\n";
32
33               TOKEN:while (1) {
34                 $token = $lexer->next;
35                 if (not $lexer->eoi) {
36                   print "Line $.\t";
37                   print "Type: ", $token->name, "\t";
38                   print "Content:->", $token->text, "<-\n";
39                 } else {
40                   last TOKEN;
41                 }
42               }
43
44               __END__
45               1+2-5
46               "a multiline
47               string with an embedded "" in it"
48               an invalid string with a "" in it"
49

DESCRIPTION

51       The classes "Parse::Lex" and "Parse::CLex" create lexical analyzers.
52       They use different analysis techniques:
53
54       1.  "Parse::Lex" steps through the analysis by moving a pointer within
55       the character strings to be analyzed (use of "pos()" together with
56       "\G"),
57
58       2.  "Parse::CLex" steps through the analysis by consuming the data
59       recognized (use of "s///").
60
61       Analyzers of the "Parse::CLex" class do not allow the use of anchoring
62       in regular expressions.  In addition, the subclasses of "Parse::Token"
63       are not implemented for this type of analyzer.
64
65       A lexical analyzer is specified by means of a list of tokens passed as
66       arguments to the "new()" method. Tokens are instances of the
67       "Parse::Token" class, which comes with "Parse::Lex". The definition of
68       a token usually comprises two arguments: a symbolic name (like
69       "INTEGER"), followed by a regular expression. If a sub ref (anonymous
70       subroutine) is given as third argument, it is called when the token is
71       recognized.  Its arguments are the "Parse::Token" instance and the
72       string recognized by the regular expression.  The anonymous
73       subroutine's return value is used as the new string contents of the
74       "Parse::Token" instance.
75
76       The order in which the lexical analyzer examines the regular
77       expressions is determined by the order in which these expressions are
78       passed as arguments to the "new()" method. The token returned by the
79       lexical analyzer corresponds to the first regular expression which
80       matches (this strategy is different from that used by Lex, which
81       returns the longest match possible out of all that can be recognized).
82
83       The lexical analyzer can recognize tokens which span multiple records.
84       If the definition of the token comprises more than one regular
85       expression (placed within a reference to an anonymous array), the
86       analyzer reads as many records as required to recognize the token (see
87       the documentation for the "Parse::Token" class).  When the start
88       pattern is found, the analyzer looks for the end, and if necessary,
89       reads more records.  No backtracking is done in case of failure.
90
91       The analyzer can be used to analyze an isolated character string or a
92       stream of data coming from a file handle. At the end of the input data
93       the analyzer returns a "Parse::Token" instance named "EOI" (End Of
94       Input).
95
96   Start Conditions
97       You can associate start conditions with the token-recognition rules
98       that comprise your lexical analyzer (this is similar to what Flex
99       provides).  When start conditions are used, the rule which succeeds is
100       no longer necessarily the first rule that matches.
101
102       A token symbol may be preceded by a start condition specifier for the
103       associated recognition rule. For example:
104
105               qw(C1:TERMINAL_1  REGEXP), sub { # associated action },
106               qw(TERMINAL_2  REGEXP), sub { # associated action },
107
108       Symbol "TERMINAL_1" will be recognized only if start condition "C1" is
109       active.  Start conditions are activated/deactivated using the
110       "start(CONDITION_NAME)" and "end(CONDITION_NAME)" methods.
111
112       "start('INITIAL')" resets the analysis automaton.
113
114       Start conditions can be combined using AND/OR operators as follows:
115
116               C1:SYMBOL      condition C1
117
118               C1:C2:SYMBOL   condition C1 AND condition C2
119
120               C1,C2:SYMBOL   condition C1 OR  condition C2
121
122       There are two types of start conditions: inclusive and exclusive, which
123       are declared by class methods "inclusive()" and "exclusive()"
124       respectively.  With an inclusive start condition, all rules are active
125       regardless of whether or not they are qualified with the start
126       condition.  With an exclusive start condition, only the rules qualified
127       with the start condition are active; all other rules are deactivated.
128
129       Example (borrowed from the documentation of Flex):
130
131        use Parse::Lex;
132        @token = (
133                 'EXPECT', 'expect-floats', sub {
134                   $lexer->start('expect');
135                   $_[1]
136                 },
137                 'expect:FLOAT', '\d+\.\d+', sub {
138                   print "found a float: $_[1]\n";
139                   $_[1]
140                 },
141                 'expect:NEWLINE', '\n', sub {
142                   $lexer->end('expect') ;
143                   $_[1]
144                 },
145                 'NEWLINE2', '\n',
146                 'INT', '\d+', sub {
147                   print "found an integer: $_[1] \n";
148                   $_[1]
149                 },
150                 'DOT', '\.', sub {
151                   print "found a dot\n";
152                   $_[1]
153                 },
154                );
155
156        Parse::Lex->exclusive('expect');
157        $lexer = Parse::Lex->new(@token);
158
159       The special start condition "ALL" is always verified.
160
161   Methods
162       analyze EXPR
163           Analyzes "EXPR" and returns a list of pairs consisting of a token
164           name followed by recognized text. "EXPR" can be a character string
165           or a reference to a filehandle.
166
167           Examples:
168
169            @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3");
170            @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
171
172       buffer EXPR
173       buffer
174           Returns the contents of the internal buffer of the lexical
175           analyzer.  With an expression as argument, places the result of the
176           expression in the buffer.
177
178           It is not advisable to directly change the contents of the buffer
179           without changing the position of the analysis pointer ("pos()") and
180           the value length of the buffer ("length()").
181
182       configure(HASH)
183           Instance method which permits specifying a lexical analyzer.  This
184           method accepts the list of the following attribute values:
185
186           From => EXPR
187                     This attribute plays the same role as the "from(EXPR)"
188                     method.  "EXPR" can be a filehandle or a character
189                     string.
190
191           Tokens => ARRAY_REF
192                     "ARRAY_REF" must contain the list of attribute values
193                     specifying the tokens to be recognized (see the
194                     documentation for "Parse::Token").
195
196           Skip => REGEX
197                     This attribute plays the same role as the "skip(REGEX)"
198                     method. "REGEX" describes the patterns to skip over
199                     during the analysis.
200
201       end EXPR
202           Deactivates condition "EXPR".
203
204       eoi Returns TRUE when there is no more data to analyze.
205
206       every SUB
207           Avoids having to write a reading loop in order to analyze a stream
208           of data. "SUB" is an anonymous subroutine executed after the
209           recognition of each token. For example, to lex the string "1+2" you
210           can write:
211
212                   use Parse::Lex;
213
214                   $lexer = Parse::Lex->new(
215                     qw(
216                        ADDOP [-+]
217                        INTEGER \d+
218                       ));
219
220                   $lexer->from("1+2");
221                   $lexer->every (sub {
222                     print $_[0]->name, "\t";
223                     print $_[0]->text, "\n";
224                   });
225
226           The first argument of the anonymous subroutine is the
227           "Parse::Token" instance recognized.
228
229       exclusive LIST
230           Class method declaring the conditions present in LIST to be
231           exclusive.
232
233       flush
234           If saving of the consumed strings is activated, "flush()" returns
235           and clears the buffer containing the character strings recognized
236           up to now.  This is only useful if "hold()" has been called to
237           activate saving of consumed strings.
238
239       from EXPR
240       from
241           "from(EXPR)" allows specifying the source of the data to be
242           analyzed. The argument of this method can be a string (or list of
243           strings), or a reference to a filehandle.  If no argument is given,
244           "from()" returns the filehandle if defined, or "undef" if input is
245           a string.  When an argument "EXPR" is used, the return value is the
246           calling lexer object itself.
247
248           By default it is assumed that data are read from "STDIN".
249
250           Examples:
251
252                   $handle = new IO::File;
253                   $handle->open("< filename");
254                   $lexer->from($handle);
255
256                   $lexer->from(\*DATA);
257                   $lexer->from('the data to be analyzed');
258
259       getSub
260           "getSub" returns the anonymous subroutine that performs the lexical
261           analysis.
262
263           Example:
264
265                   my $token = '';
266                   my $sub = $lexer->getSub;
267                   while (($token = &$sub()) ne $Token::EOI) {
268                     print $token->name, "\t";
269                     print $token->text, "\n";
270                   }
271
272              # or
273
274                   my $token = '';
275                   local *tokenizer = $lexer->getSub;
276                   while (($token = tokenizer()) ne $Token::EOI) {
277                     print $token->name, "\t";
278                     print $token->text, "\n";
279                   }
280
281       getToken
282           Same as "token()" method.
283
284       hold EXPR
285       hold
286           Activates/deactivates saving of the consumed strings.  The return
287           value is the current setting (TRUE or FALSE).  Can be used as a
288           class method.
289
290           You can obtain the contents of the buffer using the "flush" method,
291           which also empties the buffer.
292
293       inclusive LIST
294           Class method declaring the conditions present in LIST to be
295           inclusive.
296
297       length EXPR
298       length
299           Returns the length of the current record.  "length EXPR" sets the
300           length of the current record.
301
302       line EXPR
303       line
304           Returns the line number of the current record.  "line EXPR" sets
305           the value of the line number.  Always returns 1 if a character
306           string is being analyzed.  The "readline()" method increments the
307           line number.
308
309       name EXPR
310       name
311           "name EXPR" lets you give a name to the lexical analyzer.  "name()"
312           return the value of this name.
313
314       next
315           Causes searching for the next token. Return the recognized
316           "Parse::Token" instance. Returns the "Token::EOI" instance at the
317           end of the data.
318
319           Examples:
320
321                   $lexer = Parse::Lex->new(@token);
322                   print $lexer->next->name;   # print the token type
323                   print $lexer->next->text;   # print the token content
324
325       nextis SCALAR_REF
326           Variant of the "next()" method. Tokens are placed in "SCALAR_REF".
327           The method returns 1 as long as the token is not "EOI".
328
329           Example:
330
331                   while($lexer->nextis(\$token)) {
332                      print $token->text();
333                   }
334
335       new LIST
336           Creates and returns a new lexical analyzer. The argument of the
337           method is a list of "Parse::Token" instances, or a list of triplets
338           permitting their creation.  The triplets consist of: the symbolic
339           name of the token, the regular expression necessary for its
340           recognition, and possibly an anonymous subroutine that is called
341           when the token is recognized. For each triplet, an instance of type
342           "Parse::Token" is created in the calling package.
343
344       offset
345           Returns the number of characters already consumed since the
346           beginning of the analyzed data stream.
347
348       pos EXPR
349       pos "pos EXPR" sets the position of the beginning of the next token to
350           be recognized in the current line (this doesn't work with analyzers
351           of the "Parse::CLex" class).  "pos()" returns the number of
352           characters already consumed in the current line.
353
354       readline
355           Reads data from the input specified by the "from()" method. Returns
356           the result of the reading.
357
358           Example:
359
360                   use Parse::Lex;
361
362                   $lexer = Parse::Lex->new();
363                   while (not $lexer->eoi) {
364                     print $lexer->readline() # read and print one line
365                   }
366
367       reset
368           Clears the internal buffer of the lexical analyzer and erases all
369           tokens already recognized.
370
371       restart
372           Reinitializes the analysis automaton. The only active condition
373           becomes the condition "INITIAL".
374
375       setToken TOKEN
376           Sets the token to "TOKEN". Useful to requalify a token inside the
377           anonymous subroutine associated with this token.
378
379       skip EXPR
380       skip
381           "EXPR" is a regular expression defining the token separator pattern
382           (by default "[ \t]+"). "skip('')" sets this to no pattern.  With no
383           argument, "skip()" returns the value of the pattern.  "skip()" can
384           be used as a class method.
385
386           Changing the skip pattern causes recompilation of the lexical
387           analyzer.
388
389           Example:
390
391             Parse::Lex->skip('\s*#(?s:.*)|\s+');
392             @tokens = Parse::Lex->new('INTEGER' => '\d+')->analyze(\*DATA);
393             print "@tokens\n"; # print INTEGER 1 INTEGER 2 INTEGER 3 INTEGER 4 EOI
394             __END__
395             1 # first string to skip
396             2
397             3# second string to skip
398             4
399
400       start EXPR
401           Activates condition EXPR.
402
403       state EXPR
404           Returns the state of the condition represented by EXPR.
405
406       token
407           Returns the instance corresponding to the last recognized token. In
408           case no token was recognized, return the special token named
409           "DEFAULT".
410
411       tokenClass EXPR
412       tokenClass
413           Indicates which is the class of the tokens to be created from the
414           list passed as argument to the "new()" method.  If no argument is
415           given, returns the name of the class.  By default the class is
416           "Parse::Token".
417
418       trace OUTPUT
419       trace
420           Class method which activates trace mode. The activation of trace
421           mode must take place before the creation of the lexical analyzer.
422           The mode can then be deactivated by another call of this method.
423
424           "OUTPUT" can be a file name or a reference to a filehandle where
425           the trace will be redirected.
426

ERROR HANDLING

428       To handle the cases of token non-recognition, you can define a specific
429       token at the end of the list of tokens that comprise our lexical
430       analyzer.  If searching for this token succeeds, it is then possible to
431       call an error handling function:
432
433            qw(ERROR  (?s:.*)), sub {
434              print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n";
435              die qq!can\'t analyze: "$_[1]"!;
436            }
437

EXAMPLES

439       ctokenizer.pl - Scan a stream of data using the "Parse::CLex" class.
440
441       tokenizer.pl - Scan a stream of data using the "Parse::Lex" class.
442
443       every.pl - Use of the "every" method.
444
445       sexp.pl - Interpreter for prefix arithmetic expressions.
446
447       sexpcond.pl - Interpeter for prefix arithmetic expressions, using
448       conditions.
449

BUGS

451       Analyzers of the "Parse::CLex" class do not allow the use of regular
452       expressions with anchoring.
453

SEE ALSO

455       "Parse::Token", "Parse::LexEvent", "Parse::YYLex".
456

AUTHOR

458       Philippe Verdret. Documentation translated to English by Vladimir
459       Alexiev and Ocrat.
460

ACKNOWLEDGMENTS

462       Version 2.0 owes much to suggestions made by Vladimir Alexiev.  Ocrat
463       has significantly contributed to improving this documentation.  Thanks
464       also to the numerous people who have sent me bug reports and
465       occasionally fixes.
466

REFERENCES

468       Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates
469       1996.
470
471       Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
472
473       FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and
474       elsewhere)
475
477       Copyright (c) 1995-1999 Philippe Verdret. All rights reserved.  This
478       module is free software; you can redistribute it and/or modify it under
479       the same terms as Perl itself.
480
481
482
483perl v5.28.0                      2011-12-31                     Parse::Lex(3)
Impressum