1Parse::Token(3)       User Contributed Perl Documentation      Parse::Token(3)
2
3
4

NAME

6       "Parse::Token" - Definition of tokens used by "Parse::Lex"
7

SYNOPSIS

9               require 5.005;
10
11               use Parse::Lex;
12               @token = qw(
13                   ADDOP    [-+]
14                   INTEGER  [1-9][0-9]*
15                  );
16
17               $lexer = Parse::Lex->new(@token);
18               $lexer->from(\*DATA);
19
20               $content = $INTEGER->next;
21               if ($INTEGER->status) {
22                 print "$content\n";
23               }
24               $content = $ADDOP->next;
25               if ($ADDOP->status) {
26                 print "$content\n";
27               }
28               if ($INTEGER->isnext(\$content)) {
29                 print "$content\n";
30               }
31               __END__
32               1+2
33

DESCRIPTION

35       The "Parse::Token" class and its derived classes permit defining the
36       tokens used by "Parse::Lex" or "Parse::LexEvent".
37
38       The creation of tokens can be done by means of the "new()" or
39       "factory()" methods.  The "Lex::new()" method of the "Parse::Lex"
40       package indirectly creates instances of the tokens to be recognized.
41
42       The "next()" or "isnext()" methods of the "Parse::Token" package permit
43       interfacing the lexical analyzer with a syntactic analyzer of recursive
44       descent type.  For interfacing with "byacc", see the "Parse::YYLex"
45       package.
46
47       "Parse::Token" is included indirectly by means of "use Parse::Lex" or
48       "use Parse::LexEvent".
49

Methods

51       action
52           Returns the anonymous subroutine defined within the "Parse::Token"
53           object.
54
55       factory LIST
56       factory ARRAY_REF
57           The "factory(LIST)" method creates a list of tokens from a list of
58           specifications, which include for each token: a name, a regular
59           expression, and possibly an anonymous subroutine.  The list can
60           also include objects of class "Parse::Token" or of a class derived
61           from it.
62
63           The "factory(ARRAY_REF)" method permits creating tokens from
64           specifications of type attribute-value:
65
66                   Parse::Token->factory([Type => 'Simple',
67                                          Name => 'EXAMPLE',
68                                          Regex => '.+']);
69
70           "Type" indicates the type of each token to be created (the package
71           prefix is not indicated).
72
73           "factory()" creates a series of tokens but does not import these
74           tokens into the calling package.
75
76           You could for example write:
77
78                   %keywords =
79                     qw (
80                         PROC  undef
81                         FUNC  undef
82                         RETURN undef
83                         IF    undef
84                         ELSE  undef
85                         WHILE undef
86                         PRINT undef
87                         READ  undef
88                        );
89                   @tokens = Parse::Token->factory(%keywords);
90
91           and install these tokens in a symbol table in the following manner:
92
93                   foreach $name (keys %keywords) {
94                     ${$name} = pop @tokens;
95                     $symbol{"\L$name"} = [${$name}, ''];
96                   }
97
98           "${$name}" is the token instance.
99
100           During the lexical analysis phase, you can use the tokens in the
101           following manner:
102
103                   qw(IDENT [a-zA-Z][a-zA-Z0-9_]*),  sub {
104                      $symbol{$_[1]} = [] unless defined $symbol{$_[1]};
105                      my $type = $symbol{$_[1]}[0];
106                      $lexer->setToken((not defined $type) ? $VAR : $type);
107                      $_[1];  # THE TOKEN TEXT
108                    }
109
110           This permits indicating that any symbol of unknown type is a
111           variable.
112
113           In this example we have used $_[1] which corresponds to the text
114           recognized by the regular expression.  This text associated with
115           the token must be returned by the anonymous subroutine.
116
117       get EXPR
118           "get" obtains the value of the attribute named by the result of
119           evaluating EXPR.  You can also use the name of the attribute as a
120           method name.
121
122       getText
123           Returns the character string that was recognized by means of this
124           "Parse::Token" object.
125
126           Same as the text() method.
127
128       isnext EXPR
129       isnext
130           Returns the status of the token. The consumed string is put into
131           EXPR if it is a reference to a scalar.
132
133       name
134           Returns the name of the token.
135
136       next
137           Activate searching for the lexeme defined by the regular expression
138           contained in the object. If this lexeme is recognized on the
139           character stream to analyze, "next" returns the string found and
140           sets the status of the object to true.
141
142       new SYMBOL_NAME, REGEXP, SUB
143       new SYMBOL_NAME, REGEXP
144           Creates an object of type "Parse::Token::Simple" or
145           "Parse::Token::Segmented". The arguments of the "new()" method are,
146           respectively: a symbolic name, a regular expression, and possibly
147           an anonymous subroutine.  The subclasses of "Parse::Token" permit
148           specifying tokens by means of a list of attribute-values.
149
150           REGEXP is either a simple regular expression, or a reference to an
151           array containing from one to three regular expressions.  In the
152           first case, the instance belongs to the "Parse::Token::Simple"
153           class.  In the second case, the instance belongs to the
154           "Parse::Token::Segmented" class.  The tokens of this type permit
155           recognizing structures of type character string delimited by
156           quotation marks, comments in a C program, etc.  The regular
157           expressions are used to recognize:
158
159           1. The beginning of the lexeme,
160
161           2. The "body" of the lexeme; if this second expression is missing,
162           "Parse::Lex" uses "(?:.*?)",
163
164           3. the end of the lexeme; if this last expression is missing then
165           the first one is used. (Note! The end of the lexeme cannot span
166           several lines).
167
168           Example:
169
170                     qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],
171
172           These regular expressions can recognize multi-line strings
173           delimited by quotation marks, where the backslash is used to quote
174           the quotation marks appearing within the string. Notice the
175           quadrupling of the backslash.
176
177           Here is a variation of the previous example which uses the "s"
178           option to include newline in the characters recognized by ""."":
179
180                     qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],
181
182           (Note: it is possible to write regular expressions which are more
183           efficient in terms of execution time, but this is not our objective
184           with this example.  See Mastering Regular Expressions.)
185
186           The anonymous subroutine is called when the lexeme is recognized by
187           the lexical analyzer. This subroutine takes two arguments: $_[0]
188           contains the token instance, and $_[1] contains the string
189           recognized by the regular expression. The scalar returned by the
190           anonymous subroutine defines the character string memorized in the
191           token instance.
192
193           In the anonymous subroutine you can use the positional variables
194           $1, $2, etc. which correspond to the groups of parentheses in the
195           regular expression.
196
197       regexp
198           Returns the regular expression of the "Token" object.
199
200       set LIST
201           Allows marking a token with a list of attribute-value pairs.
202
203           An attribute name can be used as a method name.
204
205       setText EXPR
206           The value of "EXPR" defines the character string associated with
207           the lexeme.
208
209           Same as the "text(EXPR)" method.
210
211       status EXPR
212       status
213           Indicates if the last search of the lexeme succeeded or failed.
214           "status EXPR" overrides the existing value and sets it to the value
215           of EXPR.
216
217       text EXPR
218       text
219           "text()" returns the character string recognized by means of the
220           token. The value of "EXPR" sets the character string associated
221           with the lexeme.
222
223       trace OUTPUT
224       trace
225           Class method which activates/deactivates a trace of the lexical
226           analysis.
227
228           "OUTPUT" can be a file name or a reference to a filehandle to which
229           the trace will be directed.
230

Subclasses of Parse::Token

232       Subclasses of the "Parse::Token" class are being defined.  They permit
233       recognizing specific structures such as, for example, strings within
234       double-quotes, C comments, etc.  Here are the subclasses which I am
235       working on:
236
237       "Parse::Token::Simple" : tokens of this class are defined by means of a
238       single regular expression.
239
240       "Parse::Token::Segmented" : tokens of this class are defined by means
241       of three regular expressions.  Reading of new data is done
242       automatically.
243
244       "Parse::Token::Delimited" : permits recognizing, for example, C
245       language comments.
246
247       "Parse::Token::Quoted" : permits recognizing, for example, character
248       strings within quotation marks.
249
250       "Parse::Token::Nested" : permits recognizing nested structures such as
251       parenthesized expressions.  NOT DEFINED.
252
253       These classes are recently created and no doubt contain some bugs.
254
255   Parse::Token::Action
256       Tokens of the "Parse::Token::Action" class permit inserting arbitrary
257       Perl expressions within a lexical analyzer.  An expression can be used
258       for instance to print out internal variables of the analyzer:
259
260       •   $LEX_BUFFER : contents of the buffer to be analyzed
261
262       •   $LEX_LENGTH : length of the character string being analyzed
263
264       •   $LEX_RECORD : number of the record being analyzed
265
266       •   $LEX_OFFSET : number of characters already consumed since the start
267           of the analysis.
268
269       •   $LEX_POS : position reached by the analysis as a number of
270           characters since the start of the buffer.
271
272       The class constructor accepts the following attributes:
273
274       •   "Name" : the name of the token
275
276       •   "Expr" : a Perl expression
277
278       Example :
279
280               $ACTION = new Parse::Token::Action(
281                                             Name => 'ACTION',
282                                             Expr => q!print "LEX_POS: $LEX_POS\n" .
283                                             "LEX_BUFFER: $LEX_BUFFER\n" .
284                                             "LEX_LENGTH: $LEX_LENGTH\n" .
285                                             "LEX_RECORD: $LEX_RECORD\n" .
286                                             "LEX_OFFSET: $LEX_OFFSET\n"
287                                             ;!,
288                                            );
289
290   Parse::Token::Simple
291       The class constructor accepts the following attributes:
292
293       •   "Handler" : the value indicates the name of a function to call
294           during an analysis performed by an analyzer of class
295           "Parse::LexEvent".
296
297       •   "Name" : the associated value is the name of the token.
298
299       •   "Regex" : the associated value is a regular expression
300           corresponding to the pattern to be recognized.
301
302       •   "ReadMore" : if the associated value is 1, the recognition of the
303           token continues after reading a new record.  The strings recognized
304           are concatenated.  This attribute only has effect during analysis
305           of a character stream.
306
307       •   "Sub" : the associated value must be an anonymous subroutine to be
308           executed after the token is recognized.  This function is only used
309           with analyzers of class "Parse::Lex" or "Parse::CLex".
310
311       Example.
312             new Parse::Token::Simple(Name => 'remainder',
313                                      Regex => '[^/\'\"]+',
314                                      ReadMore => 1);
315
316   Parse::Token::Segmented
317       The definition of these tokens includes three regular expressions.
318       During analysis of a data stream, new data is read as long as the end
319       of the token has not been reached.
320
321       The class constructor accepts the following attributes:
322
323       •   "Handler" : the value indicates the name of a function to call
324           during analysis performed by an analyzer of class
325           "Parse::LexEvent".
326
327       •   "Name" : the associated value is the name of the token.
328
329       •   "Regex" : the associated value must be a reference to an array that
330           contains three regular expressions.
331
332       •   "Sub" : the associated value must be an anonymous subroutine to be
333           executed after the token is recognized.  This function is only used
334           with analyzers of class "Parse::Lex" or "Parse::CLex".
335
336   Parse::Token::Quoted
337       "Parse::Token::Quoted" is a subclass of "Parse::Token::Segmented".  It
338       permits recognizing character strings within double quotes or single
339       quotes.
340
341       Examples.
342
343             ---------------------------------------------------------
344              Start    End            Escaping
345             ---------------------------------------------------------
346               '        '              ''
347               "        "              ""
348               "        "              \
349             ---------------------------------------------------------
350
351       The class constructor accepts the following attributes:
352
353       •   "End" : The associated value is a regular expression permitting
354           recognizing the end of the token.
355
356       •   "Escape" : The associated value indicates the character used to
357           escape the delimiter.  By default, a double occurrence of the
358           terminating character escapes that character.
359
360       •   "Handler" : the value indicates the name of a function to be called
361           during an analysis performed by an analyzer of class
362           "Parse::LexEvent".
363
364       •   "Name" : the associated value is the name of the token.
365
366       •   "Start" : the associated value is a regular expression permitting
367           recognizing the start of the token.
368
369       •   "Sub" : the associated value must be an anonymous subroutine to be
370           executed after the token is recognized.  This function is only used
371           with analyzers of class "Parse::Lex" or "Parse::CLex".
372
373       Example.
374             new Parse::Token::Quoted(Name => 'squotes',
375                                      Handler => 'string',
376                                      Escape => '\\',
377                                      Quote => qq!\'!,
378                                     );
379
380   Parse::Token::Delimited
381       "Parse::Token::Delimited" is a subclass of "Parse::Token::Segmented".
382       It permits, for example, recognizing C language comments.
383
384       Examples.
385
386             ---------------------------------------------------------
387               Start   End     Constraint
388                               on the contents
389             ---------------------------------------------------------
390               /*       */                         C Comment
391               <!--     -->      No '--'           XML Comment
392               <!--     -->                        SGML Comment
393               <?       ?>                         Processing instruction
394                                                   in SGML/XML
395             ---------------------------------------------------------
396
397       The class constructor accepts the following attributes:
398
399       •   "End" : The associated value is a regular expression permitting
400           recognizing the end of the token.
401
402       •   "Handler" : the value indicates the name of a function to be called
403           during an analysis performed by an analyzer of class
404           "Parse::LexEvent".
405
406       •   "Name" : the associated value is the name of the token.
407
408       •   "Start" : the associated value is a regular expression permitting
409           recognizing the start of the token.
410
411       •   "Sub" : the associated value must be an anonymous subroutine to be
412           executed after the token is recognized.  This function is only used
413           with analyzers of class "Parse::Lex" or "Parse::CLex".
414
415       Example.
416             new Parse::Token::Delimited(Name => 'comment',
417                                         Start => '/[*]',
418                                         End => '[*]/'
419                                        );
420
421   Parse::Token::Nested - Not defined
422       Examples.
423
424             ----------------------------------------------------------
425               Start   End
426             ----------------------------------------------------------
427               (        )                      Symbolic Expressions
428               {        }                      Rich Text Format Groups
429             ----------------------------------------------------------
430

BUGS

432       The implementation of subclasses of tokens is not complete for
433       analyzers of the "Parse::CLex" class.  I am not too keen to do it,
434       since an implementation for classes "Parse::Lex" and "Parse::LexEvent"
435       seems quite sufficient.
436

AUTHOR

438       Philippe Verdret. Documentation translated to English by Vladimir
439       Alexiev and Ocrat.
440

ACKNOWLEDGMENTS

442       Version 2.0 owes much to suggestions made by Vladimir Alexiev.  Ocrat
443       has significantly contributed to improving this documentation.  Thanks
444       also to the numerous persons who have made comments or sometimes sent
445       bug fixes.
446

REFERENCES

448       Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates
449       1996.
450
451       Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
452
454       Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This
455       module is free software; you can redistribute it and/or modify it under
456       the same terms as Perl itself.
457
458
459
460perl v5.34.0                      2022-01-21                   Parse::Token(3)
Impressum