1Parse::RecDescent(3) User Contributed Perl Documentation Parse::RecDescent(3)
2
3
4
6 Parse::RecDescent - Generate Recursive-Descent Parsers
7
9 This document describes version 1.967009 of Parse::RecDescent released
10 March 16th, 2012.
11
13 use Parse::RecDescent;
14
15 # Generate a parser from the specification in $grammar:
16
17 $parser = new Parse::RecDescent ($grammar);
18
19 # Generate a parser from the specification in $othergrammar
20
21 $anotherparser = new Parse::RecDescent ($othergrammar);
22
23
24 # Parse $text using rule 'startrule' (which must be
25 # defined in $grammar):
26
27 $parser->startrule($text);
28
29
30 # Parse $text using rule 'otherrule' (which must also
31 # be defined in $grammar):
32
33 $parser->otherrule($text);
34
35
36 # Change the universal token prefix pattern
37 # before building a grammar
38 # (the default is: '\s*'):
39
40 $Parse::RecDescent::skip = '[ \t]+';
41
42
43 # Replace productions of existing rules (or create new ones)
44 # with the productions defined in $newgrammar:
45
46 $parser->Replace($newgrammar);
47
48
49 # Extend existing rules (or create new ones)
50 # by adding extra productions defined in $moregrammar:
51
52 $parser->Extend($moregrammar);
53
54
55 # Global flags (useful as command line arguments under -s):
56
57 $::RD_ERRORS # unless undefined, report fatal errors
58 $::RD_WARN # unless undefined, also report non-fatal problems
59 $::RD_HINT # if defined, also suggestion remedies
60 $::RD_TRACE # if defined, also trace parsers' behaviour
61 $::RD_AUTOSTUB # if defined, generates "stubs" for undefined rules
62 $::RD_AUTOACTION # if defined, appends specified action to productions
63
65 Overview
66 Parse::RecDescent incrementally generates top-down recursive-descent
67 text parsers from simple yacc-like grammar specifications. It provides:
68
69 · Regular expressions or literal strings as terminals (tokens),
70
71 · Multiple (non-contiguous) productions for any rule,
72
73 · Repeated and optional subrules within productions,
74
75 · Full access to Perl within actions specified as part of the
76 grammar,
77
78 · Simple automated error reporting during parser generation and
79 parsing,
80
81 · The ability to commit to, uncommit to, or reject particular
82 productions during a parse,
83
84 · The ability to pass data up and down the parse tree ("down" via
85 subrule argument lists, "up" via subrule return values)
86
87 · Incremental extension of the parsing grammar (even during a parse),
88
89 · Precompilation of parser objects,
90
91 · User-definable reduce-reduce conflict resolution via "scoring" of
92 matching productions.
93
94 Using "Parse::RecDescent"
95 Parser objects are created by calling "Parse::RecDescent::new", passing
96 in a grammar specification (see the following subsections). If the
97 grammar is correct, "new" returns a blessed reference which can then be
98 used to initiate parsing through any rule specified in the original
99 grammar. A typical sequence looks like this:
100
101 $grammar = q {
102 # GRAMMAR SPECIFICATION HERE
103 };
104
105 $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n";
106
107 # acquire $text
108
109 defined $parser->startrule($text) or print "Bad text!\n";
110
111 The rule through which parsing is initiated must be explicitly defined
112 in the grammar (i.e. for the above example, the grammar must include a
113 rule of the form: "startrule: <subrules>".
114
115 If the starting rule succeeds, its value (see below) is returned.
116 Failure to generate the original parser or failure to match a text is
117 indicated by returning "undef". Note that it's easy to set up grammars
118 that can succeed, but which return a value of 0, "0", or "". So don't
119 be tempted to write:
120
121 $parser->startrule($text) or print "Bad text!\n";
122
123 Normally, the parser has no effect on the original text. So in the
124 previous example the value of $text would be unchanged after having
125 been parsed.
126
127 If, however, the text to be matched is passed by reference:
128
129 $parser->startrule(\$text)
130
131 then any text which was consumed during the match will be removed from
132 the start of $text.
133
134 Rules
135 In the grammar from which the parser is built, rules are specified by
136 giving an identifier (which must satisfy /[A-Za-z]\w*/), followed by a
137 colon on the same line, followed by one or more productions, separated
138 by single vertical bars. The layout of the productions is entirely
139 free-format:
140
141 rule1: production1
142 | production2 |
143 production3 | production4
144
145 At any point in the grammar previously defined rules may be extended
146 with additional productions. This is achieved by redeclaring the rule
147 with the new productions. Thus:
148
149 rule1: a | b | c
150 rule2: d | e | f
151 rule1: g | h
152
153 is exactly equivalent to:
154
155 rule1: a | b | c | g | h
156 rule2: d | e | f
157
158 Each production in a rule consists of zero or more items, each of which
159 may be either: the name of another rule to be matched (a "subrule"), a
160 pattern or string literal to be matched directly (a "token"), a block
161 of Perl code to be executed (an "action"), a special instruction to the
162 parser (a "directive"), or a standard Perl comment (which is ignored).
163
164 A rule matches a text if one of its productions matches. A production
165 matches if each of its items match consecutive substrings of the text.
166 The productions of a rule being matched are tried in the same order
167 that they appear in the original grammar, and the first matching
168 production terminates the match attempt (successfully). If all
169 productions are tried and none matches, the match attempt fails.
170
171 Note that this behaviour is quite different from the "prefer the longer
172 match" behaviour of yacc. For example, if yacc were parsing the rule:
173
174 seq : 'A' 'B'
175 | 'A' 'B' 'C'
176
177 upon matching "AB" it would look ahead to see if a 'C' is next and, if
178 so, will match the second production in preference to the first. In
179 other words, yacc effectively tries all the productions of a rule
180 breadth-first in parallel, and selects the "best" match, where "best"
181 means longest (note that this is a gross simplification of the true
182 behaviour of yacc but it will do for our purposes).
183
184 In contrast, "Parse::RecDescent" tries each production depth-first in
185 sequence, and selects the "best" match, where "best" means first. This
186 is the fundamental difference between "bottom-up" and "recursive
187 descent" parsing.
188
189 Each successfully matched item in a production is assigned a value,
190 which can be accessed in subsequent actions within the same production
191 (or, in some cases, as the return value of a successful subrule call).
192 Unsuccessful items don't have an associated value, since the failure of
193 an item causes the entire surrounding production to immediately fail.
194 The following sections describe the various types of items and their
195 success values.
196
197 Subrules
198 A subrule which appears in a production is an instruction to the parser
199 to attempt to match the named rule at that point in the text being
200 parsed. If the named subrule is not defined when requested the
201 production containing it immediately fails (unless it was "autostubbed"
202 - see Autostubbing).
203
204 A rule may (recursively) call itself as a subrule, but not as the left-
205 most item in any of its productions (since such recursions are usually
206 non-terminating).
207
208 The value associated with a subrule is the value associated with its
209 $return variable (see "Actions" below), or with the last successfully
210 matched item in the subrule match.
211
212 Subrules may also be specified with a trailing repetition specifier,
213 indicating that they are to be (greedily) matched the specified number
214 of times. The available specifiers are:
215
216 subrule(?) # Match one-or-zero times
217 subrule(s) # Match one-or-more times
218 subrule(s?) # Match zero-or-more times
219 subrule(N) # Match exactly N times for integer N > 0
220 subrule(N..M) # Match between N and M times
221 subrule(..M) # Match between 1 and M times
222 subrule(N..) # Match at least N times
223
224 Repeated subrules keep matching until either the subrule fails to
225 match, or it has matched the minimal number of times but fails to
226 consume any of the parsed text (this second condition prevents the
227 subrule matching forever in some cases).
228
229 Since a repeated subrule may match many instances of the subrule
230 itself, the value associated with it is not a simple scalar, but rather
231 a reference to a list of scalars, each of which is the value associated
232 with one of the individual subrule matches. In other words in the rule:
233
234 program: statement(s)
235
236 the value associated with the repeated subrule "statement(s)" is a
237 reference to an array containing the values matched by each call to the
238 individual subrule "statement".
239
240 Repetition modifiers may include a separator pattern:
241
242 program: statement(s /;/)
243
244 specifying some sequence of characters to be skipped between each
245 repetition. This is really just a shorthand for the <leftop:...>
246 directive (see below).
247
248 Tokens
249 If a quote-delimited string or a Perl regex appears in a production,
250 the parser attempts to match that string or pattern at that point in
251 the text. For example:
252
253 typedef: "typedef" typename identifier ';'
254
255 identifier: /[A-Za-z_][A-Za-z0-9_]*/
256
257 As in regular Perl, a single quoted string is uninterpolated, whilst a
258 double-quoted string or a pattern is interpolated (at the time of
259 matching, not when the parser is constructed). Hence, it is possible to
260 define rules in which tokens can be set at run-time:
261
262 typedef: "$::typedefkeyword" typename identifier ';'
263
264 identifier: /$::identpat/
265
266 Note that, since each rule is implemented inside a special namespace
267 belonging to its parser, it is necessary to explicitly quantify
268 variables from the main package.
269
270 Regex tokens can be specified using just slashes as delimiters or with
271 the explicit "m<delimiter>......<delimiter>" syntax:
272
273 typedef: "typedef" typename identifier ';'
274
275 typename: /[A-Za-z_][A-Za-z0-9_]*/
276
277 identifier: m{[A-Za-z_][A-Za-z0-9_]*}
278
279 A regex of either type can also have any valid trailing parameter(s)
280 (that is, any of [cgimsox]):
281
282 typedef: "typedef" typename identifier ';'
283
284 identifier: / [a-z_] # LEADING ALPHA OR UNDERSCORE
285 [a-z0-9_]* # THEN DIGITS ALSO ALLOWED
286 /ix # CASE/SPACE/COMMENT INSENSITIVE
287
288 The value associated with any successfully matched token is a string
289 containing the actual text which was matched by the token.
290
291 It is important to remember that, since each grammar is specified in a
292 Perl string, all instances of the universal escape character '\' within
293 a grammar must be "doubled", so that they interpolate to single '\'s
294 when the string is compiled. For example, to use the grammar:
295
296 word: /\S+/ | backslash
297 line: prefix word(s) "\n"
298 backslash: '\\'
299
300 the following code is required:
301
302 $parser = new Parse::RecDescent (q{
303
304 word: /\\S+/ | backslash
305 line: prefix word(s) "\\n"
306 backslash: '\\\\'
307
308 });
309
310 Anonymous subrules
311 Parentheses introduce a nested scope that is very like a call to an
312 anonymous subrule. Hence they are useful for "in-lining" subroutine
313 calls, and other kinds of grouping behaviour. For example, instead of:
314
315 word: /\S+/ | backslash
316 line: prefix word(s) "\n"
317
318 you could write:
319
320 line: prefix ( /\S+/ | backslash )(s) "\n"
321
322 and get exactly the same effects.
323
324 Parentheses are also use for collecting unrepeated alternations within
325 a single production.
326
327 secret_identity: "Mr" ("Incredible"|"Fantastic"|"Sheen") ", Esq."
328
329 Terminal Separators
330 For the purpose of matching, each terminal in a production is
331 considered to be preceded by a "prefix" - a pattern which must be
332 matched before a token match is attempted. By default, the prefix is
333 optional whitespace (which always matches, at least trivially), but
334 this default may be reset in any production.
335
336 The variable $Parse::RecDescent::skip stores the universal prefix,
337 which is the default for all terminal matches in all parsers built with
338 "Parse::RecDescent".
339
340 If you want to change the universal prefix using
341 $Parse::RecDescent::skip, be careful to set it before creating the
342 grammar object, because it is applied statically (when a grammar is
343 built) rather than dynamically (when the grammar is used).
344 Alternatively you can provide a global "<skip:...>" directive in your
345 grammar before any rules (described later).
346
347 The prefix for an individual production can be altered by using the
348 "<skip:...>" directive (described later). Setting this directive in
349 the top-level rule is an alternative approach to setting
350 $Parse::RecDescent::skip before creating the object, but in this case
351 you don't get the intended skipping behaviour if you directly invoke
352 methods different from the top-level rule.
353
354 Actions
355 An action is a block of Perl code which is to be executed (as the block
356 of a "do" statement) when the parser reaches that point in a
357 production. The action executes within a special namespace belonging to
358 the active parser, so care must be taken in correctly qualifying
359 variable names (see also "Start-up Actions" below).
360
361 The action is considered to succeed if the final value of the block is
362 defined (that is, if the implied "do" statement evaluates to a defined
363 value - even one which would be treated as "false"). Note that the
364 value associated with a successful action is also the final value in
365 the block.
366
367 An action will fail if its last evaluated value is "undef". This is
368 surprisingly easy to accomplish by accident. For instance, here's an
369 infuriating case of an action that makes its production fail, but only
370 when debugging isn't activated:
371
372 description: name rank serial_number
373 { print "Got $item[2] $item[1] ($item[3])\n"
374 if $::debugging
375 }
376
377 If $debugging is false, no statement in the block is executed, so the
378 final value is "undef", and the entire production fails. The solution
379 is:
380
381 description: name rank serial_number
382 { print "Got $item[2] $item[1] ($item[3])\n"
383 if $::debugging;
384 1;
385 }
386
387 Within an action, a number of useful parse-time variables are available
388 in the special parser namespace (there are other variables also
389 accessible, but meddling with them will probably just break your
390 parser. As a general rule, if you avoid referring to unqualified
391 variables - especially those starting with an underscore - inside an
392 action, things should be okay):
393
394 @item and %item
395 The array slice @item[1..$#item] stores the value associated with
396 each item (that is, each subrule, token, or action) in the current
397 production. The analogy is to $1, $2, etc. in a yacc grammar. Note
398 that, for obvious reasons, @item only contains the values of items
399 before the current point in the production.
400
401 The first element ($item[0]) stores the name of the current rule
402 being matched.
403
404 @item is a standard Perl array, so it can also be indexed with
405 negative numbers, representing the number of items back from the
406 current position in the parse:
407
408 stuff: /various/ bits 'and' pieces "then" data 'end'
409 { print $item[-2] } # PRINTS data
410 # (EASIER THAN: $item[6])
411
412 The %item hash complements the <@item> array, providing named
413 access to the same item values:
414
415 stuff: /various/ bits 'and' pieces "then" data 'end'
416 { print $item{data} # PRINTS data
417 # (EVEN EASIER THAN USING @item)
418
419 The results of named subrules are stored in the hash under each
420 subrule's name (including the repetition specifier, if any), whilst
421 all other items are stored under a "named positional" key that
422 indictates their ordinal position within their item type:
423 __STRINGn__, __PATTERNn__, __DIRECTIVEn__, __ACTIONn__:
424
425 stuff: /various/ bits 'and' pieces "then" data 'end' { save }
426 { print $item{__PATTERN1__}, # PRINTS 'various'
427 $item{__STRING2__}, # PRINTS 'then'
428 $item{__ACTION1__}, # PRINTS RETURN
429 # VALUE OF save
430 }
431
432 If you want proper named access to patterns or literals, you need
433 to turn them into separate rules:
434
435 stuff: various bits 'and' pieces "then" data 'end'
436 { print $item{various} # PRINTS various
437 }
438
439 various: /various/
440
441 The special entry $item{__RULE__} stores the name of the current
442 rule (i.e. the same value as $item[0].
443
444 The advantage of using %item, instead of @items is that it removes
445 the need to track items positions that may change as a grammar
446 evolves. For example, adding an interim "<skip>" directive of
447 action can silently ruin a trailing action, by moving an @item
448 element "down" the array one place. In contrast, the named entry of
449 %item is unaffected by such an insertion.
450
451 A limitation of the %item hash is that it only records the last
452 value of a particular subrule. For example:
453
454 range: '(' number '..' number )'
455 { $return = $item{number} }
456
457 will return only the value corresponding to the second match of the
458 "number" subrule. In other words, successive calls to a subrule
459 overwrite the corresponding entry in %item. Once again, the
460 solution is to rename each subrule in its own rule:
461
462 range: '(' from_num '..' to_num ')'
463 { $return = $item{from_num} }
464
465 from_num: number
466 to_num: number
467
468 @arg and %arg
469 The array @arg and the hash %arg store any arguments passed to the
470 rule from some other rule (see "Subrule argument lists"). Changes
471 to the elements of either variable do not propagate back to the
472 calling rule (data can be passed back from a subrule via the
473 $return variable - see next item).
474
475 $return
476 If a value is assigned to $return within an action, that value is
477 returned if the production containing the action eventually matches
478 successfully. Note that setting $return doesn't cause the current
479 production to succeed. It merely tells it what to return if it does
480 succeed. Hence $return is analogous to $$ in a yacc grammar.
481
482 If $return is not assigned within a production, the value of the
483 last component of the production (namely: $item[$#item]) is
484 returned if the production succeeds.
485
486 $commit
487 The current state of commitment to the current production (see
488 "Directives" below).
489
490 $skip
491 The current terminal prefix (see "Directives" below).
492
493 $text
494 The remaining (unparsed) text. Changes to $text do not propagate
495 out of unsuccessful productions, but do survive successful
496 productions. Hence it is possible to dynamically alter the text
497 being parsed - for example, to provide a "#include"-like facility:
498
499 hash_include: '#include' filename
500 { $text = ::loadfile($item[2]) . $text }
501
502 filename: '<' /[a-z0-9._-]+/i '>' { $return = $item[2] }
503 | '"' /[a-z0-9._-]+/i '"' { $return = $item[2] }
504
505 $thisline and $prevline
506 $thisline stores the current line number within the current parse
507 (starting from 1). $prevline stores the line number for the last
508 character which was already successfully parsed (this will be
509 different from $thisline at the end of each line).
510
511 For efficiency, $thisline and $prevline are actually tied hashes,
512 and only recompute the required line number when the variable's
513 value is used.
514
515 Assignment to $thisline adjusts the line number calculator, so that
516 it believes that the current line number is the value being
517 assigned. Note that this adjustment will be reflected in all
518 subsequent line numbers calculations.
519
520 Modifying the value of the variable $text (as in the previous
521 "hash_include" example, for instance) will confuse the line
522 counting mechanism. To prevent this, you should call
523 "Parse::RecDescent::LineCounter::resync($thisline)" immediately
524 after any assignment to the variable $text (or, at least, before
525 the next attempt to use $thisline).
526
527 Note that if a production fails after assigning to or resync'ing
528 $thisline, the parser's line counter mechanism will usually be
529 corrupted.
530
531 Also see the entry for @itempos.
532
533 The line number can be set to values other than 1, by calling the
534 start rule with a second argument. For example:
535
536 $parser = new Parse::RecDescent ($grammar);
537
538 $parser->input($text, 10); # START LINE NUMBERS AT 10
539
540 $thiscolumn and $prevcolumn
541 $thiscolumn stores the current column number within the current
542 line being parsed (starting from 1). $prevcolumn stores the column
543 number of the last character which was actually successfully
544 parsed. Usually "$prevcolumn == $thiscolumn-1", but not at the end
545 of lines.
546
547 For efficiency, $thiscolumn and $prevcolumn are actually tied
548 hashes, and only recompute the required column number when the
549 variable's value is used.
550
551 Assignment to $thiscolumn or $prevcolumn is a fatal error.
552
553 Modifying the value of the variable $text (as in the previous
554 "hash_include" example, for instance) may confuse the column
555 counting mechanism.
556
557 Note that $thiscolumn reports the column number before any
558 whitespace that might be skipped before reading a token. Hence if
559 you wish to know where a token started (and ended) use something
560 like this:
561
562 rule: token1 token2 startcol token3 endcol token4
563 { print "token3: columns $item[3] to $item[5]"; }
564
565 startcol: '' { $thiscolumn } # NEED THE '' TO STEP PAST TOKEN SEP
566 endcol: { $prevcolumn }
567
568 Also see the entry for @itempos.
569
570 $thisoffset and $prevoffset
571 $thisoffset stores the offset of the current parsing position
572 within the complete text being parsed (starting from 0).
573 $prevoffset stores the offset of the last character which was
574 actually successfully parsed. In all cases "$prevoffset ==
575 $thisoffset-1".
576
577 For efficiency, $thisoffset and $prevoffset are actually tied
578 hashes, and only recompute the required offset when the variable's
579 value is used.
580
581 Assignment to $thisoffset or <$prevoffset> is a fatal error.
582
583 Modifying the value of the variable $text will not affect the
584 offset counting mechanism.
585
586 Also see the entry for @itempos.
587
588 @itempos
589 The array @itempos stores a hash reference corresponding to each
590 element of @item. The elements of the hash provide the following:
591
592 $itempos[$n]{offset}{from} # VALUE OF $thisoffset BEFORE $item[$n]
593 $itempos[$n]{offset}{to} # VALUE OF $prevoffset AFTER $item[$n]
594 $itempos[$n]{line}{from} # VALUE OF $thisline BEFORE $item[$n]
595 $itempos[$n]{line}{to} # VALUE OF $prevline AFTER $item[$n]
596 $itempos[$n]{column}{from} # VALUE OF $thiscolumn BEFORE $item[$n]
597 $itempos[$n]{column}{to} # VALUE OF $prevcolumn AFTER $item[$n]
598
599 Note that the various "$itempos[$n]...{from}" values record the
600 appropriate value after any token prefix has been skipped.
601
602 Hence, instead of the somewhat tedious and error-prone:
603
604 rule: startcol token1 endcol
605 startcol token2 endcol
606 startcol token3 endcol
607 { print "token1: columns $item[1]
608 to $item[3]
609 token2: columns $item[4]
610 to $item[6]
611 token3: columns $item[7]
612 to $item[9]" }
613
614 startcol: '' { $thiscolumn } # NEED THE '' TO STEP PAST TOKEN SEP
615 endcol: { $prevcolumn }
616
617 it is possible to write:
618
619 rule: token1 token2 token3
620 { print "token1: columns $itempos[1]{column}{from}
621 to $itempos[1]{column}{to}
622 token2: columns $itempos[2]{column}{from}
623 to $itempos[2]{column}{to}
624 token3: columns $itempos[3]{column}{from}
625 to $itempos[3]{column}{to}" }
626
627 Note however that (in the current implementation) the use of
628 @itempos anywhere in a grammar implies that item positioning
629 information is collected everywhere during the parse. Depending on
630 the grammar and the size of the text to be parsed, this may be
631 prohibitively expensive and the explicit use of $thisline,
632 $thiscolumn, etc. may be a better choice.
633
634 $thisparser
635 A reference to the "Parse::RecDescent" object through which parsing
636 was initiated.
637
638 The value of $thisparser propagates down the subrules of a parse
639 but not back up. Hence, you can invoke subrules from another parser
640 for the scope of the current rule as follows:
641
642 rule: subrule1 subrule2
643 | { $thisparser = $::otherparser } <reject>
644 | subrule3 subrule4
645 | subrule5
646
647 The result is that the production calls "subrule1" and "subrule2"
648 of the current parser, and the remaining productions call the named
649 subrules from $::otherparser. Note, however that "Bad Things" will
650 happen if "::otherparser" isn't a blessed reference and/or doesn't
651 have methods with the same names as the required subrules!
652
653 $thisrule
654 A reference to the "Parse::RecDescent::Rule" object corresponding
655 to the rule currently being matched.
656
657 $thisprod
658 A reference to the "Parse::RecDescent::Production" object
659 corresponding to the production currently being matched.
660
661 $score and $score_return
662 $score stores the best production score to date, as specified by an
663 earlier "<score:...>" directive. $score_return stores the
664 corresponding return value for the successful production.
665
666 See "Scored productions".
667
668 Warning: the parser relies on the information in the various "this..."
669 objects in some non-obvious ways. Tinkering with the other members of
670 these objects will probably cause Bad Things to happen, unless you
671 really know what you're doing. The only exception to this advice is
672 that the use of "$this...->{local}" is always safe.
673
674 Start-up Actions
675 Any actions which appear before the first rule definition in a grammar
676 are treated as "start-up" actions. Each such action is stripped of its
677 outermost brackets and then evaluated (in the parser's special
678 namespace) just before the rules of the grammar are first compiled.
679
680 The main use of start-up actions is to declare local variables within
681 the parser's special namespace:
682
683 { my $lastitem = '???'; }
684
685 list: item(s) { $return = $lastitem }
686
687 item: book { $lastitem = 'book'; }
688 bell { $lastitem = 'bell'; }
689 candle { $lastitem = 'candle'; }
690
691 but start-up actions can be used to execute any valid Perl code within
692 a parser's special namespace.
693
694 Start-up actions can appear within a grammar extension or replacement
695 (that is, a partial grammar installed via "Parse::RecDescent::Extend()"
696 or "Parse::RecDescent::Replace()" - see "Incremental Parsing"), and
697 will be executed before the new grammar is installed. Note, however,
698 that a particular start-up action is only ever executed once.
699
700 Autoactions
701 It is sometimes desirable to be able to specify a default action to be
702 taken at the end of every production (for example, in order to easily
703 build a parse tree). If the variable $::RD_AUTOACTION is defined when
704 "Parse::RecDescent::new()" is called, the contents of that variable are
705 treated as a specification of an action which is to appended to each
706 production in the corresponding grammar.
707
708 Alternatively, you can hard-code the autoaction within a grammar, using
709 the "<autoaction:...>" directive.
710
711 So, for example, to construct a simple parse tree you could write:
712
713 $::RD_AUTOACTION = q { [@item] };
714
715 parser = Parse::RecDescent->new(q{
716 expression: and_expr '||' expression | and_expr
717 and_expr: not_expr '&&' and_expr | not_expr
718 not_expr: '!' brack_expr | brack_expr
719 brack_expr: '(' expression ')' | identifier
720 identifier: /[a-z]+/i
721 });
722
723 or:
724
725 parser = Parse::RecDescent->new(q{
726 <autoaction: { [@item] } >
727
728 expression: and_expr '||' expression | and_expr
729 and_expr: not_expr '&&' and_expr | not_expr
730 not_expr: '!' brack_expr | brack_expr
731 brack_expr: '(' expression ')' | identifier
732 identifier: /[a-z]+/i
733 });
734
735 Either of these is equivalent to:
736
737 parser = new Parse::RecDescent (q{
738 expression: and_expr '||' expression
739 { [@item] }
740 | and_expr
741 { [@item] }
742
743 and_expr: not_expr '&&' and_expr
744 { [@item] }
745 | not_expr
746 { [@item] }
747
748 not_expr: '!' brack_expr
749 { [@item] }
750 | brack_expr
751 { [@item] }
752
753 brack_expr: '(' expression ')'
754 { [@item] }
755 | identifier
756 { [@item] }
757
758 identifier: /[a-z]+/i
759 { [@item] }
760 });
761
762 Alternatively, we could take an object-oriented approach, use different
763 classes for each node (and also eliminating redundant intermediate
764 nodes):
765
766 $::RD_AUTOACTION = q
767 { $#item==1 ? $item[1] : "$item[0]_node"->new(@item[1..$#item]) };
768
769 parser = Parse::RecDescent->new(q{
770 expression: and_expr '||' expression | and_expr
771 and_expr: not_expr '&&' and_expr | not_expr
772 not_expr: '!' brack_expr | brack_expr
773 brack_expr: '(' expression ')' | identifier
774 identifier: /[a-z]+/i
775 });
776
777 or:
778
779 parser = Parse::RecDescent->new(q{
780 <autoaction:
781 $#item==1 ? $item[1] : "$item[0]_node"->new(@item[1..$#item])
782 >
783
784 expression: and_expr '||' expression | and_expr
785 and_expr: not_expr '&&' and_expr | not_expr
786 not_expr: '!' brack_expr | brack_expr
787 brack_expr: '(' expression ')' | identifier
788 identifier: /[a-z]+/i
789 });
790
791 which are equivalent to:
792
793 parser = Parse::RecDescent->new(q{
794 expression: and_expr '||' expression
795 { "expression_node"->new(@item[1..3]) }
796 | and_expr
797
798 and_expr: not_expr '&&' and_expr
799 { "and_expr_node"->new(@item[1..3]) }
800 | not_expr
801
802 not_expr: '!' brack_expr
803 { "not_expr_node"->new(@item[1..2]) }
804 | brack_expr
805
806 brack_expr: '(' expression ')'
807 { "brack_expr_node"->new(@item[1..3]) }
808 | identifier
809
810 identifier: /[a-z]+/i
811 { "identifer_node"->new(@item[1]) }
812 });
813
814 Note that, if a production already ends in an action, no autoaction is
815 appended to it. For example, in this version:
816
817 $::RD_AUTOACTION = q
818 { $#item==1 ? $item[1] : "$item[0]_node"->new(@item[1..$#item]) };
819
820 parser = Parse::RecDescent->new(q{
821 expression: and_expr '&&' expression | and_expr
822 and_expr: not_expr '&&' and_expr | not_expr
823 not_expr: '!' brack_expr | brack_expr
824 brack_expr: '(' expression ')' | identifier
825 identifier: /[a-z]+/i
826 { 'terminal_node'->new($item[1]) }
827 });
828
829 each "identifier" match produces a "terminal_node" object, not an
830 "identifier_node" object.
831
832 A level 1 warning is issued each time an "autoaction" is added to some
833 production.
834
835 Autotrees
836 A commonly needed autoaction is one that builds a parse-tree. It is
837 moderately tricky to set up such an action (which must treat terminals
838 differently from non-terminals), so Parse::RecDescent simplifies the
839 process by providing the "<autotree>" directive.
840
841 If this directive appears at the start of grammar, it causes
842 Parse::RecDescent to insert autoactions at the end of any rule except
843 those which already end in an action. The action inserted depends on
844 whether the production is an intermediate rule (two or more items), or
845 a terminal of the grammar (i.e. a single pattern or string item).
846
847 So, for example, the following grammar:
848
849 <autotree>
850
851 file : command(s)
852 command : get | set | vet
853 get : 'get' ident ';'
854 set : 'set' ident 'to' value ';'
855 vet : 'check' ident 'is' value ';'
856 ident : /\w+/
857 value : /\d+/
858
859 is equivalent to:
860
861 file : command(s) { bless \%item, $item[0] }
862 command : get { bless \%item, $item[0] }
863 | set { bless \%item, $item[0] }
864 | vet { bless \%item, $item[0] }
865 get : 'get' ident ';' { bless \%item, $item[0] }
866 set : 'set' ident 'to' value ';' { bless \%item, $item[0] }
867 vet : 'check' ident 'is' value ';' { bless \%item, $item[0] }
868
869 ident : /\w+/ { bless {__VALUE__=>$item[1]}, $item[0] }
870 value : /\d+/ { bless {__VALUE__=>$item[1]}, $item[0] }
871
872 Note that each node in the tree is blessed into a class of the same
873 name as the rule itself. This makes it easy to build object-oriented
874 processors for the parse-trees that the grammar produces. Note too that
875 the last two rules produce special objects with the single attribute
876 '__VALUE__'. This is because they consist solely of a single terminal.
877
878 This autoaction-ed grammar would then produce a parse tree in a data
879 structure like this:
880
881 {
882 file => {
883 command => {
884 [ get => {
885 identifier => { __VALUE__ => 'a' },
886 },
887 set => {
888 identifier => { __VALUE__ => 'b' },
889 value => { __VALUE__ => '7' },
890 },
891 vet => {
892 identifier => { __VALUE__ => 'b' },
893 value => { __VALUE__ => '7' },
894 },
895 ],
896 },
897 }
898 }
899
900 (except, of course, that each nested hash would also be blessed into
901 the appropriate class).
902
903 You can also specify a base class for the "<autotree>" directive. The
904 supplied prefix will be prepended to the rule names when creating tree
905 nodes. The following are equivalent:
906
907 <autotree:MyBase::Class>
908 <autotree:MyBase::Class::>
909
910 And will produce a root node blessed into the "MyBase::Class::file"
911 package in the example above.
912
913 Autostubbing
914 Normally, if a subrule appears in some production, but no rule of that
915 name is ever defined in the grammar, the production which refers to the
916 non-existent subrule fails immediately. This typically occurs as a
917 result of misspellings, and is a sufficiently common occurance that a
918 warning is generated for such situations.
919
920 However, when prototyping a grammar it is sometimes useful to be able
921 to use subrules before a proper specification of them is really
922 possible. For example, a grammar might include a section like:
923
924 function_call: identifier '(' arg(s?) ')'
925
926 identifier: /[a-z]\w*/i
927
928 where the possible format of an argument is sufficiently complex that
929 it is not worth specifying in full until the general function call
930 syntax has been debugged. In this situation it is convenient to leave
931 the real rule "arg" undefined and just slip in a placeholder (or
932 "stub"):
933
934 arg: 'arg'
935
936 so that the function call syntax can be tested with dummy input such
937 as:
938
939 f0()
940 f1(arg)
941 f2(arg arg)
942 f3(arg arg arg)
943
944 et cetera.
945
946 Early in prototyping, many such "stubs" may be required, so
947 "Parse::RecDescent" provides a means of automating their definition.
948 If the variable $::RD_AUTOSTUB is defined when a parser is built, a
949 subrule reference to any non-existent rule (say, "subrule"), will cause
950 a "stub" rule to be automatically defined in the generated parser. If
951 "$::RD_AUTOSTUB eq '1'" or is false, a stub rule of the form:
952
953 subrule: 'subrule'
954
955 will be generated. The special-case for a value of '1' is to allow the
956 use of the perl -s with -RD_AUTOSTUB without generating "subrule: '1'"
957 per below. If $::RD_AUTOSTUB is true, a stub rule of the form:
958
959 subrule: $::RD_AUTOSTUB
960
961 will be generated. $::RD_AUTOSTUB must contain a valid production
962 item, no checking is performed. No lazy evaluation of $::RD_AUTOSTUB
963 is performed, it is evaluated at the time the Parser is generated.
964
965 Hence, with $::RD_AUTOSTUB defined, it is possible to only partially
966 specify a grammar, and then "fake" matches of the unspecified
967 (sub)rules by just typing in their name, or a literal value that was
968 assigned to $::RD_AUTOSTUB.
969
970 Look-ahead
971 If a subrule, token, or action is prefixed by "...", then it is treated
972 as a "look-ahead" request. That means that the current production can
973 (as usual) only succeed if the specified item is matched, but that the
974 matching does not consume any of the text being parsed. This is very
975 similar to the "/(?=...)/" look-ahead construct in Perl patterns. Thus,
976 the rule:
977
978 inner_word: word ...word
979
980 will match whatever the subrule "word" matches, provided that match is
981 followed by some more text which subrule "word" would also match
982 (although this second substring is not actually consumed by
983 "inner_word")
984
985 Likewise, a "...!" prefix, causes the following item to succeed
986 (without consuming any text) if and only if it would normally fail.
987 Hence, a rule such as:
988
989 identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/
990
991 matches a string of characters which satisfies the pattern
992 "/[A-Za-z_]\w*/", but only if the same sequence of characters would not
993 match either subrule "keyword" or the literal token '_'.
994
995 Sequences of look-ahead prefixes accumulate, multiplying their positive
996 and/or negative senses. Hence:
997
998 inner_word: word ...!......!word
999
1000 is exactly equivalent the the original example above (a warning is
1001 issued in cases like these, since they often indicate something left
1002 out, or misunderstood).
1003
1004 Note that actions can also be treated as look-aheads. In such cases,
1005 the state of the parser text (in the local variable $text) after the
1006 look-ahead action is guaranteed to be identical to its state before the
1007 action, regardless of how it's changed within the action (unless you
1008 actually undefine $text, in which case you get the disaster you deserve
1009 :-).
1010
1011 Directives
1012 Directives are special pre-defined actions which may be used to alter
1013 the behaviour of the parser. There are currently twenty-three
1014 directives: "<commit>", "<uncommit>", "<reject>", "<score>",
1015 "<autoscore>", "<skip>", "<resync>", "<error>", "<warn>", "<hint>",
1016 "<trace_build>", "<trace_parse>", "<nocheck>", "<rulevar>",
1017 "<matchrule>", "<leftop>", "<rightop>", "<defer>", "<nocheck>",
1018 "<perl_quotelike>", "<perl_codeblock>", "<perl_variable>", and
1019 "<token>".
1020
1021 Committing and uncommitting
1022 The "<commit>" and "<uncommit>" directives permit the recursive
1023 descent of the parse tree to be pruned (or "cut") for efficiency.
1024 Within a rule, a "<commit>" directive instructs the rule to ignore
1025 subsequent productions if the current production fails. For
1026 example:
1027
1028 command: 'find' <commit> filename
1029 | 'open' <commit> filename
1030 | 'move' filename filename
1031
1032 Clearly, if the leading token 'find' is matched in the first
1033 production but that production fails for some other reason, then
1034 the remaining productions cannot possibly match. The presence of
1035 the "<commit>" causes the "command" rule to fail immediately if an
1036 invalid "find" command is found, and likewise if an invalid "open"
1037 command is encountered.
1038
1039 It is also possible to revoke a previous commitment. For example:
1040
1041 if_statement: 'if' <commit> condition
1042 'then' block <uncommit>
1043 'else' block
1044 | 'if' <commit> condition
1045 'then' block
1046
1047 In this case, a failure to find an "else" block in the first
1048 production shouldn't preclude trying the second production, but a
1049 failure to find a "condition" certainly should.
1050
1051 As a special case, any production in which the first item is an
1052 "<uncommit>" immediately revokes a preceding "<commit>" (even
1053 though the production would not otherwise have been tried). For
1054 example, in the rule:
1055
1056 request: 'explain' expression
1057 | 'explain' <commit> keyword
1058 | 'save'
1059 | 'quit'
1060 | <uncommit> term '?'
1061
1062 if the text being matched was "explain?", and the first two
1063 productions failed, then the "<commit>" in production two would
1064 cause productions three and four to be skipped, but the leading
1065 "<uncommit>" in the production five would allow that production to
1066 attempt a match.
1067
1068 Note in the preceding example, that the "<commit>" was only placed
1069 in production two. If production one had been:
1070
1071 request: 'explain' <commit> expression
1072
1073 then production two would be (inappropriately) skipped if a leading
1074 "explain..." was encountered.
1075
1076 Both "<commit>" and "<uncommit>" directives always succeed, and
1077 their value is always 1.
1078
1079 Rejecting a production
1080 The "<reject>" directive immediately causes the current production
1081 to fail (it is exactly equivalent to, but more obvious than, the
1082 action "{undef}"). A "<reject>" is useful when it is desirable to
1083 get the side effects of the actions in one production, without
1084 prejudicing a match by some other production later in the rule. For
1085 example, to insert tracing code into the parse:
1086
1087 complex_rule: { print "In complex rule...\n"; } <reject>
1088
1089 complex_rule: simple_rule '+' 'i' '*' simple_rule
1090 | 'i' '*' simple_rule
1091 | simple_rule
1092
1093 It is also possible to specify a conditional rejection, using the
1094 form "<reject:condition>", which only rejects if the specified
1095 condition is true. This form of rejection is exactly equivalent to
1096 the action "{(condition)?undef:1}>". For example:
1097
1098 command: save_command
1099 | restore_command
1100 | <reject: defined $::tolerant> { exit }
1101 | <error: Unknown command. Ignored.>
1102
1103 A "<reject>" directive never succeeds (and hence has no associated
1104 value). A conditional rejection may succeed (if its condition is
1105 not satisfied), in which case its value is 1.
1106
1107 As an extra optimization, "Parse::RecDescent" ignores any
1108 production which begins with an unconditional "<reject>" directive,
1109 since any such production can never successfully match or have any
1110 useful side-effects. A level 1 warning is issued in all such cases.
1111
1112 Note that productions beginning with conditional "<reject:...>"
1113 directives are never "optimized away" in this manner, even if they
1114 are always guaranteed to fail (for example: "<reject:1>")
1115
1116 Due to the way grammars are parsed, there is a minor restriction on
1117 the condition of a conditional "<reject:...>": it cannot contain
1118 any raw '<' or '>' characters. For example:
1119
1120 line: cmd <reject: $thiscolumn > max> data
1121
1122 results in an error when a parser is built from this grammar (since
1123 the grammar parser has no way of knowing whether the first > is a
1124 "less than" or the end of the "<reject:...>".
1125
1126 To overcome this problem, put the condition inside a do{} block:
1127
1128 line: cmd <reject: do{$thiscolumn > max}> data
1129
1130 Note that the same problem may occur in other directives that take
1131 arguments. The same solution will work in all cases.
1132
1133 Skipping between terminals
1134 The "<skip>" directive enables the terminal prefix used in a
1135 production to be changed. For example:
1136
1137 OneLiner: Command <skip:'[ \t]*'> Arg(s) /;/
1138
1139 causes only blanks and tabs to be skipped before terminals in the
1140 "Arg" subrule (and any of its subrules>, and also before the final
1141 "/;/" terminal. Once the production is complete, the previous
1142 terminal prefix is reinstated. Note that this implies that distinct
1143 productions of a rule must reset their terminal prefixes
1144 individually.
1145
1146 The "<skip>" directive evaluates to the previous terminal prefix,
1147 so it's easy to reinstate a prefix later in a production:
1148
1149 Command: <skip:","> CSV(s) <skip:$item[1]> Modifier
1150
1151 The value specified after the colon is interpolated into a pattern,
1152 so all of the following are equivalent (though their efficiency
1153 increases down the list):
1154
1155 <skip: "$colon|$comma"> # ASSUMING THE VARS HOLD THE OBVIOUS VALUES
1156
1157 <skip: ':|,'>
1158
1159 <skip: q{[:,]}>
1160
1161 <skip: qr/[:,]/>
1162
1163 There is no way of directly setting the prefix for an entire rule,
1164 except as follows:
1165
1166 Rule: <skip: '[ \t]*'> Prod1
1167 | <skip: '[ \t]*'> Prod2a Prod2b
1168 | <skip: '[ \t]*'> Prod3
1169
1170 or, better:
1171
1172 Rule: <skip: '[ \t]*'>
1173 (
1174 Prod1
1175 | Prod2a Prod2b
1176 | Prod3
1177 )
1178
1179 The skip pattern is passed down to subrules, so setting the skip
1180 for the top-level rule as described above actually sets the prefix
1181 for the entire grammar (provided that you only call the method
1182 corresponding to the top-level rule itself). Alternatively, or if
1183 you have more than one top-level rule in your grammar, you can
1184 provide a global "<skip>" directive prior to defining any rules in
1185 the grammar. These are the preferred alternatives to setting
1186 $Parse::RecDescent::skip.
1187
1188 Additionally, using "<skip>" actually allows you to have a
1189 completely dynamic skipping behaviour. For example:
1190
1191 Rule_with_dynamic_skip: <skip: $::skip_pattern> Rule
1192
1193 Then you can set $::skip_pattern before invoking
1194 "Rule_with_dynamic_skip" and have it skip whatever you specified.
1195
1196 Note: Up to release 1.51 of Parse::RecDescent, an entirely
1197 different mechanism was used for specifying terminal prefixes. The
1198 current method is not backwards-compatible with that early
1199 approach. The current approach is stable and will not to change
1200 again.
1201
1202 Resynchronization
1203 The "<resync>" directive provides a visually distinctive means of
1204 consuming some of the text being parsed, usually to skip an
1205 erroneous input. In its simplest form "<resync>" simply consumes
1206 text up to and including the next newline ("\n") character,
1207 succeeding only if the newline is found, in which case it causes
1208 its surrounding rule to return zero on success.
1209
1210 In other words, a "<resync>" is exactly equivalent to the token
1211 "/[^\n]*\n/" followed by the action "{ $return = 0 }" (except that
1212 productions beginning with a "<resync>" are ignored when generating
1213 error messages). A typical use might be:
1214
1215 script : command(s)
1216
1217 command: save_command
1218 | restore_command
1219 | <resync> # TRY NEXT LINE, IF POSSIBLE
1220
1221 It is also possible to explicitly specify a resynchronization
1222 pattern, using the "<resync:pattern>" variant. This version
1223 succeeds only if the specified pattern matches (and consumes) the
1224 parsed text. In other words, "<resync:pattern>" is exactly
1225 equivalent to the token "/pattern/" (followed by a
1226 "{ $return = 0 }" action). For example, if commands were terminated
1227 by newlines or semi-colons:
1228
1229 command: save_command
1230 | restore_command
1231 | <resync:[^;\n]*[;\n]>
1232
1233 The value of a successfully matched "<resync>" directive (of either
1234 type) is the text that it consumed. Note, however, that since the
1235 directive also sets $return, a production consisting of a lone
1236 "<resync>" succeeds but returns the value zero (which a calling
1237 rule may find useful to distinguish between "true" matches and
1238 "tolerant" matches). Remember that returning a zero value
1239 indicates that the rule succeeded (since only an "undef" denotes
1240 failure within "Parse::RecDescent" parsers.
1241
1242 Error handling
1243 The "<error>" directive provides automatic or user-defined
1244 generation of error messages during a parse. In its simplest form
1245 "<error>" prepares an error message based on the mismatch between
1246 the last item expected and the text which cause it to fail. For
1247 example, given the rule:
1248
1249 McCoy: curse ',' name ', I'm a doctor, not a' a_profession '!'
1250 | pronoun 'dead,' name '!'
1251 | <error>
1252
1253 the following strings would produce the following messages:
1254
1255 "Amen, Jim!"
1256 ERROR (line 1): Invalid McCoy: Expected curse or pronoun
1257 not found
1258
1259 "Dammit, Jim, I'm a doctor!"
1260 ERROR (line 1): Invalid McCoy: Expected ", I'm a doctor, not a"
1261 but found ", I'm a doctor!" instead
1262
1263 "He's dead,\n"
1264 ERROR (line 2): Invalid McCoy: Expected name not found
1265
1266 "He's alive!"
1267 ERROR (line 1): Invalid McCoy: Expected 'dead,' but found
1268 "alive!" instead
1269
1270 "Dammit, Jim, I'm a doctor, not a pointy-eared Vulcan!"
1271 ERROR (line 1): Invalid McCoy: Expected a profession but found
1272 "pointy-eared Vulcan!" instead
1273
1274 Note that, when autogenerating error messages, all underscores in
1275 any rule name used in a message are replaced by single spaces (for
1276 example "a_production" becomes "a production"). Judicious choice of
1277 rule names can therefore considerably improve the readability of
1278 automatic error messages (as well as the maintainability of the
1279 original grammar).
1280
1281 If the automatically generated error is not sufficient, it is
1282 possible to provide an explicit message as part of the error
1283 directive. For example:
1284
1285 Spock: "Fascinating ',' (name | 'Captain') '.'
1286 | "Highly illogical, doctor."
1287 | <error: He never said that!>
1288
1289 which would result in all failures to parse a "Spock" subrule
1290 printing the following message:
1291
1292 ERROR (line <N>): Invalid Spock: He never said that!
1293
1294 The error message is treated as a "qq{...}" string and interpolated
1295 when the error is generated (not when the directive is specified!).
1296 Hence:
1297
1298 <error: Mystical error near "$text">
1299
1300 would correctly insert the ambient text string which caused the
1301 error.
1302
1303 There are two other forms of error directive: "<error?>" and
1304 "<error?: msg>". These behave just like "<error>" and
1305 "<error: msg>" respectively, except that they are only triggered if
1306 the rule is "committed" at the time they are encountered. For
1307 example:
1308
1309 Scotty: "Ya kenna change the Laws of Phusics," <commit> name
1310 | name <commit> ',' 'she's goanta blaw!'
1311 | <error?>
1312
1313 will only generate an error for a string beginning with "Ya kenna
1314 change the Laws o' Phusics," or a valid name, but which still fails
1315 to match the corresponding production. That is,
1316 "$parser->Scotty("Aye, Cap'ain")" will fail silently (since neither
1317 production will "commit" the rule on that input), whereas
1318 "$parser->Scotty("Mr Spock, ah jest kenna do'ut!")" will fail with
1319 the error message:
1320
1321 ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!'
1322 but found 'I jest kenna do'ut!' instead.
1323
1324 since in that case the second production would commit after
1325 matching the leading name.
1326
1327 Note that to allow this behaviour, all "<error>" directives which
1328 are the first item in a production automatically uncommit the rule
1329 just long enough to allow their production to be attempted (that
1330 is, when their production fails, the commitment is reinstated so
1331 that subsequent productions are skipped).
1332
1333 In order to permanently uncommit the rule before an error message,
1334 it is necessary to put an explicit "<uncommit>" before the
1335 "<error>". For example:
1336
1337 line: 'Kirk:' <commit> Kirk
1338 | 'Spock:' <commit> Spock
1339 | 'McCoy:' <commit> McCoy
1340 | <uncommit> <error?> <reject>
1341 | <resync>
1342
1343 Error messages generated by the various "<error...>" directives are
1344 not displayed immediately. Instead, they are "queued" in a buffer
1345 and are only displayed once parsing ultimately fails. Moreover,
1346 "<error...>" directives that cause one production of a rule to fail
1347 are automatically removed from the message queue if another
1348 production subsequently causes the entire rule to succeed. This
1349 means that you can put "<error...>" directives wherever useful
1350 diagnosis can be done, and only those associated with actual parser
1351 failure will ever be displayed. Also see "GOTCHAS".
1352
1353 As a general rule, the most useful diagnostics are usually
1354 generated either at the very lowest level within the grammar, or at
1355 the very highest. A good rule of thumb is to identify those
1356 subrules which consist mainly (or entirely) of terminals, and then
1357 put an "<error...>" directive at the end of any other rule which
1358 calls one or more of those subrules.
1359
1360 There is one other situation in which the output of the various
1361 types of error directive is suppressed; namely, when the rule
1362 containing them is being parsed as part of a "look-ahead" (see
1363 "Look-ahead"). In this case, the error directive will still cause
1364 the rule to fail, but will do so silently.
1365
1366 An unconditional "<error>" directive always fails (and hence has no
1367 associated value). This means that encountering such a directive
1368 always causes the production containing it to fail. Hence an
1369 "<error>" directive will inevitably be the last (useful) item of a
1370 rule (a level 3 warning is issued if a production contains items
1371 after an unconditional "<error>" directive).
1372
1373 An "<error?>" directive will succeed (that is: fail to fail :-), if
1374 the current rule is uncommitted when the directive is encountered.
1375 In that case the directive's associated value is zero. Hence, this
1376 type of error directive can be used before the end of a production.
1377 For example:
1378
1379 command: 'do' <commit> something
1380 | 'report' <commit> something
1381 | <error?: Syntax error> <error: Unknown command>
1382
1383 Warning: The "<error?>" directive does not mean "always fail (but
1384 do so silently unless committed)". It actually means "only fail
1385 (and report) if committed, otherwise succeed". To achieve the "fail
1386 silently if uncommitted" semantics, it is necessary to use:
1387
1388 rule: item <commit> item(s)
1389 | <error?> <reject> # FAIL SILENTLY UNLESS COMMITTED
1390
1391 However, because people seem to expect a lone "<error?>" directive
1392 to work like this:
1393
1394 rule: item <commit> item(s)
1395 | <error?: Error message if committed>
1396 | <error: Error message if uncommitted>
1397
1398 Parse::RecDescent automatically appends a "<reject>" directive if
1399 the "<error?>" directive is the only item in a production. A level
1400 2 warning (see below) is issued when this happens.
1401
1402 The level of error reporting during both parser construction and
1403 parsing is controlled by the presence or absence of four global
1404 variables: $::RD_ERRORS, $::RD_WARN, $::RD_HINT, and <$::RD_TRACE>.
1405 If $::RD_ERRORS is defined (and, by default, it is) then fatal
1406 errors are reported.
1407
1408 Whenever $::RD_WARN is defined, certain non-fatal problems are also
1409 reported.
1410
1411 Warnings have an associated "level": 1, 2, or 3. The higher the
1412 level, the more serious the warning. The value of the corresponding
1413 global variable ($::RD_WARN) determines the lowest level of warning
1414 to be displayed. Hence, to see all warnings, set $::RD_WARN to 1.
1415 To see only the most serious warnings set $::RD_WARN to 3. By
1416 default $::RD_WARN is initialized to 3, ensuring that serious but
1417 non-fatal errors are automatically reported.
1418
1419 There is also a grammar directive to turn on warnings from within
1420 the grammar: "<warn>". It takes an optional argument, which
1421 specifies the warning level: "<warn: 2>".
1422
1423 See "DIAGNOSTICS" for a list of the varous error and warning
1424 messages that Parse::RecDescent generates when these two variables
1425 are defined.
1426
1427 Defining any of the remaining variables (which are not defined by
1428 default) further increases the amount of information reported.
1429 Defining $::RD_HINT causes the parser generator to offer more
1430 detailed analyses and hints on both errors and warnings. Note that
1431 setting $::RD_HINT at any point automagically sets $::RD_WARN to 1.
1432 There is also a "<hint>" directive, which can be hard-coded into a
1433 grammar.
1434
1435 Defining $::RD_TRACE causes the parser generator and the parser to
1436 report their progress to STDERR in excruciating detail (although,
1437 without hints unless $::RD_HINT is separately defined). This detail
1438 can be moderated in only one respect: if $::RD_TRACE has an integer
1439 value (N) greater than 1, only the N characters of the "current
1440 parsing context" (that is, where in the input string we are at any
1441 point in the parse) is reported at any time.
1442
1443 $::RD_TRACE is mainly useful for debugging a grammar that isn't
1444 behaving as you expected it to. To this end, if $::RD_TRACE is
1445 defined when a parser is built, any actual parser code which is
1446 generated is also written to a file named "RD_TRACE" in the local
1447 directory.
1448
1449 There are two directives associated with the $::RD_TRACE variable.
1450 If a grammar contains a "<trace_build>" directive anywhere in its
1451 specification, $::RD_TRACE is turned on during the parser
1452 construction phase. If a grammar contains a "<trace_parse>"
1453 directive anywhere in its specification, $::RD_TRACE is turned on
1454 during any parse the parser performs.
1455
1456 Note that the four variables belong to the "main" package, which
1457 makes them easier to refer to in the code controlling the parser,
1458 and also makes it easy to turn them into command line flags
1459 ("-RD_ERRORS", "-RD_WARN", "-RD_HINT", "-RD_TRACE") under perl -s.
1460
1461 The corresponding directives are useful to "hardwire" the various
1462 debugging features into a particular grammar (rather than having to
1463 set and reset external variables).
1464
1465 Redirecting diagnostics
1466 The diagnostics provided by the tracing mechanism always go to
1467 STDERR. If you need them to go elsewhere, localize and reopen
1468 STDERR prior to the parse.
1469
1470 For example:
1471
1472 {
1473 local *STDERR = IO::File->new(">$filename") or die $!;
1474
1475 my $result = $parser->startrule($text);
1476 }
1477
1478 Consistency checks
1479 Whenever a parser is build, Parse::RecDescent carries out a number
1480 of (potentially expensive) consistency checks. These include:
1481 verifying that the grammar is not left-recursive and that no rules
1482 have been left undefined.
1483
1484 These checks are important safeguards during development, but
1485 unnecessary overheads when the grammar is stable and ready to be
1486 deployed. So Parse::RecDescent provides a directive to disable
1487 them: "<nocheck>".
1488
1489 If a grammar contains a "<nocheck>" directive anywhere in its
1490 specification, the extra compile-time checks are by-passed.
1491
1492 Specifying local variables
1493 It is occasionally convenient to specify variables which are local
1494 to a single rule. This may be achieved by including a
1495 "<rulevar:...>" directive anywhere in the rule. For example:
1496
1497 markup: <rulevar: $tag>
1498
1499 markup: tag {($tag=$item[1]) =~ s/^<|>$//g} body[$tag]
1500
1501 The example "<rulevar: $tag>" directive causes a "my" variable
1502 named $tag to be declared at the start of the subroutine
1503 implementing the "markup" rule (that is, before the first
1504 production, regardless of where in the rule it is specified).
1505
1506 Specifically, any directive of the form: "<rulevar:text>" causes a
1507 line of the form "my text;" to be added at the beginning of the
1508 rule subroutine, immediately after the definitions of the following
1509 local variables:
1510
1511 $thisparser $commit
1512 $thisrule @item
1513 $thisline @arg
1514 $text %arg
1515
1516 This means that the following "<rulevar>" directives work as
1517 expected:
1518
1519 <rulevar: $count = 0 >
1520
1521 <rulevar: $firstarg = $arg[0] || '' >
1522
1523 <rulevar: $myItems = \@item >
1524
1525 <rulevar: @context = ( $thisline, $text, @arg ) >
1526
1527 <rulevar: ($name,$age) = $arg{"name","age"} >
1528
1529 If a variable that is also visible to subrules is required, it
1530 needs to be "local"'d, not "my"'d. "rulevar" defaults to "my", but
1531 if "local" is explicitly specified:
1532
1533 <rulevar: local $count = 0 >
1534
1535 then a "local"-ized variable is declared instead, and will be
1536 available within subrules.
1537
1538 Note however that, because all such variables are "my" variables,
1539 their values do not persist between match attempts on a given rule.
1540 To preserve values between match attempts, values can be stored
1541 within the "local" member of the $thisrule object:
1542
1543 countedrule: { $thisrule->{"local"}{"count"}++ }
1544 <reject>
1545 | subrule1
1546 | subrule2
1547 | <reject: $thisrule->{"local"}{"count"} == 1>
1548 subrule3
1549
1550 When matching a rule, each "<rulevar>" directive is matched as if
1551 it were an unconditional "<reject>" directive (that is, it causes
1552 any production in which it appears to immediately fail to match).
1553 For this reason (and to improve readability) it is usual to specify
1554 any "<rulevar>" directive in a separate production at the start of
1555 the rule (this has the added advantage that it enables
1556 "Parse::RecDescent" to optimize away such productions, just as it
1557 does for the "<reject>" directive).
1558
1559 Dynamically matched rules
1560 Because regexes and double-quoted strings are interpolated, it is
1561 relatively easy to specify productions with "context sensitive"
1562 tokens. For example:
1563
1564 command: keyword body "end $item[1]"
1565
1566 which ensures that a command block is bounded by a "<keyword>...end
1567 <same keyword>" pair.
1568
1569 Building productions in which subrules are context sensitive is
1570 also possible, via the "<matchrule:...>" directive. This directive
1571 behaves identically to a subrule item, except that the rule which
1572 is invoked to match it is determined by the string specified after
1573 the colon. For example, we could rewrite the "command" rule like
1574 this:
1575
1576 command: keyword <matchrule:body> "end $item[1]"
1577
1578 Whatever appears after the colon in the directive is treated as an
1579 interpolated string (that is, as if it appeared in "qq{...}"
1580 operator) and the value of that interpolated string is the name of
1581 the subrule to be matched.
1582
1583 Of course, just putting a constant string like "body" in a
1584 "<matchrule:...>" directive is of little interest or benefit. The
1585 power of directive is seen when we use a string that interpolates
1586 to something interesting. For example:
1587
1588 command: keyword <matchrule:$item[1]_body> "end $item[1]"
1589
1590 keyword: 'while' | 'if' | 'function'
1591
1592 while_body: condition block
1593
1594 if_body: condition block ('else' block)(?)
1595
1596 function_body: arglist block
1597
1598 Now the "command" rule selects how to proceed on the basis of the
1599 keyword that is found. It is as if "command" were declared:
1600
1601 command: 'while' while_body "end while"
1602 | 'if' if_body "end if"
1603 | 'function' function_body "end function"
1604
1605 When a "<matchrule:...>" directive is used as a repeated subrule,
1606 the rule name expression is "late-bound". That is, the name of the
1607 rule to be called is re-evaluated each time a match attempt is
1608 made. Hence, the following grammar:
1609
1610 { $::species = 'dogs' }
1611
1612 pair: 'two' <matchrule:$::species>(s)
1613
1614 dogs: /dogs/ { $::species = 'cats' }
1615
1616 cats: /cats/
1617
1618 will match the string "two dogs cats cats" completely, whereas it
1619 will only match the string "two dogs dogs dogs" up to the eighth
1620 letter. If the rule name were "early bound" (that is, evaluated
1621 only the first time the directive is encountered in a production),
1622 the reverse behaviour would be expected.
1623
1624 Note that the "matchrule" directive takes a string that is to be
1625 treated as a rule name, not as a rule invocation. That is, it's
1626 like a Perl symbolic reference, not an "eval". Just as you can say:
1627
1628 $subname = 'foo';
1629
1630 # and later...
1631
1632 &{$foo}(@args);
1633
1634 but not:
1635
1636 $subname = 'foo(@args)';
1637
1638 # and later...
1639
1640 &{$foo};
1641
1642 likewise you can say:
1643
1644 $rulename = 'foo';
1645
1646 # and in the grammar...
1647
1648 <matchrule:$rulename>[@args]
1649
1650 but not:
1651
1652 $rulename = 'foo[@args]';
1653
1654 # and in the grammar...
1655
1656 <matchrule:$rulename>
1657
1658 Deferred actions
1659 The "<defer:...>" directive is used to specify an action to be
1660 performed when (and only if!) the current production ultimately
1661 succeeds.
1662
1663 Whenever a "<defer:...>" directive appears, the code it specifies
1664 is converted to a closure (an anonymous subroutine reference) which
1665 is queued within the active parser object. Note that, because the
1666 deferred code is converted to a closure, the values of any "local"
1667 variable (such as $text, <@item>, etc.) are preserved until the
1668 deferred code is actually executed.
1669
1670 If the parse ultimately succeeds and the production in which the
1671 "<defer:...>" directive was evaluated formed part of the successful
1672 parse, then the deferred code is executed immediately before the
1673 parse returns. If however the production which queued a deferred
1674 action fails, or one of the higher-level rules which called that
1675 production fails, then the deferred action is removed from the
1676 queue, and hence is never executed.
1677
1678 For example, given the grammar:
1679
1680 sentence: noun trans noun
1681 | noun intrans
1682
1683 noun: 'the dog'
1684 { print "$item[1]\t(noun)\n" }
1685 | 'the meat'
1686 { print "$item[1]\t(noun)\n" }
1687
1688 trans: 'ate'
1689 { print "$item[1]\t(transitive)\n" }
1690
1691 intrans: 'ate'
1692 { print "$item[1]\t(intransitive)\n" }
1693 | 'barked'
1694 { print "$item[1]\t(intransitive)\n" }
1695
1696 then parsing the sentence "the dog ate" would produce the output:
1697
1698 the dog (noun)
1699 ate (transitive)
1700 the dog (noun)
1701 ate (intransitive)
1702
1703 This is because, even though the first production of "sentence"
1704 ultimately fails, its initial subrules "noun" and "trans" do match,
1705 and hence they execute their associated actions. Then the second
1706 production of "sentence" succeeds, causing the actions of the
1707 subrules "noun" and "intrans" to be executed as well.
1708
1709 On the other hand, if the actions were replaced by "<defer:...>"
1710 directives:
1711
1712 sentence: noun trans noun
1713 | noun intrans
1714
1715 noun: 'the dog'
1716 <defer: print "$item[1]\t(noun)\n" >
1717 | 'the meat'
1718 <defer: print "$item[1]\t(noun)\n" >
1719
1720 trans: 'ate'
1721 <defer: print "$item[1]\t(transitive)\n" >
1722
1723 intrans: 'ate'
1724 <defer: print "$item[1]\t(intransitive)\n" >
1725 | 'barked'
1726 <defer: print "$item[1]\t(intransitive)\n" >
1727
1728 the output would be:
1729
1730 the dog (noun)
1731 ate (intransitive)
1732
1733 since deferred actions are only executed if they were evaluated in
1734 a production which ultimately contributes to the successful parse.
1735
1736 In this case, even though the first production of "sentence" caused
1737 the subrules "noun" and "trans" to match, that production
1738 ultimately failed and so the deferred actions queued by those
1739 subrules were subsequently disgarded. The second production then
1740 succeeded, causing the entire parse to succeed, and so the deferred
1741 actions queued by the (second) match of the "noun" subrule and the
1742 subsequent match of "intrans" are preserved and eventually
1743 executed.
1744
1745 Deferred actions provide a means of improving the performance of a
1746 parser, by only executing those actions which are part of the final
1747 parse-tree for the input data.
1748
1749 Alternatively, deferred actions can be viewed as a mechanism for
1750 building (and executing) a customized subroutine corresponding to
1751 the given input data, much in the same way that autoactions (see
1752 "Autoactions") can be used to build a customized data structure for
1753 specific input.
1754
1755 Whether or not the action it specifies is ever executed, a
1756 "<defer:...>" directive always succeeds, returning the number of
1757 deferred actions currently queued at that point.
1758
1759 Parsing Perl
1760 Parse::RecDescent provides limited support for parsing subsets of
1761 Perl, namely: quote-like operators, Perl variables, and complete
1762 code blocks.
1763
1764 The "<perl_quotelike>" directive can be used to parse any Perl
1765 quote-like operator: 'a string', "m/a pattern/", "tr{ans}{lation}",
1766 etc. It does this by calling Text::Balanced::quotelike().
1767
1768 If a quote-like operator is found, a reference to an array of eight
1769 elements is returned. Those elements are identical to the last
1770 eight elements returned by Text::Balanced::extract_quotelike() in
1771 an array context, namely:
1772
1773 [0] the name of the quotelike operator -- 'q', 'qq', 'm', 's', 'tr'
1774 -- if the operator was named; otherwise "undef",
1775
1776 [1] the left delimiter of the first block of the operation,
1777
1778 [2] the text of the first block of the operation (that is, the
1779 contents of a quote, the regex of a match, or substitution or
1780 the target list of a translation),
1781
1782 [3] the right delimiter of the first block of the operation,
1783
1784 [4] the left delimiter of the second block of the operation if
1785 there is one (that is, if it is a "s", "tr", or "y"); otherwise
1786 "undef",
1787
1788 [5] the text of the second block of the operation if there is one
1789 (that is, the replacement of a substitution or the translation
1790 list of a translation); otherwise "undef",
1791
1792 [6] the right delimiter of the second block of the operation (if
1793 any); otherwise "undef",
1794
1795 [7] the trailing modifiers on the operation (if any); otherwise
1796 "undef".
1797
1798 If a quote-like expression is not found, the directive fails with
1799 the usual "undef" value.
1800
1801 The "<perl_variable>" directive can be used to parse any Perl
1802 variable: $scalar, @array, %hash, $ref->{field}[$index], etc. It
1803 does this by calling Text::Balanced::extract_variable().
1804
1805 If the directive matches text representing a valid Perl variable
1806 specification, it returns that text. Otherwise it fails with the
1807 usual "undef" value.
1808
1809 The "<perl_codeblock>" directive can be used to parse curly-brace-
1810 delimited block of Perl code, such as: { $a = 1; f() =~ m/pat/; }.
1811 It does this by calling Text::Balanced::extract_codeblock().
1812
1813 If the directive matches text representing a valid Perl code block,
1814 it returns that text. Otherwise it fails with the usual "undef"
1815 value.
1816
1817 You can also tell it what kind of brackets to use as the outermost
1818 delimiters. For example:
1819
1820 arglist: <perl_codeblock ()>
1821
1822 causes an arglist to match a perl code block whose outermost
1823 delimiters are "(...)" (rather than the default "{...}").
1824
1825 Constructing tokens
1826 Eventually, Parse::RecDescent will be able to parse tokenized
1827 input, as well as ordinary strings. In preparation for this joyous
1828 day, the "<token:...>" directive has been provided. This directive
1829 creates a token which will be suitable for input to a
1830 Parse::RecDescent parser (when it eventually supports tokenized
1831 input).
1832
1833 The text of the token is the value of the immediately preceding
1834 item in the production. A "<token:...>" directive always succeeds
1835 with a return value which is the hash reference that is the new
1836 token. It also sets the return value for the production to that
1837 hash ref.
1838
1839 The "<token:...>" directive makes it easy to build a
1840 Parse::RecDescent-compatible lexer in Parse::RecDescent:
1841
1842 my $lexer = new Parse::RecDescent q
1843 {
1844 lex: token(s)
1845
1846 token: /a\b/ <token:INDEF>
1847 | /the\b/ <token:DEF>
1848 | /fly\b/ <token:NOUN,VERB>
1849 | /[a-z]+/i { lc $item[1] } <token:ALPHA>
1850 | <error: Unknown token>
1851
1852 };
1853
1854 which will eventually be able to be used with a regular
1855 Parse::RecDescent grammar:
1856
1857 my $parser = new Parse::RecDescent q
1858 {
1859 startrule: subrule1 subrule 2
1860
1861 # ETC...
1862 };
1863
1864 either with a pre-lexing phase:
1865
1866 $parser->startrule( $lexer->lex($data) );
1867
1868 or with a lex-on-demand approach:
1869
1870 $parser->startrule( sub{$lexer->token(\$data)} );
1871
1872 But at present, only the "<token:...>" directive is actually
1873 implemented. The rest is vapourware.
1874
1875 Specifying operations
1876 One of the commonest requirements when building a parser is to
1877 specify binary operators. Unfortunately, in a normal grammar, the
1878 rules for such things are awkward:
1879
1880 disjunction: conjunction ('or' conjunction)(s?)
1881 { $return = [ $item[1], @{$item[2]} ] }
1882
1883 conjunction: atom ('and' atom)(s?)
1884 { $return = [ $item[1], @{$item[2]} ] }
1885
1886 or inefficient:
1887
1888 disjunction: conjunction 'or' disjunction
1889 { $return = [ $item[1], @{$item[2]} ] }
1890 | conjunction
1891 { $return = [ $item[1] ] }
1892
1893 conjunction: atom 'and' conjunction
1894 { $return = [ $item[1], @{$item[2]} ] }
1895 | atom
1896 { $return = [ $item[1] ] }
1897
1898 and either way is ugly and hard to get right.
1899
1900 The "<leftop:...>" and "<rightop:...>" directives provide an easier
1901 way of specifying such operations. Using "<leftop:...>" the above
1902 examples become:
1903
1904 disjunction: <leftop: conjunction 'or' conjunction>
1905 conjunction: <leftop: atom 'and' atom>
1906
1907 The "<leftop:...>" directive specifies a left-associative binary
1908 operator. It is specified around three other grammar elements
1909 (typically subrules or terminals), which match the left operand,
1910 the operator itself, and the right operand respectively.
1911
1912 A "<leftop:...>" directive such as:
1913
1914 disjunction: <leftop: conjunction 'or' conjunction>
1915
1916 is converted to the following:
1917
1918 disjunction: ( conjunction ('or' conjunction)(s?)
1919 { $return = [ $item[1], @{$item[2]} ] } )
1920
1921 In other words, a "<leftop:...>" directive matches the left operand
1922 followed by zero or more repetitions of both the operator and the
1923 right operand. It then flattens the matched items into an anonymous
1924 array which becomes the (single) value of the entire "<leftop:...>"
1925 directive.
1926
1927 For example, an "<leftop:...>" directive such as:
1928
1929 output: <leftop: ident '<<' expr >
1930
1931 when given a string such as:
1932
1933 cout << var << "str" << 3
1934
1935 would match, and $item[1] would be set to:
1936
1937 [ 'cout', 'var', '"str"', '3' ]
1938
1939 In other words:
1940
1941 output: <leftop: ident '<<' expr >
1942
1943 is equivalent to a left-associative operator:
1944
1945 output: ident { $return = [$item[1]] }
1946 | ident '<<' expr { $return = [@item[1,3]] }
1947 | ident '<<' expr '<<' expr { $return = [@item[1,3,5]] }
1948 | ident '<<' expr '<<' expr '<<' expr { $return = [@item[1,3,5,7]] }
1949 # ...etc...
1950
1951 Similarly, the "<rightop:...>" directive takes a left operand, an
1952 operator, and a right operand:
1953
1954 assign: <rightop: var '=' expr >
1955
1956 and converts them to:
1957
1958 assign: ( (var '=' {$return=$item[1]})(s?) expr
1959 { $return = [ @{$item[1]}, $item[2] ] } )
1960
1961 which is equivalent to a right-associative operator:
1962
1963 assign: expr { $return = [$item[1]] }
1964 | var '=' expr { $return = [@item[1,3]] }
1965 | var '=' var '=' expr { $return = [@item[1,3,5]] }
1966 | var '=' var '=' var '=' expr { $return = [@item[1,3,5,7]] }
1967 # ...etc...
1968
1969 Note that for both the "<leftop:...>" and "<rightop:...>"
1970 directives, the directive does not normally return the operator
1971 itself, just a list of the operands involved. This is particularly
1972 handy for specifying lists:
1973
1974 list: '(' <leftop: list_item ',' list_item> ')'
1975 { $return = $item[2] }
1976
1977 There is, however, a problem: sometimes the operator is itself
1978 significant. For example, in a Perl list a comma and a "=>" are
1979 both valid separators, but the "=>" has additional stringification
1980 semantics. Hence it's important to know which was used in each
1981 case.
1982
1983 To solve this problem the "<leftop:...>" and "<rightop:...>"
1984 directives do return the operator(s) as well, under two
1985 circumstances. The first case is where the operator is specified
1986 as a subrule. In that instance, whatever the operator matches is
1987 returned (on the assumption that if the operator is important
1988 enough to have its own subrule, then it's important enough to
1989 return).
1990
1991 The second case is where the operator is specified as a regular
1992 expression. In that case, if the first bracketed subpattern of the
1993 regular expression matches, that matching value is returned (this
1994 is analogous to the behaviour of the Perl "split" function, except
1995 that only the first subpattern is returned).
1996
1997 In other words, given the input:
1998
1999 ( a=>1, b=>2 )
2000
2001 the specifications:
2002
2003 list: '(' <leftop: list_item separator list_item> ')'
2004
2005 separator: ',' | '=>'
2006
2007 or:
2008
2009 list: '(' <leftop: list_item /(,|=>)/ list_item> ')'
2010
2011 cause the list separators to be interleaved with the operands in
2012 the anonymous array in $item[2]:
2013
2014 [ 'a', '=>', '1', ',', 'b', '=>', '2' ]
2015
2016 But the following version:
2017
2018 list: '(' <leftop: list_item /,|=>/ list_item> ')'
2019
2020 returns only the operators:
2021
2022 [ 'a', '1', 'b', '2' ]
2023
2024 Of course, none of the above specifications handle the case of an
2025 empty list, since the "<leftop:...>" and "<rightop:...>" directives
2026 require at least a single right or left operand to match. To
2027 specify that the operator can match "trivially", it's necessary to
2028 add a "(s?)" qualifier to the directive:
2029
2030 list: '(' <leftop: list_item /(,|=>)/ list_item>(s?) ')'
2031
2032 Note that in almost all the above examples, the first and third
2033 arguments of the "<leftop:...>" directive were the same subrule.
2034 That is because "<leftop:...>"'s are frequently used to specify
2035 "separated" lists of the same type of item. To make such lists
2036 easier to specify, the following syntax:
2037
2038 list: element(s /,/)
2039
2040 is exactly equivalent to:
2041
2042 list: <leftop: element /,/ element>
2043
2044 Note that the separator must be specified as a raw pattern (i.e.
2045 not a string or subrule).
2046
2047 Scored productions
2048 By default, Parse::RecDescent grammar rules always accept the first
2049 production that matches the input. But if two or more productions
2050 may potentially match the same input, choosing the first that does
2051 so may not be optimal.
2052
2053 For example, if you were parsing the sentence "time flies like an
2054 arrow", you might use a rule like this:
2055
2056 sentence: verb noun preposition article noun { [@item] }
2057 | adjective noun verb article noun { [@item] }
2058 | noun verb preposition article noun { [@item] }
2059
2060 Each of these productions matches the sentence, but the third one
2061 is the most likely interpretation. However, if the sentence had
2062 been "fruit flies like a banana", then the second production is
2063 probably the right match.
2064
2065 To cater for such situtations, the "<score:...>" can be used. The
2066 directive is equivalent to an unconditional "<reject>", except that
2067 it allows you to specify a "score" for the current production. If
2068 that score is numerically greater than the best score of any
2069 preceding production, the current production is cached for later
2070 consideration. If no later production matches, then the cached
2071 production is treated as having matched, and the value of the item
2072 immediately before its "<score:...>" directive is returned as the
2073 result.
2074
2075 In other words, by putting a "<score:...>" directive at the end of
2076 each production, you can select which production matches using
2077 criteria other than specification order. For example:
2078
2079 sentence: verb noun preposition article noun { [@item] } <score: sensible(@item)>
2080 | adjective noun verb article noun { [@item] } <score: sensible(@item)>
2081 | noun verb preposition article noun { [@item] } <score: sensible(@item)>
2082
2083 Now, when each production reaches its respective "<score:...>"
2084 directive, the subroutine "sensible" will be called to evaluate the
2085 matched items (somehow). Once all productions have been tried, the
2086 one which "sensible" scored most highly will be the one that is
2087 accepted as a match for the rule.
2088
2089 The variable $score always holds the current best score of any
2090 production, and the variable $score_return holds the corresponding
2091 return value.
2092
2093 As another example, the following grammar matches lines that may be
2094 separated by commas, colons, or semi-colons. This can be tricky if
2095 a colon-separated line also contains commas, or vice versa. The
2096 grammar resolves the ambiguity by selecting the rule that results
2097 in the fewest fields:
2098
2099 line: seplist[sep=>','] <score: -@{$item[1]}>
2100 | seplist[sep=>':'] <score: -@{$item[1]}>
2101 | seplist[sep=>" "] <score: -@{$item[1]}>
2102
2103 seplist: <skip:""> <leftop: /[^$arg{sep}]*/ "$arg{sep}" /[^$arg{sep}]*/>
2104
2105 Note the use of negation within the "<score:...>" directive to
2106 ensure that the seplist with the most items gets the lowest score.
2107
2108 As the above examples indicate, it is often the case that all
2109 productions in a rule use exactly the same "<score:...>" directive.
2110 It is tedious to have to repeat this identical directive in every
2111 production, so Parse::RecDescent also provides the
2112 "<autoscore:...>" directive.
2113
2114 If an "<autoscore:...>" directive appears in any production of a
2115 rule, the code it specifies is used as the scoring code for every
2116 production of that rule, except productions that already end with
2117 an explicit "<score:...>" directive. Thus the rules above could be
2118 rewritten:
2119
2120 line: <autoscore: -@{$item[1]}>
2121 line: seplist[sep=>',']
2122 | seplist[sep=>':']
2123 | seplist[sep=>" "]
2124
2125
2126 sentence: <autoscore: sensible(@item)>
2127 | verb noun preposition article noun { [@item] }
2128 | adjective noun verb article noun { [@item] }
2129 | noun verb preposition article noun { [@item] }
2130
2131 Note that the "<autoscore:...>" directive itself acts as an
2132 unconditional "<reject>", and (like the "<rulevar:...>" directive)
2133 is pruned at compile-time wherever possible.
2134
2135 Dispensing with grammar checks
2136 During the compilation phase of parser construction,
2137 Parse::RecDescent performs a small number of checks on the grammar
2138 it's given. Specifically it checks that the grammar is not left-
2139 recursive, that there are no "insatiable" constructs of the form:
2140
2141 rule: subrule(s) subrule
2142
2143 and that there are no rules missing (i.e. referred to, but never
2144 defined).
2145
2146 These checks are important during development, but can slow down
2147 parser construction in stable code. So Parse::RecDescent provides
2148 the <nocheck> directive to turn them off. The directive can only
2149 appear before the first rule definition, and switches off checking
2150 throughout the rest of the current grammar.
2151
2152 Typically, this directive would be added when a parser has been
2153 thoroughly tested and is ready for release.
2154
2155 Subrule argument lists
2156 It is occasionally useful to pass data to a subrule which is being
2157 invoked. For example, consider the following grammar fragment:
2158
2159 classdecl: keyword decl
2160
2161 keyword: 'struct' | 'class';
2162
2163 decl: # WHATEVER
2164
2165 The "decl" rule might wish to know which of the two keywords was used
2166 (since it may affect some aspect of the way the subsequent declaration
2167 is interpreted). "Parse::RecDescent" allows the grammar designer to
2168 pass data into a rule, by placing that data in an argument list (that
2169 is, in square brackets) immediately after any subrule item in a
2170 production. Hence, we could pass the keyword to "decl" as follows:
2171
2172 classdecl: keyword decl[ $item[1] ]
2173
2174 keyword: 'struct' | 'class';
2175
2176 decl: # WHATEVER
2177
2178 The argument list can consist of any number (including zero!) of comma-
2179 separated Perl expressions. In other words, it looks exactly like a
2180 Perl anonymous array reference. For example, we could pass the keyword,
2181 the name of the surrounding rule, and the literal 'keyword' to "decl"
2182 like so:
2183
2184 classdecl: keyword decl[$item[1],$item[0],'keyword']
2185
2186 keyword: 'struct' | 'class';
2187
2188 decl: # WHATEVER
2189
2190 Within the rule to which the data is passed ("decl" in the above
2191 examples) that data is available as the elements of a local variable
2192 @arg. Hence "decl" might report its intentions as follows:
2193
2194 classdecl: keyword decl[$item[1],$item[0],'keyword']
2195
2196 keyword: 'struct' | 'class';
2197
2198 decl: { print "Declaring $arg[0] (a $arg[2])\n";
2199 print "(this rule called by $arg[1])" }
2200
2201 Subrule argument lists can also be interpreted as hashes, simply by
2202 using the local variable %arg instead of @arg. Hence we could rewrite
2203 the previous example:
2204
2205 classdecl: keyword decl[keyword => $item[1],
2206 caller => $item[0],
2207 type => 'keyword']
2208
2209 keyword: 'struct' | 'class';
2210
2211 decl: { print "Declaring $arg{keyword} (a $arg{type})\n";
2212 print "(this rule called by $arg{caller})" }
2213
2214 Both @arg and %arg are always available, so the grammar designer may
2215 choose whichever convention (or combination of conventions) suits best.
2216
2217 Subrule argument lists are also useful for creating "rule templates"
2218 (especially when used in conjunction with the "<matchrule:...>"
2219 directive). For example, the subrule:
2220
2221 list: <matchrule:$arg{rule}> /$arg{sep}/ list[%arg]
2222 { $return = [ $item[1], @{$item[3]} ] }
2223 | <matchrule:$arg{rule}>
2224 { $return = [ $item[1]] }
2225
2226 is a handy template for the common problem of matching a separated
2227 list. For example:
2228
2229 function: 'func' name '(' list[rule=>'param',sep=>';'] ')'
2230
2231 param: list[rule=>'name',sep=>','] ':' typename
2232
2233 name: /\w+/
2234
2235 typename: name
2236
2237 When a subrule argument list is used with a repeated subrule, the
2238 argument list goes before the repetition specifier:
2239
2240 list: /some|many/ thing[ $item[1] ](s)
2241
2242 The argument list is "late bound". That is, it is re-evaluated for
2243 every repetition of the repeated subrule. This means that each
2244 repeated attempt to match the subrule may be passed a completely
2245 different set of arguments if the value of the expression in the
2246 argument list changes between attempts. So, for example, the grammar:
2247
2248 { $::species = 'dogs' }
2249
2250 pair: 'two' animal[$::species](s)
2251
2252 animal: /$arg[0]/ { $::species = 'cats' }
2253
2254 will match the string "two dogs cats cats" completely, whereas it will
2255 only match the string "two dogs dogs dogs" up to the eighth letter. If
2256 the value of the argument list were "early bound" (that is, evaluated
2257 only the first time a repeated subrule match is attempted), one would
2258 expect the matching behaviours to be reversed.
2259
2260 Of course, it is possible to effectively "early bind" such argument
2261 lists by passing them a value which does not change on each repetition.
2262 For example:
2263
2264 { $::species = 'dogs' }
2265
2266 pair: 'two' { $::species } animal[$item[2]](s)
2267
2268 animal: /$arg[0]/ { $::species = 'cats' }
2269
2270 Arguments can also be passed to the start rule, simply by appending
2271 them to the argument list with which the start rule is called (after
2272 the "line number" parameter). For example, given:
2273
2274 $parser = new Parse::RecDescent ( $grammar );
2275
2276 $parser->data($text, 1, "str", 2, \@arr);
2277
2278 # ^^^^^ ^ ^^^^^^^^^^^^^^^
2279 # | | |
2280 # TEXT TO BE PARSED | |
2281 # STARTING LINE NUMBER |
2282 # ELEMENTS OF @arg WHICH IS PASSED TO RULE data
2283
2284 then within the productions of the rule "data", the array @arg will
2285 contain "("str", 2, \@arr)".
2286
2287 Alternations
2288 Alternations are implicit (unnamed) rules defined as part of a
2289 production. An alternation is defined as a series of '|'-separated
2290 productions inside a pair of round brackets. For example:
2291
2292 character: 'the' ( good | bad | ugly ) /dude/
2293
2294 Every alternation implicitly defines a new subrule, whose
2295 automatically-generated name indicates its origin:
2296 "_alternation_<I>_of_production_<P>_of_rule<R>" for the appropriate
2297 values of <I>, <P>, and <R>. A call to this implicit subrule is then
2298 inserted in place of the brackets. Hence the above example is merely a
2299 convenient short-hand for:
2300
2301 character: 'the'
2302 _alternation_1_of_production_1_of_rule_character
2303 /dude/
2304
2305 _alternation_1_of_production_1_of_rule_character:
2306 good | bad | ugly
2307
2308 Since alternations are parsed by recursively calling the parser
2309 generator, any type(s) of item can appear in an alternation. For
2310 example:
2311
2312 character: 'the' ( 'high' "plains" # Silent, with poncho
2313 | /no[- ]name/ # Silent, no poncho
2314 | vengeance_seeking # Poncho-optional
2315 | <error>
2316 ) drifter
2317
2318 In this case, if an error occurred, the automatically generated message
2319 would be:
2320
2321 ERROR (line <N>): Invalid implicit subrule: Expected
2322 'high' or /no[- ]name/ or generic,
2323 but found "pacifist" instead
2324
2325 Since every alternation actually has a name, it's even possible to
2326 extend or replace them:
2327
2328 parser->Replace(
2329 "_alternation_1_of_production_1_of_rule_character:
2330 'generic Eastwood'"
2331 );
2332
2333 More importantly, since alternations are a form of subrule, they can be
2334 given repetition specifiers:
2335
2336 character: 'the' ( good | bad | ugly )(?) /dude/
2337
2338 Incremental Parsing
2339 "Parse::RecDescent" provides two methods - "Extend" and "Replace" -
2340 which can be used to alter the grammar matched by a parser. Both
2341 methods take the same argument as "Parse::RecDescent::new", namely a
2342 grammar specification string
2343
2344 "Parse::RecDescent::Extend" interprets the grammar specification and
2345 adds any productions it finds to the end of the rules for which they
2346 are specified. For example:
2347
2348 $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
2349 parser->Extend($add);
2350
2351 adds two productions to the rule "name" (creating it if necessary) and
2352 one production to the rule "desc".
2353
2354 "Parse::RecDescent::Replace" is identical, except that it first resets
2355 are rule specified in the additional grammar, removing any existing
2356 productions. Hence after:
2357
2358 $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
2359 parser->Replace($add);
2360
2361 are are only valid "name"s and the one possible description.
2362
2363 A more interesting use of the "Extend" and "Replace" methods is to call
2364 them inside the action of an executing parser. For example:
2365
2366 typedef: 'typedef' type_name identifier ';'
2367 { $thisparser->Extend("type_name: '$item[3]'") }
2368 | <error>
2369
2370 identifier: ...!type_name /[A-Za-z_]w*/
2371
2372 which automatically prevents type names from being typedef'd, or:
2373
2374 command: 'map' key_name 'to' abort_key
2375 { $thisparser->Replace("abort_key: '$item[2]'") }
2376 | 'map' key_name 'to' key_name
2377 { map_key($item[2],$item[4]) }
2378 | abort_key
2379 { exit if confirm("abort?") }
2380
2381 abort_key: 'q'
2382
2383 key_name: ...!abort_key /[A-Za-z]/
2384
2385 which allows the user to change the abort key binding, but not to
2386 unbind it.
2387
2388 The careful use of such constructs makes it possible to reconfigure a a
2389 running parser, eliminating the need for semantic feedback by providing
2390 syntactic feedback instead. However, as currently implemented,
2391 "Replace()" and "Extend()" have to regenerate and re-"eval" the entire
2392 parser whenever they are called. This makes them quite slow for large
2393 grammars.
2394
2395 In such cases, the judicious use of an interpolated regex is likely to
2396 be far more efficient:
2397
2398 typedef: 'typedef' type_name/ identifier ';'
2399 { $thisparser->{local}{type_name} .= "|$item[3]" }
2400 | <error>
2401
2402 identifier: ...!type_name /[A-Za-z_]w*/
2403
2404 type_name: /$thisparser->{local}{type_name}/
2405
2406 Precompiling parsers
2407 Normally Parse::RecDescent builds a parser from a grammar at run-time.
2408 That approach simplifies the design and implementation of parsing code,
2409 but has the disadvantage that it slows the parsing process down - you
2410 have to wait for Parse::RecDescent to build the parser every time the
2411 program runs. Long or complex grammars can be particularly slow to
2412 build, leading to unacceptable delays at start-up.
2413
2414 To overcome this, the module provides a way of "pre-building" a parser
2415 object and saving it in a separate module. That module can then be used
2416 to create clones of the original parser.
2417
2418 A grammar may be precompiled using the "Precompile" class method. For
2419 example, to precompile a grammar stored in the scalar $grammar, and
2420 produce a class named PreGrammar in a module file named PreGrammar.pm,
2421 you could use:
2422
2423 use Parse::RecDescent;
2424
2425 Parse::RecDescent->Precompile([$options_hashref], $grammar, "PreGrammar");
2426
2427 The first required argument is the grammar string, the second is the
2428 name of the class to be built. The name of the module file is generated
2429 automatically by appending ".pm" to the last element of the class name.
2430 Thus
2431
2432 Parse::RecDescent->Precompile($grammar, "My::New::Parser");
2433
2434 would produce a module file named Parser.pm.
2435
2436 An optional hash reference may be supplied as the first argument to
2437 "Precompile". This argument is currently EXPERIMENTAL, and may change
2438 in a future release of Parse::RecDescent. The only supported option is
2439 currently "-standalone", see "Standalone Precompiled Parsers".
2440
2441 It is somewhat tedious to have to write a small Perl program just to
2442 generate a precompiled grammar class, so Parse::RecDescent has some
2443 special magic that allows you to do the job directly from the command-
2444 line.
2445
2446 If your grammar is specified in a file named grammar, you can generate
2447 a class named Yet::Another::Grammar like so:
2448
2449 > perl -MParse::RecDescent - grammar Yet::Another::Grammar
2450
2451 This would produce a file named Grammar.pm containing the full
2452 definition of a class called Yet::Another::Grammar. Of course, to use
2453 that class, you would need to put the Grammar.pm file in a directory
2454 named Yet/Another, somewhere in your Perl include path.
2455
2456 Having created the new class, it's very easy to use it to build a
2457 parser. You simply "use" the new module, and then call its "new" method
2458 to create a parser object. For example:
2459
2460 use Yet::Another::Grammar;
2461 my $parser = Yet::Another::Grammar->new();
2462
2463 The effect of these two lines is exactly the same as:
2464
2465 use Parse::RecDescent;
2466
2467 open GRAMMAR_FILE, "grammar" or die;
2468 local $/;
2469 my $grammar = <GRAMMAR_FILE>;
2470
2471 my $parser = Parse::RecDescent->new($grammar);
2472
2473 only considerably faster.
2474
2475 Note however that the parsers produced by either approach are exactly
2476 the same, so whilst precompilation has an effect on set-up speed, it
2477 has no effect on parsing speed. RecDescent 2.0 will address that
2478 problem.
2479
2480 Standalone Precompiled Parsers
2481
2482 Until version 1.967003 of Parse::RecDescent, parser modules built with
2483 "Precompile" were dependent on Parse::RecDescent. Future
2484 Parse::RecDescent releases with different internal implementations
2485 would break pre-existing precompiled parsers.
2486
2487 Version 1.967_005 added the ability for Parse::RecDescent to include
2488 itself in the resulting .pm file if you pass the boolean option
2489 "-standalone" to "Precompile":
2490
2491 Parse::RecDescent->Precompile({ -standalone = 1, },
2492 $grammar, "My::New::Parser");
2493
2494 Parse::RecDescent is included as Parse::RecDescent::_Runtime in order
2495 to avoid conflicts between an installed version of Parse::RecDescent
2496 and a precompiled, standalone parser made with another version of
2497 Parse::RecDescent. This renaming is experimental, and is subject to
2498 change in future versions.
2499
2500 Precompiled parsers remain dependent on Parse::RecDescent by default,
2501 as this feature is still considered experimental. In the future,
2502 standalone parsers will become the default.
2503
2505 This section describes common mistakes that grammar writers seem to
2506 make on a regular basis.
2507
2508 1. Expecting an error to always invalidate a parse
2509 A common mistake when using error messages is to write the grammar like
2510 this:
2511
2512 file: line(s)
2513
2514 line: line_type_1
2515 | line_type_2
2516 | line_type_3
2517 | <error>
2518
2519 The expectation seems to be that any line that is not of type 1, 2 or 3
2520 will invoke the "<error>" directive and thereby cause the parse to
2521 fail.
2522
2523 Unfortunately, that only happens if the error occurs in the very first
2524 line. The first rule states that a "file" is matched by one or more
2525 lines, so if even a single line succeeds, the first rule is completely
2526 satisfied and the parse as a whole succeeds. That means that any error
2527 messages generated by subsequent failures in the "line" rule are
2528 quietly ignored.
2529
2530 Typically what's really needed is this:
2531
2532 file: line(s) eofile { $return = $item[1] }
2533
2534 line: line_type_1
2535 | line_type_2
2536 | line_type_3
2537 | <error>
2538
2539 eofile: /^\Z/
2540
2541 The addition of the "eofile" subrule to the first production means
2542 that a file only matches a series of successful "line" matches that
2543 consume the complete input text. If any input text remains after the
2544 lines are matched, there must have been an error in the last "line". In
2545 that case the "eofile" rule will fail, causing the entire "file" rule
2546 to fail too.
2547
2548 Note too that "eofile" must match "/^\Z/" (end-of-text), not "/^\cZ/"
2549 or "/^\cD/" (end-of-file).
2550
2551 And don't forget the action at the end of the production. If you just
2552 write:
2553
2554 file: line(s) eofile
2555
2556 then the value returned by the "file" rule will be the value of its
2557 last item: "eofile". Since "eofile" always returns an empty string on
2558 success, that will cause the "file" rule to return that empty string.
2559 Apart from returning the wrong value, returning an empty string will
2560 trip up code such as:
2561
2562 $parser->file($filetext) || die;
2563
2564 (since "" is false).
2565
2566 Remember that Parse::RecDescent returns undef on failure, so the only
2567 safe test for failure is:
2568
2569 defined($parser->file($filetext)) || die;
2570
2571 2. Using a "return" in an action
2572 An action is like a "do" block inside the subroutine implementing the
2573 surrounding rule. So if you put a "return" statement in an action:
2574
2575 range: '(' start '..' end )'
2576 { return $item{end} }
2577 /\s+/
2578
2579 that subroutine will immediately return, without checking the rest of
2580 the items in the current production (e.g. the "/\s+/") and without
2581 setting up the necessary data structures to tell the parser that the
2582 rule has succeeded.
2583
2584 The correct way to set a return value in an action is to set the
2585 $return variable:
2586
2587 range: '(' start '..' end )'
2588 { $return = $item{end} }
2589 /\s+/
2590
2591 2. Setting $Parse::RecDescent::skip at parse time
2592 If you want to change the default skipping behaviour (see "Terminal
2593 Separators" and the "<skip:...>" directive) by setting
2594 $Parse::RecDescent::skip you have to remember to set this variable
2595 before creating the grammar object.
2596
2597 For example, you might want to skip all Perl-like comments with this
2598 regular expression:
2599
2600 my $skip_spaces_and_comments = qr/
2601 (?mxs:
2602 \s+ # either spaces
2603 | \# .*?$ # or a dash and whatever up to the end of line
2604 )* # repeated at will (in whatever order)
2605 /;
2606
2607 And then:
2608
2609 my $parser1 = Parse::RecDescent->new($grammar);
2610
2611 $Parse::RecDescent::skip = $skip_spaces_and_comments;
2612
2613 my $parser2 = Parse::RecDescent->new($grammar);
2614
2615 $parser1->parse($text); # this does not cope with comments
2616 $parser2->parse($text); # this skips comments correctly
2617
2618 The two parsers behave differently, because any skipping behaviour
2619 specified via $Parse::RecDescent::skip is hard-coded when the grammar
2620 object is built, not at parse time.
2621
2623 Diagnostics are intended to be self-explanatory (particularly if you
2624 use -RD_HINT (under perl -s) or define $::RD_HINT inside the program).
2625
2626 "Parse::RecDescent" currently diagnoses the following:
2627
2628 · Invalid regular expressions used as pattern terminals (fatal
2629 error).
2630
2631 · Invalid Perl code in code blocks (fatal error).
2632
2633 · Lookahead used in the wrong place or in a nonsensical way (fatal
2634 error).
2635
2636 · "Obvious" cases of left-recursion (fatal error).
2637
2638 · Missing or extra components in a "<leftop>" or "<rightop>"
2639 directive.
2640
2641 · Unrecognisable components in the grammar specification (fatal
2642 error).
2643
2644 · "Orphaned" rule components specified before the first rule (fatal
2645 error) or after an "<error>" directive (level 3 warning).
2646
2647 · Missing rule definitions (this only generates a level 3 warning,
2648 since you may be providing them later via
2649 "Parse::RecDescent::Extend()").
2650
2651 · Instances where greedy repetition behaviour will almost certainly
2652 cause the failure of a production (a level 3 warning - see "ON-
2653 GOING ISSUES AND FUTURE DIRECTIONS" below).
2654
2655 · Attempts to define rules named 'Replace' or 'Extend', which cannot
2656 be called directly through the parser object because of the
2657 predefined meaning of "Parse::RecDescent::Replace" and
2658 "Parse::RecDescent::Extend". (Only a level 2 warning is generated,
2659 since such rules can still be used as subrules).
2660
2661 · Productions which consist of a single "<error?>" directive, and
2662 which therefore may succeed unexpectedly (a level 2 warning, since
2663 this might conceivably be the desired effect).
2664
2665 · Multiple consecutive lookahead specifiers (a level 1 warning only,
2666 since their effects simply accumulate).
2667
2668 · Productions which start with a "<reject>" or "<rulevar:...>"
2669 directive. Such productions are optimized away (a level 1 warning).
2670
2671 · Rules which are autogenerated under $::AUTOSTUB (a level 1
2672 warning).
2673
2675 Damian Conway (damian@conway.org) Jeremy T. Braun (JTBRAUN@CPAN.org)
2676 [current maintainer]
2677
2679 There are undoubtedly serious bugs lurking somewhere in this much code
2680 :-) Bug reports, test cases and other feedback are most welcome.
2681
2682 Ongoing annoyances include:
2683
2684 · There's no support for parsing directly from an input stream. If
2685 and when the Perl Gods give us regular expressions on streams, this
2686 should be trivial (ahem!) to implement.
2687
2688 · The parser generator can get confused if actions aren't properly
2689 closed or if they contain particularly nasty Perl syntax errors
2690 (especially unmatched curly brackets).
2691
2692 · The generator only detects the most obvious form of left recursion
2693 (potential recursion on the first subrule in a rule). More subtle
2694 forms of left recursion (for example, through the second item in a
2695 rule after a "zero" match of a preceding "zero-or-more" repetition,
2696 or after a match of a subrule with an empty production) are not
2697 found.
2698
2699 · Instead of complaining about left-recursion, the generator should
2700 silently transform the grammar to remove it. Don't expect this
2701 feature any time soon as it would require a more sophisticated
2702 approach to parser generation than is currently used.
2703
2704 · The generated parsers don't always run as fast as might be wished.
2705
2706 · The meta-parser should be bootstrapped using "Parse::RecDescent"
2707 :-)
2708
2710 1. Repetitions are "incorrigibly greedy" in that they will eat
2711 everything they can and won't backtrack if that behaviour causes a
2712 production to fail needlessly. So, for example:
2713
2714 rule: subrule(s) subrule
2715
2716 will never succeed, because the repetition will eat all the
2717 subrules it finds, leaving none to match the second item. Such
2718 constructions are relatively rare (and "Parse::RecDescent::new"
2719 generates a warning whenever they occur) so this may not be a
2720 problem, especially since the insatiable behaviour can be overcome
2721 "manually" by writing:
2722
2723 rule: penultimate_subrule(s) subrule
2724
2725 penultimate_subrule: subrule ...subrule
2726
2727 The issue is that this construction is exactly twice as expensive
2728 as the original, whereas backtracking would add only 1/N to the
2729 cost (for matching N repetitions of "subrule"). I would welcome
2730 feedback on the need for backtracking; particularly on cases where
2731 the lack of it makes parsing performance problematical.
2732
2733 2. Having opened that can of worms, it's also necessary to consider
2734 whether there is a need for non-greedy repetition specifiers.
2735 Again, it's possible (at some cost) to manually provide the
2736 required functionality:
2737
2738 rule: nongreedy_subrule(s) othersubrule
2739
2740 nongreedy_subrule: subrule ...!othersubrule
2741
2742 Overall, the issue is whether the benefit of this extra
2743 functionality outweighs the drawbacks of further complicating the
2744 (currently minimalist) grammar specification syntax, and (worse)
2745 introducing more overhead into the generated parsers.
2746
2747 3. An "<autocommit>" directive would be nice. That is, it would be
2748 useful to be able to say:
2749
2750 command: <autocommit>
2751 command: 'find' name
2752 | 'find' address
2753 | 'do' command 'at' time 'if' condition
2754 | 'do' command 'at' time
2755 | 'do' command
2756 | unusual_command
2757
2758 and have the generator work out that this should be "pruned" thus:
2759
2760 command: 'find' name
2761 | 'find' <commit> address
2762 | 'do' <commit> command <uncommit>
2763 'at' time
2764 'if' <commit> condition
2765 | 'do' <commit> command <uncommit>
2766 'at' <commit> time
2767 | 'do' <commit> command
2768 | unusual_command
2769
2770 There are several issues here. Firstly, should the "<autocommit>"
2771 automatically install an "<uncommit>" at the start of the last
2772 production (on the grounds that the "command" rule doesn't know
2773 whether an "unusual_command" might start with "find" or "do") or
2774 should the "unusual_command" subgraph be analysed (to see if it
2775 might be viable after a "find" or "do")?
2776
2777 The second issue is how regular expressions should be treated. The
2778 simplest approach would be simply to uncommit before them (on the
2779 grounds that they might match). Better efficiency would be obtained
2780 by analyzing all preceding literal tokens to determine whether the
2781 pattern would match them.
2782
2783 Overall, the issues are: can such automated "pruning" approach a
2784 hand-tuned version sufficiently closely to warrant the extra set-up
2785 expense, and (more importantly) is the problem important enough to
2786 even warrant the non-trivial effort of building an automated
2787 solution?
2788
2790 Source Code Repository
2791 <http://github.com/jtbraun/Parse-RecDescent>
2792
2793 Mailing List
2794 Visit <http://www.perlfoundation.org/perl5/index.cgi?parse_recdescent>
2795 to sign up for the mailing list.
2796
2797 <http://www.PerlMonks.org> is also a good place to ask questions.
2798 Previous posts about Parse::RecDescent can typically be found with this
2799 search: <http://perlmonks.org/index.pl?node=recdescent>.
2800
2801 FAQ
2802 Visit Parse::RecDescent::FAQ for answers to frequently (and not so
2803 frequently) asked questions about Parse::RecDescent.
2804
2805 View/Report Bugs
2806 To view the current bug list or report a new issue visit
2807 <https://rt.cpan.org/Public/Dist/Display.html?Name=Parse-RecDescent>.
2808
2810 Regexp::Grammars provides Parse::RecDescent style parsing using native
2811 Perl 5.10 regular expressions.
2812
2814 Copyright (c) 1997-2007, Damian Conway "<DCONWAY@CPAN.org>". All rights
2815 reserved.
2816
2817 This module is free software; you can redistribute it and/or modify it
2818 under the same terms as Perl itself. See perlartistic.
2819
2821 BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
2822 FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
2823 WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
2824 PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
2825 EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
2826 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
2827 ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
2828 YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
2829 NECESSARY SERVICING, REPAIR, OR CORRECTION.
2830
2831 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
2832 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
2833 REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
2834 TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
2835 CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
2836 SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
2837 RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
2838 FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
2839 SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
2840 DAMAGES.
2841
2842
2843
2844perl v5.16.3 2014-06-09 Parse::RecDescent(3)