1Regexp::Grammars(3) User Contributed Perl Documentation Regexp::Grammars(3)
2
3
4
6 Regexp::Grammars - Add grammatical parsing features to Perl 5.10
7 regexes
8
10 This document describes Regexp::Grammars version 1.049
11
13 use Regexp::Grammars;
14
15 my $parser = qr{
16 (?:
17 <Verb> # Parse and save a Verb in a scalar
18 <.ws> # Parse but don't save whitespace
19 <Noun> # Parse and save a Noun in a scalar
20
21 <type=(?{ rand > 0.5 ? 'VN' : 'VerbNoun' })>
22 # Save result of expression in a scalar
23 |
24 (?:
25 <[Noun]> # Parse a Noun and save result in a list
26 (saved under the key 'Noun')
27 <[PostNoun=ws]> # Parse whitespace, save it in a list
28 # (saved under the key 'PostNoun')
29 )+
30
31 <Verb> # Parse a Verb and save result in a scalar
32 (saved under the key 'Verb')
33
34 <type=(?{ 'VN' })> # Save a literal in a scalar
35 |
36 <debug: match> # Turn on the integrated debugger here
37 <.Cmd= (?: mv? )> # Parse but don't capture a subpattern
38 (name it 'Cmd' for debugging purposes)
39 <[File]>+ # Parse 1+ Files and save them in a list
40 (saved under the key 'File')
41 <debug: off> # Turn off the integrated debugger here
42 <Dest=File> # Parse a File and save it in a scalar
43 (saved under the key 'Dest')
44 )
45
46 ################################################################
47
48 <token: File> # Define a subrule named File
49 <.ws> # - Parse but don't capture whitespace
50 <MATCH= ([\w-]+) > # - Parse the subpattern and capture
51 # matched text as the result of the
52 # subrule
53
54 <token: Noun> # Define a subrule named Noun
55 cat | dog | fish # - Match an alternative (as usual)
56
57 <rule: Verb> # Define a whitespace-sensitive subrule
58 eats # - Match a literal (after any space)
59 <Object=Noun>? # - Parse optional subrule Noun and
60 # save result under the key 'Object'
61 | # Or else...
62 <AUX> # - Parse subrule AUX and save result
63 <part= (eaten|seen) > # - Match a literal, save under 'part'
64
65 <token: AUX> # Define a whitespace-insensitive subrule
66 (has | is) # - Match an alternative and capture
67 (?{ $MATCH = uc $^N }) # - Use captured text as subrule result
68
69 }x;
70
71 # Match the grammar against some text...
72 if ($text =~ $parser) {
73 # If successful, the hash %/ will have the hierarchy of results...
74 process_data_in( %/ );
75 }
76
78 In your program...
79 use Regexp::Grammars; Allow enhanced regexes in lexical scope
80 %/ Result-hash for successful grammar match
81
82 Defining and using named grammars...
83 <grammar: GRAMMARNAME> Define a named grammar that can be inherited
84 <extends: GRAMMARNAME> Current grammar inherits named grammar's rules
85
86 Defining rules in your grammar...
87 <rule: RULENAME> Define rule with magic whitespace
88 <token: RULENAME> Define rule without magic whitespace
89
90 <objrule: CLASS= NAME> Define rule that blesses return-hash into class
91 <objtoken: CLASS= NAME> Define token that blesses return-hash into class
92
93 <objrule: CLASS> Shortcut for above (rule name derived from class)
94 <objtoken: CLASS> Shortcut for above (token name derived from class)
95
96 Matching rules in your grammar...
97 <RULENAME> Call named subrule (may be fully qualified)
98 save result to $MATCH{RULENAME}
99
100 <RULENAME(...)> Call named subrule, passing args to it
101
102 <!RULENAME> Call subrule and fail if it matches
103 <!RULENAME(...)> (shorthand for (?!<.RULENAME>) )
104
105 <:IDENT> Match contents of $ARG{IDENT} as a pattern
106 <\:IDENT> Match contents of $ARG{IDENT} as a literal
107 </:IDENT> Match closing delimiter for $ARG{IDENT}
108
109 <%HASH> Match longest possible key of hash
110 <%HASH {PAT}> Match any key of hash that also matches PAT
111
112 </IDENT> Match closing delimiter for $MATCH{IDENT}
113 <\_IDENT> Match the literal contents of $MATCH{IDENT}
114
115 <ALIAS= RULENAME> Call subrule, save result in $MATCH{ALIAS}
116 <ALIAS= %HASH> Match a hash key, save key in $MATCH{ALIAS}
117 <ALIAS= ( PATTERN )> Match pattern, save match in $MATCH{ALIAS}
118 <ALIAS= (?{ CODE })> Execute code, save value in $MATCH{ALIAS}
119 <ALIAS= 'STR' > Save specified string in $MATCH{ALIAS}
120 <ALIAS= 42 > Save specified number in $MATCH{ALIAS}
121 <ALIAS= /IDENT> Match closing delim, save as $MATCH{ALIAS}
122 <ALIAS= \_IDENT> Match '$MATCH{IDENT}', save as $MATCH{ALIAS}
123
124 <.SUBRULE> Call subrule (one of the above forms),
125 but don't save the result in %MATCH
126
127
128 <[SUBRULE]> Call subrule (one of the above forms), but
129 append result instead of overwriting it
130
131 <SUBRULE1>+ % <SUBRULE2> Match one or more repetitions of SUBRULE1
132 as long as they're separated by SUBRULE2
133 <SUBRULE1> ** <SUBRULE2> Same (only for backwards compatibility)
134
135 <SUBRULE1>* % <SUBRULE2> Match zero or more repetitions of SUBRULE1
136 as long as they're separated by SUBRULE2
137
138 In your grammar's code blocks...
139 $CAPTURE Alias for $^N (the most recent paren capture)
140 $CONTEXT Another alias for $^N
141 $INDEX Current index of next matching position in string
142 %MATCH Current rule's result-hash
143 $MATCH Magic override value (returned instead of result-hash)
144 %ARG Current rule's argument hash
145 $DEBUG Current match-time debugging mode
146
147 Directives...
148 <require: (?{ CODE }) > Fail if code evaluates false
149 <timeout: INT > Fail after specified number of seconds
150 <debug: COMMAND > Change match-time debugging mode
151 <logfile: LOGFILE > Change debugging log file (default: STDERR)
152 <fatal: TEXT|(?{CODE})> Queue error message and fail parse
153 <error: TEXT|(?{CODE})> Queue error message and backtrack
154 <warning: TEXT|(?{CODE})> Queue warning message and continue
155 <log: TEXT|(?{CODE})> Explicitly add a message to debugging log
156 <ws: PATTERN > Override automatic whitespace matching
157 <minimize:> Simplify the result of a subrule match
158 <context:> Switch on context substring retention
159 <nocontext:> Switch off context substring retention
160
162 This module adds a small number of new regex constructs that can be
163 used within Perl 5.10 patterns to implement complete recursive-descent
164 parsing.
165
166 Perl 5.10 already supports recursive=descent matching, via the new
167 "(?<name>...)" and "(?&name)" constructs. For example, here is a simple
168 matcher for a subset of the LaTeX markup language:
169
170 $matcher = qr{
171 (?&File)
172
173 (?(DEFINE)
174 (?<File> (?&Element)* )
175
176 (?<Element> \s* (?&Command)
177 | \s* (?&Literal)
178 )
179
180 (?<Command> \\ \s* (?&Literal) \s* (?&Options)? \s* (?&Args)? )
181
182 (?<Options> \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \])
183
184 (?<Args> \{ \s* (?&Element)* \s* \} )
185
186 (?<Option> \s* [^][\$&%#_{}~^\s,]+ )
187
188 (?<Literal> \s* [^][\$&%#_{}~^\s]+ )
189 )
190 }xms
191
192 This technique makes it possible to use regexes to recognize complex,
193 hierarchical--and even recursive--textual structures. The problem is
194 that Perl 5.10 doesn't provide any support for extracting that
195 hierarchical data into nested data structures. In other words, using
196 Perl 5.10 you can match complex data, but not parse it into an
197 internally useful form.
198
199 An additional problem when using Perl 5.10 regexes to match complex
200 data formats is that you have to make sure you remember to insert
201 whitespace-matching constructs (such as "\s*") at every possible
202 position where the data might contain ignorable whitespace. This
203 reduces the readability of such patterns, and increases the chance of
204 errors (typically caused by overlooking a location where whitespace
205 might appear).
206
207 The Regexp::Grammars module solves both those problems.
208
209 If you import the module into a particular lexical scope, it
210 preprocesses any regex in that scope, so as to implement a number of
211 extensions to the standard Perl 5.10 regex syntax. These extensions
212 simplify the task of defining and calling subrules within a grammar,
213 and allow those subrule calls to capture and retain the components of
214 they match in a proper hierarchical manner.
215
216 For example, the above LaTeX matcher could be converted to a full LaTeX
217 parser (and considerably tidied up at the same time), like so:
218
219 use Regexp::Grammars;
220 $parser = qr{
221 <File>
222
223 <rule: File> <[Element]>*
224
225 <rule: Element> <Command> | <Literal>
226
227 <rule: Command> \\ <Literal> <Options>? <Args>?
228
229 <rule: Options> \[ <[Option]>+ % (,) \]
230
231 <rule: Args> \{ <[Element]>* \}
232
233 <rule: Option> [^][\$&%#_{}~^\s,]+
234
235 <rule: Literal> [^][\$&%#_{}~^\s]+
236 }xms
237
238 Note that there is no need to explicitly place "\s*" subpatterns
239 throughout the rules; that is taken care of automatically.
240
241 If the Regexp::Grammars version of this regex were successfully matched
242 against some appropriate LaTeX document, each rule would call the
243 subrules specified within it, and then return a hash containing
244 whatever result each of those subrules returned, with each result
245 indexed by the subrule's name.
246
247 That is, if the rule named "Command" were invoked, it would first try
248 to match a backslash, then it would call the three subrules
249 "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
250 matched successfully, the "Command" rule would then return a hash with
251 three keys: 'Literal', 'Options', and 'Args'. The value for each of
252 those hash entries would be whatever result-hash the subrules
253 themselves had returned when matched.
254
255 In this way, each level of the hierarchical regex can generate hashes
256 recording everything its own subrules matched, so when the entire
257 pattern matches, it produces a tree of nested hashes that represent the
258 structured data the pattern matched.
259
260 For example, if the previous regex grammar were matched against a
261 string containing:
262
263 \documentclass[a4paper,11pt]{article}
264 \author{D. Conway}
265
266 it would automatically extract a data structure equivalent to the
267 following (but with several extra "empty" keys, which are described in
268 "Subrule results"):
269
270 {
271 'file' => {
272 'element' => [
273 {
274 'command' => {
275 'literal' => 'documentclass',
276 'options' => {
277 'option' => [ 'a4paper', '11pt' ],
278 },
279 'args' => {
280 'element' => [ 'article' ],
281 }
282 }
283 },
284 {
285 'command' => {
286 'literal' => 'author',
287 'args' => {
288 'element' => [
289 {
290 'literal' => 'D.',
291 },
292 {
293 'literal' => 'Conway',
294 }
295 ]
296 }
297 }
298 }
299 ]
300 }
301 }
302
303 The data structure that Regexp::Grammars produces from a regex match is
304 available to the surrounding program in the magic variable "%/".
305
306 Regexp::Grammars provides many features that simplify the extraction of
307 hierarchical data via a regex match, and also some features that can
308 simplify the processing of that data once it has been extracted. The
309 following sections explain each of those features, and some of the
310 parsing techniques they support.
311
312 Setting up the module
313 Just add:
314
315 use Regexp::Grammars;
316
317 to any lexical scope. Any regexes within that scope will automatically
318 now implement the new parsing constructs:
319
320 use Regexp::Grammars;
321
322 my $parser = qr/ regex with $extra <chocolatey> grammar bits /;
323
324 Note that you do not to use the "/x" modifier when declaring a regex
325 grammar (though you certainly may). But even if you don't, the module
326 quietly adds a "/x" to every regex within the scope of its usage.
327 Otherwise, the default "a whitespace character matches exactly that
328 whitespace character" behaviour of Perl regexes would mess up your
329 grammar's parsing. If you need the non-"/x" behaviour, you can still
330 use the "(?-x)" of "(?-x:...)" directives to switch of "/x" within one
331 or more of your grammar's components.
332
333 Once the grammar has been processed, you can then match text against
334 the extended regexes, in the usual manner (i.e. via a "=~" match):
335
336 if ($input_text =~ $parser) {
337 ...
338 }
339
340 After a successful match, the variable "%/" will contain a series of
341 nested hashes representing the structured hierarchical data captured
342 during the parse.
343
344 Structure of a Regexp::Grammars grammar
345 A Regexp::Grammars specification consists of a start-pattern (which may
346 include both standard Perl 5.10 regex syntax, as well as special
347 Regexp::Grammars directives), followed by one or more rule or token
348 definitions.
349
350 For example:
351
352 use Regexp::Grammars;
353 my $balanced_brackets = qr{
354
355 # Start-pattern...
356 <paren_pair> | <brace_pair>
357
358 # Rule definition...
359 <rule: paren_pair>
360 \( (?: <escape> | <paren_pair> | <brace_pair> | [^()] )* \)
361
362 # Rule definition...
363 <rule: brace_pair>
364 \{ (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )* \}
365
366 # Token definition...
367 <token: escape>
368 \\ .
369 }xms;
370
371 The start-pattern at the beginning of the grammar acts like the "top"
372 token of the grammar, and must be matched completely for the grammar to
373 match.
374
375 This pattern is treated like a token for whitespace matching behaviour
376 (see "Tokens vs rules (whitespace handling)"). That is, whitespace in
377 the start-pattern is treated like whitespace in any normal Perl regex.
378
379 The rules and tokens are declarations only and they are not directly
380 matched. Instead, they act like subroutines, and are invoked by name
381 from the initial pattern (or from within a rule or token).
382
383 Each rule or token extends from the directive that introduces it up to
384 either the next rule or token directive, or (in the case of the final
385 rule or token) to the end of the grammar.
386
387 Tokens vs rules (whitespace handling)
388 The difference between a token and a rule is that a token treats any
389 whitespace within it exactly as a normal Perl regular expression would.
390 That is, a sequence of whitespace in a token is ignored if the "/x"
391 modifier is in effect, or else matches the same literal sequence of
392 whitespace characters (if "/x" is not in effect).
393
394 In a rule, most sequences of whitespace are treated as matching the
395 implicit subrule "<.ws>", which is automatically predefined to match
396 optional whitespace (i.e. "\s*").
397
398 Exceptions to this behaviour are whitespaces before a "|" or a code
399 block or an explicit space-matcher (such as "<ws>" or "\s"), or at the
400 very end of the rule)
401
402 In other words, a rule such as:
403
404 <rule: sentence> <noun> <verb>
405 | <verb> <noun>
406
407 is equivalent to a token with added non-capturing whitespace matching:
408
409 <token: sentence> <.ws> <noun> <.ws> <verb>
410 | <.ws> <verb> <.ws> <noun>
411
412 You can explicitly define a "<ws>" token to change that default
413 behaviour. For example, you could alter the definition of "whitespace"
414 to include Perlish comments, by adding an explicit "<token: ws>":
415
416 <token: ws>
417 (?: \s+ | #[^\n]* )*
418
419 But be careful not to define "<ws>" as a rule, as this will lead to all
420 kinds of infinitely recursive unpleasantness.
421
422 Per-rule whitespace handling
423
424 Redefining the "<ws>" token changes its behaviour throughout the entire
425 grammar, within every rule definition. Usually that's appropriate, but
426 sometimes you need finer-grained control over whitespace handling.
427
428 So Regexp::Grammars provides the "<ws:>" directive, which allows you to
429 override the implicit whitespace-matches-whitespace behaviour only
430 within the current rule.
431
432 Note that this directive does not redefined "<ws>" within the rule; it
433 simply specifies what to replace each whitespace sequence with (instead
434 of replacing each with a "<ws>" call).
435
436 For example, if a language allows one kind of comment between
437 statements and another within statements, you could parse it with:
438
439 <rule: program>
440 # One type of comment between...
441 <ws: (\s++ | \# .*? \n)* >
442
443 # ...colon-separated statements...
444 <[statement]>+ % ( ; )
445
446
447 <rule: statement>
448 # Another type of comment...
449 <ws: (\s*+ | \#{ .*? }\# )* >
450
451 # ...between comma-separated commands...
452 <cmd> <[arg]>+ % ( , )
453
454 Note that each directive only applies to the rule in which it is
455 specified. In every other rule in the grammar, whitespace would still
456 match the usual "<ws>" subrule.
457
458 Calling subrules
459 To invoke a rule to match at any point, just enclose the rule's name in
460 angle brackets (like in Perl 6). There must be no space between the
461 opening bracket and the rulename. For example::
462
463 qr{
464 file: # Match literal sequence 'f' 'i' 'l' 'e' ':'
465 <name> # Call <rule: name>
466 <options>? # Call <rule: options> (it's okay if it fails)
467
468 <rule: name>
469 # etc.
470 }x;
471
472 If you need to match a literal pattern that would otherwise look like a
473 subrule call, just backslash-escape the leading angle:
474
475 qr{
476 file: # Match literal sequence 'f' 'i' 'l' 'e' ':'
477 \<name> # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
478 <options>? # Call <rule: options> (it's okay if it fails)
479
480 <rule: name>
481 # etc.
482 }x;
483
484 Subrule results
485 If a subrule call successfully matches, the result of that match is a
486 reference to a hash. That hash reference is stored in the current
487 rule's own result-hash, under the name of the subrule that was invoked.
488 The hash will, in turn, contain the results of any more deeply nested
489 subrule calls, each stored under the name by which the nested subrule
490 was invoked.
491
492 In other words, if the rule "sentence" is defined:
493
494 <rule: sentence>
495 <noun> <verb> <object>
496
497 then successfully calling the rule:
498
499 <sentence>
500
501 causes a new hash entry at the current nesting level. That entry's key
502 will be 'sentence' and its value will be a reference to a hash, which
503 in turn will have keys: 'noun', 'verb', and 'object'.
504
505 In addition each result-hash has one extra key: the empty string. The
506 value for this key is whatever substring the entire subrule call
507 matched. This value is known as the context substring.
508
509 So, for example, a successful call to "<sentence>" might add something
510 like the following to the current result-hash:
511
512 sentence => {
513 "" => 'I saw a dog',
514 noun => 'I',
515 verb => 'saw',
516 object => {
517 "" => 'a dog',
518 article => 'a',
519 noun => 'dog',
520 },
521 }
522
523 Note, however, that if the result-hash at any level contains only the
524 empty-string key (i.e. the subrule did not call any sub-subrules or
525 save any of their nested result-hashes), then the hash is "unpacked"
526 and just the context substring itself is returned.
527
528 For example, if "<rule: sentence>" had been defined:
529
530 <rule: sentence>
531 I see dead people
532
533 then a successful call to the rule would only add:
534
535 sentence => 'I see dead people'
536
537 to the current result-hash.
538
539 This is a useful feature because it prevents a series of nested subrule
540 calls from producing very unwieldy data structures. For example,
541 without this automatic unpacking, even the simple earlier example:
542
543 <rule: sentence>
544 <noun> <verb> <object>
545
546 would produce something needlessly complex, such as:
547
548 sentence => {
549 "" => 'I saw a dog',
550 noun => {
551 "" => 'I',
552 },
553 verb => {
554 "" => 'saw',
555 },
556 object => {
557 "" => 'a dog',
558 article => {
559 "" => 'a',
560 },
561 noun => {
562 "" => 'dog',
563 },
564 },
565 }
566
567 Turning off the context substring
568
569 The context substring is convenient for debugging and for generating
570 error messages but, in a large grammar, or when parsing a long string,
571 the capture and storage of many nested substrings may quickly become
572 prohibitively expensive.
573
574 So Regexp::Grammars provides a directive to prevent context substrings
575 from being retained. Any rule or token that includes the directive
576 "<nocontext:>" anywhere in the rule's body will not retain any context
577 substring it matches...unless that substring would be the only entry in
578 its result hash (which only happens within objrules and objtokens).
579
580 If a "<nocontext:>" directive appears before the first rule or token
581 definition (i.e. as part of the main pattern), then the entire grammar
582 will discard all context substrings from every one of its rules and
583 tokens.
584
585 However, you can override this universal prohibition with a second
586 directive: "<context:>". If this directive appears in any rule or
587 token, that rule or token will save its context substring, even if a
588 global "<nocontext:>" is in effect.
589
590 This means that this grammar:
591
592 qr{
593 <Command>
594
595 <rule: Command>
596 <nocontext:>
597 <Keyword> <arg=(\S+)>+ % <.ws>
598
599 <token: Keyword>
600 <Move> | <Copy> | <Delete>
601
602 # etc.
603 }x
604
605 and this grammar:
606
607 qr{
608 <nocontext:>
609 <Command>
610
611 <rule: Command>
612 <Keyword> <arg=(\S+)>+ % <.ws>
613
614 <token: Keyword>
615 <context:>
616 <Move> | <Copy> | <Delete>
617
618 # etc.
619 }x
620
621 will behave identically (saving context substrings for keywords, but
622 not for commands), except that the first version will also retain the
623 global context substring (i.e. $/{""}), whereas the second version will
624 not.
625
626 Note that "<context:>" and "<nocontext:>" have no effect on, or even
627 any interaction with, the various result distillation mechanisms, which
628 continue to work in the usual way when either or both of the directives
629 is used.
630
631 Renaming subrule results
632 It is not always convenient to have subrule results stored under the
633 same name as the rule itself. Rule names should be optimized for
634 understanding the behaviour of the parser, whereas result names should
635 be optimized for understanding the structure of the data. Often those
636 two goals are identical, but not always; sometimes rule names need to
637 describe what the data looks like, while result names need to describe
638 what the data means.
639
640 For example, sometimes you need to call the same rule twice, to match
641 two syntactically identical components whose positions give then
642 semantically distinct meanings:
643
644 <rule: copy_cmd>
645 copy <file> <file>
646
647 The problem here is that, if the second call to "<file>" succeeds, its
648 result-hash will be stored under the key 'file', clobbering the data
649 that was returned from the first call to "<file>".
650
651 To avoid such problems, Regexp::Grammars allows you to alias any
652 subrule call, so that it is still invoked by the original name, but its
653 result-hash is stored under a different key. The syntax for that is:
654 "<alias=rulename>". For example:
655
656 <rule: copy_cmd>
657 copy <from=file> <to=file>
658
659 Here, "<rule: file>" is called twice, with the first result-hash being
660 stored under the key 'from', and the second result-hash being stored
661 under the key 'to'.
662
663 Note, however, that the alias before the "=" must be a proper
664 identifier (i.e. a letter or underscore, followed by letters, digits,
665 and/or underscores). Aliases that start with an underscore and aliases
666 named "MATCH" have special meaning (see "Private subrule calls" and
667 "Result distillation" respectively).
668
669 Aliases can also be useful for normalizing data that may appear in
670 different formats and sequences. For example:
671
672 <rule: copy_cmd>
673 copy <from=file> <to=file>
674 | dup <to=file> as <from=file>
675 | <from=file> -> <to=file>
676 | <to=file> <- <from=file>
677
678 Here, regardless of which order the old and new files are specified,
679 the result-hash always gets:
680
681 copy_cmd => {
682 from => 'oldfile',
683 to => 'newfile',
684 }
685
686 List-like subrule calls
687 If a subrule call is quantified with a repetition specifier:
688
689 <rule: file_sequence>
690 <file>+
691
692 then each repeated match overwrites the corresponding entry in the
693 surrounding rule's result-hash, so only the result of the final
694 repetition will be retained. That is, if the above example matched the
695 string "foo.pl bar.py baz.php", then the result-hash would contain:
696
697 file_sequence {
698 "" => 'foo.pl bar.py baz.php',
699 file => 'baz.php',
700 }
701
702 Usually, that's not the desired outcome, so Regexp::Grammars provides
703 another mechanism by which to call a subrule; one that saves all
704 repetitions of its results.
705
706 A regular subrule call consists of the rule's name surrounded by angle
707 brackets. If, instead, you surround the rule's name with "<[...]>"
708 (angle and square brackets) like so:
709
710 <rule: file_sequence>
711 <[file]>+
712
713 then the rule is invoked in exactly the same way, but the result of
714 that submatch is pushed onto an array nested inside the appropriate
715 result-hash entry. In other words, if the above example matched the
716 same "foo.pl bar.py baz.php" string, the result-hash would contain:
717
718 file_sequence {
719 "" => 'foo.pl bar.py baz.php',
720 file => [ 'foo.pl', 'bar.py', 'baz.php' ],
721 }
722
723 This "listifying subrule call" can also be useful for non-repeated
724 subrule calls, if the same subrule is invoked in several places in a
725 grammar. For example if a cmdline option could be given either one or
726 two values, you might parse it:
727
728 <rule: size_option>
729 -size <[size]> (?: x <[size]> )?
730
731 The result-hash entry for 'size' would then always contain an array,
732 with either one or two elements, depending on the input being parsed.
733
734 Listifying subrules can also be given aliases, just like ordinary
735 subrules. The alias is always specified inside the square brackets:
736
737 <rule: size_option>
738 -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
739
740 Here, the sizes are parsed using the "pos_integer" rule, but saved in
741 the result-hash in an array under the key 'size'.
742
743 Parametric subrules
744 When a subrule is invoked, it can be passed a set of named arguments
745 (specified as key"=>"values pairs). This argument list is placed in a
746 normal Perl regex code block and must appear immediately after the
747 subrule name, before the closing angle bracket.
748
749 Within the subrule that has been invoked, the arguments can be accessed
750 via the special hash %ARG. For example:
751
752 <rule: block>
753 <tag>
754 <[block]>*
755 <end_tag(?{ tag=>$MATCH{tag} })> # ...call subrule with argument
756
757 <token: end_tag>
758 end_ (??{ quotemeta $ARG{tag} })
759
760 Here the "block" rule first matches a "<tag>", and the corresponding
761 substring is saved in $MATCH{tag}. It then matches any number of nested
762 blocks. Finally it invokes the "<end_tag>" subrule, passing it an
763 argument whose name is 'tag' and whose value is the current value of
764 $MATCH{tag} (i.e. the original opening tag).
765
766 When it is thus invoked, the "end_tag" token first matches 'end_', then
767 interpolates the literal value of the 'tag' argument and attempts to
768 match it.
769
770 Any number of named arguments can be passed when a subrule is invoked.
771 For example, we could generalize the "end_tag" rule to allow any prefix
772 (not just 'end_'), and also to allow for 'if...fi'-style reversed tags,
773 like so:
774
775 <rule: block>
776 <tag>
777 <[block]>*
778 <end_tag (?{ prefix=>'end', tag=>$MATCH{tag} })>
779
780 <token: end_tag>
781 (??{ $ARG{prefix} // q{(?!)} }) # ...prefix as pattern
782 (??{ quotemeta $ARG{tag} }) # ...tag as literal
783 |
784 (??{ quotemeta reverse $ARG{tag} }) # ...reversed tag
785
786 Note that, if you do not need to interpolate values (such as
787 $MATCH{tag}) into a subrule's argument list, you can use simple
788 parentheses instead of "(?{...})", like so:
789
790 <end_tag( prefix=>'end', tag=>'head' )>
791
792 The only types of values you can use in this simplified syntax are
793 numbers and single-quote-delimited strings. For anything more complex,
794 put the argument list in a full "(?{...})".
795
796 As the earlier examples show, the single most common type of argument
797 is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
798 a common requirement to pass an element of %MATCH into a subrule, named
799 with its own key.
800
801 Because this is such a common usage, Regexp::Grammars provides a
802 shortcut. If you use simple parentheses (instead of "(?{...})"
803 parentheses) then instead of a pair, you can specify an argument using
804 a colon followed by an identifier. This argument is replaced by a
805 named argument whose name is the identifier and whose value is the
806 corresponding item from %MATCH. So, for example, instead of:
807
808 <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>
809
810 you can just write:
811
812 <end_tag( prefix=>'end', :tag )>
813
814 Note that, from Perl 5.20 onwards, due to changes in the way that Perl
815 parses regexes, Regexp::Grammars does not support explicitly passing
816 elements of %MATCH as argument values within a list subrule (yeah, it's
817 a very specific and obscure edge-case):
818
819 <[end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })]> # Does not work
820
821 Note, however, that the shortcut:
822
823 <[end_tag( prefix=>'end', :tag )]>
824
825 still works correctly.
826
827 Accessing subrule arguments more cleanly
828
829 As the preceding examples illustrate, using subrule arguments
830 effectively generally requires the use of run-time interpolated
831 subpatterns via the "(??{...})" construct.
832
833 This produces ugly rule bodies such as:
834
835 <token: end_tag>
836 (??{ $ARG{prefix} // q{(?!)} }) # ...prefix as pattern
837 (??{ quotemeta $ARG{tag} }) # ...tag as literal
838 |
839 (??{ quotemeta reverse $ARG{tag} }) # ...reversed tag
840
841 To simplify these common usages, Regexp::Grammars provides three
842 convenience constructs.
843
844 A subrule call of the form "<:"identifier">" is equivalent to:
845
846 (??{ $ARG{'identifier'} // q{(?!)} })
847
848 Namely: "Match the contents of $ARG{'identifier'}, treating those
849 contents as a pattern."
850
851 A subrule call of the form "<\:"identifier">" (that is: a matchref with
852 a colon after the backslash) is equivalent to:
853
854 (??{ defined $ARG{'identifier'}
855 ? quotemeta($ARG{'identifier'})
856 : '(?!)'
857 })
858
859 Namely: "Match the contents of $ARG{'identifier'}, treating those
860 contents as a literal."
861
862 A subrule call of the form "</:"identifier">" (that is: an invertref
863 with a colon after the forward slash) is equivalent to:
864
865 (??{ defined $ARG{'identifier'}
866 ? quotemeta(reverse $ARG{'identifier'})
867 : '(?!)'
868 })
869
870 Namely: "Match the closing delimiter corresponding to the contents of
871 $ARG{'identifier'}, as if it were a literal".
872
873 The availability of these three constructs mean that we could rewrite
874 the above "<end_tag>" token much more cleanly as:
875
876 <token: end_tag>
877 <:prefix> # ...prefix as pattern
878 <\:tag> # ...tag as a literal
879 |
880 </:tag> # ...reversed tag
881
882 In general these constructs mean that, within a subrule, if you want to
883 match an argument passed to that subrule, you use "<:"ARGNAME">" (to
884 match the argument as a pattern) or "<\:"ARGNAME">" (to match the
885 argument as a literal).
886
887 Note the consistent mnemonic in these various subrule-like
888 interpolations of named arguments: the name is always prefixed by a
889 colon.
890
891 In other words, the "<:ARGNAME>" form works just like a "<RULENAME>",
892 except that the leading colon tells Regexp::Grammars to use the
893 contents of $ARG{'ARGNAME'} as the subpattern, instead of the contents
894 of "(?&RULENAME)"
895
896 Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs work exactly
897 like "<\_MATCHNAME>" and "</INVERTNAME>" respectively, except that the
898 leading colon indicates that the matchref or invertref should be taken
899 from %ARG instead of from %MATCH.
900
901 Pseudo-subrules
902 Aliases can also be given to standard Perl subpatterns, as well as to
903 code blocks within a regex. The syntax for subpatterns is:
904
905 <ALIAS= (SUBPATTERN) >
906
907 In other words, the syntax is exactly like an aliased subrule call,
908 except that the rule name is replaced with a set of parentheses
909 containing the subpattern. Any parentheses--capturing or
910 non-capturing--will do.
911
912 The effect of aliasing a standard subpattern is to cause whatever that
913 subpattern matches to be saved in the result-hash, using the alias as
914 its key. For example:
915
916 <rule: file_command>
917
918 <cmd=(mv|cp|ln)> <from=file> <to=file>
919
920 Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
921 "(mv|cp|ln)", but whatever substring it matches is saved in the result-
922 hash under the key 'cmd'.
923
924 The syntax for aliasing code blocks is:
925
926 <ALIAS= (?{ your($code->here) }) >
927
928 Note, however, that the code block must be specified in the standard
929 Perl 5.10 regex notation: "(?{...})". A common mistake is to write:
930
931 <ALIAS= { your($code->here } >
932
933 instead, which will attempt to interpolate $code before the regex is
934 even compiled, as such variables are only "protected" from
935 interpolation inside a "(?{...})".
936
937 When correctly specified, this construct executes the code in the block
938 and saves the result of that execution in the result-hash, using the
939 alias as its key. Aliased code blocks are useful for adding semantic
940 information based on which branch of a rule is executed. For example,
941 consider the "copy_cmd" alternatives shown earlier:
942
943 <rule: copy_cmd>
944 copy <from=file> <to=file>
945 | dup <to=file> as <from=file>
946 | <from=file> -> <to=file>
947 | <to=file> <- <from=file>
948
949 Using aliased code blocks, you could add an extra field to the result-
950 hash to describe which form of the command was detected, like so:
951
952 <rule: copy_cmd>
953 copy <from=file> <to=file> <type=(?{ 'std' })>
954 | dup <to=file> as <from=file> <type=(?{ 'rev' })>
955 | <from=file> -> <to=file> <type=(?{ +1 })>
956 | <to=file> <- <from=file> <type=(?{ -1 })>
957
958 Now, if the rule matched, the result-hash would contain something like:
959
960 copy_cmd => {
961 from => 'oldfile',
962 to => 'newfile',
963 type => 'fwd',
964 }
965
966 Note that, in addition to the semantics described above, aliased
967 subpatterns and code blocks also become visible to Regexp::Grammars'
968 integrated debugger (see Debugging).
969
970 Aliased literals
971 As the previous example illustrates, it is inconveniently verbose to
972 assign constants via aliased code blocks. So Regexp::Grammars provides
973 a short-cut. It is possible to directly alias a numeric literal or a
974 single-quote delimited literal string, without putting either inside a
975 code block. For example, the previous example could also be written:
976
977 <rule: copy_cmd>
978 copy <from=file> <to=file> <type='std'>
979 | dup <to=file> as <from=file> <type='rev'>
980 | <from=file> -> <to=file> <type= +1 >
981 | <to=file> <- <from=file> <type= -1 >
982
983 Note that only these two forms of literal are supported in this
984 abbreviated syntax.
985
986 Amnesiac subrule calls
987 By default, every subrule call saves its result into the result-hash,
988 either under its own name, or under an alias.
989
990 However, sometimes you may want to refactor some literal part of a rule
991 into one or more subrules, without having those submatches added to the
992 result-hash. The syntax for calling a subrule, but ignoring its return
993 value is:
994
995 <.SUBRULE>
996
997 (which is stolen directly from Perl 6).
998
999 For example, you may prefer to rewrite a rule such as:
1000
1001 <rule: paren_pair>
1002
1003 \(
1004 (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
1005 \)
1006
1007 without any literal matching, like so:
1008
1009 <rule: paren_pair>
1010
1011 <.left_paren>
1012 (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
1013 <.right_paren>
1014
1015 <token: left_paren> \(
1016 <token: right_paren> \)
1017 <token: non_paren> [^()]
1018
1019 Moreover, as the individual components inside the parentheses probably
1020 aren't being captured for any useful purpose either, you could further
1021 optimize that to:
1022
1023 <rule: paren_pair>
1024
1025 <.left_paren>
1026 (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
1027 <.right_paren>
1028
1029 Note that you can also use the dot modifier on an aliased subpattern:
1030
1031 <.Alias= (SUBPATTERN) >
1032
1033 This seemingly contradictory behaviour (of giving a subpattern a name,
1034 then deliberately ignoring that name) actually does make sense in one
1035 situation. Providing the alias makes the subpattern visible to the
1036 debugger, while using the dot stops it from affecting the result-hash.
1037 See "Debugging non-grammars" for an example of this usage.
1038
1039 Private subrule calls
1040 If a rule name (or an alias) begins with an underscore:
1041
1042 <_RULENAME> <_ALIAS=RULENAME>
1043 <[_RULENAME]> <[_ALIAS=RULENAME]>
1044
1045 then matching proceeds as normal, and any result that is returned is
1046 stored in the current result-hash in the usual way.
1047
1048 However, when any rule finishes (and just before it returns) it first
1049 filters its result-hash, removing any entries whose keys begin with an
1050 underscore. This means that any subrule with an underscored name (or
1051 with an underscored alias) remembers its result, but only until the end
1052 of the current rule. Its results are effectively private to the current
1053 rule.
1054
1055 This is especially useful in conjunction with result distillation.
1056
1057 Lookahead (zero-width) subrules
1058 Non-capturing subrule calls can be used in normal lookaheads:
1059
1060 <rule: qualified_typename>
1061 # A valid typename and has a :: in it...
1062 (?= <.typename> ) [^\s:]+ :: \S+
1063
1064 <rule: identifier>
1065 # An alpha followed by alnums (but not a valid typename)...
1066 (?! <.typename> ) [^\W\d]\w*
1067
1068 but the syntax is a little unwieldy. More importantly, an internal
1069 problem with backtracking causes positive lookaheads to mess up the
1070 module's named capturing mechanism.
1071
1072 So Regexp::Grammars provides two shorthands:
1073
1074 <!typename> same as: (?! <.typename> )
1075 <?typename> same as: (?= <.typename> ) ...but works correctly!
1076
1077 These two constructs can also be called with arguments, if necessary:
1078
1079 <rule: Command>
1080 <Keyword>
1081 (?:
1082 <!Terminator(:Keyword)> <Args=(\S+)>
1083 )?
1084 <Terminator(:Keyword)>
1085
1086 Note that, as the above equivalences imply, neither of these forms of a
1087 subroutine call ever captures what it matches.
1088
1089 Matching separated lists
1090 One of the commonest tasks in text parsing is to match a list of
1091 unspecified length, in which items are separated by a fixed token.
1092 Things like:
1093
1094 1, 2, 3 , 4 ,13, 91 # Numbers separated by commas and spaces
1095
1096 g-c-a-g-t-t-a-c-a # DNA bases separated by dashes
1097
1098 /usr/local/bin # Names separated by directory markers
1099
1100 /usr:/usr/local:bin # Directories separated by colons
1101
1102 The usual construct required to parse these kinds of structures is
1103 either:
1104
1105 <rule: list>
1106
1107 <item> <separator> <list> # recursive definition
1108 | <item> # base case
1109
1110 or, if you want to allow zero-or-more items instead of requiring one-
1111 or-more:
1112
1113 <rule: list_opt>
1114 <list>? # entire list may be missing
1115
1116 <rule: list> # as before...
1117 <item> <separator> <list> # recursive definition
1118 | <item> # base case
1119
1120 Or, more efficiently, but less prettily:
1121
1122 <rule: list>
1123 <[item]> (?: <separator> <[item]> )* # one-or-more
1124
1125 <rule: list_opt>
1126 (?: <[item]> (?: <separator> <[item]> )* )? # zero-or-more
1127
1128 Because separated lists are such a common component of grammars,
1129 Regexp::Grammars provides cleaner ways to specify them:
1130
1131 <rule: list>
1132 <[item]>+ % <separator> # one-or-more
1133
1134 <rule: list_zom>
1135 <[item]>* % <separator> # zero-or-more
1136
1137 Note that these are just regular repetition qualifiers (i.e. "+" and
1138 "*") applied to a subriule ("<[item]>"), with a "%" modifier after them
1139 to specify the required separator between the repeated matches.
1140
1141 The number of repetitions matched is controlled both by the nature of
1142 the qualifier ("+" vs "*") and by the subrule specified after the "%".
1143 The qualified subrule will be repeatedly matched for as long as its
1144 qualifier allows, provided that the second subrule also matches between
1145 those repetitions.
1146
1147 For example, you can match a parenthesized sequence of one-or-more
1148 numbers separated by commas, such as:
1149
1150 (1, 2, 3, 4, 13, 91) # Numbers separated by commas (and spaces)
1151
1152 with:
1153
1154 <rule: number_list>
1155
1156 \( <[number]>+ % <comma> \)
1157
1158 <token: number> \d+
1159 <token: comma> ,
1160
1161 Note that any spaces round the commas will be ignored because
1162 "<number_list>" is specified as a rule and the "+%" specifier has
1163 spaces within and around it. To disallow spaces around the commas, make
1164 sure there are no spaces in or around the "+%":
1165
1166 <rule: number_list_no_spaces>
1167
1168 \( <[number]>+%<comma> \)
1169
1170 (or else specify the rule as a token instead).
1171
1172 Because the "%" is a modifier applied to a qualifier, you can modify
1173 any other repetition qualifier in the same way. For example:
1174
1175 <[item]>{2,4} % <sep> # two-to-four items, separated
1176
1177 <[item]>{7} % <sep> # exactly 7 items, separated
1178
1179 <[item]>{10,}? % <sep> # minimum of 10 or more items, separated
1180
1181 You can even do this:
1182
1183 <[item]>? % <sep> # one-or-zero items, (theoretically) separated
1184
1185 though the separator specification is, of course, meaningless in that
1186 case as it will never be needed to separate a maximum of one item.
1187
1188 If a "%" appears anywhere else in a grammar (i.e. not immediately after
1189 a repetition qualifier), it is treated normally (i.e. as a self-
1190 matching literal character):
1191
1192 <token: perl_hash>
1193 % <ident> # match "%foo", "%bar", etc.
1194
1195 <token: perl_mod>
1196 <expr> % <expr> # match "$n % 2", "($n+3) % ($n-1)", etc.
1197
1198 If you need to match a literal "%" immediately after a repetition,
1199 either quote it:
1200
1201 <token: percentage>
1202 \d{1,3} \% solution # match "7% solution", etc.
1203
1204 or refactor the "%" character:
1205
1206 <token: percentage>
1207 \d{1,3} <percent_sign> solution # match "7% solution", etc.
1208
1209 <token: percent_sign>
1210 %
1211
1212 Note that it's usually necessary to use the "<[...]>" form for the
1213 repeated items being matched, so that all of them are saved in the
1214 result hash. You can also save all the separators (if they're
1215 important) by specifying them as a list-like subrule too:
1216
1217 \( <[number]>* % <[comma]> \) # save numbers *and* separators
1218
1219 The repeated item must be specified as a subrule call of some kind
1220 (i.e. in angles), but the separators may be specified either as a
1221 subrule or as a raw bracketed pattern. For example:
1222
1223 <[number]>* % ( , | : ) # Numbers separated by commas or colons
1224
1225 <[number]>* % [,:] # Same, but more efficiently matched
1226
1227 The separator should always be specified within matched delimiters of
1228 some kind: either matching "<...>" or matching "(...)" or matching
1229 "[...]". Simple, non-bracketed separators will sometimes also work:
1230
1231 <[number]>+ % ,
1232
1233 but not always:
1234
1235 <[number]>+ % ,\s+ # Oops! Separator is just: ,
1236
1237 This is because of the limited way in which the module internally
1238 parses ordinary regex components (i.e. without full understanding of
1239 their implicit precedence). As a consequence, consistently placing
1240 brackets around any separator is a much safer approach:
1241
1242 <[number]>+ % (,\s+)
1243
1244 You can also use a simple pattern on the left of the "%" as the item
1245 matcher, but in this case it must always be aliased into a list-
1246 collecting subrule, like so:
1247
1248 <[item=(\d+)]>* % [,]
1249
1250 Note that, for backwards compatibility with earlier versions of
1251 Regexp::Grammars, the "+%" operator can also be written: "**".
1252 However, there can be no space between the two asterisks of this
1253 variant. That is:
1254
1255 <[item]> ** <sep> # same as <[item]>* % <sep>
1256
1257 <[item]>* * <sep> # error (two * qualifiers in a row)
1258
1259 Matching hash keys
1260 In some situations a grammar may need a rule that matches dozens,
1261 hundreds, or even thousands of one-word alternatives. For example, when
1262 matching command names, or valid userids, or English words. In such
1263 cases it is often impractical (and always inefficient) to list all the
1264 alternatives between "|" alternators:
1265
1266 <rule: shell_cmd>
1267 a2p | ac | apply | ar | automake | awk | ...
1268 # ...and 400 lines later
1269 ... | zdiff | zgrep | zip | zmore | zsh
1270
1271 <rule: valid_word>
1272 a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
1273 # ...and 40,000 lines later...
1274 ... | zymotize | zymotoxic | zymurgy | zythem | zythum
1275
1276 To simplify such cases, Regexp::Grammars provides a special construct
1277 that allows you to specify all the alternatives as the keys of a normal
1278 hash. The syntax for that construct is simply to put the hash name
1279 inside angle brackets (with no space between the angles and the hash
1280 name).
1281
1282 Which means that the rules in the previous example could also be
1283 written:
1284
1285 <rule: shell_cmd>
1286 <%cmds>
1287
1288 <rule: valid_word>
1289 <%dict>
1290
1291 provided that the two hashes (%cmds and %dict) are visible in the scope
1292 where the grammar is created.
1293
1294 Matching a hash key in this way is typically significantly faster than
1295 matching a large set of alternations. Specifically, it is O(length of
1296 longest potential key) ^ 2, instead of O(number of keys).
1297
1298 Internally, the construct is converted to something equivalent to:
1299
1300 <rule: shell_cmd>
1301 (<.hk>) <require: (?{ exists $cmds{$CAPTURE} })>
1302
1303 <rule: valid_word>
1304 (<.hk>) <require: (?{ exists $dict{$CAPTURE} })>
1305
1306 The special "<hk>" rule is created automatically, and defaults to
1307 "\S+", but you can also define it explicitly to handle other kinds of
1308 keys. For example:
1309
1310 <rule: hk>
1311 [^\n]+ # Key may be any number of chars on a single line
1312
1313 <rule: hk>
1314 [ACGT]{10,} # Key is a base sequence of at least 10 pairs
1315
1316 Alternatively, you can specify a different key-matching pattern for
1317 each hash you're matching, by placing the required pattern in braces
1318 immediately after the hash name. For example:
1319
1320 <rule: client_name>
1321 # Valid keys match <.hk> (default or explicitly specified)
1322 <%clients>
1323
1324 <rule: shell_cmd>
1325 # Valid keys contain only word chars, hyphen, slash, or dot...
1326 <%cmds { [\w-/.]+ }>
1327
1328 <rule: valid_word>
1329 # Valid keys contain only alphas or internal hyphen or apostrophe...
1330 <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>
1331
1332 <rule: DNA_sequence>
1333 # Valid keys are base sequences of at least 10 pairs...
1334 <%sequences{[ACGT]{10,}}>
1335
1336 This second approach to key-matching is preferred, because it localizes
1337 any non-standard key-matching behaviour to each individual hash.
1338
1339 Note that changes in the compilation process from Perl 5.18 onwards
1340 mean that in some cases the "<%hash>" construct only works reliably if
1341 the hash itself is declared at the outermost lexical scope (i.e. file
1342 scope).
1343
1344 Specifically, if the regex grammar does not include any interpolated
1345 scalars or arrays and the hash was declared within a subroutine (even
1346 within the same subroutine as the regex grammar that uses it), the
1347 regex will not be able to "see" the hash variable at compile-time. This
1348 will produce a "Global symbol "%hash" requires explicit package name"
1349 compile-time error. For example:
1350
1351 sub build_keyword_parser {
1352 # Hash declared inside subroutine...
1353 my %keywords = (foo => 1, bar => 1);
1354
1355 # ...then used in <%hash> construct within uninterpolated regex...
1356 return qr{
1357 ^<keyword>$
1358 <rule: keyword> <%keywords>
1359 }x;
1360
1361 # ...produces compile-time error
1362 }
1363
1364 The solution is to place the hash outside the subroutine containing the
1365 grammar:
1366
1367 # Hash declared OUTSIDE subroutine...
1368 my %keywords = (foo => 1, bar => 1);
1369
1370 sub build_keyword_parser {
1371 return qr{
1372 ^<keyword>$
1373 <rule: keyword> <%keywords>
1374 }x;
1375 }
1376
1377 ...or else to explicitly interpolate at least one scalar (even just a
1378 scalar containing an empty string):
1379
1380 sub build_keyword_parser {
1381 my %keywords = (foo => 1, bar => 1);
1382 my $DEFER_REGEX_COMPILATION = "";
1383
1384 return qr{
1385 ^<keyword>$
1386 <rule: keyword> <%keywords>
1387
1388 $DEFER_REGEX_COMPILATION
1389 }x;
1390 }
1391
1392 Rematching subrule results
1393 Sometimes it is useful to be able to rematch a string that has
1394 previously been matched by some earlier subrule. For example, consider
1395 a rule to match shell-like control blocks:
1396
1397 <rule: control_block>
1398 for <expr> <[command]>+ endfor
1399 | while <expr> <[command]>+ endwhile
1400 | if <expr> <[command]>+ endif
1401 | with <expr> <[command]>+ endwith
1402
1403 This would be much tidier if we could factor out the command names
1404 (which are the only differences between the four alternatives). The
1405 problem is that the obvious solution:
1406
1407 <rule: control_block>
1408 <keyword> <expr>
1409 <[command]>+
1410 end<keyword>
1411
1412 doesn't work, because it would also match an incorrect input like:
1413
1414 for 1..10
1415 echo $n
1416 ls subdir/$n
1417 endif
1418
1419 We need some way to ensure that the "<keyword>" matched immediately
1420 after "end" is the same "<keyword>" that was initially matched.
1421
1422 That's not difficult, because the first "<keyword>" will have captured
1423 what it matched into $MATCH{keyword}, so we could just write:
1424
1425 <rule: control_block>
1426 <keyword> <expr>
1427 <[command]>+
1428 end(??{quotemeta $MATCH{keyword}})
1429
1430 This is such a useful technique, yet so ugly, scary, and prone to
1431 error, that Regexp::Grammars provides a cleaner equivalent:
1432
1433 <rule: control_block>
1434 <keyword> <expr>
1435 <[command]>+
1436 end<\_keyword>
1437
1438 A directive of the form "<\_IDENTIFIER>" is known as a "matchref" (an
1439 abbreviation of "%MATCH-supplied backreference"). Matchrefs always
1440 attempt to match, as a literal, the current value of
1441 $MATCH{IDENTIFIER}.
1442
1443 By default, a matchref does not capture what it matches, but you can
1444 have it do so by giving it an alias:
1445
1446 <token: delimited_string>
1447 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1448
1449 <token: str_delim> ["'`]
1450
1451 At first glance this doesn't seem very useful as, by definition,
1452 $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up with
1453 identical values. However, it can be useful if the rule also has other
1454 alternatives and you want to create a consistent internal
1455 representation for those alternatives, like so:
1456
1457 <token: delimited_string>
1458 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1459 | <ldelim=( \[ ) .*? <rdelim=( \] )
1460 | <ldelim=( \{ ) .*? <rdelim=( \} )
1461 | <ldelim=( \( ) .*? <rdelim=( \) )
1462 | <ldelim=( \< ) .*? <rdelim=( \> )
1463
1464 You can also force a matchref to save repeated matches as a nested
1465 array, in the usual way:
1466
1467 <token: marked_text>
1468 <marker> <text> <[endmarkers=\_marker]>+
1469
1470 Be careful though, as the following will not do as you may expect:
1471
1472 <[marker]>+ <text> <[endmarkers=\_marker]>+
1473
1474 because the value of $MATCH{marker} will be an array reference, which
1475 the matchref will flatten and concatenate, then match the resulting
1476 string as a literal, which will mean the previous example will match
1477 endmarkers that are exact multiples of the complete start marker,
1478 rather than endmarkers that consist of any number of repetitions of the
1479 individual start marker delimiter. So:
1480
1481 ""text here""
1482 ""text here""""
1483 ""text here""""""
1484
1485 but not:
1486
1487 ""text here"""
1488 ""text here"""""
1489
1490 Uneven start and end markers such as these are extremely unusual, so
1491 this problem rarely arises in practice.
1492
1493 Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
1494 was "<\IDENTIFIER>" instead of "<\_IDENTIFIER>". This created problems
1495 when the identifier started with any of "l", "u", "L", "U", "Q", or
1496 "E", so the syntax has had to be altered in a backwards incompatible
1497 way. It will not be altered again.
1498
1499 Rematching balanced delimiters
1500 Consider the example in the previous section:
1501
1502 <token: delimited_string>
1503 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1504 | <ldelim=( \[ ) .*? <rdelim=( \] )
1505 | <ldelim=( \{ ) .*? <rdelim=( \} )
1506 | <ldelim=( \( ) .*? <rdelim=( \) )
1507 | <ldelim=( \< ) .*? <rdelim=( \> )
1508
1509 The repeated pattern of the last four alternatives is gauling, but we
1510 can't just refactor those delimiters as well:
1511
1512 <token: delimited_string>
1513 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1514 | <ldelim=bracket> .*? <rdelim=\_ldelim>
1515
1516 because that would incorrectly match:
1517
1518 { delimited content here {
1519
1520 while failing to match:
1521
1522 { delimited content here }
1523
1524 To refactor balanced delimiters like those, we need a second kind of
1525 matchref; one that's a little smarter.
1526
1527 Or, preferably, a lot smarter...because there are many other kinds of
1528 balanced delimiters, apart from single brackets. For example:
1529
1530 {{{ delimited content here }}}
1531 /* delimited content here */
1532 (* delimited content here *)
1533 `` delimited content here ''
1534 if delimited content here fi
1535
1536 The common characteristic of these delimiter pairs is that the closing
1537 delimiter is the inverse of the opening delimiter: the sequence of
1538 characters is reversed and certain characters (mainly brackets, but
1539 also single-quotes/backticks) are mirror-reflected.
1540
1541 Regexp::Grammars supports the parsing of such delimiters with a
1542 construct known as an invertref, which is specified using the
1543 "</IDENT>" directive. An invertref acts very like a matchref, except
1544 that it does not convert to:
1545
1546 (??{ quotemeta( $MATCH{I<IDENT>} ) })
1547
1548 but rather to:
1549
1550 (??{ quotemeta( inverse( $MATCH{I<IDENT> ))} })
1551
1552 With this directive available, the balanced delimiters of the previous
1553 example can be refactored to:
1554
1555 <token: delimited_string>
1556 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1557 | <ldelim=( [[{(<] ) .*? <rdelim=/ldelim>
1558
1559 Like matchrefs, invertrefs come in the usual range of flavours:
1560
1561 </ident> # Match the inverse of $MATCH{ident}
1562 <ALIAS=/ident> # Match inverse and capture to $MATCH{ident}
1563 <[ALIAS=/ident]> # Match inverse and push on @{$MATCH{ident}}
1564
1565 The character pairs that are reversed during mirroring are: "{" and
1566 "}", "[" and "]", "(" and ")", "<" and ">", "AX" and "AX", "`" and "'".
1567
1568 The following mnemonics may be useful in distinguishing inverserefs
1569 from backrefs: a backref starts with a "\" (just like the standard Perl
1570 regex backrefs "\1" and "\g{-2}" and "\k<name>"), whereas an inverseref
1571 starts with a "/" (like an HTML or XML closing tag). Or just remember
1572 that "<\_IDENT>" is "match the same again", and if you want "the same
1573 again, only mirrored" instead, just mirror the "\" to get "</IDENT>".
1574
1575 Rematching parametric results and delimiters
1576 The "<\IDENTIFIER>" and "</IDENTIFIER>" mechanisms normally locate the
1577 literal to be matched by looking in $MATCH{IDENTIFIER}.
1578
1579 However, you can cause them to look in $ARG{IDENTIFIER} instead, by
1580 prefixing the identifier with a single ":". This is especially useful
1581 when refactoring subrules. For example, instead of:
1582
1583 <rule: Command>
1584 <Keyword> <CommandBody> end_ <\_Keyword>
1585
1586 <rule: Placeholder>
1587 <Keyword> \.\.\. end_ <\_Keyword>
1588
1589 you could parameterize the Terminator rule, like so:
1590
1591 <rule: Command>
1592 <Keyword> <CommandBody> <Terminator(:Keyword)>
1593
1594 <rule: Placeholder>
1595 <Keyword> \.\.\. <Terminator(:Keyword)>
1596
1597 <token: Terminator>
1598 end_ <\:Keyword>
1599
1600 Tracking and reporting match positions
1601 Regexp::Grammars automatically predefines a special token that makes it
1602 easy to track exactly where in its input a particular subrule matches.
1603 That token is: "<matchpos>".
1604
1605 The "<matchpos>" token implements a zero-width match that never fails.
1606 It always returns the current index within the string that the grammar
1607 is matching.
1608
1609 So, for example you could have your "<delimited_text>" subrule detect
1610 and report unterminated text like so:
1611
1612 <token: delimited_text>
1613 qq? <delim> <text=(.*?)> </delim>
1614 |
1615 <matchpos> qq? <delim>
1616 <error: (?{"Unterminated string starting at index $MATCH{matchpos}"})>
1617
1618 Matching "<matchpos>" in the second alternative causes $MATCH{matchpos}
1619 to contain the position in the string at which the "<matchpos>" subrule
1620 was matched (in this example: the start of the unterminated text).
1621
1622 If you want the line number instead of the string index, use the
1623 predefined "<matchline>" subrule instead:
1624
1625 <token: delimited_text>
1626 qq? <delim> <text=(.*?)> </delim>
1627 | <matchline> qq? <delim>
1628 <error: (?{"Unterminated string starting at line $MATCH{matchline}"})>
1629
1630 Note that the line numbers returned by "<matchline>" start at 1 (not at
1631 zero, as with "<matchpos>").
1632
1633 The "<matchpos>" and "<matchline>" subrules are just like any other
1634 subrules; you can alias them ("<started_at=matchpos>") or match them
1635 repeatedly ( "(?: <[matchline]> <[item]> )++"), etc.
1636
1638 The module also supports event-based parsing. You can specify a grammar
1639 in the usual way and then, for a particular parse, layer a collection
1640 of call-backs (known as "autoactions") over the grammar to handle the
1641 data as it is parsed.
1642
1643 Normally, a grammar rule returns the result hash it has accumulated (or
1644 whatever else was aliased to "MATCH=" within the rule). However, you
1645 can specify an autoaction object before the grammar is matched.
1646
1647 Once the autoaction object is specified, every time a rule succeeds
1648 during the parse, its result is passed to the object via one of its
1649 methods; specifically it is passed to the method whose name is the same
1650 as the rule's.
1651
1652 For example, suppose you had a grammar that recognizes simple algebraic
1653 expressions:
1654
1655 my $expr_parser = do{
1656 use Regexp::Grammars;
1657 qr{
1658 <Expr>
1659
1660 <rule: Expr> <[Operand=Mult]>+ % <[Op=(\+|\-)]>
1661
1662 <rule: Mult> <[Operand=Pow]>+ % <[Op=(\*|/|%)]>
1663
1664 <rule: Pow> <[Operand=Term]>+ % <Op=(\^)>
1665
1666 <rule: Term> <MATCH=Literal>
1667 | \( <MATCH=Expr> \)
1668
1669 <token: Literal> <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
1670 }xms
1671 };
1672
1673 You could convert this grammar to a calculator, by installing a set of
1674 autoactions that convert each rule's result hash to the corresponding
1675 value of the sub-expression that the rule just parsed. To do that, you
1676 would create a class with methods whose names match the rules whose
1677 results you want to change. For example:
1678
1679 package Calculator;
1680 use List::Util qw< reduce >;
1681
1682 sub new {
1683 my ($class) = @_;
1684
1685 return bless {}, $class
1686 }
1687
1688 sub Answer {
1689 my ($self, $result_hash) = @_;
1690
1691 my $sum = shift @{$result_hash->{Operand}};
1692
1693 for my $term (@{$result_hash->{Operand}}) {
1694 my $op = shift @{$result_hash->{Op}};
1695 if ($op eq '+') { $sum += $term; }
1696 else { $sum -= $term; }
1697 }
1698
1699 return $sum;
1700 }
1701
1702 sub Mult {
1703 my ($self, $result_hash) = @_;
1704
1705 return reduce { eval($a . shift(@{$result_hash->{Op}}) . $b) }
1706 @{$result_hash->{Operand}};
1707 }
1708
1709 sub Pow {
1710 my ($self, $result_hash) = @_;
1711
1712 return reduce { $b ** $a } reverse @{$result_hash->{Operand}};
1713 }
1714
1715 Objects of this class (and indeed the class itself) now have methods
1716 corresponding to some of the rules in the expression grammar. To apply
1717 those methods to the results of the rules (as they parse) you simply
1718 install an object as the "autoaction" handler, immediately before you
1719 initiate the parse:
1720
1721 if ($text ~= $expr_parser->with_actions(Calculator->new)) {
1722 say $/{Answer}; # Now prints the result of the expression
1723 }
1724
1725 The "with_actions()" method expects to be passed an object or
1726 classname. This object or class will be installed as the autoaction
1727 handler for the next match against any grammar. After that match, the
1728 handler will be uninstalled. "with_actions()" returns the grammar it's
1729 called on, making it easy to call it as part of a match (which is the
1730 recommended idiom).
1731
1732 With a "Calculator" object set as the autoaction handler, whenever the
1733 "Answer", "Mult", or "Pow" rule of the grammar matches, the
1734 corresponding "Answer", "Mult", or "Pow" method of the "Calculator"
1735 object will be called (with the rule's result value passed as its only
1736 argument), and the result of the method will be used as the result of
1737 the rule.
1738
1739 Note that nothing new happens when a "Term" or "Literal" rule matches,
1740 because the "Calculator" object doesn't have methods with those names.
1741
1742 The overall effect, then, is to allow you to specify a grammar without
1743 rule-specific bahaviours and then, later, specify a set of final
1744 actions (as methods) for some or all of the rules of the grammar.
1745
1746 Note that, if a particular callback method returns "undef", the result
1747 of the corresponding rule will be passed through without modification.
1748
1750 All the grammars shown so far are confined to a single regex. However,
1751 Regexp::Grammars also provides a mechanism that allows you to defined
1752 named grammars, which can then be imported into other regexes. This
1753 gives the a way of modularizing common grammatical components.
1754
1755 Defining a named grammar
1756 You can create a named grammar using the "<grammar:...>" directive.
1757 This directive must appear before the first rule definition in the
1758 grammar, and instead of any start-rule. For example:
1759
1760 qr{
1761 <grammar: List::Generic>
1762
1763 <rule: List>
1764 <[MATCH=Item]>+ % <Separator>
1765
1766 <rule: Item>
1767 \S++
1768
1769 <token: Separator>
1770 \s* , \s*
1771 }x;
1772
1773 This creates a grammar named "List::Generic", and installs it in the
1774 module's internal caches, for future reference.
1775
1776 Note that there is no need (or reason) to assign the resulting regex to
1777 a variable, as the named grammar cannot itself be matched against.
1778
1779 Using a named grammar
1780 To make use of a named grammar, you need to incorporate it into another
1781 grammar, by inheritance. To do that, use the "<extends:...>" directive,
1782 like so:
1783
1784 my $parser = qr{
1785 <extends: List::Generic>
1786
1787 <List>
1788 }x;
1789
1790 The "<extends:...>" directive incorporates the rules defined in the
1791 specified grammar into the current regex. You can then call any of
1792 those rules in the start-pattern.
1793
1794 Overriding an inherited rule or token
1795 Subrule dispatch within a grammar is always polymorphic. That is, when
1796 a subrule is called, the most-derived rule of the same name within the
1797 grammar's hierarchy is invoked.
1798
1799 So, to replace a particular rule within grammar, you simply need to
1800 inherit that grammar and specify new, more-specific versions of any
1801 rules you want to change. For example:
1802
1803 my $list_of_integers = qr{
1804 <List>
1805
1806 # Inherit rules from base grammar...
1807 <extends: List::Generic>
1808
1809 # Replace Item rule from List::Generic...
1810 <rule: Item>
1811 [+-]? \d++
1812 }x;
1813
1814 You can also use "<extends:...>" in other named grammars, to create
1815 hierarchies:
1816
1817 qr{
1818 <grammar: List::Integral>
1819 <extends: List::Generic>
1820
1821 <token: Item>
1822 [+-]? <MATCH=(<.Digit>+)>
1823
1824 <token: Digit>
1825 \d
1826 }x;
1827
1828 qr{
1829 <grammar: List::ColonSeparated>
1830 <extends: List::Generic>
1831
1832 <token: Separator>
1833 \s* : \s*
1834 }x;
1835
1836 qr{
1837 <grammar: List::Integral::ColonSeparated>
1838 <extends: List::Integral>
1839 <extends: List::ColonSeparated>
1840 }x;
1841
1842 As shown in the previous example, Regexp::Grammars allows you to
1843 multiply inherit two (or more) base grammars. For example, the
1844 "List::Integral::ColonSeparated" grammar takes the definitions of
1845 "List" and "Item" from the "List::Integral" grammar, and the definition
1846 of "Separator" from "List::ColonSeparated".
1847
1848 Note that grammars dispatch subrule calls using C3 method lookup,
1849 rather than Perl's older DFS lookup. That's why
1850 "List::Integral::ColonSeparated" correctly gets the more-specific
1851 "Separator" rule defined in "List::ColonSeparated", rather than the
1852 more-generic version defined in "List::Generic" (via "List::Integral").
1853 See "perldoc mro" for more discussion of the C3 dispatch algorithm.
1854
1855 Augmenting an inherited rule or token
1856 Instead of replacing an inherited rule, you can augment it.
1857
1858 For example, if you need a grammar for lists of hexademical numbers,
1859 you could inherit the behaviour of "List::Integral" and add the hex
1860 digits to its "Digit" token:
1861
1862 my $list_of_hexadecimal = qr{
1863 <List>
1864
1865 <extends: List::Integral>
1866
1867 <token: Digit>
1868 <List::Integral::Digit>
1869 | [A-Fa-f]
1870 }x;
1871
1872 If you call a subrule using a fully qualified name (such as
1873 "<List::Integral::Digit>"), the grammar calls that version of the rule,
1874 rather than the most-derived version.
1875
1876 Debugging named grammars
1877 Named grammars are independent of each other, even when inherited. This
1878 means that, if debugging is enabled in a derived grammar, it will not
1879 be active in any rules inherited from a base grammar, unless the base
1880 grammar also included a "<debug:...>" directive.
1881
1882 This is a deliberate design decision, as activating the debugger adds a
1883 significant amount of code to each grammar's implementation, which is
1884 detrimental to the matching performance of the resulting regexes.
1885
1886 If you need to debug a named grammar, the best approach is to include a
1887 "<debug: same>" directive at the start of the grammar. The presence of
1888 this directive will ensure the necessary extra debugging code is
1889 included in the regex implementing the grammar, while setting "same"
1890 mode will ensure that the debugging mode isn't altered when the matcher
1891 uses the inherited rules.
1892
1894 Result distillation
1895 Normally, calls to subrules produce nested result-hashes within the
1896 current result-hash. Those nested hashes always have at least one
1897 automatically supplied key (""), whose value is the entire substring
1898 that the subrule matched.
1899
1900 If there are no other nested captures within the subrule, there will be
1901 no other keys in the result-hash. This would be annoying as a typical
1902 nested grammar would then produce results consisting of hashes of
1903 hashes, with each nested hash having only a single key (""). This in
1904 turn would make postprocessing the result-hash (in "%/") far more
1905 complicated than it needs to be.
1906
1907 To avoid this behaviour, if a subrule's result-hash doesn't contain any
1908 keys except "", the module "flattens" the result-hash, by replacing it
1909 with the value of its single key.
1910
1911 So, for example, the grammar:
1912
1913 mv \s* <from> \s* <to>
1914
1915 <rule: from> [\w/.-]+
1916 <rule: to> [\w/.-]+
1917
1918 doesn't return a result-hash like this:
1919
1920 {
1921 "" => 'mv /usr/local/lib/libhuh.dylib /dev/null/badlib',
1922 'from' => { "" => '/usr/local/lib/libhuh.dylib' },
1923 'to' => { "" => '/dev/null/badlib' },
1924 }
1925
1926 Instead, it returns:
1927
1928 {
1929 "" => 'mv /usr/local/lib/libhuh.dylib /dev/null/badlib',
1930 'from' => '/usr/local/lib/libhuh.dylib',
1931 'to' => '/dev/null/badlib',
1932 }
1933
1934 That is, because the 'from' and 'to' subhashes each have only a single
1935 entry, they are each "flattened" to the value of that entry.
1936
1937 This flattening also occurs if a result-hash contains only "private"
1938 keys (i.e. keys starting with underscores). For example:
1939
1940 mv \s* <from> \s* <to>
1941
1942 <rule: from> <_dir=path>? <_file=filename>
1943 <rule: to> <_dir=path>? <_file=filename>
1944
1945 <token: path> [\w/.-]*/
1946 <token: filename> [\w.-]+
1947
1948 Here, the "from" rule produces a result like this:
1949
1950 from => {
1951 "" => '/usr/local/bin/perl',
1952 _dir => '/usr/local/bin/',
1953 _file => 'perl',
1954 }
1955
1956 which is automatically stripped of "private" keys, leaving:
1957
1958 from => {
1959 "" => '/usr/local/bin/perl',
1960 }
1961
1962 which is then automatically flattened to:
1963
1964 from => '/usr/local/bin/perl'
1965
1966 List result distillation
1967
1968 A special case of result distillation occurs in a separated list, such
1969 as:
1970
1971 <rule: List>
1972
1973 <[Item]>+ % <[Sep=(,)]>
1974
1975 If this construct matches just a single item, the result hash will
1976 contain a single entry consisting of a nested array with a single
1977 value, like so:
1978
1979 { Item => [ 'data' ] }
1980
1981 Instead of returning this annoyingly nested data structure, you can
1982 tell Regexp::Grammars to flatten it to just the inner data with a
1983 special directive:
1984
1985 <rule: List>
1986
1987 <[Item]>+ % <[Sep=(,)]>
1988
1989 <minimize:>
1990
1991 The "<minimize:>" directive examines the result hash (i.e. %MATCH). If
1992 that hash contains only a single entry, which is a reference to an
1993 array with a single value, then the directive assigns that single value
1994 directly to $MATCH, so that it will be returned instead of the usual
1995 result hash.
1996
1997 This means that a normal separated list still results in a hash
1998 containing all elements and separators, but a "degenerate" list of only
1999 one item results in just that single item.
2000
2001 Manual result distillation
2002
2003 Regexp::Grammars also offers full manual control over the distillation
2004 process. If you use the reserved word "MATCH" as the alias for a
2005 subrule call:
2006
2007 <MATCH=filename>
2008
2009 or a subpattern match:
2010
2011 <MATCH=( \w+ )>
2012
2013 or a code block:
2014
2015 <MATCH=(?{ 42 })>
2016
2017 then the current rule will treat the return value of that subrule,
2018 pattern, or code block as its complete result, and return that value
2019 instead of the usual result-hash it constructs. This is the case even
2020 if the result has other entries that would normally also be returned.
2021
2022 For example, in a rule like:
2023
2024 <rule: term>
2025 <MATCH=literal>
2026 | <left_paren> <MATCH=expr> <right_paren>
2027
2028 The use of "MATCH" aliases causes the rule to return either whatever
2029 "<literal>" returns, or whatever "<expr>" returns (provided it's
2030 between left and right parentheses).
2031
2032 Note that, in this second case, even though "<left_paren>" and
2033 "<right_paren>" are captured to the result-hash, they are not returned,
2034 because the "MATCH" alias overrides the normal "return the result-hash"
2035 semantics and returns only what its associated subrule (i.e. "<expr>")
2036 produces.
2037
2038 Note also that the return value is only assigned, if the subrule call
2039 actually matches. For example:
2040
2041 <rule: optional_names>
2042 <[MATCH=name]>*
2043
2044 If the repeated subrule call to "<name>" matches zero times, the return
2045 value of the "optional_names" rule will not be an empty array, because
2046 the "MATCH=" will not have executed at all. Instead, the default return
2047 value (an empty string) will be returned. If you had specifically
2048 wanted to return an empty array, you could use any of the following:
2049
2050 <rule: optional_names>
2051 <MATCH=(?{ [] })> # Set up empty array before first match attempt
2052 <[MATCH=name]>*
2053
2054 or:
2055
2056 <rule: optional_names>
2057 <[MATCH=name]>+ # Match one or more times
2058 | # or
2059 <MATCH=(?{ [] })> # Set up empty array, if no match
2060
2061 Programmatic result distillation
2062
2063 It's also possible to control what a rule returns from within a code
2064 block. Regexp::Grammars provides a set of reserved variables that give
2065 direct access to the result-hash.
2066
2067 The result-hash itself can be accessed as %MATCH within any code block
2068 inside a rule. For example:
2069
2070 <rule: sum>
2071 <X=product> \+ <Y=product>
2072 <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
2073
2074 Here, the rule matches a product (aliased 'X' in the result-hash), then
2075 a literal '+', then another product (aliased to 'Y' in the result-
2076 hash). The rule then executes the code block, which accesses the two
2077 saved values (as $MATCH{X} and $MATCH{Y}), adding them together.
2078 Because the block is itself aliased to "MATCH", the sum produced by the
2079 block becomes the (only) result of the rule.
2080
2081 It is also possible to set the rule result from within a code block
2082 (instead of aliasing it). The special "override" return value is
2083 represented by the special variable $MATCH. So the previous example
2084 could be rewritten:
2085
2086 <rule: sum>
2087 <X=product> \+ <Y=product>
2088 (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
2089
2090 Both forms are identical in effect. Any assignment to $MATCH overrides
2091 the normal "return all subrule results" behaviour.
2092
2093 Assigning to $MATCH directly is particularly handy if the result may
2094 not always be "distillable", for example:
2095
2096 <rule: sum>
2097 <X=product> \+ <Y=product>
2098 (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
2099 # Reduce to sum, if both terms are simple scalars...
2100 $MATCH = $MATCH{X} + $MATCH{Y};
2101 }
2102 else {
2103 # Return full syntax tree for non-simple case...
2104 $MATCH{op} = '+';
2105 }
2106 })
2107
2108 Note that you can also partially override the subrule return behaviour.
2109 Normally, the subrule returns the complete text it matched as its
2110 context substring (i.e. under the "empty key") in its result-hash. That
2111 is, of course, $MATCH{""}, so you can override just that behaviour by
2112 directly assigning to that entry.
2113
2114 For example, if you have a rule that matches key/value pairs from a
2115 configuration file, you might prefer that any trailing comments not be
2116 included in the "matched text" entry of the rule's result-hash. You
2117 could hide such comments like so:
2118
2119 <rule: config_line>
2120 <key> : <value> <comment>?
2121 (?{
2122 # Edit trailing comments out of "matched text" entry...
2123 $MATCH = "$MATCH{key} : $MATCH{value}";
2124 })
2125
2126 Some more examples of the uses of $MATCH:
2127
2128 <rule: FuncDecl>
2129 # Keyword Name Keep return the name (as a string)...
2130 func <Identifier> ; (?{ $MATCH = $MATCH{'Identifier'} })
2131
2132
2133 <rule: NumList>
2134 # Numbers in square brackets...
2135 \[
2136 ( \d+ (?: , \d+)* )
2137 \]
2138
2139 # Return only the numbers...
2140 (?{ $MATCH = $CAPTURE })
2141
2142
2143 <token: Cmd>
2144 # Match standard variants then standardize the keyword...
2145 (?: mv | move | rename ) (?{ $MATCH = 'mv'; })
2146
2147 Parse-time data processing
2148 Using code blocks in rules, it's often possible to fully process data
2149 as you parse it. For example, the "<sum>" rule shown in the previous
2150 section might be part of a simple calculator, implemented entirely in a
2151 single grammar. Such a calculator might look like this:
2152
2153 my $calculator = do{
2154 use Regexp::Grammars;
2155 qr{
2156 <Answer>
2157
2158 <rule: Answer>
2159 ( <.Mult>+ % <.Op=([+-])> )
2160 <MATCH= (?{ eval $CAPTURE })>
2161
2162 <rule: Mult>
2163 ( <.Pow>+ % <.Op=([*/%])> )
2164 <MATCH= (?{ eval $CAPTURE })>
2165
2166 <rule: Pow>
2167 <X=Term> \^ <Y=Pow>
2168 <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>
2169 |
2170 <MATCH=Term>
2171
2172 <rule: Term>
2173 <MATCH=Literal>
2174 | \( <MATCH=Answer> \)
2175
2176 <token: Literal>
2177 <MATCH= ( [+-]? \d++ (?: \. \d++ )?+ )>
2178 }xms
2179 };
2180
2181 while (my $input = <>) {
2182 if ($input =~ $calculator) {
2183 say "--> $/{Answer}";
2184 }
2185 }
2186
2187 Because every rule computes a value using the results of the subrules
2188 below it, and aliases that result to its "MATCH", each rule returns a
2189 complete evaluation of the subexpression it matches, passing that back
2190 to higher-level rules, which then do the same.
2191
2192 Hence, the result returned to the very top-level rule (i.e. to
2193 "<Answer>") is the complete evaluation of the entire expression that
2194 was matched. That means that, in the very process of having matched a
2195 valid expression, the calculator has also computed the value of that
2196 expression, which can then simply be printed directly.
2197
2198 It is often possible to have a grammar fully (or sometimes at least
2199 partially) evaluate or transform the data it is parsing, and this
2200 usually leads to very efficient and easy-to-maintain implementations.
2201
2202 The main limitation of this technique is that the data has to be in a
2203 well-structured form, where subsets of the data can be evaluated using
2204 only local information. In cases where the meaning of the data is
2205 distributed through that data non-hierarchically, or relies on global
2206 state, or on external information, it is often better to have the
2207 grammar simply construct a complete syntax tree for the data first, and
2208 then evaluate that syntax tree separately, after parsing is complete.
2209 The following section describes a feature of Regexp::Grammars that can
2210 make this second style of data processing simpler and more
2211 maintainable.
2212
2213 Object-oriented parsing
2214 When a grammar has parsed successfully, the "%/" variable will contain
2215 a series of nested hashes (and possibly arrays) representing the
2216 hierarchical structure of the parsed data.
2217
2218 Typically, the next step is to walk that tree, extracting or converting
2219 or otherwise processing that information. If the tree has nodes of many
2220 different types, it can be difficult to build a recursive subroutine
2221 that can navigate it easily.
2222
2223 A much cleaner solution is possible if the nodes of the tree are proper
2224 objects. In that case, you just define a "process()" or "traverse()"
2225 method for eah of the classes, and have every node call that method on
2226 each of its children. For example, if the parser were to return a tree
2227 of nodes representing the contents of a LaTeX file, then you could
2228 define the following methods:
2229
2230 sub Latex::file::explain
2231 {
2232 my ($self, $level) = @_;
2233 for my $element (@{$self->{element}}) {
2234 $element->explain($level);
2235 }
2236 }
2237
2238 sub Latex::element::explain {
2239 my ($self, $level) = @_;
2240 ( $self->{command} || $self->{literal})->explain($level)
2241 }
2242
2243 sub Latex::command::explain {
2244 my ($self, $level) = @_;
2245 say "\t"x$level, "Command:";
2246 say "\t"x($level+1), "Name: $self->{name}";
2247 if ($self->{options}) {
2248 say "\t"x$level, "\tOptions:";
2249 $self->{options}->explain($level+2)
2250 }
2251
2252 for my $arg (@{$self->{arg}}) {
2253 say "\t"x$level, "\tArg:";
2254 $arg->explain($level+2)
2255 }
2256 }
2257
2258 sub Latex::options::explain {
2259 my ($self, $level) = @_;
2260 $_->explain($level) foreach @{$self->{option}};
2261 }
2262
2263 sub Latex::literal::explain {
2264 my ($self, $level, $label) = @_;
2265 $label //= 'Literal';
2266 say "\t"x$level, "$label: ", $self->{q{}};
2267 }
2268
2269 and then simply write:
2270
2271 if ($text =~ $LaTeX_parser) {
2272 $/{LaTeX_file}->explain();
2273 }
2274
2275 and the chain of "explain()" calls would cascade down the nodes of the
2276 tree, each one invoking the appropriate "explain()" method according to
2277 the type of node encountered.
2278
2279 The only problem is that, by default, Regexp::Grammars returns a tree
2280 of plain-old hashes, not LaTeX::Whatever objects. Fortunately, it's
2281 easy to request that the result hashes be automatically blessed into
2282 the appropriate classes, using the "<objrule:...>" and "<objtoken:...>"
2283 directives.
2284
2285 These directives are identical to the "<rule:...>" and "<token:...>"
2286 directives (respectively), except that the rule or token they create
2287 will also convert the hash it normally returns into an object of a
2288 specified class. This conversion is done by passing the result hash to
2289 the class's constructor:
2290
2291 $class->new(\%result_hash)
2292
2293 if the class has a constructor method named "new()", or else (if the
2294 class doesn't provide a constructor) by directly blessing the result
2295 hash:
2296
2297 bless \%result_hash, $class
2298
2299 Note that, even if object is constructed via its own constructor, the
2300 module still expects the new object to be hash-based, and will fail if
2301 the object is anything but a blessed hash. The module issues an error
2302 in this case.
2303
2304 The generic syntax for these types of rules and tokens is:
2305
2306 <objrule: CLASS::NAME = RULENAME >
2307 <objtoken: CLASS::NAME = TOKENNAME >
2308
2309 For example:
2310
2311 <objrule: LaTeX::Element=component>
2312 # ...Defines a rule that can be called as <component>
2313 # ...and which returns a hash-based LaTeX::Element object
2314
2315 <objtoken: LaTex::Literal=atom>
2316 # ...Defines a token that can be called as <atom>
2317 # ...and which returns a hash-based LaTeX::Literal object
2318
2319 Note that, just as in aliased subrule calls, the name by which
2320 something is referred to outside the grammar (in this case, the class
2321 name) comes before the "=", whereas the name that it is referred to
2322 inside the grammar comes after the "=".
2323
2324 You can freely mix object-returning and plain-old-hash-returning rules
2325 and tokens within a single grammar, though you have to be careful not
2326 to subsequently try to call a method on any of the unblessed nodes.
2327
2328 An important caveat regarding OO rules
2329
2330 Prior to Perl 5.14.0, Perl's regex engine was not fully re-entrant.
2331 This means that in older versions of Perl, it is not possible to re-
2332 invoke the regex engine when already inside the regex engine.
2333
2334 This means that you need to be careful that the "new()" constructors
2335 that are called by your object-rules do not themselves use regexes in
2336 any way, unless you're running under Perl 5.14 or later (in which case
2337 you can ignore what follows).
2338
2339 The two ways this is most likely to happen are:
2340
2341 1. If you're using a class built on Moose, where one or more of the
2342 "has" uses a type constraint (such as 'Int') that is implemented
2343 via regex matching. For example:
2344
2345 has 'id' => (is => 'rw', isa => 'Int');
2346
2347 The workaround (for pre-5.14 Perls) is to replace the type
2348 constraint with one that doesn't use a regex. For example:
2349
2350 has 'id' => (is => 'rw', isa => 'Num');
2351
2352 Alternatively, you could define your own type constraint that
2353 avoids regexes:
2354
2355 use Moose::Util::TypeConstraints;
2356
2357 subtype 'Non::Regex::Int',
2358 as 'Num',
2359 where { int($_) == $_ };
2360
2361 no Moose::Util::TypeConstraints;
2362
2363 # and later...
2364
2365 has 'id' => (is => 'rw', isa => 'Non::Regex::Int');
2366
2367 2. If your class uses an "AUTOLOAD()" method to implement its
2368 constructor and that method uses the typical:
2369
2370 $AUTOLOAD =~ s/.*://;
2371
2372 technique. The workaround here is to achieve the same effect
2373 without a regex. For example:
2374
2375 my $last_colon_pos = rindex($AUTOLOAD, ':');
2376 substr $AUTOLOAD, 0, $last_colon_pos+1, q{};
2377
2378 Note that this caveat against using nested regexes also applies to any
2379 code blocks executed inside a rule or token (whether or not those rules
2380 or tokens are object-oriented).
2381
2382 A naming shortcut
2383
2384 If an "<objrule:...>" or "<objtoken:...>" is defined with a class name
2385 that is not followed by "=" and a rule name, then the rule name is
2386 determined automatically from the classname. Specifically, the final
2387 component of the classname (i.e. after the last "::", if any) is used.
2388
2389 For example:
2390
2391 <objrule: LaTeX::Element>
2392 # ...Defines a rule that can be called as <Element>
2393 # ...and which returns a hash-based LaTeX::Element object
2394
2395 <objtoken: LaTex::Literal>
2396 # ...Defines a token that can be called as <Literal>
2397 # ...and which returns a hash-based LaTeX::Literal object
2398
2399 <objtoken: Comment>
2400 # ...Defines a token that can be called as <Comment>
2401 # ...and which returns a hash-based Comment object
2402
2404 Regexp::Grammars provides a number of features specifically designed to
2405 help debug both grammars and the data they parse.
2406
2407 All debugging messages are written to a log file (which, by default, is
2408 just STDERR). However, you can specify a disk file explicitly by
2409 placing a "<logfile:...>" directive at the start of your grammar:
2410
2411 $grammar = qr{
2412
2413 <logfile: LaTeX_parser_log >
2414
2415 \A <LaTeX_file> \Z # Pattern to match
2416
2417 <rule: LaTeX_file>
2418 # etc.
2419 }x;
2420
2421 You can also explicitly specify that messages go to the terminal:
2422
2423 <logfile: - >
2424
2425 Debugging grammar creation with "<logfile:...>"
2426 Whenever a log file has been directly specified, Regexp::Grammars
2427 automatically does verbose static analysis of your grammar. That is,
2428 whenever it compiles a grammar containing an explicit "<logfile:...>"
2429 directive it logs a series of messages explaining how it has
2430 interpreted the various components of that grammar. For example, the
2431 following grammar:
2432
2433 <logfile: parser_log >
2434
2435 <cmd>
2436
2437 <rule: cmd>
2438 mv <from=file> <to=file>
2439 | cp <source> <[file]> <.comment>?
2440
2441 would produce the following analysis in the 'parser_log' file:
2442
2443 info | Processing the main regex before any rule definitions
2444 | |
2445 | |...Treating <cmd> as:
2446 | | | match the subrule <cmd>
2447 | | \ saving the match in $MATCH{'cmd'}
2448 | |
2449 | \___End of main regex
2450 |
2451 info | Defining a rule: <cmd>
2452 | |...Returns: a hash
2453 | |
2454 | |...Treating ' mv ' as:
2455 | | \ normal Perl regex syntax
2456 | |
2457 | |...Treating <from=file> as:
2458 | | | match the subrule <file>
2459 | | \ saving the match in $MATCH{'from'}
2460 | |
2461 | |...Treating <to=file> as:
2462 | | | match the subrule <file>
2463 | | \ saving the match in $MATCH{'to'}
2464 | |
2465 | |...Treating ' | cp ' as:
2466 | | \ normal Perl regex syntax
2467 | |
2468 | |...Treating <source> as:
2469 | | | match the subrule <source>
2470 | | \ saving the match in $MATCH{'source'}
2471 | |
2472 | |...Treating <[file]> as:
2473 | | | match the subrule <file>
2474 | | \ appending the match to $MATCH{'file'}
2475 | |
2476 | |...Treating <.comment>? as:
2477 | | | match the subrule <comment> if possible
2478 | | \ but don't save anything
2479 | |
2480 | \___End of rule definition
2481
2482 This kind of static analysis is a useful starting point in debugging a
2483 miscreant grammar, because it enables you to see what you actually
2484 specified (as opposed to what you thought you'd specified).
2485
2486 Debugging grammar execution with "<debug:...>"
2487 Regexp::Grammars also provides a simple interactive debugger, with
2488 which you can observe the process of parsing and the data being
2489 collected in any result-hash.
2490
2491 To initiate debugging, place a "<debug:...>" directive anywhere in your
2492 grammar. When parsing reaches that directive the debugger will be
2493 activated, and the command specified in the directive immediately
2494 executed. The available commands are:
2495
2496 <debug: on> - Enable debugging, stop when a rule matches
2497 <debug: match> - Enable debugging, stop when a rule matches
2498 <debug: try> - Enable debugging, stop when a rule is tried
2499 <debug: run> - Enable debugging, run until the match completes
2500 <debug: same> - Continue debugging (or not) as currently
2501 <debug: off> - Disable debugging and continue parsing silently
2502
2503 <debug: continue> - Synonym for <debug: run>
2504 <debug: step> - Synonym for <debug: try>
2505
2506 These directives can be placed anywhere within a grammar and take
2507 effect when that point is reached in the parsing. Hence, adding a
2508 "<debug:step>" directive is very much like setting a breakpoint at that
2509 point in the grammar. Indeed, a common debugging strategy is to turn
2510 debugging on and off only around a suspect part of the grammar:
2511
2512 <rule: tricky> # This is where we think the problem is...
2513 <debug:step>
2514 <preamble> <text> <postscript>
2515 <debug:off>
2516
2517 Once the debugger is active, it steps through the parse, reporting
2518 rules that are tried, matches and failures, backtracking and restarts,
2519 and the parser's location within both the grammar and the text being
2520 matched. That report looks like this:
2521
2522 ===============> Trying <grammar> from position 0
2523 > cp file1 file2 |...Trying <cmd>
2524 | |...Trying <cmd=(cp)>
2525 | | \FAIL <cmd=(cp)>
2526 | \FAIL <cmd>
2527 \FAIL <grammar>
2528 ===============> Trying <grammar> from position 1
2529 cp file1 file2 |...Trying <cmd>
2530 | |...Trying <cmd=(cp)>
2531 file1 file2 | | \_____<cmd=(cp)> matched 'cp'
2532 file1 file2 | |...Trying <[file]>+
2533 file2 | | \_____<[file]>+ matched 'file1'
2534 | |...Trying <[file]>+
2535 [eos] | | \_____<[file]>+ matched ' file2'
2536 | |...Trying <[file]>+
2537 | | \FAIL <[file]>+
2538 | |...Trying <target>
2539 | | |...Trying <file>
2540 | | | \FAIL <file>
2541 | | \FAIL <target>
2542 <~~~~~~~~~~~~~~ | |...Backtracking 5 chars and trying new match
2543 file2 | |...Trying <target>
2544 | | |...Trying <file>
2545 | | | \____ <file> matched 'file2'
2546 [eos] | | \_____<target> matched 'file2'
2547 | \_____<cmd> matched ' cp file1 file2'
2548 \_____<grammar> matched ' cp file1 file2'
2549
2550 The first column indicates the point in the input at which the parser
2551 is trying to match, as well as any backtracking or forward searching it
2552 may need to do. The remainder of the columns track the parser's
2553 hierarchical traversal of the grammar, indicating which rules are
2554 tried, which succeed, and what they match.
2555
2556 Provided the logfile is a terminal (as it is by default), the debugger
2557 also pauses at various points in the parsing process--before trying a
2558 rule, after a rule succeeds, or at the end of the parse--according to
2559 the most recent command issued. When it pauses, you can issue a new
2560 command by entering a single letter:
2561
2562 m - to continue until the next subrule matches
2563 t or s - to continue until the next subrule is tried
2564 r or c - to continue to the end of the grammar
2565 o - to switch off debugging
2566
2567 Note that these are the first letters of the corresponding
2568 "<debug:...>" commands, listed earlier. Just hitting ENTER while the
2569 debugger is paused repeats the previous command.
2570
2571 While the debugger is paused you can also type a 'd', which will
2572 display the result-hash for the current rule. This can be useful for
2573 detecting which rule isn't returning the data you expected.
2574
2575 Resizing the context string
2576
2577 By default, the first column of the debugger output (which shows the
2578 current matching position within the string) is limited to a width of
2579 20 columns.
2580
2581 However, you can change that limit calling the
2582 "Regexp::Grammars::set_context_width()" subroutine. You have to specify
2583 the fully qualified name, however, as Regexp::Grammars does not export
2584 this (or any other) subroutine.
2585
2586 "set_context_width()" expects a single argument: a positive integer
2587 indicating the maximal allowable width for the context column. It
2588 issues a warning if an invalid value is passed, and ignores it.
2589
2590 If called in a void context, "set_context_width()" changes the context
2591 width permanently throughout your application. If called in a scalar or
2592 list context, "set_context_width()" returns an object whose destructor
2593 will cause the context width to revert to its previous value. This
2594 means you can temporarily change the context width within a given block
2595 with something like:
2596
2597 {
2598 my $temporary = Regexp::Grammars::set_context_width(50);
2599
2600 if ($text =~ $parser) {
2601 do_stuff_with( %/ );
2602 }
2603
2604 } # <--- context width automagically reverts at this point
2605
2606 and the context width will change back to its previous value when
2607 $temporary goes out of scope at the end of the block.
2608
2609 User-defined logging with "<log:...>"
2610 Both static and interactive debugging send a series of predefined log
2611 messages to whatever log file you have specified. It is also possible
2612 to send additional, user-defined messages to the log, using the
2613 "<log:...>" directive.
2614
2615 This directive expects either a simple text or a codeblock as its
2616 single argument. If the argument is a code block, that code is expected
2617 to return the text of the message; if the argument is anything else,
2618 that something else is the literal message. For example:
2619
2620 <rule: ListElem>
2621
2622 <Elem= ( [a-z]\d+) >
2623 <log: Checking for a suffix, too...>
2624
2625 <Suffix= ( : \d+ ) >?
2626 <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
2627
2628 User-defined log messages implemented using a codeblock can also
2629 specify a severity level. If the codeblock of a "<log:...>" directive
2630 returns two or more values, the first is treated as a log message
2631 severity indicator, and the remaining values as separate lines of text
2632 to be logged. For example:
2633
2634 <rule: ListElem>
2635 <Elem= ( [a-z]\d+) >
2636 <Suffix= ( : \d+ ) >?
2637
2638 <log: (?{
2639 warn => "Elem was: $MATCH{Elem}",
2640 "Suffix was $MATCH{Suffix}",
2641 })>
2642
2643 When they are encountered, user-defined log messages are interspersed
2644 between any automatic log messages (i.e. from the debugger), at the
2645 correct level of nesting for the current rule.
2646
2647 Debugging non-grammars
2648 [Note that, with the release in 2012 of the Regexp::Debugger module (on
2649 CPAN) the techniques described below are unnecessary. If you need to
2650 debug plain Perl regexes, use Regexp::Debugger instead.]
2651
2652 It is possible to use Regexp::Grammars without creating any subrule
2653 definitions, simply to debug a recalcitrant regex. For example, if the
2654 following regex wasn't working as expected:
2655
2656 my $balanced_brackets = qr{
2657 \( # left delim
2658 (?:
2659 \\ # escape or
2660 | (?R) # recurse or
2661 | . # whatever
2662 )*
2663 \) # right delim
2664 }xms;
2665
2666 you could instrument it with aliased subpatterns and then debug it
2667 step-by-step, using Regexp::Grammars:
2668
2669 use Regexp::Grammars;
2670
2671 my $balanced_brackets = qr{
2672 <debug:step>
2673
2674 <.left_delim= ( \( )>
2675 (?:
2676 <.escape= ( \\ )>
2677 | <.recurse= ( (?R) )>
2678 | <.whatever=( . )>
2679 )*
2680 <.right_delim= ( \) )>
2681 }xms;
2682
2683 while (<>) {
2684 say 'matched' if /$balanced_brackets/;
2685 }
2686
2687 Note the use of amnesiac aliased subpatterns to avoid needlessly
2688 building a result-hash. Alternatively, you could use listifying aliases
2689 to preserve the matching structure as an additional debugging aid:
2690
2691 use Regexp::Grammars;
2692
2693 my $balanced_brackets = qr{
2694 <debug:step>
2695
2696 <[left_delim= ( \( )]>
2697 (?:
2698 <[escape= ( \\ )]>
2699 | <[recurse= ( (?R) )]>
2700 | <[whatever=( . )]>
2701 )*
2702 <[right_delim= ( \) )]>
2703 }xms;
2704
2705 if ( '(a(bc)d)' =~ /$balanced_brackets/) {
2706 use Data::Dumper 'Dumper';
2707 warn Dumper \%/;
2708 }
2709
2711 Assuming you have correctly debugged your grammar, the next source of
2712 problems will probably be invalid input (especially if that input is
2713 being provided interactively). So Regexp::Grammars also provides some
2714 support for detecting when a parse is likely to fail...and informing
2715 the user why.
2716
2717 Requirements
2718 The "<require:...>" directive is useful for testing conditions that
2719 it's not easy (or even possible) to check within the syntax of the the
2720 regex itself. For example:
2721
2722 <rule: IPV4_Octet_Decimal>
2723 # Up three digits...
2724 <MATCH= ( \d{1,3}+ )>
2725
2726 # ...but less than 256...
2727 <require: (?{ $MATCH <= 255 })>
2728
2729 A require expects a regex codeblock as its argument and succeeds if the
2730 final value of that codeblock is true. If the final value is false, the
2731 directive fails and the rule starts backtracking.
2732
2733 Note, in this example that the digits are matched with " \d{1,3}+ ".
2734 The trailing "+" prevents the "{1,3}" repetition from backtracking to a
2735 smaller number of digits if the "<require:...>" fails.
2736
2737 Handling failure
2738 The module has limited support for error reporting from within a
2739 grammar, in the form of the "<error:...>" and "<warning:...>"
2740 directives and their shortcuts: "<...>", "<!!!>", and "<???>"
2741
2742 Error messages
2743
2744 The "<error: MSG>" directive queues a conditional error message within
2745 "@!" and then fails to match (that is, it is equivalent to a "(?!)"
2746 when matching). For example:
2747
2748 <rule: ListElem>
2749 <SerialNumber>
2750 | <ClientName>
2751 | <error: (?{ $errcount++ . ': Missing list element' })>
2752
2753 So a common code pattern when using grammars that do this kind of error
2754 detection is:
2755
2756 if ($text =~ $grammar) {
2757 # Do something with the data collected in %/
2758 }
2759 else {
2760 say {*STDERR} $_ for @!; # i.e. report all errors
2761 }
2762
2763 Each error message is conditional in the sense that, if any surrounding
2764 rule subsequently matches, the message is automatically removed from
2765 "@!". This implies that you can queue up as many error messages as you
2766 wish, but they will only remain in "@!" if the match ultimately fails.
2767 Moreover, only those error messages originating from rules that
2768 actually contributed to the eventual failure-to-match will remain in
2769 "@!".
2770
2771 If a code block is specified as the argument, the error message is
2772 whatever final value is produced when the block is executed. Note that
2773 this final value does not have to be a string (though it does have to
2774 be a scalar).
2775
2776 <rule: ListElem>
2777 <SerialNumber>
2778 | <ClientName>
2779 | <error: (?{
2780 # Return a hash, with the error information...
2781 { errnum => $errcount++, msg => 'Missing list element' }
2782 })>
2783
2784 If anything else is specified as the argument, it is treated as a
2785 literal error string (and may not contain an unbalanced '<' or '>', nor
2786 any interpolated variables).
2787
2788 However, if the literal error string begins with "Expected " or
2789 "Expecting ", then the error string automatically has the following
2790 "context suffix" appended:
2791
2792 , but found '$CONTEXT' instead
2793
2794 For example:
2795
2796 qr{ <Arithmetic_Expression> # ...Match arithmetic expression
2797 | # Or else
2798 <error: Expected a valid expression> # ...Report error, and fail
2799
2800 # Rule definitions here...
2801 }xms;
2802
2803 On an invalid input this example might produce an error message like:
2804
2805 "Expected a valid expression, but found '(2+3]*7/' instead"
2806
2807 The value of the special $CONTEXT variable is found by looking ahead in
2808 the string being matched against, to locate the next sequence of non-
2809 blank characters after the current parsing position. This variable may
2810 also be explicitly used within the "<error: (?{...})>" form of the
2811 directive.
2812
2813 As a special case, if you omit the message entirely from the directive,
2814 it is supplied automatically, derived from the name of the current
2815 rule. For example, if the following rule were to fail to match:
2816
2817 <rule: Arithmetic_expression>
2818 <Multiplicative_Expression>+ % ([+-])
2819 | <error:>
2820
2821 the error message queued would be:
2822
2823 "Expected arithmetic expression, but found 'one plus two' instead"
2824
2825 Note however, that it is still essential to include the colon in the
2826 directive. A common mistake is to write:
2827
2828 <rule: Arithmetic_expression>
2829 <Multiplicative_Expression>+ % ([+-])
2830 | <error>
2831
2832 which merely attempts to call "<rule: error>" if the first alternative
2833 fails.
2834
2835 Warning messages
2836
2837 Sometimes, you want to detect problems, but not invalidate the entire
2838 parse as a result. For those occasions, the module provides a "less
2839 stringent" form of error reporting: the "<warning:...>" directive.
2840
2841 This directive is exactly the same as an "<error:...>" in every respect
2842 except that it does not induce a failure to match at the point it
2843 appears.
2844
2845 The directive is, therefore, useful for reporting non-fatal problems in
2846 a parse. For example:
2847
2848 qr{ \A # ...Match only at start of input
2849 <ArithExpr> # ...Match a valid arithmetic expression
2850
2851 (?:
2852 # Should be at end of input...
2853 \s* \Z
2854 |
2855 # If not, report the fact but don't fail...
2856 <warning: Expected end-of-input>
2857 <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
2858 )
2859
2860 # Rule definitions here...
2861 }xms;
2862
2863 Note that, because they do not induce failure, two or more
2864 "<warning:...>" directives can be "stacked" in sequence, as in the
2865 previous example.
2866
2867 Stubbing
2868
2869 The module also provides three useful shortcuts, specifically to make
2870 it easy to declare, but not define, rules and tokens.
2871
2872 The "<...>" and "<???>" directives are equivalent to the directive:
2873
2874 <error: Cannot match RULENAME (not implemented)>
2875
2876 The "<???>" is equivalent to the directive:
2877
2878 <warning: Cannot match RULENAME (not implemented)>
2879
2880 For example, in the following grammar:
2881
2882 <grammar: List::Generic>
2883
2884 <rule: List>
2885 <[Item]>+ % (\s*,\s*)
2886
2887 <rule: Item>
2888 <...>
2889
2890 the "Item" rule is declared but not defined. That means the grammar
2891 will compile correctly, (the "List" rule won't complain about a call to
2892 a non-existent "Item"), but if the "Item" rule isn't overridden in some
2893 derived grammar, a match-time error will occur when "List" tries to
2894 match the "<...>" within "Item".
2895
2896 Localizing the (semi-)automatic error messages
2897
2898 Error directives of any of the following forms:
2899
2900 <error: Expecting identifier>
2901
2902 <error: >
2903
2904 <...>
2905
2906 <!!!>
2907
2908 or their warning equivalents:
2909
2910 <warning: Expecting identifier>
2911
2912 <warning: >
2913
2914 <???>
2915
2916 each autogenerate part or all of the actual error message they produce.
2917 By default, that autogenerated message is always produced in English.
2918
2919 However, the module provides a mechanism by which you can intercept
2920 every error or warning that is queued to "@!" via these
2921 directives...and localize those messages.
2922
2923 To do this, you call "Regexp::Grammars::set_error_translator()" (with
2924 the full qualification, since Regexp::Grammars does not export it...nor
2925 anything else, for that matter).
2926
2927 The "set_error_translator()" subroutine expect as single argument,
2928 which must be a reference to another subroutine. This subroutine is
2929 then called whenever an error or warning message is queued to "@!".
2930
2931 The subroutine is passed three arguments:
2932
2933 · the message string,
2934
2935 · the name of the rule from which the error or warning was queued,
2936 and
2937
2938 · the value of $CONTEXT when the error or warning was encountered
2939
2940 The subroutine is expected to return the final version of the message
2941 that is actually to be appended to "@!". To accomplish this it may make
2942 use of one of the many internationalization/localization modules
2943 available in Perl, or it may do the conversion entirely by itself.
2944
2945 The first argument is always exactly what appeared as a message in the
2946 original directive (regardless of whether that message is supposed to
2947 trigger autogeneration, or is just a "regular" error message). That
2948 is:
2949
2950 Directive 1st argument
2951
2952 <error: Expecting identifier> "Expecting identifier"
2953 <warning: That's not a moon!> "That's not a moon!"
2954 <error: > ""
2955 <warning: > ""
2956 <...> ""
2957 <!!!> ""
2958 <???> ""
2959
2960 The second argument always contains the name of the rule in which the
2961 directive was encountered. For example, when invoked from within
2962 "<rule: Frinstance>" the following directives produce:
2963
2964 Directive 2nd argument
2965
2966 <error: Expecting identifier> "Frinstance"
2967 <warning: That's not a moon!> "Frinstance"
2968 <error: > "Frinstance"
2969 <warning: > "Frinstance"
2970 <...> "-Frinstance"
2971 <!!!> "-Frinstance"
2972 <???> "-Frinstance"
2973
2974 Note that the "unimplemented" markers pass the rule name with a
2975 preceding '-'. This allows your translator to distinguish between
2976 "empty" messages (which should then be generated automatically) and the
2977 "unimplemented" markers (which should report that the rule is not yet
2978 properly defined).
2979
2980 If you call "Regexp::Grammars::set_error_translator()" in a void
2981 context, the error translator is permanently replaced (at least, until
2982 the next call to "set_error_translator()").
2983
2984 However, if you call "Regexp::Grammars::set_error_translator()" in a
2985 scalar or list context, it returns an object whose destructor will
2986 restore the previous translator. This allows you to install a
2987 translator only within a given scope, like so:
2988
2989 {
2990 my $temporary
2991 = Regexp::Grammars::set_error_translator(\&my_translator);
2992
2993 if ($text =~ $parser) {
2994 do_stuff_with( %/ );
2995 }
2996 else {
2997 report_errors_in( @! );
2998 }
2999
3000 } # <--- error translator automagically reverts at this point
3001
3002 Warning: any error translation subroutine you install will be called
3003 during the grammar's parsing phase (i.e. as the grammar's regex is
3004 matching). You should therefore ensure that your translator does not
3005 itself use regular expressions, as nested evaluations of regexes inside
3006 other regexes are extremely problematical (i.e. almost always
3007 disastrous) in Perl.
3008
3009 Restricting how long a parse runs
3010 Like the core Perl 5 regex engine on which they are built, the grammars
3011 implemented by Regexp::Grammars are essentially top-down parsers. This
3012 means that they may occasionally require an exponentially long time to
3013 parse a particular input. This usually occurs if a particular grammar
3014 includes a lot of recursion or nested backtracking, especially if the
3015 grammar is then matched against a long string.
3016
3017 The judicious use of non-backtracking repetitions (i.e. "x*+" and
3018 "x++") can significantly improve parsing performance in many such
3019 cases. Likewise, carefully reordering any high-level alternatives (so
3020 as to test simple common cases first) can substantially reduce parsing
3021 times.
3022
3023 However, some languages are just intrinsically slow to parse using top-
3024 down techniques (or, at least, may have slow-to-parse corner cases).
3025
3026 To help cope with this constraint, Regexp::Grammars provides a
3027 mechanism by which you can limit the total effort that a given grammar
3028 will expend in attempting to match. The "<timeout:...>" directive
3029 allows you to specify how long a grammar is allowed to continue trying
3030 to match before giving up. It expects a single argument, which must be
3031 an unsigned integer, and it treats this integer as the number of
3032 seconds to continue attempting to match.
3033
3034 For example:
3035
3036 <timeout: 10> # Give up after 10 seconds
3037
3038 indicates that the grammar should keep attempting to match for another
3039 10 seconds from the point where the directive is encountered during a
3040 parse. If the complete grammar has not matched in that time, the entire
3041 match is considered to have failed, the matching process is immediately
3042 terminated, and a standard error message ('Internal error: Timed out
3043 after 10 seconds (as requested)') is returned in "@!".
3044
3045 A "<timeout:...>" directive can be placed anywhere in a grammar, but is
3046 most usually placed at the very start, so that the entire grammar is
3047 governed by the specified time limit. The second most common
3048 alternative is to place the timeout at the start of a particular
3049 subrule that is known to be potentially very slow.
3050
3051 A common mistake is to put the timeout specification at the top level
3052 of the grammar, but place it after the actual subrule to be matched,
3053 like so:
3054
3055 my $grammar = qr{
3056
3057 <Text_Corpus> # Subrule to be matched
3058 <timeout: 10> # Useless use of timeout
3059
3060 <rule: Text_Corpus>
3061 # et cetera...
3062 }xms;
3063
3064 Since the parser will only reach the "<timeout: 10>" directive after it
3065 has completely matched "<Text_Corpus>", the timeout is only initiated
3066 at the very end of the matching process and so does not limit that
3067 process in any useful way.
3068
3069 Immediate timeouts
3070
3071 As you might expect, a "<timeout: 0>" directive tells the parser to
3072 keep trying for only zero more seconds, and therefore will immediately
3073 cause the entire surrounding grammar to fail (no matter how deeply
3074 within that grammar the directive is encountered).
3075
3076 This can occasionally be exteremely useful. If you know that detecting
3077 a particular datum means that the grammar will never match, no matter
3078 how many other alternatives may subsequently be tried, you can short-
3079 circuit the parser by injecting a "<timeout: 0>" immediately after the
3080 offending datum is detected.
3081
3082 For example, if your grammar only accepts certain versions of the
3083 language being parsed, you could write:
3084
3085 <rule: Valid_Language_Version>
3086 vers = <%AcceptableVersions>
3087 |
3088 vers = <bad_version=(\S++)>
3089 <warning: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3090 <timeout: 0>
3091
3092 In fact, this "<warning: MSG> <timeout: 0>" sequence is sufficiently
3093 useful, sufficiently complex, and sufficiently easy to get wrong, that
3094 Regexp::Grammars provides a handy shortcut for it: the "<fatal:...>"
3095 directive. A "<fatal:...>" is exactly equivalent to a "<warning:...>"
3096 followed by a zero-timeout, so the previous example could also be
3097 written:
3098
3099 <rule: Valid_Language_Version>
3100 vers = <%AcceptableVersions>
3101 |
3102 vers = <bad_version=(\S++)>
3103 <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3104
3105 Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
3106 own failure context in $CONTEXT, so the previous example could be
3107 further simplified to:
3108
3109 <rule: Valid_Language_Version>
3110 vers = <%AcceptableVersions>
3111 |
3112 vers = <fatal:(?{ "Cannot parse language version $CONTEXT" })>
3113
3114 Also like "<error:...>", "<fatal:...>" can autogenerate an error
3115 message if none is provided, so the example could be still further
3116 reduced to:
3117
3118 <rule: Valid_Language_Version>
3119 vers = <%AcceptableVersions>
3120 |
3121 vers = <fatal:>
3122
3123 In this last case, however, the error message returned in "@!" would no
3124 longer be:
3125
3126 Cannot parse language version 0.95
3127
3128 It would now be:
3129
3130 Expected valid language version, but found '0.95' instead
3131
3133 If you intend to use a grammar as part of a larger program that
3134 contains other (non-grammatical) regexes, it is more efficient--and
3135 less error-prone--to avoid having Regexp::Grammars process those
3136 regexes as well. So it's often a good idea to declare your grammar in a
3137 "do" block, thereby restricting the scope of the module's effects.
3138
3139 For example:
3140
3141 my $grammar = do {
3142 use Regexp::Grammars;
3143 qr{
3144 <file>
3145
3146 <rule: file>
3147 <prelude>
3148 <data>
3149 <postlude>
3150
3151 <rule: prelude>
3152 # etc.
3153 }x;
3154 };
3155
3156 Because the effects of Regexp::Grammars are lexically scoped, any
3157 regexes defined outside that "do" block will be unaffected by the
3158 module.
3159
3161 Perl API
3162 "use Regexp::Grammars;"
3163 Causes all regexes in the current lexical scope to be compile-time
3164 processed for grammar elements.
3165
3166 "$str =~ $grammar"
3167 "$str =~ /$grammar/"
3168 Attempt to match the grammar against the string, building a nested
3169 data structure from it.
3170
3171 "%/"
3172 This hash is assigned the nested data structure created by any
3173 successful match of a grammar regex.
3174
3175 "@!"
3176 This array is assigned the queue of error messages created by any
3177 unsuccessful match attempt of a grammar regex.
3178
3179 Grammar syntax
3180 Directives
3181
3182 "<rule: IDENTIFIER>"
3183 Define a rule whose name is specified by the supplied identifier.
3184
3185 Everything following the "<rule:...>" directive (up to the next
3186 "<rule:...>" or "<token:...>" directive) is treated as part of the
3187 rule being defined.
3188
3189 Any whitespace in the rule is replaced by a call to the "<.ws>"
3190 subrule (which defaults to matching "\s*", but may be explicitly
3191 redefined).
3192
3193 "<token: IDENTIFIER>"
3194 Define a rule whose name is specified by the supplied identifier.
3195
3196 Everything following the "<token:...>" directive (up to the next
3197 "<rule:...>" or "<token:...>" directive) is treated as part of the
3198 rule being defined.
3199
3200 Any whitespace in the rule is ignored (under the "/x" modifier), or
3201 explicitly matched (if "/x" is not used).
3202
3203 "<objrule: IDENTIFIER>"
3204 "<objtoken: IDENTIFIER>"
3205 Identical to a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
3206 declaration, except that the rule or token will also bless the hash
3207 it normally returns, converting it to an object of a class whose
3208 name is the same as the rule or token itself.
3209
3210 "<require: (?{ CODE }) >"
3211 The code block is executed and if its final value is true, matching
3212 continues from the same position. If the block's final value is
3213 false, the match fails at that point and starts backtracking.
3214
3215 "<error: (?{ CODE }) >"
3216 "<error: LITERAL TEXT >"
3217 "<error: >"
3218 This directive queues a conditional error message within the global
3219 special variable "@!" and then fails to match at that point (that
3220 is, it is equivalent to a "(?!)" or "(*FAIL)" when matching).
3221
3222 "<fatal: (?{ CODE }) >"
3223 "<fatal: LITERAL TEXT >"
3224 "<fatal: >"
3225 This directive is exactly the same as an "<error:...>" in every
3226 respect except that it immediately causes the entire surrounding
3227 grammar to fail, and parsing to immediate cease.
3228
3229 "<warning: (?{ CODE }) >"
3230 "<warning: LITERAL TEXT >"
3231 This directive is exactly the same as an "<error:...>" in every
3232 respect except that it does not induce a failure to match at the
3233 point it appears. That is, it is equivalent to a "(?=)" ["succeed
3234 and continue matching"], rather than a "(?!)" ["fail and
3235 backtrack"].
3236
3237 "<debug: COMMAND >"
3238 During the matching of grammar regexes send debugging and warning
3239 information to the specified log file (see "<logfile: LOGFILE>").
3240
3241 The available "COMMAND"'s are:
3242
3243 <debug: continue> ___ Debug until end of complete parse
3244 <debug: run> _/
3245
3246 <debug: on> ___ Debug until next subrule match
3247 <debug: match> _/
3248
3249 <debug: try> ___ Debug until next subrule call or match
3250 <debug: step> _/
3251
3252 <debug: same> ___ Maintain current debugging mode
3253
3254 <debug: off> ___ No debugging
3255
3256 See also the $DEBUG special variable.
3257
3258 "<logfile: LOGFILE>"
3259 "<logfile: - >"
3260 During the compilation of grammar regexes, send debugging and
3261 warning information to the specified LOGFILE (or to *STDERR if "-"
3262 is specified).
3263
3264 If the specified LOGFILE name contains a %t, it is replaced with a
3265 (sortable) "YYYYMMDD.HHMMSS" timestamp. For example:
3266
3267 <logfile: test-run-%t >
3268
3269 executed at around 9.30pm on the 21st of March 2009, would generate
3270 a log file named: "test-run-20090321.213056"
3271
3272 "<log: (?{ CODE }) >"
3273 "<log: LITERAL TEXT >"
3274 Append a message to the log file. If the argument is a code block,
3275 that code is expected to return the text of the message; if the
3276 argument is anything else, that something else is the literal
3277 message.
3278
3279 If the block returns two or more values, the first is treated as a
3280 log message severity indicator, and the remaining values as
3281 separate lines of text to be logged.
3282
3283 "<timeout: INT >"
3284 Restrict the match-time of the parse to the specified number of
3285 seconds. Queues a error message and terminates the entire match
3286 process if the parse does not complete within the nominated time
3287 limit.
3288
3289 Subrule calls
3290
3291 "<IDENTIFIER>"
3292 Call the subrule whose name is IDENTIFIER.
3293
3294 If it matches successfully, save the hash it returns in the current
3295 scope's result-hash, under the key 'IDENTIFIER'.
3296
3297 "<IDENTIFIER_1=IDENTIFIER_2>"
3298 Call the subrule whose name is IDENTIFIER_1.
3299
3300 If it matches successfully, save the hash it returns in the current
3301 scope's result-hash, under the key 'IDENTIFIER_2'.
3302
3303 In other words, the "IDENTIFIER_1=" prefix changes the key under
3304 which the result of calling a subrule is stored.
3305
3306 "<.IDENTIFIER>"
3307 Call the subrule whose name is IDENTIFIER. Don't save the hash it
3308 returns.
3309
3310 In other words, the "dot" prefix disables saving of subrule
3311 results.
3312
3313 "<IDENTIFIER= ( PATTERN )>"
3314 Match the subpattern PATTERN.
3315
3316 If it matches successfully, capture the substring it matched and
3317 save that substring in the current scope's result-hash, under the
3318 key 'IDENTIFIER'.
3319
3320 "<.IDENTIFIER= ( PATTERN )>"
3321 Match the subpattern PATTERN. Don't save the substring it matched.
3322
3323 "<IDENTIFIER= %HASH>"
3324 Match a sequence of non-whitespace then verify that the sequence is
3325 a key in the specified hash
3326
3327 If it matches successfully, capture the sequence it matched and
3328 save that substring in the current scope's result-hash, under the
3329 key 'IDENTIFIER'.
3330
3331 "<%HASH>"
3332 Match a key from the hash. Don't save the substring it matched.
3333
3334 "<IDENTIFIER= (?{ CODE })>"
3335 Execute the specified CODE.
3336
3337 Save the result (of the final expression that the CODE evaluates)
3338 in the current scope's result-hash, under the key 'IDENTIFIER'.
3339
3340 "<[IDENTIFIER]>"
3341 Call the subrule whose name is IDENTIFIER.
3342
3343 If it matches successfully, append the hash it returns to a nested
3344 array within the current scope's result-hash, under the key
3345 <'IDENTIFIER'>.
3346
3347 "<[IDENTIFIER_1=IDENTIFIER_2]>"
3348 Call the subrule whose name is IDENTIFIER_1.
3349
3350 If it matches successfully, append the hash it returns to a nested
3351 array within the current scope's result-hash, under the key
3352 'IDENTIFIER_2'.
3353
3354 "<ANY_SUBRULE>+ % <ANY_OTHER_SUBRULE>"
3355 "<ANY_SUBRULE>* % <ANY_OTHER_SUBRULE>"
3356 "<ANY_SUBRULE>+ % (PATTERN)"
3357 "<ANY_SUBRULE>* % (PATTERN)"
3358 Repeatedly call the first subrule. Keep matching as long as the
3359 subrule matches, provided successive matches are separated by
3360 matches of the second subrule or the pattern.
3361
3362 In other words, match a list of ANY_SUBRULE's separated by
3363 ANY_OTHER_SUBRULE's or PATTERN's.
3364
3365 Note that, if a pattern is used to specify the separator, it must
3366 be specified in some kind of matched parentheses. These may be
3367 capturing ["(...)"], non-capturing ["(?:...)"], non-backtracking
3368 ["(?>...)"], or any other construct enclosed by an opening and
3369 closing paren.
3370
3371 "<ANY_SUBRULE>+ %% <ANY_OTHER_SUBRULE>"
3372 "<ANY_SUBRULE>* %% <ANY_OTHER_SUBRULE>"
3373 "<ANY_SUBRULE>+ %% (PATTERN)"
3374 "<ANY_SUBRULE>* %% (PATTERN)"
3375 Repeatedly call the first subrule. Keep matching as long as the
3376 subrule matches, provided successive matches are separated by
3377 matches of the second subrule or the pattern.
3378
3379 Also allow an optional final trailing instance of the second
3380 subrule or pattern (this is where "%%" differs from "%").
3381
3382 In other words, match a list of ANY_SUBRULE's separated by
3383 ANY_OTHER_SUBRULE's or PATTERN's, with a possible final separator.
3384
3385 As for the single "%" operator, if a pattern is used to specify the
3386 separator, it must be specified in some kind of matched
3387 parentheses. These may be capturing ["(...)"], non-capturing
3388 ["(?:...)"], non-backtracking ["(?>...)"], or any other construct
3389 enclosed by an opening and closing paren.
3390
3391 Special variables within grammar actions
3392 $CAPTURE
3393 $CONTEXT
3394 These are both aliases for the built-in read-only $^N variable,
3395 which always contains the substring matched by the nearest
3396 preceding "(...)" capture. $^N still works perfectly well, but
3397 these are provided to improve the readability of code blocks and
3398 error messages respectively.
3399
3400 $INDEX
3401 This variable contains the index at which the next match will be
3402 attempted within the string being parsed. It is most commonly used
3403 in "<error:...>" or "<log:...>" directives:
3404
3405 <rule: ListElem>
3406 <log: (?{ "Trying words at index $INDEX" })>
3407 <MATCH=( \w++ )>
3408 |
3409 <log: (?{ "Trying digits at index $INDEX" })>
3410 <MATCH=( \d++ )>
3411 |
3412 <error: (?{ "Missing ListElem near index $INDEX" })>
3413
3414 %MATCH
3415 This variable contains all the saved results of any subrules called
3416 from the current rule. In other words, subrule calls like:
3417
3418 <ListElem> <Separator= (,)>
3419
3420 stores their respective match results in $MATCH{'ListElem'} and
3421 $MATCH{'Separator'}.
3422
3423 $MATCH
3424 This variable is an alias for $MATCH{"="}. This is the %MATCH entry
3425 for the special "override value". If this entry is defined, its
3426 value overrides the usual "return \%MATCH" semantics of a
3427 successful rule.
3428
3429 %ARG
3430 This variable contains all the key/value pairs that were passed
3431 into a particular subrule call.
3432
3433 <Keyword> <Command> <Terminator(:Keyword)>
3434
3435 the "Terminator" rule could get access to the text matched by
3436 "<Keyword>" like so:
3437
3438 <token: Terminator>
3439 end_ (??{ $ARG{'Keyword'} })
3440
3441 Note that to match against the calling subrules 'Keyword' value,
3442 it's necessary to use either a deferred interpolation ("(??{...})")
3443 or a qualified matchref:
3444
3445 <token: Terminator>
3446 end_ <\:Keyword>
3447
3448 A common mistake is to attempt to directly interpolate the
3449 argument:
3450
3451 <token: Terminator>
3452 end_ $ARG{'Keyword'}
3453
3454 This evaluates $ARG{'Keyword'} when the grammar is compiled, rather
3455 than when the rule is matched.
3456
3457 $_ At the start of any code blocks inside any regex, the variable $_
3458 contains the complete string being matched against. The current
3459 matching position within that string is given by: "pos($_)".
3460
3461 $DEBUG
3462 This variable stores the current debugging mode (which may be any
3463 of: 'off', 'on', 'run', 'continue', 'match', 'step', or 'try'). It
3464 is set automatically by the "<debug:...>" command, but may also be
3465 set manually in a code block (which can be useful for conditional
3466 debugging). For example:
3467
3468 <rule: ListElem>
3469 <Identifier>
3470
3471 # Conditionally debug if 'foobar' encountered...
3472 (?{ $DEBUG = $MATCH{Identifier} eq 'foobar' ? 'step' : 'off' })
3473
3474 <Modifier>?
3475
3476 See also: the "<log: LOGFILE>" and "<debug: DEBUG_CMD>" directives.
3477
3479 · Prior to Perl 5.14, the Perl 5 regex engine as not reentrant. So
3480 any attempt to perform a regex match inside a "(?{ ... })" or "(??{
3481 ... })" under Perl 5.12 or earlier will almost certainly lead to
3482 either weird data corruption or a segfault.
3483
3484 The same calamities can also occur in any constructor called by
3485 "<objrule:>". If the constructor invokes another regex in any way,
3486 it will most likely fail catastrophically. In particular, this
3487 means that Moose constructors will frequently crash and burn within
3488 a Regex::Grammars grammar (for example, if the Moose-based class
3489 declares an attribute type constraint such as 'Int', which Moose
3490 checks using a regex).
3491
3492 · The additional regex constructs this module provides are
3493 implemented by rewriting regular expressions. This is a (safer)
3494 form of source filtering, but still subject to all the same
3495 limitations and fallibilities of any other macro-based solution.
3496
3497 · In particular, rewriting the macros involves the insertion of (a
3498 lot of) extra capturing parentheses. This means you can no longer
3499 assume that particular capturing parens correspond to particular
3500 numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
3501 directly use Perl 5.10's named capture construct:
3502
3503 (?<name> [^\W\d]\w* )
3504
3505 Better still, capture the data in its correct hierarchical context
3506 using the module's "named subpattern" construct:
3507
3508 <name= ([^\W\d]\w*) >
3509
3510 · No recursive descent parser--including those created with
3511 Regexp::Grammars--can directly handle left-recursive grammars with
3512 rules of the form:
3513
3514 <rule: List>
3515 <List> , <ListElem>
3516
3517 If you find yourself attempting to write a left-recursive grammar
3518 (which Perl 5.10 may or may not complain about, but will never
3519 successfully parse with), then you probably need to use the
3520 "separated list" construct instead:
3521
3522 <rule: List>
3523 <[ListElem]>+ % (,)
3524
3525 · Grammatical parsing with Regexp::Grammars can fail if your grammar
3526 uses "non-backtracking" directives (i.e. the "(?>...)" block or the
3527 "?+", "*+", or "++" repetition specifiers). The problem appears to
3528 be that preventing the regex from backtracking through the in-regex
3529 actions that Regexp::Grammars adds causes the module's internal
3530 stack to fall out of sync with the regex match.
3531
3532 For the time being, if your grammar does not work as expected, you
3533 may need to replace one or more "non-backtracking" directives, with
3534 their regular (i.e. backtracking) equivalents.
3535
3536 · Similarly, parsing with Regexp::Grammars will fail if your grammar
3537 places a subrule call within a positive look-ahead, since these
3538 don't play nicely with the data stack.
3539
3540 This seems to be an internal problem with perl itself.
3541 Investigations, and attempts at a workaround, are proceeding.
3542
3543 For the time being, you need to make sure that grammar rules don't
3544 appear inside a positive lookahead or use the "<?RULENAME>"
3545 construct instead
3546
3548 Note that (because the author cannot find a way to throw exceptions
3549 from within a regex) none of the following diagnostics actually throws
3550 an exception.
3551
3552 Instead, these messages are simply written to the specified parser
3553 logfile (or to *STDERR, if no logfile is specified).
3554
3555 However, any fatal match-time message will immediately terminate the
3556 parser matching and will still set $@ (as if an exception had been
3557 thrown and caught at that point in the code). You then have the option
3558 to check $@ immediately after matching with the grammar, and rethrow if
3559 necessary:
3560
3561 if ($input =~ $grammar) {
3562 process_data_in(\%/);
3563 }
3564 else {
3565 die if $@;
3566 }
3567
3568 "Found call to %s, but no %s was defined in the grammar"
3569 You specified a call to a subrule for which there was no definition
3570 in the grammar. Typically that's either because you forget to
3571 define the rule, or because you misspelled either the definition or
3572 the subrule call. For example:
3573
3574 <file>
3575
3576 <rule: fiel> <---- misspelled rule
3577 <lines> <---- used but never defined
3578
3579 Regexp::Grammars converts any such subrule call attempt to an
3580 instant catastrophic failure of the entire parse, so if your parser
3581 ever actually tries to perform that call, Very Bad Things will
3582 happen.
3583
3584 "Entire parse terminated prematurely while attempting to call
3585 non-existent rule: %s"
3586 You ignored the previous error and actually tried to call to a
3587 subrule for which there was no definition in the grammar. Very Bad
3588 Things are now happening. The parser got very upset, took its ball,
3589 and went home. See the preceding diagnostic for remedies.
3590
3591 This diagnostic should throw an exception, but can't. So it sets $@
3592 instead, allowing you to trap the error manually if you wish.
3593
3594 "Fatal error: <objrule: %s> returned a non-hash-based object"
3595 An <objrule:> was specified and returned a blessed object that
3596 wasn't a hash. This will break the behaviour of the grammar, so the
3597 module immediately reports the problem and gives up.
3598
3599 The solution is to use only hash-based classes with <objrule:>
3600
3601 "Can't match against <grammar: %s>"
3602 The regex you attempted to match against defined a pure grammar,
3603 using the "<grammar:...>" directive. Pure grammars have no start-
3604 pattern and hence cannot be matched against directly.
3605
3606 You need to define a matchable grammar that inherits from your pure
3607 grammar and then calls one of its rules. For example, instead of:
3608
3609 my $greeting = qr{
3610 <grammar: Greeting>
3611
3612 <rule: greet>
3613 Hi there
3614 | Hello
3615 | Yo!
3616 }xms;
3617
3618 you need:
3619
3620 qr{
3621 <grammar: Greeting>
3622
3623 <rule: greet>
3624 Hi there
3625 | Hello
3626 | Yo!
3627 }xms;
3628
3629 my $greeting = qr{
3630 <extends: Greeting>
3631 <greet>
3632 }xms;
3633
3634 "Inheritance from unknown grammar requested by <%s>"
3635 You used an "<extends:...>" directive to request that your grammar
3636 inherit from another, but the grammar you asked to inherit from
3637 doesn't exist.
3638
3639 Check the spelling of the grammar name, and that it's already been
3640 defined somewhere earlier in your program.
3641
3642 "Redeclaration of <%s> will be ignored"
3643 You defined two or more rules or tokens with the same name. The
3644 first one defined in the grammar will be used; the rest will be
3645 ignored.
3646
3647 To get rid of the warning, get rid of the extra definitions (or, at
3648 least, comment them out or rename the rules).
3649
3650 "Possible invalid subrule call %s"
3651 Your grammar contained something of the form:
3652
3653 <identifier
3654 <.identifier
3655 <[identifier
3656
3657 which you might have intended to be a subrule call, but which
3658 didn't correctly parse as one. If it was supposed to be a
3659 Regexp::Grammars subrule call, you need to check the syntax you
3660 used. If it wasn't supposed to be a subrule call, you can silence
3661 the warning by rewriting it and quoting the leading angle:
3662
3663 \<identifier
3664 \<.identifier
3665 \<[identifier
3666
3667 "Possible failed attempt to specify a directive: %s"
3668 Your grammar contained something of the form:
3669
3670 <identifier:...
3671
3672 but which wasn't a known directive like "<rule:...>" or
3673 "<debug:...>". If it was supposed to be a Regexp::Grammars
3674 directive, check the spelling of the directive name. If it wasn't
3675 supposed to be a directive, you can silence the warning by
3676 rewriting it and quoting the leading angle:
3677
3678 \<identifier:
3679
3680 "Possible failed attempt to specify a subrule call %s"
3681 Your grammar contained something of the form:
3682
3683 <identifier...
3684
3685 but which wasn't a call to a known subrule like "<ident>" or
3686 "<name>". If it was supposed to be a Regexp::Grammars subrule call,
3687 check the spelling of the rule name in the angles. If it wasn't
3688 supposed to be a subrule call, you can silence the warning by
3689 rewriting it and quoting the leading angle:
3690
3691 \<identifier...
3692
3693 "Repeated subrule %s will only capture its final match"
3694 You specified a subrule call with a repetition qualifier, such as:
3695
3696 <ListElem>*
3697
3698 or:
3699
3700 <ListElem>+
3701
3702 Because each subrule call saves its result in a hash entry of the
3703 same name, each repeated match will overwrite the previous ones, so
3704 only the last match will ultimately be saved. If you want to save
3705 all the matches, you need to tell Regexp::Grammars to save the
3706 sequence of results as a nested array within the hash entry, like
3707 so:
3708
3709 <[ListElem]>*
3710
3711 or:
3712
3713 <[ListElem]>+
3714
3715 If you really did intend to throw away every result but the final
3716 one, you can silence the warning by placing the subrule call inside
3717 any kind of parentheses. For example:
3718
3719 (<ListElem>)*
3720
3721 or:
3722
3723 (?: <ListElem> )+
3724
3725 "Unable to open log file '$filename' (%s)"
3726 You specified a "<logfile:...>" directive but the file whose name
3727 you specified could not be opened for writing (for the reason given
3728 in the parens).
3729
3730 Did you misspell the filename, or get the permissions wrong
3731 somewhere in the filepath?
3732
3733 "Non-backtracking subrule %s may not revert correctly during
3734 backtracking"
3735 Because of inherent limitations in the Perl regex engine, non-
3736 backtracking constructs like "++", "*+", "?+", and "(?>...)" do not
3737 always work correctly when applied to subrule calls, especially in
3738 earlier versions of Perl.
3739
3740 If the grammar doesn't work properly, replace the offending
3741 constructs with regular backtracking versions instead. If the
3742 grammar does work, you can silence the warning by enclosing the
3743 subrule call in any kind of parentheses. For example, change:
3744
3745 <[ListElem]>++
3746
3747 to:
3748
3749 (?: <[ListElem]> )++
3750
3751 "Unexpected item before first subrule specification in definition of
3752 <grammar: %s>"
3753 Named grammar definitions must consist only of rule and token
3754 definitions. They cannot have patterns before the first
3755 definitions. You had some kind of pattern before the first
3756 definition, which will be completely ignored within the grammar.
3757
3758 To silence the warning, either comment out or delete whatever is
3759 before the first rule/token definition.
3760
3761 "No main regex specified before rule definitions"
3762 You specified an unnamed grammar (i.e. no "<grammar:...>"
3763 directive), but didn't specify anything for it to actually match,
3764 just some rules that you don't actually call. For example:
3765
3766 my $grammar = qr{
3767
3768 <rule: list> \( <item> +% [,] \)
3769
3770 <token: item> <list> | \d+
3771 }x;
3772
3773 You have to provide something before the first rule to start the
3774 matching off. For example:
3775
3776 my $grammar = qr{
3777
3778 <list> # <--- This tells the grammar how to start matching
3779
3780 <rule: list> \( <item> +% [,] \)
3781
3782 <token: item> <list> | \d+
3783 }x;
3784
3785 "Ignoring useless empty <ws:> directive"
3786 The "<ws:...>" directive specifies what whitespace matches within
3787 the current rule. An empty "<ws:>" directive would cause whitespace
3788 to match nothing at all, which is what happens in a token
3789 definition, not in a rule definition.
3790
3791 Either put some subpattern inside the empty "<ws:...>" or, if you
3792 really do want whitespace to match nothing at all, remove the
3793 directive completely and change the rule definition to a token
3794 definition.
3795
3796 "Ignoring useless <ws: %s > directive in a token definition"
3797 The "<ws:...>" directive is used to specify what whitespace matches
3798 within a rule. Since whitespace never matches anything inside
3799 tokens, putting a "<ws:...>" directive in a token is a waste of
3800 time.
3801
3802 Either remove the useless directive, or else change the surrounding
3803 token definition to a rule definition.
3804
3805 "Quantifier that doesn't quantify anything: <%s>"
3806 You specified a rule or token something like:
3807
3808 <token: star> *
3809
3810 or:
3811
3812 <rule: add_op> plus | add | +
3813
3814 but the "*" and "+" in those examples are both regex meta-
3815 operators: quantifiers that usually cause what precedes them to
3816 match repeatedly. In these cases however, nothing is preceding the
3817 quantifier, so it's a Perl syntax error.
3818
3819 You almost certainly need to escape the meta-characters in some
3820 way. For example:
3821
3822 <token: star> \*
3823
3824 <rule: add_op> plus | add | [+]
3825
3827 Regexp::Grammars requires no configuration files or environment
3828 variables.
3829
3831 This module only works under Perl 5.10 or later.
3832
3834 This module is likely to be incompatible with any other module that
3835 automagically rewrites regexes. For example it may conflict with
3836 Regexp::DefaultFlags, Regexp::DeferredExecution, or Regexp::Extended.
3837
3839 No bugs have been reported.
3840
3841 Please report any bugs or feature requests to
3842 "bug-regexp-grammars@rt.cpan.org", or through the web interface at
3843 <http://rt.cpan.org>.
3844
3846 Damian Conway "<DCONWAY@CPAN.org>"
3847
3849 Copyright (c) 2009, Damian Conway "<DCONWAY@CPAN.org>". All rights
3850 reserved.
3851
3852 This module is free software; you can redistribute it and/or modify it
3853 under the same terms as Perl itself. See perlartistic.
3854
3856 BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
3857 FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
3858 WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
3859 PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
3860 EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
3861 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
3862 ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
3863 YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
3864 NECESSARY SERVICING, REPAIR, OR CORRECTION.
3865
3866 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
3867 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
3868 REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
3869 TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
3870 CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
3871 SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
3872 RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
3873 FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
3874 SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
3875 DAMAGES.
3876
3877
3878
3879perl v5.28.1 2019-02-02 Regexp::Grammars(3)