Regexp::Grammars(3pm)

1Regexp::Grammars(3)   User Contributed Perl Documentation  Regexp::Grammars(3)
2
3
4

NAME

6       Regexp::Grammars - Add grammatical parsing features to Perl 5.10
7       regexes
8

VERSION

10       This document describes Regexp::Grammars version 1.049
11

SYNOPSIS

13           use Regexp::Grammars;
14
15           my $parser = qr{
16               (?:
17                   <Verb>               # Parse and save a Verb in a scalar
18                   <.ws>                # Parse but don't save whitespace
19                   <Noun>               # Parse and save a Noun in a scalar
20
21                   <type=(?{ rand > 0.5 ? 'VN' : 'VerbNoun' })>
22                                        # Save result of expression in a scalar
23               |
24                   (?:
25                       <[Noun]>         # Parse a Noun and save result in a list
26                                            (saved under the key 'Noun')
27                       <[PostNoun=ws]>  # Parse whitespace, save it in a list
28                                        #   (saved under the key 'PostNoun')
29                   )+
30
31                   <Verb>               # Parse a Verb and save result in a scalar
32                                            (saved under the key 'Verb')
33
34                   <type=(?{ 'VN' })>   # Save a literal in a scalar
35               |
36                   <debug: match>       # Turn on the integrated debugger here
37                   <.Cmd= (?: mv? )>    # Parse but don't capture a subpattern
38                                            (name it 'Cmd' for debugging purposes)
39                   <[File]>+            # Parse 1+ Files and save them in a list
40                                            (saved under the key 'File')
41                   <debug: off>         # Turn off the integrated debugger here
42                   <Dest=File>          # Parse a File and save it in a scalar
43                                            (saved under the key 'Dest')
44               )
45
46               ################################################################
47
48               <token: File>              # Define a subrule named File
49                   <.ws>                  #  - Parse but don't capture whitespace
50                   <MATCH= ([\w-]+) >     #  - Parse the subpattern and capture
51                                          #    matched text as the result of the
52                                          #    subrule
53
54               <token: Noun>              # Define a subrule named Noun
55                   cat | dog | fish       #  - Match an alternative (as usual)
56
57               <rule: Verb>               # Define a whitespace-sensitive subrule
58                   eats                   #  - Match a literal (after any space)
59                   <Object=Noun>?         #  - Parse optional subrule Noun and
60                                          #    save result under the key 'Object'
61               |                          #  Or else...
62                   <AUX>                  #  - Parse subrule AUX and save result
63                   <part= (eaten|seen) >  #  - Match a literal, save under 'part'
64
65               <token: AUX>               # Define a whitespace-insensitive subrule
66                   (has | is)             #  - Match an alternative and capture
67                   (?{ $MATCH = uc $^N }) #  - Use captured text as subrule result
68
69           }x;
70
71           # Match the grammar against some text...
72           if ($text =~ $parser) {
73               # If successful, the hash %/ will have the hierarchy of results...
74               process_data_in( %/ );
75           }
76

QUICKSTART CHEATSHEET

78   In your program...
79           use Regexp::Grammars;    Allow enhanced regexes in lexical scope
80           %/                       Result-hash for successful grammar match
81
82   Defining and using named grammars...
83           <grammar:  GRAMMARNAME>  Define a named grammar that can be inherited
84           <extends:  GRAMMARNAME>  Current grammar inherits named grammar's rules
85
86   Defining rules in your grammar...
87           <rule:     RULENAME>     Define rule with magic whitespace
88           <token:    RULENAME>     Define rule without magic whitespace
89
90           <objrule:  CLASS= NAME>  Define rule that blesses return-hash into class
91           <objtoken: CLASS= NAME>  Define token that blesses return-hash into class
92
93           <objrule:  CLASS>        Shortcut for above (rule name derived from class)
94           <objtoken: CLASS>        Shortcut for above (token name derived from class)
95
96   Matching rules in your grammar...
97           <RULENAME>               Call named subrule (may be fully qualified)
98                                    save result to $MATCH{RULENAME}
99
100           <RULENAME(...)>          Call named subrule, passing args to it
101
102           <!RULENAME>              Call subrule and fail if it matches
103           <!RULENAME(...)>         (shorthand for (?!<.RULENAME>) )
104
105           <:IDENT>                 Match contents of $ARG{IDENT} as a pattern
106           <\:IDENT>                Match contents of $ARG{IDENT} as a literal
107           </:IDENT>                Match closing delimiter for $ARG{IDENT}
108
109           <%HASH>                  Match longest possible key of hash
110           <%HASH {PAT}>            Match any key of hash that also matches PAT
111
112           </IDENT>                 Match closing delimiter for $MATCH{IDENT}
113           <\_IDENT>                Match the literal contents of $MATCH{IDENT}
114
115           <ALIAS= RULENAME>        Call subrule, save result in $MATCH{ALIAS}
116           <ALIAS= %HASH>           Match a hash key, save key in $MATCH{ALIAS}
117           <ALIAS= ( PATTERN )>     Match pattern, save match in $MATCH{ALIAS}
118           <ALIAS= (?{ CODE })>     Execute code, save value in $MATCH{ALIAS}
119           <ALIAS= 'STR' >          Save specified string in $MATCH{ALIAS}
120           <ALIAS= 42 >             Save specified number in $MATCH{ALIAS}
121           <ALIAS= /IDENT>          Match closing delim, save as $MATCH{ALIAS}
122           <ALIAS= \_IDENT>         Match '$MATCH{IDENT}', save as $MATCH{ALIAS}
123
124           <.SUBRULE>               Call subrule (one of the above forms),
125                                    but don't save the result in %MATCH
126
127
128           <[SUBRULE]>              Call subrule (one of the above forms), but
129                                    append result instead of overwriting it
130
131           <SUBRULE1>+ % <SUBRULE2> Match one or more repetitions of SUBRULE1
132                                    as long as they're separated by SUBRULE2
133           <SUBRULE1> ** <SUBRULE2> Same (only for backwards compatibility)
134
135           <SUBRULE1>* % <SUBRULE2> Match zero or more repetitions of SUBRULE1
136                                    as long as they're separated by SUBRULE2
137
138   In your grammar's code blocks...
139           $CAPTURE    Alias for $^N (the most recent paren capture)
140           $CONTEXT    Another alias for $^N
141           $INDEX      Current index of next matching position in string
142           %MATCH      Current rule's result-hash
143           $MATCH      Magic override value (returned instead of result-hash)
144           %ARG        Current rule's argument hash
145           $DEBUG      Current match-time debugging mode
146
147   Directives...
148           <require: (?{ CODE })   >  Fail if code evaluates false
149           <timeout: INT           >  Fail after specified number of seconds
150           <debug:   COMMAND       >  Change match-time debugging mode
151           <logfile: LOGFILE       >  Change debugging log file (default: STDERR)
152           <fatal:   TEXT|(?{CODE})>  Queue error message and fail parse
153           <error:   TEXT|(?{CODE})>  Queue error message and backtrack
154           <warning: TEXT|(?{CODE})>  Queue warning message and continue
155           <log:     TEXT|(?{CODE})>  Explicitly add a message to debugging log
156           <ws:      PATTERN       >  Override automatic whitespace matching
157           <minimize:>                Simplify the result of a subrule match
158           <context:>                 Switch on context substring retention
159           <nocontext:>               Switch off context substring retention
160

DESCRIPTION

162       This module adds a small number of new regex constructs that can be
163       used within Perl 5.10 patterns to implement complete recursive-descent
164       parsing.
165
166       Perl 5.10 already supports recursive=descent matching, via the new
167       "(?<name>...)" and "(?&name)" constructs. For example, here is a simple
168       matcher for a subset of the LaTeX markup language:
169
170           $matcher = qr{
171               (?&File)
172
173               (?(DEFINE)
174                   (?<File>     (?&Element)* )
175
176                   (?<Element>  \s* (?&Command)
177                             |  \s* (?&Literal)
178                   )
179
180                   (?<Command>  \\ \s* (?&Literal) \s* (?&Options)? \s* (?&Args)? )
181
182                   (?<Options>  \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \])
183
184                   (?<Args>     \{ \s* (?&Element)* \s* \}  )
185
186                   (?<Option>   \s* [^][\$&%#_{}~^\s,]+     )
187
188                   (?<Literal>  \s* [^][\$&%#_{}~^\s]+      )
189               )
190           }xms
191
192       This technique makes it possible to use regexes to recognize complex,
193       hierarchical--and even recursive--textual structures. The problem is
194       that Perl 5.10 doesn't provide any support for extracting that
195       hierarchical data into nested data structures. In other words, using
196       Perl 5.10 you can match complex data, but not parse it into an
197       internally useful form.
198
199       An additional problem when using Perl 5.10 regexes to match complex
200       data formats is that you have to make sure you remember to insert
201       whitespace-matching constructs (such as "\s*") at every possible
202       position where the data might contain ignorable whitespace. This
203       reduces the readability of such patterns, and increases the chance of
204       errors (typically caused by overlooking a location where whitespace
205       might appear).
206
207       The Regexp::Grammars module solves both those problems.
208
209       If you import the module into a particular lexical scope, it
210       preprocesses any regex in that scope, so as to implement a number of
211       extensions to the standard Perl 5.10 regex syntax. These extensions
212       simplify the task of defining and calling subrules within a grammar,
213       and allow those subrule calls to capture and retain the components of
214       they match in a proper hierarchical manner.
215
216       For example, the above LaTeX matcher could be converted to a full LaTeX
217       parser (and considerably tidied up at the same time), like so:
218
219           use Regexp::Grammars;
220           $parser = qr{
221               <File>
222
223               <rule: File>       <[Element]>*
224
225               <rule: Element>    <Command> | <Literal>
226
227               <rule: Command>    \\  <Literal>  <Options>?  <Args>?
228
229               <rule: Options>    \[  <[Option]>+ % (,)  \]
230
231               <rule: Args>       \{  <[Element]>*  \}
232
233               <rule: Option>     [^][\$&%#_{}~^\s,]+
234
235               <rule: Literal>    [^][\$&%#_{}~^\s]+
236           }xms
237
238       Note that there is no need to explicitly place "\s*" subpatterns
239       throughout the rules; that is taken care of automatically.
240
241       If the Regexp::Grammars version of this regex were successfully matched
242       against some appropriate LaTeX document, each rule would call the
243       subrules specified within it, and then return a hash containing
244       whatever result each of those subrules returned, with each result
245       indexed by the subrule's name.
246
247       That is, if the rule named "Command" were invoked, it would first try
248       to match a backslash, then it would call the three subrules
249       "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
250       matched successfully, the "Command" rule would then return a hash with
251       three keys: 'Literal', 'Options', and 'Args'. The value for each of
252       those hash entries would be whatever result-hash the subrules
253       themselves had returned when matched.
254
255       In this way, each level of the hierarchical regex can generate hashes
256       recording everything its own subrules matched, so when the entire
257       pattern matches, it produces a tree of nested hashes that represent the
258       structured data the pattern matched.
259
260       For example, if the previous regex grammar were matched against a
261       string containing:
262
263           \documentclass[a4paper,11pt]{article}
264           \author{D. Conway}
265
266       it would automatically extract a data structure equivalent to the
267       following (but with several extra "empty" keys, which are described in
268       "Subrule results"):
269
270           {
271               'file' => {
272                   'element' => [
273                       {
274                           'command' => {
275                               'literal' => 'documentclass',
276                               'options' => {
277                                   'option'  => [ 'a4paper', '11pt' ],
278                               },
279                               'args'    => {
280                                   'element' => [ 'article' ],
281                               }
282                           }
283                       },
284                       {
285                           'command' => {
286                               'literal' => 'author',
287                               'args' => {
288                                   'element' => [
289                                       {
290                                           'literal' => 'D.',
291                                       },
292                                       {
293                                           'literal' => 'Conway',
294                                       }
295                                   ]
296                               }
297                           }
298                       }
299                   ]
300               }
301           }
302
303       The data structure that Regexp::Grammars produces from a regex match is
304       available to the surrounding program in the magic variable "%/".
305
306       Regexp::Grammars provides many features that simplify the extraction of
307       hierarchical data via a regex match, and also some features that can
308       simplify the processing of that data once it has been extracted. The
309       following sections explain each of those features, and some of the
310       parsing techniques they support.
311
312   Setting up the module
313       Just add:
314
315           use Regexp::Grammars;
316
317       to any lexical scope. Any regexes within that scope will automatically
318       now implement the new parsing constructs:
319
320           use Regexp::Grammars;
321
322           my $parser = qr/ regex with $extra <chocolatey> grammar bits /;
323
324       Note that you do not to use the "/x" modifier when declaring a regex
325       grammar (though you certainly may). But even if you don't, the module
326       quietly adds a "/x" to every regex within the scope of its usage.
327       Otherwise, the default "a whitespace character matches exactly that
328       whitespace character" behaviour of Perl regexes would mess up your
329       grammar's parsing. If you need the non-"/x" behaviour, you can still
330       use the "(?-x)" of "(?-x:...)" directives to switch of "/x" within one
331       or more of your grammar's components.
332
333       Once the grammar has been processed, you can then match text against
334       the extended regexes, in the usual manner (i.e. via a "=~" match):
335
336           if ($input_text =~ $parser) {
337               ...
338           }
339
340       After a successful match, the variable "%/" will contain a series of
341       nested hashes representing the structured hierarchical data captured
342       during the parse.
343
344   Structure of a Regexp::Grammars grammar
345       A Regexp::Grammars specification consists of a start-pattern (which may
346       include both standard Perl 5.10 regex syntax, as well as special
347       Regexp::Grammars directives), followed by one or more rule or token
348       definitions.
349
350       For example:
351
352           use Regexp::Grammars;
353           my $balanced_brackets = qr{
354
355               # Start-pattern...
356               <paren_pair> | <brace_pair>
357
358               # Rule definition...
359               <rule: paren_pair>
360                   \(  (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*  \)
361
362               # Rule definition...
363               <rule: brace_pair>
364                   \{  (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )*  \}
365
366               # Token definition...
367               <token: escape>
368                   \\ .
369           }xms;
370
371       The start-pattern at the beginning of the grammar acts like the "top"
372       token of the grammar, and must be matched completely for the grammar to
373       match.
374
375       This pattern is treated like a token for whitespace matching behaviour
376       (see "Tokens vs rules (whitespace handling)").  That is, whitespace in
377       the start-pattern is treated like whitespace in any normal Perl regex.
378
379       The rules and tokens are declarations only and they are not directly
380       matched.  Instead, they act like subroutines, and are invoked by name
381       from the initial pattern (or from within a rule or token).
382
383       Each rule or token extends from the directive that introduces it up to
384       either the next rule or token directive, or (in the case of the final
385       rule or token) to the end of the grammar.
386
387   Tokens vs rules (whitespace handling)
388       The difference between a token and a rule is that a token treats any
389       whitespace within it exactly as a normal Perl regular expression would.
390       That is, a sequence of whitespace in a token is ignored if the "/x"
391       modifier is in effect, or else matches the same literal sequence of
392       whitespace characters (if "/x" is not in effect).
393
394       In a rule, most sequences of whitespace are treated as matching the
395       implicit subrule "<.ws>", which is automatically predefined to match
396       optional whitespace (i.e. "\s*").
397
398       Exceptions to this behaviour are whitespaces before a "|" or a code
399       block or an explicit space-matcher (such as "<ws>" or "\s"), or at the
400       very end of the rule)
401
402       In other words, a rule such as:
403
404           <rule: sentence>   <noun> <verb>
405                          |   <verb> <noun>
406
407       is equivalent to a token with added non-capturing whitespace matching:
408
409           <token: sentence>  <.ws> <noun> <.ws> <verb>
410                           |  <.ws> <verb> <.ws> <noun>
411
412       You can explicitly define a "<ws>" token to change that default
413       behaviour. For example, you could alter the definition of "whitespace"
414       to include Perlish comments, by adding an explicit "<token: ws>":
415
416           <token: ws>
417               (?: \s+ | #[^\n]* )*
418
419       But be careful not to define "<ws>" as a rule, as this will lead to all
420       kinds of infinitely recursive unpleasantness.
421
422       Per-rule whitespace handling
423
424       Redefining the "<ws>" token changes its behaviour throughout the entire
425       grammar, within every rule definition. Usually that's appropriate, but
426       sometimes you need finer-grained control over whitespace handling.
427
428       So Regexp::Grammars provides the "<ws:>" directive, which allows you to
429       override the implicit whitespace-matches-whitespace behaviour only
430       within the current rule.
431
432       Note that this directive does not redefined "<ws>" within the rule; it
433       simply specifies what to replace each whitespace sequence with (instead
434       of replacing each with a "<ws>" call).
435
436       For example, if a language allows one kind of comment between
437       statements and another within statements, you could parse it with:
438
439           <rule: program>
440               # One type of comment between...
441               <ws: (\s++ | \# .*? \n)* >
442
443               # ...colon-separated statements...
444               <[statement]>+ % ( ; )
445
446
447           <rule: statement>
448               # Another type of comment...
449               <ws: (\s*+ | \#{ .*? }\# )* >
450
451               # ...between comma-separated commands...
452               <cmd>  <[arg]>+ % ( , )
453
454       Note that each directive only applies to the rule in which it is
455       specified. In every other rule in the grammar, whitespace would still
456       match the usual "<ws>" subrule.
457
458   Calling subrules
459       To invoke a rule to match at any point, just enclose the rule's name in
460       angle brackets (like in Perl 6). There must be no space between the
461       opening bracket and the rulename. For example::
462
463           qr{
464               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
465               <name>            # Call <rule: name>
466               <options>?        # Call <rule: options> (it's okay if it fails)
467
468               <rule: name>
469                   # etc.
470           }x;
471
472       If you need to match a literal pattern that would otherwise look like a
473       subrule call, just backslash-escape the leading angle:
474
475           qr{
476               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
477               \<name>           # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
478               <options>?        # Call <rule: options> (it's okay if it fails)
479
480               <rule: name>
481                   # etc.
482           }x;
483
484   Subrule results
485       If a subrule call successfully matches, the result of that match is a
486       reference to a hash. That hash reference is stored in the current
487       rule's own result-hash, under the name of the subrule that was invoked.
488       The hash will, in turn, contain the results of any more deeply nested
489       subrule calls, each stored under the name by which the nested subrule
490       was invoked.
491
492       In other words, if the rule "sentence" is defined:
493
494           <rule: sentence>
495               <noun> <verb> <object>
496
497       then successfully calling the rule:
498
499           <sentence>
500
501       causes a new hash entry at the current nesting level. That entry's key
502       will be 'sentence' and its value will be a reference to a hash, which
503       in turn will have keys: 'noun', 'verb', and 'object'.
504
505       In addition each result-hash has one extra key: the empty string. The
506       value for this key is whatever substring the entire subrule call
507       matched.  This value is known as the context substring.
508
509       So, for example, a successful call to "<sentence>" might add something
510       like the following to the current result-hash:
511
512           sentence => {
513               ""     => 'I saw a dog',
514               noun   => 'I',
515               verb   => 'saw',
516               object => {
517                   ""      => 'a dog',
518                   article => 'a',
519                   noun    => 'dog',
520               },
521           }
522
523       Note, however, that if the result-hash at any level contains only the
524       empty-string key (i.e. the subrule did not call any sub-subrules or
525       save any of their nested result-hashes), then the hash is "unpacked"
526       and just the context substring itself is returned.
527
528       For example, if "<rule: sentence>" had been defined:
529
530           <rule: sentence>
531               I see dead people
532
533       then a successful call to the rule would only add:
534
535           sentence => 'I see dead people'
536
537       to the current result-hash.
538
539       This is a useful feature because it prevents a series of nested subrule
540       calls from producing very unwieldy data structures. For example,
541       without this automatic unpacking, even the simple earlier example:
542
543           <rule: sentence>
544               <noun> <verb> <object>
545
546       would produce something needlessly complex, such as:
547
548           sentence => {
549               ""     => 'I saw a dog',
550               noun   => {
551                   "" => 'I',
552               },
553               verb   => {
554                   "" => 'saw',
555               },
556               object => {
557                   ""      => 'a dog',
558                   article => {
559                       "" => 'a',
560                   },
561                   noun    => {
562                       "" => 'dog',
563                   },
564               },
565           }
566
567       Turning off the context substring
568
569       The context substring is convenient for debugging and for generating
570       error messages but, in a large grammar, or when parsing a long string,
571       the capture and storage of many nested substrings may quickly become
572       prohibitively expensive.
573
574       So Regexp::Grammars provides a directive to prevent context substrings
575       from being retained. Any rule or token that includes the directive
576       "<nocontext:>" anywhere in the rule's body will not retain any context
577       substring it matches...unless that substring would be the only entry in
578       its result hash (which only happens within objrules and objtokens).
579
580       If a "<nocontext:>" directive appears before the first rule or token
581       definition (i.e. as part of the main pattern), then the entire grammar
582       will discard all context substrings from every one of its rules and
583       tokens.
584
585       However, you can override this universal prohibition with a second
586       directive: "<context:>". If this directive appears in any rule or
587       token, that rule or token will save its context substring, even if a
588       global "<nocontext:>" is in effect.
589
590       This means that this grammar:
591
592           qr{
593               <Command>
594
595               <rule: Command>
596                   <nocontext:>
597                   <Keyword> <arg=(\S+)>+ % <.ws>
598
599               <token: Keyword>
600                   <Move> | <Copy> | <Delete>
601
602               # etc.
603           }x
604
605       and this grammar:
606
607           qr{
608               <nocontext:>
609               <Command>
610
611               <rule: Command>
612                   <Keyword> <arg=(\S+)>+ % <.ws>
613
614               <token: Keyword>
615                   <context:>
616                   <Move> | <Copy> | <Delete>
617
618               # etc.
619           }x
620
621       will behave identically (saving context substrings for keywords, but
622       not for commands), except that the first version will also retain the
623       global context substring (i.e. $/{""}), whereas the second version will
624       not.
625
626       Note that "<context:>" and "<nocontext:>" have no effect on, or even
627       any interaction with, the various result distillation mechanisms, which
628       continue to work in the usual way when either or both of the directives
629       is used.
630
631   Renaming subrule results
632       It is not always convenient to have subrule results stored under the
633       same name as the rule itself. Rule names should be optimized for
634       understanding the behaviour of the parser, whereas result names should
635       be optimized for understanding the structure of the data. Often those
636       two goals are identical, but not always; sometimes rule names need to
637       describe what the data looks like, while result names need to describe
638       what the data means.
639
640       For example, sometimes you need to call the same rule twice, to match
641       two syntactically identical components whose positions give then
642       semantically distinct meanings:
643
644           <rule: copy_cmd>
645               copy <file> <file>
646
647       The problem here is that, if the second call to "<file>" succeeds, its
648       result-hash will be stored under the key 'file', clobbering the data
649       that was returned from the first call to "<file>".
650
651       To avoid such problems, Regexp::Grammars allows you to alias any
652       subrule call, so that it is still invoked by the original name, but its
653       result-hash is stored under a different key. The syntax for that is:
654       "<alias=rulename>". For example:
655
656           <rule: copy_cmd>
657               copy <from=file> <to=file>
658
659       Here, "<rule: file>" is called twice, with the first result-hash being
660       stored under the key 'from', and the second result-hash being stored
661       under the key 'to'.
662
663       Note, however, that the alias before the "=" must be a proper
664       identifier (i.e. a letter or underscore, followed by letters, digits,
665       and/or underscores). Aliases that start with an underscore and aliases
666       named "MATCH" have special meaning (see "Private subrule calls" and
667       "Result distillation" respectively).
668
669       Aliases can also be useful for normalizing data that may appear in
670       different formats and sequences. For example:
671
672           <rule: copy_cmd>
673               copy <from=file>        <to=file>
674             | dup    <to=file>  as  <from=file>
675             |      <from=file>  ->    <to=file>
676             |        <to=file>  <-  <from=file>
677
678       Here, regardless of which order the old and new files are specified,
679       the result-hash always gets:
680
681           copy_cmd => {
682               from => 'oldfile',
683                 to => 'newfile',
684           }
685
686   List-like subrule calls
687       If a subrule call is quantified with a repetition specifier:
688
689           <rule: file_sequence>
690               <file>+
691
692       then each repeated match overwrites the corresponding entry in the
693       surrounding rule's result-hash, so only the result of the final
694       repetition will be retained. That is, if the above example matched the
695       string "foo.pl bar.py baz.php", then the result-hash would contain:
696
697           file_sequence {
698               ""   => 'foo.pl bar.py baz.php',
699               file => 'baz.php',
700           }
701
702       Usually, that's not the desired outcome, so Regexp::Grammars provides
703       another mechanism by which to call a subrule; one that saves all
704       repetitions of its results.
705
706       A regular subrule call consists of the rule's name surrounded by angle
707       brackets. If, instead, you surround the rule's name with "<[...]>"
708       (angle and square brackets) like so:
709
710           <rule: file_sequence>
711               <[file]>+
712
713       then the rule is invoked in exactly the same way, but the result of
714       that submatch is pushed onto an array nested inside the appropriate
715       result-hash entry. In other words, if the above example matched the
716       same "foo.pl bar.py baz.php" string, the result-hash would contain:
717
718           file_sequence {
719               ""   => 'foo.pl bar.py baz.php',
720               file => [ 'foo.pl', 'bar.py', 'baz.php' ],
721           }
722
723       This "listifying subrule call" can also be useful for non-repeated
724       subrule calls, if the same subrule is invoked in several places in a
725       grammar. For example if a cmdline option could be given either one or
726       two values, you might parse it:
727
728           <rule: size_option>
729               -size <[size]> (?: x <[size]> )?
730
731       The result-hash entry for 'size' would then always contain an array,
732       with either one or two elements, depending on the input being parsed.
733
734       Listifying subrules can also be given aliases, just like ordinary
735       subrules. The alias is always specified inside the square brackets:
736
737           <rule: size_option>
738               -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
739
740       Here, the sizes are parsed using the "pos_integer" rule, but saved in
741       the result-hash in an array under the key 'size'.
742
743   Parametric subrules
744       When a subrule is invoked, it can be passed a set of named arguments
745       (specified as key"=>"values pairs). This argument list is placed in a
746       normal Perl regex code block and must appear immediately after the
747       subrule name, before the closing angle bracket.
748
749       Within the subrule that has been invoked, the arguments can be accessed
750       via the special hash %ARG. For example:
751
752           <rule: block>
753               <tag>
754                   <[block]>*
755               <end_tag(?{ tag=>$MATCH{tag} })>  # ...call subrule with argument
756
757           <token: end_tag>
758               end_ (??{ quotemeta $ARG{tag} })
759
760       Here the "block" rule first matches a "<tag>", and the corresponding
761       substring is saved in $MATCH{tag}. It then matches any number of nested
762       blocks. Finally it invokes the "<end_tag>" subrule, passing it an
763       argument whose name is 'tag' and whose value is the current value of
764       $MATCH{tag} (i.e. the original opening tag).
765
766       When it is thus invoked, the "end_tag" token first matches 'end_', then
767       interpolates the literal value of the 'tag' argument and attempts to
768       match it.
769
770       Any number of named arguments can be passed when a subrule is invoked.
771       For example, we could generalize the "end_tag" rule to allow any prefix
772       (not just 'end_'), and also to allow for 'if...fi'-style reversed tags,
773       like so:
774
775           <rule: block>
776               <tag>
777                   <[block]>*
778               <end_tag (?{ prefix=>'end', tag=>$MATCH{tag} })>
779
780           <token: end_tag>
781               (??{ $ARG{prefix} // q{(?!)} })      # ...prefix as pattern
782               (??{ quotemeta $ARG{tag} })          # ...tag as literal
783             |
784               (??{ quotemeta reverse $ARG{tag} })  # ...reversed tag
785
786       Note that, if you do not need to interpolate values (such as
787       $MATCH{tag}) into a subrule's argument list, you can use simple
788       parentheses instead of "(?{...})", like so:
789
790               <end_tag( prefix=>'end', tag=>'head' )>
791
792       The only types of values you can use in this simplified syntax are
793       numbers and single-quote-delimited strings.  For anything more complex,
794       put the argument list in a full "(?{...})".
795
796       As the earlier examples show, the single most common type of argument
797       is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
798       a common requirement to pass an element of %MATCH into a subrule, named
799       with its own key.
800
801       Because this is such a common usage, Regexp::Grammars provides a
802       shortcut. If you use simple parentheses (instead of "(?{...})"
803       parentheses) then instead of a pair, you can specify an argument using
804       a colon followed by an identifier.  This argument is replaced by a
805       named argument whose name is the identifier and whose value is the
806       corresponding item from %MATCH. So, for example, instead of:
807
808               <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>
809
810       you can just write:
811
812               <end_tag( prefix=>'end', :tag )>
813
814       Note that, from Perl 5.20 onwards, due to changes in the way that Perl
815       parses regexes, Regexp::Grammars does not support explicitly passing
816       elements of %MATCH as argument values within a list subrule (yeah, it's
817       a very specific and obscure edge-case):
818
819               <[end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })]>   # Does not work
820
821       Note, however, that the shortcut:
822
823               <[end_tag( prefix=>'end', :tag )]>
824
825       still works correctly.
826
827       Accessing subrule arguments more cleanly
828
829       As the preceding examples illustrate, using subrule arguments
830       effectively generally requires the use of run-time interpolated
831       subpatterns via the "(??{...})" construct.
832
833       This produces ugly rule bodies such as:
834
835           <token: end_tag>
836               (??{ $ARG{prefix} // q{(?!)} })      # ...prefix as pattern
837               (??{ quotemeta $ARG{tag} })          # ...tag as literal
838             |
839               (??{ quotemeta reverse $ARG{tag} })  # ...reversed tag
840
841       To simplify these common usages, Regexp::Grammars provides three
842       convenience constructs.
843
844       A subrule call of the form "<:"identifier">" is equivalent to:
845
846           (??{ $ARG{'identifier'} // q{(?!)} })
847
848       Namely: "Match the contents of $ARG{'identifier'}, treating those
849       contents as a pattern."
850
851       A subrule call of the form "<\:"identifier">" (that is: a matchref with
852       a colon after the backslash) is equivalent to:
853
854           (??{ defined $ARG{'identifier'}
855                   ? quotemeta($ARG{'identifier'})
856                   : '(?!)'
857           })
858
859       Namely: "Match the contents of $ARG{'identifier'}, treating those
860       contents as a literal."
861
862       A subrule call of the form "</:"identifier">" (that is: an invertref
863       with a colon after the forward slash) is equivalent to:
864
865           (??{ defined $ARG{'identifier'}
866                   ? quotemeta(reverse $ARG{'identifier'})
867                   : '(?!)'
868           })
869
870       Namely: "Match the closing delimiter corresponding to the contents of
871       $ARG{'identifier'}, as if it were a literal".
872
873       The availability of these three constructs mean that we could rewrite
874       the above "<end_tag>" token much more cleanly as:
875
876           <token: end_tag>
877               <:prefix>      # ...prefix as pattern
878               <\:tag>        # ...tag as a literal
879             |
880               </:tag>        # ...reversed tag
881
882       In general these constructs mean that, within a subrule, if you want to
883       match an argument passed to that subrule, you use "<:"ARGNAME">" (to
884       match the argument as a pattern) or "<\:"ARGNAME">" (to match the
885       argument as a literal).
886
887       Note the consistent mnemonic in these various subrule-like
888       interpolations of named arguments: the name is always prefixed by a
889       colon.
890
891       In other words, the "<:ARGNAME>" form works just like a "<RULENAME>",
892       except that the leading colon tells Regexp::Grammars to use the
893       contents of $ARG{'ARGNAME'} as the subpattern, instead of the contents
894       of "(?&RULENAME)"
895
896       Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs work exactly
897       like "<\_MATCHNAME>" and "</INVERTNAME>" respectively, except that the
898       leading colon indicates that the matchref or invertref should be taken
899       from %ARG instead of from %MATCH.
900
901   Pseudo-subrules
902       Aliases can also be given to standard Perl subpatterns, as well as to
903       code blocks within a regex. The syntax for subpatterns is:
904
905           <ALIAS= (SUBPATTERN) >
906
907       In other words, the syntax is exactly like an aliased subrule call,
908       except that the rule name is replaced with a set of parentheses
909       containing the subpattern. Any parentheses--capturing or
910       non-capturing--will do.
911
912       The effect of aliasing a standard subpattern is to cause whatever that
913       subpattern matches to be saved in the result-hash, using the alias as
914       its key. For example:
915
916           <rule: file_command>
917
918               <cmd=(mv|cp|ln)>  <from=file>  <to=file>
919
920       Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
921       "(mv|cp|ln)", but whatever substring it matches is saved in the result-
922       hash under the key 'cmd'.
923
924       The syntax for aliasing code blocks is:
925
926           <ALIAS= (?{ your($code->here) }) >
927
928       Note, however, that the code block must be specified in the standard
929       Perl 5.10 regex notation: "(?{...})". A common mistake is to write:
930
931           <ALIAS= { your($code->here } >
932
933       instead, which will attempt to interpolate $code before the regex is
934       even compiled, as such variables are only "protected" from
935       interpolation inside a "(?{...})".
936
937       When correctly specified, this construct executes the code in the block
938       and saves the result of that execution in the result-hash, using the
939       alias as its key. Aliased code blocks are useful for adding semantic
940       information based on which branch of a rule is executed. For example,
941       consider the "copy_cmd" alternatives shown earlier:
942
943           <rule: copy_cmd>
944               copy <from=file>        <to=file>
945             | dup    <to=file>  as  <from=file>
946             |      <from=file>  ->    <to=file>
947             |        <to=file>  <-  <from=file>
948
949       Using aliased code blocks, you could add an extra field to the result-
950       hash to describe which form of the command was detected, like so:
951
952           <rule: copy_cmd>
953               copy <from=file>        <to=file>  <type=(?{ 'std' })>
954             | dup    <to=file>  as  <from=file>  <type=(?{ 'rev' })>
955             |      <from=file>  ->    <to=file>  <type=(?{  +1   })>
956             |        <to=file>  <-  <from=file>  <type=(?{  -1   })>
957
958       Now, if the rule matched, the result-hash would contain something like:
959
960           copy_cmd => {
961               from => 'oldfile',
962                 to => 'newfile',
963               type => 'fwd',
964           }
965
966       Note that, in addition to the semantics described above, aliased
967       subpatterns and code blocks also become visible to Regexp::Grammars'
968       integrated debugger (see Debugging).
969
970   Aliased literals
971       As the previous example illustrates, it is inconveniently verbose to
972       assign constants via aliased code blocks. So Regexp::Grammars provides
973       a short-cut. It is possible to directly alias a numeric literal or a
974       single-quote delimited literal string, without putting either inside a
975       code block. For example, the previous example could also be written:
976
977           <rule: copy_cmd>
978               copy <from=file>        <to=file>  <type='std'>
979             | dup    <to=file>  as  <from=file>  <type='rev'>
980             |      <from=file>  ->    <to=file>  <type= +1  >
981             |        <to=file>  <-  <from=file>  <type= -1  >
982
983       Note that only these two forms of literal are supported in this
984       abbreviated syntax.
985
986   Amnesiac subrule calls
987       By default, every subrule call saves its result into the result-hash,
988       either under its own name, or under an alias.
989
990       However, sometimes you may want to refactor some literal part of a rule
991       into one or more subrules, without having those submatches added to the
992       result-hash. The syntax for calling a subrule, but ignoring its return
993       value is:
994
995           <.SUBRULE>
996
997       (which is stolen directly from Perl 6).
998
999       For example, you may prefer to rewrite a rule such as:
1000
1001           <rule: paren_pair>
1002
1003               \(
1004                   (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
1005               \)
1006
1007       without any literal matching, like so:
1008
1009           <rule: paren_pair>
1010
1011               <.left_paren>
1012                   (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
1013               <.right_paren>
1014
1015           <token: left_paren>   \(
1016           <token: right_paren>  \)
1017           <token: non_paren>    [^()]
1018
1019       Moreover, as the individual components inside the parentheses probably
1020       aren't being captured for any useful purpose either, you could further
1021       optimize that to:
1022
1023           <rule: paren_pair>
1024
1025               <.left_paren>
1026                   (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
1027               <.right_paren>
1028
1029       Note that you can also use the dot modifier on an aliased subpattern:
1030
1031           <.Alias= (SUBPATTERN) >
1032
1033       This seemingly contradictory behaviour (of giving a subpattern a name,
1034       then deliberately ignoring that name) actually does make sense in one
1035       situation. Providing the alias makes the subpattern visible to the
1036       debugger, while using the dot stops it from affecting the result-hash.
1037       See "Debugging non-grammars" for an example of this usage.
1038
1039   Private subrule calls
1040       If a rule name (or an alias) begins with an underscore:
1041
1042            <_RULENAME>       <_ALIAS=RULENAME>
1043           <[_RULENAME]>     <[_ALIAS=RULENAME]>
1044
1045       then matching proceeds as normal, and any result that is returned is
1046       stored in the current result-hash in the usual way.
1047
1048       However, when any rule finishes (and just before it returns) it first
1049       filters its result-hash, removing any entries whose keys begin with an
1050       underscore. This means that any subrule with an underscored name (or
1051       with an underscored alias) remembers its result, but only until the end
1052       of the current rule. Its results are effectively private to the current
1053       rule.
1054
1055       This is especially useful in conjunction with result distillation.
1056
1057   Lookahead (zero-width) subrules
1058       Non-capturing subrule calls can be used in normal lookaheads:
1059
1060           <rule: qualified_typename>
1061               # A valid typename and has a :: in it...
1062               (?= <.typename> )  [^\s:]+ :: \S+
1063
1064           <rule: identifier>
1065               # An alpha followed by alnums (but not a valid typename)...
1066               (?! <.typename> )    [^\W\d]\w*
1067
1068       but the syntax is a little unwieldy. More importantly, an internal
1069       problem with backtracking causes positive lookaheads to mess up the
1070       module's named capturing mechanism.
1071
1072       So Regexp::Grammars provides two shorthands:
1073
1074           <!typename>        same as: (?! <.typename> )
1075           <?typename>        same as: (?= <.typename> ) ...but works correctly!
1076
1077       These two constructs can also be called with arguments, if necessary:
1078
1079           <rule: Command>
1080               <Keyword>
1081               (?:
1082                   <!Terminator(:Keyword)>  <Args=(\S+)>
1083               )?
1084               <Terminator(:Keyword)>
1085
1086       Note that, as the above equivalences imply, neither of these forms of a
1087       subroutine call ever captures what it matches.
1088
1089   Matching separated lists
1090       One of the commonest tasks in text parsing is to match a list of
1091       unspecified length, in which items are separated by a fixed token.
1092       Things like:
1093
1094           1, 2, 3 , 4 ,13, 91        # Numbers separated by commas and spaces
1095
1096           g-c-a-g-t-t-a-c-a          # DNA bases separated by dashes
1097
1098           /usr/local/bin             # Names separated by directory markers
1099
1100           /usr:/usr/local:bin        # Directories separated by colons
1101
1102       The usual construct required to parse these kinds of structures is
1103       either:
1104
1105           <rule: list>
1106
1107               <item> <separator> <list>     # recursive definition
1108             | <item>                        # base case
1109
1110       or, if you want to allow zero-or-more items instead of requiring one-
1111       or-more:
1112
1113           <rule: list_opt>
1114               <list>?                       # entire list may be missing
1115
1116           <rule: list>                      # as before...
1117               <item> <separator> <list>     #   recursive definition
1118             | <item>                        #   base case
1119
1120       Or, more efficiently, but less prettily:
1121
1122           <rule: list>
1123               <[item]> (?: <separator> <[item]> )*           # one-or-more
1124
1125           <rule: list_opt>
1126               (?: <[item]> (?: <separator> <[item]> )* )?    # zero-or-more
1127
1128       Because separated lists are such a common component of grammars,
1129       Regexp::Grammars provides cleaner ways to specify them:
1130
1131           <rule: list>
1132               <[item]>+ % <separator>      # one-or-more
1133
1134           <rule: list_zom>
1135               <[item]>* % <separator>      # zero-or-more
1136
1137       Note that these are just regular repetition qualifiers (i.e. "+" and
1138       "*") applied to a subriule ("<[item]>"), with a "%" modifier after them
1139       to specify the required separator between the repeated matches.
1140
1141       The number of repetitions matched is controlled both by the nature of
1142       the qualifier ("+" vs "*") and by the subrule specified after the "%".
1143       The qualified subrule will be repeatedly matched for as long as its
1144       qualifier allows, provided that the second subrule also matches between
1145       those repetitions.
1146
1147       For example, you can match a parenthesized sequence of one-or-more
1148       numbers separated by commas, such as:
1149
1150           (1, 2, 3, 4, 13, 91)        # Numbers separated by commas (and spaces)
1151
1152       with:
1153
1154           <rule: number_list>
1155
1156               \(  <[number]>+ % <comma>  \)
1157
1158           <token: number>  \d+
1159           <token: comma>   ,
1160
1161       Note that any spaces round the commas will be ignored because
1162       "<number_list>" is specified as a rule and the "+%" specifier has
1163       spaces within and around it. To disallow spaces around the commas, make
1164       sure there are no spaces in or around the "+%":
1165
1166           <rule: number_list_no_spaces>
1167
1168               \( <[number]>+%<comma> \)
1169
1170       (or else specify the rule as a token instead).
1171
1172       Because the "%" is a modifier applied to a qualifier, you can modify
1173       any other repetition qualifier in the same way. For example:
1174
1175           <[item]>{2,4} % <sep>   # two-to-four items, separated
1176
1177           <[item]>{7}   % <sep>   # exactly 7 items, separated
1178
1179           <[item]>{10,}? % <sep>   # minimum of 10 or more items, separated
1180
1181       You can even do this:
1182
1183           <[item]>? % <sep>       # one-or-zero items, (theoretically) separated
1184
1185       though the separator specification is, of course, meaningless in that
1186       case as it will never be needed to separate a maximum of one item.
1187
1188       If a "%" appears anywhere else in a grammar (i.e. not immediately after
1189       a repetition qualifier), it is treated normally (i.e. as a self-
1190       matching literal character):
1191
1192           <token: perl_hash>
1193               % <ident>                # match "%foo", "%bar", etc.
1194
1195           <token: perl_mod>
1196               <expr> % <expr>          # match "$n % 2", "($n+3) % ($n-1)", etc.
1197
1198       If you need to match a literal "%" immediately after a repetition,
1199       either quote it:
1200
1201           <token: percentage>
1202               \d{1,3} \% solution                  # match "7% solution", etc.
1203
1204       or refactor the "%" character:
1205
1206           <token: percentage>
1207               \d{1,3} <percent_sign> solution      # match "7% solution", etc.
1208
1209           <token: percent_sign>
1210               %
1211
1212       Note that it's usually necessary to use the "<[...]>" form for the
1213       repeated items being matched, so that all of them are saved in the
1214       result hash. You can also save all the separators (if they're
1215       important) by specifying them as a list-like subrule too:
1216
1217           \(  <[number]>* % <[comma]>  \)  # save numbers *and* separators
1218
1219       The repeated item must be specified as a subrule call of some kind
1220       (i.e. in angles), but the separators may be specified either as a
1221       subrule or as a raw bracketed pattern. For example:
1222
1223           <[number]>* % ( , | : )    # Numbers separated by commas or colons
1224
1225           <[number]>* % [,:]         # Same, but more efficiently matched
1226
1227       The separator should always be specified within matched delimiters of
1228       some kind: either matching "<...>" or matching "(...)" or matching
1229       "[...]". Simple, non-bracketed separators will sometimes also work:
1230
1231           <[number]>+ % ,
1232
1233       but not always:
1234
1235           <[number]>+ % ,\s+     # Oops! Separator is just: ,
1236
1237       This is because of the limited way in which the module internally
1238       parses ordinary regex components (i.e. without full understanding of
1239       their implicit precedence). As a consequence, consistently placing
1240       brackets around any separator is a much safer approach:
1241
1242           <[number]>+ % (,\s+)
1243
1244       You can also use a simple pattern on the left of the "%" as the item
1245       matcher, but in this case it must always be aliased into a list-
1246       collecting subrule, like so:
1247
1248           <[item=(\d+)]>* % [,]
1249
1250       Note that, for backwards compatibility with earlier versions of
1251       Regexp::Grammars, the "+%" operator can also be written: "**".
1252       However, there can be no space between the two asterisks of this
1253       variant. That is:
1254
1255           <[item]> ** <sep>      # same as <[item]>* % <sep>
1256
1257           <[item]>* * <sep>      # error (two * qualifiers in a row)
1258
1259   Matching hash keys
1260       In some situations a grammar may need a rule that matches dozens,
1261       hundreds, or even thousands of one-word alternatives. For example, when
1262       matching command names, or valid userids, or English words. In such
1263       cases it is often impractical (and always inefficient) to list all the
1264       alternatives between "|" alternators:
1265
1266           <rule: shell_cmd>
1267               a2p | ac | apply | ar | automake | awk | ...
1268               # ...and 400 lines later
1269               ... | zdiff | zgrep | zip | zmore | zsh
1270
1271           <rule: valid_word>
1272               a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
1273               # ...and 40,000 lines later...
1274               ... | zymotize | zymotoxic | zymurgy | zythem | zythum
1275
1276       To simplify such cases, Regexp::Grammars provides a special construct
1277       that allows you to specify all the alternatives as the keys of a normal
1278       hash. The syntax for that construct is simply to put the hash name
1279       inside angle brackets (with no space between the angles and the hash
1280       name).
1281
1282       Which means that the rules in the previous example could also be
1283       written:
1284
1285           <rule: shell_cmd>
1286               <%cmds>
1287
1288           <rule: valid_word>
1289               <%dict>
1290
1291       provided that the two hashes (%cmds and %dict) are visible in the scope
1292       where the grammar is created.
1293
1294       Matching a hash key in this way is typically significantly faster than
1295       matching a large set of alternations. Specifically, it is O(length of
1296       longest potential key) ^ 2, instead of O(number of keys).
1297
1298       Internally, the construct is converted to something equivalent to:
1299
1300           <rule: shell_cmd>
1301               (<.hk>)  <require: (?{ exists $cmds{$CAPTURE} })>
1302
1303           <rule: valid_word>
1304               (<.hk>)  <require: (?{ exists $dict{$CAPTURE} })>
1305
1306       The special "<hk>" rule is created automatically, and defaults to
1307       "\S+", but you can also define it explicitly to handle other kinds of
1308       keys. For example:
1309
1310           <rule: hk>
1311               [^\n]+        # Key may be any number of chars on a single line
1312
1313           <rule: hk>
1314               [ACGT]{10,}   # Key is a base sequence of at least 10 pairs
1315
1316       Alternatively, you can specify a different key-matching pattern for
1317       each hash you're matching, by placing the required pattern in braces
1318       immediately after the hash name. For example:
1319
1320           <rule: client_name>
1321               # Valid keys match <.hk> (default or explicitly specified)
1322               <%clients>
1323
1324           <rule: shell_cmd>
1325               # Valid keys contain only word chars, hyphen, slash, or dot...
1326               <%cmds { [\w-/.]+ }>
1327
1328           <rule: valid_word>
1329               # Valid keys contain only alphas or internal hyphen or apostrophe...
1330               <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>
1331
1332           <rule: DNA_sequence>
1333               # Valid keys are base sequences of at least 10 pairs...
1334               <%sequences{[ACGT]{10,}}>
1335
1336       This second approach to key-matching is preferred, because it localizes
1337       any non-standard key-matching behaviour to each individual hash.
1338
1339       Note that changes in the compilation process from Perl 5.18 onwards
1340       mean that in some cases the "<%hash>" construct only works reliably if
1341       the hash itself is declared at the outermost lexical scope (i.e. file
1342       scope).
1343
1344       Specifically, if the regex grammar does not include any interpolated
1345       scalars or arrays and the hash was declared within a subroutine (even
1346       within the same subroutine as the regex grammar that uses it), the
1347       regex will not be able to "see" the hash variable at compile-time. This
1348       will produce a "Global symbol "%hash" requires explicit package name"
1349       compile-time error. For example:
1350
1351           sub build_keyword_parser {
1352               # Hash declared inside subroutine...
1353               my %keywords = (foo => 1, bar => 1);
1354
1355               # ...then used in <%hash> construct within uninterpolated regex...
1356               return qr{
1357                           ^<keyword>$
1358                           <rule: keyword> <%keywords>
1359                        }x;
1360
1361               # ...produces compile-time error
1362           }
1363
1364       The solution is to place the hash outside the subroutine containing the
1365       grammar:
1366
1367           # Hash declared OUTSIDE subroutine...
1368           my %keywords = (foo => 1, bar => 1);
1369
1370           sub build_keyword_parser {
1371               return qr{
1372                           ^<keyword>$
1373                           <rule: keyword> <%keywords>
1374                        }x;
1375           }
1376
1377       ...or else to explicitly interpolate at least one scalar (even just a
1378       scalar containing an empty string):
1379
1380           sub build_keyword_parser {
1381               my %keywords = (foo => 1, bar => 1);
1382               my $DEFER_REGEX_COMPILATION = "";
1383
1384               return qr{
1385                           ^<keyword>$
1386                           <rule: keyword> <%keywords>
1387
1388                           $DEFER_REGEX_COMPILATION
1389                        }x;
1390           }
1391
1392   Rematching subrule results
1393       Sometimes it is useful to be able to rematch a string that has
1394       previously been matched by some earlier subrule. For example, consider
1395       a rule to match shell-like control blocks:
1396
1397           <rule: control_block>
1398                 for   <expr> <[command]>+ endfor
1399               | while <expr> <[command]>+ endwhile
1400               | if    <expr> <[command]>+ endif
1401               | with  <expr> <[command]>+ endwith
1402
1403       This would be much tidier if we could factor out the command names
1404       (which are the only differences between the four alternatives). The
1405       problem is that the obvious solution:
1406
1407           <rule: control_block>
1408               <keyword> <expr>
1409                   <[command]>+
1410               end<keyword>
1411
1412       doesn't work, because it would also match an incorrect input like:
1413
1414           for 1..10
1415               echo $n
1416               ls subdir/$n
1417           endif
1418
1419       We need some way to ensure that the "<keyword>" matched immediately
1420       after "end" is the same "<keyword>" that was initially matched.
1421
1422       That's not difficult, because the first "<keyword>" will have captured
1423       what it matched into $MATCH{keyword}, so we could just write:
1424
1425           <rule: control_block>
1426               <keyword> <expr>
1427                   <[command]>+
1428               end(??{quotemeta $MATCH{keyword}})
1429
1430       This is such a useful technique, yet so ugly, scary, and prone to
1431       error, that Regexp::Grammars provides a cleaner equivalent:
1432
1433           <rule: control_block>
1434               <keyword> <expr>
1435                   <[command]>+
1436               end<\_keyword>
1437
1438       A directive of the form "<\_IDENTIFIER>" is known as a "matchref" (an
1439       abbreviation of "%MATCH-supplied backreference").  Matchrefs always
1440       attempt to match, as a literal, the current value of
1441       $MATCH{IDENTIFIER}.
1442
1443       By default, a matchref does not capture what it matches, but you can
1444       have it do so by giving it an alias:
1445
1446           <token: delimited_string>
1447               <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1448
1449           <token: str_delim> ["'`]
1450
1451       At first glance this doesn't seem very useful as, by definition,
1452       $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up with
1453       identical values. However, it can be useful if the rule also has other
1454       alternatives and you want to create a consistent internal
1455       representation for those alternatives, like so:
1456
1457           <token: delimited_string>
1458                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1459               | <ldelim=( \[ )      .*?  <rdelim=( \] )
1460               | <ldelim=( \{ )      .*?  <rdelim=( \} )
1461               | <ldelim=( \( )      .*?  <rdelim=( \) )
1462               | <ldelim=( \< )      .*?  <rdelim=( \> )
1463
1464       You can also force a matchref to save repeated matches as a nested
1465       array, in the usual way:
1466
1467           <token: marked_text>
1468               <marker> <text> <[endmarkers=\_marker]>+
1469
1470       Be careful though, as the following will not do as you may expect:
1471
1472               <[marker]>+ <text> <[endmarkers=\_marker]>+
1473
1474       because the value of $MATCH{marker} will be an array reference, which
1475       the matchref will flatten and concatenate, then match the resulting
1476       string as a literal, which will mean the previous example will match
1477       endmarkers that are exact multiples of the complete start marker,
1478       rather than endmarkers that consist of any number of repetitions of the
1479       individual start marker delimiter. So:
1480
1481               ""text here""
1482               ""text here""""
1483               ""text here""""""
1484
1485       but not:
1486
1487               ""text here"""
1488               ""text here"""""
1489
1490       Uneven start and end markers such as these are extremely unusual, so
1491       this problem rarely arises in practice.
1492
1493       Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
1494       was "<\IDENTIFIER>" instead of "<\_IDENTIFIER>". This created problems
1495       when the identifier started with any of "l", "u", "L", "U", "Q", or
1496       "E", so the syntax has had to be altered in a backwards incompatible
1497       way. It will not be altered again.
1498
1499   Rematching balanced delimiters
1500       Consider the example in the previous section:
1501
1502           <token: delimited_string>
1503                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1504               | <ldelim=( \[ )      .*?  <rdelim=( \] )
1505               | <ldelim=( \{ )      .*?  <rdelim=( \} )
1506               | <ldelim=( \( )      .*?  <rdelim=( \) )
1507               | <ldelim=( \< )      .*?  <rdelim=( \> )
1508
1509       The repeated pattern of the last four alternatives is gauling, but we
1510       can't just refactor those delimiters as well:
1511
1512           <token: delimited_string>
1513                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1514               | <ldelim=bracket>    .*?  <rdelim=\_ldelim>
1515
1516       because that would incorrectly match:
1517
1518           { delimited content here {
1519
1520       while failing to match:
1521
1522           { delimited content here }
1523
1524       To refactor balanced delimiters like those, we need a second kind of
1525       matchref; one that's a little smarter.
1526
1527       Or, preferably, a lot smarter...because there are many other kinds of
1528       balanced delimiters, apart from single brackets. For example:
1529
1530             {{{ delimited content here }}}
1531              /* delimited content here */
1532              (* delimited content here *)
1533              `` delimited content here ''
1534              if delimited content here fi
1535
1536       The common characteristic of these delimiter pairs is that the closing
1537       delimiter is the inverse of the opening delimiter: the sequence of
1538       characters is reversed and certain characters (mainly brackets, but
1539       also single-quotes/backticks) are mirror-reflected.
1540
1541       Regexp::Grammars supports the parsing of such delimiters with a
1542       construct known as an invertref, which is specified using the
1543       "</IDENT>" directive. An invertref acts very like a matchref, except
1544       that it does not convert to:
1545
1546           (??{ quotemeta( $MATCH{I<IDENT>} ) })
1547
1548       but rather to:
1549
1550           (??{ quotemeta( inverse( $MATCH{I<IDENT> ))} })
1551
1552       With this directive available, the balanced delimiters of the previous
1553       example can be refactored to:
1554
1555           <token: delimited_string>
1556                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1557               | <ldelim=( [[{(<] )  .*?  <rdelim=/ldelim>
1558
1559       Like matchrefs, invertrefs come in the usual range of flavours:
1560
1561           </ident>            # Match the inverse of $MATCH{ident}
1562           <ALIAS=/ident>      # Match inverse and capture to $MATCH{ident}
1563           <[ALIAS=/ident]>    # Match inverse and push on @{$MATCH{ident}}
1564
1565       The character pairs that are reversed during mirroring are: "{" and
1566       "}", "[" and "]", "(" and ")", "<" and ">", "AX" and "AX", "`" and "'".
1567
1568       The following mnemonics may be useful in distinguishing inverserefs
1569       from backrefs: a backref starts with a "\" (just like the standard Perl
1570       regex backrefs "\1" and "\g{-2}" and "\k<name>"), whereas an inverseref
1571       starts with a "/" (like an HTML or XML closing tag). Or just remember
1572       that "<\_IDENT>" is "match the same again", and if you want "the same
1573       again, only mirrored" instead, just mirror the "\" to get "</IDENT>".
1574
1575   Rematching parametric results and delimiters
1576       The "<\IDENTIFIER>" and "</IDENTIFIER>" mechanisms normally locate the
1577       literal to be matched by looking in $MATCH{IDENTIFIER}.
1578
1579       However, you can cause them to look in $ARG{IDENTIFIER} instead, by
1580       prefixing the identifier with a single ":". This is especially useful
1581       when refactoring subrules. For example, instead of:
1582
1583           <rule: Command>
1584               <Keyword>  <CommandBody>  end_ <\_Keyword>
1585
1586           <rule: Placeholder>
1587               <Keyword>    \.\.\.   end_ <\_Keyword>
1588
1589       you could parameterize the Terminator rule, like so:
1590
1591           <rule: Command>
1592               <Keyword>  <CommandBody>  <Terminator(:Keyword)>
1593
1594           <rule: Placeholder>
1595               <Keyword>    \.\.\.   <Terminator(:Keyword)>
1596
1597           <token: Terminator>
1598               end_ <\:Keyword>
1599
1600   Tracking and reporting match positions
1601       Regexp::Grammars automatically predefines a special token that makes it
1602       easy to track exactly where in its input a particular subrule matches.
1603       That token is: "<matchpos>".
1604
1605       The "<matchpos>" token implements a zero-width match that never fails.
1606       It always returns the current index within the string that the grammar
1607       is matching.
1608
1609       So, for example you could have your "<delimited_text>" subrule detect
1610       and report unterminated text like so:
1611
1612           <token: delimited_text>
1613               qq? <delim> <text=(.*?)> </delim>
1614           |
1615               <matchpos> qq? <delim>
1616               <error: (?{"Unterminated string starting at index $MATCH{matchpos}"})>
1617
1618       Matching "<matchpos>" in the second alternative causes $MATCH{matchpos}
1619       to contain the position in the string at which the "<matchpos>" subrule
1620       was matched (in this example: the start of the unterminated text).
1621
1622       If you want the line number instead of the string index, use the
1623       predefined "<matchline>" subrule instead:
1624
1625           <token: delimited_text>
1626                     qq? <delim> <text=(.*?)> </delim>
1627           |   <matchline> qq? <delim>
1628               <error: (?{"Unterminated string starting at line $MATCH{matchline}"})>
1629
1630       Note that the line numbers returned by "<matchline>" start at 1 (not at
1631       zero, as with "<matchpos>").
1632
1633       The "<matchpos>" and "<matchline>" subrules are just like any other
1634       subrules; you can alias them ("<started_at=matchpos>") or match them
1635       repeatedly ( "(?: <[matchline]> <[item]> )++"), etc.
1636

Autoactions

1638       The module also supports event-based parsing. You can specify a grammar
1639       in the usual way and then, for a particular parse, layer a collection
1640       of call-backs (known as "autoactions") over the grammar to handle the
1641       data as it is parsed.
1642
1643       Normally, a grammar rule returns the result hash it has accumulated (or
1644       whatever else was aliased to "MATCH=" within the rule). However, you
1645       can specify an autoaction object before the grammar is matched.
1646
1647       Once the autoaction object is specified, every time a rule succeeds
1648       during the parse, its result is passed to the object via one of its
1649       methods; specifically it is passed to the method whose name is the same
1650       as the rule's.
1651
1652       For example, suppose you had a grammar that recognizes simple algebraic
1653       expressions:
1654
1655           my $expr_parser = do{
1656               use Regexp::Grammars;
1657               qr{
1658                   <Expr>
1659
1660                   <rule: Expr>       <[Operand=Mult]>+ % <[Op=(\+|\-)]>
1661
1662                   <rule: Mult>       <[Operand=Pow]>+  % <[Op=(\*|/|%)]>
1663
1664                   <rule: Pow>        <[Operand=Term]>+ % <Op=(\^)>
1665
1666                   <rule: Term>          <MATCH=Literal>
1667                              |       \( <MATCH=Expr> \)
1668
1669                   <token: Literal>   <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
1670               }xms
1671           };
1672
1673       You could convert this grammar to a calculator, by installing a set of
1674       autoactions that convert each rule's result hash to the corresponding
1675       value of the sub-expression that the rule just parsed. To do that, you
1676       would create a class with methods whose names match the rules whose
1677       results you want to change. For example:
1678
1679           package Calculator;
1680           use List::Util qw< reduce >;
1681
1682           sub new {
1683               my ($class) = @_;
1684
1685               return bless {}, $class
1686           }
1687
1688           sub Answer {
1689               my ($self, $result_hash) = @_;
1690
1691               my $sum = shift @{$result_hash->{Operand}};
1692
1693               for my $term (@{$result_hash->{Operand}}) {
1694                   my $op = shift @{$result_hash->{Op}};
1695                   if ($op eq '+') { $sum += $term; }
1696                   else            { $sum -= $term; }
1697               }
1698
1699               return $sum;
1700           }
1701
1702           sub Mult {
1703               my ($self, $result_hash) = @_;
1704
1705               return reduce { eval($a . shift(@{$result_hash->{Op}}) . $b) }
1706                             @{$result_hash->{Operand}};
1707           }
1708
1709           sub Pow {
1710               my ($self, $result_hash) = @_;
1711
1712               return reduce { $b ** $a } reverse @{$result_hash->{Operand}};
1713           }
1714
1715       Objects of this class (and indeed the class itself) now have methods
1716       corresponding to some of the rules in the expression grammar. To apply
1717       those methods to the results of the rules (as they parse) you simply
1718       install an object as the "autoaction" handler, immediately before you
1719       initiate the parse:
1720
1721           if ($text ~= $expr_parser->with_actions(Calculator->new)) {
1722               say $/{Answer};   # Now prints the result of the expression
1723           }
1724
1725       The "with_actions()" method expects to be passed an object or
1726       classname. This object or class will be installed as the autoaction
1727       handler for the next match against any grammar. After that match, the
1728       handler will be uninstalled. "with_actions()" returns the grammar it's
1729       called on, making it easy to call it as part of a match (which is the
1730       recommended idiom).
1731
1732       With a "Calculator" object set as the autoaction handler, whenever the
1733       "Answer", "Mult", or "Pow" rule of the grammar matches, the
1734       corresponding "Answer", "Mult", or "Pow" method of the "Calculator"
1735       object will be called (with the rule's result value passed as its only
1736       argument), and the result of the method will be used as the result of
1737       the rule.
1738
1739       Note that nothing new happens when a "Term" or "Literal" rule matches,
1740       because the "Calculator" object doesn't have methods with those names.
1741
1742       The overall effect, then, is to allow you to specify a grammar without
1743       rule-specific bahaviours and then, later, specify a set of final
1744       actions (as methods) for some or all of the rules of the grammar.
1745
1746       Note that, if a particular callback method returns "undef", the result
1747       of the corresponding rule will be passed through without modification.
1748

Named grammars

1750       All the grammars shown so far are confined to a single regex. However,
1751       Regexp::Grammars also provides a mechanism that allows you to defined
1752       named grammars, which can then be imported into other regexes. This
1753       gives the a way of modularizing common grammatical components.
1754
1755   Defining a named grammar
1756       You can create a named grammar using the "<grammar:...>" directive.
1757       This directive must appear before the first rule definition in the
1758       grammar, and instead of any start-rule. For example:
1759
1760           qr{
1761               <grammar: List::Generic>
1762
1763               <rule: List>
1764                   <[MATCH=Item]>+ % <Separator>
1765
1766               <rule: Item>
1767                   \S++
1768
1769               <token: Separator>
1770                   \s* , \s*
1771           }x;
1772
1773       This creates a grammar named "List::Generic", and installs it in the
1774       module's internal caches, for future reference.
1775
1776       Note that there is no need (or reason) to assign the resulting regex to
1777       a variable, as the named grammar cannot itself be matched against.
1778
1779   Using a named grammar
1780       To make use of a named grammar, you need to incorporate it into another
1781       grammar, by inheritance. To do that, use the "<extends:...>" directive,
1782       like so:
1783
1784           my $parser = qr{
1785               <extends: List::Generic>
1786
1787               <List>
1788           }x;
1789
1790       The "<extends:...>" directive incorporates the rules defined in the
1791       specified grammar into the current regex. You can then call any of
1792       those rules in the start-pattern.
1793
1794   Overriding an inherited rule or token
1795       Subrule dispatch within a grammar is always polymorphic. That is, when
1796       a subrule is called, the most-derived rule of the same name within the
1797       grammar's hierarchy is invoked.
1798
1799       So, to replace a particular rule within grammar, you simply need to
1800       inherit that grammar and specify new, more-specific versions of any
1801       rules you want to change. For example:
1802
1803           my $list_of_integers = qr{
1804               <List>
1805
1806               # Inherit rules from base grammar...
1807               <extends: List::Generic>
1808
1809               # Replace Item rule from List::Generic...
1810               <rule: Item>
1811                   [+-]? \d++
1812           }x;
1813
1814       You can also use "<extends:...>" in other named grammars, to create
1815       hierarchies:
1816
1817           qr{
1818               <grammar: List::Integral>
1819               <extends: List::Generic>
1820
1821               <token: Item>
1822                   [+-]? <MATCH=(<.Digit>+)>
1823
1824               <token: Digit>
1825                   \d
1826           }x;
1827
1828           qr{
1829               <grammar: List::ColonSeparated>
1830               <extends: List::Generic>
1831
1832               <token: Separator>
1833                   \s* : \s*
1834           }x;
1835
1836           qr{
1837               <grammar: List::Integral::ColonSeparated>
1838               <extends: List::Integral>
1839               <extends: List::ColonSeparated>
1840           }x;
1841
1842       As shown in the previous example, Regexp::Grammars allows you to
1843       multiply inherit two (or more) base grammars. For example, the
1844       "List::Integral::ColonSeparated" grammar takes the definitions of
1845       "List" and "Item" from the "List::Integral" grammar, and the definition
1846       of "Separator" from "List::ColonSeparated".
1847
1848       Note that grammars dispatch subrule calls using C3 method lookup,
1849       rather than Perl's older DFS lookup. That's why
1850       "List::Integral::ColonSeparated" correctly gets the more-specific
1851       "Separator" rule defined in "List::ColonSeparated", rather than the
1852       more-generic version defined in "List::Generic" (via "List::Integral").
1853       See "perldoc mro" for more discussion of the C3 dispatch algorithm.
1854
1855   Augmenting an inherited rule or token
1856       Instead of replacing an inherited rule, you can augment it.
1857
1858       For example, if you need a grammar for lists of hexademical numbers,
1859       you could inherit the behaviour of "List::Integral" and add the hex
1860       digits to its "Digit" token:
1861
1862           my $list_of_hexadecimal = qr{
1863               <List>
1864
1865               <extends: List::Integral>
1866
1867               <token: Digit>
1868                   <List::Integral::Digit>
1869                 | [A-Fa-f]
1870           }x;
1871
1872       If you call a subrule using a fully qualified name (such as
1873       "<List::Integral::Digit>"), the grammar calls that version of the rule,
1874       rather than the most-derived version.
1875
1876   Debugging named grammars
1877       Named grammars are independent of each other, even when inherited. This
1878       means that, if debugging is enabled in a derived grammar, it will not
1879       be active in any rules inherited from a base grammar, unless the base
1880       grammar also included a "<debug:...>" directive.
1881
1882       This is a deliberate design decision, as activating the debugger adds a
1883       significant amount of code to each grammar's implementation, which is
1884       detrimental to the matching performance of the resulting regexes.
1885
1886       If you need to debug a named grammar, the best approach is to include a
1887       "<debug: same>" directive at the start of the grammar. The presence of
1888       this directive will ensure the necessary extra debugging code is
1889       included in the regex implementing the grammar, while setting "same"
1890       mode will ensure that the debugging mode isn't altered when the matcher
1891       uses the inherited rules.
1892

Common parsing techniques

1894   Result distillation
1895       Normally, calls to subrules produce nested result-hashes within the
1896       current result-hash. Those nested hashes always have at least one
1897       automatically supplied key (""), whose value is the entire substring
1898       that the subrule matched.
1899
1900       If there are no other nested captures within the subrule, there will be
1901       no other keys in the result-hash. This would be annoying as a typical
1902       nested grammar would then produce results consisting of hashes of
1903       hashes, with each nested hash having only a single key (""). This in
1904       turn would make postprocessing the result-hash (in "%/") far more
1905       complicated than it needs to be.
1906
1907       To avoid this behaviour, if a subrule's result-hash doesn't contain any
1908       keys except "", the module "flattens" the result-hash, by replacing it
1909       with the value of its single key.
1910
1911       So, for example, the grammar:
1912
1913           mv \s* <from> \s* <to>
1914
1915           <rule: from>   [\w/.-]+
1916           <rule: to>     [\w/.-]+
1917
1918       doesn't return a result-hash like this:
1919
1920           {
1921               ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
1922               'from' => { "" => '/usr/local/lib/libhuh.dylib' },
1923               'to'   => { "" => '/dev/null/badlib'            },
1924           }
1925
1926       Instead, it returns:
1927
1928           {
1929               ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
1930               'from' => '/usr/local/lib/libhuh.dylib',
1931               'to'   => '/dev/null/badlib',
1932           }
1933
1934       That is, because the 'from' and 'to' subhashes each have only a single
1935       entry, they are each "flattened" to the value of that entry.
1936
1937       This flattening also occurs if a result-hash contains only "private"
1938       keys (i.e. keys starting with underscores). For example:
1939
1940           mv \s* <from> \s* <to>
1941
1942           <rule: from>   <_dir=path>? <_file=filename>
1943           <rule: to>     <_dir=path>? <_file=filename>
1944
1945           <token: path>      [\w/.-]*/
1946           <token: filename>  [\w.-]+
1947
1948       Here, the "from" rule produces a result like this:
1949
1950           from => {
1951                 "" => '/usr/local/bin/perl',
1952               _dir => '/usr/local/bin/',
1953              _file => 'perl',
1954           }
1955
1956       which is automatically stripped of "private" keys, leaving:
1957
1958           from => {
1959                 "" => '/usr/local/bin/perl',
1960           }
1961
1962       which is then automatically flattened to:
1963
1964           from => '/usr/local/bin/perl'
1965
1966       List result distillation
1967
1968       A special case of result distillation occurs in a separated list, such
1969       as:
1970
1971           <rule: List>
1972
1973               <[Item]>+ % <[Sep=(,)]>
1974
1975       If this construct matches just a single item, the result hash will
1976       contain a single entry consisting of a nested array with a single
1977       value, like so:
1978
1979           { Item => [ 'data' ] }
1980
1981       Instead of returning this annoyingly nested data structure, you can
1982       tell Regexp::Grammars to flatten it to just the inner data with a
1983       special directive:
1984
1985           <rule: List>
1986
1987               <[Item]>+ % <[Sep=(,)]>
1988
1989               <minimize:>
1990
1991       The "<minimize:>" directive examines the result hash (i.e.  %MATCH). If
1992       that hash contains only a single entry, which is a reference to an
1993       array with a single value, then the directive assigns that single value
1994       directly to $MATCH, so that it will be returned instead of the usual
1995       result hash.
1996
1997       This means that a normal separated list still results in a hash
1998       containing all elements and separators, but a "degenerate" list of only
1999       one item results in just that single item.
2000
2001       Manual result distillation
2002
2003       Regexp::Grammars also offers full manual control over the distillation
2004       process. If you use the reserved word "MATCH" as the alias for a
2005       subrule call:
2006
2007           <MATCH=filename>
2008
2009       or a subpattern match:
2010
2011           <MATCH=( \w+ )>
2012
2013       or a code block:
2014
2015           <MATCH=(?{ 42 })>
2016
2017       then the current rule will treat the return value of that subrule,
2018       pattern, or code block as its complete result, and return that value
2019       instead of the usual result-hash it constructs. This is the case even
2020       if the result has other entries that would normally also be returned.
2021
2022       For example, in a rule like:
2023
2024           <rule: term>
2025                 <MATCH=literal>
2026               | <left_paren> <MATCH=expr> <right_paren>
2027
2028       The use of "MATCH" aliases causes the rule to return either whatever
2029       "<literal>" returns, or whatever "<expr>" returns (provided it's
2030       between left and right parentheses).
2031
2032       Note that, in this second case, even though "<left_paren>" and
2033       "<right_paren>" are captured to the result-hash, they are not returned,
2034       because the "MATCH" alias overrides the normal "return the result-hash"
2035       semantics and returns only what its associated subrule (i.e. "<expr>")
2036       produces.
2037
2038       Note also that the return value is only assigned, if the subrule call
2039       actually matches. For example:
2040
2041           <rule: optional_names>
2042               <[MATCH=name]>*
2043
2044       If the repeated subrule call to "<name>" matches zero times, the return
2045       value of the "optional_names" rule will not be an empty array, because
2046       the "MATCH=" will not have executed at all. Instead, the default return
2047       value (an empty string) will be returned.  If you had specifically
2048       wanted to return an empty array, you could use any of the following:
2049
2050           <rule: optional_names>
2051               <MATCH=(?{ [] })>     # Set up empty array before first match attempt
2052               <[MATCH=name]>*
2053
2054       or:
2055
2056           <rule: optional_names>
2057               <[MATCH=name]>+       # Match one or more times
2058             |                       #          or
2059               <MATCH=(?{ [] })>     # Set up empty array, if no match
2060
2061       Programmatic result distillation
2062
2063       It's also possible to control what a rule returns from within a code
2064       block.  Regexp::Grammars provides a set of reserved variables that give
2065       direct access to the result-hash.
2066
2067       The result-hash itself can be accessed as %MATCH within any code block
2068       inside a rule. For example:
2069
2070           <rule: sum>
2071               <X=product> \+ <Y=product>
2072                   <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
2073
2074       Here, the rule matches a product (aliased 'X' in the result-hash), then
2075       a literal '+', then another product (aliased to 'Y' in the result-
2076       hash). The rule then executes the code block, which accesses the two
2077       saved values (as $MATCH{X} and $MATCH{Y}), adding them together.
2078       Because the block is itself aliased to "MATCH", the sum produced by the
2079       block becomes the (only) result of the rule.
2080
2081       It is also possible to set the rule result from within a code block
2082       (instead of aliasing it). The special "override" return value is
2083       represented by the special variable $MATCH. So the previous example
2084       could be rewritten:
2085
2086           <rule: sum>
2087               <X=product> \+ <Y=product>
2088                   (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
2089
2090       Both forms are identical in effect. Any assignment to $MATCH overrides
2091       the normal "return all subrule results" behaviour.
2092
2093       Assigning to $MATCH directly is particularly handy if the result may
2094       not always be "distillable", for example:
2095
2096           <rule: sum>
2097               <X=product> \+ <Y=product>
2098                   (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
2099                           # Reduce to sum, if both terms are simple scalars...
2100                           $MATCH = $MATCH{X} + $MATCH{Y};
2101                       }
2102                       else {
2103                           # Return full syntax tree for non-simple case...
2104                           $MATCH{op} = '+';
2105                       }
2106                   })
2107
2108       Note that you can also partially override the subrule return behaviour.
2109       Normally, the subrule returns the complete text it matched as its
2110       context substring (i.e. under the "empty key") in its result-hash. That
2111       is, of course, $MATCH{""}, so you can override just that behaviour by
2112       directly assigning to that entry.
2113
2114       For example, if you have a rule that matches key/value pairs from a
2115       configuration file, you might prefer that any trailing comments not be
2116       included in the "matched text" entry of the rule's result-hash. You
2117       could hide such comments like so:
2118
2119           <rule: config_line>
2120               <key> : <value>  <comment>?
2121                   (?{
2122                       # Edit trailing comments out of "matched text" entry...
2123                       $MATCH = "$MATCH{key} : $MATCH{value}";
2124                   })
2125
2126       Some more examples of the uses of $MATCH:
2127
2128           <rule: FuncDecl>
2129             # Keyword  Name               Keep return the name (as a string)...
2130               func     <Identifier> ;     (?{ $MATCH = $MATCH{'Identifier'} })
2131
2132
2133           <rule: NumList>
2134             # Numbers in square brackets...
2135               \[
2136                   ( \d+ (?: , \d+)* )
2137               \]
2138
2139             # Return only the numbers...
2140               (?{ $MATCH = $CAPTURE })
2141
2142
2143           <token: Cmd>
2144             # Match standard variants then standardize the keyword...
2145               (?: mv | move | rename )      (?{ $MATCH = 'mv'; })
2146
2147   Parse-time data processing
2148       Using code blocks in rules, it's often possible to fully process data
2149       as you parse it. For example, the "<sum>" rule shown in the previous
2150       section might be part of a simple calculator, implemented entirely in a
2151       single grammar. Such a calculator might look like this:
2152
2153           my $calculator = do{
2154               use Regexp::Grammars;
2155               qr{
2156                   <Answer>
2157
2158                   <rule: Answer>
2159                       ( <.Mult>+ % <.Op=([+-])> )
2160                           <MATCH= (?{ eval $CAPTURE })>
2161
2162                   <rule: Mult>
2163                       ( <.Pow>+ % <.Op=([*/%])> )
2164                           <MATCH= (?{ eval $CAPTURE })>
2165
2166                   <rule: Pow>
2167                       <X=Term> \^ <Y=Pow>
2168                           <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>
2169                     |
2170                           <MATCH=Term>
2171
2172                   <rule: Term>
2173                           <MATCH=Literal>
2174                     | \(  <MATCH=Answer>  \)
2175
2176                   <token: Literal>
2177                           <MATCH= ( [+-]? \d++ (?: \. \d++ )?+ )>
2178               }xms
2179           };
2180
2181           while (my $input = <>) {
2182               if ($input =~ $calculator) {
2183                   say "--> $/{Answer}";
2184               }
2185           }
2186
2187       Because every rule computes a value using the results of the subrules
2188       below it, and aliases that result to its "MATCH", each rule returns a
2189       complete evaluation of the subexpression it matches, passing that back
2190       to higher-level rules, which then do the same.
2191
2192       Hence, the result returned to the very top-level rule (i.e. to
2193       "<Answer>") is the complete evaluation of the entire expression that
2194       was matched. That means that, in the very process of having matched a
2195       valid expression, the calculator has also computed the value of that
2196       expression, which can then simply be printed directly.
2197
2198       It is often possible to have a grammar fully (or sometimes at least
2199       partially) evaluate or transform the data it is parsing, and this
2200       usually leads to very efficient and easy-to-maintain implementations.
2201
2202       The main limitation of this technique is that the data has to be in a
2203       well-structured form, where subsets of the data can be evaluated using
2204       only local information. In cases where the meaning of the data is
2205       distributed through that data non-hierarchically, or relies on global
2206       state, or on external information, it is often better to have the
2207       grammar simply construct a complete syntax tree for the data first, and
2208       then evaluate that syntax tree separately, after parsing is complete.
2209       The following section describes a feature of Regexp::Grammars that can
2210       make this second style of data processing simpler and more
2211       maintainable.
2212
2213   Object-oriented parsing
2214       When a grammar has parsed successfully, the "%/" variable will contain
2215       a series of nested hashes (and possibly arrays) representing the
2216       hierarchical structure of the parsed data.
2217
2218       Typically, the next step is to walk that tree, extracting or converting
2219       or otherwise processing that information. If the tree has nodes of many
2220       different types, it can be difficult to build a recursive subroutine
2221       that can navigate it easily.
2222
2223       A much cleaner solution is possible if the nodes of the tree are proper
2224       objects.  In that case, you just define a "process()" or "traverse()"
2225       method for eah of the classes, and have every node call that method on
2226       each of its children. For example, if the parser were to return a tree
2227       of nodes representing the contents of a LaTeX file, then you could
2228       define the following methods:
2229
2230           sub Latex::file::explain
2231           {
2232               my ($self, $level) = @_;
2233               for my $element (@{$self->{element}}) {
2234                   $element->explain($level);
2235               }
2236           }
2237
2238           sub Latex::element::explain {
2239               my ($self, $level) = @_;
2240               (  $self->{command} || $self->{literal})->explain($level)
2241           }
2242
2243           sub Latex::command::explain {
2244               my ($self, $level) = @_;
2245               say "\t"x$level, "Command:";
2246               say "\t"x($level+1), "Name: $self->{name}";
2247               if ($self->{options}) {
2248                   say "\t"x$level, "\tOptions:";
2249                   $self->{options}->explain($level+2)
2250               }
2251
2252               for my $arg (@{$self->{arg}}) {
2253                   say "\t"x$level, "\tArg:";
2254                   $arg->explain($level+2)
2255               }
2256           }
2257
2258           sub Latex::options::explain {
2259               my ($self, $level) = @_;
2260               $_->explain($level) foreach @{$self->{option}};
2261           }
2262
2263           sub Latex::literal::explain {
2264               my ($self, $level, $label) = @_;
2265               $label //= 'Literal';
2266               say "\t"x$level, "$label: ", $self->{q{}};
2267           }
2268
2269       and then simply write:
2270
2271           if ($text =~ $LaTeX_parser) {
2272               $/{LaTeX_file}->explain();
2273           }
2274
2275       and the chain of "explain()" calls would cascade down the nodes of the
2276       tree, each one invoking the appropriate "explain()" method according to
2277       the type of node encountered.
2278
2279       The only problem is that, by default, Regexp::Grammars returns a tree
2280       of plain-old hashes, not LaTeX::Whatever objects. Fortunately, it's
2281       easy to request that the result hashes be automatically blessed into
2282       the appropriate classes, using the "<objrule:...>" and "<objtoken:...>"
2283       directives.
2284
2285       These directives are identical to the "<rule:...>" and "<token:...>"
2286       directives (respectively), except that the rule or token they create
2287       will also convert the hash it normally returns into an object of a
2288       specified class. This conversion is done by passing the result hash to
2289       the class's constructor:
2290
2291           $class->new(\%result_hash)
2292
2293       if the class has a constructor method named "new()", or else (if the
2294       class doesn't provide a constructor) by directly blessing the result
2295       hash:
2296
2297           bless \%result_hash, $class
2298
2299       Note that, even if object is constructed via its own constructor, the
2300       module still expects the new object to be hash-based, and will fail if
2301       the object is anything but a blessed hash. The module issues an error
2302       in this case.
2303
2304       The generic syntax for these types of rules and tokens is:
2305
2306           <objrule:  CLASS::NAME = RULENAME  >
2307           <objtoken: CLASS::NAME = TOKENNAME >
2308
2309       For example:
2310
2311           <objrule: LaTeX::Element=component>
2312               # ...Defines a rule that can be called as <component>
2313               # ...and which returns a hash-based LaTeX::Element object
2314
2315           <objtoken: LaTex::Literal=atom>
2316               # ...Defines a token that can be called as <atom>
2317               # ...and which returns a hash-based LaTeX::Literal object
2318
2319       Note that, just as in aliased subrule calls, the name by which
2320       something is referred to outside the grammar (in this case, the class
2321       name) comes before the "=", whereas the name that it is referred to
2322       inside the grammar comes after the "=".
2323
2324       You can freely mix object-returning and plain-old-hash-returning rules
2325       and tokens within a single grammar, though you have to be careful not
2326       to subsequently try to call a method on any of the unblessed nodes.
2327
2328       An important caveat regarding OO rules
2329
2330       Prior to Perl 5.14.0, Perl's regex engine was not fully re-entrant.
2331       This means that in older versions of Perl, it is not possible to re-
2332       invoke the regex engine when already inside the regex engine.
2333
2334       This means that you need to be careful that the "new()" constructors
2335       that are called by your object-rules do not themselves use regexes in
2336       any way, unless you're running under Perl 5.14 or later (in which case
2337       you can ignore what follows).
2338
2339       The two ways this is most likely to happen are:
2340
2341       1.  If you're using a class built on Moose, where one or more of the
2342           "has" uses a type constraint (such as 'Int') that is implemented
2343           via regex matching. For example:
2344
2345               has 'id' => (is => 'rw', isa => 'Int');
2346
2347           The workaround (for pre-5.14 Perls) is to replace the type
2348           constraint with one that doesn't use a regex. For example:
2349
2350               has 'id' => (is => 'rw', isa => 'Num');
2351
2352           Alternatively, you could define your own type constraint that
2353           avoids regexes:
2354
2355               use Moose::Util::TypeConstraints;
2356
2357               subtype 'Non::Regex::Int',
2358                    as 'Num',
2359                 where { int($_) == $_ };
2360
2361               no Moose::Util::TypeConstraints;
2362
2363               # and later...
2364
2365               has 'id' => (is => 'rw', isa => 'Non::Regex::Int');
2366
2367       2.  If your class uses an "AUTOLOAD()" method to implement its
2368           constructor and that method uses the typical:
2369
2370               $AUTOLOAD =~ s/.*://;
2371
2372           technique. The workaround here is to achieve the same effect
2373           without a regex. For example:
2374
2375               my $last_colon_pos = rindex($AUTOLOAD, ':');
2376               substr $AUTOLOAD, 0, $last_colon_pos+1, q{};
2377
2378       Note that this caveat against using nested regexes also applies to any
2379       code blocks executed inside a rule or token (whether or not those rules
2380       or tokens are object-oriented).
2381
2382       A naming shortcut
2383
2384       If an "<objrule:...>" or "<objtoken:...>" is defined with a class name
2385       that is not followed by "=" and a rule name, then the rule name is
2386       determined automatically from the classname.  Specifically, the final
2387       component of the classname (i.e. after the last "::", if any) is used.
2388
2389       For example:
2390
2391           <objrule: LaTeX::Element>
2392               # ...Defines a rule that can be called as <Element>
2393               # ...and which returns a hash-based LaTeX::Element object
2394
2395           <objtoken: LaTex::Literal>
2396               # ...Defines a token that can be called as <Literal>
2397               # ...and which returns a hash-based LaTeX::Literal object
2398
2399           <objtoken: Comment>
2400               # ...Defines a token that can be called as <Comment>
2401               # ...and which returns a hash-based Comment object
2402

Debugging

2404       Regexp::Grammars provides a number of features specifically designed to
2405       help debug both grammars and the data they parse.
2406
2407       All debugging messages are written to a log file (which, by default, is
2408       just STDERR). However, you can specify a disk file explicitly by
2409       placing a "<logfile:...>" directive at the start of your grammar:
2410
2411           $grammar = qr{
2412
2413               <logfile: LaTeX_parser_log >
2414
2415               \A <LaTeX_file> \Z    # Pattern to match
2416
2417               <rule: LaTeX_file>
2418                   # etc.
2419           }x;
2420
2421       You can also explicitly specify that messages go to the terminal:
2422
2423               <logfile: - >
2424
2425   Debugging grammar creation with "<logfile:...>"
2426       Whenever a log file has been directly specified, Regexp::Grammars
2427       automatically does verbose static analysis of your grammar.  That is,
2428       whenever it compiles a grammar containing an explicit "<logfile:...>"
2429       directive it logs a series of messages explaining how it has
2430       interpreted the various components of that grammar. For example, the
2431       following grammar:
2432
2433           <logfile: parser_log >
2434
2435           <cmd>
2436
2437           <rule: cmd>
2438               mv <from=file> <to=file>
2439             | cp <source> <[file]>  <.comment>?
2440
2441       would produce the following analysis in the 'parser_log' file:
2442
2443           info | Processing the main regex before any rule definitions
2444                |    |
2445                |    |...Treating <cmd> as:
2446                |    |      |  match the subrule <cmd>
2447                |    |       \ saving the match in $MATCH{'cmd'}
2448                |    |
2449                |     \___End of main regex
2450                |
2451           info | Defining a rule: <cmd>
2452                |    |...Returns: a hash
2453                |    |
2454                |    |...Treating ' mv ' as:
2455                |    |       \ normal Perl regex syntax
2456                |    |
2457                |    |...Treating <from=file> as:
2458                |    |      |  match the subrule <file>
2459                |    |       \ saving the match in $MATCH{'from'}
2460                |    |
2461                |    |...Treating <to=file> as:
2462                |    |      |  match the subrule <file>
2463                |    |       \ saving the match in $MATCH{'to'}
2464                |    |
2465                |    |...Treating ' | cp ' as:
2466                |    |       \ normal Perl regex syntax
2467                |    |
2468                |    |...Treating <source> as:
2469                |    |      |  match the subrule <source>
2470                |    |       \ saving the match in $MATCH{'source'}
2471                |    |
2472                |    |...Treating <[file]> as:
2473                |    |      |  match the subrule <file>
2474                |    |       \ appending the match to $MATCH{'file'}
2475                |    |
2476                |    |...Treating <.comment>? as:
2477                |    |      |  match the subrule <comment> if possible
2478                |    |       \ but don't save anything
2479                |    |
2480                |     \___End of rule definition
2481
2482       This kind of static analysis is a useful starting point in debugging a
2483       miscreant grammar, because it enables you to see what you actually
2484       specified (as opposed to what you thought you'd specified).
2485
2486   Debugging grammar execution with "<debug:...>"
2487       Regexp::Grammars also provides a simple interactive debugger, with
2488       which you can observe the process of parsing and the data being
2489       collected in any result-hash.
2490
2491       To initiate debugging, place a "<debug:...>" directive anywhere in your
2492       grammar. When parsing reaches that directive the debugger will be
2493       activated, and the command specified in the directive immediately
2494       executed. The available commands are:
2495
2496           <debug: on>    - Enable debugging, stop when a rule matches
2497           <debug: match> - Enable debugging, stop when a rule matches
2498           <debug: try>   - Enable debugging, stop when a rule is tried
2499           <debug: run>   - Enable debugging, run until the match completes
2500           <debug: same>  - Continue debugging (or not) as currently
2501           <debug: off>   - Disable debugging and continue parsing silently
2502
2503           <debug: continue> - Synonym for <debug: run>
2504           <debug: step>     - Synonym for <debug: try>
2505
2506       These directives can be placed anywhere within a grammar and take
2507       effect when that point is reached in the parsing. Hence, adding a
2508       "<debug:step>" directive is very much like setting a breakpoint at that
2509       point in the grammar. Indeed, a common debugging strategy is to turn
2510       debugging on and off only around a suspect part of the grammar:
2511
2512           <rule: tricky>   # This is where we think the problem is...
2513               <debug:step>
2514               <preamble> <text> <postscript>
2515               <debug:off>
2516
2517       Once the debugger is active, it steps through the parse, reporting
2518       rules that are tried, matches and failures, backtracking and restarts,
2519       and the parser's location within both the grammar and the text being
2520       matched. That report looks like this:
2521
2522           ===============> Trying <grammar> from position 0
2523           > cp file1 file2 |...Trying <cmd>
2524                            |   |...Trying <cmd=(cp)>
2525                            |   |    \FAIL <cmd=(cp)>
2526                            |    \FAIL <cmd>
2527                             \FAIL <grammar>
2528           ===============> Trying <grammar> from position 1
2529            cp file1 file2  |...Trying <cmd>
2530                            |   |...Trying <cmd=(cp)>
2531            file1 file2     |   |    \_____<cmd=(cp)> matched 'cp'
2532           file1 file2      |   |...Trying <[file]>+
2533            file2           |   |    \_____<[file]>+ matched 'file1'
2534                            |   |...Trying <[file]>+
2535           [eos]            |   |    \_____<[file]>+ matched ' file2'
2536                            |   |...Trying <[file]>+
2537                            |   |    \FAIL <[file]>+
2538                            |   |...Trying <target>
2539                            |   |   |...Trying <file>
2540                            |   |   |    \FAIL <file>
2541                            |   |    \FAIL <target>
2542            <~~~~~~~~~~~~~~ |   |...Backtracking 5 chars and trying new match
2543           file2            |   |...Trying <target>
2544                            |   |   |...Trying <file>
2545                            |   |   |    \____ <file> matched 'file2'
2546           [eos]            |   |    \_____<target> matched 'file2'
2547                            |    \_____<cmd> matched ' cp file1 file2'
2548                             \_____<grammar> matched ' cp file1 file2'
2549
2550       The first column indicates the point in the input at which the parser
2551       is trying to match, as well as any backtracking or forward searching it
2552       may need to do. The remainder of the columns track the parser's
2553       hierarchical traversal of the grammar, indicating which rules are
2554       tried, which succeed, and what they match.
2555
2556       Provided the logfile is a terminal (as it is by default), the debugger
2557       also pauses at various points in the parsing process--before trying a
2558       rule, after a rule succeeds, or at the end of the parse--according to
2559       the most recent command issued. When it pauses, you can issue a new
2560       command by entering a single letter:
2561
2562           m       - to continue until the next subrule matches
2563           t or s  - to continue until the next subrule is tried
2564           r or c  - to continue to the end of the grammar
2565           o       - to switch off debugging
2566
2567       Note that these are the first letters of the corresponding
2568       "<debug:...>" commands, listed earlier. Just hitting ENTER while the
2569       debugger is paused repeats the previous command.
2570
2571       While the debugger is paused you can also type a 'd', which will
2572       display the result-hash for the current rule. This can be useful for
2573       detecting which rule isn't returning the data you expected.
2574
2575       Resizing the context string
2576
2577       By default, the first column of the debugger output (which shows the
2578       current matching position within the string) is limited to a width of
2579       20 columns.
2580
2581       However, you can change that limit calling the
2582       "Regexp::Grammars::set_context_width()" subroutine. You have to specify
2583       the fully qualified name, however, as Regexp::Grammars does not export
2584       this (or any other) subroutine.
2585
2586       "set_context_width()" expects a single argument: a positive integer
2587       indicating the maximal allowable width for the context column. It
2588       issues a warning if an invalid value is passed, and ignores it.
2589
2590       If called in a void context, "set_context_width()" changes the context
2591       width permanently throughout your application. If called in a scalar or
2592       list context, "set_context_width()" returns an object whose destructor
2593       will cause the context width to revert to its previous value. This
2594       means you can temporarily change the context width within a given block
2595       with something like:
2596
2597           {
2598               my $temporary = Regexp::Grammars::set_context_width(50);
2599
2600               if ($text =~ $parser) {
2601                   do_stuff_with( %/ );
2602               }
2603
2604           } # <--- context width automagically reverts at this point
2605
2606       and the context width will change back to its previous value when
2607       $temporary goes out of scope at the end of the block.
2608
2609   User-defined logging with "<log:...>"
2610       Both static and interactive debugging send a series of predefined log
2611       messages to whatever log file you have specified. It is also possible
2612       to send additional, user-defined messages to the log, using the
2613       "<log:...>" directive.
2614
2615       This directive expects either a simple text or a codeblock as its
2616       single argument. If the argument is a code block, that code is expected
2617       to return the text of the message; if the argument is anything else,
2618       that something else is the literal message. For example:
2619
2620           <rule: ListElem>
2621
2622               <Elem=   ( [a-z]\d+) >
2623                   <log: Checking for a suffix, too...>
2624
2625               <Suffix= ( : \d+   ) >?
2626                   <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
2627
2628       User-defined log messages implemented using a codeblock can also
2629       specify a severity level. If the codeblock of a "<log:...>" directive
2630       returns two or more values, the first is treated as a log message
2631       severity indicator, and the remaining values as separate lines of text
2632       to be logged. For example:
2633
2634           <rule: ListElem>
2635               <Elem=   ( [a-z]\d+) >
2636               <Suffix= ( : \d+   ) >?
2637
2638                   <log: (?{
2639                       warn => "Elem was: $MATCH{Elem}",
2640                               "Suffix was $MATCH{Suffix}",
2641                   })>
2642
2643       When they are encountered, user-defined log messages are interspersed
2644       between any automatic log messages (i.e. from the debugger), at the
2645       correct level of nesting for the current rule.
2646
2647   Debugging non-grammars
2648       [Note that, with the release in 2012 of the Regexp::Debugger module (on
2649       CPAN) the techniques described below are unnecessary. If you need to
2650       debug plain Perl regexes, use Regexp::Debugger instead.]
2651
2652       It is possible to use Regexp::Grammars without creating any subrule
2653       definitions, simply to debug a recalcitrant regex. For example, if the
2654       following regex wasn't working as expected:
2655
2656           my $balanced_brackets = qr{
2657               \(             # left delim
2658               (?:
2659                   \\         # escape or
2660               |   (?R)       # recurse or
2661               |   .          # whatever
2662               )*
2663               \)             # right delim
2664           }xms;
2665
2666       you could instrument it with aliased subpatterns and then debug it
2667       step-by-step, using Regexp::Grammars:
2668
2669           use Regexp::Grammars;
2670
2671           my $balanced_brackets = qr{
2672               <debug:step>
2673
2674               <.left_delim=  (  \(  )>
2675               (?:
2676                   <.escape=  (  \\  )>
2677               |   <.recurse= ( (?R) )>
2678               |   <.whatever=(  .   )>
2679               )*
2680               <.right_delim= (  \)  )>
2681           }xms;
2682
2683           while (<>) {
2684               say 'matched' if /$balanced_brackets/;
2685           }
2686
2687       Note the use of amnesiac aliased subpatterns to avoid needlessly
2688       building a result-hash. Alternatively, you could use listifying aliases
2689       to preserve the matching structure as an additional debugging aid:
2690
2691           use Regexp::Grammars;
2692
2693           my $balanced_brackets = qr{
2694               <debug:step>
2695
2696               <[left_delim=  (  \(  )]>
2697               (?:
2698                   <[escape=  (  \\  )]>
2699               |   <[recurse= ( (?R) )]>
2700               |   <[whatever=(  .   )]>
2701               )*
2702               <[right_delim= (  \)  )]>
2703           }xms;
2704
2705           if ( '(a(bc)d)' =~ /$balanced_brackets/) {
2706               use Data::Dumper 'Dumper';
2707               warn Dumper \%/;
2708           }
2709

Handling errors when parsing

2711       Assuming you have correctly debugged your grammar, the next source of
2712       problems will probably be invalid input (especially if that input is
2713       being provided interactively). So Regexp::Grammars also provides some
2714       support for detecting when a parse is likely to fail...and informing
2715       the user why.
2716
2717   Requirements
2718       The "<require:...>" directive is useful for testing conditions that
2719       it's not easy (or even possible) to check within the syntax of the the
2720       regex itself. For example:
2721
2722           <rule: IPV4_Octet_Decimal>
2723               # Up three digits...
2724               <MATCH= ( \d{1,3}+ )>
2725
2726               # ...but less than 256...
2727               <require: (?{ $MATCH <= 255 })>
2728
2729       A require expects a regex codeblock as its argument and succeeds if the
2730       final value of that codeblock is true. If the final value is false, the
2731       directive fails and the rule starts backtracking.
2732
2733       Note, in this example that the digits are matched with " \d{1,3}+ ".
2734       The trailing "+" prevents the "{1,3}" repetition from backtracking to a
2735       smaller number of digits if the "<require:...>" fails.
2736
2737   Handling failure
2738       The module has limited support for error reporting from within a
2739       grammar, in the form of the "<error:...>" and "<warning:...>"
2740       directives and their shortcuts: "<...>", "<!!!>", and "<???>"
2741
2742       Error messages
2743
2744       The "<error: MSG>" directive queues a conditional error message within
2745       "@!" and then fails to match (that is, it is equivalent to a "(?!)"
2746       when matching). For example:
2747
2748           <rule: ListElem>
2749               <SerialNumber>
2750             | <ClientName>
2751             | <error: (?{ $errcount++ . ': Missing list element' })>
2752
2753       So a common code pattern when using grammars that do this kind of error
2754       detection is:
2755
2756           if ($text =~ $grammar) {
2757               # Do something with the data collected in %/
2758           }
2759           else {
2760               say {*STDERR} $_ for @!;   # i.e. report all errors
2761           }
2762
2763       Each error message is conditional in the sense that, if any surrounding
2764       rule subsequently matches, the message is automatically removed from
2765       "@!". This implies that you can queue up as many error messages as you
2766       wish, but they will only remain in "@!" if the match ultimately fails.
2767       Moreover, only those error messages originating from rules that
2768       actually contributed to the eventual failure-to-match will remain in
2769       "@!".
2770
2771       If a code block is specified as the argument, the error message is
2772       whatever final value is produced when the block is executed. Note that
2773       this final value does not have to be a string (though it does have to
2774       be a scalar).
2775
2776           <rule: ListElem>
2777               <SerialNumber>
2778             | <ClientName>
2779             | <error: (?{
2780                   # Return a hash, with the error information...
2781                   { errnum => $errcount++, msg => 'Missing list element' }
2782               })>
2783
2784       If anything else is specified as the argument, it is treated as a
2785       literal error string (and may not contain an unbalanced '<' or '>', nor
2786       any interpolated variables).
2787
2788       However, if the literal error string begins with "Expected " or
2789       "Expecting ", then the error string automatically has the following
2790       "context suffix" appended:
2791
2792           , but found '$CONTEXT' instead
2793
2794       For example:
2795
2796           qr{ <Arithmetic_Expression>                # ...Match arithmetic expression
2797             |                                        # Or else
2798               <error: Expected a valid expression>   # ...Report error, and fail
2799
2800               # Rule definitions here...
2801           }xms;
2802
2803       On an invalid input this example might produce an error message like:
2804
2805           "Expected a valid expression, but found '(2+3]*7/' instead"
2806
2807       The value of the special $CONTEXT variable is found by looking ahead in
2808       the string being matched against, to locate the next sequence of non-
2809       blank characters after the current parsing position. This variable may
2810       also be explicitly used within the "<error: (?{...})>" form of the
2811       directive.
2812
2813       As a special case, if you omit the message entirely from the directive,
2814       it is supplied automatically, derived from the name of the current
2815       rule.  For example, if the following rule were to fail to match:
2816
2817           <rule: Arithmetic_expression>
2818                 <Multiplicative_Expression>+ % ([+-])
2819               | <error:>
2820
2821       the error message queued would be:
2822
2823           "Expected arithmetic expression, but found 'one plus two' instead"
2824
2825       Note however, that it is still essential to include the colon in the
2826       directive. A common mistake is to write:
2827
2828           <rule: Arithmetic_expression>
2829                 <Multiplicative_Expression>+ % ([+-])
2830               | <error>
2831
2832       which merely attempts to call "<rule: error>" if the first alternative
2833       fails.
2834
2835       Warning messages
2836
2837       Sometimes, you want to detect problems, but not invalidate the entire
2838       parse as a result. For those occasions, the module provides a "less
2839       stringent" form of error reporting: the "<warning:...>" directive.
2840
2841       This directive is exactly the same as an "<error:...>" in every respect
2842       except that it does not induce a failure to match at the point it
2843       appears.
2844
2845       The directive is, therefore, useful for reporting non-fatal problems in
2846       a parse. For example:
2847
2848           qr{ \A            # ...Match only at start of input
2849               <ArithExpr>   # ...Match a valid arithmetic expression
2850
2851               (?:
2852                   # Should be at end of input...
2853                   \s* \Z
2854                 |
2855                   # If not, report the fact but don't fail...
2856                   <warning: Expected end-of-input>
2857                   <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
2858               )
2859
2860               # Rule definitions here...
2861           }xms;
2862
2863       Note that, because they do not induce failure, two or more
2864       "<warning:...>" directives can be "stacked" in sequence, as in the
2865       previous example.
2866
2867       Stubbing
2868
2869       The module also provides three useful shortcuts, specifically to make
2870       it easy to declare, but not define, rules and tokens.
2871
2872       The "<...>" and "<???>" directives are equivalent to the directive:
2873
2874           <error: Cannot match RULENAME (not implemented)>
2875
2876       The "<???>" is equivalent to the directive:
2877
2878           <warning: Cannot match RULENAME (not implemented)>
2879
2880       For example, in the following grammar:
2881
2882           <grammar: List::Generic>
2883
2884           <rule: List>
2885               <[Item]>+ % (\s*,\s*)
2886
2887           <rule: Item>
2888               <...>
2889
2890       the "Item" rule is declared but not defined. That means the grammar
2891       will compile correctly, (the "List" rule won't complain about a call to
2892       a non-existent "Item"), but if the "Item" rule isn't overridden in some
2893       derived grammar, a match-time error will occur when "List" tries to
2894       match the "<...>" within "Item".
2895
2896       Localizing the (semi-)automatic error messages
2897
2898       Error directives of any of the following forms:
2899
2900           <error: Expecting identifier>
2901
2902           <error: >
2903
2904           <...>
2905
2906           <!!!>
2907
2908       or their warning equivalents:
2909
2910           <warning: Expecting identifier>
2911
2912           <warning: >
2913
2914           <???>
2915
2916       each autogenerate part or all of the actual error message they produce.
2917       By default, that autogenerated message is always produced in English.
2918
2919       However, the module provides a mechanism by which you can intercept
2920       every error or warning that is queued to "@!"  via these
2921       directives...and localize those messages.
2922
2923       To do this, you call "Regexp::Grammars::set_error_translator()" (with
2924       the full qualification, since Regexp::Grammars does not export it...nor
2925       anything else, for that matter).
2926
2927       The "set_error_translator()" subroutine expect as single argument,
2928       which must be a reference to another subroutine.  This subroutine is
2929       then called whenever an error or warning message is queued to "@!".
2930
2931       The subroutine is passed three arguments:
2932
2933       ·   the message string,
2934
2935       ·   the name of the rule from which the error or warning was queued,
2936           and
2937
2938       ·   the value of $CONTEXT when the error or warning was encountered
2939
2940       The subroutine is expected to return the final version of the message
2941       that is actually to be appended to "@!". To accomplish this it may make
2942       use of one of the many internationalization/localization modules
2943       available in Perl, or it may do the conversion entirely by itself.
2944
2945       The first argument is always exactly what appeared as a message in the
2946       original directive (regardless of whether that message is supposed to
2947       trigger autogeneration, or is just a "regular" error message).  That
2948       is:
2949
2950           Directive                         1st argument
2951
2952           <error: Expecting identifier>     "Expecting identifier"
2953           <warning: That's not a moon!>     "That's not a moon!"
2954           <error: >                         ""
2955           <warning: >                       ""
2956           <...>                             ""
2957           <!!!>                             ""
2958           <???>                             ""
2959
2960       The second argument always contains the name of the rule in which the
2961       directive was encountered. For example, when invoked from within
2962       "<rule: Frinstance>" the following directives produce:
2963
2964           Directive                         2nd argument
2965
2966           <error: Expecting identifier>     "Frinstance"
2967           <warning: That's not a moon!>     "Frinstance"
2968           <error: >                         "Frinstance"
2969           <warning: >                       "Frinstance"
2970           <...>                             "-Frinstance"
2971           <!!!>                             "-Frinstance"
2972           <???>                             "-Frinstance"
2973
2974       Note that the "unimplemented" markers pass the rule name with a
2975       preceding '-'. This allows your translator to distinguish between
2976       "empty" messages (which should then be generated automatically) and the
2977       "unimplemented" markers (which should report that the rule is not yet
2978       properly defined).
2979
2980       If you call "Regexp::Grammars::set_error_translator()" in a void
2981       context, the error translator is permanently replaced (at least, until
2982       the next call to "set_error_translator()").
2983
2984       However, if you call "Regexp::Grammars::set_error_translator()" in a
2985       scalar or list context, it returns an object whose destructor will
2986       restore the previous translator. This allows you to install a
2987       translator only within a given scope, like so:
2988
2989           {
2990               my $temporary
2991                   = Regexp::Grammars::set_error_translator(\&my_translator);
2992
2993               if ($text =~ $parser) {
2994                   do_stuff_with( %/ );
2995               }
2996               else {
2997                   report_errors_in( @! );
2998               }
2999
3000           } # <--- error translator automagically reverts at this point
3001
3002       Warning: any error translation subroutine you install will be called
3003       during the grammar's parsing phase (i.e. as the grammar's regex is
3004       matching). You should therefore ensure that your translator does not
3005       itself use regular expressions, as nested evaluations of regexes inside
3006       other regexes are extremely problematical (i.e. almost always
3007       disastrous) in Perl.
3008
3009   Restricting how long a parse runs
3010       Like the core Perl 5 regex engine on which they are built, the grammars
3011       implemented by Regexp::Grammars are essentially top-down parsers. This
3012       means that they may occasionally require an exponentially long time to
3013       parse a particular input. This usually occurs if a particular grammar
3014       includes a lot of recursion or nested backtracking, especially if the
3015       grammar is then matched against a long string.
3016
3017       The judicious use of non-backtracking repetitions (i.e. "x*+" and
3018       "x++") can significantly improve parsing performance in many such
3019       cases. Likewise, carefully reordering any high-level alternatives (so
3020       as to test simple common cases first) can substantially reduce parsing
3021       times.
3022
3023       However, some languages are just intrinsically slow to parse using top-
3024       down techniques (or, at least, may have slow-to-parse corner cases).
3025
3026       To help cope with this constraint, Regexp::Grammars provides a
3027       mechanism by which you can limit the total effort that a given grammar
3028       will expend in attempting to match. The "<timeout:...>" directive
3029       allows you to specify how long a grammar is allowed to continue trying
3030       to match before giving up. It expects a single argument, which must be
3031       an unsigned integer, and it treats this integer as the number of
3032       seconds to continue attempting to match.
3033
3034       For example:
3035
3036           <timeout: 10>    # Give up after 10 seconds
3037
3038       indicates that the grammar should keep attempting to match for another
3039       10 seconds from the point where the directive is encountered during a
3040       parse. If the complete grammar has not matched in that time, the entire
3041       match is considered to have failed, the matching process is immediately
3042       terminated, and a standard error message ('Internal error: Timed out
3043       after 10 seconds (as requested)') is returned in "@!".
3044
3045       A "<timeout:...>" directive can be placed anywhere in a grammar, but is
3046       most usually placed at the very start, so that the entire grammar is
3047       governed by the specified time limit. The second most common
3048       alternative is to place the timeout at the start of a particular
3049       subrule that is known to be potentially very slow.
3050
3051       A common mistake is to put the timeout specification at the top level
3052       of the grammar, but place it after the actual subrule to be matched,
3053       like so:
3054
3055           my $grammar = qr{
3056
3057               <Text_Corpus>      # Subrule to be matched
3058               <timeout: 10>      # Useless use of timeout
3059
3060               <rule: Text_Corpus>
3061                   # et cetera...
3062           }xms;
3063
3064       Since the parser will only reach the "<timeout: 10>" directive after it
3065       has completely matched "<Text_Corpus>", the timeout is only initiated
3066       at the very end of the matching process and so does not limit that
3067       process in any useful way.
3068
3069       Immediate timeouts
3070
3071       As you might expect, a "<timeout: 0>" directive tells the parser to
3072       keep trying for only zero more seconds, and therefore will immediately
3073       cause the entire surrounding grammar to fail (no matter how deeply
3074       within that grammar the directive is encountered).
3075
3076       This can occasionally be exteremely useful. If you know that detecting
3077       a particular datum means that the grammar will never match, no matter
3078       how many other alternatives may subsequently be tried, you can short-
3079       circuit the parser by injecting a "<timeout: 0>" immediately after the
3080       offending datum is detected.
3081
3082       For example, if your grammar only accepts certain versions of the
3083       language being parsed, you could write:
3084
3085           <rule: Valid_Language_Version>
3086                   vers = <%AcceptableVersions>
3087               |
3088                   vers = <bad_version=(\S++)>
3089                   <warning: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3090                   <timeout: 0>
3091
3092       In fact, this "<warning: MSG> <timeout: 0>" sequence is sufficiently
3093       useful, sufficiently complex, and sufficiently easy to get wrong, that
3094       Regexp::Grammars provides a handy shortcut for it: the "<fatal:...>"
3095       directive. A "<fatal:...>" is exactly equivalent to a "<warning:...>"
3096       followed by a zero-timeout, so the previous example could also be
3097       written:
3098
3099           <rule: Valid_Language_Version>
3100                   vers = <%AcceptableVersions>
3101               |
3102                   vers = <bad_version=(\S++)>
3103                   <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3104
3105       Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
3106       own failure context in $CONTEXT, so the previous example could be
3107       further simplified to:
3108
3109           <rule: Valid_Language_Version>
3110                   vers = <%AcceptableVersions>
3111               |
3112                   vers = <fatal:(?{ "Cannot parse language version $CONTEXT" })>
3113
3114       Also like "<error:...>", "<fatal:...>" can autogenerate an error
3115       message if none is provided, so the example could be still further
3116       reduced to:
3117
3118           <rule: Valid_Language_Version>
3119                   vers = <%AcceptableVersions>
3120               |
3121                   vers = <fatal:>
3122
3123       In this last case, however, the error message returned in "@!" would no
3124       longer be:
3125
3126           Cannot parse language version 0.95
3127
3128       It would now be:
3129
3130           Expected valid language version, but found '0.95' instead
3131

Scoping considerations

3133       If you intend to use a grammar as part of a larger program that
3134       contains other (non-grammatical) regexes, it is more efficient--and
3135       less error-prone--to avoid having Regexp::Grammars process those
3136       regexes as well. So it's often a good idea to declare your grammar in a
3137       "do" block, thereby restricting the scope of the module's effects.
3138
3139       For example:
3140
3141           my $grammar = do {
3142               use Regexp::Grammars;
3143               qr{
3144                   <file>
3145
3146                   <rule: file>
3147                       <prelude>
3148                       <data>
3149                       <postlude>
3150
3151                   <rule: prelude>
3152                       # etc.
3153               }x;
3154           };
3155
3156       Because the effects of Regexp::Grammars are lexically scoped, any
3157       regexes defined outside that "do" block will be unaffected by the
3158       module.
3159

INTERFACE

3161   Perl API
3162       "use Regexp::Grammars;"
3163           Causes all regexes in the current lexical scope to be compile-time
3164           processed for grammar elements.
3165
3166       "$str =~ $grammar"
3167       "$str =~ /$grammar/"
3168           Attempt to match the grammar against the string, building a nested
3169           data structure from it.
3170
3171       "%/"
3172           This hash is assigned the nested data structure created by any
3173           successful match of a grammar regex.
3174
3175       "@!"
3176           This array is assigned the queue of error messages created by any
3177           unsuccessful match attempt of a grammar regex.
3178
3179   Grammar syntax
3180       Directives
3181
3182       "<rule: IDENTIFIER>"
3183           Define a rule whose name is specified by the supplied identifier.
3184
3185           Everything following the "<rule:...>" directive (up to the next
3186           "<rule:...>" or "<token:...>" directive) is treated as part of the
3187           rule being defined.
3188
3189           Any whitespace in the rule is replaced by a call to the "<.ws>"
3190           subrule (which defaults to matching "\s*", but may be explicitly
3191           redefined).
3192
3193       "<token: IDENTIFIER>"
3194           Define a rule whose name is specified by the supplied identifier.
3195
3196           Everything following the "<token:...>" directive (up to the next
3197           "<rule:...>" or "<token:...>" directive) is treated as part of the
3198           rule being defined.
3199
3200           Any whitespace in the rule is ignored (under the "/x" modifier), or
3201           explicitly matched (if "/x" is not used).
3202
3203       "<objrule:  IDENTIFIER>"
3204       "<objtoken: IDENTIFIER>"
3205           Identical to a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
3206           declaration, except that the rule or token will also bless the hash
3207           it normally returns, converting it to an object of a class whose
3208           name is the same as the rule or token itself.
3209
3210       "<require: (?{ CODE }) >"
3211           The code block is executed and if its final value is true, matching
3212           continues from the same position. If the block's final value is
3213           false, the match fails at that point and starts backtracking.
3214
3215       "<error: (?{ CODE })  >"
3216       "<error: LITERAL TEXT >"
3217       "<error: >"
3218           This directive queues a conditional error message within the global
3219           special variable "@!" and then fails to match at that point (that
3220           is, it is equivalent to a "(?!)" or "(*FAIL)" when matching).
3221
3222       "<fatal: (?{ CODE })  >"
3223       "<fatal: LITERAL TEXT >"
3224       "<fatal: >"
3225           This directive is exactly the same as an "<error:...>" in every
3226           respect except that it immediately causes the entire surrounding
3227           grammar to fail, and parsing to immediate cease.
3228
3229       "<warning: (?{ CODE })  >"
3230       "<warning: LITERAL TEXT >"
3231           This directive is exactly the same as an "<error:...>" in every
3232           respect except that it does not induce a failure to match at the
3233           point it appears. That is, it is equivalent to a "(?=)" ["succeed
3234           and continue matching"], rather than a "(?!)" ["fail and
3235           backtrack"].
3236
3237       "<debug: COMMAND >"
3238           During the matching of grammar regexes send debugging and warning
3239           information to the specified log file (see "<logfile: LOGFILE>").
3240
3241           The available "COMMAND"'s are:
3242
3243               <debug: continue>    ___ Debug until end of complete parse
3244               <debug: run>         _/
3245
3246               <debug: on>          ___ Debug until next subrule match
3247               <debug: match>       _/
3248
3249               <debug: try>         ___ Debug until next subrule call or match
3250               <debug: step>        _/
3251
3252               <debug: same>        ___ Maintain current debugging mode
3253
3254               <debug: off>         ___ No debugging
3255
3256           See also the $DEBUG special variable.
3257
3258       "<logfile: LOGFILE>"
3259       "<logfile:    -   >"
3260           During the compilation of grammar regexes, send debugging and
3261           warning information to the specified LOGFILE (or to *STDERR if "-"
3262           is specified).
3263
3264           If the specified LOGFILE name contains a %t, it is replaced with a
3265           (sortable) "YYYYMMDD.HHMMSS" timestamp. For example:
3266
3267               <logfile: test-run-%t >
3268
3269           executed at around 9.30pm on the 21st of March 2009, would generate
3270           a log file named: "test-run-20090321.213056"
3271
3272       "<log: (?{ CODE })  >"
3273       "<log: LITERAL TEXT >"
3274           Append a message to the log file. If the argument is a code block,
3275           that code is expected to return the text of the message; if the
3276           argument is anything else, that something else is the literal
3277           message.
3278
3279           If the block returns two or more values, the first is treated as a
3280           log message severity indicator, and the remaining values as
3281           separate lines of text to be logged.
3282
3283       "<timeout: INT >"
3284           Restrict the match-time of the parse to the specified number of
3285           seconds.  Queues a error message and terminates the entire match
3286           process if the parse does not complete within the nominated time
3287           limit.
3288
3289       Subrule calls
3290
3291       "<IDENTIFIER>"
3292           Call the subrule whose name is IDENTIFIER.
3293
3294           If it matches successfully, save the hash it returns in the current
3295           scope's result-hash, under the key 'IDENTIFIER'.
3296
3297       "<IDENTIFIER_1=IDENTIFIER_2>"
3298           Call the subrule whose name is IDENTIFIER_1.
3299
3300           If it matches successfully, save the hash it returns in the current
3301           scope's result-hash, under the key 'IDENTIFIER_2'.
3302
3303           In other words, the "IDENTIFIER_1=" prefix changes the key under
3304           which the result of calling a subrule is stored.
3305
3306       "<.IDENTIFIER>"
3307           Call the subrule whose name is IDENTIFIER.  Don't save the hash it
3308           returns.
3309
3310           In other words, the "dot" prefix disables saving of subrule
3311           results.
3312
3313       "<IDENTIFIER= ( PATTERN )>"
3314           Match the subpattern PATTERN.
3315
3316           If it matches successfully, capture the substring it matched and
3317           save that substring in the current scope's result-hash, under the
3318           key 'IDENTIFIER'.
3319
3320       "<.IDENTIFIER= ( PATTERN )>"
3321           Match the subpattern PATTERN.  Don't save the substring it matched.
3322
3323       "<IDENTIFIER= %HASH>"
3324           Match a sequence of non-whitespace then verify that the sequence is
3325           a key in the specified hash
3326
3327           If it matches successfully, capture the sequence it matched and
3328           save that substring in the current scope's result-hash, under the
3329           key 'IDENTIFIER'.
3330
3331       "<%HASH>"
3332           Match a key from the hash.  Don't save the substring it matched.
3333
3334       "<IDENTIFIER= (?{ CODE })>"
3335           Execute the specified CODE.
3336
3337           Save the result (of the final expression that the CODE evaluates)
3338           in the current scope's result-hash, under the key 'IDENTIFIER'.
3339
3340       "<[IDENTIFIER]>"
3341           Call the subrule whose name is IDENTIFIER.
3342
3343           If it matches successfully, append the hash it returns to a nested
3344           array within the current scope's result-hash, under the key
3345           <'IDENTIFIER'>.
3346
3347       "<[IDENTIFIER_1=IDENTIFIER_2]>"
3348           Call the subrule whose name is IDENTIFIER_1.
3349
3350           If it matches successfully, append the hash it returns to a nested
3351           array within the current scope's result-hash, under the key
3352           'IDENTIFIER_2'.
3353
3354       "<ANY_SUBRULE>+ % <ANY_OTHER_SUBRULE>"
3355       "<ANY_SUBRULE>* % <ANY_OTHER_SUBRULE>"
3356       "<ANY_SUBRULE>+ % (PATTERN)"
3357       "<ANY_SUBRULE>* % (PATTERN)"
3358           Repeatedly call the first subrule.  Keep matching as long as the
3359           subrule matches, provided successive matches are separated by
3360           matches of the second subrule or the pattern.
3361
3362           In other words, match a list of ANY_SUBRULE's separated by
3363           ANY_OTHER_SUBRULE's or PATTERN's.
3364
3365           Note that, if a pattern is used to specify the separator, it must
3366           be specified in some kind of matched parentheses. These may be
3367           capturing ["(...)"], non-capturing ["(?:...)"], non-backtracking
3368           ["(?>...)"], or any other construct enclosed by an opening and
3369           closing paren.
3370
3371       "<ANY_SUBRULE>+ %% <ANY_OTHER_SUBRULE>"
3372       "<ANY_SUBRULE>* %% <ANY_OTHER_SUBRULE>"
3373       "<ANY_SUBRULE>+ %% (PATTERN)"
3374       "<ANY_SUBRULE>* %% (PATTERN)"
3375           Repeatedly call the first subrule.  Keep matching as long as the
3376           subrule matches, provided successive matches are separated by
3377           matches of the second subrule or the pattern.
3378
3379           Also allow an optional final trailing instance of the second
3380           subrule or pattern (this is where "%%" differs from "%").
3381
3382           In other words, match a list of ANY_SUBRULE's separated by
3383           ANY_OTHER_SUBRULE's or PATTERN's, with a possible final separator.
3384
3385           As for the single "%" operator, if a pattern is used to specify the
3386           separator, it must be specified in some kind of matched
3387           parentheses.  These may be capturing ["(...)"], non-capturing
3388           ["(?:...)"], non-backtracking ["(?>...)"], or any other construct
3389           enclosed by an opening and closing paren.
3390
3391   Special variables within grammar actions
3392       $CAPTURE
3393       $CONTEXT
3394           These are both aliases for the built-in read-only $^N variable,
3395           which always contains the substring matched by the nearest
3396           preceding "(...)"  capture. $^N still works perfectly well, but
3397           these are provided to improve the readability of code blocks and
3398           error messages respectively.
3399
3400       $INDEX
3401           This variable contains the index at which the next match will be
3402           attempted within the string being parsed. It is most commonly used
3403           in "<error:...>" or "<log:...>" directives:
3404
3405               <rule: ListElem>
3406                   <log: (?{ "Trying words at index $INDEX" })>
3407                   <MATCH=( \w++ )>
3408                 |
3409                   <log: (?{ "Trying digits at index $INDEX" })>
3410                   <MATCH=( \d++ )>
3411                 |
3412                   <error: (?{ "Missing ListElem near index $INDEX" })>
3413
3414       %MATCH
3415           This variable contains all the saved results of any subrules called
3416           from the current rule. In other words, subrule calls like:
3417
3418               <ListElem>  <Separator= (,)>
3419
3420           stores their respective match results in $MATCH{'ListElem'} and
3421           $MATCH{'Separator'}.
3422
3423       $MATCH
3424           This variable is an alias for $MATCH{"="}. This is the %MATCH entry
3425           for the special "override value". If this entry is defined, its
3426           value overrides the usual "return \%MATCH" semantics of a
3427           successful rule.
3428
3429       %ARG
3430           This variable contains all the key/value pairs that were passed
3431           into a particular subrule call.
3432
3433               <Keyword>  <Command>  <Terminator(:Keyword)>
3434
3435           the "Terminator" rule could get access to the text matched by
3436           "<Keyword>" like so:
3437
3438               <token: Terminator>
3439                   end_ (??{ $ARG{'Keyword'} })
3440
3441           Note that to match against the calling subrules 'Keyword' value,
3442           it's necessary to use either a deferred interpolation ("(??{...})")
3443           or a qualified matchref:
3444
3445               <token: Terminator>
3446                   end_ <\:Keyword>
3447
3448           A common mistake is to attempt to directly interpolate the
3449           argument:
3450
3451               <token: Terminator>
3452                   end_ $ARG{'Keyword'}
3453
3454           This evaluates $ARG{'Keyword'} when the grammar is compiled, rather
3455           than when the rule is matched.
3456
3457       $_  At the start of any code blocks inside any regex, the variable $_
3458           contains the complete string being matched against. The current
3459           matching position within that string is given by: "pos($_)".
3460
3461       $DEBUG
3462           This variable stores the current debugging mode (which may be any
3463           of: 'off', 'on', 'run', 'continue', 'match', 'step', or 'try'). It
3464           is set automatically by the "<debug:...>" command, but may also be
3465           set manually in a code block (which can be useful for conditional
3466           debugging). For example:
3467
3468               <rule: ListElem>
3469                   <Identifier>
3470
3471                   # Conditionally debug if 'foobar' encountered...
3472                   (?{ $DEBUG = $MATCH{Identifier} eq 'foobar' ? 'step' : 'off' })
3473
3474                   <Modifier>?
3475
3476           See also: the "<log: LOGFILE>" and "<debug: DEBUG_CMD>" directives.
3477

IMPORTANT CONSTRAINTS AND LIMITATIONS

3479       ·   Prior to Perl 5.14, the Perl 5 regex engine as not reentrant. So
3480           any attempt to perform a regex match inside a "(?{ ... })" or "(??{
3481           ... })" under Perl 5.12 or earlier will almost certainly lead to
3482           either weird data corruption or a segfault.
3483
3484           The same calamities can also occur in any constructor called by
3485           "<objrule:>". If the constructor invokes another regex in any way,
3486           it will most likely fail catastrophically. In particular, this
3487           means that Moose constructors will frequently crash and burn within
3488           a Regex::Grammars grammar (for example, if the Moose-based class
3489           declares an attribute type constraint such as 'Int', which Moose
3490           checks using a regex).
3491
3492       ·   The additional regex constructs this module provides are
3493           implemented by rewriting regular expressions. This is a (safer)
3494           form of source filtering, but still subject to all the same
3495           limitations and fallibilities of any other macro-based solution.
3496
3497       ·   In particular, rewriting the macros involves the insertion of (a
3498           lot of) extra capturing parentheses. This means you can no longer
3499           assume that particular capturing parens correspond to particular
3500           numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
3501           directly use Perl 5.10's named capture construct:
3502
3503               (?<name> [^\W\d]\w* )
3504
3505           Better still, capture the data in its correct hierarchical context
3506           using the module's "named subpattern" construct:
3507
3508               <name= ([^\W\d]\w*) >
3509
3510       ·   No recursive descent parser--including those created with
3511           Regexp::Grammars--can directly handle left-recursive grammars with
3512           rules of the form:
3513
3514               <rule: List>
3515                   <List> , <ListElem>
3516
3517           If you find yourself attempting to write a left-recursive grammar
3518           (which Perl 5.10 may or may not complain about, but will never
3519           successfully parse with), then you probably need to use the
3520           "separated list" construct instead:
3521
3522               <rule: List>
3523                   <[ListElem]>+ % (,)
3524
3525       ·   Grammatical parsing with Regexp::Grammars can fail if your grammar
3526           uses "non-backtracking" directives (i.e. the "(?>...)" block or the
3527           "?+", "*+", or "++" repetition specifiers). The problem appears to
3528           be that preventing the regex from backtracking through the in-regex
3529           actions that Regexp::Grammars adds causes the module's internal
3530           stack to fall out of sync with the regex match.
3531
3532           For the time being, if your grammar does not work as expected, you
3533           may need to replace one or more "non-backtracking" directives, with
3534           their regular (i.e. backtracking) equivalents.
3535
3536       ·   Similarly, parsing with Regexp::Grammars will fail if your grammar
3537           places a subrule call within a positive look-ahead, since these
3538           don't play nicely with the data stack.
3539
3540           This seems to be an internal problem with perl itself.
3541           Investigations, and attempts at a workaround, are proceeding.
3542
3543           For the time being, you need to make sure that grammar rules don't
3544           appear inside a positive lookahead or use the "<?RULENAME>"
3545           construct instead
3546

DIAGNOSTICS

3548       Note that (because the author cannot find a way to throw exceptions
3549       from within a regex) none of the following diagnostics actually throws
3550       an exception.
3551
3552       Instead, these messages are simply written to the specified parser
3553       logfile (or to *STDERR, if no logfile is specified).
3554
3555       However, any fatal match-time message will immediately terminate the
3556       parser matching and will still set $@ (as if an exception had been
3557       thrown and caught at that point in the code). You then have the option
3558       to check $@ immediately after matching with the grammar, and rethrow if
3559       necessary:
3560
3561           if ($input =~ $grammar) {
3562               process_data_in(\%/);
3563           }
3564           else {
3565               die if $@;
3566           }
3567
3568       "Found call to %s, but no %s was defined in the grammar"
3569           You specified a call to a subrule for which there was no definition
3570           in the grammar. Typically that's either because you forget to
3571           define the rule, or because you misspelled either the definition or
3572           the subrule call. For example:
3573
3574               <file>
3575
3576               <rule: fiel>            <---- misspelled rule
3577                   <lines>             <---- used but never defined
3578
3579           Regexp::Grammars converts any such subrule call attempt to an
3580           instant catastrophic failure of the entire parse, so if your parser
3581           ever actually tries to perform that call, Very Bad Things will
3582           happen.
3583
3584       "Entire parse terminated prematurely while attempting to call
3585       non-existent rule: %s"
3586           You ignored the previous error and actually tried to call to a
3587           subrule for which there was no definition in the grammar. Very Bad
3588           Things are now happening. The parser got very upset, took its ball,
3589           and went home.  See the preceding diagnostic for remedies.
3590
3591           This diagnostic should throw an exception, but can't. So it sets $@
3592           instead, allowing you to trap the error manually if you wish.
3593
3594       "Fatal error: <objrule: %s> returned a non-hash-based object"
3595           An <objrule:> was specified and returned a blessed object that
3596           wasn't a hash. This will break the behaviour of the grammar, so the
3597           module immediately reports the problem and gives up.
3598
3599           The solution is to use only hash-based classes with <objrule:>
3600
3601       "Can't match against <grammar: %s>"
3602           The regex you attempted to match against defined a pure grammar,
3603           using the "<grammar:...>" directive. Pure grammars have no start-
3604           pattern and hence cannot be matched against directly.
3605
3606           You need to define a matchable grammar that inherits from your pure
3607           grammar and then calls one of its rules. For example, instead of:
3608
3609               my $greeting = qr{
3610                   <grammar: Greeting>
3611
3612                   <rule: greet>
3613                       Hi there
3614                       | Hello
3615                       | Yo!
3616               }xms;
3617
3618           you need:
3619
3620               qr{
3621                   <grammar: Greeting>
3622
3623                   <rule: greet>
3624                       Hi there
3625                     | Hello
3626                     | Yo!
3627               }xms;
3628
3629               my $greeting = qr{
3630                   <extends: Greeting>
3631                   <greet>
3632               }xms;
3633
3634       "Inheritance from unknown grammar requested by <%s>"
3635           You used an "<extends:...>" directive to request that your grammar
3636           inherit from another, but the grammar you asked to inherit from
3637           doesn't exist.
3638
3639           Check the spelling of the grammar name, and that it's already been
3640           defined somewhere earlier in your program.
3641
3642       "Redeclaration of <%s> will be ignored"
3643           You defined two or more rules or tokens with the same name.  The
3644           first one defined in the grammar will be used; the rest will be
3645           ignored.
3646
3647           To get rid of the warning, get rid of the extra definitions (or, at
3648           least, comment them out or rename the rules).
3649
3650       "Possible invalid subrule call %s"
3651           Your grammar contained something of the form:
3652
3653               <identifier
3654               <.identifier
3655               <[identifier
3656
3657           which you might have intended to be a subrule call, but which
3658           didn't correctly parse as one. If it was supposed to be a
3659           Regexp::Grammars subrule call, you need to check the syntax you
3660           used. If it wasn't supposed to be a subrule call, you can silence
3661           the warning by rewriting it and quoting the leading angle:
3662
3663               \<identifier
3664               \<.identifier
3665               \<[identifier
3666
3667       "Possible failed attempt to specify a directive: %s"
3668           Your grammar contained something of the form:
3669
3670               <identifier:...
3671
3672           but which wasn't a known directive like "<rule:...>" or
3673           "<debug:...>". If it was supposed to be a Regexp::Grammars
3674           directive, check the spelling of the directive name. If it wasn't
3675           supposed to be a directive, you can silence the warning by
3676           rewriting it and quoting the leading angle:
3677
3678               \<identifier:
3679
3680       "Possible failed attempt to specify a subrule call %s"
3681           Your grammar contained something of the form:
3682
3683               <identifier...
3684
3685           but which wasn't a call to a known subrule like "<ident>" or
3686           "<name>". If it was supposed to be a Regexp::Grammars subrule call,
3687           check the spelling of the rule name in the angles. If it wasn't
3688           supposed to be a subrule call, you can silence the warning by
3689           rewriting it and quoting the leading angle:
3690
3691               \<identifier...
3692
3693       "Repeated subrule %s will only capture its final match"
3694           You specified a subrule call with a repetition qualifier, such as:
3695
3696               <ListElem>*
3697
3698           or:
3699
3700               <ListElem>+
3701
3702           Because each subrule call saves its result in a hash entry of the
3703           same name, each repeated match will overwrite the previous ones, so
3704           only the last match will ultimately be saved. If you want to save
3705           all the matches, you need to tell Regexp::Grammars to save the
3706           sequence of results as a nested array within the hash entry, like
3707           so:
3708
3709               <[ListElem]>*
3710
3711           or:
3712
3713               <[ListElem]>+
3714
3715           If you really did intend to throw away every result but the final
3716           one, you can silence the warning by placing the subrule call inside
3717           any kind of parentheses. For example:
3718
3719               (<ListElem>)*
3720
3721           or:
3722
3723               (?: <ListElem> )+
3724
3725       "Unable to open log file '$filename' (%s)"
3726           You specified a "<logfile:...>" directive but the file whose name
3727           you specified could not be opened for writing (for the reason given
3728           in the parens).
3729
3730           Did you misspell the filename, or get the permissions wrong
3731           somewhere in the filepath?
3732
3733       "Non-backtracking subrule %s may not revert correctly during
3734       backtracking"
3735           Because of inherent limitations in the Perl regex engine, non-
3736           backtracking constructs like "++", "*+", "?+", and "(?>...)" do not
3737           always work correctly when applied to subrule calls, especially in
3738           earlier versions of Perl.
3739
3740           If the grammar doesn't work properly, replace the offending
3741           constructs with regular backtracking versions instead. If the
3742           grammar does work, you can silence the warning by enclosing the
3743           subrule call in any kind of parentheses. For example, change:
3744
3745               <[ListElem]>++
3746
3747           to:
3748
3749               (?: <[ListElem]> )++
3750
3751       "Unexpected item before first subrule specification in definition of
3752       <grammar: %s>"
3753           Named grammar definitions must consist only of rule and token
3754           definitions.  They cannot have patterns before the first
3755           definitions.  You had some kind of pattern before the first
3756           definition, which will be completely ignored within the grammar.
3757
3758           To silence the warning, either comment out or delete whatever is
3759           before the first rule/token definition.
3760
3761       "No main regex specified before rule definitions"
3762           You specified an unnamed grammar (i.e. no "<grammar:...>"
3763           directive), but didn't specify anything for it to actually match,
3764           just some rules that you don't actually call. For example:
3765
3766               my $grammar = qr{
3767
3768                   <rule: list>    \( <item> +% [,] \)
3769
3770                   <token: item>   <list> | \d+
3771               }x;
3772
3773           You have to provide something before the first rule to start the
3774           matching off. For example:
3775
3776               my $grammar = qr{
3777
3778                   <list>   # <--- This tells the grammar how to start matching
3779
3780                   <rule: list>    \( <item> +% [,] \)
3781
3782                   <token: item>   <list> | \d+
3783               }x;
3784
3785       "Ignoring useless empty <ws:> directive"
3786           The "<ws:...>" directive specifies what whitespace matches within
3787           the current rule. An empty "<ws:>" directive would cause whitespace
3788           to match nothing at all, which is what happens in a token
3789           definition, not in a rule definition.
3790
3791           Either put some subpattern inside the empty "<ws:...>" or, if you
3792           really do want whitespace to match nothing at all, remove the
3793           directive completely and change the rule definition to a token
3794           definition.
3795
3796       "Ignoring useless <ws: %s > directive in a token definition"
3797           The "<ws:...>" directive is used to specify what whitespace matches
3798           within a rule. Since whitespace never matches anything inside
3799           tokens, putting a "<ws:...>" directive in a token is a waste of
3800           time.
3801
3802           Either remove the useless directive, or else change the surrounding
3803           token definition to a rule definition.
3804
3805       "Quantifier that doesn't quantify anything: <%s>"
3806           You specified a rule or token something like:
3807
3808               <token: star>  *
3809
3810           or:
3811
3812               <rule: add_op>  plus | add | +
3813
3814           but the "*" and "+" in those examples are both regex meta-
3815           operators: quantifiers that usually cause what precedes them to
3816           match repeatedly.  In these cases however, nothing is preceding the
3817           quantifier, so it's a Perl syntax error.
3818
3819           You almost certainly need to escape the meta-characters in some
3820           way.  For example:
3821
3822               <token: star>  \*
3823
3824               <rule: add_op>  plus | add | [+]
3825

CONFIGURATION AND ENVIRONMENT

3827       Regexp::Grammars requires no configuration files or environment
3828       variables.
3829

DEPENDENCIES

3831       This module only works under Perl 5.10 or later.
3832

INCOMPATIBILITIES

3834       This module is likely to be incompatible with any other module that
3835       automagically rewrites regexes. For example it may conflict with
3836       Regexp::DefaultFlags, Regexp::DeferredExecution, or Regexp::Extended.
3837

BUGS

3839       No bugs have been reported.
3840
3841       Please report any bugs or feature requests to
3842       "bug-regexp-grammars@rt.cpan.org", or through the web interface at
3843       <http://rt.cpan.org>.
3844

AUTHOR

3846       Damian Conway  "<DCONWAY@CPAN.org>"
3847

LICENCE AND COPYRIGHT

3849       Copyright (c) 2009, Damian Conway "<DCONWAY@CPAN.org>". All rights
3850       reserved.
3851
3852       This module is free software; you can redistribute it and/or modify it
3853       under the same terms as Perl itself. See perlartistic.
3854

DISCLAIMER OF WARRANTY

3856       BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
3857       FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
3858       WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
3859       PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
3860       EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
3861       WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
3862       ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
3863       YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
3864       NECESSARY SERVICING, REPAIR, OR CORRECTION.
3865
3866       IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
3867       WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
3868       REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
3869       TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
3870       CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
3871       SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
3872       RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
3873       FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
3874       SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
3875       DAMAGES.
3876
3877
3878
3879perl v5.28.1                      2019-02-02               Regexp::Grammars(3)