Regexp::Grammars(3pm)

1Regexp::Grammars(3)   User Contributed Perl Documentation  Regexp::Grammars(3)
2
3
4

NAME

6       Regexp::Grammars - Add grammatical parsing features to Perl 5.10
7       regexes
8

VERSION

10       This document describes Regexp::Grammars version 1.052
11

SYNOPSIS

13           use Regexp::Grammars;
14
15           my $parser = qr{
16               (?:
17                   <Verb>               # Parse and save a Verb in a scalar
18                   <.ws>                # Parse but don't save whitespace
19                   <Noun>               # Parse and save a Noun in a scalar
20
21                   <type=(?{ rand > 0.5 ? 'VN' : 'VerbNoun' })>
22                                        # Save result of expression in a scalar
23               |
24                   (?:
25                       <[Noun]>         # Parse a Noun and save result in a list
26                                            (saved under the key 'Noun')
27                       <[PostNoun=ws]>  # Parse whitespace, save it in a list
28                                        #   (saved under the key 'PostNoun')
29                   )+
30
31                   <Verb>               # Parse a Verb and save result in a scalar
32                                            (saved under the key 'Verb')
33
34                   <type=(?{ 'VN' })>   # Save a literal in a scalar
35               |
36                   <debug: match>       # Turn on the integrated debugger here
37                   <.Cmd= (?: mv? )>    # Parse but don't capture a subpattern
38                                            (name it 'Cmd' for debugging purposes)
39                   <[File]>+            # Parse 1+ Files and save them in a list
40                                            (saved under the key 'File')
41                   <debug: off>         # Turn off the integrated debugger here
42                   <Dest=File>          # Parse a File and save it in a scalar
43                                            (saved under the key 'Dest')
44               )
45
46               ################################################################
47
48               <token: File>              # Define a subrule named File
49                   <.ws>                  #  - Parse but don't capture whitespace
50                   <MATCH= ([\w-]+) >     #  - Parse the subpattern and capture
51                                          #    matched text as the result of the
52                                          #    subrule
53
54               <token: Noun>              # Define a subrule named Noun
55                   cat | dog | fish       #  - Match an alternative (as usual)
56
57               <rule: Verb>               # Define a whitespace-sensitive subrule
58                   eats                   #  - Match a literal (after any space)
59                   <Object=Noun>?         #  - Parse optional subrule Noun and
60                                          #    save result under the key 'Object'
61               |                          #  Or else...
62                   <AUX>                  #  - Parse subrule AUX and save result
63                   <part= (eaten|seen) >  #  - Match a literal, save under 'part'
64
65               <token: AUX>               # Define a whitespace-insensitive subrule
66                   (has | is)             #  - Match an alternative and capture
67                   (?{ $MATCH = uc $^N }) #  - Use captured text as subrule result
68
69           }x;
70
71           # Match the grammar against some text...
72           if ($text =~ $parser) {
73               # If successful, the hash %/ will have the hierarchy of results...
74               process_data_in( %/ );
75           }
76

QUICKSTART CHEATSHEET

78   In your program...
79           use Regexp::Grammars;    Allow enhanced regexes in lexical scope
80           %/                       Result-hash for successful grammar match
81
82   Defining and using named grammars...
83           <grammar:  GRAMMARNAME>  Define a named grammar that can be inherited
84           <extends:  GRAMMARNAME>  Current grammar inherits named grammar's rules
85
86   Defining rules in your grammar...
87           <rule:     RULENAME>     Define rule with magic whitespace
88           <token:    RULENAME>     Define rule without magic whitespace
89
90           <objrule:  CLASS= NAME>  Define rule that blesses return-hash into class
91           <objtoken: CLASS= NAME>  Define token that blesses return-hash into class
92
93           <objrule:  CLASS>        Shortcut for above (rule name derived from class)
94           <objtoken: CLASS>        Shortcut for above (token name derived from class)
95
96   Matching rules in your grammar...
97           <RULENAME>                Call named subrule (may be fully qualified)
98                                     save result to $MATCH{RULENAME}
99
100           <RULENAME(...)>           Call named subrule, passing args to it
101
102           <!RULENAME>               Call subrule and fail if it matches
103           <!RULENAME(...)>          (shorthand for (?!<.RULENAME>) )
104
105           <:IDENT>                  Match contents of $ARG{IDENT} as a pattern
106           <\:IDENT>                 Match contents of $ARG{IDENT} as a literal
107           </:IDENT>                 Match closing delimiter for $ARG{IDENT}
108
109           <%HASH>                   Match longest possible key of hash
110           <%HASH {PAT}>             Match any key of hash that also matches PAT
111
112           </IDENT>                  Match closing delimiter for $MATCH{IDENT}
113           <\_IDENT>                 Match the literal contents of $MATCH{IDENT}
114
115           <ALIAS= RULENAME>         Call subrule, save result in $MATCH{ALIAS}
116           <ALIAS= %HASH>            Match a hash key, save key in $MATCH{ALIAS}
117           <ALIAS= ( PATTERN )>      Match pattern, save match in $MATCH{ALIAS}
118           <ALIAS= (?{ CODE })>      Execute code, save value in $MATCH{ALIAS}
119           <ALIAS= 'STR' >           Save specified string in $MATCH{ALIAS}
120           <ALIAS= 42 >              Save specified number in $MATCH{ALIAS}
121           <ALIAS= /IDENT>           Match closing delim, save as $MATCH{ALIAS}
122           <ALIAS= \_IDENT>          Match '$MATCH{IDENT}', save as $MATCH{ALIAS}
123
124           <.SUBRULE>                Call subrule (one of the above forms),
125                                     but don't save the result in %MATCH
126
127
128           <[SUBRULE]>               Call subrule (one of the above forms), but
129                                     append result instead of overwriting it
130
131           <SUBRULE1>+ % <SUBRULE2>  Match one or more repetitions of SUBRULE1
132                                     as long as they're separated by SUBRULE2
133           <SUBRULE1> ** <SUBRULE2>  Same (only for backwards compatibility)
134
135           <SUBRULE1>* % <SUBRULE2>  Match zero or more repetitions of SUBRULE1
136                                     as long as they're separated by SUBRULE2
137
138           <SUBRULE1>* %% <SUBRULE2> Match zero or more repetitions of SUBRULE1
139                                     as long as they're separated by SUBRULE2
140                                     and allow an optional trailing SUBRULE2
141
142   In your grammar's code blocks...
143           $CAPTURE    Alias for $^N (the most recent paren capture)
144           $CONTEXT    Another alias for $^N
145           $INDEX      Current index of next matching position in string
146           %MATCH      Current rule's result-hash
147           $MATCH      Magic override value (returned instead of result-hash)
148           %ARG        Current rule's argument hash
149           $DEBUG      Current match-time debugging mode
150
151   Directives...
152           <require: (?{ CODE })   >  Fail if code evaluates false
153           <timeout: INT           >  Fail after specified number of seconds
154           <debug:   COMMAND       >  Change match-time debugging mode
155           <logfile: LOGFILE       >  Change debugging log file (default: STDERR)
156           <fatal:   TEXT|(?{CODE})>  Queue error message and fail parse
157           <error:   TEXT|(?{CODE})>  Queue error message and backtrack
158           <warning: TEXT|(?{CODE})>  Queue warning message and continue
159           <log:     TEXT|(?{CODE})>  Explicitly add a message to debugging log
160           <ws:      PATTERN       >  Override automatic whitespace matching
161           <minimize:>                Simplify the result of a subrule match
162           <context:>                 Switch on context substring retention
163           <nocontext:>               Switch off context substring retention
164

DESCRIPTION

166       This module adds a small number of new regex constructs that can be
167       used within Perl 5.10 patterns to implement complete recursive-descent
168       parsing.
169
170       Perl 5.10 already supports recursive-descent matching, via the new
171       "(?<name>...)" and "(?&name)" constructs. For example, here is a simple
172       matcher for a subset of the LaTeX markup language:
173
174           $matcher = qr{
175               (?&File)
176
177               (?(DEFINE)
178                   (?<File>     (?&Element)* )
179
180                   (?<Element>  \s* (?&Command)
181                             |  \s* (?&Literal)
182                   )
183
184                   (?<Command>  \\ \s* (?&Literal) \s* (?&Options)? \s* (?&Args)? )
185
186                   (?<Options>  \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \])
187
188                   (?<Args>     \{ \s* (?&Element)* \s* \}  )
189
190                   (?<Option>   \s* [^][\$&%#_{}~^\s,]+     )
191
192                   (?<Literal>  \s* [^][\$&%#_{}~^\s]+      )
193               )
194           }xms
195
196       This technique makes it possible to use regexes to recognize complex,
197       hierarchical--and even recursive--textual structures. The problem is
198       that Perl 5.10 doesn't provide any support for extracting that
199       hierarchical data into nested data structures. In other words, using
200       Perl 5.10 you can match complex data, but not parse it into an
201       internally useful form.
202
203       An additional problem when using Perl 5.10 regexes to match complex
204       data formats is that you have to make sure you remember to insert
205       whitespace-matching constructs (such as "\s*") at every possible
206       position where the data might contain ignorable whitespace. This
207       reduces the readability of such patterns, and increases the chance of
208       errors (typically caused by overlooking a location where whitespace
209       might appear).
210
211       The Regexp::Grammars module solves both those problems.
212
213       If you import the module into a particular lexical scope, it
214       preprocesses any regex in that scope, so as to implement a number of
215       extensions to the standard Perl 5.10 regex syntax. These extensions
216       simplify the task of defining and calling subrules within a grammar,
217       and allow those subrule calls to capture and retain the components of
218       they match in a proper hierarchical manner.
219
220       For example, the above LaTeX matcher could be converted to a full LaTeX
221       parser (and considerably tidied up at the same time), like so:
222
223           use Regexp::Grammars;
224           $parser = qr{
225               <File>
226
227               <rule: File>       <[Element]>*
228
229               <rule: Element>    <Command> | <Literal>
230
231               <rule: Command>    \\  <Literal>  <Options>?  <Args>?
232
233               <rule: Options>    \[  <[Option]>+ % (,)  \]
234
235               <rule: Args>       \{  <[Element]>*  \}
236
237               <rule: Option>     [^][\$&%#_{}~^\s,]+
238
239               <rule: Literal>    [^][\$&%#_{}~^\s]+
240           }xms
241
242       Note that there is no need to explicitly place "\s*" subpatterns
243       throughout the rules; that is taken care of automatically.
244
245       If the Regexp::Grammars version of this regex were successfully matched
246       against some appropriate LaTeX document, each rule would call the
247       subrules specified within it, and then return a hash containing
248       whatever result each of those subrules returned, with each result
249       indexed by the subrule's name.
250
251       That is, if the rule named "Command" were invoked, it would first try
252       to match a backslash, then it would call the three subrules
253       "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
254       matched successfully, the "Command" rule would then return a hash with
255       three keys: 'Literal', 'Options', and 'Args'. The value for each of
256       those hash entries would be whatever result-hash the subrules
257       themselves had returned when matched.
258
259       In this way, each level of the hierarchical regex can generate hashes
260       recording everything its own subrules matched, so when the entire
261       pattern matches, it produces a tree of nested hashes that represent the
262       structured data the pattern matched.
263
264       For example, if the previous regex grammar were matched against a
265       string containing:
266
267           \documentclass[a4paper,11pt]{article}
268           \author{D. Conway}
269
270       it would automatically extract a data structure equivalent to the
271       following (but with several extra "empty" keys, which are described in
272       "Subrule results"):
273
274           {
275               'file' => {
276                   'element' => [
277                       {
278                           'command' => {
279                               'literal' => 'documentclass',
280                               'options' => {
281                                   'option'  => [ 'a4paper', '11pt' ],
282                               },
283                               'args'    => {
284                                   'element' => [ 'article' ],
285                               }
286                           }
287                       },
288                       {
289                           'command' => {
290                               'literal' => 'author',
291                               'args' => {
292                                   'element' => [
293                                       {
294                                           'literal' => 'D.',
295                                       },
296                                       {
297                                           'literal' => 'Conway',
298                                       }
299                                   ]
300                               }
301                           }
302                       }
303                   ]
304               }
305           }
306
307       The data structure that Regexp::Grammars produces from a regex match is
308       available to the surrounding program in the magic variable "%/".
309
310       Regexp::Grammars provides many features that simplify the extraction of
311       hierarchical data via a regex match, and also some features that can
312       simplify the processing of that data once it has been extracted. The
313       following sections explain each of those features, and some of the
314       parsing techniques they support.
315
316   Setting up the module
317       Just add:
318
319           use Regexp::Grammars;
320
321       to any lexical scope. Any regexes within that scope will automatically
322       now implement the new parsing constructs:
323
324           use Regexp::Grammars;
325
326           my $parser = qr/ regex with $extra <chocolatey> grammar bits /;
327
328       Note that you do not to use the "/x" modifier when declaring a regex
329       grammar (though you certainly may). But even if you don't, the module
330       quietly adds a "/x" to every regex within the scope of its usage.
331       Otherwise, the default "a whitespace character matches exactly that
332       whitespace character" behaviour of Perl regexes would mess up your
333       grammar's parsing. If you need the non-"/x" behaviour, you can still
334       use the "(?-x)" of "(?-x:...)" directives to switch off "/x" within one
335       or more of your grammar's components.
336
337       Once the grammar has been processed, you can then match text against
338       the extended regexes, in the usual manner (i.e. via a "=~" match):
339
340           if ($input_text =~ $parser) {
341               ...
342           }
343
344       After a successful match, the variable "%/" will contain a series of
345       nested hashes representing the structured hierarchical data captured
346       during the parse.
347
348   Structure of a Regexp::Grammars grammar
349       A Regexp::Grammars specification consists of a start-pattern (which may
350       include both standard Perl 5.10 regex syntax, as well as special
351       Regexp::Grammars directives), followed by one or more rule or token
352       definitions.
353
354       For example:
355
356           use Regexp::Grammars;
357           my $balanced_brackets = qr{
358
359               # Start-pattern...
360               <paren_pair> | <brace_pair>
361
362               # Rule definition...
363               <rule: paren_pair>
364                   \(  (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*  \)
365
366               # Rule definition...
367               <rule: brace_pair>
368                   \{  (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )*  \}
369
370               # Token definition...
371               <token: escape>
372                   \\ .
373           }xms;
374
375       The start-pattern at the beginning of the grammar acts like the "top"
376       token of the grammar, and must be matched completely for the grammar to
377       match.
378
379       This pattern is treated like a token for whitespace matching behaviour
380       (see "Tokens vs rules (whitespace handling)").  That is, whitespace in
381       the start-pattern is treated like whitespace in any normal Perl regex.
382
383       The rules and tokens are declarations only and they are not directly
384       matched.  Instead, they act like subroutines, and are invoked by name
385       from the initial pattern (or from within a rule or token).
386
387       Each rule or token extends from the directive that introduces it up to
388       either the next rule or token directive, or (in the case of the final
389       rule or token) to the end of the grammar.
390
391   Tokens vs rules (whitespace handling)
392       The difference between a token and a rule is that a token treats any
393       whitespace within it exactly as a normal Perl regular expression would.
394       That is, a sequence of whitespace in a token is ignored if the "/x"
395       modifier is in effect, or else matches the same literal sequence of
396       whitespace characters (if "/x" is not in effect).
397
398       In a rule, most sequences of whitespace are treated as matching the
399       implicit subrule "<.ws>", which is automatically predefined to match
400       optional whitespace (i.e. "\s*").
401
402       Exceptions to this behaviour are whitespaces before a "|" or a code
403       block or an explicit space-matcher (such as "<ws>" or "\s"), or at the
404       very end of the rule)
405
406       In other words, a rule such as:
407
408           <rule: sentence>   <noun> <verb>
409                          |   <verb> <noun>
410
411       is equivalent to a token with added non-capturing whitespace matching:
412
413           <token: sentence>  <.ws> <noun> <.ws> <verb>
414                           |  <.ws> <verb> <.ws> <noun>
415
416       You can explicitly define a "<ws>" token to change that default
417       behaviour. For example, you could alter the definition of "whitespace"
418       to include Perlish comments, by adding an explicit "<token: ws>":
419
420           <token: ws>
421               (?: \s+ | #[^\n]* )*
422
423       But be careful not to define "<ws>" as a rule, as this will lead to all
424       kinds of infinitely recursive unpleasantness.
425
426       Per-rule whitespace handling
427
428       Redefining the "<ws>" token changes its behaviour throughout the entire
429       grammar, within every rule definition. Usually that's appropriate, but
430       sometimes you need finer-grained control over whitespace handling.
431
432       So Regexp::Grammars provides the "<ws:>" directive, which allows you to
433       override the implicit whitespace-matches-whitespace behaviour only
434       within the current rule.
435
436       Note that this directive does not redefine "<ws>" within the rule; it
437       simply specifies what to replace each whitespace sequence with (instead
438       of replacing each with a "<ws>" call).
439
440       For example, if a language allows one kind of comment between
441       statements and another within statements, you could parse it with:
442
443           <rule: program>
444               # One type of comment between...
445               <ws: (\s++ | \# .*? \n)* >
446
447               # ...colon-separated statements...
448               <[statement]>+ % ( ; )
449
450
451           <rule: statement>
452               # Another type of comment...
453               <ws: (\s*+ | \#{ .*? }\# )* >
454
455               # ...between comma-separated commands...
456               <cmd>  <[arg]>+ % ( , )
457
458       Note that each directive only applies to the rule in which it is
459       specified. In every other rule in the grammar, whitespace would still
460       match the usual "<ws>" subrule.
461
462   Calling subrules
463       To invoke a rule to match at any point, just enclose the rule's name in
464       angle brackets (like in Perl 6). There must be no space between the
465       opening bracket and the rulename. For example::
466
467           qr{
468               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
469               <name>            # Call <rule: name>
470               <options>?        # Call <rule: options> (it's okay if it fails)
471
472               <rule: name>
473                   # etc.
474           }x;
475
476       If you need to match a literal pattern that would otherwise look like a
477       subrule call, just backslash-escape the leading angle:
478
479           qr{
480               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
481               \<name>           # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
482               <options>?        # Call <rule: options> (it's okay if it fails)
483
484               <rule: name>
485                   # etc.
486           }x;
487
488   Subrule results
489       If a subrule call successfully matches, the result of that match is a
490       reference to a hash. That hash reference is stored in the current
491       rule's own result-hash, under the name of the subrule that was invoked.
492       The hash will, in turn, contain the results of any more deeply nested
493       subrule calls, each stored under the name by which the nested subrule
494       was invoked.
495
496       In other words, if the rule "sentence" is defined:
497
498           <rule: sentence>
499               <noun> <verb> <object>
500
501       then successfully calling the rule:
502
503           <sentence>
504
505       causes a new hash entry at the current nesting level. That entry's key
506       will be 'sentence' and its value will be a reference to a hash, which
507       in turn will have keys: 'noun', 'verb', and 'object'.
508
509       In addition each result-hash has one extra key: the empty string. The
510       value for this key is whatever substring the entire subrule call
511       matched.  This value is known as the context substring.
512
513       So, for example, a successful call to "<sentence>" might add something
514       like the following to the current result-hash:
515
516           sentence => {
517               ""     => 'I saw a dog',
518               noun   => 'I',
519               verb   => 'saw',
520               object => {
521                   ""      => 'a dog',
522                   article => 'a',
523                   noun    => 'dog',
524               },
525           }
526
527       Note, however, that if the result-hash at any level contains only the
528       empty-string key (i.e. the subrule did not call any sub-subrules or
529       save any of their nested result-hashes), then the hash is "unpacked"
530       and just the context substring itself is returned.
531
532       For example, if "<rule: sentence>" had been defined:
533
534           <rule: sentence>
535               I see dead people
536
537       then a successful call to the rule would only add:
538
539           sentence => 'I see dead people'
540
541       to the current result-hash.
542
543       This is a useful feature because it prevents a series of nested subrule
544       calls from producing very unwieldy data structures. For example,
545       without this automatic unpacking, even the simple earlier example:
546
547           <rule: sentence>
548               <noun> <verb> <object>
549
550       would produce something needlessly complex, such as:
551
552           sentence => {
553               ""     => 'I saw a dog',
554               noun   => {
555                   "" => 'I',
556               },
557               verb   => {
558                   "" => 'saw',
559               },
560               object => {
561                   ""      => 'a dog',
562                   article => {
563                       "" => 'a',
564                   },
565                   noun    => {
566                       "" => 'dog',
567                   },
568               },
569           }
570
571       Turning off the context substring
572
573       The context substring is convenient for debugging and for generating
574       error messages but, in a large grammar, or when parsing a long string,
575       the capture and storage of many nested substrings may quickly become
576       prohibitively expensive.
577
578       So Regexp::Grammars provides a directive to prevent context substrings
579       from being retained. Any rule or token that includes the directive
580       "<nocontext:>" anywhere in the rule's body will not retain any context
581       substring it matches...unless that substring would be the only entry in
582       its result hash (which only happens within objrules and objtokens).
583
584       If a "<nocontext:>" directive appears before the first rule or token
585       definition (i.e. as part of the main pattern), then the entire grammar
586       will discard all context substrings from every one of its rules and
587       tokens.
588
589       However, you can override this universal prohibition with a second
590       directive: "<context:>". If this directive appears in any rule or
591       token, that rule or token will save its context substring, even if a
592       global "<nocontext:>" is in effect.
593
594       This means that this grammar:
595
596           qr{
597               <Command>
598
599               <rule: Command>
600                   <nocontext:>
601                   <Keyword> <arg=(\S+)>+ % <.ws>
602
603               <token: Keyword>
604                   <Move> | <Copy> | <Delete>
605
606               # etc.
607           }x
608
609       and this grammar:
610
611           qr{
612               <nocontext:>
613               <Command>
614
615               <rule: Command>
616                   <Keyword> <arg=(\S+)>+ % <.ws>
617
618               <token: Keyword>
619                   <context:>
620                   <Move> | <Copy> | <Delete>
621
622               # etc.
623           }x
624
625       will behave identically (saving context substrings for keywords, but
626       not for commands), except that the first version will also retain the
627       global context substring (i.e. $/{""}), whereas the second version will
628       not.
629
630       Note that "<context:>" and "<nocontext:>" have no effect on, or even
631       any interaction with, the various result distillation mechanisms, which
632       continue to work in the usual way when either or both of the directives
633       is used.
634
635   Renaming subrule results
636       It is not always convenient to have subrule results stored under the
637       same name as the rule itself. Rule names should be optimized for
638       understanding the behaviour of the parser, whereas result names should
639       be optimized for understanding the structure of the data. Often those
640       two goals are identical, but not always; sometimes rule names need to
641       describe what the data looks like, while result names need to describe
642       what the data means.
643
644       For example, sometimes you need to call the same rule twice, to match
645       two syntactically identical components whose positions give then
646       semantically distinct meanings:
647
648           <rule: copy_cmd>
649               copy <file> <file>
650
651       The problem here is that, if the second call to "<file>" succeeds, its
652       result-hash will be stored under the key 'file', clobbering the data
653       that was returned from the first call to "<file>".
654
655       To avoid such problems, Regexp::Grammars allows you to alias any
656       subrule call, so that it is still invoked by the original name, but its
657       result-hash is stored under a different key. The syntax for that is:
658       "<alias=rulename>". For example:
659
660           <rule: copy_cmd>
661               copy <from=file> <to=file>
662
663       Here, "<rule: file>" is called twice, with the first result-hash being
664       stored under the key 'from', and the second result-hash being stored
665       under the key 'to'.
666
667       Note, however, that the alias before the "=" must be a proper
668       identifier (i.e. a letter or underscore, followed by letters, digits,
669       and/or underscores). Aliases that start with an underscore and aliases
670       named "MATCH" have special meaning (see "Private subrule calls" and
671       "Result distillation" respectively).
672
673       Aliases can also be useful for normalizing data that may appear in
674       different formats and sequences. For example:
675
676           <rule: copy_cmd>
677               copy <from=file>        <to=file>
678             | dup    <to=file>  as  <from=file>
679             |      <from=file>  ->    <to=file>
680             |        <to=file>  <-  <from=file>
681
682       Here, regardless of which order the old and new files are specified,
683       the result-hash always gets:
684
685           copy_cmd => {
686               from => 'oldfile',
687                 to => 'newfile',
688           }
689
690   List-like subrule calls
691       If a subrule call is quantified with a repetition specifier:
692
693           <rule: file_sequence>
694               <file>+
695
696       then each repeated match overwrites the corresponding entry in the
697       surrounding rule's result-hash, so only the result of the final
698       repetition will be retained. That is, if the above example matched the
699       string "foo.pl bar.py baz.php", then the result-hash would contain:
700
701           file_sequence {
702               ""   => 'foo.pl bar.py baz.php',
703               file => 'baz.php',
704           }
705
706       Usually, that's not the desired outcome, so Regexp::Grammars provides
707       another mechanism by which to call a subrule; one that saves all
708       repetitions of its results.
709
710       A regular subrule call consists of the rule's name surrounded by angle
711       brackets. If, instead, you surround the rule's name with "<[...]>"
712       (angle and square brackets) like so:
713
714           <rule: file_sequence>
715               <[file]>+
716
717       then the rule is invoked in exactly the same way, but the result of
718       that submatch is pushed onto an array nested inside the appropriate
719       result-hash entry. In other words, if the above example matched the
720       same "foo.pl bar.py baz.php" string, the result-hash would contain:
721
722           file_sequence {
723               ""   => 'foo.pl bar.py baz.php',
724               file => [ 'foo.pl', 'bar.py', 'baz.php' ],
725           }
726
727       This "listifying subrule call" can also be useful for non-repeated
728       subrule calls, if the same subrule is invoked in several places in a
729       grammar. For example if a cmdline option could be given either one or
730       two values, you might parse it:
731
732           <rule: size_option>
733               -size <[size]> (?: x <[size]> )?
734
735       The result-hash entry for 'size' would then always contain an array,
736       with either one or two elements, depending on the input being parsed.
737
738       Listifying subrules can also be given aliases, just like ordinary
739       subrules. The alias is always specified inside the square brackets:
740
741           <rule: size_option>
742               -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
743
744       Here, the sizes are parsed using the "pos_integer" rule, but saved in
745       the result-hash in an array under the key 'size'.
746
747   Parametric subrules
748       When a subrule is invoked, it can be passed a set of named arguments
749       (specified as key"=>"values pairs). This argument list is placed in a
750       normal Perl regex code block and must appear immediately after the
751       subrule name, before the closing angle bracket.
752
753       Within the subrule that has been invoked, the arguments can be accessed
754       via the special hash %ARG. For example:
755
756           <rule: block>
757               <tag>
758                   <[block]>*
759               <end_tag(?{ tag=>$MATCH{tag} })>  # ...call subrule with argument
760
761           <token: end_tag>
762               end_ (??{ quotemeta $ARG{tag} })
763
764       Here the "block" rule first matches a "<tag>", and the corresponding
765       substring is saved in $MATCH{tag}. It then matches any number of nested
766       blocks. Finally it invokes the "<end_tag>" subrule, passing it an
767       argument whose name is 'tag' and whose value is the current value of
768       $MATCH{tag} (i.e. the original opening tag).
769
770       When it is thus invoked, the "end_tag" token first matches 'end_', then
771       interpolates the literal value of the 'tag' argument and attempts to
772       match it.
773
774       Any number of named arguments can be passed when a subrule is invoked.
775       For example, we could generalize the "end_tag" rule to allow any prefix
776       (not just 'end_'), and also to allow for 'if...fi'-style reversed tags,
777       like so:
778
779           <rule: block>
780               <tag>
781                   <[block]>*
782               <end_tag (?{ prefix=>'end', tag=>$MATCH{tag} })>
783
784           <token: end_tag>
785               (??{ $ARG{prefix} // q{(?!)} })      # ...prefix as pattern
786               (??{ quotemeta $ARG{tag} })          # ...tag as literal
787             |
788               (??{ quotemeta reverse $ARG{tag} })  # ...reversed tag
789
790       Note that, if you do not need to interpolate values (such as
791       $MATCH{tag}) into a subrule's argument list, you can use simple
792       parentheses instead of "(?{...})", like so:
793
794               <end_tag( prefix=>'end', tag=>'head' )>
795
796       The only types of values you can use in this simplified syntax are
797       numbers and single-quote-delimited strings.  For anything more complex,
798       put the argument list in a full "(?{...})".
799
800       As the earlier examples show, the single most common type of argument
801       is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
802       a common requirement to pass an element of %MATCH into a subrule, named
803       with its own key.
804
805       Because this is such a common usage, Regexp::Grammars provides a
806       shortcut. If you use simple parentheses (instead of "(?{...})"
807       parentheses) then instead of a pair, you can specify an argument using
808       a colon followed by an identifier.  This argument is replaced by a
809       named argument whose name is the identifier and whose value is the
810       corresponding item from %MATCH. So, for example, instead of:
811
812               <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>
813
814       you can just write:
815
816               <end_tag( prefix=>'end', :tag )>
817
818       Note that, from Perl 5.20 onwards, due to changes in the way that Perl
819       parses regexes, Regexp::Grammars does not support explicitly passing
820       elements of %MATCH as argument values within a list subrule (yeah, it's
821       a very specific and obscure edge-case):
822
823               <[end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })]>   # Does not work
824
825       Note, however, that the shortcut:
826
827               <[end_tag( prefix=>'end', :tag )]>
828
829       still works correctly.
830
831       Accessing subrule arguments more cleanly
832
833       As the preceding examples illustrate, using subrule arguments
834       effectively generally requires the use of run-time interpolated
835       subpatterns via the "(??{...})" construct.
836
837       This produces ugly rule bodies such as:
838
839           <token: end_tag>
840               (??{ $ARG{prefix} // q{(?!)} })      # ...prefix as pattern
841               (??{ quotemeta $ARG{tag} })          # ...tag as literal
842             |
843               (??{ quotemeta reverse $ARG{tag} })  # ...reversed tag
844
845       To simplify these common usages, Regexp::Grammars provides three
846       convenience constructs.
847
848       A subrule call of the form "<:"identifier">" is equivalent to:
849
850           (??{ $ARG{'identifier'} // q{(?!)} })
851
852       Namely: "Match the contents of $ARG{'identifier'}, treating those
853       contents as a pattern."
854
855       A subrule call of the form "<\:"identifier">" (that is: a matchref with
856       a colon after the backslash) is equivalent to:
857
858           (??{ defined $ARG{'identifier'}
859                   ? quotemeta($ARG{'identifier'})
860                   : '(?!)'
861           })
862
863       Namely: "Match the contents of $ARG{'identifier'}, treating those
864       contents as a literal."
865
866       A subrule call of the form "</:"identifier">" (that is: an invertref
867       with a colon after the forward slash) is equivalent to:
868
869           (??{ defined $ARG{'identifier'}
870                   ? quotemeta(reverse $ARG{'identifier'})
871                   : '(?!)'
872           })
873
874       Namely: "Match the closing delimiter corresponding to the contents of
875       $ARG{'identifier'}, as if it were a literal".
876
877       The availability of these three constructs mean that we could rewrite
878       the above "<end_tag>" token much more cleanly as:
879
880           <token: end_tag>
881               <:prefix>      # ...prefix as pattern
882               <\:tag>        # ...tag as a literal
883             |
884               </:tag>        # ...reversed tag
885
886       In general these constructs mean that, within a subrule, if you want to
887       match an argument passed to that subrule, you use "<:"ARGNAME">" (to
888       match the argument as a pattern) or "<\:"ARGNAME">" (to match the
889       argument as a literal).
890
891       Note the consistent mnemonic in these various subrule-like
892       interpolations of named arguments: the name is always prefixed by a
893       colon.
894
895       In other words, the "<:ARGNAME>" form works just like a "<RULENAME>",
896       except that the leading colon tells Regexp::Grammars to use the
897       contents of $ARG{'ARGNAME'} as the subpattern, instead of the contents
898       of "(?&RULENAME)"
899
900       Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs work exactly
901       like "<\_MATCHNAME>" and "</INVERTNAME>" respectively, except that the
902       leading colon indicates that the matchref or invertref should be taken
903       from %ARG instead of from %MATCH.
904
905   Pseudo-subrules
906       Aliases can also be given to standard Perl subpatterns, as well as to
907       code blocks within a regex. The syntax for subpatterns is:
908
909           <ALIAS= (SUBPATTERN) >
910
911       In other words, the syntax is exactly like an aliased subrule call,
912       except that the rule name is replaced with a set of parentheses
913       containing the subpattern. Any parentheses--capturing or
914       non-capturing--will do.
915
916       The effect of aliasing a standard subpattern is to cause whatever that
917       subpattern matches to be saved in the result-hash, using the alias as
918       its key. For example:
919
920           <rule: file_command>
921
922               <cmd=(mv|cp|ln)>  <from=file>  <to=file>
923
924       Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
925       "(mv|cp|ln)", but whatever substring it matches is saved in the result-
926       hash under the key 'cmd'.
927
928       The syntax for aliasing code blocks is:
929
930           <ALIAS= (?{ your($code->here) }) >
931
932       Note, however, that the code block must be specified in the standard
933       Perl 5.10 regex notation: "(?{...})". A common mistake is to write:
934
935           <ALIAS= { your($code->here } >
936
937       instead, which will attempt to interpolate $code before the regex is
938       even compiled, as such variables are only "protected" from
939       interpolation inside a "(?{...})".
940
941       When correctly specified, this construct executes the code in the block
942       and saves the result of that execution in the result-hash, using the
943       alias as its key. Aliased code blocks are useful for adding semantic
944       information based on which branch of a rule is executed. For example,
945       consider the "copy_cmd" alternatives shown earlier:
946
947           <rule: copy_cmd>
948               copy <from=file>        <to=file>
949             | dup    <to=file>  as  <from=file>
950             |      <from=file>  ->    <to=file>
951             |        <to=file>  <-  <from=file>
952
953       Using aliased code blocks, you could add an extra field to the result-
954       hash to describe which form of the command was detected, like so:
955
956           <rule: copy_cmd>
957               copy <from=file>        <to=file>  <type=(?{ 'std' })>
958             | dup    <to=file>  as  <from=file>  <type=(?{ 'rev' })>
959             |      <from=file>  ->    <to=file>  <type=(?{  +1   })>
960             |        <to=file>  <-  <from=file>  <type=(?{  -1   })>
961
962       Now, if the rule matched, the result-hash would contain something like:
963
964           copy_cmd => {
965               from => 'oldfile',
966                 to => 'newfile',
967               type => 'fwd',
968           }
969
970       Note that, in addition to the semantics described above, aliased
971       subpatterns and code blocks also become visible to Regexp::Grammars'
972       integrated debugger (see Debugging).
973
974   Aliased literals
975       As the previous example illustrates, it is inconveniently verbose to
976       assign constants via aliased code blocks. So Regexp::Grammars provides
977       a short-cut. It is possible to directly alias a numeric literal or a
978       single-quote delimited literal string, without putting either inside a
979       code block. For example, the previous example could also be written:
980
981           <rule: copy_cmd>
982               copy <from=file>        <to=file>  <type='std'>
983             | dup    <to=file>  as  <from=file>  <type='rev'>
984             |      <from=file>  ->    <to=file>  <type= +1  >
985             |        <to=file>  <-  <from=file>  <type= -1  >
986
987       Note that only these two forms of literal are supported in this
988       abbreviated syntax.
989
990   Amnesiac subrule calls
991       By default, every subrule call saves its result into the result-hash,
992       either under its own name, or under an alias.
993
994       However, sometimes you may want to refactor some literal part of a rule
995       into one or more subrules, without having those submatches added to the
996       result-hash. The syntax for calling a subrule, but ignoring its return
997       value is:
998
999           <.SUBRULE>
1000
1001       (which is stolen directly from Perl 6).
1002
1003       For example, you may prefer to rewrite a rule such as:
1004
1005           <rule: paren_pair>
1006
1007               \(
1008                   (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
1009               \)
1010
1011       without any literal matching, like so:
1012
1013           <rule: paren_pair>
1014
1015               <.left_paren>
1016                   (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
1017               <.right_paren>
1018
1019           <token: left_paren>   \(
1020           <token: right_paren>  \)
1021           <token: non_paren>    [^()]
1022
1023       Moreover, as the individual components inside the parentheses probably
1024       aren't being captured for any useful purpose either, you could further
1025       optimize that to:
1026
1027           <rule: paren_pair>
1028
1029               <.left_paren>
1030                   (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
1031               <.right_paren>
1032
1033       Note that you can also use the dot modifier on an aliased subpattern:
1034
1035           <.Alias= (SUBPATTERN) >
1036
1037       This seemingly contradictory behaviour (of giving a subpattern a name,
1038       then deliberately ignoring that name) actually does make sense in one
1039       situation. Providing the alias makes the subpattern visible to the
1040       debugger, while using the dot stops it from affecting the result-hash.
1041       See "Debugging non-grammars" for an example of this usage.
1042
1043   Private subrule calls
1044       If a rule name (or an alias) begins with an underscore:
1045
1046            <_RULENAME>       <_ALIAS=RULENAME>
1047           <[_RULENAME]>     <[_ALIAS=RULENAME]>
1048
1049       then matching proceeds as normal, and any result that is returned is
1050       stored in the current result-hash in the usual way.
1051
1052       However, when any rule finishes (and just before it returns) it first
1053       filters its result-hash, removing any entries whose keys begin with an
1054       underscore. This means that any subrule with an underscored name (or
1055       with an underscored alias) remembers its result, but only until the end
1056       of the current rule. Its results are effectively private to the current
1057       rule.
1058
1059       This is especially useful in conjunction with result distillation.
1060
1061   Lookahead (zero-width) subrules
1062       Non-capturing subrule calls can be used in normal lookaheads:
1063
1064           <rule: qualified_typename>
1065               # A valid typename and has a :: in it...
1066               (?= <.typename> )  [^\s:]+ :: \S+
1067
1068           <rule: identifier>
1069               # An alpha followed by alnums (but not a valid typename)...
1070               (?! <.typename> )    [^\W\d]\w*
1071
1072       but the syntax is a little unwieldy. More importantly, an internal
1073       problem with backtracking causes positive lookaheads to mess up the
1074       module's named capturing mechanism.
1075
1076       So Regexp::Grammars provides two shorthands:
1077
1078           <!typename>        same as: (?! <.typename> )
1079           <?typename>        same as: (?= <.typename> ) ...but works correctly!
1080
1081       These two constructs can also be called with arguments, if necessary:
1082
1083           <rule: Command>
1084               <Keyword>
1085               (?:
1086                   <!Terminator(:Keyword)>  <Args=(\S+)>
1087               )?
1088               <Terminator(:Keyword)>
1089
1090       Note that, as the above equivalences imply, neither of these forms of a
1091       subroutine call ever captures what it matches.
1092
1093   Matching separated lists
1094       One of the commonest tasks in text parsing is to match a list of
1095       unspecified length, in which items are separated by a fixed token.
1096       Things like:
1097
1098           1, 2, 3 , 4 ,13, 91        # Numbers separated by commas and spaces
1099
1100           g-c-a-g-t-t-a-c-a          # DNA bases separated by dashes
1101
1102           /usr/local/bin             # Names separated by directory markers
1103
1104           /usr:/usr/local:bin        # Directories separated by colons
1105
1106       The usual construct required to parse these kinds of structures is
1107       either:
1108
1109           <rule: list>
1110
1111               <item> <separator> <list>     # recursive definition
1112             | <item>                        # base case
1113
1114       or, if you want to allow zero-or-more items instead of requiring one-
1115       or-more:
1116
1117           <rule: list_opt>
1118               <list>?                       # entire list may be missing
1119
1120           <rule: list>                      # as before...
1121               <item> <separator> <list>     #   recursive definition
1122             | <item>                        #   base case
1123
1124       Or, more efficiently, but less prettily:
1125
1126           <rule: list>
1127               <[item]> (?: <separator> <[item]> )*           # one-or-more
1128
1129           <rule: list_opt>
1130               (?: <[item]> (?: <separator> <[item]> )* )?    # zero-or-more
1131
1132       Because separated lists are such a common component of grammars,
1133       Regexp::Grammars provides cleaner ways to specify them:
1134
1135           <rule: list>
1136               <[item]>+ % <separator>      # one-or-more
1137
1138           <rule: list_zom>
1139               <[item]>* % <separator>      # zero-or-more
1140
1141       Note that these are just regular repetition qualifiers (i.e. "+" and
1142       "*") applied to a subriule ("<[item]>"), with a "%" modifier after them
1143       to specify the required separator between the repeated matches.
1144
1145       The number of repetitions matched is controlled both by the nature of
1146       the qualifier ("+" vs "*") and by the subrule specified after the "%".
1147       The qualified subrule will be repeatedly matched for as long as its
1148       qualifier allows, provided that the second subrule also matches between
1149       those repetitions.
1150
1151       For example, you can match a parenthesized sequence of one-or-more
1152       numbers separated by commas, such as:
1153
1154           (1, 2, 3, 4, 13, 91)        # Numbers separated by commas (and spaces)
1155
1156       with:
1157
1158           <rule: number_list>
1159
1160               \(  <[number]>+ % <comma>  \)
1161
1162           <token: number>  \d+
1163           <token: comma>   ,
1164
1165       Note that any spaces round the commas will be ignored because
1166       "<number_list>" is specified as a rule and the "+%" specifier has
1167       spaces within and around it. To disallow spaces around the commas, make
1168       sure there are no spaces in or around the "+%":
1169
1170           <rule: number_list_no_spaces>
1171
1172               \( <[number]>+%<comma> \)
1173
1174       (or else specify the rule as a token instead).
1175
1176       Because the "%" is a modifier applied to a qualifier, you can modify
1177       any other repetition qualifier in the same way. For example:
1178
1179           <[item]>{2,4} % <sep>   # two-to-four items, separated
1180
1181           <[item]>{7}   % <sep>   # exactly 7 items, separated
1182
1183           <[item]>{10,}? % <sep>   # minimum of 10 or more items, separated
1184
1185       You can even do this:
1186
1187           <[item]>? % <sep>       # one-or-zero items, (theoretically) separated
1188
1189       though the separator specification is, of course, meaningless in that
1190       case as it will never be needed to separate a maximum of one item.
1191
1192       If a "%" appears anywhere else in a grammar (i.e. not immediately after
1193       a repetition qualifier), it is treated normally (i.e. as a self-
1194       matching literal character):
1195
1196           <token: perl_hash>
1197               % <ident>                # match "%foo", "%bar", etc.
1198
1199           <token: perl_mod>
1200               <expr> % <expr>          # match "$n % 2", "($n+3) % ($n-1)", etc.
1201
1202       If you need to match a literal "%" immediately after a repetition,
1203       either quote it:
1204
1205           <token: percentage>
1206               \d{1,3} \% solution                  # match "7% solution", etc.
1207
1208       or refactor the "%" character:
1209
1210           <token: percentage>
1211               \d{1,3} <percent_sign> solution      # match "7% solution", etc.
1212
1213           <token: percent_sign>
1214               %
1215
1216       Note that it's usually necessary to use the "<[...]>" form for the
1217       repeated items being matched, so that all of them are saved in the
1218       result hash. You can also save all the separators (if they're
1219       important) by specifying them as a list-like subrule too:
1220
1221           \(  <[number]>* % <[comma]>  \)  # save numbers *and* separators
1222
1223       The repeated item must be specified as a subrule call of some kind
1224       (i.e. in angles), but the separators may be specified either as a
1225       subrule or as a raw bracketed pattern. For example:
1226
1227           <[number]>* % ( , | : )    # Numbers separated by commas or colons
1228
1229           <[number]>* % [,:]         # Same, but more efficiently matched
1230
1231       The separator should always be specified within matched delimiters of
1232       some kind: either matching "<...>" or matching "(...)" or matching
1233       "[...]". Simple, non-bracketed separators will sometimes also work:
1234
1235           <[number]>+ % ,
1236
1237       but not always:
1238
1239           <[number]>+ % ,\s+     # Oops! Separator is just: ,
1240
1241       This is because of the limited way in which the module internally
1242       parses ordinary regex components (i.e. without full understanding of
1243       their implicit precedence). As a consequence, consistently placing
1244       brackets around any separator is a much safer approach:
1245
1246           <[number]>+ % (,\s+)
1247
1248       You can also use a simple pattern on the left of the "%" as the item
1249       matcher, but in this case it must always be aliased into a list-
1250       collecting subrule, like so:
1251
1252           <[item=(\d+)]>* % [,]
1253
1254       Note that, for backwards compatibility with earlier versions of
1255       Regexp::Grammars, the "+%" operator can also be written: "**".
1256       However, there can be no space between the two asterisks of this
1257       variant. That is:
1258
1259           <[item]> ** <sep>      # same as <[item]>* % <sep>
1260
1261           <[item]>* * <sep>      # error (two * qualifiers in a row)
1262
1263       Matching separated lists with a trailing separator
1264
1265       Some languages allow a separated list to include an extra trailing
1266       separator. For example:
1267
1268           ~/bin/perl5/        # Trailing /-separator in filepath
1269           (1,2,3,)            # Trailing ,-separator in Perl list
1270
1271       To match such constructs using the "%" operator, you would need to add
1272       something to explicitly match the optional trailing separator:
1273
1274           <dir>+ % [/] [/]?    # Slash-separated dirs, then optional final slash
1275
1276           <elem>+ % [,] [,]?   # Comma-separated elems, then optional final comma
1277
1278       which is tedious.
1279
1280       So the module also supports a second kind of "separated list" operator,
1281       that allows an optional trailing separator as well: the "%%" operator.
1282       THis operator behaves exactly like the "%" operator, except that it
1283       also matches a final trailing separator, if one is present.
1284
1285       So the previous examples could be (better) written as:
1286
1287           <dir>+ %% [/]     # Slash-separated dirs, with optional final slash
1288
1289           <elem>+ %% [,]    # Comma-separated elems, with optional final comma
1290
1291   Matching hash keys
1292       In some situations a grammar may need a rule that matches dozens,
1293       hundreds, or even thousands of one-word alternatives. For example, when
1294       matching command names, or valid userids, or English words. In such
1295       cases it is often impractical (and always inefficient) to list all the
1296       alternatives between "|" alternators:
1297
1298           <rule: shell_cmd>
1299               a2p | ac | apply | ar | automake | awk | ...
1300               # ...and 400 lines later
1301               ... | zdiff | zgrep | zip | zmore | zsh
1302
1303           <rule: valid_word>
1304               a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
1305               # ...and 40,000 lines later...
1306               ... | zymotize | zymotoxic | zymurgy | zythem | zythum
1307
1308       To simplify such cases, Regexp::Grammars provides a special construct
1309       that allows you to specify all the alternatives as the keys of a normal
1310       hash. The syntax for that construct is simply to put the hash name
1311       inside angle brackets (with no space between the angles and the hash
1312       name).
1313
1314       Which means that the rules in the previous example could also be
1315       written:
1316
1317           <rule: shell_cmd>
1318               <%cmds>
1319
1320           <rule: valid_word>
1321               <%dict>
1322
1323       provided that the two hashes (%cmds and %dict) are visible in the scope
1324       where the grammar is created.
1325
1326       Matching a hash key in this way is typically significantly faster than
1327       matching a large set of alternations. Specifically, it is O(length of
1328       longest potential key) ^ 2, instead of O(number of keys).
1329
1330       Internally, the construct is converted to something equivalent to:
1331
1332           <rule: shell_cmd>
1333               (<.hk>)  <require: (?{ exists $cmds{$CAPTURE} })>
1334
1335           <rule: valid_word>
1336               (<.hk>)  <require: (?{ exists $dict{$CAPTURE} })>
1337
1338       The special "<hk>" rule is created automatically, and defaults to
1339       "\S+", but you can also define it explicitly to handle other kinds of
1340       keys. For example:
1341
1342           <rule: hk>
1343               [^\n]+        # Key may be any number of chars on a single line
1344
1345           <rule: hk>
1346               [ACGT]{10,}   # Key is a base sequence of at least 10 pairs
1347
1348       Alternatively, you can specify a different key-matching pattern for
1349       each hash you're matching, by placing the required pattern in braces
1350       immediately after the hash name. For example:
1351
1352           <rule: client_name>
1353               # Valid keys match <.hk> (default or explicitly specified)
1354               <%clients>
1355
1356           <rule: shell_cmd>
1357               # Valid keys contain only word chars, hyphen, slash, or dot...
1358               <%cmds { [\w-/.]+ }>
1359
1360           <rule: valid_word>
1361               # Valid keys contain only alphas or internal hyphen or apostrophe...
1362               <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>
1363
1364           <rule: DNA_sequence>
1365               # Valid keys are base sequences of at least 10 pairs...
1366               <%sequences{[ACGT]{10,}}>
1367
1368       This second approach to key-matching is preferred, because it localizes
1369       any non-standard key-matching behaviour to each individual hash.
1370
1371       Note that changes in the compilation process from Perl 5.18 onwards
1372       mean that in some cases the "<%hash>" construct only works reliably if
1373       the hash itself is declared at the outermost lexical scope (i.e. file
1374       scope).
1375
1376       Specifically, if the regex grammar does not include any interpolated
1377       scalars or arrays and the hash was declared within a subroutine (even
1378       within the same subroutine as the regex grammar that uses it), the
1379       regex will not be able to "see" the hash variable at compile-time. This
1380       will produce a "Global symbol "%hash" requires explicit package name"
1381       compile-time error. For example:
1382
1383           sub build_keyword_parser {
1384               # Hash declared inside subroutine...
1385               my %keywords = (foo => 1, bar => 1);
1386
1387               # ...then used in <%hash> construct within uninterpolated regex...
1388               return qr{
1389                           ^<keyword>$
1390                           <rule: keyword> <%keywords>
1391                        }x;
1392
1393               # ...produces compile-time error
1394           }
1395
1396       The solution is to place the hash outside the subroutine containing the
1397       grammar:
1398
1399           # Hash declared OUTSIDE subroutine...
1400           my %keywords = (foo => 1, bar => 1);
1401
1402           sub build_keyword_parser {
1403               return qr{
1404                           ^<keyword>$
1405                           <rule: keyword> <%keywords>
1406                        }x;
1407           }
1408
1409       ...or else to explicitly interpolate at least one scalar (even just a
1410       scalar containing an empty string):
1411
1412           sub build_keyword_parser {
1413               my %keywords = (foo => 1, bar => 1);
1414               my $DEFER_REGEX_COMPILATION = "";
1415
1416               return qr{
1417                           ^<keyword>$
1418                           <rule: keyword> <%keywords>
1419
1420                           $DEFER_REGEX_COMPILATION
1421                        }x;
1422           }
1423
1424   Rematching subrule results
1425       Sometimes it is useful to be able to rematch a string that has
1426       previously been matched by some earlier subrule. For example, consider
1427       a rule to match shell-like control blocks:
1428
1429           <rule: control_block>
1430                 for   <expr> <[command]>+ endfor
1431               | while <expr> <[command]>+ endwhile
1432               | if    <expr> <[command]>+ endif
1433               | with  <expr> <[command]>+ endwith
1434
1435       This would be much tidier if we could factor out the command names
1436       (which are the only differences between the four alternatives). The
1437       problem is that the obvious solution:
1438
1439           <rule: control_block>
1440               <keyword> <expr>
1441                   <[command]>+
1442               end<keyword>
1443
1444       doesn't work, because it would also match an incorrect input like:
1445
1446           for 1..10
1447               echo $n
1448               ls subdir/$n
1449           endif
1450
1451       We need some way to ensure that the "<keyword>" matched immediately
1452       after "end" is the same "<keyword>" that was initially matched.
1453
1454       That's not difficult, because the first "<keyword>" will have captured
1455       what it matched into $MATCH{keyword}, so we could just write:
1456
1457           <rule: control_block>
1458               <keyword> <expr>
1459                   <[command]>+
1460               end(??{quotemeta $MATCH{keyword}})
1461
1462       This is such a useful technique, yet so ugly, scary, and prone to
1463       error, that Regexp::Grammars provides a cleaner equivalent:
1464
1465           <rule: control_block>
1466               <keyword> <expr>
1467                   <[command]>+
1468               end<\_keyword>
1469
1470       A directive of the form "<\_IDENTIFIER>" is known as a "matchref" (an
1471       abbreviation of "%MATCH-supplied backreference").  Matchrefs always
1472       attempt to match, as a literal, the current value of
1473       $MATCH{IDENTIFIER}.
1474
1475       By default, a matchref does not capture what it matches, but you can
1476       have it do so by giving it an alias:
1477
1478           <token: delimited_string>
1479               <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1480
1481           <token: str_delim> ["'`]
1482
1483       At first glance this doesn't seem very useful as, by definition,
1484       $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up with
1485       identical values. However, it can be useful if the rule also has other
1486       alternatives and you want to create a consistent internal
1487       representation for those alternatives, like so:
1488
1489           <token: delimited_string>
1490                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1491               | <ldelim=( \[ )      .*?  <rdelim=( \] )
1492               | <ldelim=( \{ )      .*?  <rdelim=( \} )
1493               | <ldelim=( \( )      .*?  <rdelim=( \) )
1494               | <ldelim=( \< )      .*?  <rdelim=( \> )
1495
1496       You can also force a matchref to save repeated matches as a nested
1497       array, in the usual way:
1498
1499           <token: marked_text>
1500               <marker> <text> <[endmarkers=\_marker]>+
1501
1502       Be careful though, as the following will not do as you may expect:
1503
1504               <[marker]>+ <text> <[endmarkers=\_marker]>+
1505
1506       because the value of $MATCH{marker} will be an array reference, which
1507       the matchref will flatten and concatenate, then match the resulting
1508       string as a literal, which will mean the previous example will match
1509       endmarkers that are exact multiples of the complete start marker,
1510       rather than endmarkers that consist of any number of repetitions of the
1511       individual start marker delimiter. So:
1512
1513               ""text here""
1514               ""text here""""
1515               ""text here""""""
1516
1517       but not:
1518
1519               ""text here"""
1520               ""text here"""""
1521
1522       Uneven start and end markers such as these are extremely unusual, so
1523       this problem rarely arises in practice.
1524
1525       Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
1526       was "<\IDENTIFIER>" instead of "<\_IDENTIFIER>". This created problems
1527       when the identifier started with any of "l", "u", "L", "U", "Q", or
1528       "E", so the syntax has had to be altered in a backwards incompatible
1529       way. It will not be altered again.
1530
1531   Rematching balanced delimiters
1532       Consider the example in the previous section:
1533
1534           <token: delimited_string>
1535                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1536               | <ldelim=( \[ )      .*?  <rdelim=( \] )
1537               | <ldelim=( \{ )      .*?  <rdelim=( \} )
1538               | <ldelim=( \( )      .*?  <rdelim=( \) )
1539               | <ldelim=( \< )      .*?  <rdelim=( \> )
1540
1541       The repeated pattern of the last four alternatives is gauling, but we
1542       can't just refactor those delimiters as well:
1543
1544           <token: delimited_string>
1545                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1546               | <ldelim=bracket>    .*?  <rdelim=\_ldelim>
1547
1548       because that would incorrectly match:
1549
1550           { delimited content here {
1551
1552       while failing to match:
1553
1554           { delimited content here }
1555
1556       To refactor balanced delimiters like those, we need a second kind of
1557       matchref; one that's a little smarter.
1558
1559       Or, preferably, a lot smarter...because there are many other kinds of
1560       balanced delimiters, apart from single brackets. For example:
1561
1562             {{{ delimited content here }}}
1563              /* delimited content here */
1564              (* delimited content here *)
1565              `` delimited content here ''
1566              if delimited content here fi
1567
1568       The common characteristic of these delimiter pairs is that the closing
1569       delimiter is the inverse of the opening delimiter: the sequence of
1570       characters is reversed and certain characters (mainly brackets, but
1571       also single-quotes/backticks) are mirror-reflected.
1572
1573       Regexp::Grammars supports the parsing of such delimiters with a
1574       construct known as an invertref, which is specified using the
1575       "</IDENT>" directive. An invertref acts very like a matchref, except
1576       that it does not convert to:
1577
1578           (??{ quotemeta( $MATCH{I<IDENT>} ) })
1579
1580       but rather to:
1581
1582           (??{ quotemeta( inverse( $MATCH{I<IDENT> ))} })
1583
1584       With this directive available, the balanced delimiters of the previous
1585       example can be refactored to:
1586
1587           <token: delimited_string>
1588                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1589               | <ldelim=( [[{(<] )  .*?  <rdelim=/ldelim>
1590
1591       Like matchrefs, invertrefs come in the usual range of flavours:
1592
1593           </ident>            # Match the inverse of $MATCH{ident}
1594           <ALIAS=/ident>      # Match inverse and capture to $MATCH{ident}
1595           <[ALIAS=/ident]>    # Match inverse and push on @{$MATCH{ident}}
1596
1597       The character pairs that are reversed during mirroring are: "{" and
1598       "}", "[" and "]", "(" and ")", "<" and ">", "AX" and "AX", "`" and "'".
1599
1600       The following mnemonics may be useful in distinguishing inverserefs
1601       from backrefs: a backref starts with a "\" (just like the standard Perl
1602       regex backrefs "\1" and "\g{-2}" and "\k<name>"), whereas an inverseref
1603       starts with a "/" (like an HTML or XML closing tag). Or just remember
1604       that "<\_IDENT>" is "match the same again", and if you want "the same
1605       again, only mirrored" instead, just mirror the "\" to get "</IDENT>".
1606
1607   Rematching parametric results and delimiters
1608       The "<\_IDENTIFIER>" and "</IDENTIFIER>" mechanisms normally locate the
1609       literal to be matched by looking in $MATCH{IDENTIFIER}.
1610
1611       However, you can cause them to look in $ARG{IDENTIFIER} instead, by
1612       prefixing the identifier with a single ":". This is especially useful
1613       when refactoring subrules. For example, instead of:
1614
1615           <rule: Command>
1616               <Keyword>  <CommandBody>  end_ <\_Keyword>
1617
1618           <rule: Placeholder>
1619               <Keyword>    \.\.\.   end_ <\_Keyword>
1620
1621       you could parameterize the Terminator rule, like so:
1622
1623           <rule: Command>
1624               <Keyword>  <CommandBody>  <Terminator(:Keyword)>
1625
1626           <rule: Placeholder>
1627               <Keyword>    \.\.\.   <Terminator(:Keyword)>
1628
1629           <token: Terminator>
1630               end_ <\:Keyword>
1631
1632   Tracking and reporting match positions
1633       Regexp::Grammars automatically predefines a special token that makes it
1634       easy to track exactly where in its input a particular subrule matches.
1635       That token is: "<matchpos>".
1636
1637       The "<matchpos>" token implements a zero-width match that never fails.
1638       It always returns the current index within the string that the grammar
1639       is matching.
1640
1641       So, for example you could have your "<delimited_text>" subrule detect
1642       and report unterminated text like so:
1643
1644           <token: delimited_text>
1645               qq? <delim> <text=(.*?)> </delim>
1646           |
1647               <matchpos> qq? <delim>
1648               <error: (?{"Unterminated string starting at index $MATCH{matchpos}"})>
1649
1650       Matching "<matchpos>" in the second alternative causes $MATCH{matchpos}
1651       to contain the position in the string at which the "<matchpos>" subrule
1652       was matched (in this example: the start of the unterminated text).
1653
1654       If you want the line number instead of the string index, use the
1655       predefined "<matchline>" subrule instead:
1656
1657           <token: delimited_text>
1658                     qq? <delim> <text=(.*?)> </delim>
1659           |   <matchline> qq? <delim>
1660               <error: (?{"Unterminated string starting at line $MATCH{matchline}"})>
1661
1662       Note that the line numbers returned by "<matchline>" start at 1 (not at
1663       zero, as with "<matchpos>").
1664
1665       The "<matchpos>" and "<matchline>" subrules are just like any other
1666       subrules; you can alias them ("<started_at=matchpos>") or match them
1667       repeatedly ( "(?: <[matchline]> <[item]> )++"), etc.
1668

Autoactions

1670       The module also supports event-based parsing. You can specify a grammar
1671       in the usual way and then, for a particular parse, layer a collection
1672       of call-backs (known as "autoactions") over the grammar to handle the
1673       data as it is parsed.
1674
1675       Normally, a grammar rule returns the result hash it has accumulated (or
1676       whatever else was aliased to "MATCH=" within the rule). However, you
1677       can specify an autoaction object before the grammar is matched.
1678
1679       Once the autoaction object is specified, every time a rule succeeds
1680       during the parse, its result is passed to the object via one of its
1681       methods; specifically it is passed to the method whose name is the same
1682       as the rule's.
1683
1684       For example, suppose you had a grammar that recognizes simple algebraic
1685       expressions:
1686
1687           my $expr_parser = do{
1688               use Regexp::Grammars;
1689               qr{
1690                   <Expr>
1691
1692                   <rule: Expr>       <[Operand=Mult]>+ % <[Op=(\+|\-)]>
1693
1694                   <rule: Mult>       <[Operand=Pow]>+  % <[Op=(\*|/|%)]>
1695
1696                   <rule: Pow>        <[Operand=Term]>+ % <Op=(\^)>
1697
1698                   <rule: Term>          <MATCH=Literal>
1699                              |       \( <MATCH=Expr> \)
1700
1701                   <token: Literal>   <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
1702               }xms
1703           };
1704
1705       You could convert this grammar to a calculator, by installing a set of
1706       autoactions that convert each rule's result hash to the corresponding
1707       value of the sub-expression that the rule just parsed. To do that, you
1708       would create a class with methods whose names match the rules whose
1709       results you want to change. For example:
1710
1711           package Calculator;
1712           use List::Util qw< reduce >;
1713
1714           sub new {
1715               my ($class) = @_;
1716
1717               return bless {}, $class
1718           }
1719
1720           sub Answer {
1721               my ($self, $result_hash) = @_;
1722
1723               my $sum = shift @{$result_hash->{Operand}};
1724
1725               for my $term (@{$result_hash->{Operand}}) {
1726                   my $op = shift @{$result_hash->{Op}};
1727                   if ($op eq '+') { $sum += $term; }
1728                   else            { $sum -= $term; }
1729               }
1730
1731               return $sum;
1732           }
1733
1734           sub Mult {
1735               my ($self, $result_hash) = @_;
1736
1737               return reduce { eval($a . shift(@{$result_hash->{Op}}) . $b) }
1738                             @{$result_hash->{Operand}};
1739           }
1740
1741           sub Pow {
1742               my ($self, $result_hash) = @_;
1743
1744               return reduce { $b ** $a } reverse @{$result_hash->{Operand}};
1745           }
1746
1747       Objects of this class (and indeed the class itself) now have methods
1748       corresponding to some of the rules in the expression grammar. To apply
1749       those methods to the results of the rules (as they parse) you simply
1750       install an object as the "autoaction" handler, immediately before you
1751       initiate the parse:
1752
1753           if ($text ~= $expr_parser->with_actions(Calculator->new)) {
1754               say $/{Answer};   # Now prints the result of the expression
1755           }
1756
1757       The "with_actions()" method expects to be passed an object or
1758       classname. This object or class will be installed as the autoaction
1759       handler for the next match against any grammar. After that match, the
1760       handler will be uninstalled. "with_actions()" returns the grammar it's
1761       called on, making it easy to call it as part of a match (which is the
1762       recommended idiom).
1763
1764       With a "Calculator" object set as the autoaction handler, whenever the
1765       "Answer", "Mult", or "Pow" rule of the grammar matches, the
1766       corresponding "Answer", "Mult", or "Pow" method of the "Calculator"
1767       object will be called (with the rule's result value passed as its only
1768       argument), and the result of the method will be used as the result of
1769       the rule.
1770
1771       Note that nothing new happens when a "Term" or "Literal" rule matches,
1772       because the "Calculator" object doesn't have methods with those names.
1773
1774       The overall effect, then, is to allow you to specify a grammar without
1775       rule-specific bahaviours and then, later, specify a set of final
1776       actions (as methods) for some or all of the rules of the grammar.
1777
1778       Note that, if a particular callback method returns "undef", the result
1779       of the corresponding rule will be passed through without modification.
1780

Named grammars

1782       All the grammars shown so far are confined to a single regex. However,
1783       Regexp::Grammars also provides a mechanism that allows you to defined
1784       named grammars, which can then be imported into other regexes. This
1785       gives the a way of modularizing common grammatical components.
1786
1787   Defining a named grammar
1788       You can create a named grammar using the "<grammar:...>" directive.
1789       This directive must appear before the first rule definition in the
1790       grammar, and instead of any start-rule. For example:
1791
1792           qr{
1793               <grammar: List::Generic>
1794
1795               <rule: List>
1796                   <[MATCH=Item]>+ % <Separator>
1797
1798               <rule: Item>
1799                   \S++
1800
1801               <token: Separator>
1802                   \s* , \s*
1803           }x;
1804
1805       This creates a grammar named "List::Generic", and installs it in the
1806       module's internal caches, for future reference.
1807
1808       Note that there is no need (or reason) to assign the resulting regex to
1809       a variable, as the named grammar cannot itself be matched against.
1810
1811   Using a named grammar
1812       To make use of a named grammar, you need to incorporate it into another
1813       grammar, by inheritance. To do that, use the "<extends:...>" directive,
1814       like so:
1815
1816           my $parser = qr{
1817               <extends: List::Generic>
1818
1819               <List>
1820           }x;
1821
1822       The "<extends:...>" directive incorporates the rules defined in the
1823       specified grammar into the current regex. You can then call any of
1824       those rules in the start-pattern.
1825
1826   Overriding an inherited rule or token
1827       Subrule dispatch within a grammar is always polymorphic. That is, when
1828       a subrule is called, the most-derived rule of the same name within the
1829       grammar's hierarchy is invoked.
1830
1831       So, to replace a particular rule within grammar, you simply need to
1832       inherit that grammar and specify new, more-specific versions of any
1833       rules you want to change. For example:
1834
1835           my $list_of_integers = qr{
1836               <List>
1837
1838               # Inherit rules from base grammar...
1839               <extends: List::Generic>
1840
1841               # Replace Item rule from List::Generic...
1842               <rule: Item>
1843                   [+-]? \d++
1844           }x;
1845
1846       You can also use "<extends:...>" in other named grammars, to create
1847       hierarchies:
1848
1849           qr{
1850               <grammar: List::Integral>
1851               <extends: List::Generic>
1852
1853               <token: Item>
1854                   [+-]? <MATCH=(<.Digit>+)>
1855
1856               <token: Digit>
1857                   \d
1858           }x;
1859
1860           qr{
1861               <grammar: List::ColonSeparated>
1862               <extends: List::Generic>
1863
1864               <token: Separator>
1865                   \s* : \s*
1866           }x;
1867
1868           qr{
1869               <grammar: List::Integral::ColonSeparated>
1870               <extends: List::Integral>
1871               <extends: List::ColonSeparated>
1872           }x;
1873
1874       As shown in the previous example, Regexp::Grammars allows you to
1875       multiply inherit two (or more) base grammars. For example, the
1876       "List::Integral::ColonSeparated" grammar takes the definitions of
1877       "List" and "Item" from the "List::Integral" grammar, and the definition
1878       of "Separator" from "List::ColonSeparated".
1879
1880       Note that grammars dispatch subrule calls using C3 method lookup,
1881       rather than Perl's older DFS lookup. That's why
1882       "List::Integral::ColonSeparated" correctly gets the more-specific
1883       "Separator" rule defined in "List::ColonSeparated", rather than the
1884       more-generic version defined in "List::Generic" (via "List::Integral").
1885       See "perldoc mro" for more discussion of the C3 dispatch algorithm.
1886
1887   Augmenting an inherited rule or token
1888       Instead of replacing an inherited rule, you can augment it.
1889
1890       For example, if you need a grammar for lists of hexademical numbers,
1891       you could inherit the behaviour of "List::Integral" and add the hex
1892       digits to its "Digit" token:
1893
1894           my $list_of_hexadecimal = qr{
1895               <List>
1896
1897               <extends: List::Integral>
1898
1899               <token: Digit>
1900                   <List::Integral::Digit>
1901                 | [A-Fa-f]
1902           }x;
1903
1904       If you call a subrule using a fully qualified name (such as
1905       "<List::Integral::Digit>"), the grammar calls that version of the rule,
1906       rather than the most-derived version.
1907
1908   Debugging named grammars
1909       Named grammars are independent of each other, even when inherited. This
1910       means that, if debugging is enabled in a derived grammar, it will not
1911       be active in any rules inherited from a base grammar, unless the base
1912       grammar also included a "<debug:...>" directive.
1913
1914       This is a deliberate design decision, as activating the debugger adds a
1915       significant amount of code to each grammar's implementation, which is
1916       detrimental to the matching performance of the resulting regexes.
1917
1918       If you need to debug a named grammar, the best approach is to include a
1919       "<debug: same>" directive at the start of the grammar. The presence of
1920       this directive will ensure the necessary extra debugging code is
1921       included in the regex implementing the grammar, while setting "same"
1922       mode will ensure that the debugging mode isn't altered when the matcher
1923       uses the inherited rules.
1924

Common parsing techniques

1926   Result distillation
1927       Normally, calls to subrules produce nested result-hashes within the
1928       current result-hash. Those nested hashes always have at least one
1929       automatically supplied key (""), whose value is the entire substring
1930       that the subrule matched.
1931
1932       If there are no other nested captures within the subrule, there will be
1933       no other keys in the result-hash. This would be annoying as a typical
1934       nested grammar would then produce results consisting of hashes of
1935       hashes, with each nested hash having only a single key (""). This in
1936       turn would make postprocessing the result-hash (in "%/") far more
1937       complicated than it needs to be.
1938
1939       To avoid this behaviour, if a subrule's result-hash doesn't contain any
1940       keys except "", the module "flattens" the result-hash, by replacing it
1941       with the value of its single key.
1942
1943       So, for example, the grammar:
1944
1945           mv \s* <from> \s* <to>
1946
1947           <rule: from>   [\w/.-]+
1948           <rule: to>     [\w/.-]+
1949
1950       doesn't return a result-hash like this:
1951
1952           {
1953               ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
1954               'from' => { "" => '/usr/local/lib/libhuh.dylib' },
1955               'to'   => { "" => '/dev/null/badlib'            },
1956           }
1957
1958       Instead, it returns:
1959
1960           {
1961               ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
1962               'from' => '/usr/local/lib/libhuh.dylib',
1963               'to'   => '/dev/null/badlib',
1964           }
1965
1966       That is, because the 'from' and 'to' subhashes each have only a single
1967       entry, they are each "flattened" to the value of that entry.
1968
1969       This flattening also occurs if a result-hash contains only "private"
1970       keys (i.e. keys starting with underscores). For example:
1971
1972           mv \s* <from> \s* <to>
1973
1974           <rule: from>   <_dir=path>? <_file=filename>
1975           <rule: to>     <_dir=path>? <_file=filename>
1976
1977           <token: path>      [\w/.-]*/
1978           <token: filename>  [\w.-]+
1979
1980       Here, the "from" rule produces a result like this:
1981
1982           from => {
1983                 "" => '/usr/local/bin/perl',
1984               _dir => '/usr/local/bin/',
1985              _file => 'perl',
1986           }
1987
1988       which is automatically stripped of "private" keys, leaving:
1989
1990           from => {
1991                 "" => '/usr/local/bin/perl',
1992           }
1993
1994       which is then automatically flattened to:
1995
1996           from => '/usr/local/bin/perl'
1997
1998       List result distillation
1999
2000       A special case of result distillation occurs in a separated list, such
2001       as:
2002
2003           <rule: List>
2004
2005               <[Item]>+ % <[Sep=(,)]>
2006
2007       If this construct matches just a single item, the result hash will
2008       contain a single entry consisting of a nested array with a single
2009       value, like so:
2010
2011           { Item => [ 'data' ] }
2012
2013       Instead of returning this annoyingly nested data structure, you can
2014       tell Regexp::Grammars to flatten it to just the inner data with a
2015       special directive:
2016
2017           <rule: List>
2018
2019               <[Item]>+ % <[Sep=(,)]>
2020
2021               <minimize:>
2022
2023       The "<minimize:>" directive examines the result hash (i.e.  %MATCH). If
2024       that hash contains only a single entry, which is a reference to an
2025       array with a single value, then the directive assigns that single value
2026       directly to $MATCH, so that it will be returned instead of the usual
2027       result hash.
2028
2029       This means that a normal separated list still results in a hash
2030       containing all elements and separators, but a "degenerate" list of only
2031       one item results in just that single item.
2032
2033       Manual result distillation
2034
2035       Regexp::Grammars also offers full manual control over the distillation
2036       process. If you use the reserved word "MATCH" as the alias for a
2037       subrule call:
2038
2039           <MATCH=filename>
2040
2041       or a subpattern match:
2042
2043           <MATCH=( \w+ )>
2044
2045       or a code block:
2046
2047           <MATCH=(?{ 42 })>
2048
2049       then the current rule will treat the return value of that subrule,
2050       pattern, or code block as its complete result, and return that value
2051       instead of the usual result-hash it constructs. This is the case even
2052       if the result has other entries that would normally also be returned.
2053
2054       For example, consider a rule like:
2055
2056           <rule: term>
2057                 <MATCH=literal>
2058               | <left_paren> <MATCH=expr> <right_paren>
2059
2060       The use of "MATCH" aliases causes the rule to return either whatever
2061       "<literal>" returns, or whatever "<expr>" returns (provided it's
2062       between left and right parentheses).
2063
2064       Note that, in this second case, even though "<left_paren>" and
2065       "<right_paren>" are captured to the result-hash, they are not returned,
2066       because the "MATCH" alias overrides the normal "return the result-hash"
2067       semantics and returns only what its associated subrule (i.e. "<expr>")
2068       produces.
2069
2070       Note also that the return value is only assigned, if the subrule call
2071       actually matches. For example:
2072
2073           <rule: optional_names>
2074               <[MATCH=name]>*
2075
2076       If the repeated subrule call to "<name>" matches zero times, the return
2077       value of the "optional_names" rule will not be an empty array, because
2078       the "MATCH=" will not have executed at all. Instead, the default return
2079       value (an empty string) will be returned.  If you had specifically
2080       wanted to return an empty array, you could use any of the following:
2081
2082           <rule: optional_names>
2083               <MATCH=(?{ [] })>     # Set up empty array before first match attempt
2084               <[MATCH=name]>*
2085
2086       or:
2087
2088           <rule: optional_names>
2089               <[MATCH=name]>+       # Match one or more times
2090             |                       #          or
2091               <MATCH=(?{ [] })>     # Set up empty array, if no match
2092
2093       Programmatic result distillation
2094
2095       It's also possible to control what a rule returns from within a code
2096       block.  Regexp::Grammars provides a set of reserved variables that give
2097       direct access to the result-hash.
2098
2099       The result-hash itself can be accessed as %MATCH within any code block
2100       inside a rule. For example:
2101
2102           <rule: sum>
2103               <X=product> \+ <Y=product>
2104                   <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
2105
2106       Here, the rule matches a product (aliased 'X' in the result-hash), then
2107       a literal '+', then another product (aliased to 'Y' in the result-
2108       hash). The rule then executes the code block, which accesses the two
2109       saved values (as $MATCH{X} and $MATCH{Y}), adding them together.
2110       Because the block is itself aliased to "MATCH", the sum produced by the
2111       block becomes the (only) result of the rule.
2112
2113       It is also possible to set the rule result from within a code block
2114       (instead of aliasing it). The special "override" return value is
2115       represented by the special variable $MATCH. So the previous example
2116       could be rewritten:
2117
2118           <rule: sum>
2119               <X=product> \+ <Y=product>
2120                   (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
2121
2122       Both forms are identical in effect. Any assignment to $MATCH overrides
2123       the normal "return all subrule results" behaviour.
2124
2125       Assigning to $MATCH directly is particularly handy if the result may
2126       not always be "distillable", for example:
2127
2128           <rule: sum>
2129               <X=product> \+ <Y=product>
2130                   (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
2131                           # Reduce to sum, if both terms are simple scalars...
2132                           $MATCH = $MATCH{X} + $MATCH{Y};
2133                       }
2134                       else {
2135                           # Return full syntax tree for non-simple case...
2136                           $MATCH{op} = '+';
2137                       }
2138                   })
2139
2140       Note that you can also partially override the subrule return behaviour.
2141       Normally, the subrule returns the complete text it matched as its
2142       context substring (i.e. under the "empty key") in its result-hash. That
2143       is, of course, $MATCH{""}, so you can override just that behaviour by
2144       directly assigning to that entry.
2145
2146       For example, if you have a rule that matches key/value pairs from a
2147       configuration file, you might prefer that any trailing comments not be
2148       included in the "matched text" entry of the rule's result-hash. You
2149       could hide such comments like so:
2150
2151           <rule: config_line>
2152               <key> : <value>  <comment>?
2153                   (?{
2154                       # Edit trailing comments out of "matched text" entry...
2155                       $MATCH = "$MATCH{key} : $MATCH{value}";
2156                   })
2157
2158       Some more examples of the uses of $MATCH:
2159
2160           <rule: FuncDecl>
2161             # Keyword  Name               Keep return the name (as a string)...
2162               func     <Identifier> ;     (?{ $MATCH = $MATCH{'Identifier'} })
2163
2164
2165           <rule: NumList>
2166             # Numbers in square brackets...
2167               \[
2168                   ( \d+ (?: , \d+)* )
2169               \]
2170
2171             # Return only the numbers...
2172               (?{ $MATCH = $CAPTURE })
2173
2174
2175           <token: Cmd>
2176             # Match standard variants then standardize the keyword...
2177               (?: mv | move | rename )      (?{ $MATCH = 'mv'; })
2178
2179   Parse-time data processing
2180       Using code blocks in rules, it's often possible to fully process data
2181       as you parse it. For example, the "<sum>" rule shown in the previous
2182       section might be part of a simple calculator, implemented entirely in a
2183       single grammar. Such a calculator might look like this:
2184
2185           my $calculator = do{
2186               use Regexp::Grammars;
2187               qr{
2188                   <Answer>
2189
2190                   <rule: Answer>
2191                       ( <.Mult>+ % <.Op=([+-])> )
2192                           <MATCH= (?{ eval $CAPTURE })>
2193
2194                   <rule: Mult>
2195                       ( <.Pow>+ % <.Op=([*/%])> )
2196                           <MATCH= (?{ eval $CAPTURE })>
2197
2198                   <rule: Pow>
2199                       <X=Term> \^ <Y=Pow>
2200                           <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>
2201                     |
2202                           <MATCH=Term>
2203
2204                   <rule: Term>
2205                           <MATCH=Literal>
2206                     | \(  <MATCH=Answer>  \)
2207
2208                   <token: Literal>
2209                           <MATCH= ( [+-]? \d++ (?: \. \d++ )?+ )>
2210               }xms
2211           };
2212
2213           while (my $input = <>) {
2214               if ($input =~ $calculator) {
2215                   say "--> $/{Answer}";
2216               }
2217           }
2218
2219       Because every rule computes a value using the results of the subrules
2220       below it, and aliases that result to its "MATCH", each rule returns a
2221       complete evaluation of the subexpression it matches, passing that back
2222       to higher-level rules, which then do the same.
2223
2224       Hence, the result returned to the very top-level rule (i.e. to
2225       "<Answer>") is the complete evaluation of the entire expression that
2226       was matched. That means that, in the very process of having matched a
2227       valid expression, the calculator has also computed the value of that
2228       expression, which can then simply be printed directly.
2229
2230       It is often possible to have a grammar fully (or sometimes at least
2231       partially) evaluate or transform the data it is parsing, and this
2232       usually leads to very efficient and easy-to-maintain implementations.
2233
2234       The main limitation of this technique is that the data has to be in a
2235       well-structured form, where subsets of the data can be evaluated using
2236       only local information. In cases where the meaning of the data is
2237       distributed through that data non-hierarchically, or relies on global
2238       state, or on external information, it is often better to have the
2239       grammar simply construct a complete syntax tree for the data first, and
2240       then evaluate that syntax tree separately, after parsing is complete.
2241       The following section describes a feature of Regexp::Grammars that can
2242       make this second style of data processing simpler and more
2243       maintainable.
2244
2245   Object-oriented parsing
2246       When a grammar has parsed successfully, the "%/" variable will contain
2247       a series of nested hashes (and possibly arrays) representing the
2248       hierarchical structure of the parsed data.
2249
2250       Typically, the next step is to walk that tree, extracting or converting
2251       or otherwise processing that information. If the tree has nodes of many
2252       different types, it can be difficult to build a recursive subroutine
2253       that can navigate it easily.
2254
2255       A much cleaner solution is possible if the nodes of the tree are proper
2256       objects.  In that case, you just define a "process()" or "traverse()"
2257       method for eah of the classes, and have every node call that method on
2258       each of its children. For example, if the parser were to return a tree
2259       of nodes representing the contents of a LaTeX file, then you could
2260       define the following methods:
2261
2262           sub Latex::file::explain
2263           {
2264               my ($self, $level) = @_;
2265               for my $element (@{$self->{element}}) {
2266                   $element->explain($level);
2267               }
2268           }
2269
2270           sub Latex::element::explain {
2271               my ($self, $level) = @_;
2272               (  $self->{command} || $self->{literal})->explain($level)
2273           }
2274
2275           sub Latex::command::explain {
2276               my ($self, $level) = @_;
2277               say "\t"x$level, "Command:";
2278               say "\t"x($level+1), "Name: $self->{name}";
2279               if ($self->{options}) {
2280                   say "\t"x$level, "\tOptions:";
2281                   $self->{options}->explain($level+2)
2282               }
2283
2284               for my $arg (@{$self->{arg}}) {
2285                   say "\t"x$level, "\tArg:";
2286                   $arg->explain($level+2)
2287               }
2288           }
2289
2290           sub Latex::options::explain {
2291               my ($self, $level) = @_;
2292               $_->explain($level) foreach @{$self->{option}};
2293           }
2294
2295           sub Latex::literal::explain {
2296               my ($self, $level, $label) = @_;
2297               $label //= 'Literal';
2298               say "\t"x$level, "$label: ", $self->{q{}};
2299           }
2300
2301       and then simply write:
2302
2303           if ($text =~ $LaTeX_parser) {
2304               $/{LaTeX_file}->explain();
2305           }
2306
2307       and the chain of "explain()" calls would cascade down the nodes of the
2308       tree, each one invoking the appropriate "explain()" method according to
2309       the type of node encountered.
2310
2311       The only problem is that, by default, Regexp::Grammars returns a tree
2312       of plain-old hashes, not LaTeX::Whatever objects. Fortunately, it's
2313       easy to request that the result hashes be automatically blessed into
2314       the appropriate classes, using the "<objrule:...>" and "<objtoken:...>"
2315       directives.
2316
2317       These directives are identical to the "<rule:...>" and "<token:...>"
2318       directives (respectively), except that the rule or token they create
2319       will also convert the hash it normally returns into an object of a
2320       specified class. This conversion is done by passing the result hash to
2321       the class's constructor:
2322
2323           $class->new(\%result_hash)
2324
2325       if the class has a constructor method named "new()", or else (if the
2326       class doesn't provide a constructor) by directly blessing the result
2327       hash:
2328
2329           bless \%result_hash, $class
2330
2331       Note that, even if object is constructed via its own constructor, the
2332       module still expects the new object to be hash-based, and will fail if
2333       the object is anything but a blessed hash. The module issues an error
2334       in this case.
2335
2336       The generic syntax for these types of rules and tokens is:
2337
2338           <objrule:  CLASS::NAME = RULENAME  >
2339           <objtoken: CLASS::NAME = TOKENNAME >
2340
2341       For example:
2342
2343           <objrule: LaTeX::Element=component>
2344               # ...Defines a rule that can be called as <component>
2345               # ...and which returns a hash-based LaTeX::Element object
2346
2347           <objtoken: LaTex::Literal=atom>
2348               # ...Defines a token that can be called as <atom>
2349               # ...and which returns a hash-based LaTeX::Literal object
2350
2351       Note that, just as in aliased subrule calls, the name by which
2352       something is referred to outside the grammar (in this case, the class
2353       name) comes before the "=", whereas the name that it is referred to
2354       inside the grammar comes after the "=".
2355
2356       You can freely mix object-returning and plain-old-hash-returning rules
2357       and tokens within a single grammar, though you have to be careful not
2358       to subsequently try to call a method on any of the unblessed nodes.
2359
2360       An important caveat regarding OO rules
2361
2362       Prior to Perl 5.14.0, Perl's regex engine was not fully re-entrant.
2363       This means that in older versions of Perl, it is not possible to re-
2364       invoke the regex engine when already inside the regex engine.
2365
2366       This means that you need to be careful that the "new()" constructors
2367       that are called by your object-rules do not themselves use regexes in
2368       any way, unless you're running under Perl 5.14 or later (in which case
2369       you can ignore what follows).
2370
2371       The two ways this is most likely to happen are:
2372
2373       1.  If you're using a class built on Moose, where one or more of the
2374           "has" uses a type constraint (such as 'Int') that is implemented
2375           via regex matching. For example:
2376
2377               has 'id' => (is => 'rw', isa => 'Int');
2378
2379           The workaround (for pre-5.14 Perls) is to replace the type
2380           constraint with one that doesn't use a regex. For example:
2381
2382               has 'id' => (is => 'rw', isa => 'Num');
2383
2384           Alternatively, you could define your own type constraint that
2385           avoids regexes:
2386
2387               use Moose::Util::TypeConstraints;
2388
2389               subtype 'Non::Regex::Int',
2390                    as 'Num',
2391                 where { int($_) == $_ };
2392
2393               no Moose::Util::TypeConstraints;
2394
2395               # and later...
2396
2397               has 'id' => (is => 'rw', isa => 'Non::Regex::Int');
2398
2399       2.  If your class uses an "AUTOLOAD()" method to implement its
2400           constructor and that method uses the typical:
2401
2402               $AUTOLOAD =~ s/.*://;
2403
2404           technique. The workaround here is to achieve the same effect
2405           without a regex. For example:
2406
2407               my $last_colon_pos = rindex($AUTOLOAD, ':');
2408               substr $AUTOLOAD, 0, $last_colon_pos+1, q{};
2409
2410       Note that this caveat against using nested regexes also applies to any
2411       code blocks executed inside a rule or token (whether or not those rules
2412       or tokens are object-oriented).
2413
2414       A naming shortcut
2415
2416       If an "<objrule:...>" or "<objtoken:...>" is defined with a class name
2417       that is not followed by "=" and a rule name, then the rule name is
2418       determined automatically from the classname.  Specifically, the final
2419       component of the classname (i.e. after the last "::", if any) is used.
2420
2421       For example:
2422
2423           <objrule: LaTeX::Element>
2424               # ...Defines a rule that can be called as <Element>
2425               # ...and which returns a hash-based LaTeX::Element object
2426
2427           <objtoken: LaTex::Literal>
2428               # ...Defines a token that can be called as <Literal>
2429               # ...and which returns a hash-based LaTeX::Literal object
2430
2431           <objtoken: Comment>
2432               # ...Defines a token that can be called as <Comment>
2433               # ...and which returns a hash-based Comment object
2434

Debugging

2436       Regexp::Grammars provides a number of features specifically designed to
2437       help debug both grammars and the data they parse.
2438
2439       All debugging messages are written to a log file (which, by default, is
2440       just STDERR). However, you can specify a disk file explicitly by
2441       placing a "<logfile:...>" directive at the start of your grammar:
2442
2443           $grammar = qr{
2444
2445               <logfile: LaTeX_parser_log >
2446
2447               \A <LaTeX_file> \Z    # Pattern to match
2448
2449               <rule: LaTeX_file>
2450                   # etc.
2451           }x;
2452
2453       You can also explicitly specify that messages go to the terminal:
2454
2455               <logfile: - >
2456
2457   Debugging grammar creation with "<logfile:...>"
2458       Whenever a log file has been directly specified, Regexp::Grammars
2459       automatically does verbose static analysis of your grammar.  That is,
2460       whenever it compiles a grammar containing an explicit "<logfile:...>"
2461       directive it logs a series of messages explaining how it has
2462       interpreted the various components of that grammar. For example, the
2463       following grammar:
2464
2465           <logfile: parser_log >
2466
2467           <cmd>
2468
2469           <rule: cmd>
2470               mv <from=file> <to=file>
2471             | cp <source> <[file]>  <.comment>?
2472
2473       would produce the following analysis in the 'parser_log' file:
2474
2475           info | Processing the main regex before any rule definitions
2476                |    |
2477                |    |...Treating <cmd> as:
2478                |    |      |  match the subrule <cmd>
2479                |    |       \ saving the match in $MATCH{'cmd'}
2480                |    |
2481                |     \___End of main regex
2482                |
2483           info | Defining a rule: <cmd>
2484                |    |...Returns: a hash
2485                |    |
2486                |    |...Treating ' mv ' as:
2487                |    |       \ normal Perl regex syntax
2488                |    |
2489                |    |...Treating <from=file> as:
2490                |    |      |  match the subrule <file>
2491                |    |       \ saving the match in $MATCH{'from'}
2492                |    |
2493                |    |...Treating <to=file> as:
2494                |    |      |  match the subrule <file>
2495                |    |       \ saving the match in $MATCH{'to'}
2496                |    |
2497                |    |...Treating ' | cp ' as:
2498                |    |       \ normal Perl regex syntax
2499                |    |
2500                |    |...Treating <source> as:
2501                |    |      |  match the subrule <source>
2502                |    |       \ saving the match in $MATCH{'source'}
2503                |    |
2504                |    |...Treating <[file]> as:
2505                |    |      |  match the subrule <file>
2506                |    |       \ appending the match to $MATCH{'file'}
2507                |    |
2508                |    |...Treating <.comment>? as:
2509                |    |      |  match the subrule <comment> if possible
2510                |    |       \ but don't save anything
2511                |    |
2512                |     \___End of rule definition
2513
2514       This kind of static analysis is a useful starting point in debugging a
2515       miscreant grammar, because it enables you to see what you actually
2516       specified (as opposed to what you thought you'd specified).
2517
2518   Debugging grammar execution with "<debug:...>"
2519       Regexp::Grammars also provides a simple interactive debugger, with
2520       which you can observe the process of parsing and the data being
2521       collected in any result-hash.
2522
2523       To initiate debugging, place a "<debug:...>" directive anywhere in your
2524       grammar. When parsing reaches that directive the debugger will be
2525       activated, and the command specified in the directive immediately
2526       executed. The available commands are:
2527
2528           <debug: on>    - Enable debugging, stop when a rule matches
2529           <debug: match> - Enable debugging, stop when a rule matches
2530           <debug: try>   - Enable debugging, stop when a rule is tried
2531           <debug: run>   - Enable debugging, run until the match completes
2532           <debug: same>  - Continue debugging (or not) as currently
2533           <debug: off>   - Disable debugging and continue parsing silently
2534
2535           <debug: continue> - Synonym for <debug: run>
2536           <debug: step>     - Synonym for <debug: try>
2537
2538       These directives can be placed anywhere within a grammar and take
2539       effect when that point is reached in the parsing. Hence, adding a
2540       "<debug:step>" directive is very much like setting a breakpoint at that
2541       point in the grammar. Indeed, a common debugging strategy is to turn
2542       debugging on and off only around a suspect part of the grammar:
2543
2544           <rule: tricky>   # This is where we think the problem is...
2545               <debug:step>
2546               <preamble> <text> <postscript>
2547               <debug:off>
2548
2549       Once the debugger is active, it steps through the parse, reporting
2550       rules that are tried, matches and failures, backtracking and restarts,
2551       and the parser's location within both the grammar and the text being
2552       matched. That report looks like this:
2553
2554           ===============> Trying <grammar> from position 0
2555           > cp file1 file2 |...Trying <cmd>
2556                            |   |...Trying <cmd=(cp)>
2557                            |   |    \FAIL <cmd=(cp)>
2558                            |    \FAIL <cmd>
2559                             \FAIL <grammar>
2560           ===============> Trying <grammar> from position 1
2561            cp file1 file2  |...Trying <cmd>
2562                            |   |...Trying <cmd=(cp)>
2563            file1 file2     |   |    \_____<cmd=(cp)> matched 'cp'
2564           file1 file2      |   |...Trying <[file]>+
2565            file2           |   |    \_____<[file]>+ matched 'file1'
2566                            |   |...Trying <[file]>+
2567           [eos]            |   |    \_____<[file]>+ matched ' file2'
2568                            |   |...Trying <[file]>+
2569                            |   |    \FAIL <[file]>+
2570                            |   |...Trying <target>
2571                            |   |   |...Trying <file>
2572                            |   |   |    \FAIL <file>
2573                            |   |    \FAIL <target>
2574            <~~~~~~~~~~~~~~ |   |...Backtracking 5 chars and trying new match
2575           file2            |   |...Trying <target>
2576                            |   |   |...Trying <file>
2577                            |   |   |    \____ <file> matched 'file2'
2578           [eos]            |   |    \_____<target> matched 'file2'
2579                            |    \_____<cmd> matched ' cp file1 file2'
2580                             \_____<grammar> matched ' cp file1 file2'
2581
2582       The first column indicates the point in the input at which the parser
2583       is trying to match, as well as any backtracking or forward searching it
2584       may need to do. The remainder of the columns track the parser's
2585       hierarchical traversal of the grammar, indicating which rules are
2586       tried, which succeed, and what they match.
2587
2588       Provided the logfile is a terminal (as it is by default), the debugger
2589       also pauses at various points in the parsing process--before trying a
2590       rule, after a rule succeeds, or at the end of the parse--according to
2591       the most recent command issued. When it pauses, you can issue a new
2592       command by entering a single letter:
2593
2594           m       - to continue until the next subrule matches
2595           t or s  - to continue until the next subrule is tried
2596           r or c  - to continue to the end of the grammar
2597           o       - to switch off debugging
2598
2599       Note that these are the first letters of the corresponding
2600       "<debug:...>" commands, listed earlier. Just hitting ENTER while the
2601       debugger is paused repeats the previous command.
2602
2603       While the debugger is paused you can also type a 'd', which will
2604       display the result-hash for the current rule. This can be useful for
2605       detecting which rule isn't returning the data you expected.
2606
2607       Resizing the context string
2608
2609       By default, the first column of the debugger output (which shows the
2610       current matching position within the string) is limited to a width of
2611       20 columns.
2612
2613       However, you can change that limit calling the
2614       "Regexp::Grammars::set_context_width()" subroutine. You have to specify
2615       the fully qualified name, however, as Regexp::Grammars does not export
2616       this (or any other) subroutine.
2617
2618       "set_context_width()" expects a single argument: a positive integer
2619       indicating the maximal allowable width for the context column. It
2620       issues a warning if an invalid value is passed, and ignores it.
2621
2622       If called in a void context, "set_context_width()" changes the context
2623       width permanently throughout your application. If called in a scalar or
2624       list context, "set_context_width()" returns an object whose destructor
2625       will cause the context width to revert to its previous value. This
2626       means you can temporarily change the context width within a given block
2627       with something like:
2628
2629           {
2630               my $temporary = Regexp::Grammars::set_context_width(50);
2631
2632               if ($text =~ $parser) {
2633                   do_stuff_with( %/ );
2634               }
2635
2636           } # <--- context width automagically reverts at this point
2637
2638       and the context width will change back to its previous value when
2639       $temporary goes out of scope at the end of the block.
2640
2641   User-defined logging with "<log:...>"
2642       Both static and interactive debugging send a series of predefined log
2643       messages to whatever log file you have specified. It is also possible
2644       to send additional, user-defined messages to the log, using the
2645       "<log:...>" directive.
2646
2647       This directive expects either a simple text or a codeblock as its
2648       single argument. If the argument is a code block, that code is expected
2649       to return the text of the message; if the argument is anything else,
2650       that something else is the literal message. For example:
2651
2652           <rule: ListElem>
2653
2654               <Elem=   ( [a-z]\d+) >
2655                   <log: Checking for a suffix, too...>
2656
2657               <Suffix= ( : \d+   ) >?
2658                   <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
2659
2660       User-defined log messages implemented using a codeblock can also
2661       specify a severity level. If the codeblock of a "<log:...>" directive
2662       returns two or more values, the first is treated as a log message
2663       severity indicator, and the remaining values as separate lines of text
2664       to be logged. For example:
2665
2666           <rule: ListElem>
2667               <Elem=   ( [a-z]\d+) >
2668               <Suffix= ( : \d+   ) >?
2669
2670                   <log: (?{
2671                       warn => "Elem was: $MATCH{Elem}",
2672                               "Suffix was $MATCH{Suffix}",
2673                   })>
2674
2675       When they are encountered, user-defined log messages are interspersed
2676       between any automatic log messages (i.e. from the debugger), at the
2677       correct level of nesting for the current rule.
2678
2679   Debugging non-grammars
2680       [Note that, with the release in 2012 of the Regexp::Debugger module (on
2681       CPAN) the techniques described below are unnecessary. If you need to
2682       debug plain Perl regexes, use Regexp::Debugger instead.]
2683
2684       It is possible to use Regexp::Grammars without creating any subrule
2685       definitions, simply to debug a recalcitrant regex. For example, if the
2686       following regex wasn't working as expected:
2687
2688           my $balanced_brackets = qr{
2689               \(             # left delim
2690               (?:
2691                   \\         # escape or
2692               |   (?R)       # recurse or
2693               |   .          # whatever
2694               )*
2695               \)             # right delim
2696           }xms;
2697
2698       you could instrument it with aliased subpatterns and then debug it
2699       step-by-step, using Regexp::Grammars:
2700
2701           use Regexp::Grammars;
2702
2703           my $balanced_brackets = qr{
2704               <debug:step>
2705
2706               <.left_delim=  (  \(  )>
2707               (?:
2708                   <.escape=  (  \\  )>
2709               |   <.recurse= ( (?R) )>
2710               |   <.whatever=(  .   )>
2711               )*
2712               <.right_delim= (  \)  )>
2713           }xms;
2714
2715           while (<>) {
2716               say 'matched' if /$balanced_brackets/;
2717           }
2718
2719       Note the use of amnesiac aliased subpatterns to avoid needlessly
2720       building a result-hash. Alternatively, you could use listifying aliases
2721       to preserve the matching structure as an additional debugging aid:
2722
2723           use Regexp::Grammars;
2724
2725           my $balanced_brackets = qr{
2726               <debug:step>
2727
2728               <[left_delim=  (  \(  )]>
2729               (?:
2730                   <[escape=  (  \\  )]>
2731               |   <[recurse= ( (?R) )]>
2732               |   <[whatever=(  .   )]>
2733               )*
2734               <[right_delim= (  \)  )]>
2735           }xms;
2736
2737           if ( '(a(bc)d)' =~ /$balanced_brackets/) {
2738               use Data::Dumper 'Dumper';
2739               warn Dumper \%/;
2740           }
2741

Handling errors when parsing

2743       Assuming you have correctly debugged your grammar, the next source of
2744       problems will probably be invalid input (especially if that input is
2745       being provided interactively). So Regexp::Grammars also provides some
2746       support for detecting when a parse is likely to fail...and informing
2747       the user why.
2748
2749   Requirements
2750       The "<require:...>" directive is useful for testing conditions that
2751       it's not easy (or even possible) to check within the syntax of the the
2752       regex itself. For example:
2753
2754           <rule: IPV4_Octet_Decimal>
2755               # Up three digits...
2756               <MATCH= ( \d{1,3}+ )>
2757
2758               # ...but less than 256...
2759               <require: (?{ $MATCH <= 255 })>
2760
2761       A require expects a regex codeblock as its argument and succeeds if the
2762       final value of that codeblock is true. If the final value is false, the
2763       directive fails and the rule starts backtracking.
2764
2765       Note, in this example that the digits are matched with " \d{1,3}+ ".
2766       The trailing "+" prevents the "{1,3}" repetition from backtracking to a
2767       smaller number of digits if the "<require:...>" fails.
2768
2769   Handling failure
2770       The module has limited support for error reporting from within a
2771       grammar, in the form of the "<error:...>" and "<warning:...>"
2772       directives and their shortcuts: "<...>", "<!!!>", and "<???>"
2773
2774       Error messages
2775
2776       The "<error: MSG>" directive queues a conditional error message within
2777       "@!" and then fails to match (that is, it is equivalent to a "(?!)"
2778       when matching). For example:
2779
2780           <rule: ListElem>
2781               <SerialNumber>
2782             | <ClientName>
2783             | <error: (?{ $errcount++ . ': Missing list element' })>
2784
2785       So a common code pattern when using grammars that do this kind of error
2786       detection is:
2787
2788           if ($text =~ $grammar) {
2789               # Do something with the data collected in %/
2790           }
2791           else {
2792               say {*STDERR} $_ for @!;   # i.e. report all errors
2793           }
2794
2795       Each error message is conditional in the sense that, if any surrounding
2796       rule subsequently matches, the message is automatically removed from
2797       "@!". This implies that you can queue up as many error messages as you
2798       wish, but they will only remain in "@!" if the match ultimately fails.
2799       Moreover, only those error messages originating from rules that
2800       actually contributed to the eventual failure-to-match will remain in
2801       "@!".
2802
2803       If a code block is specified as the argument, the error message is
2804       whatever final value is produced when the block is executed. Note that
2805       this final value does not have to be a string (though it does have to
2806       be a scalar).
2807
2808           <rule: ListElem>
2809               <SerialNumber>
2810             | <ClientName>
2811             | <error: (?{
2812                   # Return a hash, with the error information...
2813                   { errnum => $errcount++, msg => 'Missing list element' }
2814               })>
2815
2816       If anything else is specified as the argument, it is treated as a
2817       literal error string (and may not contain an unbalanced '<' or '>', nor
2818       any interpolated variables).
2819
2820       However, if the literal error string begins with "Expected " or
2821       "Expecting ", then the error string automatically has the following
2822       "context suffix" appended:
2823
2824           , but found '$CONTEXT' instead
2825
2826       For example:
2827
2828           qr{ <Arithmetic_Expression>                # ...Match arithmetic expression
2829             |                                        # Or else
2830               <error: Expected a valid expression>   # ...Report error, and fail
2831
2832               # Rule definitions here...
2833           }xms;
2834
2835       On an invalid input this example might produce an error message like:
2836
2837           "Expected a valid expression, but found '(2+3]*7/' instead"
2838
2839       The value of the special $CONTEXT variable is found by looking ahead in
2840       the string being matched against, to locate the next sequence of non-
2841       blank characters after the current parsing position. This variable may
2842       also be explicitly used within the "<error: (?{...})>" form of the
2843       directive.
2844
2845       As a special case, if you omit the message entirely from the directive,
2846       it is supplied automatically, derived from the name of the current
2847       rule.  For example, if the following rule were to fail to match:
2848
2849           <rule: Arithmetic_expression>
2850                 <Multiplicative_Expression>+ % ([+-])
2851               | <error:>
2852
2853       the error message queued would be:
2854
2855           "Expected arithmetic expression, but found 'one plus two' instead"
2856
2857       Note however, that it is still essential to include the colon in the
2858       directive. A common mistake is to write:
2859
2860           <rule: Arithmetic_expression>
2861                 <Multiplicative_Expression>+ % ([+-])
2862               | <error>
2863
2864       which merely attempts to call "<rule: error>" if the first alternative
2865       fails.
2866
2867       Warning messages
2868
2869       Sometimes, you want to detect problems, but not invalidate the entire
2870       parse as a result. For those occasions, the module provides a "less
2871       stringent" form of error reporting: the "<warning:...>" directive.
2872
2873       This directive is exactly the same as an "<error:...>" in every respect
2874       except that it does not induce a failure to match at the point it
2875       appears.
2876
2877       The directive is, therefore, useful for reporting non-fatal problems in
2878       a parse. For example:
2879
2880           qr{ \A            # ...Match only at start of input
2881               <ArithExpr>   # ...Match a valid arithmetic expression
2882
2883               (?:
2884                   # Should be at end of input...
2885                   \s* \Z
2886                 |
2887                   # If not, report the fact but don't fail...
2888                   <warning: Expected end-of-input>
2889                   <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
2890               )
2891
2892               # Rule definitions here...
2893           }xms;
2894
2895       Note that, because they do not induce failure, two or more
2896       "<warning:...>" directives can be "stacked" in sequence, as in the
2897       previous example.
2898
2899       Stubbing
2900
2901       The module also provides three useful shortcuts, specifically to make
2902       it easy to declare, but not define, rules and tokens.
2903
2904       The "<...>" and "<???>" directives are equivalent to the directive:
2905
2906           <error: Cannot match RULENAME (not implemented)>
2907
2908       The "<???>" is equivalent to the directive:
2909
2910           <warning: Cannot match RULENAME (not implemented)>
2911
2912       For example, in the following grammar:
2913
2914           <grammar: List::Generic>
2915
2916           <rule: List>
2917               <[Item]>+ % (\s*,\s*)
2918
2919           <rule: Item>
2920               <...>
2921
2922       the "Item" rule is declared but not defined. That means the grammar
2923       will compile correctly, (the "List" rule won't complain about a call to
2924       a non-existent "Item"), but if the "Item" rule isn't overridden in some
2925       derived grammar, a match-time error will occur when "List" tries to
2926       match the "<...>" within "Item".
2927
2928       Localizing the (semi-)automatic error messages
2929
2930       Error directives of any of the following forms:
2931
2932           <error: Expecting identifier>
2933
2934           <error: >
2935
2936           <...>
2937
2938           <!!!>
2939
2940       or their warning equivalents:
2941
2942           <warning: Expecting identifier>
2943
2944           <warning: >
2945
2946           <???>
2947
2948       each autogenerate part or all of the actual error message they produce.
2949       By default, that autogenerated message is always produced in English.
2950
2951       However, the module provides a mechanism by which you can intercept
2952       every error or warning that is queued to "@!"  via these
2953       directives...and localize those messages.
2954
2955       To do this, you call "Regexp::Grammars::set_error_translator()" (with
2956       the full qualification, since Regexp::Grammars does not export it...nor
2957       anything else, for that matter).
2958
2959       The "set_error_translator()" subroutine expect as single argument,
2960       which must be a reference to another subroutine.  This subroutine is
2961       then called whenever an error or warning message is queued to "@!".
2962
2963       The subroutine is passed three arguments:
2964
2965       ·   the message string,
2966
2967       ·   the name of the rule from which the error or warning was queued,
2968           and
2969
2970       ·   the value of $CONTEXT when the error or warning was encountered
2971
2972       The subroutine is expected to return the final version of the message
2973       that is actually to be appended to "@!". To accomplish this it may make
2974       use of one of the many internationalization/localization modules
2975       available in Perl, or it may do the conversion entirely by itself.
2976
2977       The first argument is always exactly what appeared as a message in the
2978       original directive (regardless of whether that message is supposed to
2979       trigger autogeneration, or is just a "regular" error message).  That
2980       is:
2981
2982           Directive                         1st argument
2983
2984           <error: Expecting identifier>     "Expecting identifier"
2985           <warning: That's not a moon!>     "That's not a moon!"
2986           <error: >                         ""
2987           <warning: >                       ""
2988           <...>                             ""
2989           <!!!>                             ""
2990           <???>                             ""
2991
2992       The second argument always contains the name of the rule in which the
2993       directive was encountered. For example, when invoked from within
2994       "<rule: Frinstance>" the following directives produce:
2995
2996           Directive                         2nd argument
2997
2998           <error: Expecting identifier>     "Frinstance"
2999           <warning: That's not a moon!>     "Frinstance"
3000           <error: >                         "Frinstance"
3001           <warning: >                       "Frinstance"
3002           <...>                             "-Frinstance"
3003           <!!!>                             "-Frinstance"
3004           <???>                             "-Frinstance"
3005
3006       Note that the "unimplemented" markers pass the rule name with a
3007       preceding '-'. This allows your translator to distinguish between
3008       "empty" messages (which should then be generated automatically) and the
3009       "unimplemented" markers (which should report that the rule is not yet
3010       properly defined).
3011
3012       If you call "Regexp::Grammars::set_error_translator()" in a void
3013       context, the error translator is permanently replaced (at least, until
3014       the next call to "set_error_translator()").
3015
3016       However, if you call "Regexp::Grammars::set_error_translator()" in a
3017       scalar or list context, it returns an object whose destructor will
3018       restore the previous translator. This allows you to install a
3019       translator only within a given scope, like so:
3020
3021           {
3022               my $temporary
3023                   = Regexp::Grammars::set_error_translator(\&my_translator);
3024
3025               if ($text =~ $parser) {
3026                   do_stuff_with( %/ );
3027               }
3028               else {
3029                   report_errors_in( @! );
3030               }
3031
3032           } # <--- error translator automagically reverts at this point
3033
3034       Warning: any error translation subroutine you install will be called
3035       during the grammar's parsing phase (i.e. as the grammar's regex is
3036       matching). You should therefore ensure that your translator does not
3037       itself use regular expressions, as nested evaluations of regexes inside
3038       other regexes are extremely problematical (i.e. almost always
3039       disastrous) in Perl.
3040
3041   Restricting how long a parse runs
3042       Like the core Perl 5 regex engine on which they are built, the grammars
3043       implemented by Regexp::Grammars are essentially top-down parsers. This
3044       means that they may occasionally require an exponentially long time to
3045       parse a particular input. This usually occurs if a particular grammar
3046       includes a lot of recursion or nested backtracking, especially if the
3047       grammar is then matched against a long string.
3048
3049       The judicious use of non-backtracking repetitions (i.e. "x*+" and
3050       "x++") can significantly improve parsing performance in many such
3051       cases. Likewise, carefully reordering any high-level alternatives (so
3052       as to test simple common cases first) can substantially reduce parsing
3053       times.
3054
3055       However, some languages are just intrinsically slow to parse using top-
3056       down techniques (or, at least, may have slow-to-parse corner cases).
3057
3058       To help cope with this constraint, Regexp::Grammars provides a
3059       mechanism by which you can limit the total effort that a given grammar
3060       will expend in attempting to match. The "<timeout:...>" directive
3061       allows you to specify how long a grammar is allowed to continue trying
3062       to match before giving up. It expects a single argument, which must be
3063       an unsigned integer, and it treats this integer as the number of
3064       seconds to continue attempting to match.
3065
3066       For example:
3067
3068           <timeout: 10>    # Give up after 10 seconds
3069
3070       indicates that the grammar should keep attempting to match for another
3071       10 seconds from the point where the directive is encountered during a
3072       parse. If the complete grammar has not matched in that time, the entire
3073       match is considered to have failed, the matching process is immediately
3074       terminated, and a standard error message ('Internal error: Timed out
3075       after 10 seconds (as requested)') is returned in "@!".
3076
3077       A "<timeout:...>" directive can be placed anywhere in a grammar, but is
3078       most usually placed at the very start, so that the entire grammar is
3079       governed by the specified time limit. The second most common
3080       alternative is to place the timeout at the start of a particular
3081       subrule that is known to be potentially very slow.
3082
3083       A common mistake is to put the timeout specification at the top level
3084       of the grammar, but place it after the actual subrule to be matched,
3085       like so:
3086
3087           my $grammar = qr{
3088
3089               <Text_Corpus>      # Subrule to be matched
3090               <timeout: 10>      # Useless use of timeout
3091
3092               <rule: Text_Corpus>
3093                   # et cetera...
3094           }xms;
3095
3096       Since the parser will only reach the "<timeout: 10>" directive after it
3097       has completely matched "<Text_Corpus>", the timeout is only initiated
3098       at the very end of the matching process and so does not limit that
3099       process in any useful way.
3100
3101       Immediate timeouts
3102
3103       As you might expect, a "<timeout: 0>" directive tells the parser to
3104       keep trying for only zero more seconds, and therefore will immediately
3105       cause the entire surrounding grammar to fail (no matter how deeply
3106       within that grammar the directive is encountered).
3107
3108       This can occasionally be exteremely useful. If you know that detecting
3109       a particular datum means that the grammar will never match, no matter
3110       how many other alternatives may subsequently be tried, you can short-
3111       circuit the parser by injecting a "<timeout: 0>" immediately after the
3112       offending datum is detected.
3113
3114       For example, if your grammar only accepts certain versions of the
3115       language being parsed, you could write:
3116
3117           <rule: Valid_Language_Version>
3118                   vers = <%AcceptableVersions>
3119               |
3120                   vers = <bad_version=(\S++)>
3121                   <warning: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3122                   <timeout: 0>
3123
3124       In fact, this "<warning: MSG> <timeout: 0>" sequence is sufficiently
3125       useful, sufficiently complex, and sufficiently easy to get wrong, that
3126       Regexp::Grammars provides a handy shortcut for it: the "<fatal:...>"
3127       directive. A "<fatal:...>" is exactly equivalent to a "<warning:...>"
3128       followed by a zero-timeout, so the previous example could also be
3129       written:
3130
3131           <rule: Valid_Language_Version>
3132                   vers = <%AcceptableVersions>
3133               |
3134                   vers = <bad_version=(\S++)>
3135                   <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3136
3137       Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
3138       own failure context in $CONTEXT, so the previous example could be
3139       further simplified to:
3140
3141           <rule: Valid_Language_Version>
3142                   vers = <%AcceptableVersions>
3143               |
3144                   vers = <fatal:(?{ "Cannot parse language version $CONTEXT" })>
3145
3146       Also like "<error:...>", "<fatal:...>" can autogenerate an error
3147       message if none is provided, so the example could be still further
3148       reduced to:
3149
3150           <rule: Valid_Language_Version>
3151                   vers = <%AcceptableVersions>
3152               |
3153                   vers = <fatal:>
3154
3155       In this last case, however, the error message returned in "@!" would no
3156       longer be:
3157
3158           Cannot parse language version 0.95
3159
3160       It would now be:
3161
3162           Expected valid language version, but found '0.95' instead
3163

Scoping considerations

3165       If you intend to use a grammar as part of a larger program that
3166       contains other (non-grammatical) regexes, it is more efficient--and
3167       less error-prone--to avoid having Regexp::Grammars process those
3168       regexes as well. So it's often a good idea to declare your grammar in a
3169       "do" block, thereby restricting the scope of the module's effects.
3170
3171       For example:
3172
3173           my $grammar = do {
3174               use Regexp::Grammars;
3175               qr{
3176                   <file>
3177
3178                   <rule: file>
3179                       <prelude>
3180                       <data>
3181                       <postlude>
3182
3183                   <rule: prelude>
3184                       # etc.
3185               }x;
3186           };
3187
3188       Because the effects of Regexp::Grammars are lexically scoped, any
3189       regexes defined outside that "do" block will be unaffected by the
3190       module.
3191

INTERFACE

3193   Perl API
3194       "use Regexp::Grammars;"
3195           Causes all regexes in the current lexical scope to be compile-time
3196           processed for grammar elements.
3197
3198       "$str =~ $grammar"
3199       "$str =~ /$grammar/"
3200           Attempt to match the grammar against the string, building a nested
3201           data structure from it.
3202
3203       "%/"
3204           This hash is assigned the nested data structure created by any
3205           successful match of a grammar regex.
3206
3207       "@!"
3208           This array is assigned the queue of error messages created by any
3209           unsuccessful match attempt of a grammar regex.
3210
3211   Grammar syntax
3212       Directives
3213
3214       "<rule: IDENTIFIER>"
3215           Define a rule whose name is specified by the supplied identifier.
3216
3217           Everything following the "<rule:...>" directive (up to the next
3218           "<rule:...>" or "<token:...>" directive) is treated as part of the
3219           rule being defined.
3220
3221           Any whitespace in the rule is replaced by a call to the "<.ws>"
3222           subrule (which defaults to matching "\s*", but may be explicitly
3223           redefined).
3224
3225       "<token: IDENTIFIER>"
3226           Define a rule whose name is specified by the supplied identifier.
3227
3228           Everything following the "<token:...>" directive (up to the next
3229           "<rule:...>" or "<token:...>" directive) is treated as part of the
3230           rule being defined.
3231
3232           Any whitespace in the rule is ignored (under the "/x" modifier), or
3233           explicitly matched (if "/x" is not used).
3234
3235       "<objrule:  IDENTIFIER>"
3236       "<objtoken: IDENTIFIER>"
3237           Identical to a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
3238           declaration, except that the rule or token will also bless the hash
3239           it normally returns, converting it to an object of a class whose
3240           name is the same as the rule or token itself.
3241
3242       "<require: (?{ CODE }) >"
3243           The code block is executed and if its final value is true, matching
3244           continues from the same position. If the block's final value is
3245           false, the match fails at that point and starts backtracking.
3246
3247       "<error: (?{ CODE })  >"
3248       "<error: LITERAL TEXT >"
3249       "<error: >"
3250           This directive queues a conditional error message within the global
3251           special variable "@!" and then fails to match at that point (that
3252           is, it is equivalent to a "(?!)" or "(*FAIL)" when matching).
3253
3254       "<fatal: (?{ CODE })  >"
3255       "<fatal: LITERAL TEXT >"
3256       "<fatal: >"
3257           This directive is exactly the same as an "<error:...>" in every
3258           respect except that it immediately causes the entire surrounding
3259           grammar to fail, and parsing to immediate cease.
3260
3261       "<warning: (?{ CODE })  >"
3262       "<warning: LITERAL TEXT >"
3263           This directive is exactly the same as an "<error:...>" in every
3264           respect except that it does not induce a failure to match at the
3265           point it appears. That is, it is equivalent to a "(?=)" ["succeed
3266           and continue matching"], rather than a "(?!)" ["fail and
3267           backtrack"].
3268
3269       "<debug: COMMAND >"
3270           During the matching of grammar regexes send debugging and warning
3271           information to the specified log file (see "<logfile: LOGFILE>").
3272
3273           The available "COMMAND"'s are:
3274
3275               <debug: continue>    ___ Debug until end of complete parse
3276               <debug: run>         _/
3277
3278               <debug: on>          ___ Debug until next subrule match
3279               <debug: match>       _/
3280
3281               <debug: try>         ___ Debug until next subrule call or match
3282               <debug: step>        _/
3283
3284               <debug: same>        ___ Maintain current debugging mode
3285
3286               <debug: off>         ___ No debugging
3287
3288           See also the $DEBUG special variable.
3289
3290       "<logfile: LOGFILE>"
3291       "<logfile:    -   >"
3292           During the compilation of grammar regexes, send debugging and
3293           warning information to the specified LOGFILE (or to *STDERR if "-"
3294           is specified).
3295
3296           If the specified LOGFILE name contains a %t, it is replaced with a
3297           (sortable) "YYYYMMDD.HHMMSS" timestamp. For example:
3298
3299               <logfile: test-run-%t >
3300
3301           executed at around 9.30pm on the 21st of March 2009, would generate
3302           a log file named: "test-run-20090321.213056"
3303
3304       "<log: (?{ CODE })  >"
3305       "<log: LITERAL TEXT >"
3306           Append a message to the log file. If the argument is a code block,
3307           that code is expected to return the text of the message; if the
3308           argument is anything else, that something else is the literal
3309           message.
3310
3311           If the block returns two or more values, the first is treated as a
3312           log message severity indicator, and the remaining values as
3313           separate lines of text to be logged.
3314
3315       "<timeout: INT >"
3316           Restrict the match-time of the parse to the specified number of
3317           seconds.  Queues a error message and terminates the entire match
3318           process if the parse does not complete within the nominated time
3319           limit.
3320
3321       Subrule calls
3322
3323       "<IDENTIFIER>"
3324           Call the subrule whose name is IDENTIFIER.
3325
3326           If it matches successfully, save the hash it returns in the current
3327           scope's result-hash, under the key 'IDENTIFIER'.
3328
3329       "<IDENTIFIER_1=IDENTIFIER_2>"
3330           Call the subrule whose name is IDENTIFIER_1.
3331
3332           If it matches successfully, save the hash it returns in the current
3333           scope's result-hash, under the key 'IDENTIFIER_2'.
3334
3335           In other words, the "IDENTIFIER_1=" prefix changes the key under
3336           which the result of calling a subrule is stored.
3337
3338       "<.IDENTIFIER>"
3339           Call the subrule whose name is IDENTIFIER.  Don't save the hash it
3340           returns.
3341
3342           In other words, the "dot" prefix disables saving of subrule
3343           results.
3344
3345       "<IDENTIFIER= ( PATTERN )>"
3346           Match the subpattern PATTERN.
3347
3348           If it matches successfully, capture the substring it matched and
3349           save that substring in the current scope's result-hash, under the
3350           key 'IDENTIFIER'.
3351
3352       "<.IDENTIFIER= ( PATTERN )>"
3353           Match the subpattern PATTERN.  Don't save the substring it matched.
3354
3355       "<IDENTIFIER= %HASH>"
3356           Match a sequence of non-whitespace then verify that the sequence is
3357           a key in the specified hash
3358
3359           If it matches successfully, capture the sequence it matched and
3360           save that substring in the current scope's result-hash, under the
3361           key 'IDENTIFIER'.
3362
3363       "<%HASH>"
3364           Match a key from the hash.  Don't save the substring it matched.
3365
3366       "<IDENTIFIER= (?{ CODE })>"
3367           Execute the specified CODE.
3368
3369           Save the result (of the final expression that the CODE evaluates)
3370           in the current scope's result-hash, under the key 'IDENTIFIER'.
3371
3372       "<[IDENTIFIER]>"
3373           Call the subrule whose name is IDENTIFIER.
3374
3375           If it matches successfully, append the hash it returns to a nested
3376           array within the current scope's result-hash, under the key
3377           <'IDENTIFIER'>.
3378
3379       "<[IDENTIFIER_1=IDENTIFIER_2]>"
3380           Call the subrule whose name is IDENTIFIER_1.
3381
3382           If it matches successfully, append the hash it returns to a nested
3383           array within the current scope's result-hash, under the key
3384           'IDENTIFIER_2'.
3385
3386       "<ANY_SUBRULE>+ % <ANY_OTHER_SUBRULE>"
3387       "<ANY_SUBRULE>* % <ANY_OTHER_SUBRULE>"
3388       "<ANY_SUBRULE>+ % (PATTERN)"
3389       "<ANY_SUBRULE>* % (PATTERN)"
3390           Repeatedly call the first subrule.  Keep matching as long as the
3391           subrule matches, provided successive matches are separated by
3392           matches of the second subrule or the pattern.
3393
3394           In other words, match a list of ANY_SUBRULE's separated by
3395           ANY_OTHER_SUBRULE's or PATTERN's.
3396
3397           Note that, if a pattern is used to specify the separator, it must
3398           be specified in some kind of matched parentheses. These may be
3399           capturing ["(...)"], non-capturing ["(?:...)"], non-backtracking
3400           ["(?>...)"], or any other construct enclosed by an opening and
3401           closing paren.
3402
3403       "<ANY_SUBRULE>+ %% <ANY_OTHER_SUBRULE>"
3404       "<ANY_SUBRULE>* %% <ANY_OTHER_SUBRULE>"
3405       "<ANY_SUBRULE>+ %% (PATTERN)"
3406       "<ANY_SUBRULE>* %% (PATTERN)"
3407           Repeatedly call the first subrule.  Keep matching as long as the
3408           subrule matches, provided successive matches are separated by
3409           matches of the second subrule or the pattern.
3410
3411           Also allow an optional final trailing instance of the second
3412           subrule or pattern (this is where "%%" differs from "%").
3413
3414           In other words, match a list of ANY_SUBRULE's separated by
3415           ANY_OTHER_SUBRULE's or PATTERN's, with a possible final separator.
3416
3417           As for the single "%" operator, if a pattern is used to specify the
3418           separator, it must be specified in some kind of matched
3419           parentheses.  These may be capturing ["(...)"], non-capturing
3420           ["(?:...)"], non-backtracking ["(?>...)"], or any other construct
3421           enclosed by an opening and closing paren.
3422
3423   Special variables within grammar actions
3424       $CAPTURE
3425       $CONTEXT
3426           These are both aliases for the built-in read-only $^N variable,
3427           which always contains the substring matched by the nearest
3428           preceding "(...)"  capture. $^N still works perfectly well, but
3429           these are provided to improve the readability of code blocks and
3430           error messages respectively.
3431
3432       $INDEX
3433           This variable contains the index at which the next match will be
3434           attempted within the string being parsed. It is most commonly used
3435           in "<error:...>" or "<log:...>" directives:
3436
3437               <rule: ListElem>
3438                   <log: (?{ "Trying words at index $INDEX" })>
3439                   <MATCH=( \w++ )>
3440                 |
3441                   <log: (?{ "Trying digits at index $INDEX" })>
3442                   <MATCH=( \d++ )>
3443                 |
3444                   <error: (?{ "Missing ListElem near index $INDEX" })>
3445
3446       %MATCH
3447           This variable contains all the saved results of any subrules called
3448           from the current rule. In other words, subrule calls like:
3449
3450               <ListElem>  <Separator= (,)>
3451
3452           stores their respective match results in $MATCH{'ListElem'} and
3453           $MATCH{'Separator'}.
3454
3455       $MATCH
3456           This variable is an alias for $MATCH{"="}. This is the %MATCH entry
3457           for the special "override value". If this entry is defined, its
3458           value overrides the usual "return \%MATCH" semantics of a
3459           successful rule.
3460
3461       %ARG
3462           This variable contains all the key/value pairs that were passed
3463           into a particular subrule call.
3464
3465               <Keyword>  <Command>  <Terminator(:Keyword)>
3466
3467           the "Terminator" rule could get access to the text matched by
3468           "<Keyword>" like so:
3469
3470               <token: Terminator>
3471                   end_ (??{ $ARG{'Keyword'} })
3472
3473           Note that to match against the calling subrules 'Keyword' value,
3474           it's necessary to use either a deferred interpolation ("(??{...})")
3475           or a qualified matchref:
3476
3477               <token: Terminator>
3478                   end_ <\:Keyword>
3479
3480           A common mistake is to attempt to directly interpolate the
3481           argument:
3482
3483               <token: Terminator>
3484                   end_ $ARG{'Keyword'}
3485
3486           This evaluates $ARG{'Keyword'} when the grammar is compiled, rather
3487           than when the rule is matched.
3488
3489       $_  At the start of any code blocks inside any regex, the variable $_
3490           contains the complete string being matched against. The current
3491           matching position within that string is given by: "pos($_)".
3492
3493       $DEBUG
3494           This variable stores the current debugging mode (which may be any
3495           of: 'off', 'on', 'run', 'continue', 'match', 'step', or 'try'). It
3496           is set automatically by the "<debug:...>" command, but may also be
3497           set manually in a code block (which can be useful for conditional
3498           debugging). For example:
3499
3500               <rule: ListElem>
3501                   <Identifier>
3502
3503                   # Conditionally debug if 'foobar' encountered...
3504                   (?{ $DEBUG = $MATCH{Identifier} eq 'foobar' ? 'step' : 'off' })
3505
3506                   <Modifier>?
3507
3508           See also: the "<log: LOGFILE>" and "<debug: DEBUG_CMD>" directives.
3509

IMPORTANT CONSTRAINTS AND LIMITATIONS

3511       ·   Prior to Perl 5.14, the Perl 5 regex engine as not reentrant. So
3512           any attempt to perform a regex match inside a "(?{ ... })" or "(??{
3513           ... })" under Perl 5.12 or earlier will almost certainly lead to
3514           either weird data corruption or a segfault.
3515
3516           The same calamities can also occur in any constructor called by
3517           "<objrule:>". If the constructor invokes another regex in any way,
3518           it will most likely fail catastrophically. In particular, this
3519           means that Moose constructors will frequently crash and burn within
3520           a Regex::Grammars grammar (for example, if the Moose-based class
3521           declares an attribute type constraint such as 'Int', which Moose
3522           checks using a regex).
3523
3524       ·   The additional regex constructs this module provides are
3525           implemented by rewriting regular expressions. This is a (safer)
3526           form of source filtering, but still subject to all the same
3527           limitations and fallibilities of any other macro-based solution.
3528
3529       ·   In particular, rewriting the macros involves the insertion of (a
3530           lot of) extra capturing parentheses. This means you can no longer
3531           assume that particular capturing parens correspond to particular
3532           numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
3533           directly use Perl 5.10's named capture construct:
3534
3535               (?<name> [^\W\d]\w* )
3536
3537           Better still, capture the data in its correct hierarchical context
3538           using the module's "named subpattern" construct:
3539
3540               <name= ([^\W\d]\w*) >
3541
3542       ·   No recursive descent parser--including those created with
3543           Regexp::Grammars--can directly handle left-recursive grammars with
3544           rules of the form:
3545
3546               <rule: List>
3547                   <List> , <ListElem>
3548
3549           If you find yourself attempting to write a left-recursive grammar
3550           (which Perl 5.10 may or may not complain about, but will never
3551           successfully parse with), then you probably need to use the
3552           "separated list" construct instead:
3553
3554               <rule: List>
3555                   <[ListElem]>+ % (,)
3556
3557       ·   Grammatical parsing with Regexp::Grammars can fail if your grammar
3558           uses "non-backtracking" directives (i.e. the "(?>...)" block or the
3559           "?+", "*+", or "++" repetition specifiers). The problem appears to
3560           be that preventing the regex from backtracking through the in-regex
3561           actions that Regexp::Grammars adds causes the module's internal
3562           stack to fall out of sync with the regex match.
3563
3564           For the time being, if your grammar does not work as expected, you
3565           may need to replace one or more "non-backtracking" directives, with
3566           their regular (i.e. backtracking) equivalents.
3567
3568       ·   Similarly, parsing with Regexp::Grammars will fail if your grammar
3569           places a subrule call within a positive look-ahead, since these
3570           don't play nicely with the data stack.
3571
3572           This seems to be an internal problem with perl itself.
3573           Investigations, and attempts at a workaround, are proceeding.
3574
3575           For the time being, you need to make sure that grammar rules don't
3576           appear inside a positive lookahead or use the "<?RULENAME>"
3577           construct instead
3578

DIAGNOSTICS

3580       Note that (because the author cannot find a way to throw exceptions
3581       from within a regex) none of the following diagnostics actually throws
3582       an exception.
3583
3584       Instead, these messages are simply written to the specified parser
3585       logfile (or to *STDERR, if no logfile is specified).
3586
3587       However, any fatal match-time message will immediately terminate the
3588       parser matching and will still set $@ (as if an exception had been
3589       thrown and caught at that point in the code). You then have the option
3590       to check $@ immediately after matching with the grammar, and rethrow if
3591       necessary:
3592
3593           if ($input =~ $grammar) {
3594               process_data_in(\%/);
3595           }
3596           else {
3597               die if $@;
3598           }
3599
3600       "Found call to %s, but no %s was defined in the grammar"
3601           You specified a call to a subrule for which there was no definition
3602           in the grammar. Typically that's either because you forget to
3603           define the rule, or because you misspelled either the definition or
3604           the subrule call. For example:
3605
3606               <file>
3607
3608               <rule: fiel>            <---- misspelled rule
3609                   <lines>             <---- used but never defined
3610
3611           Regexp::Grammars converts any such subrule call attempt to an
3612           instant catastrophic failure of the entire parse, so if your parser
3613           ever actually tries to perform that call, Very Bad Things will
3614           happen.
3615
3616       "Entire parse terminated prematurely while attempting to call
3617       non-existent rule: %s"
3618           You ignored the previous error and actually tried to call to a
3619           subrule for which there was no definition in the grammar. Very Bad
3620           Things are now happening. The parser got very upset, took its ball,
3621           and went home.  See the preceding diagnostic for remedies.
3622
3623           This diagnostic should throw an exception, but can't. So it sets $@
3624           instead, allowing you to trap the error manually if you wish.
3625
3626       "Fatal error: <objrule: %s> returned a non-hash-based object"
3627           An <objrule:> was specified and returned a blessed object that
3628           wasn't a hash. This will break the behaviour of the grammar, so the
3629           module immediately reports the problem and gives up.
3630
3631           The solution is to use only hash-based classes with <objrule:>
3632
3633       "Can't match against <grammar: %s>"
3634           The regex you attempted to match against defined a pure grammar,
3635           using the "<grammar:...>" directive. Pure grammars have no start-
3636           pattern and hence cannot be matched against directly.
3637
3638           You need to define a matchable grammar that inherits from your pure
3639           grammar and then calls one of its rules. For example, instead of:
3640
3641               my $greeting = qr{
3642                   <grammar: Greeting>
3643
3644                   <rule: greet>
3645                       Hi there
3646                       | Hello
3647                       | Yo!
3648               }xms;
3649
3650           you need:
3651
3652               qr{
3653                   <grammar: Greeting>
3654
3655                   <rule: greet>
3656                       Hi there
3657                     | Hello
3658                     | Yo!
3659               }xms;
3660
3661               my $greeting = qr{
3662                   <extends: Greeting>
3663                   <greet>
3664               }xms;
3665
3666       "Inheritance from unknown grammar requested by <%s>"
3667           You used an "<extends:...>" directive to request that your grammar
3668           inherit from another, but the grammar you asked to inherit from
3669           doesn't exist.
3670
3671           Check the spelling of the grammar name, and that it's already been
3672           defined somewhere earlier in your program.
3673
3674       "Redeclaration of <%s> will be ignored"
3675           You defined two or more rules or tokens with the same name.  The
3676           first one defined in the grammar will be used; the rest will be
3677           ignored.
3678
3679           To get rid of the warning, get rid of the extra definitions (or, at
3680           least, comment them out or rename the rules).
3681
3682       "Possible invalid subrule call %s"
3683           Your grammar contained something of the form:
3684
3685               <identifier
3686               <.identifier
3687               <[identifier
3688
3689           which you might have intended to be a subrule call, but which
3690           didn't correctly parse as one. If it was supposed to be a
3691           Regexp::Grammars subrule call, you need to check the syntax you
3692           used. If it wasn't supposed to be a subrule call, you can silence
3693           the warning by rewriting it and quoting the leading angle:
3694
3695               \<identifier
3696               \<.identifier
3697               \<[identifier
3698
3699       "Possible failed attempt to specify a directive: %s"
3700           Your grammar contained something of the form:
3701
3702               <identifier:...
3703
3704           but which wasn't a known directive like "<rule:...>" or
3705           "<debug:...>". If it was supposed to be a Regexp::Grammars
3706           directive, check the spelling of the directive name. If it wasn't
3707           supposed to be a directive, you can silence the warning by
3708           rewriting it and quoting the leading angle:
3709
3710               \<identifier:
3711
3712       "Possible failed attempt to specify a subrule call %s"
3713           Your grammar contained something of the form:
3714
3715               <identifier...
3716
3717           but which wasn't a call to a known subrule like "<ident>" or
3718           "<name>". If it was supposed to be a Regexp::Grammars subrule call,
3719           check the spelling of the rule name in the angles. If it wasn't
3720           supposed to be a subrule call, you can silence the warning by
3721           rewriting it and quoting the leading angle:
3722
3723               \<identifier...
3724
3725       "Repeated subrule %s will only capture its final match"
3726           You specified a subrule call with a repetition qualifier, such as:
3727
3728               <ListElem>*
3729
3730           or:
3731
3732               <ListElem>+
3733
3734           Because each subrule call saves its result in a hash entry of the
3735           same name, each repeated match will overwrite the previous ones, so
3736           only the last match will ultimately be saved. If you want to save
3737           all the matches, you need to tell Regexp::Grammars to save the
3738           sequence of results as a nested array within the hash entry, like
3739           so:
3740
3741               <[ListElem]>*
3742
3743           or:
3744
3745               <[ListElem]>+
3746
3747           If you really did intend to throw away every result but the final
3748           one, you can silence the warning by placing the subrule call inside
3749           any kind of parentheses. For example:
3750
3751               (<ListElem>)*
3752
3753           or:
3754
3755               (?: <ListElem> )+
3756
3757       "Unable to open log file '$filename' (%s)"
3758           You specified a "<logfile:...>" directive but the file whose name
3759           you specified could not be opened for writing (for the reason given
3760           in the parens).
3761
3762           Did you misspell the filename, or get the permissions wrong
3763           somewhere in the filepath?
3764
3765       "Non-backtracking subrule %s may not revert correctly during
3766       backtracking"
3767           Because of inherent limitations in the Perl regex engine, non-
3768           backtracking constructs like "++", "*+", "?+", and "(?>...)" do not
3769           always work correctly when applied to subrule calls, especially in
3770           earlier versions of Perl.
3771
3772           If the grammar doesn't work properly, replace the offending
3773           constructs with regular backtracking versions instead. If the
3774           grammar does work, you can silence the warning by enclosing the
3775           subrule call in any kind of parentheses. For example, change:
3776
3777               <[ListElem]>++
3778
3779           to:
3780
3781               (?: <[ListElem]> )++
3782
3783       "Unexpected item before first subrule specification in definition of
3784       <grammar: %s>"
3785           Named grammar definitions must consist only of rule and token
3786           definitions.  They cannot have patterns before the first
3787           definitions.  You had some kind of pattern before the first
3788           definition, which will be completely ignored within the grammar.
3789
3790           To silence the warning, either comment out or delete whatever is
3791           before the first rule/token definition.
3792
3793       "No main regex specified before rule definitions"
3794           You specified an unnamed grammar (i.e. no "<grammar:...>"
3795           directive), but didn't specify anything for it to actually match,
3796           just some rules that you don't actually call. For example:
3797
3798               my $grammar = qr{
3799
3800                   <rule: list>    \( <item> +% [,] \)
3801
3802                   <token: item>   <list> | \d+
3803               }x;
3804
3805           You have to provide something before the first rule to start the
3806           matching off. For example:
3807
3808               my $grammar = qr{
3809
3810                   <list>   # <--- This tells the grammar how to start matching
3811
3812                   <rule: list>    \( <item> +% [,] \)
3813
3814                   <token: item>   <list> | \d+
3815               }x;
3816
3817       "Ignoring useless empty <ws:> directive"
3818           The "<ws:...>" directive specifies what whitespace matches within
3819           the current rule. An empty "<ws:>" directive would cause whitespace
3820           to match nothing at all, which is what happens in a token
3821           definition, not in a rule definition.
3822
3823           Either put some subpattern inside the empty "<ws:...>" or, if you
3824           really do want whitespace to match nothing at all, remove the
3825           directive completely and change the rule definition to a token
3826           definition.
3827
3828       "Ignoring useless <ws: %s > directive in a token definition"
3829           The "<ws:...>" directive is used to specify what whitespace matches
3830           within a rule. Since whitespace never matches anything inside
3831           tokens, putting a "<ws:...>" directive in a token is a waste of
3832           time.
3833
3834           Either remove the useless directive, or else change the surrounding
3835           token definition to a rule definition.
3836
3837       "Quantifier that doesn't quantify anything: <%s>"
3838           You specified a rule or token something like:
3839
3840               <token: star>  *
3841
3842           or:
3843
3844               <rule: add_op>  plus | add | +
3845
3846           but the "*" and "+" in those examples are both regex meta-
3847           operators: quantifiers that usually cause what precedes them to
3848           match repeatedly.  In these cases however, nothing is preceding the
3849           quantifier, so it's a Perl syntax error.
3850
3851           You almost certainly need to escape the meta-characters in some
3852           way.  For example:
3853
3854               <token: star>  \*
3855
3856               <rule: add_op>  plus | add | [+]
3857

CONFIGURATION AND ENVIRONMENT

3859       Regexp::Grammars requires no configuration files or environment
3860       variables.
3861

DEPENDENCIES

3863       This module only works under Perl 5.10 or later.
3864

INCOMPATIBILITIES

3866       This module is likely to be incompatible with any other module that
3867       automagically rewrites regexes. For example it may conflict with
3868       Regexp::DefaultFlags, Regexp::DeferredExecution, or Regexp::Extended.
3869

BUGS

3871       No bugs have been reported.
3872
3873       Please report any bugs or feature requests to
3874       "bug-regexp-grammars@rt.cpan.org", or through the web interface at
3875       <http://rt.cpan.org>.
3876

AUTHOR

3878       Damian Conway  "<DCONWAY@CPAN.org>"
3879

LICENCE AND COPYRIGHT

3881       Copyright (c) 2009, Damian Conway "<DCONWAY@CPAN.org>". All rights
3882       reserved.
3883
3884       This module is free software; you can redistribute it and/or modify it
3885       under the same terms as Perl itself. See perlartistic.
3886

DISCLAIMER OF WARRANTY

3888       BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
3889       FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
3890       WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
3891       PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
3892       EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
3893       WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
3894       ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
3895       YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
3896       NECESSARY SERVICING, REPAIR, OR CORRECTION.
3897
3898       IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
3899       WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
3900       REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
3901       TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
3902       CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
3903       SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
3904       RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
3905       FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
3906       SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
3907       DAMAGES.
3908
3909
3910
3911perl v5.30.1                      2020-01-30               Regexp::Grammars(3)