1Regexp::Grammars(3)   User Contributed Perl Documentation  Regexp::Grammars(3)
2
3
4

NAME

6       Regexp::Grammars - Add grammatical parsing features to Perl 5.10
7       regexes
8

VERSION

10       This document describes Regexp::Grammars version 1.057
11

SYNOPSIS

13           use Regexp::Grammars;
14
15           my $parser = qr{
16               (?:
17                   <Verb>               # Parse and save a Verb in a scalar
18                   <.ws>                # Parse but don't save whitespace
19                   <Noun>               # Parse and save a Noun in a scalar
20
21                   <type=(?{ rand > 0.5 ? 'VN' : 'VerbNoun' })>
22                                        # Save result of expression in a scalar
23               |
24                   (?:
25                       <[Noun]>         # Parse a Noun and save result in a list
26                                            (saved under the key 'Noun')
27                       <[PostNoun=ws]>  # Parse whitespace, save it in a list
28                                        #   (saved under the key 'PostNoun')
29                   )+
30
31                   <Verb>               # Parse a Verb and save result in a scalar
32                                            (saved under the key 'Verb')
33
34                   <type=(?{ 'VN' })>   # Save a literal in a scalar
35               |
36                   <debug: match>       # Turn on the integrated debugger here
37                   <.Cmd= (?: mv? )>    # Parse but don't capture a subpattern
38                                            (name it 'Cmd' for debugging purposes)
39                   <[File]>+            # Parse 1+ Files and save them in a list
40                                            (saved under the key 'File')
41                   <debug: off>         # Turn off the integrated debugger here
42                   <Dest=File>          # Parse a File and save it in a scalar
43                                            (saved under the key 'Dest')
44               )
45
46               ################################################################
47
48               <token: File>              # Define a subrule named File
49                   <.ws>                  #  - Parse but don't capture whitespace
50                   <MATCH= ([\w-]+) >     #  - Parse the subpattern and capture
51                                          #    matched text as the result of the
52                                          #    subrule
53
54               <token: Noun>              # Define a subrule named Noun
55                   cat | dog | fish       #  - Match an alternative (as usual)
56
57               <rule: Verb>               # Define a whitespace-sensitive subrule
58                   eats                   #  - Match a literal (after any space)
59                   <Object=Noun>?         #  - Parse optional subrule Noun and
60                                          #    save result under the key 'Object'
61               |                          #  Or else...
62                   <AUX>                  #  - Parse subrule AUX and save result
63                   <part= (eaten|seen) >  #  - Match a literal, save under 'part'
64
65               <token: AUX>               # Define a whitespace-insensitive subrule
66                   (has | is)             #  - Match an alternative and capture
67                   (?{ $MATCH = uc $^N }) #  - Use captured text as subrule result
68
69           }x;
70
71           # Match the grammar against some text...
72           if ($text =~ $parser) {
73               # If successful, the hash %/ will have the hierarchy of results...
74               process_data_in( %/ );
75           }
76

QUICKSTART CHEATSHEET

78   In your program...
79           use Regexp::Grammars;    Allow enhanced regexes in lexical scope
80           %/                       Result-hash for successful grammar match
81
82   Defining and using named grammars...
83           <grammar:  GRAMMARNAME>  Define a named grammar that can be inherited
84           <extends:  GRAMMARNAME>  Current grammar inherits named grammar's rules
85
86   Defining rules in your grammar...
87           <rule:     RULENAME>     Define rule with magic whitespace
88           <token:    RULENAME>     Define rule without magic whitespace
89
90           <objrule:  CLASS= NAME>  Define rule that blesses return-hash into class
91           <objtoken: CLASS= NAME>  Define token that blesses return-hash into class
92
93           <objrule:  CLASS>        Shortcut for above (rule name derived from class)
94           <objtoken: CLASS>        Shortcut for above (token name derived from class)
95
96   Matching rules in your grammar...
97           <RULENAME>                Call named subrule (may be fully qualified)
98                                     save result to $MATCH{RULENAME}
99
100           <RULENAME(...)>           Call named subrule, passing args to it
101
102           <!RULENAME>               Call subrule and fail if it matches
103           <!RULENAME(...)>          (shorthand for (?!<.RULENAME>) )
104
105           <:IDENT>                  Match contents of $ARG{IDENT} as a pattern
106           <\:IDENT>                 Match contents of $ARG{IDENT} as a literal
107           </:IDENT>                 Match closing delimiter for $ARG{IDENT}
108
109           <%HASH>                   Match longest possible key of hash
110           <%HASH {PAT}>             Match any key of hash that also matches PAT
111
112           </IDENT>                  Match closing delimiter for $MATCH{IDENT}
113           <\_IDENT>                 Match the literal contents of $MATCH{IDENT}
114
115           <ALIAS= RULENAME>         Call subrule, save result in $MATCH{ALIAS}
116           <ALIAS= %HASH>            Match a hash key, save key in $MATCH{ALIAS}
117           <ALIAS= ( PATTERN )>      Match pattern, save match in $MATCH{ALIAS}
118           <ALIAS= (?{ CODE })>      Execute code, save value in $MATCH{ALIAS}
119           <ALIAS= 'STR' >           Save specified string in $MATCH{ALIAS}
120           <ALIAS= 42 >              Save specified number in $MATCH{ALIAS}
121           <ALIAS= /IDENT>           Match closing delim, save as $MATCH{ALIAS}
122           <ALIAS= \_IDENT>          Match '$MATCH{IDENT}', save as $MATCH{ALIAS}
123
124           <.SUBRULE>                Call subrule (one of the above forms),
125                                     but don't save the result in %MATCH
126
127
128           <[SUBRULE]>               Call subrule (one of the above forms), but
129                                     append result instead of overwriting it
130
131           <SUBRULE1>+ % <SUBRULE2>  Match one or more repetitions of SUBRULE1
132                                     as long as they're separated by SUBRULE2
133           <SUBRULE1> ** <SUBRULE2>  Same (only for backwards compatibility)
134
135           <SUBRULE1>* % <SUBRULE2>  Match zero or more repetitions of SUBRULE1
136                                     as long as they're separated by SUBRULE2
137
138           <SUBRULE1>* %% <SUBRULE2> Match zero or more repetitions of SUBRULE1
139                                     as long as they're separated by SUBRULE2
140                                     and allow an optional trailing SUBRULE2
141
142   In your grammar's code blocks...
143           $CAPTURE    Alias for $^N (the most recent paren capture)
144           $CONTEXT    Another alias for $^N
145           $INDEX      Current index of next matching position in string
146           %MATCH      Current rule's result-hash
147           $MATCH      Magic override value (returned instead of result-hash)
148           %ARG        Current rule's argument hash
149           $DEBUG      Current match-time debugging mode
150
151   Directives...
152           <require: (?{ CODE })   >  Fail if code evaluates false
153           <timeout: INT           >  Fail after specified number of seconds
154           <debug:   COMMAND       >  Change match-time debugging mode
155           <logfile: LOGFILE       >  Change debugging log file (default: STDERR)
156           <fatal:   TEXT|(?{CODE})>  Queue error message and fail parse
157           <error:   TEXT|(?{CODE})>  Queue error message and backtrack
158           <warning: TEXT|(?{CODE})>  Queue warning message and continue
159           <log:     TEXT|(?{CODE})>  Explicitly add a message to debugging log
160           <ws:      PATTERN       >  Override automatic whitespace matching
161           <minimize:>                Simplify the result of a subrule match
162           <context:>                 Switch on context substring retention
163           <nocontext:>               Switch off context substring retention
164

DESCRIPTION

166       This module adds a small number of new regex constructs that can be
167       used within Perl 5.10 patterns to implement complete recursive-descent
168       parsing.
169
170       Perl 5.10 already supports recursive-descent matching, via the new
171       "(?<name>...)" and "(?&name)" constructs. For example, here is a simple
172       matcher for a subset of the LaTeX markup language:
173
174           $matcher = qr{
175               (?&File)
176
177               (?(DEFINE)
178                   (?<File>     (?&Element)* )
179
180                   (?<Element>  \s* (?&Command)
181                             |  \s* (?&Literal)
182                   )
183
184                   (?<Command>  \\ \s* (?&Literal) \s* (?&Options)? \s* (?&Args)? )
185
186                   (?<Options>  \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \])
187
188                   (?<Args>     \{ \s* (?&Element)* \s* \}  )
189
190                   (?<Option>   \s* [^][\$&%#_{}~^\s,]+     )
191
192                   (?<Literal>  \s* [^][\$&%#_{}~^\s]+      )
193               )
194           }xms
195
196       This technique makes it possible to use regexes to recognize complex,
197       hierarchical--and even recursive--textual structures. The problem is
198       that Perl 5.10 doesn't provide any support for extracting that
199       hierarchical data into nested data structures. In other words, using
200       Perl 5.10 you can match complex data, but not parse it into an
201       internally useful form.
202
203       An additional problem when using Perl 5.10 regexes to match complex
204       data formats is that you have to make sure you remember to insert
205       whitespace-matching constructs (such as "\s*") at every possible
206       position where the data might contain ignorable whitespace. This
207       reduces the readability of such patterns, and increases the chance of
208       errors (typically caused by overlooking a location where whitespace
209       might appear).
210
211       The Regexp::Grammars module solves both those problems.
212
213       If you import the module into a particular lexical scope, it
214       preprocesses any regex in that scope, so as to implement a number of
215       extensions to the standard Perl 5.10 regex syntax. These extensions
216       simplify the task of defining and calling subrules within a grammar,
217       and allow those subrule calls to capture and retain the components of
218       they match in a proper hierarchical manner.
219
220       For example, the above LaTeX matcher could be converted to a full LaTeX
221       parser (and considerably tidied up at the same time), like so:
222
223           use Regexp::Grammars;
224           $parser = qr{
225               <File>
226
227               <rule: File>       <[Element]>*
228
229               <rule: Element>    <Command> | <Literal>
230
231               <rule: Command>    \\  <Literal>  <Options>?  <Args>?
232
233               <rule: Options>    \[  <[Option]>+ % (,)  \]
234
235               <rule: Args>       \{  <[Element]>*  \}
236
237               <rule: Option>     [^][\$&%#_{}~^\s,]+
238
239               <rule: Literal>    [^][\$&%#_{}~^\s]+
240           }xms
241
242       Note that there is no need to explicitly place "\s*" subpatterns
243       throughout the rules; that is taken care of automatically.
244
245       If the Regexp::Grammars version of this regex were successfully matched
246       against some appropriate LaTeX document, each rule would call the
247       subrules specified within it, and then return a hash containing
248       whatever result each of those subrules returned, with each result
249       indexed by the subrule's name.
250
251       That is, if the rule named "Command" were invoked, it would first try
252       to match a backslash, then it would call the three subrules
253       "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
254       matched successfully, the "Command" rule would then return a hash with
255       three keys: 'Literal', 'Options', and 'Args'. The value for each of
256       those hash entries would be whatever result-hash the subrules
257       themselves had returned when matched.
258
259       In this way, each level of the hierarchical regex can generate hashes
260       recording everything its own subrules matched, so when the entire
261       pattern matches, it produces a tree of nested hashes that represent the
262       structured data the pattern matched.
263
264       For example, if the previous regex grammar were matched against a
265       string containing:
266
267           \documentclass[a4paper,11pt]{article}
268           \author{D. Conway}
269
270       it would automatically extract a data structure equivalent to the
271       following (but with several extra "empty" keys, which are described in
272       "Subrule results"):
273
274           {
275               'file' => {
276                   'element' => [
277                       {
278                           'command' => {
279                               'literal' => 'documentclass',
280                               'options' => {
281                                   'option'  => [ 'a4paper', '11pt' ],
282                               },
283                               'args'    => {
284                                   'element' => [ 'article' ],
285                               }
286                           }
287                       },
288                       {
289                           'command' => {
290                               'literal' => 'author',
291                               'args' => {
292                                   'element' => [
293                                       {
294                                           'literal' => 'D.',
295                                       },
296                                       {
297                                           'literal' => 'Conway',
298                                       }
299                                   ]
300                               }
301                           }
302                       }
303                   ]
304               }
305           }
306
307       The data structure that Regexp::Grammars produces from a regex match is
308       available to the surrounding program in the magic variable "%/".
309
310       Regexp::Grammars provides many features that simplify the extraction of
311       hierarchical data via a regex match, and also some features that can
312       simplify the processing of that data once it has been extracted. The
313       following sections explain each of those features, and some of the
314       parsing techniques they support.
315
316   Setting up the module
317       Just add:
318
319           use Regexp::Grammars;
320
321       to any lexical scope. Any regexes within that scope will automatically
322       now implement the new parsing constructs:
323
324           use Regexp::Grammars;
325
326           my $parser = qr/ regex with $extra <chocolatey> grammar bits /;
327
328       Note that you do not to use the "/x" modifier when declaring a regex
329       grammar (though you certainly may). But even if you don't, the module
330       quietly adds a "/x" to every regex within the scope of its usage.
331       Otherwise, the default "a whitespace character matches exactly that
332       whitespace character" behaviour of Perl regexes would mess up your
333       grammar's parsing. If you need the non-"/x" behaviour, you can still
334       use the "(?-x)" of "(?-x:...)" directives to switch off "/x" within one
335       or more of your grammar's components.
336
337       Once the grammar has been processed, you can then match text against
338       the extended regexes, in the usual manner (i.e. via a "=~" match):
339
340           if ($input_text =~ $parser) {
341               ...
342           }
343
344       After a successful match, the variable "%/" will contain a series of
345       nested hashes representing the structured hierarchical data captured
346       during the parse.
347
348   Structure of a Regexp::Grammars grammar
349       A Regexp::Grammars specification consists of a start-pattern (which may
350       include both standard Perl 5.10 regex syntax, as well as special
351       Regexp::Grammars directives), followed by one or more rule or token
352       definitions.
353
354       For example:
355
356           use Regexp::Grammars;
357           my $balanced_brackets = qr{
358
359               # Start-pattern...
360               <paren_pair> | <brace_pair>
361
362               # Rule definition...
363               <rule: paren_pair>
364                   \(  (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*  \)
365
366               # Rule definition...
367               <rule: brace_pair>
368                   \{  (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )*  \}
369
370               # Token definition...
371               <token: escape>
372                   \\ .
373           }xms;
374
375       The start-pattern at the beginning of the grammar acts like the "top"
376       token of the grammar, and must be matched completely for the grammar to
377       match.
378
379       This pattern is treated like a token for whitespace matching behaviour
380       (see "Tokens vs rules (whitespace handling)").  That is, whitespace in
381       the start-pattern is treated like whitespace in any normal Perl regex.
382
383       The rules and tokens are declarations only and they are not directly
384       matched.  Instead, they act like subroutines, and are invoked by name
385       from the initial pattern (or from within a rule or token).
386
387       Each rule or token extends from the directive that introduces it up to
388       either the next rule or token directive, or (in the case of the final
389       rule or token) to the end of the grammar.
390
391   Tokens vs rules (whitespace handling)
392       The difference between a token and a rule is that a token treats any
393       whitespace within it exactly as a normal Perl regular expression would.
394       That is, a sequence of whitespace in a token is ignored if the "/x"
395       modifier is in effect, or else matches the same literal sequence of
396       whitespace characters (if "/x" is not in effect).
397
398       In a rule, most sequences of whitespace are treated as matching the
399       implicit subrule "<.ws>", which is automatically predefined to match
400       optional whitespace (i.e. "\s*").
401
402       Exceptions to this behaviour are whitespaces before a "|" or a code
403       block or an explicit space-matcher (such as "<ws>" or "\s"), or at the
404       very end of the rule)
405
406       In other words, a rule such as:
407
408           <rule: sentence>   <noun> <verb>
409                          |   <verb> <noun>
410
411       is equivalent to a token with added non-capturing whitespace matching:
412
413           <token: sentence>  <.ws> <noun> <.ws> <verb>
414                           |  <.ws> <verb> <.ws> <noun>
415
416       You can explicitly define a "<ws>" token to change that default
417       behaviour. For example, you could alter the definition of "whitespace"
418       to include Perlish comments, by adding an explicit "<token: ws>":
419
420           <token: ws>
421               (?: \s+ | #[^\n]* )*
422
423       But be careful not to define "<ws>" as a rule, as this will lead to all
424       kinds of infinitely recursive unpleasantness.
425
426       Per-rule whitespace handling
427
428       Redefining the "<ws>" token changes its behaviour throughout the entire
429       grammar, within every rule definition. Usually that's appropriate, but
430       sometimes you need finer-grained control over whitespace handling.
431
432       So Regexp::Grammars provides the "<ws:>" directive, which allows you to
433       override the implicit whitespace-matches-whitespace behaviour only
434       within the current rule.
435
436       Note that this directive does not redefine "<ws>" within the rule; it
437       simply specifies what to replace each whitespace sequence with (instead
438       of replacing each with a "<ws>" call).
439
440       For example, if a language allows one kind of comment between
441       statements and another within statements, you could parse it with:
442
443           <rule: program>
444               # One type of comment between...
445               <ws: (\s++ | \# .*? \n)* >
446
447               # ...colon-separated statements...
448               <[statement]>+ % ( ; )
449
450
451           <rule: statement>
452               # Another type of comment...
453               <ws: (\s*+ | \#{ .*? }\# )* >
454
455               # ...between comma-separated commands...
456               <cmd>  <[arg]>+ % ( , )
457
458       Note that each directive only applies to the rule in which it is
459       specified. In every other rule in the grammar, whitespace would still
460       match the usual "<ws>" subrule.
461
462   Calling subrules
463       To invoke a rule to match at any point, just enclose the rule's name in
464       angle brackets (like in Perl 6). There must be no space between the
465       opening bracket and the rulename. For example::
466
467           qr{
468               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
469               <name>            # Call <rule: name>
470               <options>?        # Call <rule: options> (it's okay if it fails)
471
472               <rule: name>
473                   # etc.
474           }x;
475
476       If you need to match a literal pattern that would otherwise look like a
477       subrule call, just backslash-escape the leading angle:
478
479           qr{
480               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
481               \<name>           # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
482               <options>?        # Call <rule: options> (it's okay if it fails)
483
484               <rule: name>
485                   # etc.
486           }x;
487
488   Subrule results
489       If a subrule call successfully matches, the result of that match is a
490       reference to a hash. That hash reference is stored in the current
491       rule's own result-hash, under the name of the subrule that was invoked.
492       The hash will, in turn, contain the results of any more deeply nested
493       subrule calls, each stored under the name by which the nested subrule
494       was invoked.
495
496       In other words, if the rule "sentence" is defined:
497
498           <rule: sentence>
499               <noun> <verb> <object>
500
501       then successfully calling the rule:
502
503           <sentence>
504
505       causes a new hash entry at the current nesting level. That entry's key
506       will be 'sentence' and its value will be a reference to a hash, which
507       in turn will have keys: 'noun', 'verb', and 'object'.
508
509       In addition each result-hash has one extra key: the empty string. The
510       value for this key is whatever substring the entire subrule call
511       matched.  This value is known as the context substring.
512
513       So, for example, a successful call to "<sentence>" might add something
514       like the following to the current result-hash:
515
516           sentence => {
517               ""     => 'I saw a dog',
518               noun   => 'I',
519               verb   => 'saw',
520               object => {
521                   ""      => 'a dog',
522                   article => 'a',
523                   noun    => 'dog',
524               },
525           }
526
527       Note, however, that if the result-hash at any level contains only the
528       empty-string key (i.e. the subrule did not call any sub-subrules or
529       save any of their nested result-hashes), then the hash is "unpacked"
530       and just the context substring itself is returned.
531
532       For example, if "<rule: sentence>" had been defined:
533
534           <rule: sentence>
535               I see dead people
536
537       then a successful call to the rule would only add:
538
539           sentence => 'I see dead people'
540
541       to the current result-hash.
542
543       This is a useful feature because it prevents a series of nested subrule
544       calls from producing very unwieldy data structures. For example,
545       without this automatic unpacking, even the simple earlier example:
546
547           <rule: sentence>
548               <noun> <verb> <object>
549
550       would produce something needlessly complex, such as:
551
552           sentence => {
553               ""     => 'I saw a dog',
554               noun   => {
555                   "" => 'I',
556               },
557               verb   => {
558                   "" => 'saw',
559               },
560               object => {
561                   ""      => 'a dog',
562                   article => {
563                       "" => 'a',
564                   },
565                   noun    => {
566                       "" => 'dog',
567                   },
568               },
569           }
570
571       Turning off the context substring
572
573       The context substring is convenient for debugging and for generating
574       error messages but, in a large grammar, or when parsing a long string,
575       the capture and storage of many nested substrings may quickly become
576       prohibitively expensive.
577
578       So Regexp::Grammars provides a directive to prevent context substrings
579       from being retained. Any rule or token that includes the directive
580       "<nocontext:>" anywhere in the rule's body will not retain any context
581       substring it matches...unless that substring would be the only entry in
582       its result hash (which only happens within objrules and objtokens).
583
584       If a "<nocontext:>" directive appears before the first rule or token
585       definition (i.e. as part of the main pattern), then the entire grammar
586       will discard all context substrings from every one of its rules and
587       tokens.
588
589       However, you can override this universal prohibition with a second
590       directive: "<context:>". If this directive appears in any rule or
591       token, that rule or token will save its context substring, even if a
592       global "<nocontext:>" is in effect.
593
594       This means that this grammar:
595
596           qr{
597               <Command>
598
599               <rule: Command>
600                   <nocontext:>
601                   <Keyword> <arg=(\S+)>+ % <.ws>
602
603               <token: Keyword>
604                   <Move> | <Copy> | <Delete>
605
606               # etc.
607           }x
608
609       and this grammar:
610
611           qr{
612               <nocontext:>
613               <Command>
614
615               <rule: Command>
616                   <Keyword> <arg=(\S+)>+ % <.ws>
617
618               <token: Keyword>
619                   <context:>
620                   <Move> | <Copy> | <Delete>
621
622               # etc.
623           }x
624
625       will behave identically (saving context substrings for keywords, but
626       not for commands), except that the first version will also retain the
627       global context substring (i.e. $/{""}), whereas the second version will
628       not.
629
630       Note that "<context:>" and "<nocontext:>" have no effect on, or even
631       any interaction with, the various result distillation mechanisms, which
632       continue to work in the usual way when either or both of the directives
633       is used.
634
635   Renaming subrule results
636       It is not always convenient to have subrule results stored under the
637       same name as the rule itself. Rule names should be optimized for
638       understanding the behaviour of the parser, whereas result names should
639       be optimized for understanding the structure of the data. Often those
640       two goals are identical, but not always; sometimes rule names need to
641       describe what the data looks like, while result names need to describe
642       what the data means.
643
644       For example, sometimes you need to call the same rule twice, to match
645       two syntactically identical components whose positions give then
646       semantically distinct meanings:
647
648           <rule: copy_cmd>
649               copy <file> <file>
650
651       The problem here is that, if the second call to "<file>" succeeds, its
652       result-hash will be stored under the key 'file', clobbering the data
653       that was returned from the first call to "<file>".
654
655       To avoid such problems, Regexp::Grammars allows you to alias any
656       subrule call, so that it is still invoked by the original name, but its
657       result-hash is stored under a different key. The syntax for that is:
658       "<alias=rulename>". For example:
659
660           <rule: copy_cmd>
661               copy <from=file> <to=file>
662
663       Here, "<rule: file>" is called twice, with the first result-hash being
664       stored under the key 'from', and the second result-hash being stored
665       under the key 'to'.
666
667       Note, however, that the alias before the "=" must be a proper
668       identifier (i.e. a letter or underscore, followed by letters, digits,
669       and/or underscores). Aliases that start with an underscore and aliases
670       named "MATCH" have special meaning (see "Private subrule calls" and
671       "Result distillation" respectively).
672
673       Aliases can also be useful for normalizing data that may appear in
674       different formats and sequences. For example:
675
676           <rule: copy_cmd>
677               copy <from=file>        <to=file>
678             | dup    <to=file>  as  <from=file>
679             |      <from=file>  ->    <to=file>
680             |        <to=file>  <-  <from=file>
681
682       Here, regardless of which order the old and new files are specified,
683       the result-hash always gets:
684
685           copy_cmd => {
686               from => 'oldfile',
687                 to => 'newfile',
688           }
689
690   List-like subrule calls
691       If a subrule call is quantified with a repetition specifier:
692
693           <rule: file_sequence>
694               <file>+
695
696       then each repeated match overwrites the corresponding entry in the
697       surrounding rule's result-hash, so only the result of the final
698       repetition will be retained. That is, if the above example matched the
699       string "foo.pl bar.py baz.php", then the result-hash would contain:
700
701           file_sequence {
702               ""   => 'foo.pl bar.py baz.php',
703               file => 'baz.php',
704           }
705
706       Usually, that's not the desired outcome, so Regexp::Grammars provides
707       another mechanism by which to call a subrule; one that saves all
708       repetitions of its results.
709
710       A regular subrule call consists of the rule's name surrounded by angle
711       brackets. If, instead, you surround the rule's name with "<[...]>"
712       (angle and square brackets) like so:
713
714           <rule: file_sequence>
715               <[file]>+
716
717       then the rule is invoked in exactly the same way, but the result of
718       that submatch is pushed onto an array nested inside the appropriate
719       result-hash entry. In other words, if the above example matched the
720       same "foo.pl bar.py baz.php" string, the result-hash would contain:
721
722           file_sequence {
723               ""   => 'foo.pl bar.py baz.php',
724               file => [ 'foo.pl', 'bar.py', 'baz.php' ],
725           }
726
727       This "listifying subrule call" can also be useful for non-repeated
728       subrule calls, if the same subrule is invoked in several places in a
729       grammar. For example if a cmdline option could be given either one or
730       two values, you might parse it:
731
732           <rule: size_option>
733               -size <[size]> (?: x <[size]> )?
734
735       The result-hash entry for 'size' would then always contain an array,
736       with either one or two elements, depending on the input being parsed.
737
738       Listifying subrules can also be given aliases, just like ordinary
739       subrules. The alias is always specified inside the square brackets:
740
741           <rule: size_option>
742               -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
743
744       Here, the sizes are parsed using the "pos_integer" rule, but saved in
745       the result-hash in an array under the key 'size'.
746
747   Parametric subrules
748       When a subrule is invoked, it can be passed a set of named arguments
749       (specified as key"=>"values pairs). This argument list is placed in a
750       normal Perl regex code block and must appear immediately after the
751       subrule name, before the closing angle bracket.
752
753       Within the subrule that has been invoked, the arguments can be accessed
754       via the special hash %ARG. For example:
755
756           <rule: block>
757               <tag>
758                   <[block]>*
759               <end_tag(?{ tag=>$MATCH{tag} })>  # ...call subrule with argument
760
761           <token: end_tag>
762               end_ (??{ quotemeta $ARG{tag} })
763
764       Here the "block" rule first matches a "<tag>", and the corresponding
765       substring is saved in $MATCH{tag}. It then matches any number of nested
766       blocks. Finally it invokes the "<end_tag>" subrule, passing it an
767       argument whose name is 'tag' and whose value is the current value of
768       $MATCH{tag} (i.e. the original opening tag).
769
770       When it is thus invoked, the "end_tag" token first matches 'end_', then
771       interpolates the literal value of the 'tag' argument and attempts to
772       match it.
773
774       Any number of named arguments can be passed when a subrule is invoked.
775       For example, we could generalize the "end_tag" rule to allow any prefix
776       (not just 'end_'), and also to allow for 'if...fi'-style reversed tags,
777       like so:
778
779           <rule: block>
780               <tag>
781                   <[block]>*
782               <end_tag (?{ prefix=>'end', tag=>$MATCH{tag} })>
783
784           <token: end_tag>
785               (??{ $ARG{prefix} // q{(?!)} })      # ...prefix as pattern
786               (??{ quotemeta $ARG{tag} })          # ...tag as literal
787             |
788               (??{ quotemeta reverse $ARG{tag} })  # ...reversed tag
789
790       Note that, if you do not need to interpolate values (such as
791       $MATCH{tag}) into a subrule's argument list, you can use simple
792       parentheses instead of "(?{...})", like so:
793
794               <end_tag( prefix=>'end', tag=>'head' )>
795
796       The only types of values you can use in this simplified syntax are
797       numbers and single-quote-delimited strings.  For anything more complex,
798       put the argument list in a full "(?{...})".
799
800       As the earlier examples show, the single most common type of argument
801       is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
802       a common requirement to pass an element of %MATCH into a subrule, named
803       with its own key.
804
805       Because this is such a common usage, Regexp::Grammars provides a
806       shortcut. If you use simple parentheses (instead of "(?{...})"
807       parentheses) then instead of a pair, you can specify an argument using
808       a colon followed by an identifier.  This argument is replaced by a
809       named argument whose name is the identifier and whose value is the
810       corresponding item from %MATCH. So, for example, instead of:
811
812               <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>
813
814       you can just write:
815
816               <end_tag( prefix=>'end', :tag )>
817
818       Note that, from Perl 5.20 onwards, due to changes in the way that Perl
819       parses regexes, Regexp::Grammars does not support explicitly passing
820       elements of %MATCH as argument values within a list subrule (yeah, it's
821       a very specific and obscure edge-case):
822
823               <[end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })]>   # Does not work
824
825       Note, however, that the shortcut:
826
827               <[end_tag( prefix=>'end', :tag )]>
828
829       still works correctly.
830
831       Accessing subrule arguments more cleanly
832
833       As the preceding examples illustrate, using subrule arguments
834       effectively generally requires the use of run-time interpolated
835       subpatterns via the "(??{...})" construct.
836
837       This produces ugly rule bodies such as:
838
839           <token: end_tag>
840               (??{ $ARG{prefix} // q{(?!)} })      # ...prefix as pattern
841               (??{ quotemeta $ARG{tag} })          # ...tag as literal
842             |
843               (??{ quotemeta reverse $ARG{tag} })  # ...reversed tag
844
845       To simplify these common usages, Regexp::Grammars provides three
846       convenience constructs.
847
848       A subrule call of the form "<:"identifier">" is equivalent to:
849
850           (??{ $ARG{'identifier'} // q{(?!)} })
851
852       Namely: "Match the contents of $ARG{'identifier'}, treating those
853       contents as a pattern."
854
855       A subrule call of the form "<\:"identifier">" (that is: a matchref with
856       a colon after the backslash) is equivalent to:
857
858           (??{ defined $ARG{'identifier'}
859                   ? quotemeta($ARG{'identifier'})
860                   : '(?!)'
861           })
862
863       Namely: "Match the contents of $ARG{'identifier'}, treating those
864       contents as a literal."
865
866       A subrule call of the form "</:"identifier">" (that is: an invertref
867       with a colon after the forward slash) is equivalent to:
868
869           (??{ defined $ARG{'identifier'}
870                   ? quotemeta(reverse $ARG{'identifier'})
871                   : '(?!)'
872           })
873
874       Namely: "Match the closing delimiter corresponding to the contents of
875       $ARG{'identifier'}, as if it were a literal".
876
877       The availability of these three constructs mean that we could rewrite
878       the above "<end_tag>" token much more cleanly as:
879
880           <token: end_tag>
881               <:prefix>      # ...prefix as pattern
882               <\:tag>        # ...tag as a literal
883             |
884               </:tag>        # ...reversed tag
885
886       In general these constructs mean that, within a subrule, if you want to
887       match an argument passed to that subrule, you use "<:"ARGNAME">" (to
888       match the argument as a pattern) or "<\:"ARGNAME">" (to match the
889       argument as a literal).
890
891       Note the consistent mnemonic in these various subrule-like
892       interpolations of named arguments: the name is always prefixed by a
893       colon.
894
895       In other words, the "<:ARGNAME>" form works just like a "<RULENAME>",
896       except that the leading colon tells Regexp::Grammars to use the
897       contents of $ARG{'ARGNAME'} as the subpattern, instead of the contents
898       of "(?&RULENAME)"
899
900       Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs work exactly
901       like "<\_MATCHNAME>" and "</INVERTNAME>" respectively, except that the
902       leading colon indicates that the matchref or invertref should be taken
903       from %ARG instead of from %MATCH.
904
905   Pseudo-subrules
906       Aliases can also be given to standard Perl subpatterns, as well as to
907       code blocks within a regex. The syntax for subpatterns is:
908
909           <ALIAS= (SUBPATTERN) >
910
911       In other words, the syntax is exactly like an aliased subrule call,
912       except that the rule name is replaced with a set of parentheses
913       containing the subpattern. Any parentheses--capturing or
914       non-capturing--will do.
915
916       The effect of aliasing a standard subpattern is to cause whatever that
917       subpattern matches to be saved in the result-hash, using the alias as
918       its key. For example:
919
920           <rule: file_command>
921
922               <cmd=(mv|cp|ln)>  <from=file>  <to=file>
923
924       Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
925       "(mv|cp|ln)", but whatever substring it matches is saved in the result-
926       hash under the key 'cmd'.
927
928       The syntax for aliasing code blocks is:
929
930           <ALIAS= (?{ your($code->here) }) >
931
932       Note, however, that the code block must be specified in the standard
933       Perl 5.10 regex notation: "(?{...})". A common mistake is to write:
934
935           <ALIAS= { your($code->here } >
936
937       instead, which will attempt to interpolate $code before the regex is
938       even compiled, as such variables are only "protected" from
939       interpolation inside a "(?{...})".
940
941       When correctly specified, this construct executes the code in the block
942       and saves the result of that execution in the result-hash, using the
943       alias as its key. Aliased code blocks are useful for adding semantic
944       information based on which branch of a rule is executed. For example,
945       consider the "copy_cmd" alternatives shown earlier:
946
947           <rule: copy_cmd>
948               copy <from=file>        <to=file>
949             | dup    <to=file>  as  <from=file>
950             |      <from=file>  ->    <to=file>
951             |        <to=file>  <-  <from=file>
952
953       Using aliased code blocks, you could add an extra field to the result-
954       hash to describe which form of the command was detected, like so:
955
956           <rule: copy_cmd>
957               copy <from=file>        <to=file>  <type=(?{ 'std' })>
958             | dup    <to=file>  as  <from=file>  <type=(?{ 'rev' })>
959             |      <from=file>  ->    <to=file>  <type=(?{  +1   })>
960             |        <to=file>  <-  <from=file>  <type=(?{  -1   })>
961
962       Now, if the rule matched, the result-hash would contain something like:
963
964           copy_cmd => {
965               from => 'oldfile',
966                 to => 'newfile',
967               type => 'fwd',
968           }
969
970       Note that, in addition to the semantics described above, aliased
971       subpatterns and code blocks also become visible to Regexp::Grammars'
972       integrated debugger (see Debugging).
973
974   Aliased literals
975       As the previous example illustrates, it is inconveniently verbose to
976       assign constants via aliased code blocks. So Regexp::Grammars provides
977       a short-cut. It is possible to directly alias a numeric literal or a
978       single-quote delimited literal string, without putting either inside a
979       code block. For example, the previous example could also be written:
980
981           <rule: copy_cmd>
982               copy <from=file>        <to=file>  <type='std'>
983             | dup    <to=file>  as  <from=file>  <type='rev'>
984             |      <from=file>  ->    <to=file>  <type= +1  >
985             |        <to=file>  <-  <from=file>  <type= -1  >
986
987       Note that only these two forms of literal are supported in this
988       abbreviated syntax.
989
990   Amnesiac subrule calls
991       By default, every subrule call saves its result into the result-hash,
992       either under its own name, or under an alias.
993
994       However, sometimes you may want to refactor some literal part of a rule
995       into one or more subrules, without having those submatches added to the
996       result-hash. The syntax for calling a subrule, but ignoring its return
997       value is:
998
999           <.SUBRULE>
1000
1001       (which is stolen directly from Perl 6).
1002
1003       For example, you may prefer to rewrite a rule such as:
1004
1005           <rule: paren_pair>
1006
1007               \(
1008                   (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
1009               \)
1010
1011       without any literal matching, like so:
1012
1013           <rule: paren_pair>
1014
1015               <.left_paren>
1016                   (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
1017               <.right_paren>
1018
1019           <token: left_paren>   \(
1020           <token: right_paren>  \)
1021           <token: non_paren>    [^()]
1022
1023       Moreover, as the individual components inside the parentheses probably
1024       aren't being captured for any useful purpose either, you could further
1025       optimize that to:
1026
1027           <rule: paren_pair>
1028
1029               <.left_paren>
1030                   (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
1031               <.right_paren>
1032
1033       Note that you can also use the dot modifier on an aliased subpattern:
1034
1035           <.Alias= (SUBPATTERN) >
1036
1037       This seemingly contradictory behaviour (of giving a subpattern a name,
1038       then deliberately ignoring that name) actually does make sense in one
1039       situation. Providing the alias makes the subpattern visible to the
1040       debugger, while using the dot stops it from affecting the result-hash.
1041       See "Debugging non-grammars" for an example of this usage.
1042
1043   Private subrule calls
1044       If a rule name (or an alias) begins with an underscore:
1045
1046            <_RULENAME>       <_ALIAS=RULENAME>
1047           <[_RULENAME]>     <[_ALIAS=RULENAME]>
1048
1049       then matching proceeds as normal, and any result that is returned is
1050       stored in the current result-hash in the usual way.
1051
1052       However, when any rule finishes (and just before it returns) it first
1053       filters its result-hash, removing any entries whose keys begin with an
1054       underscore. This means that any subrule with an underscored name (or
1055       with an underscored alias) remembers its result, but only until the end
1056       of the current rule. Its results are effectively private to the current
1057       rule.
1058
1059       This is especially useful in conjunction with result distillation.
1060
1061   Lookahead (zero-width) subrules
1062       Non-capturing subrule calls can be used in normal lookaheads:
1063
1064           <rule: qualified_typename>
1065               # A valid typename and has a :: in it...
1066               (?= <.typename> )  [^\s:]+ :: \S+
1067
1068           <rule: identifier>
1069               # An alpha followed by alnums (but not a valid typename)...
1070               (?! <.typename> )    [^\W\d]\w*
1071
1072       but the syntax is a little unwieldy. More importantly, an internal
1073       problem with backtracking causes positive lookaheads to mess up the
1074       module's named capturing mechanism.
1075
1076       So Regexp::Grammars provides two shorthands:
1077
1078           <!typename>        same as: (?! <.typename> )
1079           <?typename>        same as: (?= <.typename> ) ...but works correctly!
1080
1081       These two constructs can also be called with arguments, if necessary:
1082
1083           <rule: Command>
1084               <Keyword>
1085               (?:
1086                   <!Terminator(:Keyword)>  <Args=(\S+)>
1087               )?
1088               <Terminator(:Keyword)>
1089
1090       Note that, as the above equivalences imply, neither of these forms of a
1091       subroutine call ever captures what it matches.
1092
1093   Matching separated lists
1094       One of the commonest tasks in text parsing is to match a list of
1095       unspecified length, in which items are separated by a fixed token.
1096       Things like:
1097
1098           1, 2, 3 , 4 ,13, 91        # Numbers separated by commas and spaces
1099
1100           g-c-a-g-t-t-a-c-a          # DNA bases separated by dashes
1101
1102           /usr/local/bin             # Names separated by directory markers
1103
1104           /usr:/usr/local:bin        # Directories separated by colons
1105
1106       The usual construct required to parse these kinds of structures is
1107       either:
1108
1109           <rule: list>
1110
1111               <item> <separator> <list>     # recursive definition
1112             | <item>                        # base case
1113
1114       or, if you want to allow zero-or-more items instead of requiring one-
1115       or-more:
1116
1117           <rule: list_opt>
1118               <list>?                       # entire list may be missing
1119
1120           <rule: list>                      # as before...
1121               <item> <separator> <list>     #   recursive definition
1122             | <item>                        #   base case
1123
1124       Or, more efficiently, but less prettily:
1125
1126           <rule: list>
1127               <[item]> (?: <separator> <[item]> )*           # one-or-more
1128
1129           <rule: list_opt>
1130               (?: <[item]> (?: <separator> <[item]> )* )?    # zero-or-more
1131
1132       Because separated lists are such a common component of grammars,
1133       Regexp::Grammars provides cleaner ways to specify them:
1134
1135           <rule: list>
1136               <[item]>+ % <separator>      # one-or-more
1137
1138           <rule: list_zom>
1139               <[item]>* % <separator>      # zero-or-more
1140
1141       Note that these are just regular repetition qualifiers (i.e. "+" and
1142       "*") applied to a subrule ("<[item]>"), with a "%" modifier after them
1143       to specify the required separator between the repeated matches.
1144
1145       The number of repetitions matched is controlled both by the nature of
1146       the qualifier ("+" vs "*") and by the subrule specified after the "%".
1147       The qualified subrule will be repeatedly matched for as long as its
1148       qualifier allows, provided that the second subrule also matches between
1149       those repetitions.
1150
1151       For example, you can match a parenthesized sequence of one-or-more
1152       numbers separated by commas, such as:
1153
1154           (1, 2, 3, 4, 13, 91)        # Numbers separated by commas (and spaces)
1155
1156       with:
1157
1158           <rule: number_list>
1159
1160               \(  <[number]>+ % <comma>  \)
1161
1162           <token: number>  \d+
1163           <token: comma>   ,
1164
1165       Note that any spaces round the commas will be ignored because
1166       "<number_list>" is specified as a rule and the "+%" specifier has
1167       spaces within and around it. To disallow spaces around the commas, make
1168       sure there are no spaces in or around the "+%":
1169
1170           <rule: number_list_no_spaces>
1171
1172               \( <[number]>+%<comma> \)
1173
1174       (or else specify the rule as a token instead).
1175
1176       Because the "%" is a modifier applied to a qualifier, you can modify
1177       any other repetition qualifier in the same way. For example:
1178
1179           <[item]>{2,4} % <sep>   # two-to-four items, separated
1180
1181           <[item]>{7}   % <sep>   # exactly 7 items, separated
1182
1183           <[item]>{10,}? % <sep>   # minimum of 10 or more items, separated
1184
1185       You can even do this:
1186
1187           <[item]>? % <sep>       # one-or-zero items, (theoretically) separated
1188
1189       though the separator specification is, of course, meaningless in that
1190       case as it will never be needed to separate a maximum of one item.
1191
1192       Within a Regexp::Grammars regex a simple "%" is always metasyntax, so
1193       it cannot be used to match a literal '%'. Any attempt to do so is
1194       immediately fatal when the regex is compiled:
1195
1196           <token: percentage>
1197               \d{1,3} %                # Fatal. Will not match "7%", "100%", etc.
1198
1199           <token: perl_hash>
1200               % <ident>                # Fatal. Will not match "%foo", "%bar", etc.
1201
1202           <token: perl_mod>
1203               <expr> % <expr>          # Fatal. Will not match "$n % 2", etc.
1204
1205       If you need to match a literal "%" immediately after a repetition,
1206       quote it with a backslash:
1207
1208           <token: percentage>
1209               \d{1,3} \%               # Okay. Will match "7%", "100%", etc.
1210
1211           <token: perl_hash>
1212               \% <ident>               # Okay. Will match "%foo", "%bar", etc.
1213
1214           <token: perl_mod>
1215               <expr> \% <expr>         # Okay. Will match "$n % 2", etc.
1216
1217       Note that it's usually necessary to use the "<[...]>" form for the
1218       repeated items being matched, so that all of them are saved in the
1219       result hash. You can also save all the separators (if they're
1220       important) by specifying them as a list-like subrule too:
1221
1222           \(  <[number]>* % <[comma]>  \)  # save numbers *and* separators
1223
1224       The repeated item must be specified as a subrule call of some kind
1225       (i.e. in angles), but the separators may be specified either as a
1226       subrule or as a raw bracketed pattern (i.e. brackets without any nested
1227       subrule calls). For example:
1228
1229           <[number]>* % ( , | : )    # Numbers separated by commas or colons
1230
1231           <[number]>* % [,:]         # Same, but more efficiently matched
1232
1233       The separator should always be specified within matched delimiters of
1234       some kind: either matching "<...>" or matching "(...)" or matching
1235       "[...]". Simple, non-bracketed separators will sometimes also work:
1236
1237           <[number]>+ % ,
1238
1239       but not always:
1240
1241           <[number]>+ % ,\s+     # Oops! Separator is just: ,
1242
1243       This is because of the limited way in which the module internally
1244       parses ordinary regex components (i.e. without full understanding of
1245       their implicit precedence). As a consequence, consistently placing
1246       brackets around any separator is a much safer approach:
1247
1248           <[number]>+ % (,\s+)
1249
1250       You can also use a simple pattern on the left of the "%" as the item
1251       matcher, but in this case it must always be aliased into a list-
1252       collecting subrule, like so:
1253
1254           <[item=(\d+)]>* % [,]
1255
1256       Note that, for backwards compatibility with earlier versions of
1257       Regexp::Grammars, the "+%" operator can also be written: "**".
1258       However, there can be no space between the two asterisks of this
1259       variant. That is:
1260
1261           <[item]> ** <sep>      # same as <[item]>* % <sep>
1262
1263           <[item]>* * <sep>      # error (two * qualifiers in a row)
1264
1265       Matching separated lists with a trailing separator
1266
1267       Some languages allow a separated list to include an extra trailing
1268       separator. For example:
1269
1270           ~/bin/perl5/        # Trailing /-separator in filepath
1271           (1,2,3,)            # Trailing ,-separator in Perl list
1272
1273       To match such constructs using the "%" operator, you would need to add
1274       something to explicitly match the optional trailing separator:
1275
1276           <dir>+ % [/] [/]?    # Slash-separated dirs, then optional final slash
1277
1278           <elem>+ % [,] [,]?   # Comma-separated elems, then optional final comma
1279
1280       which is tedious.
1281
1282       So the module also supports a second kind of "separated list" operator,
1283       that allows an optional trailing separator as well: the "%%" operator.
1284       THis operator behaves exactly like the "%" operator, except that it
1285       also matches a final trailing separator, if one is present.
1286
1287       So the previous examples could be (better) written as:
1288
1289           <dir>+ %% [/]     # Slash-separated dirs, with optional final slash
1290
1291           <elem>+ %% [,]    # Comma-separated elems, with optional final comma
1292
1293   Matching hash keys
1294       In some situations a grammar may need a rule that matches dozens,
1295       hundreds, or even thousands of one-word alternatives. For example, when
1296       matching command names, or valid userids, or English words. In such
1297       cases it is often impractical (and always inefficient) to list all the
1298       alternatives between "|" alternators:
1299
1300           <rule: shell_cmd>
1301               a2p | ac | apply | ar | automake | awk | ...
1302               # ...and 400 lines later
1303               ... | zdiff | zgrep | zip | zmore | zsh
1304
1305           <rule: valid_word>
1306               a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
1307               # ...and 40,000 lines later...
1308               ... | zymotize | zymotoxic | zymurgy | zythem | zythum
1309
1310       To simplify such cases, Regexp::Grammars provides a special construct
1311       that allows you to specify all the alternatives as the keys of a normal
1312       hash. The syntax for that construct is simply to put the hash name
1313       inside angle brackets (with no space between the angles and the hash
1314       name).
1315
1316       Which means that the rules in the previous example could also be
1317       written:
1318
1319           <rule: shell_cmd>
1320               <%cmds>
1321
1322           <rule: valid_word>
1323               <%dict>
1324
1325       provided that the two hashes (%cmds and %dict) are visible in the scope
1326       where the grammar is created.
1327
1328       Matching a hash key in this way is typically significantly faster than
1329       matching a large set of alternations. Specifically, it is O(length of
1330       longest potential key) ^ 2, instead of O(number of keys).
1331
1332       Internally, the construct is converted to something equivalent to:
1333
1334           <rule: shell_cmd>
1335               (<.hk>)  <require: (?{ exists $cmds{$CAPTURE} })>
1336
1337           <rule: valid_word>
1338               (<.hk>)  <require: (?{ exists $dict{$CAPTURE} })>
1339
1340       The special "<hk>" rule is created automatically, and defaults to
1341       "\S+", but you can also define it explicitly to handle other kinds of
1342       keys. For example:
1343
1344           <rule: hk>
1345               [^\n]+        # Key may be any number of chars on a single line
1346
1347           <rule: hk>
1348               [ACGT]{10,}   # Key is a base sequence of at least 10 pairs
1349
1350       Alternatively, you can specify a different key-matching pattern for
1351       each hash you're matching, by placing the required pattern in braces
1352       immediately after the hash name. For example:
1353
1354           <rule: client_name>
1355               # Valid keys match <.hk> (default or explicitly specified)
1356               <%clients>
1357
1358           <rule: shell_cmd>
1359               # Valid keys contain only word chars, hyphen, slash, or dot...
1360               <%cmds { [\w-/.]+ }>
1361
1362           <rule: valid_word>
1363               # Valid keys contain only alphas or internal hyphen or apostrophe...
1364               <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>
1365
1366           <rule: DNA_sequence>
1367               # Valid keys are base sequences of at least 10 pairs...
1368               <%sequences{[ACGT]{10,}}>
1369
1370       This second approach to key-matching is preferred, because it localizes
1371       any non-standard key-matching behaviour to each individual hash.
1372
1373       Note that changes in the compilation process from Perl 5.18 onwards
1374       mean that in some cases the "<%hash>" construct only works reliably if
1375       the hash itself is declared at the outermost lexical scope (i.e. file
1376       scope).
1377
1378       Specifically, if the regex grammar does not include any interpolated
1379       scalars or arrays and the hash was declared within a subroutine (even
1380       within the same subroutine as the regex grammar that uses it), the
1381       regex will not be able to "see" the hash variable at compile-time. This
1382       will produce a "Global symbol "%hash" requires explicit package name"
1383       compile-time error. For example:
1384
1385           sub build_keyword_parser {
1386               # Hash declared inside subroutine...
1387               my %keywords = (foo => 1, bar => 1);
1388
1389               # ...then used in <%hash> construct within uninterpolated regex...
1390               return qr{
1391                           ^<keyword>$
1392                           <rule: keyword> <%keywords>
1393                        }x;
1394
1395               # ...produces compile-time error
1396           }
1397
1398       The solution is to place the hash outside the subroutine containing the
1399       grammar:
1400
1401           # Hash declared OUTSIDE subroutine...
1402           my %keywords = (foo => 1, bar => 1);
1403
1404           sub build_keyword_parser {
1405               return qr{
1406                           ^<keyword>$
1407                           <rule: keyword> <%keywords>
1408                        }x;
1409           }
1410
1411       ...or else to explicitly interpolate at least one scalar (even just a
1412       scalar containing an empty string):
1413
1414           sub build_keyword_parser {
1415               my %keywords = (foo => 1, bar => 1);
1416               my $DEFER_REGEX_COMPILATION = "";
1417
1418               return qr{
1419                           ^<keyword>$
1420                           <rule: keyword> <%keywords>
1421
1422                           $DEFER_REGEX_COMPILATION
1423                        }x;
1424           }
1425
1426   Rematching subrule results
1427       Sometimes it is useful to be able to rematch a string that has
1428       previously been matched by some earlier subrule. For example, consider
1429       a rule to match shell-like control blocks:
1430
1431           <rule: control_block>
1432                 for   <expr> <[command]>+ endfor
1433               | while <expr> <[command]>+ endwhile
1434               | if    <expr> <[command]>+ endif
1435               | with  <expr> <[command]>+ endwith
1436
1437       This would be much tidier if we could factor out the command names
1438       (which are the only differences between the four alternatives). The
1439       problem is that the obvious solution:
1440
1441           <rule: control_block>
1442               <keyword> <expr>
1443                   <[command]>+
1444               end<keyword>
1445
1446       doesn't work, because it would also match an incorrect input like:
1447
1448           for 1..10
1449               echo $n
1450               ls subdir/$n
1451           endif
1452
1453       We need some way to ensure that the "<keyword>" matched immediately
1454       after "end" is the same "<keyword>" that was initially matched.
1455
1456       That's not difficult, because the first "<keyword>" will have captured
1457       what it matched into $MATCH{keyword}, so we could just write:
1458
1459           <rule: control_block>
1460               <keyword> <expr>
1461                   <[command]>+
1462               end(??{quotemeta $MATCH{keyword}})
1463
1464       This is such a useful technique, yet so ugly, scary, and prone to
1465       error, that Regexp::Grammars provides a cleaner equivalent:
1466
1467           <rule: control_block>
1468               <keyword> <expr>
1469                   <[command]>+
1470               end<\_keyword>
1471
1472       A directive of the form "<\_IDENTIFIER>" is known as a "matchref" (an
1473       abbreviation of "%MATCH-supplied backreference").  Matchrefs always
1474       attempt to match, as a literal, the current value of
1475       $MATCH{IDENTIFIER}.
1476
1477       By default, a matchref does not capture what it matches, but you can
1478       have it do so by giving it an alias:
1479
1480           <token: delimited_string>
1481               <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1482
1483           <token: str_delim> ["'`]
1484
1485       At first glance this doesn't seem very useful as, by definition,
1486       $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up with
1487       identical values. However, it can be useful if the rule also has other
1488       alternatives and you want to create a consistent internal
1489       representation for those alternatives, like so:
1490
1491           <token: delimited_string>
1492                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1493               | <ldelim=( \[ )      .*?  <rdelim=( \] )
1494               | <ldelim=( \{ )      .*?  <rdelim=( \} )
1495               | <ldelim=( \( )      .*?  <rdelim=( \) )
1496               | <ldelim=( \< )      .*?  <rdelim=( \> )
1497
1498       You can also force a matchref to save repeated matches as a nested
1499       array, in the usual way:
1500
1501           <token: marked_text>
1502               <marker> <text> <[endmarkers=\_marker]>+
1503
1504       Be careful though, as the following will not do as you may expect:
1505
1506               <[marker]>+ <text> <[endmarkers=\_marker]>+
1507
1508       because the value of $MATCH{marker} will be an array reference, which
1509       the matchref will flatten and concatenate, then match the resulting
1510       string as a literal, which will mean the previous example will match
1511       endmarkers that are exact multiples of the complete start marker,
1512       rather than endmarkers that consist of any number of repetitions of the
1513       individual start marker delimiter. So:
1514
1515               ""text here""
1516               ""text here""""
1517               ""text here""""""
1518
1519       but not:
1520
1521               ""text here"""
1522               ""text here"""""
1523
1524       Uneven start and end markers such as these are extremely unusual, so
1525       this problem rarely arises in practice.
1526
1527       Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
1528       was "<\IDENTIFIER>" instead of "<\_IDENTIFIER>". This created problems
1529       when the identifier started with any of "l", "u", "L", "U", "Q", or
1530       "E", so the syntax has had to be altered in a backwards incompatible
1531       way. It will not be altered again.
1532
1533   Rematching balanced delimiters
1534       Consider the example in the previous section:
1535
1536           <token: delimited_string>
1537                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1538               | <ldelim=( \[ )      .*?  <rdelim=( \] )
1539               | <ldelim=( \{ )      .*?  <rdelim=( \} )
1540               | <ldelim=( \( )      .*?  <rdelim=( \) )
1541               | <ldelim=( \< )      .*?  <rdelim=( \> )
1542
1543       The repeated pattern of the last four alternatives is gauling, but we
1544       can't just refactor those delimiters as well:
1545
1546           <token: delimited_string>
1547                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1548               | <ldelim=bracket>    .*?  <rdelim=\_ldelim>
1549
1550       because that would incorrectly match:
1551
1552           { delimited content here {
1553
1554       while failing to match:
1555
1556           { delimited content here }
1557
1558       To refactor balanced delimiters like those, we need a second kind of
1559       matchref; one that's a little smarter.
1560
1561       Or, preferably, a lot smarter...because there are many other kinds of
1562       balanced delimiters, apart from single brackets. For example:
1563
1564             {{{ delimited content here }}}
1565              /* delimited content here */
1566              (* delimited content here *)
1567              `` delimited content here ''
1568              if delimited content here fi
1569
1570       The common characteristic of these delimiter pairs is that the closing
1571       delimiter is the inverse of the opening delimiter: the sequence of
1572       characters is reversed and certain characters (mainly brackets, but
1573       also single-quotes/backticks) are mirror-reflected.
1574
1575       Regexp::Grammars supports the parsing of such delimiters with a
1576       construct known as an invertref, which is specified using the
1577       "</IDENT>" directive. An invertref acts very like a matchref, except
1578       that it does not convert to:
1579
1580           (??{ quotemeta( $MATCH{I<IDENT>} ) })
1581
1582       but rather to:
1583
1584           (??{ quotemeta( inverse( $MATCH{I<IDENT> ))} })
1585
1586       With this directive available, the balanced delimiters of the previous
1587       example can be refactored to:
1588
1589           <token: delimited_string>
1590                 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
1591               | <ldelim=( [[{(<] )  .*?  <rdelim=/ldelim>
1592
1593       Like matchrefs, invertrefs come in the usual range of flavours:
1594
1595           </ident>            # Match the inverse of $MATCH{ident}
1596           <ALIAS=/ident>      # Match inverse and capture to $MATCH{ident}
1597           <[ALIAS=/ident]>    # Match inverse and push on @{$MATCH{ident}}
1598
1599       The character pairs that are reversed during mirroring are: "{" and
1600       "}", "[" and "]", "(" and ")", "<" and ">", "AX" and "AX", "`" and "'".
1601
1602       The following mnemonics may be useful in distinguishing inverserefs
1603       from backrefs: a backref starts with a "\" (just like the standard Perl
1604       regex backrefs "\1" and "\g{-2}" and "\k<name>"), whereas an inverseref
1605       starts with a "/" (like an HTML or XML closing tag). Or just remember
1606       that "<\_IDENT>" is "match the same again", and if you want "the same
1607       again, only mirrored" instead, just mirror the "\" to get "</IDENT>".
1608
1609   Rematching parametric results and delimiters
1610       The "<\_IDENTIFIER>" and "</IDENTIFIER>" mechanisms normally locate the
1611       literal to be matched by looking in $MATCH{IDENTIFIER}.
1612
1613       However, you can cause them to look in $ARG{IDENTIFIER} instead, by
1614       prefixing the identifier with a single ":". This is especially useful
1615       when refactoring subrules. For example, instead of:
1616
1617           <rule: Command>
1618               <Keyword>  <CommandBody>  end_ <\_Keyword>
1619
1620           <rule: Placeholder>
1621               <Keyword>    \.\.\.   end_ <\_Keyword>
1622
1623       you could parameterize the Terminator rule, like so:
1624
1625           <rule: Command>
1626               <Keyword>  <CommandBody>  <Terminator(:Keyword)>
1627
1628           <rule: Placeholder>
1629               <Keyword>    \.\.\.   <Terminator(:Keyword)>
1630
1631           <token: Terminator>
1632               end_ <\:Keyword>
1633
1634   Tracking and reporting match positions
1635       Regexp::Grammars automatically predefines a special token that makes it
1636       easy to track exactly where in its input a particular subrule matches.
1637       That token is: "<matchpos>".
1638
1639       The "<matchpos>" token implements a zero-width match that never fails.
1640       It always returns the current index within the string that the grammar
1641       is matching.
1642
1643       So, for example you could have your "<delimited_text>" subrule detect
1644       and report unterminated text like so:
1645
1646           <token: delimited_text>
1647               qq? <delim> <text=(.*?)> </delim>
1648           |
1649               <matchpos> qq? <delim>
1650               <error: (?{"Unterminated string starting at index $MATCH{matchpos}"})>
1651
1652       Matching "<matchpos>" in the second alternative causes $MATCH{matchpos}
1653       to contain the position in the string at which the "<matchpos>" subrule
1654       was matched (in this example: the start of the unterminated text).
1655
1656       If you want the line number instead of the string index, use the
1657       predefined "<matchline>" subrule instead:
1658
1659           <token: delimited_text>
1660                     qq? <delim> <text=(.*?)> </delim>
1661           |   <matchline> qq? <delim>
1662               <error: (?{"Unterminated string starting at line $MATCH{matchline}"})>
1663
1664       Note that the line numbers returned by "<matchline>" start at 1 (not at
1665       zero, as with "<matchpos>").
1666
1667       The "<matchpos>" and "<matchline>" subrules are just like any other
1668       subrules; you can alias them ("<started_at=matchpos>") or match them
1669       repeatedly ( "(?: <[matchline]> <[item]> )++"), etc.
1670

Autoactions

1672       The module also supports event-based parsing. You can specify a grammar
1673       in the usual way and then, for a particular parse, layer a collection
1674       of call-backs (known as "autoactions") over the grammar to handle the
1675       data as it is parsed.
1676
1677       Normally, a grammar rule returns the result hash it has accumulated (or
1678       whatever else was aliased to "MATCH=" within the rule). However, you
1679       can specify an autoaction object before the grammar is matched.
1680
1681       Once the autoaction object is specified, every time a rule succeeds
1682       during the parse, its result is passed to the object via one of its
1683       methods; specifically it is passed to the method whose name is the same
1684       as the rule's.
1685
1686       For example, suppose you had a grammar that recognizes simple algebraic
1687       expressions:
1688
1689           my $expr_parser = do{
1690               use Regexp::Grammars;
1691               qr{
1692                   <Expr>
1693
1694                   <rule: Expr>       <[Operand=Mult]>+ % <[Op=(\+|\-)]>
1695
1696                   <rule: Mult>       <[Operand=Pow]>+  % <[Op=(\*|/|%)]>
1697
1698                   <rule: Pow>        <[Operand=Term]>+ % <Op=(\^)>
1699
1700                   <rule: Term>          <MATCH=Literal>
1701                              |       \( <MATCH=Expr> \)
1702
1703                   <token: Literal>   <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
1704               }xms
1705           };
1706
1707       You could convert this grammar to a calculator, by installing a set of
1708       autoactions that convert each rule's result hash to the corresponding
1709       value of the sub-expression that the rule just parsed. To do that, you
1710       would create a class with methods whose names match the rules whose
1711       results you want to change. For example:
1712
1713           package Calculator;
1714           use List::Util qw< reduce >;
1715
1716           sub new {
1717               my ($class) = @_;
1718
1719               return bless {}, $class
1720           }
1721
1722           sub Answer {
1723               my ($self, $result_hash) = @_;
1724
1725               my $sum = shift @{$result_hash->{Operand}};
1726
1727               for my $term (@{$result_hash->{Operand}}) {
1728                   my $op = shift @{$result_hash->{Op}};
1729                   if ($op eq '+') { $sum += $term; }
1730                   else            { $sum -= $term; }
1731               }
1732
1733               return $sum;
1734           }
1735
1736           sub Mult {
1737               my ($self, $result_hash) = @_;
1738
1739               return reduce { eval($a . shift(@{$result_hash->{Op}}) . $b) }
1740                             @{$result_hash->{Operand}};
1741           }
1742
1743           sub Pow {
1744               my ($self, $result_hash) = @_;
1745
1746               return reduce { $b ** $a } reverse @{$result_hash->{Operand}};
1747           }
1748
1749       Objects of this class (and indeed the class itself) now have methods
1750       corresponding to some of the rules in the expression grammar. To apply
1751       those methods to the results of the rules (as they parse) you simply
1752       install an object as the "autoaction" handler, immediately before you
1753       initiate the parse:
1754
1755           if ($text ~= $expr_parser->with_actions(Calculator->new)) {
1756               say $/{Answer};   # Now prints the result of the expression
1757           }
1758
1759       The "with_actions()" method expects to be passed an object or
1760       classname. This object or class will be installed as the autoaction
1761       handler for the next match against any grammar. After that match, the
1762       handler will be uninstalled. "with_actions()" returns the grammar it's
1763       called on, making it easy to call it as part of a match (which is the
1764       recommended idiom).
1765
1766       With a "Calculator" object set as the autoaction handler, whenever the
1767       "Answer", "Mult", or "Pow" rule of the grammar matches, the
1768       corresponding "Answer", "Mult", or "Pow" method of the "Calculator"
1769       object will be called (with the rule's result value passed as its only
1770       argument), and the result of the method will be used as the result of
1771       the rule.
1772
1773       Note that nothing new happens when a "Term" or "Literal" rule matches,
1774       because the "Calculator" object doesn't have methods with those names.
1775
1776       The overall effect, then, is to allow you to specify a grammar without
1777       rule-specific bahaviours and then, later, specify a set of final
1778       actions (as methods) for some or all of the rules of the grammar.
1779
1780       Note that, if a particular callback method returns "undef", the result
1781       of the corresponding rule will be passed through without modification.
1782

Named grammars

1784       All the grammars shown so far are confined to a single regex. However,
1785       Regexp::Grammars also provides a mechanism that allows you to defined
1786       named grammars, which can then be imported into other regexes. This
1787       gives the a way of modularizing common grammatical components.
1788
1789   Defining a named grammar
1790       You can create a named grammar using the "<grammar:...>" directive.
1791       This directive must appear before the first rule definition in the
1792       grammar, and instead of any start-rule. For example:
1793
1794           qr{
1795               <grammar: List::Generic>
1796
1797               <rule: List>
1798                   <[MATCH=Item]>+ % <Separator>
1799
1800               <rule: Item>
1801                   \S++
1802
1803               <token: Separator>
1804                   \s* , \s*
1805           }x;
1806
1807       This creates a grammar named "List::Generic", and installs it in the
1808       module's internal caches, for future reference.
1809
1810       Note that there is no need (or reason) to assign the resulting regex to
1811       a variable, as the named grammar cannot itself be matched against.
1812
1813   Using a named grammar
1814       To make use of a named grammar, you need to incorporate it into another
1815       grammar, by inheritance. To do that, use the "<extends:...>" directive,
1816       like so:
1817
1818           my $parser = qr{
1819               <extends: List::Generic>
1820
1821               <List>
1822           }x;
1823
1824       The "<extends:...>" directive incorporates the rules defined in the
1825       specified grammar into the current regex. You can then call any of
1826       those rules in the start-pattern.
1827
1828   Overriding an inherited rule or token
1829       Subrule dispatch within a grammar is always polymorphic. That is, when
1830       a subrule is called, the most-derived rule of the same name within the
1831       grammar's hierarchy is invoked.
1832
1833       So, to replace a particular rule within grammar, you simply need to
1834       inherit that grammar and specify new, more-specific versions of any
1835       rules you want to change. For example:
1836
1837           my $list_of_integers = qr{
1838               <List>
1839
1840               # Inherit rules from base grammar...
1841               <extends: List::Generic>
1842
1843               # Replace Item rule from List::Generic...
1844               <rule: Item>
1845                   [+-]? \d++
1846           }x;
1847
1848       You can also use "<extends:...>" in other named grammars, to create
1849       hierarchies:
1850
1851           qr{
1852               <grammar: List::Integral>
1853               <extends: List::Generic>
1854
1855               <token: Item>
1856                   [+-]? <MATCH=(<.Digit>+)>
1857
1858               <token: Digit>
1859                   \d
1860           }x;
1861
1862           qr{
1863               <grammar: List::ColonSeparated>
1864               <extends: List::Generic>
1865
1866               <token: Separator>
1867                   \s* : \s*
1868           }x;
1869
1870           qr{
1871               <grammar: List::Integral::ColonSeparated>
1872               <extends: List::Integral>
1873               <extends: List::ColonSeparated>
1874           }x;
1875
1876       As shown in the previous example, Regexp::Grammars allows you to
1877       multiply inherit two (or more) base grammars. For example, the
1878       "List::Integral::ColonSeparated" grammar takes the definitions of
1879       "List" and "Item" from the "List::Integral" grammar, and the definition
1880       of "Separator" from "List::ColonSeparated".
1881
1882       Note that grammars dispatch subrule calls using C3 method lookup,
1883       rather than Perl's older DFS lookup. That's why
1884       "List::Integral::ColonSeparated" correctly gets the more-specific
1885       "Separator" rule defined in "List::ColonSeparated", rather than the
1886       more-generic version defined in "List::Generic" (via "List::Integral").
1887       See "perldoc mro" for more discussion of the C3 dispatch algorithm.
1888
1889   Augmenting an inherited rule or token
1890       Instead of replacing an inherited rule, you can augment it.
1891
1892       For example, if you need a grammar for lists of hexademical numbers,
1893       you could inherit the behaviour of "List::Integral" and add the hex
1894       digits to its "Digit" token:
1895
1896           my $list_of_hexadecimal = qr{
1897               <List>
1898
1899               <extends: List::Integral>
1900
1901               <token: Digit>
1902                   <List::Integral::Digit>
1903                 | [A-Fa-f]
1904           }x;
1905
1906       If you call a subrule using a fully qualified name (such as
1907       "<List::Integral::Digit>"), the grammar calls that version of the rule,
1908       rather than the most-derived version.
1909
1910   Debugging named grammars
1911       Named grammars are independent of each other, even when inherited. This
1912       means that, if debugging is enabled in a derived grammar, it will not
1913       be active in any rules inherited from a base grammar, unless the base
1914       grammar also included a "<debug:...>" directive.
1915
1916       This is a deliberate design decision, as activating the debugger adds a
1917       significant amount of code to each grammar's implementation, which is
1918       detrimental to the matching performance of the resulting regexes.
1919
1920       If you need to debug a named grammar, the best approach is to include a
1921       "<debug: same>" directive at the start of the grammar. The presence of
1922       this directive will ensure the necessary extra debugging code is
1923       included in the regex implementing the grammar, while setting "same"
1924       mode will ensure that the debugging mode isn't altered when the matcher
1925       uses the inherited rules.
1926

Common parsing techniques

1928   Result distillation
1929       Normally, calls to subrules produce nested result-hashes within the
1930       current result-hash. Those nested hashes always have at least one
1931       automatically supplied key (""), whose value is the entire substring
1932       that the subrule matched.
1933
1934       If there are no other nested captures within the subrule, there will be
1935       no other keys in the result-hash. This would be annoying as a typical
1936       nested grammar would then produce results consisting of hashes of
1937       hashes, with each nested hash having only a single key (""). This in
1938       turn would make postprocessing the result-hash (in "%/") far more
1939       complicated than it needs to be.
1940
1941       To avoid this behaviour, if a subrule's result-hash doesn't contain any
1942       keys except "", the module "flattens" the result-hash, by replacing it
1943       with the value of its single key.
1944
1945       So, for example, the grammar:
1946
1947           mv \s* <from> \s* <to>
1948
1949           <rule: from>   [\w/.-]+
1950           <rule: to>     [\w/.-]+
1951
1952       doesn't return a result-hash like this:
1953
1954           {
1955               ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
1956               'from' => { "" => '/usr/local/lib/libhuh.dylib' },
1957               'to'   => { "" => '/dev/null/badlib'            },
1958           }
1959
1960       Instead, it returns:
1961
1962           {
1963               ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
1964               'from' => '/usr/local/lib/libhuh.dylib',
1965               'to'   => '/dev/null/badlib',
1966           }
1967
1968       That is, because the 'from' and 'to' subhashes each have only a single
1969       entry, they are each "flattened" to the value of that entry.
1970
1971       This flattening also occurs if a result-hash contains only "private"
1972       keys (i.e. keys starting with underscores). For example:
1973
1974           mv \s* <from> \s* <to>
1975
1976           <rule: from>   <_dir=path>? <_file=filename>
1977           <rule: to>     <_dir=path>? <_file=filename>
1978
1979           <token: path>      [\w/.-]*/
1980           <token: filename>  [\w.-]+
1981
1982       Here, the "from" rule produces a result like this:
1983
1984           from => {
1985                 "" => '/usr/local/bin/perl',
1986               _dir => '/usr/local/bin/',
1987              _file => 'perl',
1988           }
1989
1990       which is automatically stripped of "private" keys, leaving:
1991
1992           from => {
1993                 "" => '/usr/local/bin/perl',
1994           }
1995
1996       which is then automatically flattened to:
1997
1998           from => '/usr/local/bin/perl'
1999
2000       List result distillation
2001
2002       A special case of result distillation occurs in a separated list, such
2003       as:
2004
2005           <rule: List>
2006
2007               <[Item]>+ % <[Sep=(,)]>
2008
2009       If this construct matches just a single item, the result hash will
2010       contain a single entry consisting of a nested array with a single
2011       value, like so:
2012
2013           { Item => [ 'data' ] }
2014
2015       Instead of returning this annoyingly nested data structure, you can
2016       tell Regexp::Grammars to flatten it to just the inner data with a
2017       special directive:
2018
2019           <rule: List>
2020
2021               <[Item]>+ % <[Sep=(,)]>
2022
2023               <minimize:>
2024
2025       The "<minimize:>" directive examines the result hash (i.e.  %MATCH). If
2026       that hash contains only a single entry, which is a reference to an
2027       array with a single value, then the directive assigns that single value
2028       directly to $MATCH, so that it will be returned instead of the usual
2029       result hash.
2030
2031       This means that a normal separated list still results in a hash
2032       containing all elements and separators, but a "degenerate" list of only
2033       one item results in just that single item.
2034
2035       Manual result distillation
2036
2037       Regexp::Grammars also offers full manual control over the distillation
2038       process. If you use the reserved word "MATCH" as the alias for a
2039       subrule call:
2040
2041           <MATCH=filename>
2042
2043       or a subpattern match:
2044
2045           <MATCH=( \w+ )>
2046
2047       or a code block:
2048
2049           <MATCH=(?{ 42 })>
2050
2051       then the current rule will treat the return value of that subrule,
2052       pattern, or code block as its complete result, and return that value
2053       instead of the usual result-hash it constructs. This is the case even
2054       if the result has other entries that would normally also be returned.
2055
2056       For example, consider a rule like:
2057
2058           <rule: term>
2059                 <MATCH=literal>
2060               | <left_paren> <MATCH=expr> <right_paren>
2061
2062       The use of "MATCH" aliases causes the rule to return either whatever
2063       "<literal>" returns, or whatever "<expr>" returns (provided it's
2064       between left and right parentheses).
2065
2066       Note that, in this second case, even though "<left_paren>" and
2067       "<right_paren>" are captured to the result-hash, they are not returned,
2068       because the "MATCH" alias overrides the normal "return the result-hash"
2069       semantics and returns only what its associated subrule (i.e. "<expr>")
2070       produces.
2071
2072       Note also that the return value is only assigned, if the subrule call
2073       actually matches. For example:
2074
2075           <rule: optional_names>
2076               <[MATCH=name]>*
2077
2078       If the repeated subrule call to "<name>" matches zero times, the return
2079       value of the "optional_names" rule will not be an empty array, because
2080       the "MATCH=" will not have executed at all. Instead, the default return
2081       value (an empty string) will be returned.  If you had specifically
2082       wanted to return an empty array, you could use any of the following:
2083
2084           <rule: optional_names>
2085               <MATCH=(?{ [] })>     # Set up empty array before first match attempt
2086               <[MATCH=name]>*
2087
2088       or:
2089
2090           <rule: optional_names>
2091               <[MATCH=name]>+       # Match one or more times
2092             |                       #          or
2093               <MATCH=(?{ [] })>     # Set up empty array, if no match
2094
2095       Programmatic result distillation
2096
2097       It's also possible to control what a rule returns from within a code
2098       block.  Regexp::Grammars provides a set of reserved variables that give
2099       direct access to the result-hash.
2100
2101       The result-hash itself can be accessed as %MATCH within any code block
2102       inside a rule. For example:
2103
2104           <rule: sum>
2105               <X=product> \+ <Y=product>
2106                   <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
2107
2108       Here, the rule matches a product (aliased 'X' in the result-hash), then
2109       a literal '+', then another product (aliased to 'Y' in the result-
2110       hash). The rule then executes the code block, which accesses the two
2111       saved values (as $MATCH{X} and $MATCH{Y}), adding them together.
2112       Because the block is itself aliased to "MATCH", the sum produced by the
2113       block becomes the (only) result of the rule.
2114
2115       It is also possible to set the rule result from within a code block
2116       (instead of aliasing it). The special "override" return value is
2117       represented by the special variable $MATCH. So the previous example
2118       could be rewritten:
2119
2120           <rule: sum>
2121               <X=product> \+ <Y=product>
2122                   (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
2123
2124       Both forms are identical in effect. Any assignment to $MATCH overrides
2125       the normal "return all subrule results" behaviour.
2126
2127       Assigning to $MATCH directly is particularly handy if the result may
2128       not always be "distillable", for example:
2129
2130           <rule: sum>
2131               <X=product> \+ <Y=product>
2132                   (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
2133                           # Reduce to sum, if both terms are simple scalars...
2134                           $MATCH = $MATCH{X} + $MATCH{Y};
2135                       }
2136                       else {
2137                           # Return full syntax tree for non-simple case...
2138                           $MATCH{op} = '+';
2139                       }
2140                   })
2141
2142       Note that you can also partially override the subrule return behaviour.
2143       Normally, the subrule returns the complete text it matched as its
2144       context substring (i.e. under the "empty key") in its result-hash. That
2145       is, of course, $MATCH{""}, so you can override just that behaviour by
2146       directly assigning to that entry.
2147
2148       For example, if you have a rule that matches key/value pairs from a
2149       configuration file, you might prefer that any trailing comments not be
2150       included in the "matched text" entry of the rule's result-hash. You
2151       could hide such comments like so:
2152
2153           <rule: config_line>
2154               <key> : <value>  <comment>?
2155                   (?{
2156                       # Edit trailing comments out of "matched text" entry...
2157                       $MATCH = "$MATCH{key} : $MATCH{value}";
2158                   })
2159
2160       Some more examples of the uses of $MATCH:
2161
2162           <rule: FuncDecl>
2163             # Keyword  Name               Keep return the name (as a string)...
2164               func     <Identifier> ;     (?{ $MATCH = $MATCH{'Identifier'} })
2165
2166
2167           <rule: NumList>
2168             # Numbers in square brackets...
2169               \[
2170                   ( \d+ (?: , \d+)* )
2171               \]
2172
2173             # Return only the numbers...
2174               (?{ $MATCH = $CAPTURE })
2175
2176
2177           <token: Cmd>
2178             # Match standard variants then standardize the keyword...
2179               (?: mv | move | rename )      (?{ $MATCH = 'mv'; })
2180
2181   Parse-time data processing
2182       Using code blocks in rules, it's often possible to fully process data
2183       as you parse it. For example, the "<sum>" rule shown in the previous
2184       section might be part of a simple calculator, implemented entirely in a
2185       single grammar. Such a calculator might look like this:
2186
2187           my $calculator = do{
2188               use Regexp::Grammars;
2189               qr{
2190                   <Answer>
2191
2192                   <rule: Answer>
2193                       ( <.Mult>+ % <.Op=([+-])> )
2194                           <MATCH= (?{ eval $CAPTURE })>
2195
2196                   <rule: Mult>
2197                       ( <.Pow>+ % <.Op=([*/%])> )
2198                           <MATCH= (?{ eval $CAPTURE })>
2199
2200                   <rule: Pow>
2201                       <X=Term> \^ <Y=Pow>
2202                           <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>
2203                     |
2204                           <MATCH=Term>
2205
2206                   <rule: Term>
2207                           <MATCH=Literal>
2208                     | \(  <MATCH=Answer>  \)
2209
2210                   <token: Literal>
2211                           <MATCH= ( [+-]? \d++ (?: \. \d++ )?+ )>
2212               }xms
2213           };
2214
2215           while (my $input = <>) {
2216               if ($input =~ $calculator) {
2217                   say "--> $/{Answer}";
2218               }
2219           }
2220
2221       Because every rule computes a value using the results of the subrules
2222       below it, and aliases that result to its "MATCH", each rule returns a
2223       complete evaluation of the subexpression it matches, passing that back
2224       to higher-level rules, which then do the same.
2225
2226       Hence, the result returned to the very top-level rule (i.e. to
2227       "<Answer>") is the complete evaluation of the entire expression that
2228       was matched. That means that, in the very process of having matched a
2229       valid expression, the calculator has also computed the value of that
2230       expression, which can then simply be printed directly.
2231
2232       It is often possible to have a grammar fully (or sometimes at least
2233       partially) evaluate or transform the data it is parsing, and this
2234       usually leads to very efficient and easy-to-maintain implementations.
2235
2236       The main limitation of this technique is that the data has to be in a
2237       well-structured form, where subsets of the data can be evaluated using
2238       only local information. In cases where the meaning of the data is
2239       distributed through that data non-hierarchically, or relies on global
2240       state, or on external information, it is often better to have the
2241       grammar simply construct a complete syntax tree for the data first, and
2242       then evaluate that syntax tree separately, after parsing is complete.
2243       The following section describes a feature of Regexp::Grammars that can
2244       make this second style of data processing simpler and more
2245       maintainable.
2246
2247   Object-oriented parsing
2248       When a grammar has parsed successfully, the "%/" variable will contain
2249       a series of nested hashes (and possibly arrays) representing the
2250       hierarchical structure of the parsed data.
2251
2252       Typically, the next step is to walk that tree, extracting or converting
2253       or otherwise processing that information. If the tree has nodes of many
2254       different types, it can be difficult to build a recursive subroutine
2255       that can navigate it easily.
2256
2257       A much cleaner solution is possible if the nodes of the tree are proper
2258       objects.  In that case, you just define a "process()" or "traverse()"
2259       method for eah of the classes, and have every node call that method on
2260       each of its children. For example, if the parser were to return a tree
2261       of nodes representing the contents of a LaTeX file, then you could
2262       define the following methods:
2263
2264           sub Latex::file::explain
2265           {
2266               my ($self, $level) = @_;
2267               for my $element (@{$self->{element}}) {
2268                   $element->explain($level);
2269               }
2270           }
2271
2272           sub Latex::element::explain {
2273               my ($self, $level) = @_;
2274               (  $self->{command} || $self->{literal})->explain($level)
2275           }
2276
2277           sub Latex::command::explain {
2278               my ($self, $level) = @_;
2279               say "\t"x$level, "Command:";
2280               say "\t"x($level+1), "Name: $self->{name}";
2281               if ($self->{options}) {
2282                   say "\t"x$level, "\tOptions:";
2283                   $self->{options}->explain($level+2)
2284               }
2285
2286               for my $arg (@{$self->{arg}}) {
2287                   say "\t"x$level, "\tArg:";
2288                   $arg->explain($level+2)
2289               }
2290           }
2291
2292           sub Latex::options::explain {
2293               my ($self, $level) = @_;
2294               $_->explain($level) foreach @{$self->{option}};
2295           }
2296
2297           sub Latex::literal::explain {
2298               my ($self, $level, $label) = @_;
2299               $label //= 'Literal';
2300               say "\t"x$level, "$label: ", $self->{q{}};
2301           }
2302
2303       and then simply write:
2304
2305           if ($text =~ $LaTeX_parser) {
2306               $/{LaTeX_file}->explain();
2307           }
2308
2309       and the chain of "explain()" calls would cascade down the nodes of the
2310       tree, each one invoking the appropriate "explain()" method according to
2311       the type of node encountered.
2312
2313       The only problem is that, by default, Regexp::Grammars returns a tree
2314       of plain-old hashes, not LaTeX::Whatever objects. Fortunately, it's
2315       easy to request that the result hashes be automatically blessed into
2316       the appropriate classes, using the "<objrule:...>" and "<objtoken:...>"
2317       directives.
2318
2319       These directives are identical to the "<rule:...>" and "<token:...>"
2320       directives (respectively), except that the rule or token they create
2321       will also convert the hash it normally returns into an object of a
2322       specified class. This conversion is done by passing the result hash to
2323       the class's constructor:
2324
2325           $class->new(\%result_hash)
2326
2327       if the class has a constructor method named "new()", or else (if the
2328       class doesn't provide a constructor) by directly blessing the result
2329       hash:
2330
2331           bless \%result_hash, $class
2332
2333       Note that, even if object is constructed via its own constructor, the
2334       module still expects the new object to be hash-based, and will fail if
2335       the object is anything but a blessed hash. The module issues an error
2336       in this case.
2337
2338       The generic syntax for these types of rules and tokens is:
2339
2340           <objrule:  CLASS::NAME = RULENAME  >
2341           <objtoken: CLASS::NAME = TOKENNAME >
2342
2343       For example:
2344
2345           <objrule: LaTeX::Element=component>
2346               # ...Defines a rule that can be called as <component>
2347               # ...and which returns a hash-based LaTeX::Element object
2348
2349           <objtoken: LaTex::Literal=atom>
2350               # ...Defines a token that can be called as <atom>
2351               # ...and which returns a hash-based LaTeX::Literal object
2352
2353       Note that, just as in aliased subrule calls, the name by which
2354       something is referred to outside the grammar (in this case, the class
2355       name) comes before the "=", whereas the name that it is referred to
2356       inside the grammar comes after the "=".
2357
2358       You can freely mix object-returning and plain-old-hash-returning rules
2359       and tokens within a single grammar, though you have to be careful not
2360       to subsequently try to call a method on any of the unblessed nodes.
2361
2362       An important caveat regarding OO rules
2363
2364       Prior to Perl 5.14.0, Perl's regex engine was not fully re-entrant.
2365       This means that in older versions of Perl, it is not possible to re-
2366       invoke the regex engine when already inside the regex engine.
2367
2368       This means that you need to be careful that the "new()" constructors
2369       that are called by your object-rules do not themselves use regexes in
2370       any way, unless you're running under Perl 5.14 or later (in which case
2371       you can ignore what follows).
2372
2373       The two ways this is most likely to happen are:
2374
2375       1.  If you're using a class built on Moose, where one or more of the
2376           "has" uses a type constraint (such as 'Int') that is implemented
2377           via regex matching. For example:
2378
2379               has 'id' => (is => 'rw', isa => 'Int');
2380
2381           The workaround (for pre-5.14 Perls) is to replace the type
2382           constraint with one that doesn't use a regex. For example:
2383
2384               has 'id' => (is => 'rw', isa => 'Num');
2385
2386           Alternatively, you could define your own type constraint that
2387           avoids regexes:
2388
2389               use Moose::Util::TypeConstraints;
2390
2391               subtype 'Non::Regex::Int',
2392                    as 'Num',
2393                 where { int($_) == $_ };
2394
2395               no Moose::Util::TypeConstraints;
2396
2397               # and later...
2398
2399               has 'id' => (is => 'rw', isa => 'Non::Regex::Int');
2400
2401       2.  If your class uses an "AUTOLOAD()" method to implement its
2402           constructor and that method uses the typical:
2403
2404               $AUTOLOAD =~ s/.*://;
2405
2406           technique. The workaround here is to achieve the same effect
2407           without a regex. For example:
2408
2409               my $last_colon_pos = rindex($AUTOLOAD, ':');
2410               substr $AUTOLOAD, 0, $last_colon_pos+1, q{};
2411
2412       Note that this caveat against using nested regexes also applies to any
2413       code blocks executed inside a rule or token (whether or not those rules
2414       or tokens are object-oriented).
2415
2416       A naming shortcut
2417
2418       If an "<objrule:...>" or "<objtoken:...>" is defined with a class name
2419       that is not followed by "=" and a rule name, then the rule name is
2420       determined automatically from the classname.  Specifically, the final
2421       component of the classname (i.e. after the last "::", if any) is used.
2422
2423       For example:
2424
2425           <objrule: LaTeX::Element>
2426               # ...Defines a rule that can be called as <Element>
2427               # ...and which returns a hash-based LaTeX::Element object
2428
2429           <objtoken: LaTex::Literal>
2430               # ...Defines a token that can be called as <Literal>
2431               # ...and which returns a hash-based LaTeX::Literal object
2432
2433           <objtoken: Comment>
2434               # ...Defines a token that can be called as <Comment>
2435               # ...and which returns a hash-based Comment object
2436

Debugging

2438       Regexp::Grammars provides a number of features specifically designed to
2439       help debug both grammars and the data they parse.
2440
2441       All debugging messages are written to a log file (which, by default, is
2442       just STDERR). However, you can specify a disk file explicitly by
2443       placing a "<logfile:...>" directive at the start of your grammar:
2444
2445           $grammar = qr{
2446
2447               <logfile: LaTeX_parser_log >
2448
2449               \A <LaTeX_file> \Z    # Pattern to match
2450
2451               <rule: LaTeX_file>
2452                   # etc.
2453           }x;
2454
2455       You can also explicitly specify that messages go to the terminal:
2456
2457               <logfile: - >
2458
2459   Debugging grammar creation with "<logfile:...>"
2460       Whenever a log file has been directly specified, Regexp::Grammars
2461       automatically does verbose static analysis of your grammar.  That is,
2462       whenever it compiles a grammar containing an explicit "<logfile:...>"
2463       directive it logs a series of messages explaining how it has
2464       interpreted the various components of that grammar. For example, the
2465       following grammar:
2466
2467           <logfile: parser_log >
2468
2469           <cmd>
2470
2471           <rule: cmd>
2472               mv <from=file> <to=file>
2473             | cp <source> <[file]>  <.comment>?
2474
2475       would produce the following analysis in the 'parser_log' file:
2476
2477           info | Processing the main regex before any rule definitions
2478                |    |
2479                |    |...Treating <cmd> as:
2480                |    |      |  match the subrule <cmd>
2481                |    |       \ saving the match in $MATCH{'cmd'}
2482                |    |
2483                |     \___End of main regex
2484                |
2485           info | Defining a rule: <cmd>
2486                |    |...Returns: a hash
2487                |    |
2488                |    |...Treating ' mv ' as:
2489                |    |       \ normal Perl regex syntax
2490                |    |
2491                |    |...Treating <from=file> as:
2492                |    |      |  match the subrule <file>
2493                |    |       \ saving the match in $MATCH{'from'}
2494                |    |
2495                |    |...Treating <to=file> as:
2496                |    |      |  match the subrule <file>
2497                |    |       \ saving the match in $MATCH{'to'}
2498                |    |
2499                |    |...Treating ' | cp ' as:
2500                |    |       \ normal Perl regex syntax
2501                |    |
2502                |    |...Treating <source> as:
2503                |    |      |  match the subrule <source>
2504                |    |       \ saving the match in $MATCH{'source'}
2505                |    |
2506                |    |...Treating <[file]> as:
2507                |    |      |  match the subrule <file>
2508                |    |       \ appending the match to $MATCH{'file'}
2509                |    |
2510                |    |...Treating <.comment>? as:
2511                |    |      |  match the subrule <comment> if possible
2512                |    |       \ but don't save anything
2513                |    |
2514                |     \___End of rule definition
2515
2516       This kind of static analysis is a useful starting point in debugging a
2517       miscreant grammar, because it enables you to see what you actually
2518       specified (as opposed to what you thought you'd specified).
2519
2520   Debugging grammar execution with "<debug:...>"
2521       Regexp::Grammars also provides a simple interactive debugger, with
2522       which you can observe the process of parsing and the data being
2523       collected in any result-hash.
2524
2525       To initiate debugging, place a "<debug:...>" directive anywhere in your
2526       grammar. When parsing reaches that directive the debugger will be
2527       activated, and the command specified in the directive immediately
2528       executed. The available commands are:
2529
2530           <debug: on>    - Enable debugging, stop when a rule matches
2531           <debug: match> - Enable debugging, stop when a rule matches
2532           <debug: try>   - Enable debugging, stop when a rule is tried
2533           <debug: run>   - Enable debugging, run until the match completes
2534           <debug: same>  - Continue debugging (or not) as currently
2535           <debug: off>   - Disable debugging and continue parsing silently
2536
2537           <debug: continue> - Synonym for <debug: run>
2538           <debug: step>     - Synonym for <debug: try>
2539
2540       These directives can be placed anywhere within a grammar and take
2541       effect when that point is reached in the parsing. Hence, adding a
2542       "<debug:step>" directive is very much like setting a breakpoint at that
2543       point in the grammar. Indeed, a common debugging strategy is to turn
2544       debugging on and off only around a suspect part of the grammar:
2545
2546           <rule: tricky>   # This is where we think the problem is...
2547               <debug:step>
2548               <preamble> <text> <postscript>
2549               <debug:off>
2550
2551       Once the debugger is active, it steps through the parse, reporting
2552       rules that are tried, matches and failures, backtracking and restarts,
2553       and the parser's location within both the grammar and the text being
2554       matched. That report looks like this:
2555
2556           ===============> Trying <grammar> from position 0
2557           > cp file1 file2 |...Trying <cmd>
2558                            |   |...Trying <cmd=(cp)>
2559                            |   |    \FAIL <cmd=(cp)>
2560                            |    \FAIL <cmd>
2561                             \FAIL <grammar>
2562           ===============> Trying <grammar> from position 1
2563            cp file1 file2  |...Trying <cmd>
2564                            |   |...Trying <cmd=(cp)>
2565            file1 file2     |   |    \_____<cmd=(cp)> matched 'cp'
2566           file1 file2      |   |...Trying <[file]>+
2567            file2           |   |    \_____<[file]>+ matched 'file1'
2568                            |   |...Trying <[file]>+
2569           [eos]            |   |    \_____<[file]>+ matched ' file2'
2570                            |   |...Trying <[file]>+
2571                            |   |    \FAIL <[file]>+
2572                            |   |...Trying <target>
2573                            |   |   |...Trying <file>
2574                            |   |   |    \FAIL <file>
2575                            |   |    \FAIL <target>
2576            <~~~~~~~~~~~~~~ |   |...Backtracking 5 chars and trying new match
2577           file2            |   |...Trying <target>
2578                            |   |   |...Trying <file>
2579                            |   |   |    \____ <file> matched 'file2'
2580           [eos]            |   |    \_____<target> matched 'file2'
2581                            |    \_____<cmd> matched ' cp file1 file2'
2582                             \_____<grammar> matched ' cp file1 file2'
2583
2584       The first column indicates the point in the input at which the parser
2585       is trying to match, as well as any backtracking or forward searching it
2586       may need to do. The remainder of the columns track the parser's
2587       hierarchical traversal of the grammar, indicating which rules are
2588       tried, which succeed, and what they match.
2589
2590       Provided the logfile is a terminal (as it is by default), the debugger
2591       also pauses at various points in the parsing process--before trying a
2592       rule, after a rule succeeds, or at the end of the parse--according to
2593       the most recent command issued. When it pauses, you can issue a new
2594       command by entering a single letter:
2595
2596           m       - to continue until the next subrule matches
2597           t or s  - to continue until the next subrule is tried
2598           r or c  - to continue to the end of the grammar
2599           o       - to switch off debugging
2600
2601       Note that these are the first letters of the corresponding
2602       "<debug:...>" commands, listed earlier. Just hitting ENTER while the
2603       debugger is paused repeats the previous command.
2604
2605       While the debugger is paused you can also type a 'd', which will
2606       display the result-hash for the current rule. This can be useful for
2607       detecting which rule isn't returning the data you expected.
2608
2609       Resizing the context string
2610
2611       By default, the first column of the debugger output (which shows the
2612       current matching position within the string) is limited to a width of
2613       20 columns.
2614
2615       However, you can change that limit calling the
2616       "Regexp::Grammars::set_context_width()" subroutine. You have to specify
2617       the fully qualified name, however, as Regexp::Grammars does not export
2618       this (or any other) subroutine.
2619
2620       "set_context_width()" expects a single argument: a positive integer
2621       indicating the maximal allowable width for the context column. It
2622       issues a warning if an invalid value is passed, and ignores it.
2623
2624       If called in a void context, "set_context_width()" changes the context
2625       width permanently throughout your application. If called in a scalar or
2626       list context, "set_context_width()" returns an object whose destructor
2627       will cause the context width to revert to its previous value. This
2628       means you can temporarily change the context width within a given block
2629       with something like:
2630
2631           {
2632               my $temporary = Regexp::Grammars::set_context_width(50);
2633
2634               if ($text =~ $parser) {
2635                   do_stuff_with( %/ );
2636               }
2637
2638           } # <--- context width automagically reverts at this point
2639
2640       and the context width will change back to its previous value when
2641       $temporary goes out of scope at the end of the block.
2642
2643   User-defined logging with "<log:...>"
2644       Both static and interactive debugging send a series of predefined log
2645       messages to whatever log file you have specified. It is also possible
2646       to send additional, user-defined messages to the log, using the
2647       "<log:...>" directive.
2648
2649       This directive expects either a simple text or a codeblock as its
2650       single argument. If the argument is a code block, that code is expected
2651       to return the text of the message; if the argument is anything else,
2652       that something else is the literal message. For example:
2653
2654           <rule: ListElem>
2655
2656               <Elem=   ( [a-z]\d+) >
2657                   <log: Checking for a suffix, too...>
2658
2659               <Suffix= ( : \d+   ) >?
2660                   <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
2661
2662       User-defined log messages implemented using a codeblock can also
2663       specify a severity level. If the codeblock of a "<log:...>" directive
2664       returns two or more values, the first is treated as a log message
2665       severity indicator, and the remaining values as separate lines of text
2666       to be logged. For example:
2667
2668           <rule: ListElem>
2669               <Elem=   ( [a-z]\d+) >
2670               <Suffix= ( : \d+   ) >?
2671
2672                   <log: (?{
2673                       warn => "Elem was: $MATCH{Elem}",
2674                               "Suffix was $MATCH{Suffix}",
2675                   })>
2676
2677       When they are encountered, user-defined log messages are interspersed
2678       between any automatic log messages (i.e. from the debugger), at the
2679       correct level of nesting for the current rule.
2680
2681   Debugging non-grammars
2682       [Note that, with the release in 2012 of the Regexp::Debugger module (on
2683       CPAN) the techniques described below are unnecessary. If you need to
2684       debug plain Perl regexes, use Regexp::Debugger instead.]
2685
2686       It is possible to use Regexp::Grammars without creating any subrule
2687       definitions, simply to debug a recalcitrant regex. For example, if the
2688       following regex wasn't working as expected:
2689
2690           my $balanced_brackets = qr{
2691               \(             # left delim
2692               (?:
2693                   \\         # escape or
2694               |   (?R)       # recurse or
2695               |   .          # whatever
2696               )*
2697               \)             # right delim
2698           }xms;
2699
2700       you could instrument it with aliased subpatterns and then debug it
2701       step-by-step, using Regexp::Grammars:
2702
2703           use Regexp::Grammars;
2704
2705           my $balanced_brackets = qr{
2706               <debug:step>
2707
2708               <.left_delim=  (  \(  )>
2709               (?:
2710                   <.escape=  (  \\  )>
2711               |   <.recurse= ( (?R) )>
2712               |   <.whatever=(  .   )>
2713               )*
2714               <.right_delim= (  \)  )>
2715           }xms;
2716
2717           while (<>) {
2718               say 'matched' if /$balanced_brackets/;
2719           }
2720
2721       Note the use of amnesiac aliased subpatterns to avoid needlessly
2722       building a result-hash. Alternatively, you could use listifying aliases
2723       to preserve the matching structure as an additional debugging aid:
2724
2725           use Regexp::Grammars;
2726
2727           my $balanced_brackets = qr{
2728               <debug:step>
2729
2730               <[left_delim=  (  \(  )]>
2731               (?:
2732                   <[escape=  (  \\  )]>
2733               |   <[recurse= ( (?R) )]>
2734               |   <[whatever=(  .   )]>
2735               )*
2736               <[right_delim= (  \)  )]>
2737           }xms;
2738
2739           if ( '(a(bc)d)' =~ /$balanced_brackets/) {
2740               use Data::Dumper 'Dumper';
2741               warn Dumper \%/;
2742           }
2743

Handling errors when parsing

2745       Assuming you have correctly debugged your grammar, the next source of
2746       problems will probably be invalid input (especially if that input is
2747       being provided interactively). So Regexp::Grammars also provides some
2748       support for detecting when a parse is likely to fail...and informing
2749       the user why.
2750
2751   Requirements
2752       The "<require:...>" directive is useful for testing conditions that
2753       it's not easy (or even possible) to check within the syntax of the the
2754       regex itself. For example:
2755
2756           <rule: IPV4_Octet_Decimal>
2757               # Up three digits...
2758               <MATCH= ( \d{1,3}+ )>
2759
2760               # ...but less than 256...
2761               <require: (?{ $MATCH <= 255 })>
2762
2763       A require expects a regex codeblock as its argument and succeeds if the
2764       final value of that codeblock is true. If the final value is false, the
2765       directive fails and the rule starts backtracking.
2766
2767       Note, in this example that the digits are matched with " \d{1,3}+ ".
2768       The trailing "+" prevents the "{1,3}" repetition from backtracking to a
2769       smaller number of digits if the "<require:...>" fails.
2770
2771   Handling failure
2772       The module has limited support for error reporting from within a
2773       grammar, in the form of the "<error:...>" and "<warning:...>"
2774       directives and their shortcuts: "<...>", "<!!!>", and "<???>"
2775
2776       Error messages
2777
2778       The "<error: MSG>" directive queues a conditional error message within
2779       "@!" and then fails to match (that is, it is equivalent to a "(?!)"
2780       when matching). For example:
2781
2782           <rule: ListElem>
2783               <SerialNumber>
2784             | <ClientName>
2785             | <error: (?{ $errcount++ . ': Missing list element' })>
2786
2787       So a common code pattern when using grammars that do this kind of error
2788       detection is:
2789
2790           if ($text =~ $grammar) {
2791               # Do something with the data collected in %/
2792           }
2793           else {
2794               say {*STDERR} $_ for @!;   # i.e. report all errors
2795           }
2796
2797       Each error message is conditional in the sense that, if any surrounding
2798       rule subsequently matches, the message is automatically removed from
2799       "@!". This implies that you can queue up as many error messages as you
2800       wish, but they will only remain in "@!" if the match ultimately fails.
2801       Moreover, only those error messages originating from rules that
2802       actually contributed to the eventual failure-to-match will remain in
2803       "@!".
2804
2805       If a code block is specified as the argument, the error message is
2806       whatever final value is produced when the block is executed. Note that
2807       this final value does not have to be a string (though it does have to
2808       be a scalar).
2809
2810           <rule: ListElem>
2811               <SerialNumber>
2812             | <ClientName>
2813             | <error: (?{
2814                   # Return a hash, with the error information...
2815                   { errnum => $errcount++, msg => 'Missing list element' }
2816               })>
2817
2818       If anything else is specified as the argument, it is treated as a
2819       literal error string (and may not contain an unbalanced '<' or '>', nor
2820       any interpolated variables).
2821
2822       However, if the literal error string begins with "Expected " or
2823       "Expecting ", then the error string automatically has the following
2824       "context suffix" appended:
2825
2826           , but found '$CONTEXT' instead
2827
2828       For example:
2829
2830           qr{ <Arithmetic_Expression>                # ...Match arithmetic expression
2831             |                                        # Or else
2832               <error: Expected a valid expression>   # ...Report error, and fail
2833
2834               # Rule definitions here...
2835           }xms;
2836
2837       On an invalid input this example might produce an error message like:
2838
2839           "Expected a valid expression, but found '(2+3]*7/' instead"
2840
2841       The value of the special $CONTEXT variable is found by looking ahead in
2842       the string being matched against, to locate the next sequence of non-
2843       blank characters after the current parsing position. This variable may
2844       also be explicitly used within the "<error: (?{...})>" form of the
2845       directive.
2846
2847       As a special case, if you omit the message entirely from the directive,
2848       it is supplied automatically, derived from the name of the current
2849       rule.  For example, if the following rule were to fail to match:
2850
2851           <rule: Arithmetic_expression>
2852                 <Multiplicative_Expression>+ % ([+-])
2853               | <error:>
2854
2855       the error message queued would be:
2856
2857           "Expected arithmetic expression, but found 'one plus two' instead"
2858
2859       Note however, that it is still essential to include the colon in the
2860       directive. A common mistake is to write:
2861
2862           <rule: Arithmetic_expression>
2863                 <Multiplicative_Expression>+ % ([+-])
2864               | <error>
2865
2866       which merely attempts to call "<rule: error>" if the first alternative
2867       fails.
2868
2869       Warning messages
2870
2871       Sometimes, you want to detect problems, but not invalidate the entire
2872       parse as a result. For those occasions, the module provides a "less
2873       stringent" form of error reporting: the "<warning:...>" directive.
2874
2875       This directive is exactly the same as an "<error:...>" in every respect
2876       except that it does not induce a failure to match at the point it
2877       appears.
2878
2879       The directive is, therefore, useful for reporting non-fatal problems in
2880       a parse. For example:
2881
2882           qr{ \A            # ...Match only at start of input
2883               <ArithExpr>   # ...Match a valid arithmetic expression
2884
2885               (?:
2886                   # Should be at end of input...
2887                   \s* \Z
2888                 |
2889                   # If not, report the fact but don't fail...
2890                   <warning: Expected end-of-input>
2891                   <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
2892               )
2893
2894               # Rule definitions here...
2895           }xms;
2896
2897       Note that, because they do not induce failure, two or more
2898       "<warning:...>" directives can be "stacked" in sequence, as in the
2899       previous example.
2900
2901       Stubbing
2902
2903       The module also provides three useful shortcuts, specifically to make
2904       it easy to declare, but not define, rules and tokens.
2905
2906       The "<...>" and "<!!!>" directives are equivalent to the directive:
2907
2908           <error: Cannot match RULENAME (not implemented)>
2909
2910       The "<???>" is equivalent to the directive:
2911
2912           <warning: Cannot match RULENAME (not implemented)>
2913
2914       For example, in the following grammar:
2915
2916           <grammar: List::Generic>
2917
2918           <rule: List>
2919               <[Item]>+ % (\s*,\s*)
2920
2921           <rule: Item>
2922               <...>
2923
2924       the "Item" rule is declared but not defined. That means the grammar
2925       will compile correctly, (the "List" rule won't complain about a call to
2926       a non-existent "Item"), but if the "Item" rule isn't overridden in some
2927       derived grammar, a match-time error will occur when "List" tries to
2928       match the "<...>" within "Item".
2929
2930       Localizing the (semi-)automatic error messages
2931
2932       Error directives of any of the following forms:
2933
2934           <error: Expecting identifier>
2935
2936           <error: >
2937
2938           <...>
2939
2940           <!!!>
2941
2942       or their warning equivalents:
2943
2944           <warning: Expecting identifier>
2945
2946           <warning: >
2947
2948           <???>
2949
2950       each autogenerate part or all of the actual error message they produce.
2951       By default, that autogenerated message is always produced in English.
2952
2953       However, the module provides a mechanism by which you can intercept
2954       every error or warning that is queued to "@!"  via these
2955       directives...and localize those messages.
2956
2957       To do this, you call "Regexp::Grammars::set_error_translator()" (with
2958       the full qualification, since Regexp::Grammars does not export it...nor
2959       anything else, for that matter).
2960
2961       The "set_error_translator()" subroutine expect as single argument,
2962       which must be a reference to another subroutine.  This subroutine is
2963       then called whenever an error or warning message is queued to "@!".
2964
2965       The subroutine is passed three arguments:
2966
2967       •   the message string,
2968
2969       •   the name of the rule from which the error or warning was queued,
2970           and
2971
2972       •   the value of $CONTEXT when the error or warning was encountered
2973
2974       The subroutine is expected to return the final version of the message
2975       that is actually to be appended to "@!". To accomplish this it may make
2976       use of one of the many internationalization/localization modules
2977       available in Perl, or it may do the conversion entirely by itself.
2978
2979       The first argument is always exactly what appeared as a message in the
2980       original directive (regardless of whether that message is supposed to
2981       trigger autogeneration, or is just a "regular" error message).  That
2982       is:
2983
2984           Directive                         1st argument
2985
2986           <error: Expecting identifier>     "Expecting identifier"
2987           <warning: That's not a moon!>     "That's not a moon!"
2988           <error: >                         ""
2989           <warning: >                       ""
2990           <...>                             ""
2991           <!!!>                             ""
2992           <???>                             ""
2993
2994       The second argument always contains the name of the rule in which the
2995       directive was encountered. For example, when invoked from within
2996       "<rule: Frinstance>" the following directives produce:
2997
2998           Directive                         2nd argument
2999
3000           <error: Expecting identifier>     "Frinstance"
3001           <warning: That's not a moon!>     "Frinstance"
3002           <error: >                         "Frinstance"
3003           <warning: >                       "Frinstance"
3004           <...>                             "-Frinstance"
3005           <!!!>                             "-Frinstance"
3006           <???>                             "-Frinstance"
3007
3008       Note that the "unimplemented" markers pass the rule name with a
3009       preceding '-'. This allows your translator to distinguish between
3010       "empty" messages (which should then be generated automatically) and the
3011       "unimplemented" markers (which should report that the rule is not yet
3012       properly defined).
3013
3014       If you call "Regexp::Grammars::set_error_translator()" in a void
3015       context, the error translator is permanently replaced (at least, until
3016       the next call to "set_error_translator()").
3017
3018       However, if you call "Regexp::Grammars::set_error_translator()" in a
3019       scalar or list context, it returns an object whose destructor will
3020       restore the previous translator. This allows you to install a
3021       translator only within a given scope, like so:
3022
3023           {
3024               my $temporary
3025                   = Regexp::Grammars::set_error_translator(\&my_translator);
3026
3027               if ($text =~ $parser) {
3028                   do_stuff_with( %/ );
3029               }
3030               else {
3031                   report_errors_in( @! );
3032               }
3033
3034           } # <--- error translator automagically reverts at this point
3035
3036       Warning: any error translation subroutine you install will be called
3037       during the grammar's parsing phase (i.e. as the grammar's regex is
3038       matching). You should therefore ensure that your translator does not
3039       itself use regular expressions, as nested evaluations of regexes inside
3040       other regexes are extremely problematical (i.e. almost always
3041       disastrous) in Perl.
3042
3043   Restricting how long a parse runs
3044       Like the core Perl 5 regex engine on which they are built, the grammars
3045       implemented by Regexp::Grammars are essentially top-down parsers. This
3046       means that they may occasionally require an exponentially long time to
3047       parse a particular input. This usually occurs if a particular grammar
3048       includes a lot of recursion or nested backtracking, especially if the
3049       grammar is then matched against a long string.
3050
3051       The judicious use of non-backtracking repetitions (i.e. "x*+" and
3052       "x++") can significantly improve parsing performance in many such
3053       cases. Likewise, carefully reordering any high-level alternatives (so
3054       as to test simple common cases first) can substantially reduce parsing
3055       times.
3056
3057       However, some languages are just intrinsically slow to parse using top-
3058       down techniques (or, at least, may have slow-to-parse corner cases).
3059
3060       To help cope with this constraint, Regexp::Grammars provides a
3061       mechanism by which you can limit the total effort that a given grammar
3062       will expend in attempting to match. The "<timeout:...>" directive
3063       allows you to specify how long a grammar is allowed to continue trying
3064       to match before giving up. It expects a single argument, which must be
3065       an unsigned integer, and it treats this integer as the number of
3066       seconds to continue attempting to match.
3067
3068       For example:
3069
3070           <timeout: 10>    # Give up after 10 seconds
3071
3072       indicates that the grammar should keep attempting to match for another
3073       10 seconds from the point where the directive is encountered during a
3074       parse. If the complete grammar has not matched in that time, the entire
3075       match is considered to have failed, the matching process is immediately
3076       terminated, and a standard error message ('Internal error: Timed out
3077       after 10 seconds (as requested)') is returned in "@!".
3078
3079       A "<timeout:...>" directive can be placed anywhere in a grammar, but is
3080       most usually placed at the very start, so that the entire grammar is
3081       governed by the specified time limit. The second most common
3082       alternative is to place the timeout at the start of a particular
3083       subrule that is known to be potentially very slow.
3084
3085       A common mistake is to put the timeout specification at the top level
3086       of the grammar, but place it after the actual subrule to be matched,
3087       like so:
3088
3089           my $grammar = qr{
3090
3091               <Text_Corpus>      # Subrule to be matched
3092               <timeout: 10>      # Useless use of timeout
3093
3094               <rule: Text_Corpus>
3095                   # et cetera...
3096           }xms;
3097
3098       Since the parser will only reach the "<timeout: 10>" directive after it
3099       has completely matched "<Text_Corpus>", the timeout is only initiated
3100       at the very end of the matching process and so does not limit that
3101       process in any useful way.
3102
3103       Immediate timeouts
3104
3105       As you might expect, a "<timeout: 0>" directive tells the parser to
3106       keep trying for only zero more seconds, and therefore will immediately
3107       cause the entire surrounding grammar to fail (no matter how deeply
3108       within that grammar the directive is encountered).
3109
3110       This can occasionally be exteremely useful. If you know that detecting
3111       a particular datum means that the grammar will never match, no matter
3112       how many other alternatives may subsequently be tried, you can short-
3113       circuit the parser by injecting a "<timeout: 0>" immediately after the
3114       offending datum is detected.
3115
3116       For example, if your grammar only accepts certain versions of the
3117       language being parsed, you could write:
3118
3119           <rule: Valid_Language_Version>
3120                   vers = <%AcceptableVersions>
3121               |
3122                   vers = <bad_version=(\S++)>
3123                   <warning: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3124                   <timeout: 0>
3125
3126       In fact, this "<warning: MSG> <timeout: 0>" sequence is sufficiently
3127       useful, sufficiently complex, and sufficiently easy to get wrong, that
3128       Regexp::Grammars provides a handy shortcut for it: the "<fatal:...>"
3129       directive. A "<fatal:...>" is exactly equivalent to a "<warning:...>"
3130       followed by a zero-timeout, so the previous example could also be
3131       written:
3132
3133           <rule: Valid_Language_Version>
3134                   vers = <%AcceptableVersions>
3135               |
3136                   vers = <bad_version=(\S++)>
3137                   <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3138
3139       Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
3140       own failure context in $CONTEXT, so the previous example could be
3141       further simplified to:
3142
3143           <rule: Valid_Language_Version>
3144                   vers = <%AcceptableVersions>
3145               |
3146                   vers = <fatal:(?{ "Cannot parse language version $CONTEXT" })>
3147
3148       Also like "<error:...>", "<fatal:...>" can autogenerate an error
3149       message if none is provided, so the example could be still further
3150       reduced to:
3151
3152           <rule: Valid_Language_Version>
3153                   vers = <%AcceptableVersions>
3154               |
3155                   vers = <fatal:>
3156
3157       In this last case, however, the error message returned in "@!" would no
3158       longer be:
3159
3160           Cannot parse language version 0.95
3161
3162       It would now be:
3163
3164           Expected valid language version, but found '0.95' instead
3165

Scoping considerations

3167       If you intend to use a grammar as part of a larger program that
3168       contains other (non-grammatical) regexes, it is more efficient--and
3169       less error-prone--to avoid having Regexp::Grammars process those
3170       regexes as well. So it's often a good idea to declare your grammar in a
3171       "do" block, thereby restricting the scope of the module's effects.
3172
3173       For example:
3174
3175           my $grammar = do {
3176               use Regexp::Grammars;
3177               qr{
3178                   <file>
3179
3180                   <rule: file>
3181                       <prelude>
3182                       <data>
3183                       <postlude>
3184
3185                   <rule: prelude>
3186                       # etc.
3187               }x;
3188           };
3189
3190       Because the effects of Regexp::Grammars are lexically scoped, any
3191       regexes defined outside that "do" block will be unaffected by the
3192       module.
3193

INTERFACE

3195   Perl API
3196       "use Regexp::Grammars;"
3197           Causes all regexes in the current lexical scope to be compile-time
3198           processed for grammar elements.
3199
3200       "$str =~ $grammar"
3201       "$str =~ /$grammar/"
3202           Attempt to match the grammar against the string, building a nested
3203           data structure from it.
3204
3205       "%/"
3206           This hash is assigned the nested data structure created by any
3207           successful match of a grammar regex.
3208
3209       "@!"
3210           This array is assigned the queue of error messages created by any
3211           unsuccessful match attempt of a grammar regex.
3212
3213   Grammar syntax
3214       Directives
3215
3216       "<rule: IDENTIFIER>"
3217           Define a rule whose name is specified by the supplied identifier.
3218
3219           Everything following the "<rule:...>" directive (up to the next
3220           "<rule:...>" or "<token:...>" directive) is treated as part of the
3221           rule being defined.
3222
3223           Any whitespace in the rule is replaced by a call to the "<.ws>"
3224           subrule (which defaults to matching "\s*", but may be explicitly
3225           redefined).
3226
3227       "<token: IDENTIFIER>"
3228           Define a rule whose name is specified by the supplied identifier.
3229
3230           Everything following the "<token:...>" directive (up to the next
3231           "<rule:...>" or "<token:...>" directive) is treated as part of the
3232           rule being defined.
3233
3234           Any whitespace in the rule is ignored (under the "/x" modifier), or
3235           explicitly matched (if "/x" is not used).
3236
3237       "<objrule:  IDENTIFIER>"
3238       "<objtoken: IDENTIFIER>"
3239           Identical to a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
3240           declaration, except that the rule or token will also bless the hash
3241           it normally returns, converting it to an object of a class whose
3242           name is the same as the rule or token itself.
3243
3244       "<require: (?{ CODE }) >"
3245           The code block is executed and if its final value is true, matching
3246           continues from the same position. If the block's final value is
3247           false, the match fails at that point and starts backtracking.
3248
3249       "<error: (?{ CODE })  >"
3250       "<error: LITERAL TEXT >"
3251       "<error: >"
3252           This directive queues a conditional error message within the global
3253           special variable "@!" and then fails to match at that point (that
3254           is, it is equivalent to a "(?!)" or "(*FAIL)" when matching).
3255
3256       "<fatal: (?{ CODE })  >"
3257       "<fatal: LITERAL TEXT >"
3258       "<fatal: >"
3259           This directive is exactly the same as an "<error:...>" in every
3260           respect except that it immediately causes the entire surrounding
3261           grammar to fail, and parsing to immediate cease.
3262
3263       "<warning: (?{ CODE })  >"
3264       "<warning: LITERAL TEXT >"
3265           This directive is exactly the same as an "<error:...>" in every
3266           respect except that it does not induce a failure to match at the
3267           point it appears. That is, it is equivalent to a "(?=)" ["succeed
3268           and continue matching"], rather than a "(?!)" ["fail and
3269           backtrack"].
3270
3271       "<debug: COMMAND >"
3272           During the matching of grammar regexes send debugging and warning
3273           information to the specified log file (see "<logfile: LOGFILE>").
3274
3275           The available "COMMAND"'s are:
3276
3277               <debug: continue>    ___ Debug until end of complete parse
3278               <debug: run>         _/
3279
3280               <debug: on>          ___ Debug until next subrule match
3281               <debug: match>       _/
3282
3283               <debug: try>         ___ Debug until next subrule call or match
3284               <debug: step>        _/
3285
3286               <debug: same>        ___ Maintain current debugging mode
3287
3288               <debug: off>         ___ No debugging
3289
3290           See also the $DEBUG special variable.
3291
3292       "<logfile: LOGFILE>"
3293       "<logfile:    -   >"
3294           During the compilation of grammar regexes, send debugging and
3295           warning information to the specified LOGFILE (or to *STDERR if "-"
3296           is specified).
3297
3298           If the specified LOGFILE name contains a %t, it is replaced with a
3299           (sortable) "YYYYMMDD.HHMMSS" timestamp. For example:
3300
3301               <logfile: test-run-%t >
3302
3303           executed at around 9.30pm on the 21st of March 2009, would generate
3304           a log file named: "test-run-20090321.213056"
3305
3306       "<log: (?{ CODE })  >"
3307       "<log: LITERAL TEXT >"
3308           Append a message to the log file. If the argument is a code block,
3309           that code is expected to return the text of the message; if the
3310           argument is anything else, that something else is the literal
3311           message.
3312
3313           If the block returns two or more values, the first is treated as a
3314           log message severity indicator, and the remaining values as
3315           separate lines of text to be logged.
3316
3317       "<timeout: INT >"
3318           Restrict the match-time of the parse to the specified number of
3319           seconds.  Queues a error message and terminates the entire match
3320           process if the parse does not complete within the nominated time
3321           limit.
3322
3323       Subrule calls
3324
3325       "<IDENTIFIER>"
3326           Call the subrule whose name is IDENTIFIER.
3327
3328           If it matches successfully, save the hash it returns in the current
3329           scope's result-hash, under the key 'IDENTIFIER'.
3330
3331       "<IDENTIFIER_1=IDENTIFIER_2>"
3332           Call the subrule whose name is IDENTIFIER_1.
3333
3334           If it matches successfully, save the hash it returns in the current
3335           scope's result-hash, under the key 'IDENTIFIER_2'.
3336
3337           In other words, the "IDENTIFIER_1=" prefix changes the key under
3338           which the result of calling a subrule is stored.
3339
3340       "<.IDENTIFIER>"
3341           Call the subrule whose name is IDENTIFIER.  Don't save the hash it
3342           returns.
3343
3344           In other words, the "dot" prefix disables saving of subrule
3345           results.
3346
3347       "<IDENTIFIER= ( PATTERN )>"
3348           Match the subpattern PATTERN.
3349
3350           If it matches successfully, capture the substring it matched and
3351           save that substring in the current scope's result-hash, under the
3352           key 'IDENTIFIER'.
3353
3354       "<.IDENTIFIER= ( PATTERN )>"
3355           Match the subpattern PATTERN.  Don't save the substring it matched.
3356
3357       "<IDENTIFIER= %HASH>"
3358           Match a sequence of non-whitespace then verify that the sequence is
3359           a key in the specified hash
3360
3361           If it matches successfully, capture the sequence it matched and
3362           save that substring in the current scope's result-hash, under the
3363           key 'IDENTIFIER'.
3364
3365       "<%HASH>"
3366           Match a key from the hash.  Don't save the substring it matched.
3367
3368       "<IDENTIFIER= (?{ CODE })>"
3369           Execute the specified CODE.
3370
3371           Save the result (of the final expression that the CODE evaluates)
3372           in the current scope's result-hash, under the key 'IDENTIFIER'.
3373
3374       "<[IDENTIFIER]>"
3375           Call the subrule whose name is IDENTIFIER.
3376
3377           If it matches successfully, append the hash it returns to a nested
3378           array within the current scope's result-hash, under the key
3379           <'IDENTIFIER'>.
3380
3381       "<[IDENTIFIER_1=IDENTIFIER_2]>"
3382           Call the subrule whose name is IDENTIFIER_1.
3383
3384           If it matches successfully, append the hash it returns to a nested
3385           array within the current scope's result-hash, under the key
3386           'IDENTIFIER_2'.
3387
3388       "<ANY_SUBRULE>+ % <ANY_OTHER_SUBRULE>"
3389       "<ANY_SUBRULE>* % <ANY_OTHER_SUBRULE>"
3390       "<ANY_SUBRULE>+ % (PATTERN)"
3391       "<ANY_SUBRULE>* % (PATTERN)"
3392           Repeatedly call the first subrule.  Keep matching as long as the
3393           subrule matches, provided successive matches are separated by
3394           matches of the second subrule or the pattern.
3395
3396           In other words, match a list of ANY_SUBRULE's separated by
3397           ANY_OTHER_SUBRULE's or PATTERN's.
3398
3399           Note that, if a pattern is used to specify the separator, it must
3400           be specified in some kind of matched parentheses. These may be
3401           capturing ["(...)"], non-capturing ["(?:...)"], non-backtracking
3402           ["(?>...)"], or any other construct enclosed by an opening and
3403           closing paren.
3404
3405       "<ANY_SUBRULE>+ %% <ANY_OTHER_SUBRULE>"
3406       "<ANY_SUBRULE>* %% <ANY_OTHER_SUBRULE>"
3407       "<ANY_SUBRULE>+ %% (PATTERN)"
3408       "<ANY_SUBRULE>* %% (PATTERN)"
3409           Repeatedly call the first subrule.  Keep matching as long as the
3410           subrule matches, provided successive matches are separated by
3411           matches of the second subrule or the pattern.
3412
3413           Also allow an optional final trailing instance of the second
3414           subrule or pattern (this is where "%%" differs from "%").
3415
3416           In other words, match a list of ANY_SUBRULE's separated by
3417           ANY_OTHER_SUBRULE's or PATTERN's, with a possible final separator.
3418
3419           As for the single "%" operator, if a pattern is used to specify the
3420           separator, it must be specified in some kind of matched
3421           parentheses.  These may be capturing ["(...)"], non-capturing
3422           ["(?:...)"], non-backtracking ["(?>...)"], or any other construct
3423           enclosed by an opening and closing paren.
3424
3425   Special variables within grammar actions
3426       $CAPTURE
3427       $CONTEXT
3428           These are both aliases for the built-in read-only $^N variable,
3429           which always contains the substring matched by the nearest
3430           preceding "(...)"  capture. $^N still works perfectly well, but
3431           these are provided to improve the readability of code blocks and
3432           error messages respectively.
3433
3434       $INDEX
3435           This variable contains the index at which the next match will be
3436           attempted within the string being parsed. It is most commonly used
3437           in "<error:...>" or "<log:...>" directives:
3438
3439               <rule: ListElem>
3440                   <log: (?{ "Trying words at index $INDEX" })>
3441                   <MATCH=( \w++ )>
3442                 |
3443                   <log: (?{ "Trying digits at index $INDEX" })>
3444                   <MATCH=( \d++ )>
3445                 |
3446                   <error: (?{ "Missing ListElem near index $INDEX" })>
3447
3448       %MATCH
3449           This variable contains all the saved results of any subrules called
3450           from the current rule. In other words, subrule calls like:
3451
3452               <ListElem>  <Separator= (,)>
3453
3454           stores their respective match results in $MATCH{'ListElem'} and
3455           $MATCH{'Separator'}.
3456
3457       $MATCH
3458           This variable is an alias for $MATCH{"="}. This is the %MATCH entry
3459           for the special "override value". If this entry is defined, its
3460           value overrides the usual "return \%MATCH" semantics of a
3461           successful rule.
3462
3463       %ARG
3464           This variable contains all the key/value pairs that were passed
3465           into a particular subrule call.
3466
3467               <Keyword>  <Command>  <Terminator(:Keyword)>
3468
3469           the "Terminator" rule could get access to the text matched by
3470           "<Keyword>" like so:
3471
3472               <token: Terminator>
3473                   end_ (??{ $ARG{'Keyword'} })
3474
3475           Note that to match against the calling subrules 'Keyword' value,
3476           it's necessary to use either a deferred interpolation ("(??{...})")
3477           or a qualified matchref:
3478
3479               <token: Terminator>
3480                   end_ <\:Keyword>
3481
3482           A common mistake is to attempt to directly interpolate the
3483           argument:
3484
3485               <token: Terminator>
3486                   end_ $ARG{'Keyword'}
3487
3488           This evaluates $ARG{'Keyword'} when the grammar is compiled, rather
3489           than when the rule is matched.
3490
3491       $_  At the start of any code blocks inside any regex, the variable $_
3492           contains the complete string being matched against. The current
3493           matching position within that string is given by: "pos($_)".
3494
3495       $DEBUG
3496           This variable stores the current debugging mode (which may be any
3497           of: 'off', 'on', 'run', 'continue', 'match', 'step', or 'try'). It
3498           is set automatically by the "<debug:...>" command, but may also be
3499           set manually in a code block (which can be useful for conditional
3500           debugging). For example:
3501
3502               <rule: ListElem>
3503                   <Identifier>
3504
3505                   # Conditionally debug if 'foobar' encountered...
3506                   (?{ $DEBUG = $MATCH{Identifier} eq 'foobar' ? 'step' : 'off' })
3507
3508                   <Modifier>?
3509
3510           See also: the "<log: LOGFILE>" and "<debug: DEBUG_CMD>" directives.
3511

IMPORTANT CONSTRAINTS AND LIMITATIONS

3513       •   Prior to Perl 5.14, the Perl 5 regex engine as not reentrant. So
3514           any attempt to perform a regex match inside a "(?{ ... })" or "(??{
3515           ... })" under Perl 5.12 or earlier will almost certainly lead to
3516           either weird data corruption or a segfault.
3517
3518           The same calamities can also occur in any constructor called by
3519           "<objrule:>". If the constructor invokes another regex in any way,
3520           it will most likely fail catastrophically. In particular, this
3521           means that Moose constructors will frequently crash and burn within
3522           a Regex::Grammars grammar (for example, if the Moose-based class
3523           declares an attribute type constraint such as 'Int', which Moose
3524           checks using a regex).
3525
3526       •   The additional regex constructs this module provides are
3527           implemented by rewriting regular expressions. This is a (safer)
3528           form of source filtering, but still subject to all the same
3529           limitations and fallibilities of any other macro-based solution.
3530
3531       •   In particular, rewriting the macros involves the insertion of (a
3532           lot of) extra capturing parentheses. This means you can no longer
3533           assume that particular capturing parens correspond to particular
3534           numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
3535           directly use Perl 5.10's named capture construct:
3536
3537               (?<name> [^\W\d]\w* )
3538
3539           Better still, capture the data in its correct hierarchical context
3540           using the module's "named subpattern" construct:
3541
3542               <name= ([^\W\d]\w*) >
3543
3544       •   No recursive descent parser--including those created with
3545           Regexp::Grammars--can directly handle left-recursive grammars with
3546           rules of the form:
3547
3548               <rule: List>
3549                   <List> , <ListElem>
3550
3551           If you find yourself attempting to write a left-recursive grammar
3552           (which Perl 5.10 may or may not complain about, but will never
3553           successfully parse with), then you probably need to use the
3554           "separated list" construct instead:
3555
3556               <rule: List>
3557                   <[ListElem]>+ % (,)
3558
3559       •   Grammatical parsing with Regexp::Grammars can fail if your grammar
3560           uses "non-backtracking" directives (i.e. the "(?>...)" block or the
3561           "?+", "*+", or "++" repetition specifiers). The problem appears to
3562           be that preventing the regex from backtracking through the in-regex
3563           actions that Regexp::Grammars adds causes the module's internal
3564           stack to fall out of sync with the regex match.
3565
3566           For the time being, if your grammar does not work as expected, you
3567           may need to replace one or more "non-backtracking" directives, with
3568           their regular (i.e. backtracking) equivalents.
3569
3570       •   Similarly, parsing with Regexp::Grammars will fail if your grammar
3571           places a subrule call within a positive look-ahead, since these
3572           don't play nicely with the data stack.
3573
3574           This seems to be an internal problem with perl itself.
3575           Investigations, and attempts at a workaround, are proceeding.
3576
3577           For the time being, you need to make sure that grammar rules don't
3578           appear inside a positive lookahead or use the "<?RULENAME>"
3579           construct instead
3580

DIAGNOSTICS

3582       Note that (because the author cannot find a way to throw exceptions
3583       from within a regex) none of the following diagnostics actually throws
3584       an exception.
3585
3586       Instead, these messages are simply written to the specified parser
3587       logfile (or to *STDERR, if no logfile is specified).
3588
3589       However, any fatal match-time message will immediately terminate the
3590       parser matching and will still set $@ (as if an exception had been
3591       thrown and caught at that point in the code). You then have the option
3592       to check $@ immediately after matching with the grammar, and rethrow if
3593       necessary:
3594
3595           if ($input =~ $grammar) {
3596               process_data_in(\%/);
3597           }
3598           else {
3599               die if $@;
3600           }
3601
3602       "Found call to %s, but no %s was defined in the grammar"
3603           You specified a call to a subrule for which there was no definition
3604           in the grammar. Typically that's either because you forget to
3605           define the rule, or because you misspelled either the definition or
3606           the subrule call. For example:
3607
3608               <file>
3609
3610               <rule: fiel>            <---- misspelled rule
3611                   <lines>             <---- used but never defined
3612
3613           Regexp::Grammars converts any such subrule call attempt to an
3614           instant catastrophic failure of the entire parse, so if your parser
3615           ever actually tries to perform that call, Very Bad Things will
3616           happen.
3617
3618       "Entire parse terminated prematurely while attempting to call
3619       non-existent rule: %s"
3620           You ignored the previous error and actually tried to call to a
3621           subrule for which there was no definition in the grammar. Very Bad
3622           Things are now happening. The parser got very upset, took its ball,
3623           and went home.  See the preceding diagnostic for remedies.
3624
3625           This diagnostic should throw an exception, but can't. So it sets $@
3626           instead, allowing you to trap the error manually if you wish.
3627
3628       "Fatal error: <objrule: %s> returned a non-hash-based object"
3629           An <objrule:> was specified and returned a blessed object that
3630           wasn't a hash. This will break the behaviour of the grammar, so the
3631           module immediately reports the problem and gives up.
3632
3633           The solution is to use only hash-based classes with <objrule:>
3634
3635       "Can't match against <grammar: %s>"
3636           The regex you attempted to match against defined a pure grammar,
3637           using the "<grammar:...>" directive. Pure grammars have no start-
3638           pattern and hence cannot be matched against directly.
3639
3640           You need to define a matchable grammar that inherits from your pure
3641           grammar and then calls one of its rules. For example, instead of:
3642
3643               my $greeting = qr{
3644                   <grammar: Greeting>
3645
3646                   <rule: greet>
3647                       Hi there
3648                       | Hello
3649                       | Yo!
3650               }xms;
3651
3652           you need:
3653
3654               qr{
3655                   <grammar: Greeting>
3656
3657                   <rule: greet>
3658                       Hi there
3659                     | Hello
3660                     | Yo!
3661               }xms;
3662
3663               my $greeting = qr{
3664                   <extends: Greeting>
3665                   <greet>
3666               }xms;
3667
3668       "Inheritance from unknown grammar requested by <%s>"
3669           You used an "<extends:...>" directive to request that your grammar
3670           inherit from another, but the grammar you asked to inherit from
3671           doesn't exist.
3672
3673           Check the spelling of the grammar name, and that it's already been
3674           defined somewhere earlier in your program.
3675
3676       "Redeclaration of <%s> will be ignored"
3677           You defined two or more rules or tokens with the same name.  The
3678           first one defined in the grammar will be used; the rest will be
3679           ignored.
3680
3681           To get rid of the warning, get rid of the extra definitions (or, at
3682           least, comment them out or rename the rules).
3683
3684       "Possible invalid subrule call %s"
3685           Your grammar contained something of the form:
3686
3687               <identifier
3688               <.identifier
3689               <[identifier
3690
3691           which you might have intended to be a subrule call, but which
3692           didn't correctly parse as one. If it was supposed to be a
3693           Regexp::Grammars subrule call, you need to check the syntax you
3694           used. If it wasn't supposed to be a subrule call, you can silence
3695           the warning by rewriting it and quoting the leading angle:
3696
3697               \<identifier
3698               \<.identifier
3699               \<[identifier
3700
3701       "Possible failed attempt to specify a subrule call or directive: %s"
3702           Your grammar contained something of the form:
3703
3704               <identifier...
3705
3706           but which wasn't a call to a known subrule or directive. If it was
3707           supposed to be a subrule call, check the spelling of the rule name
3708           in the angles. If it was supposed to be a Regexp::Grammars
3709           directive, check the spelling of the directive name. If it wasn't
3710           supposed to be a subrule call or directive, you can silence the
3711           warning by rewriting it and quoting the leading angle:
3712
3713               \<identifier
3714
3715       "Invalid < metacharacter"
3716           The "<" character is always special in Regexp::Grammars regexes: it
3717           either introduces a subrule call, or a rule/token declaration, or a
3718           directive.
3719
3720           If you need to match a literal '<', use "\<" in your regex.
3721
3722       "Invalid separation specifier: %s"
3723           You used a "%" or a "%%" in the regex, but in a way that won't do
3724           what you expect. "%" and "%%" are metacharacters in
3725           Regexp::Grammars regexes, and can only be placed between a repeated
3726           atom (that matches a list of items) and a simple atom (that matches
3727           the separator between list items). See "Matching separated lists".
3728
3729           If you were using "%" or "%%" as a metacharacter, then you either
3730           forgot the repetition quantifier ("*", "+", "{0,9}", etc.) on the
3731           preceding list-matching atom, or you specified the following
3732           separator atom as something too complex for the module to parse
3733           (for example, a set of parens with nested subrule calls).
3734
3735           On the other hand, if you were intending to match a literal "%" or
3736           "%%" within a Regexp::Grammars regex, then you must explicitly
3737           specify it as being a literal by quotemeta'ing it, like so: "\%" or
3738           "\%\%"
3739
3740       "Repeated subrule %s will only capture its final match"
3741           You specified a subrule call with a repetition qualifier, such as:
3742
3743               <ListElem>*
3744
3745           or:
3746
3747               <ListElem>+
3748
3749           Because each subrule call saves its result in a hash entry of the
3750           same name, each repeated match will overwrite the previous ones, so
3751           only the last match will ultimately be saved. If you want to save
3752           all the matches, you need to tell Regexp::Grammars to save the
3753           sequence of results as a nested array within the hash entry, like
3754           so:
3755
3756               <[ListElem]>*
3757
3758           or:
3759
3760               <[ListElem]>+
3761
3762           If you really did intend to throw away every result but the final
3763           one, you can silence the warning by placing the subrule call inside
3764           any kind of parentheses. For example:
3765
3766               (<ListElem>)*
3767
3768           or:
3769
3770               (?: <ListElem> )+
3771
3772       "Unable to open log file '$filename' (%s)"
3773           You specified a "<logfile:...>" directive but the file whose name
3774           you specified could not be opened for writing (for the reason given
3775           in the parens).
3776
3777           Did you misspell the filename, or get the permissions wrong
3778           somewhere in the filepath?
3779
3780       "Non-backtracking subrule %s may not revert correctly during
3781       backtracking"
3782           Because of inherent limitations in the Perl regex engine, non-
3783           backtracking constructs like "++", "*+", "?+", and "(?>...)" do not
3784           always work correctly when applied to subrule calls, especially in
3785           earlier versions of Perl.
3786
3787           If the grammar doesn't work properly, replace the offending
3788           constructs with regular backtracking versions instead. If the
3789           grammar does work, you can silence the warning by enclosing the
3790           subrule call in any kind of parentheses. For example, change:
3791
3792               <[ListElem]>++
3793
3794           to:
3795
3796               (?: <[ListElem]> )++
3797
3798       "Unexpected item before first subrule specification in definition of
3799       <grammar: %s>"
3800           Named grammar definitions must consist only of rule and token
3801           definitions.  They cannot have patterns before the first
3802           definitions.  You had some kind of pattern before the first
3803           definition, which will be completely ignored within the grammar.
3804
3805           To silence the warning, either comment out or delete whatever is
3806           before the first rule/token definition.
3807
3808       "No main regex specified before rule definitions"
3809           You specified an unnamed grammar (i.e. no "<grammar:...>"
3810           directive), but didn't specify anything for it to actually match,
3811           just some rules that you don't actually call. For example:
3812
3813               my $grammar = qr{
3814
3815                   <rule: list>    \( <item> +% [,] \)
3816
3817                   <token: item>   <list> | \d+
3818               }x;
3819
3820           You have to provide something before the first rule to start the
3821           matching off. For example:
3822
3823               my $grammar = qr{
3824
3825                   <list>   # <--- This tells the grammar how to start matching
3826
3827                   <rule: list>    \( <item> +% [,] \)
3828
3829                   <token: item>   <list> | \d+
3830               }x;
3831
3832       "Ignoring useless empty <ws:> directive"
3833           The "<ws:...>" directive specifies what whitespace matches within
3834           the current rule. An empty "<ws:>" directive would cause whitespace
3835           to match nothing at all, which is what happens in a token
3836           definition, not in a rule definition.
3837
3838           Either put some subpattern inside the empty "<ws:...>" or, if you
3839           really do want whitespace to match nothing at all, remove the
3840           directive completely and change the rule definition to a token
3841           definition.
3842
3843       "Ignoring useless <ws: %s > directive in a token definition"
3844           The "<ws:...>" directive is used to specify what whitespace matches
3845           within a rule. Since whitespace never matches anything inside
3846           tokens, putting a "<ws:...>" directive in a token is a waste of
3847           time.
3848
3849           Either remove the useless directive, or else change the surrounding
3850           token definition to a rule definition.
3851
3852       "Quantifier that doesn't quantify anything: <%s>"
3853           You specified a rule or token something like:
3854
3855               <token: star>  *
3856
3857           or:
3858
3859               <rule: add_op>  plus | add | +
3860
3861           but the "*" and "+" in those examples are both regex meta-
3862           operators: quantifiers that usually cause what precedes them to
3863           match repeatedly.  In these cases however, nothing is preceding the
3864           quantifier, so it's a Perl syntax error.
3865
3866           You almost certainly need to escape the meta-characters in some
3867           way.  For example:
3868
3869               <token: star>  \*
3870
3871               <rule: add_op>  plus | add | [+]
3872

CONFIGURATION AND ENVIRONMENT

3874       Regexp::Grammars requires no configuration files or environment
3875       variables.
3876

DEPENDENCIES

3878       This module only works under Perl 5.10 or later.
3879

INCOMPATIBILITIES

3881       This module is likely to be incompatible with any other module that
3882       automagically rewrites regexes. For example it may conflict with
3883       Regexp::DefaultFlags, Regexp::DeferredExecution, or Regexp::Extended.
3884

BUGS

3886       No bugs have been reported.
3887
3888       Please report any bugs or feature requests to
3889       "bug-regexp-grammars@rt.cpan.org", or through the web interface at
3890       <http://rt.cpan.org>.
3891

AUTHOR

3893       Damian Conway  "<DCONWAY@CPAN.org>"
3894
3896       Copyright (c) 2009, Damian Conway "<DCONWAY@CPAN.org>". All rights
3897       reserved.
3898
3899       This module is free software; you can redistribute it and/or modify it
3900       under the same terms as Perl itself. See perlartistic.
3901

DISCLAIMER OF WARRANTY

3903       BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
3904       FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
3905       WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
3906       PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
3907       EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
3908       WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
3909       ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
3910       YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
3911       NECESSARY SERVICING, REPAIR, OR CORRECTION.
3912
3913       IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
3914       WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
3915       REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
3916       TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
3917       CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
3918       SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
3919       RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
3920       FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
3921       SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
3922       DAMAGES.
3923
3924
3925
3926perl v5.34.0                      2022-01-21               Regexp::Grammars(3)
Impressum