1Regexp::Grammars(3) User Contributed Perl Documentation Regexp::Grammars(3)
2
3
4
6 Regexp::Grammars - Add grammatical parsing features to Perl 5.10
7 regexes
8
10 This document describes Regexp::Grammars version 1.057
11
13 use Regexp::Grammars;
14
15 my $parser = qr{
16 (?:
17 <Verb> # Parse and save a Verb in a scalar
18 <.ws> # Parse but don't save whitespace
19 <Noun> # Parse and save a Noun in a scalar
20
21 <type=(?{ rand > 0.5 ? 'VN' : 'VerbNoun' })>
22 # Save result of expression in a scalar
23 |
24 (?:
25 <[Noun]> # Parse a Noun and save result in a list
26 (saved under the key 'Noun')
27 <[PostNoun=ws]> # Parse whitespace, save it in a list
28 # (saved under the key 'PostNoun')
29 )+
30
31 <Verb> # Parse a Verb and save result in a scalar
32 (saved under the key 'Verb')
33
34 <type=(?{ 'VN' })> # Save a literal in a scalar
35 |
36 <debug: match> # Turn on the integrated debugger here
37 <.Cmd= (?: mv? )> # Parse but don't capture a subpattern
38 (name it 'Cmd' for debugging purposes)
39 <[File]>+ # Parse 1+ Files and save them in a list
40 (saved under the key 'File')
41 <debug: off> # Turn off the integrated debugger here
42 <Dest=File> # Parse a File and save it in a scalar
43 (saved under the key 'Dest')
44 )
45
46 ################################################################
47
48 <token: File> # Define a subrule named File
49 <.ws> # - Parse but don't capture whitespace
50 <MATCH= ([\w-]+) > # - Parse the subpattern and capture
51 # matched text as the result of the
52 # subrule
53
54 <token: Noun> # Define a subrule named Noun
55 cat | dog | fish # - Match an alternative (as usual)
56
57 <rule: Verb> # Define a whitespace-sensitive subrule
58 eats # - Match a literal (after any space)
59 <Object=Noun>? # - Parse optional subrule Noun and
60 # save result under the key 'Object'
61 | # Or else...
62 <AUX> # - Parse subrule AUX and save result
63 <part= (eaten|seen) > # - Match a literal, save under 'part'
64
65 <token: AUX> # Define a whitespace-insensitive subrule
66 (has | is) # - Match an alternative and capture
67 (?{ $MATCH = uc $^N }) # - Use captured text as subrule result
68
69 }x;
70
71 # Match the grammar against some text...
72 if ($text =~ $parser) {
73 # If successful, the hash %/ will have the hierarchy of results...
74 process_data_in( %/ );
75 }
76
78 In your program...
79 use Regexp::Grammars; Allow enhanced regexes in lexical scope
80 %/ Result-hash for successful grammar match
81
82 Defining and using named grammars...
83 <grammar: GRAMMARNAME> Define a named grammar that can be inherited
84 <extends: GRAMMARNAME> Current grammar inherits named grammar's rules
85
86 Defining rules in your grammar...
87 <rule: RULENAME> Define rule with magic whitespace
88 <token: RULENAME> Define rule without magic whitespace
89
90 <objrule: CLASS= NAME> Define rule that blesses return-hash into class
91 <objtoken: CLASS= NAME> Define token that blesses return-hash into class
92
93 <objrule: CLASS> Shortcut for above (rule name derived from class)
94 <objtoken: CLASS> Shortcut for above (token name derived from class)
95
96 Matching rules in your grammar...
97 <RULENAME> Call named subrule (may be fully qualified)
98 save result to $MATCH{RULENAME}
99
100 <RULENAME(...)> Call named subrule, passing args to it
101
102 <!RULENAME> Call subrule and fail if it matches
103 <!RULENAME(...)> (shorthand for (?!<.RULENAME>) )
104
105 <:IDENT> Match contents of $ARG{IDENT} as a pattern
106 <\:IDENT> Match contents of $ARG{IDENT} as a literal
107 </:IDENT> Match closing delimiter for $ARG{IDENT}
108
109 <%HASH> Match longest possible key of hash
110 <%HASH {PAT}> Match any key of hash that also matches PAT
111
112 </IDENT> Match closing delimiter for $MATCH{IDENT}
113 <\_IDENT> Match the literal contents of $MATCH{IDENT}
114
115 <ALIAS= RULENAME> Call subrule, save result in $MATCH{ALIAS}
116 <ALIAS= %HASH> Match a hash key, save key in $MATCH{ALIAS}
117 <ALIAS= ( PATTERN )> Match pattern, save match in $MATCH{ALIAS}
118 <ALIAS= (?{ CODE })> Execute code, save value in $MATCH{ALIAS}
119 <ALIAS= 'STR' > Save specified string in $MATCH{ALIAS}
120 <ALIAS= 42 > Save specified number in $MATCH{ALIAS}
121 <ALIAS= /IDENT> Match closing delim, save as $MATCH{ALIAS}
122 <ALIAS= \_IDENT> Match '$MATCH{IDENT}', save as $MATCH{ALIAS}
123
124 <.SUBRULE> Call subrule (one of the above forms),
125 but don't save the result in %MATCH
126
127
128 <[SUBRULE]> Call subrule (one of the above forms), but
129 append result instead of overwriting it
130
131 <SUBRULE1>+ % <SUBRULE2> Match one or more repetitions of SUBRULE1
132 as long as they're separated by SUBRULE2
133 <SUBRULE1> ** <SUBRULE2> Same (only for backwards compatibility)
134
135 <SUBRULE1>* % <SUBRULE2> Match zero or more repetitions of SUBRULE1
136 as long as they're separated by SUBRULE2
137
138 <SUBRULE1>* %% <SUBRULE2> Match zero or more repetitions of SUBRULE1
139 as long as they're separated by SUBRULE2
140 and allow an optional trailing SUBRULE2
141
142 In your grammar's code blocks...
143 $CAPTURE Alias for $^N (the most recent paren capture)
144 $CONTEXT Another alias for $^N
145 $INDEX Current index of next matching position in string
146 %MATCH Current rule's result-hash
147 $MATCH Magic override value (returned instead of result-hash)
148 %ARG Current rule's argument hash
149 $DEBUG Current match-time debugging mode
150
151 Directives...
152 <require: (?{ CODE }) > Fail if code evaluates false
153 <timeout: INT > Fail after specified number of seconds
154 <debug: COMMAND > Change match-time debugging mode
155 <logfile: LOGFILE > Change debugging log file (default: STDERR)
156 <fatal: TEXT|(?{CODE})> Queue error message and fail parse
157 <error: TEXT|(?{CODE})> Queue error message and backtrack
158 <warning: TEXT|(?{CODE})> Queue warning message and continue
159 <log: TEXT|(?{CODE})> Explicitly add a message to debugging log
160 <ws: PATTERN > Override automatic whitespace matching
161 <minimize:> Simplify the result of a subrule match
162 <context:> Switch on context substring retention
163 <nocontext:> Switch off context substring retention
164
166 This module adds a small number of new regex constructs that can be
167 used within Perl 5.10 patterns to implement complete recursive-descent
168 parsing.
169
170 Perl 5.10 already supports recursive-descent matching, via the new
171 "(?<name>...)" and "(?&name)" constructs. For example, here is a simple
172 matcher for a subset of the LaTeX markup language:
173
174 $matcher = qr{
175 (?&File)
176
177 (?(DEFINE)
178 (?<File> (?&Element)* )
179
180 (?<Element> \s* (?&Command)
181 | \s* (?&Literal)
182 )
183
184 (?<Command> \\ \s* (?&Literal) \s* (?&Options)? \s* (?&Args)? )
185
186 (?<Options> \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \])
187
188 (?<Args> \{ \s* (?&Element)* \s* \} )
189
190 (?<Option> \s* [^][\$&%#_{}~^\s,]+ )
191
192 (?<Literal> \s* [^][\$&%#_{}~^\s]+ )
193 )
194 }xms
195
196 This technique makes it possible to use regexes to recognize complex,
197 hierarchical--and even recursive--textual structures. The problem is
198 that Perl 5.10 doesn't provide any support for extracting that
199 hierarchical data into nested data structures. In other words, using
200 Perl 5.10 you can match complex data, but not parse it into an
201 internally useful form.
202
203 An additional problem when using Perl 5.10 regexes to match complex
204 data formats is that you have to make sure you remember to insert
205 whitespace-matching constructs (such as "\s*") at every possible
206 position where the data might contain ignorable whitespace. This
207 reduces the readability of such patterns, and increases the chance of
208 errors (typically caused by overlooking a location where whitespace
209 might appear).
210
211 The Regexp::Grammars module solves both those problems.
212
213 If you import the module into a particular lexical scope, it
214 preprocesses any regex in that scope, so as to implement a number of
215 extensions to the standard Perl 5.10 regex syntax. These extensions
216 simplify the task of defining and calling subrules within a grammar,
217 and allow those subrule calls to capture and retain the components of
218 they match in a proper hierarchical manner.
219
220 For example, the above LaTeX matcher could be converted to a full LaTeX
221 parser (and considerably tidied up at the same time), like so:
222
223 use Regexp::Grammars;
224 $parser = qr{
225 <File>
226
227 <rule: File> <[Element]>*
228
229 <rule: Element> <Command> | <Literal>
230
231 <rule: Command> \\ <Literal> <Options>? <Args>?
232
233 <rule: Options> \[ <[Option]>+ % (,) \]
234
235 <rule: Args> \{ <[Element]>* \}
236
237 <rule: Option> [^][\$&%#_{}~^\s,]+
238
239 <rule: Literal> [^][\$&%#_{}~^\s]+
240 }xms
241
242 Note that there is no need to explicitly place "\s*" subpatterns
243 throughout the rules; that is taken care of automatically.
244
245 If the Regexp::Grammars version of this regex were successfully matched
246 against some appropriate LaTeX document, each rule would call the
247 subrules specified within it, and then return a hash containing
248 whatever result each of those subrules returned, with each result
249 indexed by the subrule's name.
250
251 That is, if the rule named "Command" were invoked, it would first try
252 to match a backslash, then it would call the three subrules
253 "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
254 matched successfully, the "Command" rule would then return a hash with
255 three keys: 'Literal', 'Options', and 'Args'. The value for each of
256 those hash entries would be whatever result-hash the subrules
257 themselves had returned when matched.
258
259 In this way, each level of the hierarchical regex can generate hashes
260 recording everything its own subrules matched, so when the entire
261 pattern matches, it produces a tree of nested hashes that represent the
262 structured data the pattern matched.
263
264 For example, if the previous regex grammar were matched against a
265 string containing:
266
267 \documentclass[a4paper,11pt]{article}
268 \author{D. Conway}
269
270 it would automatically extract a data structure equivalent to the
271 following (but with several extra "empty" keys, which are described in
272 "Subrule results"):
273
274 {
275 'file' => {
276 'element' => [
277 {
278 'command' => {
279 'literal' => 'documentclass',
280 'options' => {
281 'option' => [ 'a4paper', '11pt' ],
282 },
283 'args' => {
284 'element' => [ 'article' ],
285 }
286 }
287 },
288 {
289 'command' => {
290 'literal' => 'author',
291 'args' => {
292 'element' => [
293 {
294 'literal' => 'D.',
295 },
296 {
297 'literal' => 'Conway',
298 }
299 ]
300 }
301 }
302 }
303 ]
304 }
305 }
306
307 The data structure that Regexp::Grammars produces from a regex match is
308 available to the surrounding program in the magic variable "%/".
309
310 Regexp::Grammars provides many features that simplify the extraction of
311 hierarchical data via a regex match, and also some features that can
312 simplify the processing of that data once it has been extracted. The
313 following sections explain each of those features, and some of the
314 parsing techniques they support.
315
316 Setting up the module
317 Just add:
318
319 use Regexp::Grammars;
320
321 to any lexical scope. Any regexes within that scope will automatically
322 now implement the new parsing constructs:
323
324 use Regexp::Grammars;
325
326 my $parser = qr/ regex with $extra <chocolatey> grammar bits /;
327
328 Note that you do not to use the "/x" modifier when declaring a regex
329 grammar (though you certainly may). But even if you don't, the module
330 quietly adds a "/x" to every regex within the scope of its usage.
331 Otherwise, the default "a whitespace character matches exactly that
332 whitespace character" behaviour of Perl regexes would mess up your
333 grammar's parsing. If you need the non-"/x" behaviour, you can still
334 use the "(?-x)" of "(?-x:...)" directives to switch off "/x" within one
335 or more of your grammar's components.
336
337 Once the grammar has been processed, you can then match text against
338 the extended regexes, in the usual manner (i.e. via a "=~" match):
339
340 if ($input_text =~ $parser) {
341 ...
342 }
343
344 After a successful match, the variable "%/" will contain a series of
345 nested hashes representing the structured hierarchical data captured
346 during the parse.
347
348 Structure of a Regexp::Grammars grammar
349 A Regexp::Grammars specification consists of a start-pattern (which may
350 include both standard Perl 5.10 regex syntax, as well as special
351 Regexp::Grammars directives), followed by one or more rule or token
352 definitions.
353
354 For example:
355
356 use Regexp::Grammars;
357 my $balanced_brackets = qr{
358
359 # Start-pattern...
360 <paren_pair> | <brace_pair>
361
362 # Rule definition...
363 <rule: paren_pair>
364 \( (?: <escape> | <paren_pair> | <brace_pair> | [^()] )* \)
365
366 # Rule definition...
367 <rule: brace_pair>
368 \{ (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )* \}
369
370 # Token definition...
371 <token: escape>
372 \\ .
373 }xms;
374
375 The start-pattern at the beginning of the grammar acts like the "top"
376 token of the grammar, and must be matched completely for the grammar to
377 match.
378
379 This pattern is treated like a token for whitespace matching behaviour
380 (see "Tokens vs rules (whitespace handling)"). That is, whitespace in
381 the start-pattern is treated like whitespace in any normal Perl regex.
382
383 The rules and tokens are declarations only and they are not directly
384 matched. Instead, they act like subroutines, and are invoked by name
385 from the initial pattern (or from within a rule or token).
386
387 Each rule or token extends from the directive that introduces it up to
388 either the next rule or token directive, or (in the case of the final
389 rule or token) to the end of the grammar.
390
391 Tokens vs rules (whitespace handling)
392 The difference between a token and a rule is that a token treats any
393 whitespace within it exactly as a normal Perl regular expression would.
394 That is, a sequence of whitespace in a token is ignored if the "/x"
395 modifier is in effect, or else matches the same literal sequence of
396 whitespace characters (if "/x" is not in effect).
397
398 In a rule, most sequences of whitespace are treated as matching the
399 implicit subrule "<.ws>", which is automatically predefined to match
400 optional whitespace (i.e. "\s*").
401
402 Exceptions to this behaviour are whitespaces before a "|" or a code
403 block or an explicit space-matcher (such as "<ws>" or "\s"), or at the
404 very end of the rule)
405
406 In other words, a rule such as:
407
408 <rule: sentence> <noun> <verb>
409 | <verb> <noun>
410
411 is equivalent to a token with added non-capturing whitespace matching:
412
413 <token: sentence> <.ws> <noun> <.ws> <verb>
414 | <.ws> <verb> <.ws> <noun>
415
416 You can explicitly define a "<ws>" token to change that default
417 behaviour. For example, you could alter the definition of "whitespace"
418 to include Perlish comments, by adding an explicit "<token: ws>":
419
420 <token: ws>
421 (?: \s+ | #[^\n]* )*
422
423 But be careful not to define "<ws>" as a rule, as this will lead to all
424 kinds of infinitely recursive unpleasantness.
425
426 Per-rule whitespace handling
427
428 Redefining the "<ws>" token changes its behaviour throughout the entire
429 grammar, within every rule definition. Usually that's appropriate, but
430 sometimes you need finer-grained control over whitespace handling.
431
432 So Regexp::Grammars provides the "<ws:>" directive, which allows you to
433 override the implicit whitespace-matches-whitespace behaviour only
434 within the current rule.
435
436 Note that this directive does not redefine "<ws>" within the rule; it
437 simply specifies what to replace each whitespace sequence with (instead
438 of replacing each with a "<ws>" call).
439
440 For example, if a language allows one kind of comment between
441 statements and another within statements, you could parse it with:
442
443 <rule: program>
444 # One type of comment between...
445 <ws: (\s++ | \# .*? \n)* >
446
447 # ...colon-separated statements...
448 <[statement]>+ % ( ; )
449
450
451 <rule: statement>
452 # Another type of comment...
453 <ws: (\s*+ | \#{ .*? }\# )* >
454
455 # ...between comma-separated commands...
456 <cmd> <[arg]>+ % ( , )
457
458 Note that each directive only applies to the rule in which it is
459 specified. In every other rule in the grammar, whitespace would still
460 match the usual "<ws>" subrule.
461
462 Calling subrules
463 To invoke a rule to match at any point, just enclose the rule's name in
464 angle brackets (like in Perl 6). There must be no space between the
465 opening bracket and the rulename. For example::
466
467 qr{
468 file: # Match literal sequence 'f' 'i' 'l' 'e' ':'
469 <name> # Call <rule: name>
470 <options>? # Call <rule: options> (it's okay if it fails)
471
472 <rule: name>
473 # etc.
474 }x;
475
476 If you need to match a literal pattern that would otherwise look like a
477 subrule call, just backslash-escape the leading angle:
478
479 qr{
480 file: # Match literal sequence 'f' 'i' 'l' 'e' ':'
481 \<name> # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
482 <options>? # Call <rule: options> (it's okay if it fails)
483
484 <rule: name>
485 # etc.
486 }x;
487
488 Subrule results
489 If a subrule call successfully matches, the result of that match is a
490 reference to a hash. That hash reference is stored in the current
491 rule's own result-hash, under the name of the subrule that was invoked.
492 The hash will, in turn, contain the results of any more deeply nested
493 subrule calls, each stored under the name by which the nested subrule
494 was invoked.
495
496 In other words, if the rule "sentence" is defined:
497
498 <rule: sentence>
499 <noun> <verb> <object>
500
501 then successfully calling the rule:
502
503 <sentence>
504
505 causes a new hash entry at the current nesting level. That entry's key
506 will be 'sentence' and its value will be a reference to a hash, which
507 in turn will have keys: 'noun', 'verb', and 'object'.
508
509 In addition each result-hash has one extra key: the empty string. The
510 value for this key is whatever substring the entire subrule call
511 matched. This value is known as the context substring.
512
513 So, for example, a successful call to "<sentence>" might add something
514 like the following to the current result-hash:
515
516 sentence => {
517 "" => 'I saw a dog',
518 noun => 'I',
519 verb => 'saw',
520 object => {
521 "" => 'a dog',
522 article => 'a',
523 noun => 'dog',
524 },
525 }
526
527 Note, however, that if the result-hash at any level contains only the
528 empty-string key (i.e. the subrule did not call any sub-subrules or
529 save any of their nested result-hashes), then the hash is "unpacked"
530 and just the context substring itself is returned.
531
532 For example, if "<rule: sentence>" had been defined:
533
534 <rule: sentence>
535 I see dead people
536
537 then a successful call to the rule would only add:
538
539 sentence => 'I see dead people'
540
541 to the current result-hash.
542
543 This is a useful feature because it prevents a series of nested subrule
544 calls from producing very unwieldy data structures. For example,
545 without this automatic unpacking, even the simple earlier example:
546
547 <rule: sentence>
548 <noun> <verb> <object>
549
550 would produce something needlessly complex, such as:
551
552 sentence => {
553 "" => 'I saw a dog',
554 noun => {
555 "" => 'I',
556 },
557 verb => {
558 "" => 'saw',
559 },
560 object => {
561 "" => 'a dog',
562 article => {
563 "" => 'a',
564 },
565 noun => {
566 "" => 'dog',
567 },
568 },
569 }
570
571 Turning off the context substring
572
573 The context substring is convenient for debugging and for generating
574 error messages but, in a large grammar, or when parsing a long string,
575 the capture and storage of many nested substrings may quickly become
576 prohibitively expensive.
577
578 So Regexp::Grammars provides a directive to prevent context substrings
579 from being retained. Any rule or token that includes the directive
580 "<nocontext:>" anywhere in the rule's body will not retain any context
581 substring it matches...unless that substring would be the only entry in
582 its result hash (which only happens within objrules and objtokens).
583
584 If a "<nocontext:>" directive appears before the first rule or token
585 definition (i.e. as part of the main pattern), then the entire grammar
586 will discard all context substrings from every one of its rules and
587 tokens.
588
589 However, you can override this universal prohibition with a second
590 directive: "<context:>". If this directive appears in any rule or
591 token, that rule or token will save its context substring, even if a
592 global "<nocontext:>" is in effect.
593
594 This means that this grammar:
595
596 qr{
597 <Command>
598
599 <rule: Command>
600 <nocontext:>
601 <Keyword> <arg=(\S+)>+ % <.ws>
602
603 <token: Keyword>
604 <Move> | <Copy> | <Delete>
605
606 # etc.
607 }x
608
609 and this grammar:
610
611 qr{
612 <nocontext:>
613 <Command>
614
615 <rule: Command>
616 <Keyword> <arg=(\S+)>+ % <.ws>
617
618 <token: Keyword>
619 <context:>
620 <Move> | <Copy> | <Delete>
621
622 # etc.
623 }x
624
625 will behave identically (saving context substrings for keywords, but
626 not for commands), except that the first version will also retain the
627 global context substring (i.e. $/{""}), whereas the second version will
628 not.
629
630 Note that "<context:>" and "<nocontext:>" have no effect on, or even
631 any interaction with, the various result distillation mechanisms, which
632 continue to work in the usual way when either or both of the directives
633 is used.
634
635 Renaming subrule results
636 It is not always convenient to have subrule results stored under the
637 same name as the rule itself. Rule names should be optimized for
638 understanding the behaviour of the parser, whereas result names should
639 be optimized for understanding the structure of the data. Often those
640 two goals are identical, but not always; sometimes rule names need to
641 describe what the data looks like, while result names need to describe
642 what the data means.
643
644 For example, sometimes you need to call the same rule twice, to match
645 two syntactically identical components whose positions give then
646 semantically distinct meanings:
647
648 <rule: copy_cmd>
649 copy <file> <file>
650
651 The problem here is that, if the second call to "<file>" succeeds, its
652 result-hash will be stored under the key 'file', clobbering the data
653 that was returned from the first call to "<file>".
654
655 To avoid such problems, Regexp::Grammars allows you to alias any
656 subrule call, so that it is still invoked by the original name, but its
657 result-hash is stored under a different key. The syntax for that is:
658 "<alias=rulename>". For example:
659
660 <rule: copy_cmd>
661 copy <from=file> <to=file>
662
663 Here, "<rule: file>" is called twice, with the first result-hash being
664 stored under the key 'from', and the second result-hash being stored
665 under the key 'to'.
666
667 Note, however, that the alias before the "=" must be a proper
668 identifier (i.e. a letter or underscore, followed by letters, digits,
669 and/or underscores). Aliases that start with an underscore and aliases
670 named "MATCH" have special meaning (see "Private subrule calls" and
671 "Result distillation" respectively).
672
673 Aliases can also be useful for normalizing data that may appear in
674 different formats and sequences. For example:
675
676 <rule: copy_cmd>
677 copy <from=file> <to=file>
678 | dup <to=file> as <from=file>
679 | <from=file> -> <to=file>
680 | <to=file> <- <from=file>
681
682 Here, regardless of which order the old and new files are specified,
683 the result-hash always gets:
684
685 copy_cmd => {
686 from => 'oldfile',
687 to => 'newfile',
688 }
689
690 List-like subrule calls
691 If a subrule call is quantified with a repetition specifier:
692
693 <rule: file_sequence>
694 <file>+
695
696 then each repeated match overwrites the corresponding entry in the
697 surrounding rule's result-hash, so only the result of the final
698 repetition will be retained. That is, if the above example matched the
699 string "foo.pl bar.py baz.php", then the result-hash would contain:
700
701 file_sequence {
702 "" => 'foo.pl bar.py baz.php',
703 file => 'baz.php',
704 }
705
706 Usually, that's not the desired outcome, so Regexp::Grammars provides
707 another mechanism by which to call a subrule; one that saves all
708 repetitions of its results.
709
710 A regular subrule call consists of the rule's name surrounded by angle
711 brackets. If, instead, you surround the rule's name with "<[...]>"
712 (angle and square brackets) like so:
713
714 <rule: file_sequence>
715 <[file]>+
716
717 then the rule is invoked in exactly the same way, but the result of
718 that submatch is pushed onto an array nested inside the appropriate
719 result-hash entry. In other words, if the above example matched the
720 same "foo.pl bar.py baz.php" string, the result-hash would contain:
721
722 file_sequence {
723 "" => 'foo.pl bar.py baz.php',
724 file => [ 'foo.pl', 'bar.py', 'baz.php' ],
725 }
726
727 This "listifying subrule call" can also be useful for non-repeated
728 subrule calls, if the same subrule is invoked in several places in a
729 grammar. For example if a cmdline option could be given either one or
730 two values, you might parse it:
731
732 <rule: size_option>
733 -size <[size]> (?: x <[size]> )?
734
735 The result-hash entry for 'size' would then always contain an array,
736 with either one or two elements, depending on the input being parsed.
737
738 Listifying subrules can also be given aliases, just like ordinary
739 subrules. The alias is always specified inside the square brackets:
740
741 <rule: size_option>
742 -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
743
744 Here, the sizes are parsed using the "pos_integer" rule, but saved in
745 the result-hash in an array under the key 'size'.
746
747 Parametric subrules
748 When a subrule is invoked, it can be passed a set of named arguments
749 (specified as key"=>"values pairs). This argument list is placed in a
750 normal Perl regex code block and must appear immediately after the
751 subrule name, before the closing angle bracket.
752
753 Within the subrule that has been invoked, the arguments can be accessed
754 via the special hash %ARG. For example:
755
756 <rule: block>
757 <tag>
758 <[block]>*
759 <end_tag(?{ tag=>$MATCH{tag} })> # ...call subrule with argument
760
761 <token: end_tag>
762 end_ (??{ quotemeta $ARG{tag} })
763
764 Here the "block" rule first matches a "<tag>", and the corresponding
765 substring is saved in $MATCH{tag}. It then matches any number of nested
766 blocks. Finally it invokes the "<end_tag>" subrule, passing it an
767 argument whose name is 'tag' and whose value is the current value of
768 $MATCH{tag} (i.e. the original opening tag).
769
770 When it is thus invoked, the "end_tag" token first matches 'end_', then
771 interpolates the literal value of the 'tag' argument and attempts to
772 match it.
773
774 Any number of named arguments can be passed when a subrule is invoked.
775 For example, we could generalize the "end_tag" rule to allow any prefix
776 (not just 'end_'), and also to allow for 'if...fi'-style reversed tags,
777 like so:
778
779 <rule: block>
780 <tag>
781 <[block]>*
782 <end_tag (?{ prefix=>'end', tag=>$MATCH{tag} })>
783
784 <token: end_tag>
785 (??{ $ARG{prefix} // q{(?!)} }) # ...prefix as pattern
786 (??{ quotemeta $ARG{tag} }) # ...tag as literal
787 |
788 (??{ quotemeta reverse $ARG{tag} }) # ...reversed tag
789
790 Note that, if you do not need to interpolate values (such as
791 $MATCH{tag}) into a subrule's argument list, you can use simple
792 parentheses instead of "(?{...})", like so:
793
794 <end_tag( prefix=>'end', tag=>'head' )>
795
796 The only types of values you can use in this simplified syntax are
797 numbers and single-quote-delimited strings. For anything more complex,
798 put the argument list in a full "(?{...})".
799
800 As the earlier examples show, the single most common type of argument
801 is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
802 a common requirement to pass an element of %MATCH into a subrule, named
803 with its own key.
804
805 Because this is such a common usage, Regexp::Grammars provides a
806 shortcut. If you use simple parentheses (instead of "(?{...})"
807 parentheses) then instead of a pair, you can specify an argument using
808 a colon followed by an identifier. This argument is replaced by a
809 named argument whose name is the identifier and whose value is the
810 corresponding item from %MATCH. So, for example, instead of:
811
812 <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>
813
814 you can just write:
815
816 <end_tag( prefix=>'end', :tag )>
817
818 Note that, from Perl 5.20 onwards, due to changes in the way that Perl
819 parses regexes, Regexp::Grammars does not support explicitly passing
820 elements of %MATCH as argument values within a list subrule (yeah, it's
821 a very specific and obscure edge-case):
822
823 <[end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })]> # Does not work
824
825 Note, however, that the shortcut:
826
827 <[end_tag( prefix=>'end', :tag )]>
828
829 still works correctly.
830
831 Accessing subrule arguments more cleanly
832
833 As the preceding examples illustrate, using subrule arguments
834 effectively generally requires the use of run-time interpolated
835 subpatterns via the "(??{...})" construct.
836
837 This produces ugly rule bodies such as:
838
839 <token: end_tag>
840 (??{ $ARG{prefix} // q{(?!)} }) # ...prefix as pattern
841 (??{ quotemeta $ARG{tag} }) # ...tag as literal
842 |
843 (??{ quotemeta reverse $ARG{tag} }) # ...reversed tag
844
845 To simplify these common usages, Regexp::Grammars provides three
846 convenience constructs.
847
848 A subrule call of the form "<:"identifier">" is equivalent to:
849
850 (??{ $ARG{'identifier'} // q{(?!)} })
851
852 Namely: "Match the contents of $ARG{'identifier'}, treating those
853 contents as a pattern."
854
855 A subrule call of the form "<\:"identifier">" (that is: a matchref with
856 a colon after the backslash) is equivalent to:
857
858 (??{ defined $ARG{'identifier'}
859 ? quotemeta($ARG{'identifier'})
860 : '(?!)'
861 })
862
863 Namely: "Match the contents of $ARG{'identifier'}, treating those
864 contents as a literal."
865
866 A subrule call of the form "</:"identifier">" (that is: an invertref
867 with a colon after the forward slash) is equivalent to:
868
869 (??{ defined $ARG{'identifier'}
870 ? quotemeta(reverse $ARG{'identifier'})
871 : '(?!)'
872 })
873
874 Namely: "Match the closing delimiter corresponding to the contents of
875 $ARG{'identifier'}, as if it were a literal".
876
877 The availability of these three constructs mean that we could rewrite
878 the above "<end_tag>" token much more cleanly as:
879
880 <token: end_tag>
881 <:prefix> # ...prefix as pattern
882 <\:tag> # ...tag as a literal
883 |
884 </:tag> # ...reversed tag
885
886 In general these constructs mean that, within a subrule, if you want to
887 match an argument passed to that subrule, you use "<:"ARGNAME">" (to
888 match the argument as a pattern) or "<\:"ARGNAME">" (to match the
889 argument as a literal).
890
891 Note the consistent mnemonic in these various subrule-like
892 interpolations of named arguments: the name is always prefixed by a
893 colon.
894
895 In other words, the "<:ARGNAME>" form works just like a "<RULENAME>",
896 except that the leading colon tells Regexp::Grammars to use the
897 contents of $ARG{'ARGNAME'} as the subpattern, instead of the contents
898 of "(?&RULENAME)"
899
900 Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs work exactly
901 like "<\_MATCHNAME>" and "</INVERTNAME>" respectively, except that the
902 leading colon indicates that the matchref or invertref should be taken
903 from %ARG instead of from %MATCH.
904
905 Pseudo-subrules
906 Aliases can also be given to standard Perl subpatterns, as well as to
907 code blocks within a regex. The syntax for subpatterns is:
908
909 <ALIAS= (SUBPATTERN) >
910
911 In other words, the syntax is exactly like an aliased subrule call,
912 except that the rule name is replaced with a set of parentheses
913 containing the subpattern. Any parentheses--capturing or
914 non-capturing--will do.
915
916 The effect of aliasing a standard subpattern is to cause whatever that
917 subpattern matches to be saved in the result-hash, using the alias as
918 its key. For example:
919
920 <rule: file_command>
921
922 <cmd=(mv|cp|ln)> <from=file> <to=file>
923
924 Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
925 "(mv|cp|ln)", but whatever substring it matches is saved in the result-
926 hash under the key 'cmd'.
927
928 The syntax for aliasing code blocks is:
929
930 <ALIAS= (?{ your($code->here) }) >
931
932 Note, however, that the code block must be specified in the standard
933 Perl 5.10 regex notation: "(?{...})". A common mistake is to write:
934
935 <ALIAS= { your($code->here } >
936
937 instead, which will attempt to interpolate $code before the regex is
938 even compiled, as such variables are only "protected" from
939 interpolation inside a "(?{...})".
940
941 When correctly specified, this construct executes the code in the block
942 and saves the result of that execution in the result-hash, using the
943 alias as its key. Aliased code blocks are useful for adding semantic
944 information based on which branch of a rule is executed. For example,
945 consider the "copy_cmd" alternatives shown earlier:
946
947 <rule: copy_cmd>
948 copy <from=file> <to=file>
949 | dup <to=file> as <from=file>
950 | <from=file> -> <to=file>
951 | <to=file> <- <from=file>
952
953 Using aliased code blocks, you could add an extra field to the result-
954 hash to describe which form of the command was detected, like so:
955
956 <rule: copy_cmd>
957 copy <from=file> <to=file> <type=(?{ 'std' })>
958 | dup <to=file> as <from=file> <type=(?{ 'rev' })>
959 | <from=file> -> <to=file> <type=(?{ +1 })>
960 | <to=file> <- <from=file> <type=(?{ -1 })>
961
962 Now, if the rule matched, the result-hash would contain something like:
963
964 copy_cmd => {
965 from => 'oldfile',
966 to => 'newfile',
967 type => 'fwd',
968 }
969
970 Note that, in addition to the semantics described above, aliased
971 subpatterns and code blocks also become visible to Regexp::Grammars'
972 integrated debugger (see Debugging).
973
974 Aliased literals
975 As the previous example illustrates, it is inconveniently verbose to
976 assign constants via aliased code blocks. So Regexp::Grammars provides
977 a short-cut. It is possible to directly alias a numeric literal or a
978 single-quote delimited literal string, without putting either inside a
979 code block. For example, the previous example could also be written:
980
981 <rule: copy_cmd>
982 copy <from=file> <to=file> <type='std'>
983 | dup <to=file> as <from=file> <type='rev'>
984 | <from=file> -> <to=file> <type= +1 >
985 | <to=file> <- <from=file> <type= -1 >
986
987 Note that only these two forms of literal are supported in this
988 abbreviated syntax.
989
990 Amnesiac subrule calls
991 By default, every subrule call saves its result into the result-hash,
992 either under its own name, or under an alias.
993
994 However, sometimes you may want to refactor some literal part of a rule
995 into one or more subrules, without having those submatches added to the
996 result-hash. The syntax for calling a subrule, but ignoring its return
997 value is:
998
999 <.SUBRULE>
1000
1001 (which is stolen directly from Perl 6).
1002
1003 For example, you may prefer to rewrite a rule such as:
1004
1005 <rule: paren_pair>
1006
1007 \(
1008 (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
1009 \)
1010
1011 without any literal matching, like so:
1012
1013 <rule: paren_pair>
1014
1015 <.left_paren>
1016 (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
1017 <.right_paren>
1018
1019 <token: left_paren> \(
1020 <token: right_paren> \)
1021 <token: non_paren> [^()]
1022
1023 Moreover, as the individual components inside the parentheses probably
1024 aren't being captured for any useful purpose either, you could further
1025 optimize that to:
1026
1027 <rule: paren_pair>
1028
1029 <.left_paren>
1030 (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
1031 <.right_paren>
1032
1033 Note that you can also use the dot modifier on an aliased subpattern:
1034
1035 <.Alias= (SUBPATTERN) >
1036
1037 This seemingly contradictory behaviour (of giving a subpattern a name,
1038 then deliberately ignoring that name) actually does make sense in one
1039 situation. Providing the alias makes the subpattern visible to the
1040 debugger, while using the dot stops it from affecting the result-hash.
1041 See "Debugging non-grammars" for an example of this usage.
1042
1043 Private subrule calls
1044 If a rule name (or an alias) begins with an underscore:
1045
1046 <_RULENAME> <_ALIAS=RULENAME>
1047 <[_RULENAME]> <[_ALIAS=RULENAME]>
1048
1049 then matching proceeds as normal, and any result that is returned is
1050 stored in the current result-hash in the usual way.
1051
1052 However, when any rule finishes (and just before it returns) it first
1053 filters its result-hash, removing any entries whose keys begin with an
1054 underscore. This means that any subrule with an underscored name (or
1055 with an underscored alias) remembers its result, but only until the end
1056 of the current rule. Its results are effectively private to the current
1057 rule.
1058
1059 This is especially useful in conjunction with result distillation.
1060
1061 Lookahead (zero-width) subrules
1062 Non-capturing subrule calls can be used in normal lookaheads:
1063
1064 <rule: qualified_typename>
1065 # A valid typename and has a :: in it...
1066 (?= <.typename> ) [^\s:]+ :: \S+
1067
1068 <rule: identifier>
1069 # An alpha followed by alnums (but not a valid typename)...
1070 (?! <.typename> ) [^\W\d]\w*
1071
1072 but the syntax is a little unwieldy. More importantly, an internal
1073 problem with backtracking causes positive lookaheads to mess up the
1074 module's named capturing mechanism.
1075
1076 So Regexp::Grammars provides two shorthands:
1077
1078 <!typename> same as: (?! <.typename> )
1079 <?typename> same as: (?= <.typename> ) ...but works correctly!
1080
1081 These two constructs can also be called with arguments, if necessary:
1082
1083 <rule: Command>
1084 <Keyword>
1085 (?:
1086 <!Terminator(:Keyword)> <Args=(\S+)>
1087 )?
1088 <Terminator(:Keyword)>
1089
1090 Note that, as the above equivalences imply, neither of these forms of a
1091 subroutine call ever captures what it matches.
1092
1093 Matching separated lists
1094 One of the commonest tasks in text parsing is to match a list of
1095 unspecified length, in which items are separated by a fixed token.
1096 Things like:
1097
1098 1, 2, 3 , 4 ,13, 91 # Numbers separated by commas and spaces
1099
1100 g-c-a-g-t-t-a-c-a # DNA bases separated by dashes
1101
1102 /usr/local/bin # Names separated by directory markers
1103
1104 /usr:/usr/local:bin # Directories separated by colons
1105
1106 The usual construct required to parse these kinds of structures is
1107 either:
1108
1109 <rule: list>
1110
1111 <item> <separator> <list> # recursive definition
1112 | <item> # base case
1113
1114 or, if you want to allow zero-or-more items instead of requiring one-
1115 or-more:
1116
1117 <rule: list_opt>
1118 <list>? # entire list may be missing
1119
1120 <rule: list> # as before...
1121 <item> <separator> <list> # recursive definition
1122 | <item> # base case
1123
1124 Or, more efficiently, but less prettily:
1125
1126 <rule: list>
1127 <[item]> (?: <separator> <[item]> )* # one-or-more
1128
1129 <rule: list_opt>
1130 (?: <[item]> (?: <separator> <[item]> )* )? # zero-or-more
1131
1132 Because separated lists are such a common component of grammars,
1133 Regexp::Grammars provides cleaner ways to specify them:
1134
1135 <rule: list>
1136 <[item]>+ % <separator> # one-or-more
1137
1138 <rule: list_zom>
1139 <[item]>* % <separator> # zero-or-more
1140
1141 Note that these are just regular repetition qualifiers (i.e. "+" and
1142 "*") applied to a subrule ("<[item]>"), with a "%" modifier after them
1143 to specify the required separator between the repeated matches.
1144
1145 The number of repetitions matched is controlled both by the nature of
1146 the qualifier ("+" vs "*") and by the subrule specified after the "%".
1147 The qualified subrule will be repeatedly matched for as long as its
1148 qualifier allows, provided that the second subrule also matches between
1149 those repetitions.
1150
1151 For example, you can match a parenthesized sequence of one-or-more
1152 numbers separated by commas, such as:
1153
1154 (1, 2, 3, 4, 13, 91) # Numbers separated by commas (and spaces)
1155
1156 with:
1157
1158 <rule: number_list>
1159
1160 \( <[number]>+ % <comma> \)
1161
1162 <token: number> \d+
1163 <token: comma> ,
1164
1165 Note that any spaces round the commas will be ignored because
1166 "<number_list>" is specified as a rule and the "+%" specifier has
1167 spaces within and around it. To disallow spaces around the commas, make
1168 sure there are no spaces in or around the "+%":
1169
1170 <rule: number_list_no_spaces>
1171
1172 \( <[number]>+%<comma> \)
1173
1174 (or else specify the rule as a token instead).
1175
1176 Because the "%" is a modifier applied to a qualifier, you can modify
1177 any other repetition qualifier in the same way. For example:
1178
1179 <[item]>{2,4} % <sep> # two-to-four items, separated
1180
1181 <[item]>{7} % <sep> # exactly 7 items, separated
1182
1183 <[item]>{10,}? % <sep> # minimum of 10 or more items, separated
1184
1185 You can even do this:
1186
1187 <[item]>? % <sep> # one-or-zero items, (theoretically) separated
1188
1189 though the separator specification is, of course, meaningless in that
1190 case as it will never be needed to separate a maximum of one item.
1191
1192 Within a Regexp::Grammars regex a simple "%" is always metasyntax, so
1193 it cannot be used to match a literal '%'. Any attempt to do so is
1194 immediately fatal when the regex is compiled:
1195
1196 <token: percentage>
1197 \d{1,3} % # Fatal. Will not match "7%", "100%", etc.
1198
1199 <token: perl_hash>
1200 % <ident> # Fatal. Will not match "%foo", "%bar", etc.
1201
1202 <token: perl_mod>
1203 <expr> % <expr> # Fatal. Will not match "$n % 2", etc.
1204
1205 If you need to match a literal "%" immediately after a repetition,
1206 quote it with a backslash:
1207
1208 <token: percentage>
1209 \d{1,3} \% # Okay. Will match "7%", "100%", etc.
1210
1211 <token: perl_hash>
1212 \% <ident> # Okay. Will match "%foo", "%bar", etc.
1213
1214 <token: perl_mod>
1215 <expr> \% <expr> # Okay. Will match "$n % 2", etc.
1216
1217 Note that it's usually necessary to use the "<[...]>" form for the
1218 repeated items being matched, so that all of them are saved in the
1219 result hash. You can also save all the separators (if they're
1220 important) by specifying them as a list-like subrule too:
1221
1222 \( <[number]>* % <[comma]> \) # save numbers *and* separators
1223
1224 The repeated item must be specified as a subrule call of some kind
1225 (i.e. in angles), but the separators may be specified either as a
1226 subrule or as a raw bracketed pattern (i.e. brackets without any nested
1227 subrule calls). For example:
1228
1229 <[number]>* % ( , | : ) # Numbers separated by commas or colons
1230
1231 <[number]>* % [,:] # Same, but more efficiently matched
1232
1233 The separator should always be specified within matched delimiters of
1234 some kind: either matching "<...>" or matching "(...)" or matching
1235 "[...]". Simple, non-bracketed separators will sometimes also work:
1236
1237 <[number]>+ % ,
1238
1239 but not always:
1240
1241 <[number]>+ % ,\s+ # Oops! Separator is just: ,
1242
1243 This is because of the limited way in which the module internally
1244 parses ordinary regex components (i.e. without full understanding of
1245 their implicit precedence). As a consequence, consistently placing
1246 brackets around any separator is a much safer approach:
1247
1248 <[number]>+ % (,\s+)
1249
1250 You can also use a simple pattern on the left of the "%" as the item
1251 matcher, but in this case it must always be aliased into a list-
1252 collecting subrule, like so:
1253
1254 <[item=(\d+)]>* % [,]
1255
1256 Note that, for backwards compatibility with earlier versions of
1257 Regexp::Grammars, the "+%" operator can also be written: "**".
1258 However, there can be no space between the two asterisks of this
1259 variant. That is:
1260
1261 <[item]> ** <sep> # same as <[item]>* % <sep>
1262
1263 <[item]>* * <sep> # error (two * qualifiers in a row)
1264
1265 Matching separated lists with a trailing separator
1266
1267 Some languages allow a separated list to include an extra trailing
1268 separator. For example:
1269
1270 ~/bin/perl5/ # Trailing /-separator in filepath
1271 (1,2,3,) # Trailing ,-separator in Perl list
1272
1273 To match such constructs using the "%" operator, you would need to add
1274 something to explicitly match the optional trailing separator:
1275
1276 <dir>+ % [/] [/]? # Slash-separated dirs, then optional final slash
1277
1278 <elem>+ % [,] [,]? # Comma-separated elems, then optional final comma
1279
1280 which is tedious.
1281
1282 So the module also supports a second kind of "separated list" operator,
1283 that allows an optional trailing separator as well: the "%%" operator.
1284 THis operator behaves exactly like the "%" operator, except that it
1285 also matches a final trailing separator, if one is present.
1286
1287 So the previous examples could be (better) written as:
1288
1289 <dir>+ %% [/] # Slash-separated dirs, with optional final slash
1290
1291 <elem>+ %% [,] # Comma-separated elems, with optional final comma
1292
1293 Matching hash keys
1294 In some situations a grammar may need a rule that matches dozens,
1295 hundreds, or even thousands of one-word alternatives. For example, when
1296 matching command names, or valid userids, or English words. In such
1297 cases it is often impractical (and always inefficient) to list all the
1298 alternatives between "|" alternators:
1299
1300 <rule: shell_cmd>
1301 a2p | ac | apply | ar | automake | awk | ...
1302 # ...and 400 lines later
1303 ... | zdiff | zgrep | zip | zmore | zsh
1304
1305 <rule: valid_word>
1306 a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
1307 # ...and 40,000 lines later...
1308 ... | zymotize | zymotoxic | zymurgy | zythem | zythum
1309
1310 To simplify such cases, Regexp::Grammars provides a special construct
1311 that allows you to specify all the alternatives as the keys of a normal
1312 hash. The syntax for that construct is simply to put the hash name
1313 inside angle brackets (with no space between the angles and the hash
1314 name).
1315
1316 Which means that the rules in the previous example could also be
1317 written:
1318
1319 <rule: shell_cmd>
1320 <%cmds>
1321
1322 <rule: valid_word>
1323 <%dict>
1324
1325 provided that the two hashes (%cmds and %dict) are visible in the scope
1326 where the grammar is created.
1327
1328 Matching a hash key in this way is typically significantly faster than
1329 matching a large set of alternations. Specifically, it is O(length of
1330 longest potential key) ^ 2, instead of O(number of keys).
1331
1332 Internally, the construct is converted to something equivalent to:
1333
1334 <rule: shell_cmd>
1335 (<.hk>) <require: (?{ exists $cmds{$CAPTURE} })>
1336
1337 <rule: valid_word>
1338 (<.hk>) <require: (?{ exists $dict{$CAPTURE} })>
1339
1340 The special "<hk>" rule is created automatically, and defaults to
1341 "\S+", but you can also define it explicitly to handle other kinds of
1342 keys. For example:
1343
1344 <rule: hk>
1345 [^\n]+ # Key may be any number of chars on a single line
1346
1347 <rule: hk>
1348 [ACGT]{10,} # Key is a base sequence of at least 10 pairs
1349
1350 Alternatively, you can specify a different key-matching pattern for
1351 each hash you're matching, by placing the required pattern in braces
1352 immediately after the hash name. For example:
1353
1354 <rule: client_name>
1355 # Valid keys match <.hk> (default or explicitly specified)
1356 <%clients>
1357
1358 <rule: shell_cmd>
1359 # Valid keys contain only word chars, hyphen, slash, or dot...
1360 <%cmds { [\w-/.]+ }>
1361
1362 <rule: valid_word>
1363 # Valid keys contain only alphas or internal hyphen or apostrophe...
1364 <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>
1365
1366 <rule: DNA_sequence>
1367 # Valid keys are base sequences of at least 10 pairs...
1368 <%sequences{[ACGT]{10,}}>
1369
1370 This second approach to key-matching is preferred, because it localizes
1371 any non-standard key-matching behaviour to each individual hash.
1372
1373 Note that changes in the compilation process from Perl 5.18 onwards
1374 mean that in some cases the "<%hash>" construct only works reliably if
1375 the hash itself is declared at the outermost lexical scope (i.e. file
1376 scope).
1377
1378 Specifically, if the regex grammar does not include any interpolated
1379 scalars or arrays and the hash was declared within a subroutine (even
1380 within the same subroutine as the regex grammar that uses it), the
1381 regex will not be able to "see" the hash variable at compile-time. This
1382 will produce a "Global symbol "%hash" requires explicit package name"
1383 compile-time error. For example:
1384
1385 sub build_keyword_parser {
1386 # Hash declared inside subroutine...
1387 my %keywords = (foo => 1, bar => 1);
1388
1389 # ...then used in <%hash> construct within uninterpolated regex...
1390 return qr{
1391 ^<keyword>$
1392 <rule: keyword> <%keywords>
1393 }x;
1394
1395 # ...produces compile-time error
1396 }
1397
1398 The solution is to place the hash outside the subroutine containing the
1399 grammar:
1400
1401 # Hash declared OUTSIDE subroutine...
1402 my %keywords = (foo => 1, bar => 1);
1403
1404 sub build_keyword_parser {
1405 return qr{
1406 ^<keyword>$
1407 <rule: keyword> <%keywords>
1408 }x;
1409 }
1410
1411 ...or else to explicitly interpolate at least one scalar (even just a
1412 scalar containing an empty string):
1413
1414 sub build_keyword_parser {
1415 my %keywords = (foo => 1, bar => 1);
1416 my $DEFER_REGEX_COMPILATION = "";
1417
1418 return qr{
1419 ^<keyword>$
1420 <rule: keyword> <%keywords>
1421
1422 $DEFER_REGEX_COMPILATION
1423 }x;
1424 }
1425
1426 Rematching subrule results
1427 Sometimes it is useful to be able to rematch a string that has
1428 previously been matched by some earlier subrule. For example, consider
1429 a rule to match shell-like control blocks:
1430
1431 <rule: control_block>
1432 for <expr> <[command]>+ endfor
1433 | while <expr> <[command]>+ endwhile
1434 | if <expr> <[command]>+ endif
1435 | with <expr> <[command]>+ endwith
1436
1437 This would be much tidier if we could factor out the command names
1438 (which are the only differences between the four alternatives). The
1439 problem is that the obvious solution:
1440
1441 <rule: control_block>
1442 <keyword> <expr>
1443 <[command]>+
1444 end<keyword>
1445
1446 doesn't work, because it would also match an incorrect input like:
1447
1448 for 1..10
1449 echo $n
1450 ls subdir/$n
1451 endif
1452
1453 We need some way to ensure that the "<keyword>" matched immediately
1454 after "end" is the same "<keyword>" that was initially matched.
1455
1456 That's not difficult, because the first "<keyword>" will have captured
1457 what it matched into $MATCH{keyword}, so we could just write:
1458
1459 <rule: control_block>
1460 <keyword> <expr>
1461 <[command]>+
1462 end(??{quotemeta $MATCH{keyword}})
1463
1464 This is such a useful technique, yet so ugly, scary, and prone to
1465 error, that Regexp::Grammars provides a cleaner equivalent:
1466
1467 <rule: control_block>
1468 <keyword> <expr>
1469 <[command]>+
1470 end<\_keyword>
1471
1472 A directive of the form "<\_IDENTIFIER>" is known as a "matchref" (an
1473 abbreviation of "%MATCH-supplied backreference"). Matchrefs always
1474 attempt to match, as a literal, the current value of
1475 $MATCH{IDENTIFIER}.
1476
1477 By default, a matchref does not capture what it matches, but you can
1478 have it do so by giving it an alias:
1479
1480 <token: delimited_string>
1481 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1482
1483 <token: str_delim> ["'`]
1484
1485 At first glance this doesn't seem very useful as, by definition,
1486 $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up with
1487 identical values. However, it can be useful if the rule also has other
1488 alternatives and you want to create a consistent internal
1489 representation for those alternatives, like so:
1490
1491 <token: delimited_string>
1492 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1493 | <ldelim=( \[ ) .*? <rdelim=( \] )
1494 | <ldelim=( \{ ) .*? <rdelim=( \} )
1495 | <ldelim=( \( ) .*? <rdelim=( \) )
1496 | <ldelim=( \< ) .*? <rdelim=( \> )
1497
1498 You can also force a matchref to save repeated matches as a nested
1499 array, in the usual way:
1500
1501 <token: marked_text>
1502 <marker> <text> <[endmarkers=\_marker]>+
1503
1504 Be careful though, as the following will not do as you may expect:
1505
1506 <[marker]>+ <text> <[endmarkers=\_marker]>+
1507
1508 because the value of $MATCH{marker} will be an array reference, which
1509 the matchref will flatten and concatenate, then match the resulting
1510 string as a literal, which will mean the previous example will match
1511 endmarkers that are exact multiples of the complete start marker,
1512 rather than endmarkers that consist of any number of repetitions of the
1513 individual start marker delimiter. So:
1514
1515 ""text here""
1516 ""text here""""
1517 ""text here""""""
1518
1519 but not:
1520
1521 ""text here"""
1522 ""text here"""""
1523
1524 Uneven start and end markers such as these are extremely unusual, so
1525 this problem rarely arises in practice.
1526
1527 Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
1528 was "<\IDENTIFIER>" instead of "<\_IDENTIFIER>". This created problems
1529 when the identifier started with any of "l", "u", "L", "U", "Q", or
1530 "E", so the syntax has had to be altered in a backwards incompatible
1531 way. It will not be altered again.
1532
1533 Rematching balanced delimiters
1534 Consider the example in the previous section:
1535
1536 <token: delimited_string>
1537 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1538 | <ldelim=( \[ ) .*? <rdelim=( \] )
1539 | <ldelim=( \{ ) .*? <rdelim=( \} )
1540 | <ldelim=( \( ) .*? <rdelim=( \) )
1541 | <ldelim=( \< ) .*? <rdelim=( \> )
1542
1543 The repeated pattern of the last four alternatives is gauling, but we
1544 can't just refactor those delimiters as well:
1545
1546 <token: delimited_string>
1547 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1548 | <ldelim=bracket> .*? <rdelim=\_ldelim>
1549
1550 because that would incorrectly match:
1551
1552 { delimited content here {
1553
1554 while failing to match:
1555
1556 { delimited content here }
1557
1558 To refactor balanced delimiters like those, we need a second kind of
1559 matchref; one that's a little smarter.
1560
1561 Or, preferably, a lot smarter...because there are many other kinds of
1562 balanced delimiters, apart from single brackets. For example:
1563
1564 {{{ delimited content here }}}
1565 /* delimited content here */
1566 (* delimited content here *)
1567 `` delimited content here ''
1568 if delimited content here fi
1569
1570 The common characteristic of these delimiter pairs is that the closing
1571 delimiter is the inverse of the opening delimiter: the sequence of
1572 characters is reversed and certain characters (mainly brackets, but
1573 also single-quotes/backticks) are mirror-reflected.
1574
1575 Regexp::Grammars supports the parsing of such delimiters with a
1576 construct known as an invertref, which is specified using the
1577 "</IDENT>" directive. An invertref acts very like a matchref, except
1578 that it does not convert to:
1579
1580 (??{ quotemeta( $MATCH{I<IDENT>} ) })
1581
1582 but rather to:
1583
1584 (??{ quotemeta( inverse( $MATCH{I<IDENT> ))} })
1585
1586 With this directive available, the balanced delimiters of the previous
1587 example can be refactored to:
1588
1589 <token: delimited_string>
1590 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1591 | <ldelim=( [[{(<] ) .*? <rdelim=/ldelim>
1592
1593 Like matchrefs, invertrefs come in the usual range of flavours:
1594
1595 </ident> # Match the inverse of $MATCH{ident}
1596 <ALIAS=/ident> # Match inverse and capture to $MATCH{ident}
1597 <[ALIAS=/ident]> # Match inverse and push on @{$MATCH{ident}}
1598
1599 The character pairs that are reversed during mirroring are: "{" and
1600 "}", "[" and "]", "(" and ")", "<" and ">", "AX" and "AX", "`" and "'".
1601
1602 The following mnemonics may be useful in distinguishing inverserefs
1603 from backrefs: a backref starts with a "\" (just like the standard Perl
1604 regex backrefs "\1" and "\g{-2}" and "\k<name>"), whereas an inverseref
1605 starts with a "/" (like an HTML or XML closing tag). Or just remember
1606 that "<\_IDENT>" is "match the same again", and if you want "the same
1607 again, only mirrored" instead, just mirror the "\" to get "</IDENT>".
1608
1609 Rematching parametric results and delimiters
1610 The "<\_IDENTIFIER>" and "</IDENTIFIER>" mechanisms normally locate the
1611 literal to be matched by looking in $MATCH{IDENTIFIER}.
1612
1613 However, you can cause them to look in $ARG{IDENTIFIER} instead, by
1614 prefixing the identifier with a single ":". This is especially useful
1615 when refactoring subrules. For example, instead of:
1616
1617 <rule: Command>
1618 <Keyword> <CommandBody> end_ <\_Keyword>
1619
1620 <rule: Placeholder>
1621 <Keyword> \.\.\. end_ <\_Keyword>
1622
1623 you could parameterize the Terminator rule, like so:
1624
1625 <rule: Command>
1626 <Keyword> <CommandBody> <Terminator(:Keyword)>
1627
1628 <rule: Placeholder>
1629 <Keyword> \.\.\. <Terminator(:Keyword)>
1630
1631 <token: Terminator>
1632 end_ <\:Keyword>
1633
1634 Tracking and reporting match positions
1635 Regexp::Grammars automatically predefines a special token that makes it
1636 easy to track exactly where in its input a particular subrule matches.
1637 That token is: "<matchpos>".
1638
1639 The "<matchpos>" token implements a zero-width match that never fails.
1640 It always returns the current index within the string that the grammar
1641 is matching.
1642
1643 So, for example you could have your "<delimited_text>" subrule detect
1644 and report unterminated text like so:
1645
1646 <token: delimited_text>
1647 qq? <delim> <text=(.*?)> </delim>
1648 |
1649 <matchpos> qq? <delim>
1650 <error: (?{"Unterminated string starting at index $MATCH{matchpos}"})>
1651
1652 Matching "<matchpos>" in the second alternative causes $MATCH{matchpos}
1653 to contain the position in the string at which the "<matchpos>" subrule
1654 was matched (in this example: the start of the unterminated text).
1655
1656 If you want the line number instead of the string index, use the
1657 predefined "<matchline>" subrule instead:
1658
1659 <token: delimited_text>
1660 qq? <delim> <text=(.*?)> </delim>
1661 | <matchline> qq? <delim>
1662 <error: (?{"Unterminated string starting at line $MATCH{matchline}"})>
1663
1664 Note that the line numbers returned by "<matchline>" start at 1 (not at
1665 zero, as with "<matchpos>").
1666
1667 The "<matchpos>" and "<matchline>" subrules are just like any other
1668 subrules; you can alias them ("<started_at=matchpos>") or match them
1669 repeatedly ( "(?: <[matchline]> <[item]> )++"), etc.
1670
1672 The module also supports event-based parsing. You can specify a grammar
1673 in the usual way and then, for a particular parse, layer a collection
1674 of call-backs (known as "autoactions") over the grammar to handle the
1675 data as it is parsed.
1676
1677 Normally, a grammar rule returns the result hash it has accumulated (or
1678 whatever else was aliased to "MATCH=" within the rule). However, you
1679 can specify an autoaction object before the grammar is matched.
1680
1681 Once the autoaction object is specified, every time a rule succeeds
1682 during the parse, its result is passed to the object via one of its
1683 methods; specifically it is passed to the method whose name is the same
1684 as the rule's.
1685
1686 For example, suppose you had a grammar that recognizes simple algebraic
1687 expressions:
1688
1689 my $expr_parser = do{
1690 use Regexp::Grammars;
1691 qr{
1692 <Expr>
1693
1694 <rule: Expr> <[Operand=Mult]>+ % <[Op=(\+|\-)]>
1695
1696 <rule: Mult> <[Operand=Pow]>+ % <[Op=(\*|/|%)]>
1697
1698 <rule: Pow> <[Operand=Term]>+ % <Op=(\^)>
1699
1700 <rule: Term> <MATCH=Literal>
1701 | \( <MATCH=Expr> \)
1702
1703 <token: Literal> <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
1704 }xms
1705 };
1706
1707 You could convert this grammar to a calculator, by installing a set of
1708 autoactions that convert each rule's result hash to the corresponding
1709 value of the sub-expression that the rule just parsed. To do that, you
1710 would create a class with methods whose names match the rules whose
1711 results you want to change. For example:
1712
1713 package Calculator;
1714 use List::Util qw< reduce >;
1715
1716 sub new {
1717 my ($class) = @_;
1718
1719 return bless {}, $class
1720 }
1721
1722 sub Answer {
1723 my ($self, $result_hash) = @_;
1724
1725 my $sum = shift @{$result_hash->{Operand}};
1726
1727 for my $term (@{$result_hash->{Operand}}) {
1728 my $op = shift @{$result_hash->{Op}};
1729 if ($op eq '+') { $sum += $term; }
1730 else { $sum -= $term; }
1731 }
1732
1733 return $sum;
1734 }
1735
1736 sub Mult {
1737 my ($self, $result_hash) = @_;
1738
1739 return reduce { eval($a . shift(@{$result_hash->{Op}}) . $b) }
1740 @{$result_hash->{Operand}};
1741 }
1742
1743 sub Pow {
1744 my ($self, $result_hash) = @_;
1745
1746 return reduce { $b ** $a } reverse @{$result_hash->{Operand}};
1747 }
1748
1749 Objects of this class (and indeed the class itself) now have methods
1750 corresponding to some of the rules in the expression grammar. To apply
1751 those methods to the results of the rules (as they parse) you simply
1752 install an object as the "autoaction" handler, immediately before you
1753 initiate the parse:
1754
1755 if ($text ~= $expr_parser->with_actions(Calculator->new)) {
1756 say $/{Answer}; # Now prints the result of the expression
1757 }
1758
1759 The "with_actions()" method expects to be passed an object or
1760 classname. This object or class will be installed as the autoaction
1761 handler for the next match against any grammar. After that match, the
1762 handler will be uninstalled. "with_actions()" returns the grammar it's
1763 called on, making it easy to call it as part of a match (which is the
1764 recommended idiom).
1765
1766 With a "Calculator" object set as the autoaction handler, whenever the
1767 "Answer", "Mult", or "Pow" rule of the grammar matches, the
1768 corresponding "Answer", "Mult", or "Pow" method of the "Calculator"
1769 object will be called (with the rule's result value passed as its only
1770 argument), and the result of the method will be used as the result of
1771 the rule.
1772
1773 Note that nothing new happens when a "Term" or "Literal" rule matches,
1774 because the "Calculator" object doesn't have methods with those names.
1775
1776 The overall effect, then, is to allow you to specify a grammar without
1777 rule-specific bahaviours and then, later, specify a set of final
1778 actions (as methods) for some or all of the rules of the grammar.
1779
1780 Note that, if a particular callback method returns "undef", the result
1781 of the corresponding rule will be passed through without modification.
1782
1784 All the grammars shown so far are confined to a single regex. However,
1785 Regexp::Grammars also provides a mechanism that allows you to defined
1786 named grammars, which can then be imported into other regexes. This
1787 gives the a way of modularizing common grammatical components.
1788
1789 Defining a named grammar
1790 You can create a named grammar using the "<grammar:...>" directive.
1791 This directive must appear before the first rule definition in the
1792 grammar, and instead of any start-rule. For example:
1793
1794 qr{
1795 <grammar: List::Generic>
1796
1797 <rule: List>
1798 <[MATCH=Item]>+ % <Separator>
1799
1800 <rule: Item>
1801 \S++
1802
1803 <token: Separator>
1804 \s* , \s*
1805 }x;
1806
1807 This creates a grammar named "List::Generic", and installs it in the
1808 module's internal caches, for future reference.
1809
1810 Note that there is no need (or reason) to assign the resulting regex to
1811 a variable, as the named grammar cannot itself be matched against.
1812
1813 Using a named grammar
1814 To make use of a named grammar, you need to incorporate it into another
1815 grammar, by inheritance. To do that, use the "<extends:...>" directive,
1816 like so:
1817
1818 my $parser = qr{
1819 <extends: List::Generic>
1820
1821 <List>
1822 }x;
1823
1824 The "<extends:...>" directive incorporates the rules defined in the
1825 specified grammar into the current regex. You can then call any of
1826 those rules in the start-pattern.
1827
1828 Overriding an inherited rule or token
1829 Subrule dispatch within a grammar is always polymorphic. That is, when
1830 a subrule is called, the most-derived rule of the same name within the
1831 grammar's hierarchy is invoked.
1832
1833 So, to replace a particular rule within grammar, you simply need to
1834 inherit that grammar and specify new, more-specific versions of any
1835 rules you want to change. For example:
1836
1837 my $list_of_integers = qr{
1838 <List>
1839
1840 # Inherit rules from base grammar...
1841 <extends: List::Generic>
1842
1843 # Replace Item rule from List::Generic...
1844 <rule: Item>
1845 [+-]? \d++
1846 }x;
1847
1848 You can also use "<extends:...>" in other named grammars, to create
1849 hierarchies:
1850
1851 qr{
1852 <grammar: List::Integral>
1853 <extends: List::Generic>
1854
1855 <token: Item>
1856 [+-]? <MATCH=(<.Digit>+)>
1857
1858 <token: Digit>
1859 \d
1860 }x;
1861
1862 qr{
1863 <grammar: List::ColonSeparated>
1864 <extends: List::Generic>
1865
1866 <token: Separator>
1867 \s* : \s*
1868 }x;
1869
1870 qr{
1871 <grammar: List::Integral::ColonSeparated>
1872 <extends: List::Integral>
1873 <extends: List::ColonSeparated>
1874 }x;
1875
1876 As shown in the previous example, Regexp::Grammars allows you to
1877 multiply inherit two (or more) base grammars. For example, the
1878 "List::Integral::ColonSeparated" grammar takes the definitions of
1879 "List" and "Item" from the "List::Integral" grammar, and the definition
1880 of "Separator" from "List::ColonSeparated".
1881
1882 Note that grammars dispatch subrule calls using C3 method lookup,
1883 rather than Perl's older DFS lookup. That's why
1884 "List::Integral::ColonSeparated" correctly gets the more-specific
1885 "Separator" rule defined in "List::ColonSeparated", rather than the
1886 more-generic version defined in "List::Generic" (via "List::Integral").
1887 See "perldoc mro" for more discussion of the C3 dispatch algorithm.
1888
1889 Augmenting an inherited rule or token
1890 Instead of replacing an inherited rule, you can augment it.
1891
1892 For example, if you need a grammar for lists of hexademical numbers,
1893 you could inherit the behaviour of "List::Integral" and add the hex
1894 digits to its "Digit" token:
1895
1896 my $list_of_hexadecimal = qr{
1897 <List>
1898
1899 <extends: List::Integral>
1900
1901 <token: Digit>
1902 <List::Integral::Digit>
1903 | [A-Fa-f]
1904 }x;
1905
1906 If you call a subrule using a fully qualified name (such as
1907 "<List::Integral::Digit>"), the grammar calls that version of the rule,
1908 rather than the most-derived version.
1909
1910 Debugging named grammars
1911 Named grammars are independent of each other, even when inherited. This
1912 means that, if debugging is enabled in a derived grammar, it will not
1913 be active in any rules inherited from a base grammar, unless the base
1914 grammar also included a "<debug:...>" directive.
1915
1916 This is a deliberate design decision, as activating the debugger adds a
1917 significant amount of code to each grammar's implementation, which is
1918 detrimental to the matching performance of the resulting regexes.
1919
1920 If you need to debug a named grammar, the best approach is to include a
1921 "<debug: same>" directive at the start of the grammar. The presence of
1922 this directive will ensure the necessary extra debugging code is
1923 included in the regex implementing the grammar, while setting "same"
1924 mode will ensure that the debugging mode isn't altered when the matcher
1925 uses the inherited rules.
1926
1928 Result distillation
1929 Normally, calls to subrules produce nested result-hashes within the
1930 current result-hash. Those nested hashes always have at least one
1931 automatically supplied key (""), whose value is the entire substring
1932 that the subrule matched.
1933
1934 If there are no other nested captures within the subrule, there will be
1935 no other keys in the result-hash. This would be annoying as a typical
1936 nested grammar would then produce results consisting of hashes of
1937 hashes, with each nested hash having only a single key (""). This in
1938 turn would make postprocessing the result-hash (in "%/") far more
1939 complicated than it needs to be.
1940
1941 To avoid this behaviour, if a subrule's result-hash doesn't contain any
1942 keys except "", the module "flattens" the result-hash, by replacing it
1943 with the value of its single key.
1944
1945 So, for example, the grammar:
1946
1947 mv \s* <from> \s* <to>
1948
1949 <rule: from> [\w/.-]+
1950 <rule: to> [\w/.-]+
1951
1952 doesn't return a result-hash like this:
1953
1954 {
1955 "" => 'mv /usr/local/lib/libhuh.dylib /dev/null/badlib',
1956 'from' => { "" => '/usr/local/lib/libhuh.dylib' },
1957 'to' => { "" => '/dev/null/badlib' },
1958 }
1959
1960 Instead, it returns:
1961
1962 {
1963 "" => 'mv /usr/local/lib/libhuh.dylib /dev/null/badlib',
1964 'from' => '/usr/local/lib/libhuh.dylib',
1965 'to' => '/dev/null/badlib',
1966 }
1967
1968 That is, because the 'from' and 'to' subhashes each have only a single
1969 entry, they are each "flattened" to the value of that entry.
1970
1971 This flattening also occurs if a result-hash contains only "private"
1972 keys (i.e. keys starting with underscores). For example:
1973
1974 mv \s* <from> \s* <to>
1975
1976 <rule: from> <_dir=path>? <_file=filename>
1977 <rule: to> <_dir=path>? <_file=filename>
1978
1979 <token: path> [\w/.-]*/
1980 <token: filename> [\w.-]+
1981
1982 Here, the "from" rule produces a result like this:
1983
1984 from => {
1985 "" => '/usr/local/bin/perl',
1986 _dir => '/usr/local/bin/',
1987 _file => 'perl',
1988 }
1989
1990 which is automatically stripped of "private" keys, leaving:
1991
1992 from => {
1993 "" => '/usr/local/bin/perl',
1994 }
1995
1996 which is then automatically flattened to:
1997
1998 from => '/usr/local/bin/perl'
1999
2000 List result distillation
2001
2002 A special case of result distillation occurs in a separated list, such
2003 as:
2004
2005 <rule: List>
2006
2007 <[Item]>+ % <[Sep=(,)]>
2008
2009 If this construct matches just a single item, the result hash will
2010 contain a single entry consisting of a nested array with a single
2011 value, like so:
2012
2013 { Item => [ 'data' ] }
2014
2015 Instead of returning this annoyingly nested data structure, you can
2016 tell Regexp::Grammars to flatten it to just the inner data with a
2017 special directive:
2018
2019 <rule: List>
2020
2021 <[Item]>+ % <[Sep=(,)]>
2022
2023 <minimize:>
2024
2025 The "<minimize:>" directive examines the result hash (i.e. %MATCH). If
2026 that hash contains only a single entry, which is a reference to an
2027 array with a single value, then the directive assigns that single value
2028 directly to $MATCH, so that it will be returned instead of the usual
2029 result hash.
2030
2031 This means that a normal separated list still results in a hash
2032 containing all elements and separators, but a "degenerate" list of only
2033 one item results in just that single item.
2034
2035 Manual result distillation
2036
2037 Regexp::Grammars also offers full manual control over the distillation
2038 process. If you use the reserved word "MATCH" as the alias for a
2039 subrule call:
2040
2041 <MATCH=filename>
2042
2043 or a subpattern match:
2044
2045 <MATCH=( \w+ )>
2046
2047 or a code block:
2048
2049 <MATCH=(?{ 42 })>
2050
2051 then the current rule will treat the return value of that subrule,
2052 pattern, or code block as its complete result, and return that value
2053 instead of the usual result-hash it constructs. This is the case even
2054 if the result has other entries that would normally also be returned.
2055
2056 For example, consider a rule like:
2057
2058 <rule: term>
2059 <MATCH=literal>
2060 | <left_paren> <MATCH=expr> <right_paren>
2061
2062 The use of "MATCH" aliases causes the rule to return either whatever
2063 "<literal>" returns, or whatever "<expr>" returns (provided it's
2064 between left and right parentheses).
2065
2066 Note that, in this second case, even though "<left_paren>" and
2067 "<right_paren>" are captured to the result-hash, they are not returned,
2068 because the "MATCH" alias overrides the normal "return the result-hash"
2069 semantics and returns only what its associated subrule (i.e. "<expr>")
2070 produces.
2071
2072 Note also that the return value is only assigned, if the subrule call
2073 actually matches. For example:
2074
2075 <rule: optional_names>
2076 <[MATCH=name]>*
2077
2078 If the repeated subrule call to "<name>" matches zero times, the return
2079 value of the "optional_names" rule will not be an empty array, because
2080 the "MATCH=" will not have executed at all. Instead, the default return
2081 value (an empty string) will be returned. If you had specifically
2082 wanted to return an empty array, you could use any of the following:
2083
2084 <rule: optional_names>
2085 <MATCH=(?{ [] })> # Set up empty array before first match attempt
2086 <[MATCH=name]>*
2087
2088 or:
2089
2090 <rule: optional_names>
2091 <[MATCH=name]>+ # Match one or more times
2092 | # or
2093 <MATCH=(?{ [] })> # Set up empty array, if no match
2094
2095 Programmatic result distillation
2096
2097 It's also possible to control what a rule returns from within a code
2098 block. Regexp::Grammars provides a set of reserved variables that give
2099 direct access to the result-hash.
2100
2101 The result-hash itself can be accessed as %MATCH within any code block
2102 inside a rule. For example:
2103
2104 <rule: sum>
2105 <X=product> \+ <Y=product>
2106 <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
2107
2108 Here, the rule matches a product (aliased 'X' in the result-hash), then
2109 a literal '+', then another product (aliased to 'Y' in the result-
2110 hash). The rule then executes the code block, which accesses the two
2111 saved values (as $MATCH{X} and $MATCH{Y}), adding them together.
2112 Because the block is itself aliased to "MATCH", the sum produced by the
2113 block becomes the (only) result of the rule.
2114
2115 It is also possible to set the rule result from within a code block
2116 (instead of aliasing it). The special "override" return value is
2117 represented by the special variable $MATCH. So the previous example
2118 could be rewritten:
2119
2120 <rule: sum>
2121 <X=product> \+ <Y=product>
2122 (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
2123
2124 Both forms are identical in effect. Any assignment to $MATCH overrides
2125 the normal "return all subrule results" behaviour.
2126
2127 Assigning to $MATCH directly is particularly handy if the result may
2128 not always be "distillable", for example:
2129
2130 <rule: sum>
2131 <X=product> \+ <Y=product>
2132 (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
2133 # Reduce to sum, if both terms are simple scalars...
2134 $MATCH = $MATCH{X} + $MATCH{Y};
2135 }
2136 else {
2137 # Return full syntax tree for non-simple case...
2138 $MATCH{op} = '+';
2139 }
2140 })
2141
2142 Note that you can also partially override the subrule return behaviour.
2143 Normally, the subrule returns the complete text it matched as its
2144 context substring (i.e. under the "empty key") in its result-hash. That
2145 is, of course, $MATCH{""}, so you can override just that behaviour by
2146 directly assigning to that entry.
2147
2148 For example, if you have a rule that matches key/value pairs from a
2149 configuration file, you might prefer that any trailing comments not be
2150 included in the "matched text" entry of the rule's result-hash. You
2151 could hide such comments like so:
2152
2153 <rule: config_line>
2154 <key> : <value> <comment>?
2155 (?{
2156 # Edit trailing comments out of "matched text" entry...
2157 $MATCH = "$MATCH{key} : $MATCH{value}";
2158 })
2159
2160 Some more examples of the uses of $MATCH:
2161
2162 <rule: FuncDecl>
2163 # Keyword Name Keep return the name (as a string)...
2164 func <Identifier> ; (?{ $MATCH = $MATCH{'Identifier'} })
2165
2166
2167 <rule: NumList>
2168 # Numbers in square brackets...
2169 \[
2170 ( \d+ (?: , \d+)* )
2171 \]
2172
2173 # Return only the numbers...
2174 (?{ $MATCH = $CAPTURE })
2175
2176
2177 <token: Cmd>
2178 # Match standard variants then standardize the keyword...
2179 (?: mv | move | rename ) (?{ $MATCH = 'mv'; })
2180
2181 Parse-time data processing
2182 Using code blocks in rules, it's often possible to fully process data
2183 as you parse it. For example, the "<sum>" rule shown in the previous
2184 section might be part of a simple calculator, implemented entirely in a
2185 single grammar. Such a calculator might look like this:
2186
2187 my $calculator = do{
2188 use Regexp::Grammars;
2189 qr{
2190 <Answer>
2191
2192 <rule: Answer>
2193 ( <.Mult>+ % <.Op=([+-])> )
2194 <MATCH= (?{ eval $CAPTURE })>
2195
2196 <rule: Mult>
2197 ( <.Pow>+ % <.Op=([*/%])> )
2198 <MATCH= (?{ eval $CAPTURE })>
2199
2200 <rule: Pow>
2201 <X=Term> \^ <Y=Pow>
2202 <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>
2203 |
2204 <MATCH=Term>
2205
2206 <rule: Term>
2207 <MATCH=Literal>
2208 | \( <MATCH=Answer> \)
2209
2210 <token: Literal>
2211 <MATCH= ( [+-]? \d++ (?: \. \d++ )?+ )>
2212 }xms
2213 };
2214
2215 while (my $input = <>) {
2216 if ($input =~ $calculator) {
2217 say "--> $/{Answer}";
2218 }
2219 }
2220
2221 Because every rule computes a value using the results of the subrules
2222 below it, and aliases that result to its "MATCH", each rule returns a
2223 complete evaluation of the subexpression it matches, passing that back
2224 to higher-level rules, which then do the same.
2225
2226 Hence, the result returned to the very top-level rule (i.e. to
2227 "<Answer>") is the complete evaluation of the entire expression that
2228 was matched. That means that, in the very process of having matched a
2229 valid expression, the calculator has also computed the value of that
2230 expression, which can then simply be printed directly.
2231
2232 It is often possible to have a grammar fully (or sometimes at least
2233 partially) evaluate or transform the data it is parsing, and this
2234 usually leads to very efficient and easy-to-maintain implementations.
2235
2236 The main limitation of this technique is that the data has to be in a
2237 well-structured form, where subsets of the data can be evaluated using
2238 only local information. In cases where the meaning of the data is
2239 distributed through that data non-hierarchically, or relies on global
2240 state, or on external information, it is often better to have the
2241 grammar simply construct a complete syntax tree for the data first, and
2242 then evaluate that syntax tree separately, after parsing is complete.
2243 The following section describes a feature of Regexp::Grammars that can
2244 make this second style of data processing simpler and more
2245 maintainable.
2246
2247 Object-oriented parsing
2248 When a grammar has parsed successfully, the "%/" variable will contain
2249 a series of nested hashes (and possibly arrays) representing the
2250 hierarchical structure of the parsed data.
2251
2252 Typically, the next step is to walk that tree, extracting or converting
2253 or otherwise processing that information. If the tree has nodes of many
2254 different types, it can be difficult to build a recursive subroutine
2255 that can navigate it easily.
2256
2257 A much cleaner solution is possible if the nodes of the tree are proper
2258 objects. In that case, you just define a "process()" or "traverse()"
2259 method for eah of the classes, and have every node call that method on
2260 each of its children. For example, if the parser were to return a tree
2261 of nodes representing the contents of a LaTeX file, then you could
2262 define the following methods:
2263
2264 sub Latex::file::explain
2265 {
2266 my ($self, $level) = @_;
2267 for my $element (@{$self->{element}}) {
2268 $element->explain($level);
2269 }
2270 }
2271
2272 sub Latex::element::explain {
2273 my ($self, $level) = @_;
2274 ( $self->{command} || $self->{literal})->explain($level)
2275 }
2276
2277 sub Latex::command::explain {
2278 my ($self, $level) = @_;
2279 say "\t"x$level, "Command:";
2280 say "\t"x($level+1), "Name: $self->{name}";
2281 if ($self->{options}) {
2282 say "\t"x$level, "\tOptions:";
2283 $self->{options}->explain($level+2)
2284 }
2285
2286 for my $arg (@{$self->{arg}}) {
2287 say "\t"x$level, "\tArg:";
2288 $arg->explain($level+2)
2289 }
2290 }
2291
2292 sub Latex::options::explain {
2293 my ($self, $level) = @_;
2294 $_->explain($level) foreach @{$self->{option}};
2295 }
2296
2297 sub Latex::literal::explain {
2298 my ($self, $level, $label) = @_;
2299 $label //= 'Literal';
2300 say "\t"x$level, "$label: ", $self->{q{}};
2301 }
2302
2303 and then simply write:
2304
2305 if ($text =~ $LaTeX_parser) {
2306 $/{LaTeX_file}->explain();
2307 }
2308
2309 and the chain of "explain()" calls would cascade down the nodes of the
2310 tree, each one invoking the appropriate "explain()" method according to
2311 the type of node encountered.
2312
2313 The only problem is that, by default, Regexp::Grammars returns a tree
2314 of plain-old hashes, not LaTeX::Whatever objects. Fortunately, it's
2315 easy to request that the result hashes be automatically blessed into
2316 the appropriate classes, using the "<objrule:...>" and "<objtoken:...>"
2317 directives.
2318
2319 These directives are identical to the "<rule:...>" and "<token:...>"
2320 directives (respectively), except that the rule or token they create
2321 will also convert the hash it normally returns into an object of a
2322 specified class. This conversion is done by passing the result hash to
2323 the class's constructor:
2324
2325 $class->new(\%result_hash)
2326
2327 if the class has a constructor method named "new()", or else (if the
2328 class doesn't provide a constructor) by directly blessing the result
2329 hash:
2330
2331 bless \%result_hash, $class
2332
2333 Note that, even if object is constructed via its own constructor, the
2334 module still expects the new object to be hash-based, and will fail if
2335 the object is anything but a blessed hash. The module issues an error
2336 in this case.
2337
2338 The generic syntax for these types of rules and tokens is:
2339
2340 <objrule: CLASS::NAME = RULENAME >
2341 <objtoken: CLASS::NAME = TOKENNAME >
2342
2343 For example:
2344
2345 <objrule: LaTeX::Element=component>
2346 # ...Defines a rule that can be called as <component>
2347 # ...and which returns a hash-based LaTeX::Element object
2348
2349 <objtoken: LaTex::Literal=atom>
2350 # ...Defines a token that can be called as <atom>
2351 # ...and which returns a hash-based LaTeX::Literal object
2352
2353 Note that, just as in aliased subrule calls, the name by which
2354 something is referred to outside the grammar (in this case, the class
2355 name) comes before the "=", whereas the name that it is referred to
2356 inside the grammar comes after the "=".
2357
2358 You can freely mix object-returning and plain-old-hash-returning rules
2359 and tokens within a single grammar, though you have to be careful not
2360 to subsequently try to call a method on any of the unblessed nodes.
2361
2362 An important caveat regarding OO rules
2363
2364 Prior to Perl 5.14.0, Perl's regex engine was not fully re-entrant.
2365 This means that in older versions of Perl, it is not possible to re-
2366 invoke the regex engine when already inside the regex engine.
2367
2368 This means that you need to be careful that the "new()" constructors
2369 that are called by your object-rules do not themselves use regexes in
2370 any way, unless you're running under Perl 5.14 or later (in which case
2371 you can ignore what follows).
2372
2373 The two ways this is most likely to happen are:
2374
2375 1. If you're using a class built on Moose, where one or more of the
2376 "has" uses a type constraint (such as 'Int') that is implemented
2377 via regex matching. For example:
2378
2379 has 'id' => (is => 'rw', isa => 'Int');
2380
2381 The workaround (for pre-5.14 Perls) is to replace the type
2382 constraint with one that doesn't use a regex. For example:
2383
2384 has 'id' => (is => 'rw', isa => 'Num');
2385
2386 Alternatively, you could define your own type constraint that
2387 avoids regexes:
2388
2389 use Moose::Util::TypeConstraints;
2390
2391 subtype 'Non::Regex::Int',
2392 as 'Num',
2393 where { int($_) == $_ };
2394
2395 no Moose::Util::TypeConstraints;
2396
2397 # and later...
2398
2399 has 'id' => (is => 'rw', isa => 'Non::Regex::Int');
2400
2401 2. If your class uses an "AUTOLOAD()" method to implement its
2402 constructor and that method uses the typical:
2403
2404 $AUTOLOAD =~ s/.*://;
2405
2406 technique. The workaround here is to achieve the same effect
2407 without a regex. For example:
2408
2409 my $last_colon_pos = rindex($AUTOLOAD, ':');
2410 substr $AUTOLOAD, 0, $last_colon_pos+1, q{};
2411
2412 Note that this caveat against using nested regexes also applies to any
2413 code blocks executed inside a rule or token (whether or not those rules
2414 or tokens are object-oriented).
2415
2416 A naming shortcut
2417
2418 If an "<objrule:...>" or "<objtoken:...>" is defined with a class name
2419 that is not followed by "=" and a rule name, then the rule name is
2420 determined automatically from the classname. Specifically, the final
2421 component of the classname (i.e. after the last "::", if any) is used.
2422
2423 For example:
2424
2425 <objrule: LaTeX::Element>
2426 # ...Defines a rule that can be called as <Element>
2427 # ...and which returns a hash-based LaTeX::Element object
2428
2429 <objtoken: LaTex::Literal>
2430 # ...Defines a token that can be called as <Literal>
2431 # ...and which returns a hash-based LaTeX::Literal object
2432
2433 <objtoken: Comment>
2434 # ...Defines a token that can be called as <Comment>
2435 # ...and which returns a hash-based Comment object
2436
2438 Regexp::Grammars provides a number of features specifically designed to
2439 help debug both grammars and the data they parse.
2440
2441 All debugging messages are written to a log file (which, by default, is
2442 just STDERR). However, you can specify a disk file explicitly by
2443 placing a "<logfile:...>" directive at the start of your grammar:
2444
2445 $grammar = qr{
2446
2447 <logfile: LaTeX_parser_log >
2448
2449 \A <LaTeX_file> \Z # Pattern to match
2450
2451 <rule: LaTeX_file>
2452 # etc.
2453 }x;
2454
2455 You can also explicitly specify that messages go to the terminal:
2456
2457 <logfile: - >
2458
2459 Debugging grammar creation with "<logfile:...>"
2460 Whenever a log file has been directly specified, Regexp::Grammars
2461 automatically does verbose static analysis of your grammar. That is,
2462 whenever it compiles a grammar containing an explicit "<logfile:...>"
2463 directive it logs a series of messages explaining how it has
2464 interpreted the various components of that grammar. For example, the
2465 following grammar:
2466
2467 <logfile: parser_log >
2468
2469 <cmd>
2470
2471 <rule: cmd>
2472 mv <from=file> <to=file>
2473 | cp <source> <[file]> <.comment>?
2474
2475 would produce the following analysis in the 'parser_log' file:
2476
2477 info | Processing the main regex before any rule definitions
2478 | |
2479 | |...Treating <cmd> as:
2480 | | | match the subrule <cmd>
2481 | | \ saving the match in $MATCH{'cmd'}
2482 | |
2483 | \___End of main regex
2484 |
2485 info | Defining a rule: <cmd>
2486 | |...Returns: a hash
2487 | |
2488 | |...Treating ' mv ' as:
2489 | | \ normal Perl regex syntax
2490 | |
2491 | |...Treating <from=file> as:
2492 | | | match the subrule <file>
2493 | | \ saving the match in $MATCH{'from'}
2494 | |
2495 | |...Treating <to=file> as:
2496 | | | match the subrule <file>
2497 | | \ saving the match in $MATCH{'to'}
2498 | |
2499 | |...Treating ' | cp ' as:
2500 | | \ normal Perl regex syntax
2501 | |
2502 | |...Treating <source> as:
2503 | | | match the subrule <source>
2504 | | \ saving the match in $MATCH{'source'}
2505 | |
2506 | |...Treating <[file]> as:
2507 | | | match the subrule <file>
2508 | | \ appending the match to $MATCH{'file'}
2509 | |
2510 | |...Treating <.comment>? as:
2511 | | | match the subrule <comment> if possible
2512 | | \ but don't save anything
2513 | |
2514 | \___End of rule definition
2515
2516 This kind of static analysis is a useful starting point in debugging a
2517 miscreant grammar, because it enables you to see what you actually
2518 specified (as opposed to what you thought you'd specified).
2519
2520 Debugging grammar execution with "<debug:...>"
2521 Regexp::Grammars also provides a simple interactive debugger, with
2522 which you can observe the process of parsing and the data being
2523 collected in any result-hash.
2524
2525 To initiate debugging, place a "<debug:...>" directive anywhere in your
2526 grammar. When parsing reaches that directive the debugger will be
2527 activated, and the command specified in the directive immediately
2528 executed. The available commands are:
2529
2530 <debug: on> - Enable debugging, stop when a rule matches
2531 <debug: match> - Enable debugging, stop when a rule matches
2532 <debug: try> - Enable debugging, stop when a rule is tried
2533 <debug: run> - Enable debugging, run until the match completes
2534 <debug: same> - Continue debugging (or not) as currently
2535 <debug: off> - Disable debugging and continue parsing silently
2536
2537 <debug: continue> - Synonym for <debug: run>
2538 <debug: step> - Synonym for <debug: try>
2539
2540 These directives can be placed anywhere within a grammar and take
2541 effect when that point is reached in the parsing. Hence, adding a
2542 "<debug:step>" directive is very much like setting a breakpoint at that
2543 point in the grammar. Indeed, a common debugging strategy is to turn
2544 debugging on and off only around a suspect part of the grammar:
2545
2546 <rule: tricky> # This is where we think the problem is...
2547 <debug:step>
2548 <preamble> <text> <postscript>
2549 <debug:off>
2550
2551 Once the debugger is active, it steps through the parse, reporting
2552 rules that are tried, matches and failures, backtracking and restarts,
2553 and the parser's location within both the grammar and the text being
2554 matched. That report looks like this:
2555
2556 ===============> Trying <grammar> from position 0
2557 > cp file1 file2 |...Trying <cmd>
2558 | |...Trying <cmd=(cp)>
2559 | | \FAIL <cmd=(cp)>
2560 | \FAIL <cmd>
2561 \FAIL <grammar>
2562 ===============> Trying <grammar> from position 1
2563 cp file1 file2 |...Trying <cmd>
2564 | |...Trying <cmd=(cp)>
2565 file1 file2 | | \_____<cmd=(cp)> matched 'cp'
2566 file1 file2 | |...Trying <[file]>+
2567 file2 | | \_____<[file]>+ matched 'file1'
2568 | |...Trying <[file]>+
2569 [eos] | | \_____<[file]>+ matched ' file2'
2570 | |...Trying <[file]>+
2571 | | \FAIL <[file]>+
2572 | |...Trying <target>
2573 | | |...Trying <file>
2574 | | | \FAIL <file>
2575 | | \FAIL <target>
2576 <~~~~~~~~~~~~~~ | |...Backtracking 5 chars and trying new match
2577 file2 | |...Trying <target>
2578 | | |...Trying <file>
2579 | | | \____ <file> matched 'file2'
2580 [eos] | | \_____<target> matched 'file2'
2581 | \_____<cmd> matched ' cp file1 file2'
2582 \_____<grammar> matched ' cp file1 file2'
2583
2584 The first column indicates the point in the input at which the parser
2585 is trying to match, as well as any backtracking or forward searching it
2586 may need to do. The remainder of the columns track the parser's
2587 hierarchical traversal of the grammar, indicating which rules are
2588 tried, which succeed, and what they match.
2589
2590 Provided the logfile is a terminal (as it is by default), the debugger
2591 also pauses at various points in the parsing process--before trying a
2592 rule, after a rule succeeds, or at the end of the parse--according to
2593 the most recent command issued. When it pauses, you can issue a new
2594 command by entering a single letter:
2595
2596 m - to continue until the next subrule matches
2597 t or s - to continue until the next subrule is tried
2598 r or c - to continue to the end of the grammar
2599 o - to switch off debugging
2600
2601 Note that these are the first letters of the corresponding
2602 "<debug:...>" commands, listed earlier. Just hitting ENTER while the
2603 debugger is paused repeats the previous command.
2604
2605 While the debugger is paused you can also type a 'd', which will
2606 display the result-hash for the current rule. This can be useful for
2607 detecting which rule isn't returning the data you expected.
2608
2609 Resizing the context string
2610
2611 By default, the first column of the debugger output (which shows the
2612 current matching position within the string) is limited to a width of
2613 20 columns.
2614
2615 However, you can change that limit calling the
2616 "Regexp::Grammars::set_context_width()" subroutine. You have to specify
2617 the fully qualified name, however, as Regexp::Grammars does not export
2618 this (or any other) subroutine.
2619
2620 "set_context_width()" expects a single argument: a positive integer
2621 indicating the maximal allowable width for the context column. It
2622 issues a warning if an invalid value is passed, and ignores it.
2623
2624 If called in a void context, "set_context_width()" changes the context
2625 width permanently throughout your application. If called in a scalar or
2626 list context, "set_context_width()" returns an object whose destructor
2627 will cause the context width to revert to its previous value. This
2628 means you can temporarily change the context width within a given block
2629 with something like:
2630
2631 {
2632 my $temporary = Regexp::Grammars::set_context_width(50);
2633
2634 if ($text =~ $parser) {
2635 do_stuff_with( %/ );
2636 }
2637
2638 } # <--- context width automagically reverts at this point
2639
2640 and the context width will change back to its previous value when
2641 $temporary goes out of scope at the end of the block.
2642
2643 User-defined logging with "<log:...>"
2644 Both static and interactive debugging send a series of predefined log
2645 messages to whatever log file you have specified. It is also possible
2646 to send additional, user-defined messages to the log, using the
2647 "<log:...>" directive.
2648
2649 This directive expects either a simple text or a codeblock as its
2650 single argument. If the argument is a code block, that code is expected
2651 to return the text of the message; if the argument is anything else,
2652 that something else is the literal message. For example:
2653
2654 <rule: ListElem>
2655
2656 <Elem= ( [a-z]\d+) >
2657 <log: Checking for a suffix, too...>
2658
2659 <Suffix= ( : \d+ ) >?
2660 <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
2661
2662 User-defined log messages implemented using a codeblock can also
2663 specify a severity level. If the codeblock of a "<log:...>" directive
2664 returns two or more values, the first is treated as a log message
2665 severity indicator, and the remaining values as separate lines of text
2666 to be logged. For example:
2667
2668 <rule: ListElem>
2669 <Elem= ( [a-z]\d+) >
2670 <Suffix= ( : \d+ ) >?
2671
2672 <log: (?{
2673 warn => "Elem was: $MATCH{Elem}",
2674 "Suffix was $MATCH{Suffix}",
2675 })>
2676
2677 When they are encountered, user-defined log messages are interspersed
2678 between any automatic log messages (i.e. from the debugger), at the
2679 correct level of nesting for the current rule.
2680
2681 Debugging non-grammars
2682 [Note that, with the release in 2012 of the Regexp::Debugger module (on
2683 CPAN) the techniques described below are unnecessary. If you need to
2684 debug plain Perl regexes, use Regexp::Debugger instead.]
2685
2686 It is possible to use Regexp::Grammars without creating any subrule
2687 definitions, simply to debug a recalcitrant regex. For example, if the
2688 following regex wasn't working as expected:
2689
2690 my $balanced_brackets = qr{
2691 \( # left delim
2692 (?:
2693 \\ # escape or
2694 | (?R) # recurse or
2695 | . # whatever
2696 )*
2697 \) # right delim
2698 }xms;
2699
2700 you could instrument it with aliased subpatterns and then debug it
2701 step-by-step, using Regexp::Grammars:
2702
2703 use Regexp::Grammars;
2704
2705 my $balanced_brackets = qr{
2706 <debug:step>
2707
2708 <.left_delim= ( \( )>
2709 (?:
2710 <.escape= ( \\ )>
2711 | <.recurse= ( (?R) )>
2712 | <.whatever=( . )>
2713 )*
2714 <.right_delim= ( \) )>
2715 }xms;
2716
2717 while (<>) {
2718 say 'matched' if /$balanced_brackets/;
2719 }
2720
2721 Note the use of amnesiac aliased subpatterns to avoid needlessly
2722 building a result-hash. Alternatively, you could use listifying aliases
2723 to preserve the matching structure as an additional debugging aid:
2724
2725 use Regexp::Grammars;
2726
2727 my $balanced_brackets = qr{
2728 <debug:step>
2729
2730 <[left_delim= ( \( )]>
2731 (?:
2732 <[escape= ( \\ )]>
2733 | <[recurse= ( (?R) )]>
2734 | <[whatever=( . )]>
2735 )*
2736 <[right_delim= ( \) )]>
2737 }xms;
2738
2739 if ( '(a(bc)d)' =~ /$balanced_brackets/) {
2740 use Data::Dumper 'Dumper';
2741 warn Dumper \%/;
2742 }
2743
2745 Assuming you have correctly debugged your grammar, the next source of
2746 problems will probably be invalid input (especially if that input is
2747 being provided interactively). So Regexp::Grammars also provides some
2748 support for detecting when a parse is likely to fail...and informing
2749 the user why.
2750
2751 Requirements
2752 The "<require:...>" directive is useful for testing conditions that
2753 it's not easy (or even possible) to check within the syntax of the the
2754 regex itself. For example:
2755
2756 <rule: IPV4_Octet_Decimal>
2757 # Up three digits...
2758 <MATCH= ( \d{1,3}+ )>
2759
2760 # ...but less than 256...
2761 <require: (?{ $MATCH <= 255 })>
2762
2763 A require expects a regex codeblock as its argument and succeeds if the
2764 final value of that codeblock is true. If the final value is false, the
2765 directive fails and the rule starts backtracking.
2766
2767 Note, in this example that the digits are matched with " \d{1,3}+ ".
2768 The trailing "+" prevents the "{1,3}" repetition from backtracking to a
2769 smaller number of digits if the "<require:...>" fails.
2770
2771 Handling failure
2772 The module has limited support for error reporting from within a
2773 grammar, in the form of the "<error:...>" and "<warning:...>"
2774 directives and their shortcuts: "<...>", "<!!!>", and "<???>"
2775
2776 Error messages
2777
2778 The "<error: MSG>" directive queues a conditional error message within
2779 "@!" and then fails to match (that is, it is equivalent to a "(?!)"
2780 when matching). For example:
2781
2782 <rule: ListElem>
2783 <SerialNumber>
2784 | <ClientName>
2785 | <error: (?{ $errcount++ . ': Missing list element' })>
2786
2787 So a common code pattern when using grammars that do this kind of error
2788 detection is:
2789
2790 if ($text =~ $grammar) {
2791 # Do something with the data collected in %/
2792 }
2793 else {
2794 say {*STDERR} $_ for @!; # i.e. report all errors
2795 }
2796
2797 Each error message is conditional in the sense that, if any surrounding
2798 rule subsequently matches, the message is automatically removed from
2799 "@!". This implies that you can queue up as many error messages as you
2800 wish, but they will only remain in "@!" if the match ultimately fails.
2801 Moreover, only those error messages originating from rules that
2802 actually contributed to the eventual failure-to-match will remain in
2803 "@!".
2804
2805 If a code block is specified as the argument, the error message is
2806 whatever final value is produced when the block is executed. Note that
2807 this final value does not have to be a string (though it does have to
2808 be a scalar).
2809
2810 <rule: ListElem>
2811 <SerialNumber>
2812 | <ClientName>
2813 | <error: (?{
2814 # Return a hash, with the error information...
2815 { errnum => $errcount++, msg => 'Missing list element' }
2816 })>
2817
2818 If anything else is specified as the argument, it is treated as a
2819 literal error string (and may not contain an unbalanced '<' or '>', nor
2820 any interpolated variables).
2821
2822 However, if the literal error string begins with "Expected " or
2823 "Expecting ", then the error string automatically has the following
2824 "context suffix" appended:
2825
2826 , but found '$CONTEXT' instead
2827
2828 For example:
2829
2830 qr{ <Arithmetic_Expression> # ...Match arithmetic expression
2831 | # Or else
2832 <error: Expected a valid expression> # ...Report error, and fail
2833
2834 # Rule definitions here...
2835 }xms;
2836
2837 On an invalid input this example might produce an error message like:
2838
2839 "Expected a valid expression, but found '(2+3]*7/' instead"
2840
2841 The value of the special $CONTEXT variable is found by looking ahead in
2842 the string being matched against, to locate the next sequence of non-
2843 blank characters after the current parsing position. This variable may
2844 also be explicitly used within the "<error: (?{...})>" form of the
2845 directive.
2846
2847 As a special case, if you omit the message entirely from the directive,
2848 it is supplied automatically, derived from the name of the current
2849 rule. For example, if the following rule were to fail to match:
2850
2851 <rule: Arithmetic_expression>
2852 <Multiplicative_Expression>+ % ([+-])
2853 | <error:>
2854
2855 the error message queued would be:
2856
2857 "Expected arithmetic expression, but found 'one plus two' instead"
2858
2859 Note however, that it is still essential to include the colon in the
2860 directive. A common mistake is to write:
2861
2862 <rule: Arithmetic_expression>
2863 <Multiplicative_Expression>+ % ([+-])
2864 | <error>
2865
2866 which merely attempts to call "<rule: error>" if the first alternative
2867 fails.
2868
2869 Warning messages
2870
2871 Sometimes, you want to detect problems, but not invalidate the entire
2872 parse as a result. For those occasions, the module provides a "less
2873 stringent" form of error reporting: the "<warning:...>" directive.
2874
2875 This directive is exactly the same as an "<error:...>" in every respect
2876 except that it does not induce a failure to match at the point it
2877 appears.
2878
2879 The directive is, therefore, useful for reporting non-fatal problems in
2880 a parse. For example:
2881
2882 qr{ \A # ...Match only at start of input
2883 <ArithExpr> # ...Match a valid arithmetic expression
2884
2885 (?:
2886 # Should be at end of input...
2887 \s* \Z
2888 |
2889 # If not, report the fact but don't fail...
2890 <warning: Expected end-of-input>
2891 <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
2892 )
2893
2894 # Rule definitions here...
2895 }xms;
2896
2897 Note that, because they do not induce failure, two or more
2898 "<warning:...>" directives can be "stacked" in sequence, as in the
2899 previous example.
2900
2901 Stubbing
2902
2903 The module also provides three useful shortcuts, specifically to make
2904 it easy to declare, but not define, rules and tokens.
2905
2906 The "<...>" and "<!!!>" directives are equivalent to the directive:
2907
2908 <error: Cannot match RULENAME (not implemented)>
2909
2910 The "<???>" is equivalent to the directive:
2911
2912 <warning: Cannot match RULENAME (not implemented)>
2913
2914 For example, in the following grammar:
2915
2916 <grammar: List::Generic>
2917
2918 <rule: List>
2919 <[Item]>+ % (\s*,\s*)
2920
2921 <rule: Item>
2922 <...>
2923
2924 the "Item" rule is declared but not defined. That means the grammar
2925 will compile correctly, (the "List" rule won't complain about a call to
2926 a non-existent "Item"), but if the "Item" rule isn't overridden in some
2927 derived grammar, a match-time error will occur when "List" tries to
2928 match the "<...>" within "Item".
2929
2930 Localizing the (semi-)automatic error messages
2931
2932 Error directives of any of the following forms:
2933
2934 <error: Expecting identifier>
2935
2936 <error: >
2937
2938 <...>
2939
2940 <!!!>
2941
2942 or their warning equivalents:
2943
2944 <warning: Expecting identifier>
2945
2946 <warning: >
2947
2948 <???>
2949
2950 each autogenerate part or all of the actual error message they produce.
2951 By default, that autogenerated message is always produced in English.
2952
2953 However, the module provides a mechanism by which you can intercept
2954 every error or warning that is queued to "@!" via these
2955 directives...and localize those messages.
2956
2957 To do this, you call "Regexp::Grammars::set_error_translator()" (with
2958 the full qualification, since Regexp::Grammars does not export it...nor
2959 anything else, for that matter).
2960
2961 The "set_error_translator()" subroutine expect as single argument,
2962 which must be a reference to another subroutine. This subroutine is
2963 then called whenever an error or warning message is queued to "@!".
2964
2965 The subroutine is passed three arguments:
2966
2967 • the message string,
2968
2969 • the name of the rule from which the error or warning was queued,
2970 and
2971
2972 • the value of $CONTEXT when the error or warning was encountered
2973
2974 The subroutine is expected to return the final version of the message
2975 that is actually to be appended to "@!". To accomplish this it may make
2976 use of one of the many internationalization/localization modules
2977 available in Perl, or it may do the conversion entirely by itself.
2978
2979 The first argument is always exactly what appeared as a message in the
2980 original directive (regardless of whether that message is supposed to
2981 trigger autogeneration, or is just a "regular" error message). That
2982 is:
2983
2984 Directive 1st argument
2985
2986 <error: Expecting identifier> "Expecting identifier"
2987 <warning: That's not a moon!> "That's not a moon!"
2988 <error: > ""
2989 <warning: > ""
2990 <...> ""
2991 <!!!> ""
2992 <???> ""
2993
2994 The second argument always contains the name of the rule in which the
2995 directive was encountered. For example, when invoked from within
2996 "<rule: Frinstance>" the following directives produce:
2997
2998 Directive 2nd argument
2999
3000 <error: Expecting identifier> "Frinstance"
3001 <warning: That's not a moon!> "Frinstance"
3002 <error: > "Frinstance"
3003 <warning: > "Frinstance"
3004 <...> "-Frinstance"
3005 <!!!> "-Frinstance"
3006 <???> "-Frinstance"
3007
3008 Note that the "unimplemented" markers pass the rule name with a
3009 preceding '-'. This allows your translator to distinguish between
3010 "empty" messages (which should then be generated automatically) and the
3011 "unimplemented" markers (which should report that the rule is not yet
3012 properly defined).
3013
3014 If you call "Regexp::Grammars::set_error_translator()" in a void
3015 context, the error translator is permanently replaced (at least, until
3016 the next call to "set_error_translator()").
3017
3018 However, if you call "Regexp::Grammars::set_error_translator()" in a
3019 scalar or list context, it returns an object whose destructor will
3020 restore the previous translator. This allows you to install a
3021 translator only within a given scope, like so:
3022
3023 {
3024 my $temporary
3025 = Regexp::Grammars::set_error_translator(\&my_translator);
3026
3027 if ($text =~ $parser) {
3028 do_stuff_with( %/ );
3029 }
3030 else {
3031 report_errors_in( @! );
3032 }
3033
3034 } # <--- error translator automagically reverts at this point
3035
3036 Warning: any error translation subroutine you install will be called
3037 during the grammar's parsing phase (i.e. as the grammar's regex is
3038 matching). You should therefore ensure that your translator does not
3039 itself use regular expressions, as nested evaluations of regexes inside
3040 other regexes are extremely problematical (i.e. almost always
3041 disastrous) in Perl.
3042
3043 Restricting how long a parse runs
3044 Like the core Perl 5 regex engine on which they are built, the grammars
3045 implemented by Regexp::Grammars are essentially top-down parsers. This
3046 means that they may occasionally require an exponentially long time to
3047 parse a particular input. This usually occurs if a particular grammar
3048 includes a lot of recursion or nested backtracking, especially if the
3049 grammar is then matched against a long string.
3050
3051 The judicious use of non-backtracking repetitions (i.e. "x*+" and
3052 "x++") can significantly improve parsing performance in many such
3053 cases. Likewise, carefully reordering any high-level alternatives (so
3054 as to test simple common cases first) can substantially reduce parsing
3055 times.
3056
3057 However, some languages are just intrinsically slow to parse using top-
3058 down techniques (or, at least, may have slow-to-parse corner cases).
3059
3060 To help cope with this constraint, Regexp::Grammars provides a
3061 mechanism by which you can limit the total effort that a given grammar
3062 will expend in attempting to match. The "<timeout:...>" directive
3063 allows you to specify how long a grammar is allowed to continue trying
3064 to match before giving up. It expects a single argument, which must be
3065 an unsigned integer, and it treats this integer as the number of
3066 seconds to continue attempting to match.
3067
3068 For example:
3069
3070 <timeout: 10> # Give up after 10 seconds
3071
3072 indicates that the grammar should keep attempting to match for another
3073 10 seconds from the point where the directive is encountered during a
3074 parse. If the complete grammar has not matched in that time, the entire
3075 match is considered to have failed, the matching process is immediately
3076 terminated, and a standard error message ('Internal error: Timed out
3077 after 10 seconds (as requested)') is returned in "@!".
3078
3079 A "<timeout:...>" directive can be placed anywhere in a grammar, but is
3080 most usually placed at the very start, so that the entire grammar is
3081 governed by the specified time limit. The second most common
3082 alternative is to place the timeout at the start of a particular
3083 subrule that is known to be potentially very slow.
3084
3085 A common mistake is to put the timeout specification at the top level
3086 of the grammar, but place it after the actual subrule to be matched,
3087 like so:
3088
3089 my $grammar = qr{
3090
3091 <Text_Corpus> # Subrule to be matched
3092 <timeout: 10> # Useless use of timeout
3093
3094 <rule: Text_Corpus>
3095 # et cetera...
3096 }xms;
3097
3098 Since the parser will only reach the "<timeout: 10>" directive after it
3099 has completely matched "<Text_Corpus>", the timeout is only initiated
3100 at the very end of the matching process and so does not limit that
3101 process in any useful way.
3102
3103 Immediate timeouts
3104
3105 As you might expect, a "<timeout: 0>" directive tells the parser to
3106 keep trying for only zero more seconds, and therefore will immediately
3107 cause the entire surrounding grammar to fail (no matter how deeply
3108 within that grammar the directive is encountered).
3109
3110 This can occasionally be exteremely useful. If you know that detecting
3111 a particular datum means that the grammar will never match, no matter
3112 how many other alternatives may subsequently be tried, you can short-
3113 circuit the parser by injecting a "<timeout: 0>" immediately after the
3114 offending datum is detected.
3115
3116 For example, if your grammar only accepts certain versions of the
3117 language being parsed, you could write:
3118
3119 <rule: Valid_Language_Version>
3120 vers = <%AcceptableVersions>
3121 |
3122 vers = <bad_version=(\S++)>
3123 <warning: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3124 <timeout: 0>
3125
3126 In fact, this "<warning: MSG> <timeout: 0>" sequence is sufficiently
3127 useful, sufficiently complex, and sufficiently easy to get wrong, that
3128 Regexp::Grammars provides a handy shortcut for it: the "<fatal:...>"
3129 directive. A "<fatal:...>" is exactly equivalent to a "<warning:...>"
3130 followed by a zero-timeout, so the previous example could also be
3131 written:
3132
3133 <rule: Valid_Language_Version>
3134 vers = <%AcceptableVersions>
3135 |
3136 vers = <bad_version=(\S++)>
3137 <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3138
3139 Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
3140 own failure context in $CONTEXT, so the previous example could be
3141 further simplified to:
3142
3143 <rule: Valid_Language_Version>
3144 vers = <%AcceptableVersions>
3145 |
3146 vers = <fatal:(?{ "Cannot parse language version $CONTEXT" })>
3147
3148 Also like "<error:...>", "<fatal:...>" can autogenerate an error
3149 message if none is provided, so the example could be still further
3150 reduced to:
3151
3152 <rule: Valid_Language_Version>
3153 vers = <%AcceptableVersions>
3154 |
3155 vers = <fatal:>
3156
3157 In this last case, however, the error message returned in "@!" would no
3158 longer be:
3159
3160 Cannot parse language version 0.95
3161
3162 It would now be:
3163
3164 Expected valid language version, but found '0.95' instead
3165
3167 If you intend to use a grammar as part of a larger program that
3168 contains other (non-grammatical) regexes, it is more efficient--and
3169 less error-prone--to avoid having Regexp::Grammars process those
3170 regexes as well. So it's often a good idea to declare your grammar in a
3171 "do" block, thereby restricting the scope of the module's effects.
3172
3173 For example:
3174
3175 my $grammar = do {
3176 use Regexp::Grammars;
3177 qr{
3178 <file>
3179
3180 <rule: file>
3181 <prelude>
3182 <data>
3183 <postlude>
3184
3185 <rule: prelude>
3186 # etc.
3187 }x;
3188 };
3189
3190 Because the effects of Regexp::Grammars are lexically scoped, any
3191 regexes defined outside that "do" block will be unaffected by the
3192 module.
3193
3195 Perl API
3196 "use Regexp::Grammars;"
3197 Causes all regexes in the current lexical scope to be compile-time
3198 processed for grammar elements.
3199
3200 "$str =~ $grammar"
3201 "$str =~ /$grammar/"
3202 Attempt to match the grammar against the string, building a nested
3203 data structure from it.
3204
3205 "%/"
3206 This hash is assigned the nested data structure created by any
3207 successful match of a grammar regex.
3208
3209 "@!"
3210 This array is assigned the queue of error messages created by any
3211 unsuccessful match attempt of a grammar regex.
3212
3213 Grammar syntax
3214 Directives
3215
3216 "<rule: IDENTIFIER>"
3217 Define a rule whose name is specified by the supplied identifier.
3218
3219 Everything following the "<rule:...>" directive (up to the next
3220 "<rule:...>" or "<token:...>" directive) is treated as part of the
3221 rule being defined.
3222
3223 Any whitespace in the rule is replaced by a call to the "<.ws>"
3224 subrule (which defaults to matching "\s*", but may be explicitly
3225 redefined).
3226
3227 "<token: IDENTIFIER>"
3228 Define a rule whose name is specified by the supplied identifier.
3229
3230 Everything following the "<token:...>" directive (up to the next
3231 "<rule:...>" or "<token:...>" directive) is treated as part of the
3232 rule being defined.
3233
3234 Any whitespace in the rule is ignored (under the "/x" modifier), or
3235 explicitly matched (if "/x" is not used).
3236
3237 "<objrule: IDENTIFIER>"
3238 "<objtoken: IDENTIFIER>"
3239 Identical to a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
3240 declaration, except that the rule or token will also bless the hash
3241 it normally returns, converting it to an object of a class whose
3242 name is the same as the rule or token itself.
3243
3244 "<require: (?{ CODE }) >"
3245 The code block is executed and if its final value is true, matching
3246 continues from the same position. If the block's final value is
3247 false, the match fails at that point and starts backtracking.
3248
3249 "<error: (?{ CODE }) >"
3250 "<error: LITERAL TEXT >"
3251 "<error: >"
3252 This directive queues a conditional error message within the global
3253 special variable "@!" and then fails to match at that point (that
3254 is, it is equivalent to a "(?!)" or "(*FAIL)" when matching).
3255
3256 "<fatal: (?{ CODE }) >"
3257 "<fatal: LITERAL TEXT >"
3258 "<fatal: >"
3259 This directive is exactly the same as an "<error:...>" in every
3260 respect except that it immediately causes the entire surrounding
3261 grammar to fail, and parsing to immediate cease.
3262
3263 "<warning: (?{ CODE }) >"
3264 "<warning: LITERAL TEXT >"
3265 This directive is exactly the same as an "<error:...>" in every
3266 respect except that it does not induce a failure to match at the
3267 point it appears. That is, it is equivalent to a "(?=)" ["succeed
3268 and continue matching"], rather than a "(?!)" ["fail and
3269 backtrack"].
3270
3271 "<debug: COMMAND >"
3272 During the matching of grammar regexes send debugging and warning
3273 information to the specified log file (see "<logfile: LOGFILE>").
3274
3275 The available "COMMAND"'s are:
3276
3277 <debug: continue> ___ Debug until end of complete parse
3278 <debug: run> _/
3279
3280 <debug: on> ___ Debug until next subrule match
3281 <debug: match> _/
3282
3283 <debug: try> ___ Debug until next subrule call or match
3284 <debug: step> _/
3285
3286 <debug: same> ___ Maintain current debugging mode
3287
3288 <debug: off> ___ No debugging
3289
3290 See also the $DEBUG special variable.
3291
3292 "<logfile: LOGFILE>"
3293 "<logfile: - >"
3294 During the compilation of grammar regexes, send debugging and
3295 warning information to the specified LOGFILE (or to *STDERR if "-"
3296 is specified).
3297
3298 If the specified LOGFILE name contains a %t, it is replaced with a
3299 (sortable) "YYYYMMDD.HHMMSS" timestamp. For example:
3300
3301 <logfile: test-run-%t >
3302
3303 executed at around 9.30pm on the 21st of March 2009, would generate
3304 a log file named: "test-run-20090321.213056"
3305
3306 "<log: (?{ CODE }) >"
3307 "<log: LITERAL TEXT >"
3308 Append a message to the log file. If the argument is a code block,
3309 that code is expected to return the text of the message; if the
3310 argument is anything else, that something else is the literal
3311 message.
3312
3313 If the block returns two or more values, the first is treated as a
3314 log message severity indicator, and the remaining values as
3315 separate lines of text to be logged.
3316
3317 "<timeout: INT >"
3318 Restrict the match-time of the parse to the specified number of
3319 seconds. Queues a error message and terminates the entire match
3320 process if the parse does not complete within the nominated time
3321 limit.
3322
3323 Subrule calls
3324
3325 "<IDENTIFIER>"
3326 Call the subrule whose name is IDENTIFIER.
3327
3328 If it matches successfully, save the hash it returns in the current
3329 scope's result-hash, under the key 'IDENTIFIER'.
3330
3331 "<IDENTIFIER_1=IDENTIFIER_2>"
3332 Call the subrule whose name is IDENTIFIER_1.
3333
3334 If it matches successfully, save the hash it returns in the current
3335 scope's result-hash, under the key 'IDENTIFIER_2'.
3336
3337 In other words, the "IDENTIFIER_1=" prefix changes the key under
3338 which the result of calling a subrule is stored.
3339
3340 "<.IDENTIFIER>"
3341 Call the subrule whose name is IDENTIFIER. Don't save the hash it
3342 returns.
3343
3344 In other words, the "dot" prefix disables saving of subrule
3345 results.
3346
3347 "<IDENTIFIER= ( PATTERN )>"
3348 Match the subpattern PATTERN.
3349
3350 If it matches successfully, capture the substring it matched and
3351 save that substring in the current scope's result-hash, under the
3352 key 'IDENTIFIER'.
3353
3354 "<.IDENTIFIER= ( PATTERN )>"
3355 Match the subpattern PATTERN. Don't save the substring it matched.
3356
3357 "<IDENTIFIER= %HASH>"
3358 Match a sequence of non-whitespace then verify that the sequence is
3359 a key in the specified hash
3360
3361 If it matches successfully, capture the sequence it matched and
3362 save that substring in the current scope's result-hash, under the
3363 key 'IDENTIFIER'.
3364
3365 "<%HASH>"
3366 Match a key from the hash. Don't save the substring it matched.
3367
3368 "<IDENTIFIER= (?{ CODE })>"
3369 Execute the specified CODE.
3370
3371 Save the result (of the final expression that the CODE evaluates)
3372 in the current scope's result-hash, under the key 'IDENTIFIER'.
3373
3374 "<[IDENTIFIER]>"
3375 Call the subrule whose name is IDENTIFIER.
3376
3377 If it matches successfully, append the hash it returns to a nested
3378 array within the current scope's result-hash, under the key
3379 <'IDENTIFIER'>.
3380
3381 "<[IDENTIFIER_1=IDENTIFIER_2]>"
3382 Call the subrule whose name is IDENTIFIER_1.
3383
3384 If it matches successfully, append the hash it returns to a nested
3385 array within the current scope's result-hash, under the key
3386 'IDENTIFIER_2'.
3387
3388 "<ANY_SUBRULE>+ % <ANY_OTHER_SUBRULE>"
3389 "<ANY_SUBRULE>* % <ANY_OTHER_SUBRULE>"
3390 "<ANY_SUBRULE>+ % (PATTERN)"
3391 "<ANY_SUBRULE>* % (PATTERN)"
3392 Repeatedly call the first subrule. Keep matching as long as the
3393 subrule matches, provided successive matches are separated by
3394 matches of the second subrule or the pattern.
3395
3396 In other words, match a list of ANY_SUBRULE's separated by
3397 ANY_OTHER_SUBRULE's or PATTERN's.
3398
3399 Note that, if a pattern is used to specify the separator, it must
3400 be specified in some kind of matched parentheses. These may be
3401 capturing ["(...)"], non-capturing ["(?:...)"], non-backtracking
3402 ["(?>...)"], or any other construct enclosed by an opening and
3403 closing paren.
3404
3405 "<ANY_SUBRULE>+ %% <ANY_OTHER_SUBRULE>"
3406 "<ANY_SUBRULE>* %% <ANY_OTHER_SUBRULE>"
3407 "<ANY_SUBRULE>+ %% (PATTERN)"
3408 "<ANY_SUBRULE>* %% (PATTERN)"
3409 Repeatedly call the first subrule. Keep matching as long as the
3410 subrule matches, provided successive matches are separated by
3411 matches of the second subrule or the pattern.
3412
3413 Also allow an optional final trailing instance of the second
3414 subrule or pattern (this is where "%%" differs from "%").
3415
3416 In other words, match a list of ANY_SUBRULE's separated by
3417 ANY_OTHER_SUBRULE's or PATTERN's, with a possible final separator.
3418
3419 As for the single "%" operator, if a pattern is used to specify the
3420 separator, it must be specified in some kind of matched
3421 parentheses. These may be capturing ["(...)"], non-capturing
3422 ["(?:...)"], non-backtracking ["(?>...)"], or any other construct
3423 enclosed by an opening and closing paren.
3424
3425 Special variables within grammar actions
3426 $CAPTURE
3427 $CONTEXT
3428 These are both aliases for the built-in read-only $^N variable,
3429 which always contains the substring matched by the nearest
3430 preceding "(...)" capture. $^N still works perfectly well, but
3431 these are provided to improve the readability of code blocks and
3432 error messages respectively.
3433
3434 $INDEX
3435 This variable contains the index at which the next match will be
3436 attempted within the string being parsed. It is most commonly used
3437 in "<error:...>" or "<log:...>" directives:
3438
3439 <rule: ListElem>
3440 <log: (?{ "Trying words at index $INDEX" })>
3441 <MATCH=( \w++ )>
3442 |
3443 <log: (?{ "Trying digits at index $INDEX" })>
3444 <MATCH=( \d++ )>
3445 |
3446 <error: (?{ "Missing ListElem near index $INDEX" })>
3447
3448 %MATCH
3449 This variable contains all the saved results of any subrules called
3450 from the current rule. In other words, subrule calls like:
3451
3452 <ListElem> <Separator= (,)>
3453
3454 stores their respective match results in $MATCH{'ListElem'} and
3455 $MATCH{'Separator'}.
3456
3457 $MATCH
3458 This variable is an alias for $MATCH{"="}. This is the %MATCH entry
3459 for the special "override value". If this entry is defined, its
3460 value overrides the usual "return \%MATCH" semantics of a
3461 successful rule.
3462
3463 %ARG
3464 This variable contains all the key/value pairs that were passed
3465 into a particular subrule call.
3466
3467 <Keyword> <Command> <Terminator(:Keyword)>
3468
3469 the "Terminator" rule could get access to the text matched by
3470 "<Keyword>" like so:
3471
3472 <token: Terminator>
3473 end_ (??{ $ARG{'Keyword'} })
3474
3475 Note that to match against the calling subrules 'Keyword' value,
3476 it's necessary to use either a deferred interpolation ("(??{...})")
3477 or a qualified matchref:
3478
3479 <token: Terminator>
3480 end_ <\:Keyword>
3481
3482 A common mistake is to attempt to directly interpolate the
3483 argument:
3484
3485 <token: Terminator>
3486 end_ $ARG{'Keyword'}
3487
3488 This evaluates $ARG{'Keyword'} when the grammar is compiled, rather
3489 than when the rule is matched.
3490
3491 $_ At the start of any code blocks inside any regex, the variable $_
3492 contains the complete string being matched against. The current
3493 matching position within that string is given by: "pos($_)".
3494
3495 $DEBUG
3496 This variable stores the current debugging mode (which may be any
3497 of: 'off', 'on', 'run', 'continue', 'match', 'step', or 'try'). It
3498 is set automatically by the "<debug:...>" command, but may also be
3499 set manually in a code block (which can be useful for conditional
3500 debugging). For example:
3501
3502 <rule: ListElem>
3503 <Identifier>
3504
3505 # Conditionally debug if 'foobar' encountered...
3506 (?{ $DEBUG = $MATCH{Identifier} eq 'foobar' ? 'step' : 'off' })
3507
3508 <Modifier>?
3509
3510 See also: the "<log: LOGFILE>" and "<debug: DEBUG_CMD>" directives.
3511
3513 • Prior to Perl 5.14, the Perl 5 regex engine as not reentrant. So
3514 any attempt to perform a regex match inside a "(?{ ... })" or "(??{
3515 ... })" under Perl 5.12 or earlier will almost certainly lead to
3516 either weird data corruption or a segfault.
3517
3518 The same calamities can also occur in any constructor called by
3519 "<objrule:>". If the constructor invokes another regex in any way,
3520 it will most likely fail catastrophically. In particular, this
3521 means that Moose constructors will frequently crash and burn within
3522 a Regex::Grammars grammar (for example, if the Moose-based class
3523 declares an attribute type constraint such as 'Int', which Moose
3524 checks using a regex).
3525
3526 • The additional regex constructs this module provides are
3527 implemented by rewriting regular expressions. This is a (safer)
3528 form of source filtering, but still subject to all the same
3529 limitations and fallibilities of any other macro-based solution.
3530
3531 • In particular, rewriting the macros involves the insertion of (a
3532 lot of) extra capturing parentheses. This means you can no longer
3533 assume that particular capturing parens correspond to particular
3534 numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
3535 directly use Perl 5.10's named capture construct:
3536
3537 (?<name> [^\W\d]\w* )
3538
3539 Better still, capture the data in its correct hierarchical context
3540 using the module's "named subpattern" construct:
3541
3542 <name= ([^\W\d]\w*) >
3543
3544 • No recursive descent parser--including those created with
3545 Regexp::Grammars--can directly handle left-recursive grammars with
3546 rules of the form:
3547
3548 <rule: List>
3549 <List> , <ListElem>
3550
3551 If you find yourself attempting to write a left-recursive grammar
3552 (which Perl 5.10 may or may not complain about, but will never
3553 successfully parse with), then you probably need to use the
3554 "separated list" construct instead:
3555
3556 <rule: List>
3557 <[ListElem]>+ % (,)
3558
3559 • Grammatical parsing with Regexp::Grammars can fail if your grammar
3560 uses "non-backtracking" directives (i.e. the "(?>...)" block or the
3561 "?+", "*+", or "++" repetition specifiers). The problem appears to
3562 be that preventing the regex from backtracking through the in-regex
3563 actions that Regexp::Grammars adds causes the module's internal
3564 stack to fall out of sync with the regex match.
3565
3566 For the time being, if your grammar does not work as expected, you
3567 may need to replace one or more "non-backtracking" directives, with
3568 their regular (i.e. backtracking) equivalents.
3569
3570 • Similarly, parsing with Regexp::Grammars will fail if your grammar
3571 places a subrule call within a positive look-ahead, since these
3572 don't play nicely with the data stack.
3573
3574 This seems to be an internal problem with perl itself.
3575 Investigations, and attempts at a workaround, are proceeding.
3576
3577 For the time being, you need to make sure that grammar rules don't
3578 appear inside a positive lookahead or use the "<?RULENAME>"
3579 construct instead
3580
3582 Note that (because the author cannot find a way to throw exceptions
3583 from within a regex) none of the following diagnostics actually throws
3584 an exception.
3585
3586 Instead, these messages are simply written to the specified parser
3587 logfile (or to *STDERR, if no logfile is specified).
3588
3589 However, any fatal match-time message will immediately terminate the
3590 parser matching and will still set $@ (as if an exception had been
3591 thrown and caught at that point in the code). You then have the option
3592 to check $@ immediately after matching with the grammar, and rethrow if
3593 necessary:
3594
3595 if ($input =~ $grammar) {
3596 process_data_in(\%/);
3597 }
3598 else {
3599 die if $@;
3600 }
3601
3602 "Found call to %s, but no %s was defined in the grammar"
3603 You specified a call to a subrule for which there was no definition
3604 in the grammar. Typically that's either because you forget to
3605 define the rule, or because you misspelled either the definition or
3606 the subrule call. For example:
3607
3608 <file>
3609
3610 <rule: fiel> <---- misspelled rule
3611 <lines> <---- used but never defined
3612
3613 Regexp::Grammars converts any such subrule call attempt to an
3614 instant catastrophic failure of the entire parse, so if your parser
3615 ever actually tries to perform that call, Very Bad Things will
3616 happen.
3617
3618 "Entire parse terminated prematurely while attempting to call
3619 non-existent rule: %s"
3620 You ignored the previous error and actually tried to call to a
3621 subrule for which there was no definition in the grammar. Very Bad
3622 Things are now happening. The parser got very upset, took its ball,
3623 and went home. See the preceding diagnostic for remedies.
3624
3625 This diagnostic should throw an exception, but can't. So it sets $@
3626 instead, allowing you to trap the error manually if you wish.
3627
3628 "Fatal error: <objrule: %s> returned a non-hash-based object"
3629 An <objrule:> was specified and returned a blessed object that
3630 wasn't a hash. This will break the behaviour of the grammar, so the
3631 module immediately reports the problem and gives up.
3632
3633 The solution is to use only hash-based classes with <objrule:>
3634
3635 "Can't match against <grammar: %s>"
3636 The regex you attempted to match against defined a pure grammar,
3637 using the "<grammar:...>" directive. Pure grammars have no start-
3638 pattern and hence cannot be matched against directly.
3639
3640 You need to define a matchable grammar that inherits from your pure
3641 grammar and then calls one of its rules. For example, instead of:
3642
3643 my $greeting = qr{
3644 <grammar: Greeting>
3645
3646 <rule: greet>
3647 Hi there
3648 | Hello
3649 | Yo!
3650 }xms;
3651
3652 you need:
3653
3654 qr{
3655 <grammar: Greeting>
3656
3657 <rule: greet>
3658 Hi there
3659 | Hello
3660 | Yo!
3661 }xms;
3662
3663 my $greeting = qr{
3664 <extends: Greeting>
3665 <greet>
3666 }xms;
3667
3668 "Inheritance from unknown grammar requested by <%s>"
3669 You used an "<extends:...>" directive to request that your grammar
3670 inherit from another, but the grammar you asked to inherit from
3671 doesn't exist.
3672
3673 Check the spelling of the grammar name, and that it's already been
3674 defined somewhere earlier in your program.
3675
3676 "Redeclaration of <%s> will be ignored"
3677 You defined two or more rules or tokens with the same name. The
3678 first one defined in the grammar will be used; the rest will be
3679 ignored.
3680
3681 To get rid of the warning, get rid of the extra definitions (or, at
3682 least, comment them out or rename the rules).
3683
3684 "Possible invalid subrule call %s"
3685 Your grammar contained something of the form:
3686
3687 <identifier
3688 <.identifier
3689 <[identifier
3690
3691 which you might have intended to be a subrule call, but which
3692 didn't correctly parse as one. If it was supposed to be a
3693 Regexp::Grammars subrule call, you need to check the syntax you
3694 used. If it wasn't supposed to be a subrule call, you can silence
3695 the warning by rewriting it and quoting the leading angle:
3696
3697 \<identifier
3698 \<.identifier
3699 \<[identifier
3700
3701 "Possible failed attempt to specify a subrule call or directive: %s"
3702 Your grammar contained something of the form:
3703
3704 <identifier...
3705
3706 but which wasn't a call to a known subrule or directive. If it was
3707 supposed to be a subrule call, check the spelling of the rule name
3708 in the angles. If it was supposed to be a Regexp::Grammars
3709 directive, check the spelling of the directive name. If it wasn't
3710 supposed to be a subrule call or directive, you can silence the
3711 warning by rewriting it and quoting the leading angle:
3712
3713 \<identifier
3714
3715 "Invalid < metacharacter"
3716 The "<" character is always special in Regexp::Grammars regexes: it
3717 either introduces a subrule call, or a rule/token declaration, or a
3718 directive.
3719
3720 If you need to match a literal '<', use "\<" in your regex.
3721
3722 "Invalid separation specifier: %s"
3723 You used a "%" or a "%%" in the regex, but in a way that won't do
3724 what you expect. "%" and "%%" are metacharacters in
3725 Regexp::Grammars regexes, and can only be placed between a repeated
3726 atom (that matches a list of items) and a simple atom (that matches
3727 the separator between list items). See "Matching separated lists".
3728
3729 If you were using "%" or "%%" as a metacharacter, then you either
3730 forgot the repetition quantifier ("*", "+", "{0,9}", etc.) on the
3731 preceding list-matching atom, or you specified the following
3732 separator atom as something too complex for the module to parse
3733 (for example, a set of parens with nested subrule calls).
3734
3735 On the other hand, if you were intending to match a literal "%" or
3736 "%%" within a Regexp::Grammars regex, then you must explicitly
3737 specify it as being a literal by quotemeta'ing it, like so: "\%" or
3738 "\%\%"
3739
3740 "Repeated subrule %s will only capture its final match"
3741 You specified a subrule call with a repetition qualifier, such as:
3742
3743 <ListElem>*
3744
3745 or:
3746
3747 <ListElem>+
3748
3749 Because each subrule call saves its result in a hash entry of the
3750 same name, each repeated match will overwrite the previous ones, so
3751 only the last match will ultimately be saved. If you want to save
3752 all the matches, you need to tell Regexp::Grammars to save the
3753 sequence of results as a nested array within the hash entry, like
3754 so:
3755
3756 <[ListElem]>*
3757
3758 or:
3759
3760 <[ListElem]>+
3761
3762 If you really did intend to throw away every result but the final
3763 one, you can silence the warning by placing the subrule call inside
3764 any kind of parentheses. For example:
3765
3766 (<ListElem>)*
3767
3768 or:
3769
3770 (?: <ListElem> )+
3771
3772 "Unable to open log file '$filename' (%s)"
3773 You specified a "<logfile:...>" directive but the file whose name
3774 you specified could not be opened for writing (for the reason given
3775 in the parens).
3776
3777 Did you misspell the filename, or get the permissions wrong
3778 somewhere in the filepath?
3779
3780 "Non-backtracking subrule %s may not revert correctly during
3781 backtracking"
3782 Because of inherent limitations in the Perl regex engine, non-
3783 backtracking constructs like "++", "*+", "?+", and "(?>...)" do not
3784 always work correctly when applied to subrule calls, especially in
3785 earlier versions of Perl.
3786
3787 If the grammar doesn't work properly, replace the offending
3788 constructs with regular backtracking versions instead. If the
3789 grammar does work, you can silence the warning by enclosing the
3790 subrule call in any kind of parentheses. For example, change:
3791
3792 <[ListElem]>++
3793
3794 to:
3795
3796 (?: <[ListElem]> )++
3797
3798 "Unexpected item before first subrule specification in definition of
3799 <grammar: %s>"
3800 Named grammar definitions must consist only of rule and token
3801 definitions. They cannot have patterns before the first
3802 definitions. You had some kind of pattern before the first
3803 definition, which will be completely ignored within the grammar.
3804
3805 To silence the warning, either comment out or delete whatever is
3806 before the first rule/token definition.
3807
3808 "No main regex specified before rule definitions"
3809 You specified an unnamed grammar (i.e. no "<grammar:...>"
3810 directive), but didn't specify anything for it to actually match,
3811 just some rules that you don't actually call. For example:
3812
3813 my $grammar = qr{
3814
3815 <rule: list> \( <item> +% [,] \)
3816
3817 <token: item> <list> | \d+
3818 }x;
3819
3820 You have to provide something before the first rule to start the
3821 matching off. For example:
3822
3823 my $grammar = qr{
3824
3825 <list> # <--- This tells the grammar how to start matching
3826
3827 <rule: list> \( <item> +% [,] \)
3828
3829 <token: item> <list> | \d+
3830 }x;
3831
3832 "Ignoring useless empty <ws:> directive"
3833 The "<ws:...>" directive specifies what whitespace matches within
3834 the current rule. An empty "<ws:>" directive would cause whitespace
3835 to match nothing at all, which is what happens in a token
3836 definition, not in a rule definition.
3837
3838 Either put some subpattern inside the empty "<ws:...>" or, if you
3839 really do want whitespace to match nothing at all, remove the
3840 directive completely and change the rule definition to a token
3841 definition.
3842
3843 "Ignoring useless <ws: %s > directive in a token definition"
3844 The "<ws:...>" directive is used to specify what whitespace matches
3845 within a rule. Since whitespace never matches anything inside
3846 tokens, putting a "<ws:...>" directive in a token is a waste of
3847 time.
3848
3849 Either remove the useless directive, or else change the surrounding
3850 token definition to a rule definition.
3851
3852 "Quantifier that doesn't quantify anything: <%s>"
3853 You specified a rule or token something like:
3854
3855 <token: star> *
3856
3857 or:
3858
3859 <rule: add_op> plus | add | +
3860
3861 but the "*" and "+" in those examples are both regex meta-
3862 operators: quantifiers that usually cause what precedes them to
3863 match repeatedly. In these cases however, nothing is preceding the
3864 quantifier, so it's a Perl syntax error.
3865
3866 You almost certainly need to escape the meta-characters in some
3867 way. For example:
3868
3869 <token: star> \*
3870
3871 <rule: add_op> plus | add | [+]
3872
3874 Regexp::Grammars requires no configuration files or environment
3875 variables.
3876
3878 This module only works under Perl 5.10 or later.
3879
3881 This module is likely to be incompatible with any other module that
3882 automagically rewrites regexes. For example it may conflict with
3883 Regexp::DefaultFlags, Regexp::DeferredExecution, or Regexp::Extended.
3884
3886 No bugs have been reported.
3887
3888 Please report any bugs or feature requests to
3889 "bug-regexp-grammars@rt.cpan.org", or through the web interface at
3890 <http://rt.cpan.org>.
3891
3893 Damian Conway "<DCONWAY@CPAN.org>"
3894
3896 Copyright (c) 2009, Damian Conway "<DCONWAY@CPAN.org>". All rights
3897 reserved.
3898
3899 This module is free software; you can redistribute it and/or modify it
3900 under the same terms as Perl itself. See perlartistic.
3901
3903 BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
3904 FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
3905 WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
3906 PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
3907 EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
3908 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
3909 ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
3910 YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
3911 NECESSARY SERVICING, REPAIR, OR CORRECTION.
3912
3913 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
3914 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
3915 REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
3916 TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
3917 CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
3918 SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
3919 RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
3920 FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
3921 SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
3922 DAMAGES.
3923
3924
3925
3926perl v5.32.1 2021-01-27 Regexp::Grammars(3)