1Regexp::Grammars(3) User Contributed Perl Documentation Regexp::Grammars(3)
2
3
4
6 Regexp::Grammars - Add grammatical parsing features to Perl 5.10
7 regexes
8
10 This document describes Regexp::Grammars version 1.052
11
13 use Regexp::Grammars;
14
15 my $parser = qr{
16 (?:
17 <Verb> # Parse and save a Verb in a scalar
18 <.ws> # Parse but don't save whitespace
19 <Noun> # Parse and save a Noun in a scalar
20
21 <type=(?{ rand > 0.5 ? 'VN' : 'VerbNoun' })>
22 # Save result of expression in a scalar
23 |
24 (?:
25 <[Noun]> # Parse a Noun and save result in a list
26 (saved under the key 'Noun')
27 <[PostNoun=ws]> # Parse whitespace, save it in a list
28 # (saved under the key 'PostNoun')
29 )+
30
31 <Verb> # Parse a Verb and save result in a scalar
32 (saved under the key 'Verb')
33
34 <type=(?{ 'VN' })> # Save a literal in a scalar
35 |
36 <debug: match> # Turn on the integrated debugger here
37 <.Cmd= (?: mv? )> # Parse but don't capture a subpattern
38 (name it 'Cmd' for debugging purposes)
39 <[File]>+ # Parse 1+ Files and save them in a list
40 (saved under the key 'File')
41 <debug: off> # Turn off the integrated debugger here
42 <Dest=File> # Parse a File and save it in a scalar
43 (saved under the key 'Dest')
44 )
45
46 ################################################################
47
48 <token: File> # Define a subrule named File
49 <.ws> # - Parse but don't capture whitespace
50 <MATCH= ([\w-]+) > # - Parse the subpattern and capture
51 # matched text as the result of the
52 # subrule
53
54 <token: Noun> # Define a subrule named Noun
55 cat | dog | fish # - Match an alternative (as usual)
56
57 <rule: Verb> # Define a whitespace-sensitive subrule
58 eats # - Match a literal (after any space)
59 <Object=Noun>? # - Parse optional subrule Noun and
60 # save result under the key 'Object'
61 | # Or else...
62 <AUX> # - Parse subrule AUX and save result
63 <part= (eaten|seen) > # - Match a literal, save under 'part'
64
65 <token: AUX> # Define a whitespace-insensitive subrule
66 (has | is) # - Match an alternative and capture
67 (?{ $MATCH = uc $^N }) # - Use captured text as subrule result
68
69 }x;
70
71 # Match the grammar against some text...
72 if ($text =~ $parser) {
73 # If successful, the hash %/ will have the hierarchy of results...
74 process_data_in( %/ );
75 }
76
78 In your program...
79 use Regexp::Grammars; Allow enhanced regexes in lexical scope
80 %/ Result-hash for successful grammar match
81
82 Defining and using named grammars...
83 <grammar: GRAMMARNAME> Define a named grammar that can be inherited
84 <extends: GRAMMARNAME> Current grammar inherits named grammar's rules
85
86 Defining rules in your grammar...
87 <rule: RULENAME> Define rule with magic whitespace
88 <token: RULENAME> Define rule without magic whitespace
89
90 <objrule: CLASS= NAME> Define rule that blesses return-hash into class
91 <objtoken: CLASS= NAME> Define token that blesses return-hash into class
92
93 <objrule: CLASS> Shortcut for above (rule name derived from class)
94 <objtoken: CLASS> Shortcut for above (token name derived from class)
95
96 Matching rules in your grammar...
97 <RULENAME> Call named subrule (may be fully qualified)
98 save result to $MATCH{RULENAME}
99
100 <RULENAME(...)> Call named subrule, passing args to it
101
102 <!RULENAME> Call subrule and fail if it matches
103 <!RULENAME(...)> (shorthand for (?!<.RULENAME>) )
104
105 <:IDENT> Match contents of $ARG{IDENT} as a pattern
106 <\:IDENT> Match contents of $ARG{IDENT} as a literal
107 </:IDENT> Match closing delimiter for $ARG{IDENT}
108
109 <%HASH> Match longest possible key of hash
110 <%HASH {PAT}> Match any key of hash that also matches PAT
111
112 </IDENT> Match closing delimiter for $MATCH{IDENT}
113 <\_IDENT> Match the literal contents of $MATCH{IDENT}
114
115 <ALIAS= RULENAME> Call subrule, save result in $MATCH{ALIAS}
116 <ALIAS= %HASH> Match a hash key, save key in $MATCH{ALIAS}
117 <ALIAS= ( PATTERN )> Match pattern, save match in $MATCH{ALIAS}
118 <ALIAS= (?{ CODE })> Execute code, save value in $MATCH{ALIAS}
119 <ALIAS= 'STR' > Save specified string in $MATCH{ALIAS}
120 <ALIAS= 42 > Save specified number in $MATCH{ALIAS}
121 <ALIAS= /IDENT> Match closing delim, save as $MATCH{ALIAS}
122 <ALIAS= \_IDENT> Match '$MATCH{IDENT}', save as $MATCH{ALIAS}
123
124 <.SUBRULE> Call subrule (one of the above forms),
125 but don't save the result in %MATCH
126
127
128 <[SUBRULE]> Call subrule (one of the above forms), but
129 append result instead of overwriting it
130
131 <SUBRULE1>+ % <SUBRULE2> Match one or more repetitions of SUBRULE1
132 as long as they're separated by SUBRULE2
133 <SUBRULE1> ** <SUBRULE2> Same (only for backwards compatibility)
134
135 <SUBRULE1>* % <SUBRULE2> Match zero or more repetitions of SUBRULE1
136 as long as they're separated by SUBRULE2
137
138 <SUBRULE1>* %% <SUBRULE2> Match zero or more repetitions of SUBRULE1
139 as long as they're separated by SUBRULE2
140 and allow an optional trailing SUBRULE2
141
142 In your grammar's code blocks...
143 $CAPTURE Alias for $^N (the most recent paren capture)
144 $CONTEXT Another alias for $^N
145 $INDEX Current index of next matching position in string
146 %MATCH Current rule's result-hash
147 $MATCH Magic override value (returned instead of result-hash)
148 %ARG Current rule's argument hash
149 $DEBUG Current match-time debugging mode
150
151 Directives...
152 <require: (?{ CODE }) > Fail if code evaluates false
153 <timeout: INT > Fail after specified number of seconds
154 <debug: COMMAND > Change match-time debugging mode
155 <logfile: LOGFILE > Change debugging log file (default: STDERR)
156 <fatal: TEXT|(?{CODE})> Queue error message and fail parse
157 <error: TEXT|(?{CODE})> Queue error message and backtrack
158 <warning: TEXT|(?{CODE})> Queue warning message and continue
159 <log: TEXT|(?{CODE})> Explicitly add a message to debugging log
160 <ws: PATTERN > Override automatic whitespace matching
161 <minimize:> Simplify the result of a subrule match
162 <context:> Switch on context substring retention
163 <nocontext:> Switch off context substring retention
164
166 This module adds a small number of new regex constructs that can be
167 used within Perl 5.10 patterns to implement complete recursive-descent
168 parsing.
169
170 Perl 5.10 already supports recursive-descent matching, via the new
171 "(?<name>...)" and "(?&name)" constructs. For example, here is a simple
172 matcher for a subset of the LaTeX markup language:
173
174 $matcher = qr{
175 (?&File)
176
177 (?(DEFINE)
178 (?<File> (?&Element)* )
179
180 (?<Element> \s* (?&Command)
181 | \s* (?&Literal)
182 )
183
184 (?<Command> \\ \s* (?&Literal) \s* (?&Options)? \s* (?&Args)? )
185
186 (?<Options> \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \])
187
188 (?<Args> \{ \s* (?&Element)* \s* \} )
189
190 (?<Option> \s* [^][\$&%#_{}~^\s,]+ )
191
192 (?<Literal> \s* [^][\$&%#_{}~^\s]+ )
193 )
194 }xms
195
196 This technique makes it possible to use regexes to recognize complex,
197 hierarchical--and even recursive--textual structures. The problem is
198 that Perl 5.10 doesn't provide any support for extracting that
199 hierarchical data into nested data structures. In other words, using
200 Perl 5.10 you can match complex data, but not parse it into an
201 internally useful form.
202
203 An additional problem when using Perl 5.10 regexes to match complex
204 data formats is that you have to make sure you remember to insert
205 whitespace-matching constructs (such as "\s*") at every possible
206 position where the data might contain ignorable whitespace. This
207 reduces the readability of such patterns, and increases the chance of
208 errors (typically caused by overlooking a location where whitespace
209 might appear).
210
211 The Regexp::Grammars module solves both those problems.
212
213 If you import the module into a particular lexical scope, it
214 preprocesses any regex in that scope, so as to implement a number of
215 extensions to the standard Perl 5.10 regex syntax. These extensions
216 simplify the task of defining and calling subrules within a grammar,
217 and allow those subrule calls to capture and retain the components of
218 they match in a proper hierarchical manner.
219
220 For example, the above LaTeX matcher could be converted to a full LaTeX
221 parser (and considerably tidied up at the same time), like so:
222
223 use Regexp::Grammars;
224 $parser = qr{
225 <File>
226
227 <rule: File> <[Element]>*
228
229 <rule: Element> <Command> | <Literal>
230
231 <rule: Command> \\ <Literal> <Options>? <Args>?
232
233 <rule: Options> \[ <[Option]>+ % (,) \]
234
235 <rule: Args> \{ <[Element]>* \}
236
237 <rule: Option> [^][\$&%#_{}~^\s,]+
238
239 <rule: Literal> [^][\$&%#_{}~^\s]+
240 }xms
241
242 Note that there is no need to explicitly place "\s*" subpatterns
243 throughout the rules; that is taken care of automatically.
244
245 If the Regexp::Grammars version of this regex were successfully matched
246 against some appropriate LaTeX document, each rule would call the
247 subrules specified within it, and then return a hash containing
248 whatever result each of those subrules returned, with each result
249 indexed by the subrule's name.
250
251 That is, if the rule named "Command" were invoked, it would first try
252 to match a backslash, then it would call the three subrules
253 "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
254 matched successfully, the "Command" rule would then return a hash with
255 three keys: 'Literal', 'Options', and 'Args'. The value for each of
256 those hash entries would be whatever result-hash the subrules
257 themselves had returned when matched.
258
259 In this way, each level of the hierarchical regex can generate hashes
260 recording everything its own subrules matched, so when the entire
261 pattern matches, it produces a tree of nested hashes that represent the
262 structured data the pattern matched.
263
264 For example, if the previous regex grammar were matched against a
265 string containing:
266
267 \documentclass[a4paper,11pt]{article}
268 \author{D. Conway}
269
270 it would automatically extract a data structure equivalent to the
271 following (but with several extra "empty" keys, which are described in
272 "Subrule results"):
273
274 {
275 'file' => {
276 'element' => [
277 {
278 'command' => {
279 'literal' => 'documentclass',
280 'options' => {
281 'option' => [ 'a4paper', '11pt' ],
282 },
283 'args' => {
284 'element' => [ 'article' ],
285 }
286 }
287 },
288 {
289 'command' => {
290 'literal' => 'author',
291 'args' => {
292 'element' => [
293 {
294 'literal' => 'D.',
295 },
296 {
297 'literal' => 'Conway',
298 }
299 ]
300 }
301 }
302 }
303 ]
304 }
305 }
306
307 The data structure that Regexp::Grammars produces from a regex match is
308 available to the surrounding program in the magic variable "%/".
309
310 Regexp::Grammars provides many features that simplify the extraction of
311 hierarchical data via a regex match, and also some features that can
312 simplify the processing of that data once it has been extracted. The
313 following sections explain each of those features, and some of the
314 parsing techniques they support.
315
316 Setting up the module
317 Just add:
318
319 use Regexp::Grammars;
320
321 to any lexical scope. Any regexes within that scope will automatically
322 now implement the new parsing constructs:
323
324 use Regexp::Grammars;
325
326 my $parser = qr/ regex with $extra <chocolatey> grammar bits /;
327
328 Note that you do not to use the "/x" modifier when declaring a regex
329 grammar (though you certainly may). But even if you don't, the module
330 quietly adds a "/x" to every regex within the scope of its usage.
331 Otherwise, the default "a whitespace character matches exactly that
332 whitespace character" behaviour of Perl regexes would mess up your
333 grammar's parsing. If you need the non-"/x" behaviour, you can still
334 use the "(?-x)" of "(?-x:...)" directives to switch off "/x" within one
335 or more of your grammar's components.
336
337 Once the grammar has been processed, you can then match text against
338 the extended regexes, in the usual manner (i.e. via a "=~" match):
339
340 if ($input_text =~ $parser) {
341 ...
342 }
343
344 After a successful match, the variable "%/" will contain a series of
345 nested hashes representing the structured hierarchical data captured
346 during the parse.
347
348 Structure of a Regexp::Grammars grammar
349 A Regexp::Grammars specification consists of a start-pattern (which may
350 include both standard Perl 5.10 regex syntax, as well as special
351 Regexp::Grammars directives), followed by one or more rule or token
352 definitions.
353
354 For example:
355
356 use Regexp::Grammars;
357 my $balanced_brackets = qr{
358
359 # Start-pattern...
360 <paren_pair> | <brace_pair>
361
362 # Rule definition...
363 <rule: paren_pair>
364 \( (?: <escape> | <paren_pair> | <brace_pair> | [^()] )* \)
365
366 # Rule definition...
367 <rule: brace_pair>
368 \{ (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )* \}
369
370 # Token definition...
371 <token: escape>
372 \\ .
373 }xms;
374
375 The start-pattern at the beginning of the grammar acts like the "top"
376 token of the grammar, and must be matched completely for the grammar to
377 match.
378
379 This pattern is treated like a token for whitespace matching behaviour
380 (see "Tokens vs rules (whitespace handling)"). That is, whitespace in
381 the start-pattern is treated like whitespace in any normal Perl regex.
382
383 The rules and tokens are declarations only and they are not directly
384 matched. Instead, they act like subroutines, and are invoked by name
385 from the initial pattern (or from within a rule or token).
386
387 Each rule or token extends from the directive that introduces it up to
388 either the next rule or token directive, or (in the case of the final
389 rule or token) to the end of the grammar.
390
391 Tokens vs rules (whitespace handling)
392 The difference between a token and a rule is that a token treats any
393 whitespace within it exactly as a normal Perl regular expression would.
394 That is, a sequence of whitespace in a token is ignored if the "/x"
395 modifier is in effect, or else matches the same literal sequence of
396 whitespace characters (if "/x" is not in effect).
397
398 In a rule, most sequences of whitespace are treated as matching the
399 implicit subrule "<.ws>", which is automatically predefined to match
400 optional whitespace (i.e. "\s*").
401
402 Exceptions to this behaviour are whitespaces before a "|" or a code
403 block or an explicit space-matcher (such as "<ws>" or "\s"), or at the
404 very end of the rule)
405
406 In other words, a rule such as:
407
408 <rule: sentence> <noun> <verb>
409 | <verb> <noun>
410
411 is equivalent to a token with added non-capturing whitespace matching:
412
413 <token: sentence> <.ws> <noun> <.ws> <verb>
414 | <.ws> <verb> <.ws> <noun>
415
416 You can explicitly define a "<ws>" token to change that default
417 behaviour. For example, you could alter the definition of "whitespace"
418 to include Perlish comments, by adding an explicit "<token: ws>":
419
420 <token: ws>
421 (?: \s+ | #[^\n]* )*
422
423 But be careful not to define "<ws>" as a rule, as this will lead to all
424 kinds of infinitely recursive unpleasantness.
425
426 Per-rule whitespace handling
427
428 Redefining the "<ws>" token changes its behaviour throughout the entire
429 grammar, within every rule definition. Usually that's appropriate, but
430 sometimes you need finer-grained control over whitespace handling.
431
432 So Regexp::Grammars provides the "<ws:>" directive, which allows you to
433 override the implicit whitespace-matches-whitespace behaviour only
434 within the current rule.
435
436 Note that this directive does not redefine "<ws>" within the rule; it
437 simply specifies what to replace each whitespace sequence with (instead
438 of replacing each with a "<ws>" call).
439
440 For example, if a language allows one kind of comment between
441 statements and another within statements, you could parse it with:
442
443 <rule: program>
444 # One type of comment between...
445 <ws: (\s++ | \# .*? \n)* >
446
447 # ...colon-separated statements...
448 <[statement]>+ % ( ; )
449
450
451 <rule: statement>
452 # Another type of comment...
453 <ws: (\s*+ | \#{ .*? }\# )* >
454
455 # ...between comma-separated commands...
456 <cmd> <[arg]>+ % ( , )
457
458 Note that each directive only applies to the rule in which it is
459 specified. In every other rule in the grammar, whitespace would still
460 match the usual "<ws>" subrule.
461
462 Calling subrules
463 To invoke a rule to match at any point, just enclose the rule's name in
464 angle brackets (like in Perl 6). There must be no space between the
465 opening bracket and the rulename. For example::
466
467 qr{
468 file: # Match literal sequence 'f' 'i' 'l' 'e' ':'
469 <name> # Call <rule: name>
470 <options>? # Call <rule: options> (it's okay if it fails)
471
472 <rule: name>
473 # etc.
474 }x;
475
476 If you need to match a literal pattern that would otherwise look like a
477 subrule call, just backslash-escape the leading angle:
478
479 qr{
480 file: # Match literal sequence 'f' 'i' 'l' 'e' ':'
481 \<name> # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
482 <options>? # Call <rule: options> (it's okay if it fails)
483
484 <rule: name>
485 # etc.
486 }x;
487
488 Subrule results
489 If a subrule call successfully matches, the result of that match is a
490 reference to a hash. That hash reference is stored in the current
491 rule's own result-hash, under the name of the subrule that was invoked.
492 The hash will, in turn, contain the results of any more deeply nested
493 subrule calls, each stored under the name by which the nested subrule
494 was invoked.
495
496 In other words, if the rule "sentence" is defined:
497
498 <rule: sentence>
499 <noun> <verb> <object>
500
501 then successfully calling the rule:
502
503 <sentence>
504
505 causes a new hash entry at the current nesting level. That entry's key
506 will be 'sentence' and its value will be a reference to a hash, which
507 in turn will have keys: 'noun', 'verb', and 'object'.
508
509 In addition each result-hash has one extra key: the empty string. The
510 value for this key is whatever substring the entire subrule call
511 matched. This value is known as the context substring.
512
513 So, for example, a successful call to "<sentence>" might add something
514 like the following to the current result-hash:
515
516 sentence => {
517 "" => 'I saw a dog',
518 noun => 'I',
519 verb => 'saw',
520 object => {
521 "" => 'a dog',
522 article => 'a',
523 noun => 'dog',
524 },
525 }
526
527 Note, however, that if the result-hash at any level contains only the
528 empty-string key (i.e. the subrule did not call any sub-subrules or
529 save any of their nested result-hashes), then the hash is "unpacked"
530 and just the context substring itself is returned.
531
532 For example, if "<rule: sentence>" had been defined:
533
534 <rule: sentence>
535 I see dead people
536
537 then a successful call to the rule would only add:
538
539 sentence => 'I see dead people'
540
541 to the current result-hash.
542
543 This is a useful feature because it prevents a series of nested subrule
544 calls from producing very unwieldy data structures. For example,
545 without this automatic unpacking, even the simple earlier example:
546
547 <rule: sentence>
548 <noun> <verb> <object>
549
550 would produce something needlessly complex, such as:
551
552 sentence => {
553 "" => 'I saw a dog',
554 noun => {
555 "" => 'I',
556 },
557 verb => {
558 "" => 'saw',
559 },
560 object => {
561 "" => 'a dog',
562 article => {
563 "" => 'a',
564 },
565 noun => {
566 "" => 'dog',
567 },
568 },
569 }
570
571 Turning off the context substring
572
573 The context substring is convenient for debugging and for generating
574 error messages but, in a large grammar, or when parsing a long string,
575 the capture and storage of many nested substrings may quickly become
576 prohibitively expensive.
577
578 So Regexp::Grammars provides a directive to prevent context substrings
579 from being retained. Any rule or token that includes the directive
580 "<nocontext:>" anywhere in the rule's body will not retain any context
581 substring it matches...unless that substring would be the only entry in
582 its result hash (which only happens within objrules and objtokens).
583
584 If a "<nocontext:>" directive appears before the first rule or token
585 definition (i.e. as part of the main pattern), then the entire grammar
586 will discard all context substrings from every one of its rules and
587 tokens.
588
589 However, you can override this universal prohibition with a second
590 directive: "<context:>". If this directive appears in any rule or
591 token, that rule or token will save its context substring, even if a
592 global "<nocontext:>" is in effect.
593
594 This means that this grammar:
595
596 qr{
597 <Command>
598
599 <rule: Command>
600 <nocontext:>
601 <Keyword> <arg=(\S+)>+ % <.ws>
602
603 <token: Keyword>
604 <Move> | <Copy> | <Delete>
605
606 # etc.
607 }x
608
609 and this grammar:
610
611 qr{
612 <nocontext:>
613 <Command>
614
615 <rule: Command>
616 <Keyword> <arg=(\S+)>+ % <.ws>
617
618 <token: Keyword>
619 <context:>
620 <Move> | <Copy> | <Delete>
621
622 # etc.
623 }x
624
625 will behave identically (saving context substrings for keywords, but
626 not for commands), except that the first version will also retain the
627 global context substring (i.e. $/{""}), whereas the second version will
628 not.
629
630 Note that "<context:>" and "<nocontext:>" have no effect on, or even
631 any interaction with, the various result distillation mechanisms, which
632 continue to work in the usual way when either or both of the directives
633 is used.
634
635 Renaming subrule results
636 It is not always convenient to have subrule results stored under the
637 same name as the rule itself. Rule names should be optimized for
638 understanding the behaviour of the parser, whereas result names should
639 be optimized for understanding the structure of the data. Often those
640 two goals are identical, but not always; sometimes rule names need to
641 describe what the data looks like, while result names need to describe
642 what the data means.
643
644 For example, sometimes you need to call the same rule twice, to match
645 two syntactically identical components whose positions give then
646 semantically distinct meanings:
647
648 <rule: copy_cmd>
649 copy <file> <file>
650
651 The problem here is that, if the second call to "<file>" succeeds, its
652 result-hash will be stored under the key 'file', clobbering the data
653 that was returned from the first call to "<file>".
654
655 To avoid such problems, Regexp::Grammars allows you to alias any
656 subrule call, so that it is still invoked by the original name, but its
657 result-hash is stored under a different key. The syntax for that is:
658 "<alias=rulename>". For example:
659
660 <rule: copy_cmd>
661 copy <from=file> <to=file>
662
663 Here, "<rule: file>" is called twice, with the first result-hash being
664 stored under the key 'from', and the second result-hash being stored
665 under the key 'to'.
666
667 Note, however, that the alias before the "=" must be a proper
668 identifier (i.e. a letter or underscore, followed by letters, digits,
669 and/or underscores). Aliases that start with an underscore and aliases
670 named "MATCH" have special meaning (see "Private subrule calls" and
671 "Result distillation" respectively).
672
673 Aliases can also be useful for normalizing data that may appear in
674 different formats and sequences. For example:
675
676 <rule: copy_cmd>
677 copy <from=file> <to=file>
678 | dup <to=file> as <from=file>
679 | <from=file> -> <to=file>
680 | <to=file> <- <from=file>
681
682 Here, regardless of which order the old and new files are specified,
683 the result-hash always gets:
684
685 copy_cmd => {
686 from => 'oldfile',
687 to => 'newfile',
688 }
689
690 List-like subrule calls
691 If a subrule call is quantified with a repetition specifier:
692
693 <rule: file_sequence>
694 <file>+
695
696 then each repeated match overwrites the corresponding entry in the
697 surrounding rule's result-hash, so only the result of the final
698 repetition will be retained. That is, if the above example matched the
699 string "foo.pl bar.py baz.php", then the result-hash would contain:
700
701 file_sequence {
702 "" => 'foo.pl bar.py baz.php',
703 file => 'baz.php',
704 }
705
706 Usually, that's not the desired outcome, so Regexp::Grammars provides
707 another mechanism by which to call a subrule; one that saves all
708 repetitions of its results.
709
710 A regular subrule call consists of the rule's name surrounded by angle
711 brackets. If, instead, you surround the rule's name with "<[...]>"
712 (angle and square brackets) like so:
713
714 <rule: file_sequence>
715 <[file]>+
716
717 then the rule is invoked in exactly the same way, but the result of
718 that submatch is pushed onto an array nested inside the appropriate
719 result-hash entry. In other words, if the above example matched the
720 same "foo.pl bar.py baz.php" string, the result-hash would contain:
721
722 file_sequence {
723 "" => 'foo.pl bar.py baz.php',
724 file => [ 'foo.pl', 'bar.py', 'baz.php' ],
725 }
726
727 This "listifying subrule call" can also be useful for non-repeated
728 subrule calls, if the same subrule is invoked in several places in a
729 grammar. For example if a cmdline option could be given either one or
730 two values, you might parse it:
731
732 <rule: size_option>
733 -size <[size]> (?: x <[size]> )?
734
735 The result-hash entry for 'size' would then always contain an array,
736 with either one or two elements, depending on the input being parsed.
737
738 Listifying subrules can also be given aliases, just like ordinary
739 subrules. The alias is always specified inside the square brackets:
740
741 <rule: size_option>
742 -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
743
744 Here, the sizes are parsed using the "pos_integer" rule, but saved in
745 the result-hash in an array under the key 'size'.
746
747 Parametric subrules
748 When a subrule is invoked, it can be passed a set of named arguments
749 (specified as key"=>"values pairs). This argument list is placed in a
750 normal Perl regex code block and must appear immediately after the
751 subrule name, before the closing angle bracket.
752
753 Within the subrule that has been invoked, the arguments can be accessed
754 via the special hash %ARG. For example:
755
756 <rule: block>
757 <tag>
758 <[block]>*
759 <end_tag(?{ tag=>$MATCH{tag} })> # ...call subrule with argument
760
761 <token: end_tag>
762 end_ (??{ quotemeta $ARG{tag} })
763
764 Here the "block" rule first matches a "<tag>", and the corresponding
765 substring is saved in $MATCH{tag}. It then matches any number of nested
766 blocks. Finally it invokes the "<end_tag>" subrule, passing it an
767 argument whose name is 'tag' and whose value is the current value of
768 $MATCH{tag} (i.e. the original opening tag).
769
770 When it is thus invoked, the "end_tag" token first matches 'end_', then
771 interpolates the literal value of the 'tag' argument and attempts to
772 match it.
773
774 Any number of named arguments can be passed when a subrule is invoked.
775 For example, we could generalize the "end_tag" rule to allow any prefix
776 (not just 'end_'), and also to allow for 'if...fi'-style reversed tags,
777 like so:
778
779 <rule: block>
780 <tag>
781 <[block]>*
782 <end_tag (?{ prefix=>'end', tag=>$MATCH{tag} })>
783
784 <token: end_tag>
785 (??{ $ARG{prefix} // q{(?!)} }) # ...prefix as pattern
786 (??{ quotemeta $ARG{tag} }) # ...tag as literal
787 |
788 (??{ quotemeta reverse $ARG{tag} }) # ...reversed tag
789
790 Note that, if you do not need to interpolate values (such as
791 $MATCH{tag}) into a subrule's argument list, you can use simple
792 parentheses instead of "(?{...})", like so:
793
794 <end_tag( prefix=>'end', tag=>'head' )>
795
796 The only types of values you can use in this simplified syntax are
797 numbers and single-quote-delimited strings. For anything more complex,
798 put the argument list in a full "(?{...})".
799
800 As the earlier examples show, the single most common type of argument
801 is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
802 a common requirement to pass an element of %MATCH into a subrule, named
803 with its own key.
804
805 Because this is such a common usage, Regexp::Grammars provides a
806 shortcut. If you use simple parentheses (instead of "(?{...})"
807 parentheses) then instead of a pair, you can specify an argument using
808 a colon followed by an identifier. This argument is replaced by a
809 named argument whose name is the identifier and whose value is the
810 corresponding item from %MATCH. So, for example, instead of:
811
812 <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>
813
814 you can just write:
815
816 <end_tag( prefix=>'end', :tag )>
817
818 Note that, from Perl 5.20 onwards, due to changes in the way that Perl
819 parses regexes, Regexp::Grammars does not support explicitly passing
820 elements of %MATCH as argument values within a list subrule (yeah, it's
821 a very specific and obscure edge-case):
822
823 <[end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })]> # Does not work
824
825 Note, however, that the shortcut:
826
827 <[end_tag( prefix=>'end', :tag )]>
828
829 still works correctly.
830
831 Accessing subrule arguments more cleanly
832
833 As the preceding examples illustrate, using subrule arguments
834 effectively generally requires the use of run-time interpolated
835 subpatterns via the "(??{...})" construct.
836
837 This produces ugly rule bodies such as:
838
839 <token: end_tag>
840 (??{ $ARG{prefix} // q{(?!)} }) # ...prefix as pattern
841 (??{ quotemeta $ARG{tag} }) # ...tag as literal
842 |
843 (??{ quotemeta reverse $ARG{tag} }) # ...reversed tag
844
845 To simplify these common usages, Regexp::Grammars provides three
846 convenience constructs.
847
848 A subrule call of the form "<:"identifier">" is equivalent to:
849
850 (??{ $ARG{'identifier'} // q{(?!)} })
851
852 Namely: "Match the contents of $ARG{'identifier'}, treating those
853 contents as a pattern."
854
855 A subrule call of the form "<\:"identifier">" (that is: a matchref with
856 a colon after the backslash) is equivalent to:
857
858 (??{ defined $ARG{'identifier'}
859 ? quotemeta($ARG{'identifier'})
860 : '(?!)'
861 })
862
863 Namely: "Match the contents of $ARG{'identifier'}, treating those
864 contents as a literal."
865
866 A subrule call of the form "</:"identifier">" (that is: an invertref
867 with a colon after the forward slash) is equivalent to:
868
869 (??{ defined $ARG{'identifier'}
870 ? quotemeta(reverse $ARG{'identifier'})
871 : '(?!)'
872 })
873
874 Namely: "Match the closing delimiter corresponding to the contents of
875 $ARG{'identifier'}, as if it were a literal".
876
877 The availability of these three constructs mean that we could rewrite
878 the above "<end_tag>" token much more cleanly as:
879
880 <token: end_tag>
881 <:prefix> # ...prefix as pattern
882 <\:tag> # ...tag as a literal
883 |
884 </:tag> # ...reversed tag
885
886 In general these constructs mean that, within a subrule, if you want to
887 match an argument passed to that subrule, you use "<:"ARGNAME">" (to
888 match the argument as a pattern) or "<\:"ARGNAME">" (to match the
889 argument as a literal).
890
891 Note the consistent mnemonic in these various subrule-like
892 interpolations of named arguments: the name is always prefixed by a
893 colon.
894
895 In other words, the "<:ARGNAME>" form works just like a "<RULENAME>",
896 except that the leading colon tells Regexp::Grammars to use the
897 contents of $ARG{'ARGNAME'} as the subpattern, instead of the contents
898 of "(?&RULENAME)"
899
900 Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs work exactly
901 like "<\_MATCHNAME>" and "</INVERTNAME>" respectively, except that the
902 leading colon indicates that the matchref or invertref should be taken
903 from %ARG instead of from %MATCH.
904
905 Pseudo-subrules
906 Aliases can also be given to standard Perl subpatterns, as well as to
907 code blocks within a regex. The syntax for subpatterns is:
908
909 <ALIAS= (SUBPATTERN) >
910
911 In other words, the syntax is exactly like an aliased subrule call,
912 except that the rule name is replaced with a set of parentheses
913 containing the subpattern. Any parentheses--capturing or
914 non-capturing--will do.
915
916 The effect of aliasing a standard subpattern is to cause whatever that
917 subpattern matches to be saved in the result-hash, using the alias as
918 its key. For example:
919
920 <rule: file_command>
921
922 <cmd=(mv|cp|ln)> <from=file> <to=file>
923
924 Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
925 "(mv|cp|ln)", but whatever substring it matches is saved in the result-
926 hash under the key 'cmd'.
927
928 The syntax for aliasing code blocks is:
929
930 <ALIAS= (?{ your($code->here) }) >
931
932 Note, however, that the code block must be specified in the standard
933 Perl 5.10 regex notation: "(?{...})". A common mistake is to write:
934
935 <ALIAS= { your($code->here } >
936
937 instead, which will attempt to interpolate $code before the regex is
938 even compiled, as such variables are only "protected" from
939 interpolation inside a "(?{...})".
940
941 When correctly specified, this construct executes the code in the block
942 and saves the result of that execution in the result-hash, using the
943 alias as its key. Aliased code blocks are useful for adding semantic
944 information based on which branch of a rule is executed. For example,
945 consider the "copy_cmd" alternatives shown earlier:
946
947 <rule: copy_cmd>
948 copy <from=file> <to=file>
949 | dup <to=file> as <from=file>
950 | <from=file> -> <to=file>
951 | <to=file> <- <from=file>
952
953 Using aliased code blocks, you could add an extra field to the result-
954 hash to describe which form of the command was detected, like so:
955
956 <rule: copy_cmd>
957 copy <from=file> <to=file> <type=(?{ 'std' })>
958 | dup <to=file> as <from=file> <type=(?{ 'rev' })>
959 | <from=file> -> <to=file> <type=(?{ +1 })>
960 | <to=file> <- <from=file> <type=(?{ -1 })>
961
962 Now, if the rule matched, the result-hash would contain something like:
963
964 copy_cmd => {
965 from => 'oldfile',
966 to => 'newfile',
967 type => 'fwd',
968 }
969
970 Note that, in addition to the semantics described above, aliased
971 subpatterns and code blocks also become visible to Regexp::Grammars'
972 integrated debugger (see Debugging).
973
974 Aliased literals
975 As the previous example illustrates, it is inconveniently verbose to
976 assign constants via aliased code blocks. So Regexp::Grammars provides
977 a short-cut. It is possible to directly alias a numeric literal or a
978 single-quote delimited literal string, without putting either inside a
979 code block. For example, the previous example could also be written:
980
981 <rule: copy_cmd>
982 copy <from=file> <to=file> <type='std'>
983 | dup <to=file> as <from=file> <type='rev'>
984 | <from=file> -> <to=file> <type= +1 >
985 | <to=file> <- <from=file> <type= -1 >
986
987 Note that only these two forms of literal are supported in this
988 abbreviated syntax.
989
990 Amnesiac subrule calls
991 By default, every subrule call saves its result into the result-hash,
992 either under its own name, or under an alias.
993
994 However, sometimes you may want to refactor some literal part of a rule
995 into one or more subrules, without having those submatches added to the
996 result-hash. The syntax for calling a subrule, but ignoring its return
997 value is:
998
999 <.SUBRULE>
1000
1001 (which is stolen directly from Perl 6).
1002
1003 For example, you may prefer to rewrite a rule such as:
1004
1005 <rule: paren_pair>
1006
1007 \(
1008 (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
1009 \)
1010
1011 without any literal matching, like so:
1012
1013 <rule: paren_pair>
1014
1015 <.left_paren>
1016 (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
1017 <.right_paren>
1018
1019 <token: left_paren> \(
1020 <token: right_paren> \)
1021 <token: non_paren> [^()]
1022
1023 Moreover, as the individual components inside the parentheses probably
1024 aren't being captured for any useful purpose either, you could further
1025 optimize that to:
1026
1027 <rule: paren_pair>
1028
1029 <.left_paren>
1030 (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
1031 <.right_paren>
1032
1033 Note that you can also use the dot modifier on an aliased subpattern:
1034
1035 <.Alias= (SUBPATTERN) >
1036
1037 This seemingly contradictory behaviour (of giving a subpattern a name,
1038 then deliberately ignoring that name) actually does make sense in one
1039 situation. Providing the alias makes the subpattern visible to the
1040 debugger, while using the dot stops it from affecting the result-hash.
1041 See "Debugging non-grammars" for an example of this usage.
1042
1043 Private subrule calls
1044 If a rule name (or an alias) begins with an underscore:
1045
1046 <_RULENAME> <_ALIAS=RULENAME>
1047 <[_RULENAME]> <[_ALIAS=RULENAME]>
1048
1049 then matching proceeds as normal, and any result that is returned is
1050 stored in the current result-hash in the usual way.
1051
1052 However, when any rule finishes (and just before it returns) it first
1053 filters its result-hash, removing any entries whose keys begin with an
1054 underscore. This means that any subrule with an underscored name (or
1055 with an underscored alias) remembers its result, but only until the end
1056 of the current rule. Its results are effectively private to the current
1057 rule.
1058
1059 This is especially useful in conjunction with result distillation.
1060
1061 Lookahead (zero-width) subrules
1062 Non-capturing subrule calls can be used in normal lookaheads:
1063
1064 <rule: qualified_typename>
1065 # A valid typename and has a :: in it...
1066 (?= <.typename> ) [^\s:]+ :: \S+
1067
1068 <rule: identifier>
1069 # An alpha followed by alnums (but not a valid typename)...
1070 (?! <.typename> ) [^\W\d]\w*
1071
1072 but the syntax is a little unwieldy. More importantly, an internal
1073 problem with backtracking causes positive lookaheads to mess up the
1074 module's named capturing mechanism.
1075
1076 So Regexp::Grammars provides two shorthands:
1077
1078 <!typename> same as: (?! <.typename> )
1079 <?typename> same as: (?= <.typename> ) ...but works correctly!
1080
1081 These two constructs can also be called with arguments, if necessary:
1082
1083 <rule: Command>
1084 <Keyword>
1085 (?:
1086 <!Terminator(:Keyword)> <Args=(\S+)>
1087 )?
1088 <Terminator(:Keyword)>
1089
1090 Note that, as the above equivalences imply, neither of these forms of a
1091 subroutine call ever captures what it matches.
1092
1093 Matching separated lists
1094 One of the commonest tasks in text parsing is to match a list of
1095 unspecified length, in which items are separated by a fixed token.
1096 Things like:
1097
1098 1, 2, 3 , 4 ,13, 91 # Numbers separated by commas and spaces
1099
1100 g-c-a-g-t-t-a-c-a # DNA bases separated by dashes
1101
1102 /usr/local/bin # Names separated by directory markers
1103
1104 /usr:/usr/local:bin # Directories separated by colons
1105
1106 The usual construct required to parse these kinds of structures is
1107 either:
1108
1109 <rule: list>
1110
1111 <item> <separator> <list> # recursive definition
1112 | <item> # base case
1113
1114 or, if you want to allow zero-or-more items instead of requiring one-
1115 or-more:
1116
1117 <rule: list_opt>
1118 <list>? # entire list may be missing
1119
1120 <rule: list> # as before...
1121 <item> <separator> <list> # recursive definition
1122 | <item> # base case
1123
1124 Or, more efficiently, but less prettily:
1125
1126 <rule: list>
1127 <[item]> (?: <separator> <[item]> )* # one-or-more
1128
1129 <rule: list_opt>
1130 (?: <[item]> (?: <separator> <[item]> )* )? # zero-or-more
1131
1132 Because separated lists are such a common component of grammars,
1133 Regexp::Grammars provides cleaner ways to specify them:
1134
1135 <rule: list>
1136 <[item]>+ % <separator> # one-or-more
1137
1138 <rule: list_zom>
1139 <[item]>* % <separator> # zero-or-more
1140
1141 Note that these are just regular repetition qualifiers (i.e. "+" and
1142 "*") applied to a subriule ("<[item]>"), with a "%" modifier after them
1143 to specify the required separator between the repeated matches.
1144
1145 The number of repetitions matched is controlled both by the nature of
1146 the qualifier ("+" vs "*") and by the subrule specified after the "%".
1147 The qualified subrule will be repeatedly matched for as long as its
1148 qualifier allows, provided that the second subrule also matches between
1149 those repetitions.
1150
1151 For example, you can match a parenthesized sequence of one-or-more
1152 numbers separated by commas, such as:
1153
1154 (1, 2, 3, 4, 13, 91) # Numbers separated by commas (and spaces)
1155
1156 with:
1157
1158 <rule: number_list>
1159
1160 \( <[number]>+ % <comma> \)
1161
1162 <token: number> \d+
1163 <token: comma> ,
1164
1165 Note that any spaces round the commas will be ignored because
1166 "<number_list>" is specified as a rule and the "+%" specifier has
1167 spaces within and around it. To disallow spaces around the commas, make
1168 sure there are no spaces in or around the "+%":
1169
1170 <rule: number_list_no_spaces>
1171
1172 \( <[number]>+%<comma> \)
1173
1174 (or else specify the rule as a token instead).
1175
1176 Because the "%" is a modifier applied to a qualifier, you can modify
1177 any other repetition qualifier in the same way. For example:
1178
1179 <[item]>{2,4} % <sep> # two-to-four items, separated
1180
1181 <[item]>{7} % <sep> # exactly 7 items, separated
1182
1183 <[item]>{10,}? % <sep> # minimum of 10 or more items, separated
1184
1185 You can even do this:
1186
1187 <[item]>? % <sep> # one-or-zero items, (theoretically) separated
1188
1189 though the separator specification is, of course, meaningless in that
1190 case as it will never be needed to separate a maximum of one item.
1191
1192 If a "%" appears anywhere else in a grammar (i.e. not immediately after
1193 a repetition qualifier), it is treated normally (i.e. as a self-
1194 matching literal character):
1195
1196 <token: perl_hash>
1197 % <ident> # match "%foo", "%bar", etc.
1198
1199 <token: perl_mod>
1200 <expr> % <expr> # match "$n % 2", "($n+3) % ($n-1)", etc.
1201
1202 If you need to match a literal "%" immediately after a repetition,
1203 either quote it:
1204
1205 <token: percentage>
1206 \d{1,3} \% solution # match "7% solution", etc.
1207
1208 or refactor the "%" character:
1209
1210 <token: percentage>
1211 \d{1,3} <percent_sign> solution # match "7% solution", etc.
1212
1213 <token: percent_sign>
1214 %
1215
1216 Note that it's usually necessary to use the "<[...]>" form for the
1217 repeated items being matched, so that all of them are saved in the
1218 result hash. You can also save all the separators (if they're
1219 important) by specifying them as a list-like subrule too:
1220
1221 \( <[number]>* % <[comma]> \) # save numbers *and* separators
1222
1223 The repeated item must be specified as a subrule call of some kind
1224 (i.e. in angles), but the separators may be specified either as a
1225 subrule or as a raw bracketed pattern. For example:
1226
1227 <[number]>* % ( , | : ) # Numbers separated by commas or colons
1228
1229 <[number]>* % [,:] # Same, but more efficiently matched
1230
1231 The separator should always be specified within matched delimiters of
1232 some kind: either matching "<...>" or matching "(...)" or matching
1233 "[...]". Simple, non-bracketed separators will sometimes also work:
1234
1235 <[number]>+ % ,
1236
1237 but not always:
1238
1239 <[number]>+ % ,\s+ # Oops! Separator is just: ,
1240
1241 This is because of the limited way in which the module internally
1242 parses ordinary regex components (i.e. without full understanding of
1243 their implicit precedence). As a consequence, consistently placing
1244 brackets around any separator is a much safer approach:
1245
1246 <[number]>+ % (,\s+)
1247
1248 You can also use a simple pattern on the left of the "%" as the item
1249 matcher, but in this case it must always be aliased into a list-
1250 collecting subrule, like so:
1251
1252 <[item=(\d+)]>* % [,]
1253
1254 Note that, for backwards compatibility with earlier versions of
1255 Regexp::Grammars, the "+%" operator can also be written: "**".
1256 However, there can be no space between the two asterisks of this
1257 variant. That is:
1258
1259 <[item]> ** <sep> # same as <[item]>* % <sep>
1260
1261 <[item]>* * <sep> # error (two * qualifiers in a row)
1262
1263 Matching separated lists with a trailing separator
1264
1265 Some languages allow a separated list to include an extra trailing
1266 separator. For example:
1267
1268 ~/bin/perl5/ # Trailing /-separator in filepath
1269 (1,2,3,) # Trailing ,-separator in Perl list
1270
1271 To match such constructs using the "%" operator, you would need to add
1272 something to explicitly match the optional trailing separator:
1273
1274 <dir>+ % [/] [/]? # Slash-separated dirs, then optional final slash
1275
1276 <elem>+ % [,] [,]? # Comma-separated elems, then optional final comma
1277
1278 which is tedious.
1279
1280 So the module also supports a second kind of "separated list" operator,
1281 that allows an optional trailing separator as well: the "%%" operator.
1282 THis operator behaves exactly like the "%" operator, except that it
1283 also matches a final trailing separator, if one is present.
1284
1285 So the previous examples could be (better) written as:
1286
1287 <dir>+ %% [/] # Slash-separated dirs, with optional final slash
1288
1289 <elem>+ %% [,] # Comma-separated elems, with optional final comma
1290
1291 Matching hash keys
1292 In some situations a grammar may need a rule that matches dozens,
1293 hundreds, or even thousands of one-word alternatives. For example, when
1294 matching command names, or valid userids, or English words. In such
1295 cases it is often impractical (and always inefficient) to list all the
1296 alternatives between "|" alternators:
1297
1298 <rule: shell_cmd>
1299 a2p | ac | apply | ar | automake | awk | ...
1300 # ...and 400 lines later
1301 ... | zdiff | zgrep | zip | zmore | zsh
1302
1303 <rule: valid_word>
1304 a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
1305 # ...and 40,000 lines later...
1306 ... | zymotize | zymotoxic | zymurgy | zythem | zythum
1307
1308 To simplify such cases, Regexp::Grammars provides a special construct
1309 that allows you to specify all the alternatives as the keys of a normal
1310 hash. The syntax for that construct is simply to put the hash name
1311 inside angle brackets (with no space between the angles and the hash
1312 name).
1313
1314 Which means that the rules in the previous example could also be
1315 written:
1316
1317 <rule: shell_cmd>
1318 <%cmds>
1319
1320 <rule: valid_word>
1321 <%dict>
1322
1323 provided that the two hashes (%cmds and %dict) are visible in the scope
1324 where the grammar is created.
1325
1326 Matching a hash key in this way is typically significantly faster than
1327 matching a large set of alternations. Specifically, it is O(length of
1328 longest potential key) ^ 2, instead of O(number of keys).
1329
1330 Internally, the construct is converted to something equivalent to:
1331
1332 <rule: shell_cmd>
1333 (<.hk>) <require: (?{ exists $cmds{$CAPTURE} })>
1334
1335 <rule: valid_word>
1336 (<.hk>) <require: (?{ exists $dict{$CAPTURE} })>
1337
1338 The special "<hk>" rule is created automatically, and defaults to
1339 "\S+", but you can also define it explicitly to handle other kinds of
1340 keys. For example:
1341
1342 <rule: hk>
1343 [^\n]+ # Key may be any number of chars on a single line
1344
1345 <rule: hk>
1346 [ACGT]{10,} # Key is a base sequence of at least 10 pairs
1347
1348 Alternatively, you can specify a different key-matching pattern for
1349 each hash you're matching, by placing the required pattern in braces
1350 immediately after the hash name. For example:
1351
1352 <rule: client_name>
1353 # Valid keys match <.hk> (default or explicitly specified)
1354 <%clients>
1355
1356 <rule: shell_cmd>
1357 # Valid keys contain only word chars, hyphen, slash, or dot...
1358 <%cmds { [\w-/.]+ }>
1359
1360 <rule: valid_word>
1361 # Valid keys contain only alphas or internal hyphen or apostrophe...
1362 <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>
1363
1364 <rule: DNA_sequence>
1365 # Valid keys are base sequences of at least 10 pairs...
1366 <%sequences{[ACGT]{10,}}>
1367
1368 This second approach to key-matching is preferred, because it localizes
1369 any non-standard key-matching behaviour to each individual hash.
1370
1371 Note that changes in the compilation process from Perl 5.18 onwards
1372 mean that in some cases the "<%hash>" construct only works reliably if
1373 the hash itself is declared at the outermost lexical scope (i.e. file
1374 scope).
1375
1376 Specifically, if the regex grammar does not include any interpolated
1377 scalars or arrays and the hash was declared within a subroutine (even
1378 within the same subroutine as the regex grammar that uses it), the
1379 regex will not be able to "see" the hash variable at compile-time. This
1380 will produce a "Global symbol "%hash" requires explicit package name"
1381 compile-time error. For example:
1382
1383 sub build_keyword_parser {
1384 # Hash declared inside subroutine...
1385 my %keywords = (foo => 1, bar => 1);
1386
1387 # ...then used in <%hash> construct within uninterpolated regex...
1388 return qr{
1389 ^<keyword>$
1390 <rule: keyword> <%keywords>
1391 }x;
1392
1393 # ...produces compile-time error
1394 }
1395
1396 The solution is to place the hash outside the subroutine containing the
1397 grammar:
1398
1399 # Hash declared OUTSIDE subroutine...
1400 my %keywords = (foo => 1, bar => 1);
1401
1402 sub build_keyword_parser {
1403 return qr{
1404 ^<keyword>$
1405 <rule: keyword> <%keywords>
1406 }x;
1407 }
1408
1409 ...or else to explicitly interpolate at least one scalar (even just a
1410 scalar containing an empty string):
1411
1412 sub build_keyword_parser {
1413 my %keywords = (foo => 1, bar => 1);
1414 my $DEFER_REGEX_COMPILATION = "";
1415
1416 return qr{
1417 ^<keyword>$
1418 <rule: keyword> <%keywords>
1419
1420 $DEFER_REGEX_COMPILATION
1421 }x;
1422 }
1423
1424 Rematching subrule results
1425 Sometimes it is useful to be able to rematch a string that has
1426 previously been matched by some earlier subrule. For example, consider
1427 a rule to match shell-like control blocks:
1428
1429 <rule: control_block>
1430 for <expr> <[command]>+ endfor
1431 | while <expr> <[command]>+ endwhile
1432 | if <expr> <[command]>+ endif
1433 | with <expr> <[command]>+ endwith
1434
1435 This would be much tidier if we could factor out the command names
1436 (which are the only differences between the four alternatives). The
1437 problem is that the obvious solution:
1438
1439 <rule: control_block>
1440 <keyword> <expr>
1441 <[command]>+
1442 end<keyword>
1443
1444 doesn't work, because it would also match an incorrect input like:
1445
1446 for 1..10
1447 echo $n
1448 ls subdir/$n
1449 endif
1450
1451 We need some way to ensure that the "<keyword>" matched immediately
1452 after "end" is the same "<keyword>" that was initially matched.
1453
1454 That's not difficult, because the first "<keyword>" will have captured
1455 what it matched into $MATCH{keyword}, so we could just write:
1456
1457 <rule: control_block>
1458 <keyword> <expr>
1459 <[command]>+
1460 end(??{quotemeta $MATCH{keyword}})
1461
1462 This is such a useful technique, yet so ugly, scary, and prone to
1463 error, that Regexp::Grammars provides a cleaner equivalent:
1464
1465 <rule: control_block>
1466 <keyword> <expr>
1467 <[command]>+
1468 end<\_keyword>
1469
1470 A directive of the form "<\_IDENTIFIER>" is known as a "matchref" (an
1471 abbreviation of "%MATCH-supplied backreference"). Matchrefs always
1472 attempt to match, as a literal, the current value of
1473 $MATCH{IDENTIFIER}.
1474
1475 By default, a matchref does not capture what it matches, but you can
1476 have it do so by giving it an alias:
1477
1478 <token: delimited_string>
1479 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1480
1481 <token: str_delim> ["'`]
1482
1483 At first glance this doesn't seem very useful as, by definition,
1484 $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up with
1485 identical values. However, it can be useful if the rule also has other
1486 alternatives and you want to create a consistent internal
1487 representation for those alternatives, like so:
1488
1489 <token: delimited_string>
1490 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1491 | <ldelim=( \[ ) .*? <rdelim=( \] )
1492 | <ldelim=( \{ ) .*? <rdelim=( \} )
1493 | <ldelim=( \( ) .*? <rdelim=( \) )
1494 | <ldelim=( \< ) .*? <rdelim=( \> )
1495
1496 You can also force a matchref to save repeated matches as a nested
1497 array, in the usual way:
1498
1499 <token: marked_text>
1500 <marker> <text> <[endmarkers=\_marker]>+
1501
1502 Be careful though, as the following will not do as you may expect:
1503
1504 <[marker]>+ <text> <[endmarkers=\_marker]>+
1505
1506 because the value of $MATCH{marker} will be an array reference, which
1507 the matchref will flatten and concatenate, then match the resulting
1508 string as a literal, which will mean the previous example will match
1509 endmarkers that are exact multiples of the complete start marker,
1510 rather than endmarkers that consist of any number of repetitions of the
1511 individual start marker delimiter. So:
1512
1513 ""text here""
1514 ""text here""""
1515 ""text here""""""
1516
1517 but not:
1518
1519 ""text here"""
1520 ""text here"""""
1521
1522 Uneven start and end markers such as these are extremely unusual, so
1523 this problem rarely arises in practice.
1524
1525 Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
1526 was "<\IDENTIFIER>" instead of "<\_IDENTIFIER>". This created problems
1527 when the identifier started with any of "l", "u", "L", "U", "Q", or
1528 "E", so the syntax has had to be altered in a backwards incompatible
1529 way. It will not be altered again.
1530
1531 Rematching balanced delimiters
1532 Consider the example in the previous section:
1533
1534 <token: delimited_string>
1535 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1536 | <ldelim=( \[ ) .*? <rdelim=( \] )
1537 | <ldelim=( \{ ) .*? <rdelim=( \} )
1538 | <ldelim=( \( ) .*? <rdelim=( \) )
1539 | <ldelim=( \< ) .*? <rdelim=( \> )
1540
1541 The repeated pattern of the last four alternatives is gauling, but we
1542 can't just refactor those delimiters as well:
1543
1544 <token: delimited_string>
1545 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1546 | <ldelim=bracket> .*? <rdelim=\_ldelim>
1547
1548 because that would incorrectly match:
1549
1550 { delimited content here {
1551
1552 while failing to match:
1553
1554 { delimited content here }
1555
1556 To refactor balanced delimiters like those, we need a second kind of
1557 matchref; one that's a little smarter.
1558
1559 Or, preferably, a lot smarter...because there are many other kinds of
1560 balanced delimiters, apart from single brackets. For example:
1561
1562 {{{ delimited content here }}}
1563 /* delimited content here */
1564 (* delimited content here *)
1565 `` delimited content here ''
1566 if delimited content here fi
1567
1568 The common characteristic of these delimiter pairs is that the closing
1569 delimiter is the inverse of the opening delimiter: the sequence of
1570 characters is reversed and certain characters (mainly brackets, but
1571 also single-quotes/backticks) are mirror-reflected.
1572
1573 Regexp::Grammars supports the parsing of such delimiters with a
1574 construct known as an invertref, which is specified using the
1575 "</IDENT>" directive. An invertref acts very like a matchref, except
1576 that it does not convert to:
1577
1578 (??{ quotemeta( $MATCH{I<IDENT>} ) })
1579
1580 but rather to:
1581
1582 (??{ quotemeta( inverse( $MATCH{I<IDENT> ))} })
1583
1584 With this directive available, the balanced delimiters of the previous
1585 example can be refactored to:
1586
1587 <token: delimited_string>
1588 <ldelim=str_delim> .*? <rdelim=\_ldelim>
1589 | <ldelim=( [[{(<] ) .*? <rdelim=/ldelim>
1590
1591 Like matchrefs, invertrefs come in the usual range of flavours:
1592
1593 </ident> # Match the inverse of $MATCH{ident}
1594 <ALIAS=/ident> # Match inverse and capture to $MATCH{ident}
1595 <[ALIAS=/ident]> # Match inverse and push on @{$MATCH{ident}}
1596
1597 The character pairs that are reversed during mirroring are: "{" and
1598 "}", "[" and "]", "(" and ")", "<" and ">", "AX" and "AX", "`" and "'".
1599
1600 The following mnemonics may be useful in distinguishing inverserefs
1601 from backrefs: a backref starts with a "\" (just like the standard Perl
1602 regex backrefs "\1" and "\g{-2}" and "\k<name>"), whereas an inverseref
1603 starts with a "/" (like an HTML or XML closing tag). Or just remember
1604 that "<\_IDENT>" is "match the same again", and if you want "the same
1605 again, only mirrored" instead, just mirror the "\" to get "</IDENT>".
1606
1607 Rematching parametric results and delimiters
1608 The "<\_IDENTIFIER>" and "</IDENTIFIER>" mechanisms normally locate the
1609 literal to be matched by looking in $MATCH{IDENTIFIER}.
1610
1611 However, you can cause them to look in $ARG{IDENTIFIER} instead, by
1612 prefixing the identifier with a single ":". This is especially useful
1613 when refactoring subrules. For example, instead of:
1614
1615 <rule: Command>
1616 <Keyword> <CommandBody> end_ <\_Keyword>
1617
1618 <rule: Placeholder>
1619 <Keyword> \.\.\. end_ <\_Keyword>
1620
1621 you could parameterize the Terminator rule, like so:
1622
1623 <rule: Command>
1624 <Keyword> <CommandBody> <Terminator(:Keyword)>
1625
1626 <rule: Placeholder>
1627 <Keyword> \.\.\. <Terminator(:Keyword)>
1628
1629 <token: Terminator>
1630 end_ <\:Keyword>
1631
1632 Tracking and reporting match positions
1633 Regexp::Grammars automatically predefines a special token that makes it
1634 easy to track exactly where in its input a particular subrule matches.
1635 That token is: "<matchpos>".
1636
1637 The "<matchpos>" token implements a zero-width match that never fails.
1638 It always returns the current index within the string that the grammar
1639 is matching.
1640
1641 So, for example you could have your "<delimited_text>" subrule detect
1642 and report unterminated text like so:
1643
1644 <token: delimited_text>
1645 qq? <delim> <text=(.*?)> </delim>
1646 |
1647 <matchpos> qq? <delim>
1648 <error: (?{"Unterminated string starting at index $MATCH{matchpos}"})>
1649
1650 Matching "<matchpos>" in the second alternative causes $MATCH{matchpos}
1651 to contain the position in the string at which the "<matchpos>" subrule
1652 was matched (in this example: the start of the unterminated text).
1653
1654 If you want the line number instead of the string index, use the
1655 predefined "<matchline>" subrule instead:
1656
1657 <token: delimited_text>
1658 qq? <delim> <text=(.*?)> </delim>
1659 | <matchline> qq? <delim>
1660 <error: (?{"Unterminated string starting at line $MATCH{matchline}"})>
1661
1662 Note that the line numbers returned by "<matchline>" start at 1 (not at
1663 zero, as with "<matchpos>").
1664
1665 The "<matchpos>" and "<matchline>" subrules are just like any other
1666 subrules; you can alias them ("<started_at=matchpos>") or match them
1667 repeatedly ( "(?: <[matchline]> <[item]> )++"), etc.
1668
1670 The module also supports event-based parsing. You can specify a grammar
1671 in the usual way and then, for a particular parse, layer a collection
1672 of call-backs (known as "autoactions") over the grammar to handle the
1673 data as it is parsed.
1674
1675 Normally, a grammar rule returns the result hash it has accumulated (or
1676 whatever else was aliased to "MATCH=" within the rule). However, you
1677 can specify an autoaction object before the grammar is matched.
1678
1679 Once the autoaction object is specified, every time a rule succeeds
1680 during the parse, its result is passed to the object via one of its
1681 methods; specifically it is passed to the method whose name is the same
1682 as the rule's.
1683
1684 For example, suppose you had a grammar that recognizes simple algebraic
1685 expressions:
1686
1687 my $expr_parser = do{
1688 use Regexp::Grammars;
1689 qr{
1690 <Expr>
1691
1692 <rule: Expr> <[Operand=Mult]>+ % <[Op=(\+|\-)]>
1693
1694 <rule: Mult> <[Operand=Pow]>+ % <[Op=(\*|/|%)]>
1695
1696 <rule: Pow> <[Operand=Term]>+ % <Op=(\^)>
1697
1698 <rule: Term> <MATCH=Literal>
1699 | \( <MATCH=Expr> \)
1700
1701 <token: Literal> <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
1702 }xms
1703 };
1704
1705 You could convert this grammar to a calculator, by installing a set of
1706 autoactions that convert each rule's result hash to the corresponding
1707 value of the sub-expression that the rule just parsed. To do that, you
1708 would create a class with methods whose names match the rules whose
1709 results you want to change. For example:
1710
1711 package Calculator;
1712 use List::Util qw< reduce >;
1713
1714 sub new {
1715 my ($class) = @_;
1716
1717 return bless {}, $class
1718 }
1719
1720 sub Answer {
1721 my ($self, $result_hash) = @_;
1722
1723 my $sum = shift @{$result_hash->{Operand}};
1724
1725 for my $term (@{$result_hash->{Operand}}) {
1726 my $op = shift @{$result_hash->{Op}};
1727 if ($op eq '+') { $sum += $term; }
1728 else { $sum -= $term; }
1729 }
1730
1731 return $sum;
1732 }
1733
1734 sub Mult {
1735 my ($self, $result_hash) = @_;
1736
1737 return reduce { eval($a . shift(@{$result_hash->{Op}}) . $b) }
1738 @{$result_hash->{Operand}};
1739 }
1740
1741 sub Pow {
1742 my ($self, $result_hash) = @_;
1743
1744 return reduce { $b ** $a } reverse @{$result_hash->{Operand}};
1745 }
1746
1747 Objects of this class (and indeed the class itself) now have methods
1748 corresponding to some of the rules in the expression grammar. To apply
1749 those methods to the results of the rules (as they parse) you simply
1750 install an object as the "autoaction" handler, immediately before you
1751 initiate the parse:
1752
1753 if ($text ~= $expr_parser->with_actions(Calculator->new)) {
1754 say $/{Answer}; # Now prints the result of the expression
1755 }
1756
1757 The "with_actions()" method expects to be passed an object or
1758 classname. This object or class will be installed as the autoaction
1759 handler for the next match against any grammar. After that match, the
1760 handler will be uninstalled. "with_actions()" returns the grammar it's
1761 called on, making it easy to call it as part of a match (which is the
1762 recommended idiom).
1763
1764 With a "Calculator" object set as the autoaction handler, whenever the
1765 "Answer", "Mult", or "Pow" rule of the grammar matches, the
1766 corresponding "Answer", "Mult", or "Pow" method of the "Calculator"
1767 object will be called (with the rule's result value passed as its only
1768 argument), and the result of the method will be used as the result of
1769 the rule.
1770
1771 Note that nothing new happens when a "Term" or "Literal" rule matches,
1772 because the "Calculator" object doesn't have methods with those names.
1773
1774 The overall effect, then, is to allow you to specify a grammar without
1775 rule-specific bahaviours and then, later, specify a set of final
1776 actions (as methods) for some or all of the rules of the grammar.
1777
1778 Note that, if a particular callback method returns "undef", the result
1779 of the corresponding rule will be passed through without modification.
1780
1782 All the grammars shown so far are confined to a single regex. However,
1783 Regexp::Grammars also provides a mechanism that allows you to defined
1784 named grammars, which can then be imported into other regexes. This
1785 gives the a way of modularizing common grammatical components.
1786
1787 Defining a named grammar
1788 You can create a named grammar using the "<grammar:...>" directive.
1789 This directive must appear before the first rule definition in the
1790 grammar, and instead of any start-rule. For example:
1791
1792 qr{
1793 <grammar: List::Generic>
1794
1795 <rule: List>
1796 <[MATCH=Item]>+ % <Separator>
1797
1798 <rule: Item>
1799 \S++
1800
1801 <token: Separator>
1802 \s* , \s*
1803 }x;
1804
1805 This creates a grammar named "List::Generic", and installs it in the
1806 module's internal caches, for future reference.
1807
1808 Note that there is no need (or reason) to assign the resulting regex to
1809 a variable, as the named grammar cannot itself be matched against.
1810
1811 Using a named grammar
1812 To make use of a named grammar, you need to incorporate it into another
1813 grammar, by inheritance. To do that, use the "<extends:...>" directive,
1814 like so:
1815
1816 my $parser = qr{
1817 <extends: List::Generic>
1818
1819 <List>
1820 }x;
1821
1822 The "<extends:...>" directive incorporates the rules defined in the
1823 specified grammar into the current regex. You can then call any of
1824 those rules in the start-pattern.
1825
1826 Overriding an inherited rule or token
1827 Subrule dispatch within a grammar is always polymorphic. That is, when
1828 a subrule is called, the most-derived rule of the same name within the
1829 grammar's hierarchy is invoked.
1830
1831 So, to replace a particular rule within grammar, you simply need to
1832 inherit that grammar and specify new, more-specific versions of any
1833 rules you want to change. For example:
1834
1835 my $list_of_integers = qr{
1836 <List>
1837
1838 # Inherit rules from base grammar...
1839 <extends: List::Generic>
1840
1841 # Replace Item rule from List::Generic...
1842 <rule: Item>
1843 [+-]? \d++
1844 }x;
1845
1846 You can also use "<extends:...>" in other named grammars, to create
1847 hierarchies:
1848
1849 qr{
1850 <grammar: List::Integral>
1851 <extends: List::Generic>
1852
1853 <token: Item>
1854 [+-]? <MATCH=(<.Digit>+)>
1855
1856 <token: Digit>
1857 \d
1858 }x;
1859
1860 qr{
1861 <grammar: List::ColonSeparated>
1862 <extends: List::Generic>
1863
1864 <token: Separator>
1865 \s* : \s*
1866 }x;
1867
1868 qr{
1869 <grammar: List::Integral::ColonSeparated>
1870 <extends: List::Integral>
1871 <extends: List::ColonSeparated>
1872 }x;
1873
1874 As shown in the previous example, Regexp::Grammars allows you to
1875 multiply inherit two (or more) base grammars. For example, the
1876 "List::Integral::ColonSeparated" grammar takes the definitions of
1877 "List" and "Item" from the "List::Integral" grammar, and the definition
1878 of "Separator" from "List::ColonSeparated".
1879
1880 Note that grammars dispatch subrule calls using C3 method lookup,
1881 rather than Perl's older DFS lookup. That's why
1882 "List::Integral::ColonSeparated" correctly gets the more-specific
1883 "Separator" rule defined in "List::ColonSeparated", rather than the
1884 more-generic version defined in "List::Generic" (via "List::Integral").
1885 See "perldoc mro" for more discussion of the C3 dispatch algorithm.
1886
1887 Augmenting an inherited rule or token
1888 Instead of replacing an inherited rule, you can augment it.
1889
1890 For example, if you need a grammar for lists of hexademical numbers,
1891 you could inherit the behaviour of "List::Integral" and add the hex
1892 digits to its "Digit" token:
1893
1894 my $list_of_hexadecimal = qr{
1895 <List>
1896
1897 <extends: List::Integral>
1898
1899 <token: Digit>
1900 <List::Integral::Digit>
1901 | [A-Fa-f]
1902 }x;
1903
1904 If you call a subrule using a fully qualified name (such as
1905 "<List::Integral::Digit>"), the grammar calls that version of the rule,
1906 rather than the most-derived version.
1907
1908 Debugging named grammars
1909 Named grammars are independent of each other, even when inherited. This
1910 means that, if debugging is enabled in a derived grammar, it will not
1911 be active in any rules inherited from a base grammar, unless the base
1912 grammar also included a "<debug:...>" directive.
1913
1914 This is a deliberate design decision, as activating the debugger adds a
1915 significant amount of code to each grammar's implementation, which is
1916 detrimental to the matching performance of the resulting regexes.
1917
1918 If you need to debug a named grammar, the best approach is to include a
1919 "<debug: same>" directive at the start of the grammar. The presence of
1920 this directive will ensure the necessary extra debugging code is
1921 included in the regex implementing the grammar, while setting "same"
1922 mode will ensure that the debugging mode isn't altered when the matcher
1923 uses the inherited rules.
1924
1926 Result distillation
1927 Normally, calls to subrules produce nested result-hashes within the
1928 current result-hash. Those nested hashes always have at least one
1929 automatically supplied key (""), whose value is the entire substring
1930 that the subrule matched.
1931
1932 If there are no other nested captures within the subrule, there will be
1933 no other keys in the result-hash. This would be annoying as a typical
1934 nested grammar would then produce results consisting of hashes of
1935 hashes, with each nested hash having only a single key (""). This in
1936 turn would make postprocessing the result-hash (in "%/") far more
1937 complicated than it needs to be.
1938
1939 To avoid this behaviour, if a subrule's result-hash doesn't contain any
1940 keys except "", the module "flattens" the result-hash, by replacing it
1941 with the value of its single key.
1942
1943 So, for example, the grammar:
1944
1945 mv \s* <from> \s* <to>
1946
1947 <rule: from> [\w/.-]+
1948 <rule: to> [\w/.-]+
1949
1950 doesn't return a result-hash like this:
1951
1952 {
1953 "" => 'mv /usr/local/lib/libhuh.dylib /dev/null/badlib',
1954 'from' => { "" => '/usr/local/lib/libhuh.dylib' },
1955 'to' => { "" => '/dev/null/badlib' },
1956 }
1957
1958 Instead, it returns:
1959
1960 {
1961 "" => 'mv /usr/local/lib/libhuh.dylib /dev/null/badlib',
1962 'from' => '/usr/local/lib/libhuh.dylib',
1963 'to' => '/dev/null/badlib',
1964 }
1965
1966 That is, because the 'from' and 'to' subhashes each have only a single
1967 entry, they are each "flattened" to the value of that entry.
1968
1969 This flattening also occurs if a result-hash contains only "private"
1970 keys (i.e. keys starting with underscores). For example:
1971
1972 mv \s* <from> \s* <to>
1973
1974 <rule: from> <_dir=path>? <_file=filename>
1975 <rule: to> <_dir=path>? <_file=filename>
1976
1977 <token: path> [\w/.-]*/
1978 <token: filename> [\w.-]+
1979
1980 Here, the "from" rule produces a result like this:
1981
1982 from => {
1983 "" => '/usr/local/bin/perl',
1984 _dir => '/usr/local/bin/',
1985 _file => 'perl',
1986 }
1987
1988 which is automatically stripped of "private" keys, leaving:
1989
1990 from => {
1991 "" => '/usr/local/bin/perl',
1992 }
1993
1994 which is then automatically flattened to:
1995
1996 from => '/usr/local/bin/perl'
1997
1998 List result distillation
1999
2000 A special case of result distillation occurs in a separated list, such
2001 as:
2002
2003 <rule: List>
2004
2005 <[Item]>+ % <[Sep=(,)]>
2006
2007 If this construct matches just a single item, the result hash will
2008 contain a single entry consisting of a nested array with a single
2009 value, like so:
2010
2011 { Item => [ 'data' ] }
2012
2013 Instead of returning this annoyingly nested data structure, you can
2014 tell Regexp::Grammars to flatten it to just the inner data with a
2015 special directive:
2016
2017 <rule: List>
2018
2019 <[Item]>+ % <[Sep=(,)]>
2020
2021 <minimize:>
2022
2023 The "<minimize:>" directive examines the result hash (i.e. %MATCH). If
2024 that hash contains only a single entry, which is a reference to an
2025 array with a single value, then the directive assigns that single value
2026 directly to $MATCH, so that it will be returned instead of the usual
2027 result hash.
2028
2029 This means that a normal separated list still results in a hash
2030 containing all elements and separators, but a "degenerate" list of only
2031 one item results in just that single item.
2032
2033 Manual result distillation
2034
2035 Regexp::Grammars also offers full manual control over the distillation
2036 process. If you use the reserved word "MATCH" as the alias for a
2037 subrule call:
2038
2039 <MATCH=filename>
2040
2041 or a subpattern match:
2042
2043 <MATCH=( \w+ )>
2044
2045 or a code block:
2046
2047 <MATCH=(?{ 42 })>
2048
2049 then the current rule will treat the return value of that subrule,
2050 pattern, or code block as its complete result, and return that value
2051 instead of the usual result-hash it constructs. This is the case even
2052 if the result has other entries that would normally also be returned.
2053
2054 For example, consider a rule like:
2055
2056 <rule: term>
2057 <MATCH=literal>
2058 | <left_paren> <MATCH=expr> <right_paren>
2059
2060 The use of "MATCH" aliases causes the rule to return either whatever
2061 "<literal>" returns, or whatever "<expr>" returns (provided it's
2062 between left and right parentheses).
2063
2064 Note that, in this second case, even though "<left_paren>" and
2065 "<right_paren>" are captured to the result-hash, they are not returned,
2066 because the "MATCH" alias overrides the normal "return the result-hash"
2067 semantics and returns only what its associated subrule (i.e. "<expr>")
2068 produces.
2069
2070 Note also that the return value is only assigned, if the subrule call
2071 actually matches. For example:
2072
2073 <rule: optional_names>
2074 <[MATCH=name]>*
2075
2076 If the repeated subrule call to "<name>" matches zero times, the return
2077 value of the "optional_names" rule will not be an empty array, because
2078 the "MATCH=" will not have executed at all. Instead, the default return
2079 value (an empty string) will be returned. If you had specifically
2080 wanted to return an empty array, you could use any of the following:
2081
2082 <rule: optional_names>
2083 <MATCH=(?{ [] })> # Set up empty array before first match attempt
2084 <[MATCH=name]>*
2085
2086 or:
2087
2088 <rule: optional_names>
2089 <[MATCH=name]>+ # Match one or more times
2090 | # or
2091 <MATCH=(?{ [] })> # Set up empty array, if no match
2092
2093 Programmatic result distillation
2094
2095 It's also possible to control what a rule returns from within a code
2096 block. Regexp::Grammars provides a set of reserved variables that give
2097 direct access to the result-hash.
2098
2099 The result-hash itself can be accessed as %MATCH within any code block
2100 inside a rule. For example:
2101
2102 <rule: sum>
2103 <X=product> \+ <Y=product>
2104 <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
2105
2106 Here, the rule matches a product (aliased 'X' in the result-hash), then
2107 a literal '+', then another product (aliased to 'Y' in the result-
2108 hash). The rule then executes the code block, which accesses the two
2109 saved values (as $MATCH{X} and $MATCH{Y}), adding them together.
2110 Because the block is itself aliased to "MATCH", the sum produced by the
2111 block becomes the (only) result of the rule.
2112
2113 It is also possible to set the rule result from within a code block
2114 (instead of aliasing it). The special "override" return value is
2115 represented by the special variable $MATCH. So the previous example
2116 could be rewritten:
2117
2118 <rule: sum>
2119 <X=product> \+ <Y=product>
2120 (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
2121
2122 Both forms are identical in effect. Any assignment to $MATCH overrides
2123 the normal "return all subrule results" behaviour.
2124
2125 Assigning to $MATCH directly is particularly handy if the result may
2126 not always be "distillable", for example:
2127
2128 <rule: sum>
2129 <X=product> \+ <Y=product>
2130 (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
2131 # Reduce to sum, if both terms are simple scalars...
2132 $MATCH = $MATCH{X} + $MATCH{Y};
2133 }
2134 else {
2135 # Return full syntax tree for non-simple case...
2136 $MATCH{op} = '+';
2137 }
2138 })
2139
2140 Note that you can also partially override the subrule return behaviour.
2141 Normally, the subrule returns the complete text it matched as its
2142 context substring (i.e. under the "empty key") in its result-hash. That
2143 is, of course, $MATCH{""}, so you can override just that behaviour by
2144 directly assigning to that entry.
2145
2146 For example, if you have a rule that matches key/value pairs from a
2147 configuration file, you might prefer that any trailing comments not be
2148 included in the "matched text" entry of the rule's result-hash. You
2149 could hide such comments like so:
2150
2151 <rule: config_line>
2152 <key> : <value> <comment>?
2153 (?{
2154 # Edit trailing comments out of "matched text" entry...
2155 $MATCH = "$MATCH{key} : $MATCH{value}";
2156 })
2157
2158 Some more examples of the uses of $MATCH:
2159
2160 <rule: FuncDecl>
2161 # Keyword Name Keep return the name (as a string)...
2162 func <Identifier> ; (?{ $MATCH = $MATCH{'Identifier'} })
2163
2164
2165 <rule: NumList>
2166 # Numbers in square brackets...
2167 \[
2168 ( \d+ (?: , \d+)* )
2169 \]
2170
2171 # Return only the numbers...
2172 (?{ $MATCH = $CAPTURE })
2173
2174
2175 <token: Cmd>
2176 # Match standard variants then standardize the keyword...
2177 (?: mv | move | rename ) (?{ $MATCH = 'mv'; })
2178
2179 Parse-time data processing
2180 Using code blocks in rules, it's often possible to fully process data
2181 as you parse it. For example, the "<sum>" rule shown in the previous
2182 section might be part of a simple calculator, implemented entirely in a
2183 single grammar. Such a calculator might look like this:
2184
2185 my $calculator = do{
2186 use Regexp::Grammars;
2187 qr{
2188 <Answer>
2189
2190 <rule: Answer>
2191 ( <.Mult>+ % <.Op=([+-])> )
2192 <MATCH= (?{ eval $CAPTURE })>
2193
2194 <rule: Mult>
2195 ( <.Pow>+ % <.Op=([*/%])> )
2196 <MATCH= (?{ eval $CAPTURE })>
2197
2198 <rule: Pow>
2199 <X=Term> \^ <Y=Pow>
2200 <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>
2201 |
2202 <MATCH=Term>
2203
2204 <rule: Term>
2205 <MATCH=Literal>
2206 | \( <MATCH=Answer> \)
2207
2208 <token: Literal>
2209 <MATCH= ( [+-]? \d++ (?: \. \d++ )?+ )>
2210 }xms
2211 };
2212
2213 while (my $input = <>) {
2214 if ($input =~ $calculator) {
2215 say "--> $/{Answer}";
2216 }
2217 }
2218
2219 Because every rule computes a value using the results of the subrules
2220 below it, and aliases that result to its "MATCH", each rule returns a
2221 complete evaluation of the subexpression it matches, passing that back
2222 to higher-level rules, which then do the same.
2223
2224 Hence, the result returned to the very top-level rule (i.e. to
2225 "<Answer>") is the complete evaluation of the entire expression that
2226 was matched. That means that, in the very process of having matched a
2227 valid expression, the calculator has also computed the value of that
2228 expression, which can then simply be printed directly.
2229
2230 It is often possible to have a grammar fully (or sometimes at least
2231 partially) evaluate or transform the data it is parsing, and this
2232 usually leads to very efficient and easy-to-maintain implementations.
2233
2234 The main limitation of this technique is that the data has to be in a
2235 well-structured form, where subsets of the data can be evaluated using
2236 only local information. In cases where the meaning of the data is
2237 distributed through that data non-hierarchically, or relies on global
2238 state, or on external information, it is often better to have the
2239 grammar simply construct a complete syntax tree for the data first, and
2240 then evaluate that syntax tree separately, after parsing is complete.
2241 The following section describes a feature of Regexp::Grammars that can
2242 make this second style of data processing simpler and more
2243 maintainable.
2244
2245 Object-oriented parsing
2246 When a grammar has parsed successfully, the "%/" variable will contain
2247 a series of nested hashes (and possibly arrays) representing the
2248 hierarchical structure of the parsed data.
2249
2250 Typically, the next step is to walk that tree, extracting or converting
2251 or otherwise processing that information. If the tree has nodes of many
2252 different types, it can be difficult to build a recursive subroutine
2253 that can navigate it easily.
2254
2255 A much cleaner solution is possible if the nodes of the tree are proper
2256 objects. In that case, you just define a "process()" or "traverse()"
2257 method for eah of the classes, and have every node call that method on
2258 each of its children. For example, if the parser were to return a tree
2259 of nodes representing the contents of a LaTeX file, then you could
2260 define the following methods:
2261
2262 sub Latex::file::explain
2263 {
2264 my ($self, $level) = @_;
2265 for my $element (@{$self->{element}}) {
2266 $element->explain($level);
2267 }
2268 }
2269
2270 sub Latex::element::explain {
2271 my ($self, $level) = @_;
2272 ( $self->{command} || $self->{literal})->explain($level)
2273 }
2274
2275 sub Latex::command::explain {
2276 my ($self, $level) = @_;
2277 say "\t"x$level, "Command:";
2278 say "\t"x($level+1), "Name: $self->{name}";
2279 if ($self->{options}) {
2280 say "\t"x$level, "\tOptions:";
2281 $self->{options}->explain($level+2)
2282 }
2283
2284 for my $arg (@{$self->{arg}}) {
2285 say "\t"x$level, "\tArg:";
2286 $arg->explain($level+2)
2287 }
2288 }
2289
2290 sub Latex::options::explain {
2291 my ($self, $level) = @_;
2292 $_->explain($level) foreach @{$self->{option}};
2293 }
2294
2295 sub Latex::literal::explain {
2296 my ($self, $level, $label) = @_;
2297 $label //= 'Literal';
2298 say "\t"x$level, "$label: ", $self->{q{}};
2299 }
2300
2301 and then simply write:
2302
2303 if ($text =~ $LaTeX_parser) {
2304 $/{LaTeX_file}->explain();
2305 }
2306
2307 and the chain of "explain()" calls would cascade down the nodes of the
2308 tree, each one invoking the appropriate "explain()" method according to
2309 the type of node encountered.
2310
2311 The only problem is that, by default, Regexp::Grammars returns a tree
2312 of plain-old hashes, not LaTeX::Whatever objects. Fortunately, it's
2313 easy to request that the result hashes be automatically blessed into
2314 the appropriate classes, using the "<objrule:...>" and "<objtoken:...>"
2315 directives.
2316
2317 These directives are identical to the "<rule:...>" and "<token:...>"
2318 directives (respectively), except that the rule or token they create
2319 will also convert the hash it normally returns into an object of a
2320 specified class. This conversion is done by passing the result hash to
2321 the class's constructor:
2322
2323 $class->new(\%result_hash)
2324
2325 if the class has a constructor method named "new()", or else (if the
2326 class doesn't provide a constructor) by directly blessing the result
2327 hash:
2328
2329 bless \%result_hash, $class
2330
2331 Note that, even if object is constructed via its own constructor, the
2332 module still expects the new object to be hash-based, and will fail if
2333 the object is anything but a blessed hash. The module issues an error
2334 in this case.
2335
2336 The generic syntax for these types of rules and tokens is:
2337
2338 <objrule: CLASS::NAME = RULENAME >
2339 <objtoken: CLASS::NAME = TOKENNAME >
2340
2341 For example:
2342
2343 <objrule: LaTeX::Element=component>
2344 # ...Defines a rule that can be called as <component>
2345 # ...and which returns a hash-based LaTeX::Element object
2346
2347 <objtoken: LaTex::Literal=atom>
2348 # ...Defines a token that can be called as <atom>
2349 # ...and which returns a hash-based LaTeX::Literal object
2350
2351 Note that, just as in aliased subrule calls, the name by which
2352 something is referred to outside the grammar (in this case, the class
2353 name) comes before the "=", whereas the name that it is referred to
2354 inside the grammar comes after the "=".
2355
2356 You can freely mix object-returning and plain-old-hash-returning rules
2357 and tokens within a single grammar, though you have to be careful not
2358 to subsequently try to call a method on any of the unblessed nodes.
2359
2360 An important caveat regarding OO rules
2361
2362 Prior to Perl 5.14.0, Perl's regex engine was not fully re-entrant.
2363 This means that in older versions of Perl, it is not possible to re-
2364 invoke the regex engine when already inside the regex engine.
2365
2366 This means that you need to be careful that the "new()" constructors
2367 that are called by your object-rules do not themselves use regexes in
2368 any way, unless you're running under Perl 5.14 or later (in which case
2369 you can ignore what follows).
2370
2371 The two ways this is most likely to happen are:
2372
2373 1. If you're using a class built on Moose, where one or more of the
2374 "has" uses a type constraint (such as 'Int') that is implemented
2375 via regex matching. For example:
2376
2377 has 'id' => (is => 'rw', isa => 'Int');
2378
2379 The workaround (for pre-5.14 Perls) is to replace the type
2380 constraint with one that doesn't use a regex. For example:
2381
2382 has 'id' => (is => 'rw', isa => 'Num');
2383
2384 Alternatively, you could define your own type constraint that
2385 avoids regexes:
2386
2387 use Moose::Util::TypeConstraints;
2388
2389 subtype 'Non::Regex::Int',
2390 as 'Num',
2391 where { int($_) == $_ };
2392
2393 no Moose::Util::TypeConstraints;
2394
2395 # and later...
2396
2397 has 'id' => (is => 'rw', isa => 'Non::Regex::Int');
2398
2399 2. If your class uses an "AUTOLOAD()" method to implement its
2400 constructor and that method uses the typical:
2401
2402 $AUTOLOAD =~ s/.*://;
2403
2404 technique. The workaround here is to achieve the same effect
2405 without a regex. For example:
2406
2407 my $last_colon_pos = rindex($AUTOLOAD, ':');
2408 substr $AUTOLOAD, 0, $last_colon_pos+1, q{};
2409
2410 Note that this caveat against using nested regexes also applies to any
2411 code blocks executed inside a rule or token (whether or not those rules
2412 or tokens are object-oriented).
2413
2414 A naming shortcut
2415
2416 If an "<objrule:...>" or "<objtoken:...>" is defined with a class name
2417 that is not followed by "=" and a rule name, then the rule name is
2418 determined automatically from the classname. Specifically, the final
2419 component of the classname (i.e. after the last "::", if any) is used.
2420
2421 For example:
2422
2423 <objrule: LaTeX::Element>
2424 # ...Defines a rule that can be called as <Element>
2425 # ...and which returns a hash-based LaTeX::Element object
2426
2427 <objtoken: LaTex::Literal>
2428 # ...Defines a token that can be called as <Literal>
2429 # ...and which returns a hash-based LaTeX::Literal object
2430
2431 <objtoken: Comment>
2432 # ...Defines a token that can be called as <Comment>
2433 # ...and which returns a hash-based Comment object
2434
2436 Regexp::Grammars provides a number of features specifically designed to
2437 help debug both grammars and the data they parse.
2438
2439 All debugging messages are written to a log file (which, by default, is
2440 just STDERR). However, you can specify a disk file explicitly by
2441 placing a "<logfile:...>" directive at the start of your grammar:
2442
2443 $grammar = qr{
2444
2445 <logfile: LaTeX_parser_log >
2446
2447 \A <LaTeX_file> \Z # Pattern to match
2448
2449 <rule: LaTeX_file>
2450 # etc.
2451 }x;
2452
2453 You can also explicitly specify that messages go to the terminal:
2454
2455 <logfile: - >
2456
2457 Debugging grammar creation with "<logfile:...>"
2458 Whenever a log file has been directly specified, Regexp::Grammars
2459 automatically does verbose static analysis of your grammar. That is,
2460 whenever it compiles a grammar containing an explicit "<logfile:...>"
2461 directive it logs a series of messages explaining how it has
2462 interpreted the various components of that grammar. For example, the
2463 following grammar:
2464
2465 <logfile: parser_log >
2466
2467 <cmd>
2468
2469 <rule: cmd>
2470 mv <from=file> <to=file>
2471 | cp <source> <[file]> <.comment>?
2472
2473 would produce the following analysis in the 'parser_log' file:
2474
2475 info | Processing the main regex before any rule definitions
2476 | |
2477 | |...Treating <cmd> as:
2478 | | | match the subrule <cmd>
2479 | | \ saving the match in $MATCH{'cmd'}
2480 | |
2481 | \___End of main regex
2482 |
2483 info | Defining a rule: <cmd>
2484 | |...Returns: a hash
2485 | |
2486 | |...Treating ' mv ' as:
2487 | | \ normal Perl regex syntax
2488 | |
2489 | |...Treating <from=file> as:
2490 | | | match the subrule <file>
2491 | | \ saving the match in $MATCH{'from'}
2492 | |
2493 | |...Treating <to=file> as:
2494 | | | match the subrule <file>
2495 | | \ saving the match in $MATCH{'to'}
2496 | |
2497 | |...Treating ' | cp ' as:
2498 | | \ normal Perl regex syntax
2499 | |
2500 | |...Treating <source> as:
2501 | | | match the subrule <source>
2502 | | \ saving the match in $MATCH{'source'}
2503 | |
2504 | |...Treating <[file]> as:
2505 | | | match the subrule <file>
2506 | | \ appending the match to $MATCH{'file'}
2507 | |
2508 | |...Treating <.comment>? as:
2509 | | | match the subrule <comment> if possible
2510 | | \ but don't save anything
2511 | |
2512 | \___End of rule definition
2513
2514 This kind of static analysis is a useful starting point in debugging a
2515 miscreant grammar, because it enables you to see what you actually
2516 specified (as opposed to what you thought you'd specified).
2517
2518 Debugging grammar execution with "<debug:...>"
2519 Regexp::Grammars also provides a simple interactive debugger, with
2520 which you can observe the process of parsing and the data being
2521 collected in any result-hash.
2522
2523 To initiate debugging, place a "<debug:...>" directive anywhere in your
2524 grammar. When parsing reaches that directive the debugger will be
2525 activated, and the command specified in the directive immediately
2526 executed. The available commands are:
2527
2528 <debug: on> - Enable debugging, stop when a rule matches
2529 <debug: match> - Enable debugging, stop when a rule matches
2530 <debug: try> - Enable debugging, stop when a rule is tried
2531 <debug: run> - Enable debugging, run until the match completes
2532 <debug: same> - Continue debugging (or not) as currently
2533 <debug: off> - Disable debugging and continue parsing silently
2534
2535 <debug: continue> - Synonym for <debug: run>
2536 <debug: step> - Synonym for <debug: try>
2537
2538 These directives can be placed anywhere within a grammar and take
2539 effect when that point is reached in the parsing. Hence, adding a
2540 "<debug:step>" directive is very much like setting a breakpoint at that
2541 point in the grammar. Indeed, a common debugging strategy is to turn
2542 debugging on and off only around a suspect part of the grammar:
2543
2544 <rule: tricky> # This is where we think the problem is...
2545 <debug:step>
2546 <preamble> <text> <postscript>
2547 <debug:off>
2548
2549 Once the debugger is active, it steps through the parse, reporting
2550 rules that are tried, matches and failures, backtracking and restarts,
2551 and the parser's location within both the grammar and the text being
2552 matched. That report looks like this:
2553
2554 ===============> Trying <grammar> from position 0
2555 > cp file1 file2 |...Trying <cmd>
2556 | |...Trying <cmd=(cp)>
2557 | | \FAIL <cmd=(cp)>
2558 | \FAIL <cmd>
2559 \FAIL <grammar>
2560 ===============> Trying <grammar> from position 1
2561 cp file1 file2 |...Trying <cmd>
2562 | |...Trying <cmd=(cp)>
2563 file1 file2 | | \_____<cmd=(cp)> matched 'cp'
2564 file1 file2 | |...Trying <[file]>+
2565 file2 | | \_____<[file]>+ matched 'file1'
2566 | |...Trying <[file]>+
2567 [eos] | | \_____<[file]>+ matched ' file2'
2568 | |...Trying <[file]>+
2569 | | \FAIL <[file]>+
2570 | |...Trying <target>
2571 | | |...Trying <file>
2572 | | | \FAIL <file>
2573 | | \FAIL <target>
2574 <~~~~~~~~~~~~~~ | |...Backtracking 5 chars and trying new match
2575 file2 | |...Trying <target>
2576 | | |...Trying <file>
2577 | | | \____ <file> matched 'file2'
2578 [eos] | | \_____<target> matched 'file2'
2579 | \_____<cmd> matched ' cp file1 file2'
2580 \_____<grammar> matched ' cp file1 file2'
2581
2582 The first column indicates the point in the input at which the parser
2583 is trying to match, as well as any backtracking or forward searching it
2584 may need to do. The remainder of the columns track the parser's
2585 hierarchical traversal of the grammar, indicating which rules are
2586 tried, which succeed, and what they match.
2587
2588 Provided the logfile is a terminal (as it is by default), the debugger
2589 also pauses at various points in the parsing process--before trying a
2590 rule, after a rule succeeds, or at the end of the parse--according to
2591 the most recent command issued. When it pauses, you can issue a new
2592 command by entering a single letter:
2593
2594 m - to continue until the next subrule matches
2595 t or s - to continue until the next subrule is tried
2596 r or c - to continue to the end of the grammar
2597 o - to switch off debugging
2598
2599 Note that these are the first letters of the corresponding
2600 "<debug:...>" commands, listed earlier. Just hitting ENTER while the
2601 debugger is paused repeats the previous command.
2602
2603 While the debugger is paused you can also type a 'd', which will
2604 display the result-hash for the current rule. This can be useful for
2605 detecting which rule isn't returning the data you expected.
2606
2607 Resizing the context string
2608
2609 By default, the first column of the debugger output (which shows the
2610 current matching position within the string) is limited to a width of
2611 20 columns.
2612
2613 However, you can change that limit calling the
2614 "Regexp::Grammars::set_context_width()" subroutine. You have to specify
2615 the fully qualified name, however, as Regexp::Grammars does not export
2616 this (or any other) subroutine.
2617
2618 "set_context_width()" expects a single argument: a positive integer
2619 indicating the maximal allowable width for the context column. It
2620 issues a warning if an invalid value is passed, and ignores it.
2621
2622 If called in a void context, "set_context_width()" changes the context
2623 width permanently throughout your application. If called in a scalar or
2624 list context, "set_context_width()" returns an object whose destructor
2625 will cause the context width to revert to its previous value. This
2626 means you can temporarily change the context width within a given block
2627 with something like:
2628
2629 {
2630 my $temporary = Regexp::Grammars::set_context_width(50);
2631
2632 if ($text =~ $parser) {
2633 do_stuff_with( %/ );
2634 }
2635
2636 } # <--- context width automagically reverts at this point
2637
2638 and the context width will change back to its previous value when
2639 $temporary goes out of scope at the end of the block.
2640
2641 User-defined logging with "<log:...>"
2642 Both static and interactive debugging send a series of predefined log
2643 messages to whatever log file you have specified. It is also possible
2644 to send additional, user-defined messages to the log, using the
2645 "<log:...>" directive.
2646
2647 This directive expects either a simple text or a codeblock as its
2648 single argument. If the argument is a code block, that code is expected
2649 to return the text of the message; if the argument is anything else,
2650 that something else is the literal message. For example:
2651
2652 <rule: ListElem>
2653
2654 <Elem= ( [a-z]\d+) >
2655 <log: Checking for a suffix, too...>
2656
2657 <Suffix= ( : \d+ ) >?
2658 <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
2659
2660 User-defined log messages implemented using a codeblock can also
2661 specify a severity level. If the codeblock of a "<log:...>" directive
2662 returns two or more values, the first is treated as a log message
2663 severity indicator, and the remaining values as separate lines of text
2664 to be logged. For example:
2665
2666 <rule: ListElem>
2667 <Elem= ( [a-z]\d+) >
2668 <Suffix= ( : \d+ ) >?
2669
2670 <log: (?{
2671 warn => "Elem was: $MATCH{Elem}",
2672 "Suffix was $MATCH{Suffix}",
2673 })>
2674
2675 When they are encountered, user-defined log messages are interspersed
2676 between any automatic log messages (i.e. from the debugger), at the
2677 correct level of nesting for the current rule.
2678
2679 Debugging non-grammars
2680 [Note that, with the release in 2012 of the Regexp::Debugger module (on
2681 CPAN) the techniques described below are unnecessary. If you need to
2682 debug plain Perl regexes, use Regexp::Debugger instead.]
2683
2684 It is possible to use Regexp::Grammars without creating any subrule
2685 definitions, simply to debug a recalcitrant regex. For example, if the
2686 following regex wasn't working as expected:
2687
2688 my $balanced_brackets = qr{
2689 \( # left delim
2690 (?:
2691 \\ # escape or
2692 | (?R) # recurse or
2693 | . # whatever
2694 )*
2695 \) # right delim
2696 }xms;
2697
2698 you could instrument it with aliased subpatterns and then debug it
2699 step-by-step, using Regexp::Grammars:
2700
2701 use Regexp::Grammars;
2702
2703 my $balanced_brackets = qr{
2704 <debug:step>
2705
2706 <.left_delim= ( \( )>
2707 (?:
2708 <.escape= ( \\ )>
2709 | <.recurse= ( (?R) )>
2710 | <.whatever=( . )>
2711 )*
2712 <.right_delim= ( \) )>
2713 }xms;
2714
2715 while (<>) {
2716 say 'matched' if /$balanced_brackets/;
2717 }
2718
2719 Note the use of amnesiac aliased subpatterns to avoid needlessly
2720 building a result-hash. Alternatively, you could use listifying aliases
2721 to preserve the matching structure as an additional debugging aid:
2722
2723 use Regexp::Grammars;
2724
2725 my $balanced_brackets = qr{
2726 <debug:step>
2727
2728 <[left_delim= ( \( )]>
2729 (?:
2730 <[escape= ( \\ )]>
2731 | <[recurse= ( (?R) )]>
2732 | <[whatever=( . )]>
2733 )*
2734 <[right_delim= ( \) )]>
2735 }xms;
2736
2737 if ( '(a(bc)d)' =~ /$balanced_brackets/) {
2738 use Data::Dumper 'Dumper';
2739 warn Dumper \%/;
2740 }
2741
2743 Assuming you have correctly debugged your grammar, the next source of
2744 problems will probably be invalid input (especially if that input is
2745 being provided interactively). So Regexp::Grammars also provides some
2746 support for detecting when a parse is likely to fail...and informing
2747 the user why.
2748
2749 Requirements
2750 The "<require:...>" directive is useful for testing conditions that
2751 it's not easy (or even possible) to check within the syntax of the the
2752 regex itself. For example:
2753
2754 <rule: IPV4_Octet_Decimal>
2755 # Up three digits...
2756 <MATCH= ( \d{1,3}+ )>
2757
2758 # ...but less than 256...
2759 <require: (?{ $MATCH <= 255 })>
2760
2761 A require expects a regex codeblock as its argument and succeeds if the
2762 final value of that codeblock is true. If the final value is false, the
2763 directive fails and the rule starts backtracking.
2764
2765 Note, in this example that the digits are matched with " \d{1,3}+ ".
2766 The trailing "+" prevents the "{1,3}" repetition from backtracking to a
2767 smaller number of digits if the "<require:...>" fails.
2768
2769 Handling failure
2770 The module has limited support for error reporting from within a
2771 grammar, in the form of the "<error:...>" and "<warning:...>"
2772 directives and their shortcuts: "<...>", "<!!!>", and "<???>"
2773
2774 Error messages
2775
2776 The "<error: MSG>" directive queues a conditional error message within
2777 "@!" and then fails to match (that is, it is equivalent to a "(?!)"
2778 when matching). For example:
2779
2780 <rule: ListElem>
2781 <SerialNumber>
2782 | <ClientName>
2783 | <error: (?{ $errcount++ . ': Missing list element' })>
2784
2785 So a common code pattern when using grammars that do this kind of error
2786 detection is:
2787
2788 if ($text =~ $grammar) {
2789 # Do something with the data collected in %/
2790 }
2791 else {
2792 say {*STDERR} $_ for @!; # i.e. report all errors
2793 }
2794
2795 Each error message is conditional in the sense that, if any surrounding
2796 rule subsequently matches, the message is automatically removed from
2797 "@!". This implies that you can queue up as many error messages as you
2798 wish, but they will only remain in "@!" if the match ultimately fails.
2799 Moreover, only those error messages originating from rules that
2800 actually contributed to the eventual failure-to-match will remain in
2801 "@!".
2802
2803 If a code block is specified as the argument, the error message is
2804 whatever final value is produced when the block is executed. Note that
2805 this final value does not have to be a string (though it does have to
2806 be a scalar).
2807
2808 <rule: ListElem>
2809 <SerialNumber>
2810 | <ClientName>
2811 | <error: (?{
2812 # Return a hash, with the error information...
2813 { errnum => $errcount++, msg => 'Missing list element' }
2814 })>
2815
2816 If anything else is specified as the argument, it is treated as a
2817 literal error string (and may not contain an unbalanced '<' or '>', nor
2818 any interpolated variables).
2819
2820 However, if the literal error string begins with "Expected " or
2821 "Expecting ", then the error string automatically has the following
2822 "context suffix" appended:
2823
2824 , but found '$CONTEXT' instead
2825
2826 For example:
2827
2828 qr{ <Arithmetic_Expression> # ...Match arithmetic expression
2829 | # Or else
2830 <error: Expected a valid expression> # ...Report error, and fail
2831
2832 # Rule definitions here...
2833 }xms;
2834
2835 On an invalid input this example might produce an error message like:
2836
2837 "Expected a valid expression, but found '(2+3]*7/' instead"
2838
2839 The value of the special $CONTEXT variable is found by looking ahead in
2840 the string being matched against, to locate the next sequence of non-
2841 blank characters after the current parsing position. This variable may
2842 also be explicitly used within the "<error: (?{...})>" form of the
2843 directive.
2844
2845 As a special case, if you omit the message entirely from the directive,
2846 it is supplied automatically, derived from the name of the current
2847 rule. For example, if the following rule were to fail to match:
2848
2849 <rule: Arithmetic_expression>
2850 <Multiplicative_Expression>+ % ([+-])
2851 | <error:>
2852
2853 the error message queued would be:
2854
2855 "Expected arithmetic expression, but found 'one plus two' instead"
2856
2857 Note however, that it is still essential to include the colon in the
2858 directive. A common mistake is to write:
2859
2860 <rule: Arithmetic_expression>
2861 <Multiplicative_Expression>+ % ([+-])
2862 | <error>
2863
2864 which merely attempts to call "<rule: error>" if the first alternative
2865 fails.
2866
2867 Warning messages
2868
2869 Sometimes, you want to detect problems, but not invalidate the entire
2870 parse as a result. For those occasions, the module provides a "less
2871 stringent" form of error reporting: the "<warning:...>" directive.
2872
2873 This directive is exactly the same as an "<error:...>" in every respect
2874 except that it does not induce a failure to match at the point it
2875 appears.
2876
2877 The directive is, therefore, useful for reporting non-fatal problems in
2878 a parse. For example:
2879
2880 qr{ \A # ...Match only at start of input
2881 <ArithExpr> # ...Match a valid arithmetic expression
2882
2883 (?:
2884 # Should be at end of input...
2885 \s* \Z
2886 |
2887 # If not, report the fact but don't fail...
2888 <warning: Expected end-of-input>
2889 <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
2890 )
2891
2892 # Rule definitions here...
2893 }xms;
2894
2895 Note that, because they do not induce failure, two or more
2896 "<warning:...>" directives can be "stacked" in sequence, as in the
2897 previous example.
2898
2899 Stubbing
2900
2901 The module also provides three useful shortcuts, specifically to make
2902 it easy to declare, but not define, rules and tokens.
2903
2904 The "<...>" and "<???>" directives are equivalent to the directive:
2905
2906 <error: Cannot match RULENAME (not implemented)>
2907
2908 The "<???>" is equivalent to the directive:
2909
2910 <warning: Cannot match RULENAME (not implemented)>
2911
2912 For example, in the following grammar:
2913
2914 <grammar: List::Generic>
2915
2916 <rule: List>
2917 <[Item]>+ % (\s*,\s*)
2918
2919 <rule: Item>
2920 <...>
2921
2922 the "Item" rule is declared but not defined. That means the grammar
2923 will compile correctly, (the "List" rule won't complain about a call to
2924 a non-existent "Item"), but if the "Item" rule isn't overridden in some
2925 derived grammar, a match-time error will occur when "List" tries to
2926 match the "<...>" within "Item".
2927
2928 Localizing the (semi-)automatic error messages
2929
2930 Error directives of any of the following forms:
2931
2932 <error: Expecting identifier>
2933
2934 <error: >
2935
2936 <...>
2937
2938 <!!!>
2939
2940 or their warning equivalents:
2941
2942 <warning: Expecting identifier>
2943
2944 <warning: >
2945
2946 <???>
2947
2948 each autogenerate part or all of the actual error message they produce.
2949 By default, that autogenerated message is always produced in English.
2950
2951 However, the module provides a mechanism by which you can intercept
2952 every error or warning that is queued to "@!" via these
2953 directives...and localize those messages.
2954
2955 To do this, you call "Regexp::Grammars::set_error_translator()" (with
2956 the full qualification, since Regexp::Grammars does not export it...nor
2957 anything else, for that matter).
2958
2959 The "set_error_translator()" subroutine expect as single argument,
2960 which must be a reference to another subroutine. This subroutine is
2961 then called whenever an error or warning message is queued to "@!".
2962
2963 The subroutine is passed three arguments:
2964
2965 · the message string,
2966
2967 · the name of the rule from which the error or warning was queued,
2968 and
2969
2970 · the value of $CONTEXT when the error or warning was encountered
2971
2972 The subroutine is expected to return the final version of the message
2973 that is actually to be appended to "@!". To accomplish this it may make
2974 use of one of the many internationalization/localization modules
2975 available in Perl, or it may do the conversion entirely by itself.
2976
2977 The first argument is always exactly what appeared as a message in the
2978 original directive (regardless of whether that message is supposed to
2979 trigger autogeneration, or is just a "regular" error message). That
2980 is:
2981
2982 Directive 1st argument
2983
2984 <error: Expecting identifier> "Expecting identifier"
2985 <warning: That's not a moon!> "That's not a moon!"
2986 <error: > ""
2987 <warning: > ""
2988 <...> ""
2989 <!!!> ""
2990 <???> ""
2991
2992 The second argument always contains the name of the rule in which the
2993 directive was encountered. For example, when invoked from within
2994 "<rule: Frinstance>" the following directives produce:
2995
2996 Directive 2nd argument
2997
2998 <error: Expecting identifier> "Frinstance"
2999 <warning: That's not a moon!> "Frinstance"
3000 <error: > "Frinstance"
3001 <warning: > "Frinstance"
3002 <...> "-Frinstance"
3003 <!!!> "-Frinstance"
3004 <???> "-Frinstance"
3005
3006 Note that the "unimplemented" markers pass the rule name with a
3007 preceding '-'. This allows your translator to distinguish between
3008 "empty" messages (which should then be generated automatically) and the
3009 "unimplemented" markers (which should report that the rule is not yet
3010 properly defined).
3011
3012 If you call "Regexp::Grammars::set_error_translator()" in a void
3013 context, the error translator is permanently replaced (at least, until
3014 the next call to "set_error_translator()").
3015
3016 However, if you call "Regexp::Grammars::set_error_translator()" in a
3017 scalar or list context, it returns an object whose destructor will
3018 restore the previous translator. This allows you to install a
3019 translator only within a given scope, like so:
3020
3021 {
3022 my $temporary
3023 = Regexp::Grammars::set_error_translator(\&my_translator);
3024
3025 if ($text =~ $parser) {
3026 do_stuff_with( %/ );
3027 }
3028 else {
3029 report_errors_in( @! );
3030 }
3031
3032 } # <--- error translator automagically reverts at this point
3033
3034 Warning: any error translation subroutine you install will be called
3035 during the grammar's parsing phase (i.e. as the grammar's regex is
3036 matching). You should therefore ensure that your translator does not
3037 itself use regular expressions, as nested evaluations of regexes inside
3038 other regexes are extremely problematical (i.e. almost always
3039 disastrous) in Perl.
3040
3041 Restricting how long a parse runs
3042 Like the core Perl 5 regex engine on which they are built, the grammars
3043 implemented by Regexp::Grammars are essentially top-down parsers. This
3044 means that they may occasionally require an exponentially long time to
3045 parse a particular input. This usually occurs if a particular grammar
3046 includes a lot of recursion or nested backtracking, especially if the
3047 grammar is then matched against a long string.
3048
3049 The judicious use of non-backtracking repetitions (i.e. "x*+" and
3050 "x++") can significantly improve parsing performance in many such
3051 cases. Likewise, carefully reordering any high-level alternatives (so
3052 as to test simple common cases first) can substantially reduce parsing
3053 times.
3054
3055 However, some languages are just intrinsically slow to parse using top-
3056 down techniques (or, at least, may have slow-to-parse corner cases).
3057
3058 To help cope with this constraint, Regexp::Grammars provides a
3059 mechanism by which you can limit the total effort that a given grammar
3060 will expend in attempting to match. The "<timeout:...>" directive
3061 allows you to specify how long a grammar is allowed to continue trying
3062 to match before giving up. It expects a single argument, which must be
3063 an unsigned integer, and it treats this integer as the number of
3064 seconds to continue attempting to match.
3065
3066 For example:
3067
3068 <timeout: 10> # Give up after 10 seconds
3069
3070 indicates that the grammar should keep attempting to match for another
3071 10 seconds from the point where the directive is encountered during a
3072 parse. If the complete grammar has not matched in that time, the entire
3073 match is considered to have failed, the matching process is immediately
3074 terminated, and a standard error message ('Internal error: Timed out
3075 after 10 seconds (as requested)') is returned in "@!".
3076
3077 A "<timeout:...>" directive can be placed anywhere in a grammar, but is
3078 most usually placed at the very start, so that the entire grammar is
3079 governed by the specified time limit. The second most common
3080 alternative is to place the timeout at the start of a particular
3081 subrule that is known to be potentially very slow.
3082
3083 A common mistake is to put the timeout specification at the top level
3084 of the grammar, but place it after the actual subrule to be matched,
3085 like so:
3086
3087 my $grammar = qr{
3088
3089 <Text_Corpus> # Subrule to be matched
3090 <timeout: 10> # Useless use of timeout
3091
3092 <rule: Text_Corpus>
3093 # et cetera...
3094 }xms;
3095
3096 Since the parser will only reach the "<timeout: 10>" directive after it
3097 has completely matched "<Text_Corpus>", the timeout is only initiated
3098 at the very end of the matching process and so does not limit that
3099 process in any useful way.
3100
3101 Immediate timeouts
3102
3103 As you might expect, a "<timeout: 0>" directive tells the parser to
3104 keep trying for only zero more seconds, and therefore will immediately
3105 cause the entire surrounding grammar to fail (no matter how deeply
3106 within that grammar the directive is encountered).
3107
3108 This can occasionally be exteremely useful. If you know that detecting
3109 a particular datum means that the grammar will never match, no matter
3110 how many other alternatives may subsequently be tried, you can short-
3111 circuit the parser by injecting a "<timeout: 0>" immediately after the
3112 offending datum is detected.
3113
3114 For example, if your grammar only accepts certain versions of the
3115 language being parsed, you could write:
3116
3117 <rule: Valid_Language_Version>
3118 vers = <%AcceptableVersions>
3119 |
3120 vers = <bad_version=(\S++)>
3121 <warning: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3122 <timeout: 0>
3123
3124 In fact, this "<warning: MSG> <timeout: 0>" sequence is sufficiently
3125 useful, sufficiently complex, and sufficiently easy to get wrong, that
3126 Regexp::Grammars provides a handy shortcut for it: the "<fatal:...>"
3127 directive. A "<fatal:...>" is exactly equivalent to a "<warning:...>"
3128 followed by a zero-timeout, so the previous example could also be
3129 written:
3130
3131 <rule: Valid_Language_Version>
3132 vers = <%AcceptableVersions>
3133 |
3134 vers = <bad_version=(\S++)>
3135 <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>
3136
3137 Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
3138 own failure context in $CONTEXT, so the previous example could be
3139 further simplified to:
3140
3141 <rule: Valid_Language_Version>
3142 vers = <%AcceptableVersions>
3143 |
3144 vers = <fatal:(?{ "Cannot parse language version $CONTEXT" })>
3145
3146 Also like "<error:...>", "<fatal:...>" can autogenerate an error
3147 message if none is provided, so the example could be still further
3148 reduced to:
3149
3150 <rule: Valid_Language_Version>
3151 vers = <%AcceptableVersions>
3152 |
3153 vers = <fatal:>
3154
3155 In this last case, however, the error message returned in "@!" would no
3156 longer be:
3157
3158 Cannot parse language version 0.95
3159
3160 It would now be:
3161
3162 Expected valid language version, but found '0.95' instead
3163
3165 If you intend to use a grammar as part of a larger program that
3166 contains other (non-grammatical) regexes, it is more efficient--and
3167 less error-prone--to avoid having Regexp::Grammars process those
3168 regexes as well. So it's often a good idea to declare your grammar in a
3169 "do" block, thereby restricting the scope of the module's effects.
3170
3171 For example:
3172
3173 my $grammar = do {
3174 use Regexp::Grammars;
3175 qr{
3176 <file>
3177
3178 <rule: file>
3179 <prelude>
3180 <data>
3181 <postlude>
3182
3183 <rule: prelude>
3184 # etc.
3185 }x;
3186 };
3187
3188 Because the effects of Regexp::Grammars are lexically scoped, any
3189 regexes defined outside that "do" block will be unaffected by the
3190 module.
3191
3193 Perl API
3194 "use Regexp::Grammars;"
3195 Causes all regexes in the current lexical scope to be compile-time
3196 processed for grammar elements.
3197
3198 "$str =~ $grammar"
3199 "$str =~ /$grammar/"
3200 Attempt to match the grammar against the string, building a nested
3201 data structure from it.
3202
3203 "%/"
3204 This hash is assigned the nested data structure created by any
3205 successful match of a grammar regex.
3206
3207 "@!"
3208 This array is assigned the queue of error messages created by any
3209 unsuccessful match attempt of a grammar regex.
3210
3211 Grammar syntax
3212 Directives
3213
3214 "<rule: IDENTIFIER>"
3215 Define a rule whose name is specified by the supplied identifier.
3216
3217 Everything following the "<rule:...>" directive (up to the next
3218 "<rule:...>" or "<token:...>" directive) is treated as part of the
3219 rule being defined.
3220
3221 Any whitespace in the rule is replaced by a call to the "<.ws>"
3222 subrule (which defaults to matching "\s*", but may be explicitly
3223 redefined).
3224
3225 "<token: IDENTIFIER>"
3226 Define a rule whose name is specified by the supplied identifier.
3227
3228 Everything following the "<token:...>" directive (up to the next
3229 "<rule:...>" or "<token:...>" directive) is treated as part of the
3230 rule being defined.
3231
3232 Any whitespace in the rule is ignored (under the "/x" modifier), or
3233 explicitly matched (if "/x" is not used).
3234
3235 "<objrule: IDENTIFIER>"
3236 "<objtoken: IDENTIFIER>"
3237 Identical to a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
3238 declaration, except that the rule or token will also bless the hash
3239 it normally returns, converting it to an object of a class whose
3240 name is the same as the rule or token itself.
3241
3242 "<require: (?{ CODE }) >"
3243 The code block is executed and if its final value is true, matching
3244 continues from the same position. If the block's final value is
3245 false, the match fails at that point and starts backtracking.
3246
3247 "<error: (?{ CODE }) >"
3248 "<error: LITERAL TEXT >"
3249 "<error: >"
3250 This directive queues a conditional error message within the global
3251 special variable "@!" and then fails to match at that point (that
3252 is, it is equivalent to a "(?!)" or "(*FAIL)" when matching).
3253
3254 "<fatal: (?{ CODE }) >"
3255 "<fatal: LITERAL TEXT >"
3256 "<fatal: >"
3257 This directive is exactly the same as an "<error:...>" in every
3258 respect except that it immediately causes the entire surrounding
3259 grammar to fail, and parsing to immediate cease.
3260
3261 "<warning: (?{ CODE }) >"
3262 "<warning: LITERAL TEXT >"
3263 This directive is exactly the same as an "<error:...>" in every
3264 respect except that it does not induce a failure to match at the
3265 point it appears. That is, it is equivalent to a "(?=)" ["succeed
3266 and continue matching"], rather than a "(?!)" ["fail and
3267 backtrack"].
3268
3269 "<debug: COMMAND >"
3270 During the matching of grammar regexes send debugging and warning
3271 information to the specified log file (see "<logfile: LOGFILE>").
3272
3273 The available "COMMAND"'s are:
3274
3275 <debug: continue> ___ Debug until end of complete parse
3276 <debug: run> _/
3277
3278 <debug: on> ___ Debug until next subrule match
3279 <debug: match> _/
3280
3281 <debug: try> ___ Debug until next subrule call or match
3282 <debug: step> _/
3283
3284 <debug: same> ___ Maintain current debugging mode
3285
3286 <debug: off> ___ No debugging
3287
3288 See also the $DEBUG special variable.
3289
3290 "<logfile: LOGFILE>"
3291 "<logfile: - >"
3292 During the compilation of grammar regexes, send debugging and
3293 warning information to the specified LOGFILE (or to *STDERR if "-"
3294 is specified).
3295
3296 If the specified LOGFILE name contains a %t, it is replaced with a
3297 (sortable) "YYYYMMDD.HHMMSS" timestamp. For example:
3298
3299 <logfile: test-run-%t >
3300
3301 executed at around 9.30pm on the 21st of March 2009, would generate
3302 a log file named: "test-run-20090321.213056"
3303
3304 "<log: (?{ CODE }) >"
3305 "<log: LITERAL TEXT >"
3306 Append a message to the log file. If the argument is a code block,
3307 that code is expected to return the text of the message; if the
3308 argument is anything else, that something else is the literal
3309 message.
3310
3311 If the block returns two or more values, the first is treated as a
3312 log message severity indicator, and the remaining values as
3313 separate lines of text to be logged.
3314
3315 "<timeout: INT >"
3316 Restrict the match-time of the parse to the specified number of
3317 seconds. Queues a error message and terminates the entire match
3318 process if the parse does not complete within the nominated time
3319 limit.
3320
3321 Subrule calls
3322
3323 "<IDENTIFIER>"
3324 Call the subrule whose name is IDENTIFIER.
3325
3326 If it matches successfully, save the hash it returns in the current
3327 scope's result-hash, under the key 'IDENTIFIER'.
3328
3329 "<IDENTIFIER_1=IDENTIFIER_2>"
3330 Call the subrule whose name is IDENTIFIER_1.
3331
3332 If it matches successfully, save the hash it returns in the current
3333 scope's result-hash, under the key 'IDENTIFIER_2'.
3334
3335 In other words, the "IDENTIFIER_1=" prefix changes the key under
3336 which the result of calling a subrule is stored.
3337
3338 "<.IDENTIFIER>"
3339 Call the subrule whose name is IDENTIFIER. Don't save the hash it
3340 returns.
3341
3342 In other words, the "dot" prefix disables saving of subrule
3343 results.
3344
3345 "<IDENTIFIER= ( PATTERN )>"
3346 Match the subpattern PATTERN.
3347
3348 If it matches successfully, capture the substring it matched and
3349 save that substring in the current scope's result-hash, under the
3350 key 'IDENTIFIER'.
3351
3352 "<.IDENTIFIER= ( PATTERN )>"
3353 Match the subpattern PATTERN. Don't save the substring it matched.
3354
3355 "<IDENTIFIER= %HASH>"
3356 Match a sequence of non-whitespace then verify that the sequence is
3357 a key in the specified hash
3358
3359 If it matches successfully, capture the sequence it matched and
3360 save that substring in the current scope's result-hash, under the
3361 key 'IDENTIFIER'.
3362
3363 "<%HASH>"
3364 Match a key from the hash. Don't save the substring it matched.
3365
3366 "<IDENTIFIER= (?{ CODE })>"
3367 Execute the specified CODE.
3368
3369 Save the result (of the final expression that the CODE evaluates)
3370 in the current scope's result-hash, under the key 'IDENTIFIER'.
3371
3372 "<[IDENTIFIER]>"
3373 Call the subrule whose name is IDENTIFIER.
3374
3375 If it matches successfully, append the hash it returns to a nested
3376 array within the current scope's result-hash, under the key
3377 <'IDENTIFIER'>.
3378
3379 "<[IDENTIFIER_1=IDENTIFIER_2]>"
3380 Call the subrule whose name is IDENTIFIER_1.
3381
3382 If it matches successfully, append the hash it returns to a nested
3383 array within the current scope's result-hash, under the key
3384 'IDENTIFIER_2'.
3385
3386 "<ANY_SUBRULE>+ % <ANY_OTHER_SUBRULE>"
3387 "<ANY_SUBRULE>* % <ANY_OTHER_SUBRULE>"
3388 "<ANY_SUBRULE>+ % (PATTERN)"
3389 "<ANY_SUBRULE>* % (PATTERN)"
3390 Repeatedly call the first subrule. Keep matching as long as the
3391 subrule matches, provided successive matches are separated by
3392 matches of the second subrule or the pattern.
3393
3394 In other words, match a list of ANY_SUBRULE's separated by
3395 ANY_OTHER_SUBRULE's or PATTERN's.
3396
3397 Note that, if a pattern is used to specify the separator, it must
3398 be specified in some kind of matched parentheses. These may be
3399 capturing ["(...)"], non-capturing ["(?:...)"], non-backtracking
3400 ["(?>...)"], or any other construct enclosed by an opening and
3401 closing paren.
3402
3403 "<ANY_SUBRULE>+ %% <ANY_OTHER_SUBRULE>"
3404 "<ANY_SUBRULE>* %% <ANY_OTHER_SUBRULE>"
3405 "<ANY_SUBRULE>+ %% (PATTERN)"
3406 "<ANY_SUBRULE>* %% (PATTERN)"
3407 Repeatedly call the first subrule. Keep matching as long as the
3408 subrule matches, provided successive matches are separated by
3409 matches of the second subrule or the pattern.
3410
3411 Also allow an optional final trailing instance of the second
3412 subrule or pattern (this is where "%%" differs from "%").
3413
3414 In other words, match a list of ANY_SUBRULE's separated by
3415 ANY_OTHER_SUBRULE's or PATTERN's, with a possible final separator.
3416
3417 As for the single "%" operator, if a pattern is used to specify the
3418 separator, it must be specified in some kind of matched
3419 parentheses. These may be capturing ["(...)"], non-capturing
3420 ["(?:...)"], non-backtracking ["(?>...)"], or any other construct
3421 enclosed by an opening and closing paren.
3422
3423 Special variables within grammar actions
3424 $CAPTURE
3425 $CONTEXT
3426 These are both aliases for the built-in read-only $^N variable,
3427 which always contains the substring matched by the nearest
3428 preceding "(...)" capture. $^N still works perfectly well, but
3429 these are provided to improve the readability of code blocks and
3430 error messages respectively.
3431
3432 $INDEX
3433 This variable contains the index at which the next match will be
3434 attempted within the string being parsed. It is most commonly used
3435 in "<error:...>" or "<log:...>" directives:
3436
3437 <rule: ListElem>
3438 <log: (?{ "Trying words at index $INDEX" })>
3439 <MATCH=( \w++ )>
3440 |
3441 <log: (?{ "Trying digits at index $INDEX" })>
3442 <MATCH=( \d++ )>
3443 |
3444 <error: (?{ "Missing ListElem near index $INDEX" })>
3445
3446 %MATCH
3447 This variable contains all the saved results of any subrules called
3448 from the current rule. In other words, subrule calls like:
3449
3450 <ListElem> <Separator= (,)>
3451
3452 stores their respective match results in $MATCH{'ListElem'} and
3453 $MATCH{'Separator'}.
3454
3455 $MATCH
3456 This variable is an alias for $MATCH{"="}. This is the %MATCH entry
3457 for the special "override value". If this entry is defined, its
3458 value overrides the usual "return \%MATCH" semantics of a
3459 successful rule.
3460
3461 %ARG
3462 This variable contains all the key/value pairs that were passed
3463 into a particular subrule call.
3464
3465 <Keyword> <Command> <Terminator(:Keyword)>
3466
3467 the "Terminator" rule could get access to the text matched by
3468 "<Keyword>" like so:
3469
3470 <token: Terminator>
3471 end_ (??{ $ARG{'Keyword'} })
3472
3473 Note that to match against the calling subrules 'Keyword' value,
3474 it's necessary to use either a deferred interpolation ("(??{...})")
3475 or a qualified matchref:
3476
3477 <token: Terminator>
3478 end_ <\:Keyword>
3479
3480 A common mistake is to attempt to directly interpolate the
3481 argument:
3482
3483 <token: Terminator>
3484 end_ $ARG{'Keyword'}
3485
3486 This evaluates $ARG{'Keyword'} when the grammar is compiled, rather
3487 than when the rule is matched.
3488
3489 $_ At the start of any code blocks inside any regex, the variable $_
3490 contains the complete string being matched against. The current
3491 matching position within that string is given by: "pos($_)".
3492
3493 $DEBUG
3494 This variable stores the current debugging mode (which may be any
3495 of: 'off', 'on', 'run', 'continue', 'match', 'step', or 'try'). It
3496 is set automatically by the "<debug:...>" command, but may also be
3497 set manually in a code block (which can be useful for conditional
3498 debugging). For example:
3499
3500 <rule: ListElem>
3501 <Identifier>
3502
3503 # Conditionally debug if 'foobar' encountered...
3504 (?{ $DEBUG = $MATCH{Identifier} eq 'foobar' ? 'step' : 'off' })
3505
3506 <Modifier>?
3507
3508 See also: the "<log: LOGFILE>" and "<debug: DEBUG_CMD>" directives.
3509
3511 · Prior to Perl 5.14, the Perl 5 regex engine as not reentrant. So
3512 any attempt to perform a regex match inside a "(?{ ... })" or "(??{
3513 ... })" under Perl 5.12 or earlier will almost certainly lead to
3514 either weird data corruption or a segfault.
3515
3516 The same calamities can also occur in any constructor called by
3517 "<objrule:>". If the constructor invokes another regex in any way,
3518 it will most likely fail catastrophically. In particular, this
3519 means that Moose constructors will frequently crash and burn within
3520 a Regex::Grammars grammar (for example, if the Moose-based class
3521 declares an attribute type constraint such as 'Int', which Moose
3522 checks using a regex).
3523
3524 · The additional regex constructs this module provides are
3525 implemented by rewriting regular expressions. This is a (safer)
3526 form of source filtering, but still subject to all the same
3527 limitations and fallibilities of any other macro-based solution.
3528
3529 · In particular, rewriting the macros involves the insertion of (a
3530 lot of) extra capturing parentheses. This means you can no longer
3531 assume that particular capturing parens correspond to particular
3532 numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
3533 directly use Perl 5.10's named capture construct:
3534
3535 (?<name> [^\W\d]\w* )
3536
3537 Better still, capture the data in its correct hierarchical context
3538 using the module's "named subpattern" construct:
3539
3540 <name= ([^\W\d]\w*) >
3541
3542 · No recursive descent parser--including those created with
3543 Regexp::Grammars--can directly handle left-recursive grammars with
3544 rules of the form:
3545
3546 <rule: List>
3547 <List> , <ListElem>
3548
3549 If you find yourself attempting to write a left-recursive grammar
3550 (which Perl 5.10 may or may not complain about, but will never
3551 successfully parse with), then you probably need to use the
3552 "separated list" construct instead:
3553
3554 <rule: List>
3555 <[ListElem]>+ % (,)
3556
3557 · Grammatical parsing with Regexp::Grammars can fail if your grammar
3558 uses "non-backtracking" directives (i.e. the "(?>...)" block or the
3559 "?+", "*+", or "++" repetition specifiers). The problem appears to
3560 be that preventing the regex from backtracking through the in-regex
3561 actions that Regexp::Grammars adds causes the module's internal
3562 stack to fall out of sync with the regex match.
3563
3564 For the time being, if your grammar does not work as expected, you
3565 may need to replace one or more "non-backtracking" directives, with
3566 their regular (i.e. backtracking) equivalents.
3567
3568 · Similarly, parsing with Regexp::Grammars will fail if your grammar
3569 places a subrule call within a positive look-ahead, since these
3570 don't play nicely with the data stack.
3571
3572 This seems to be an internal problem with perl itself.
3573 Investigations, and attempts at a workaround, are proceeding.
3574
3575 For the time being, you need to make sure that grammar rules don't
3576 appear inside a positive lookahead or use the "<?RULENAME>"
3577 construct instead
3578
3580 Note that (because the author cannot find a way to throw exceptions
3581 from within a regex) none of the following diagnostics actually throws
3582 an exception.
3583
3584 Instead, these messages are simply written to the specified parser
3585 logfile (or to *STDERR, if no logfile is specified).
3586
3587 However, any fatal match-time message will immediately terminate the
3588 parser matching and will still set $@ (as if an exception had been
3589 thrown and caught at that point in the code). You then have the option
3590 to check $@ immediately after matching with the grammar, and rethrow if
3591 necessary:
3592
3593 if ($input =~ $grammar) {
3594 process_data_in(\%/);
3595 }
3596 else {
3597 die if $@;
3598 }
3599
3600 "Found call to %s, but no %s was defined in the grammar"
3601 You specified a call to a subrule for which there was no definition
3602 in the grammar. Typically that's either because you forget to
3603 define the rule, or because you misspelled either the definition or
3604 the subrule call. For example:
3605
3606 <file>
3607
3608 <rule: fiel> <---- misspelled rule
3609 <lines> <---- used but never defined
3610
3611 Regexp::Grammars converts any such subrule call attempt to an
3612 instant catastrophic failure of the entire parse, so if your parser
3613 ever actually tries to perform that call, Very Bad Things will
3614 happen.
3615
3616 "Entire parse terminated prematurely while attempting to call
3617 non-existent rule: %s"
3618 You ignored the previous error and actually tried to call to a
3619 subrule for which there was no definition in the grammar. Very Bad
3620 Things are now happening. The parser got very upset, took its ball,
3621 and went home. See the preceding diagnostic for remedies.
3622
3623 This diagnostic should throw an exception, but can't. So it sets $@
3624 instead, allowing you to trap the error manually if you wish.
3625
3626 "Fatal error: <objrule: %s> returned a non-hash-based object"
3627 An <objrule:> was specified and returned a blessed object that
3628 wasn't a hash. This will break the behaviour of the grammar, so the
3629 module immediately reports the problem and gives up.
3630
3631 The solution is to use only hash-based classes with <objrule:>
3632
3633 "Can't match against <grammar: %s>"
3634 The regex you attempted to match against defined a pure grammar,
3635 using the "<grammar:...>" directive. Pure grammars have no start-
3636 pattern and hence cannot be matched against directly.
3637
3638 You need to define a matchable grammar that inherits from your pure
3639 grammar and then calls one of its rules. For example, instead of:
3640
3641 my $greeting = qr{
3642 <grammar: Greeting>
3643
3644 <rule: greet>
3645 Hi there
3646 | Hello
3647 | Yo!
3648 }xms;
3649
3650 you need:
3651
3652 qr{
3653 <grammar: Greeting>
3654
3655 <rule: greet>
3656 Hi there
3657 | Hello
3658 | Yo!
3659 }xms;
3660
3661 my $greeting = qr{
3662 <extends: Greeting>
3663 <greet>
3664 }xms;
3665
3666 "Inheritance from unknown grammar requested by <%s>"
3667 You used an "<extends:...>" directive to request that your grammar
3668 inherit from another, but the grammar you asked to inherit from
3669 doesn't exist.
3670
3671 Check the spelling of the grammar name, and that it's already been
3672 defined somewhere earlier in your program.
3673
3674 "Redeclaration of <%s> will be ignored"
3675 You defined two or more rules or tokens with the same name. The
3676 first one defined in the grammar will be used; the rest will be
3677 ignored.
3678
3679 To get rid of the warning, get rid of the extra definitions (or, at
3680 least, comment them out or rename the rules).
3681
3682 "Possible invalid subrule call %s"
3683 Your grammar contained something of the form:
3684
3685 <identifier
3686 <.identifier
3687 <[identifier
3688
3689 which you might have intended to be a subrule call, but which
3690 didn't correctly parse as one. If it was supposed to be a
3691 Regexp::Grammars subrule call, you need to check the syntax you
3692 used. If it wasn't supposed to be a subrule call, you can silence
3693 the warning by rewriting it and quoting the leading angle:
3694
3695 \<identifier
3696 \<.identifier
3697 \<[identifier
3698
3699 "Possible failed attempt to specify a directive: %s"
3700 Your grammar contained something of the form:
3701
3702 <identifier:...
3703
3704 but which wasn't a known directive like "<rule:...>" or
3705 "<debug:...>". If it was supposed to be a Regexp::Grammars
3706 directive, check the spelling of the directive name. If it wasn't
3707 supposed to be a directive, you can silence the warning by
3708 rewriting it and quoting the leading angle:
3709
3710 \<identifier:
3711
3712 "Possible failed attempt to specify a subrule call %s"
3713 Your grammar contained something of the form:
3714
3715 <identifier...
3716
3717 but which wasn't a call to a known subrule like "<ident>" or
3718 "<name>". If it was supposed to be a Regexp::Grammars subrule call,
3719 check the spelling of the rule name in the angles. If it wasn't
3720 supposed to be a subrule call, you can silence the warning by
3721 rewriting it and quoting the leading angle:
3722
3723 \<identifier...
3724
3725 "Repeated subrule %s will only capture its final match"
3726 You specified a subrule call with a repetition qualifier, such as:
3727
3728 <ListElem>*
3729
3730 or:
3731
3732 <ListElem>+
3733
3734 Because each subrule call saves its result in a hash entry of the
3735 same name, each repeated match will overwrite the previous ones, so
3736 only the last match will ultimately be saved. If you want to save
3737 all the matches, you need to tell Regexp::Grammars to save the
3738 sequence of results as a nested array within the hash entry, like
3739 so:
3740
3741 <[ListElem]>*
3742
3743 or:
3744
3745 <[ListElem]>+
3746
3747 If you really did intend to throw away every result but the final
3748 one, you can silence the warning by placing the subrule call inside
3749 any kind of parentheses. For example:
3750
3751 (<ListElem>)*
3752
3753 or:
3754
3755 (?: <ListElem> )+
3756
3757 "Unable to open log file '$filename' (%s)"
3758 You specified a "<logfile:...>" directive but the file whose name
3759 you specified could not be opened for writing (for the reason given
3760 in the parens).
3761
3762 Did you misspell the filename, or get the permissions wrong
3763 somewhere in the filepath?
3764
3765 "Non-backtracking subrule %s may not revert correctly during
3766 backtracking"
3767 Because of inherent limitations in the Perl regex engine, non-
3768 backtracking constructs like "++", "*+", "?+", and "(?>...)" do not
3769 always work correctly when applied to subrule calls, especially in
3770 earlier versions of Perl.
3771
3772 If the grammar doesn't work properly, replace the offending
3773 constructs with regular backtracking versions instead. If the
3774 grammar does work, you can silence the warning by enclosing the
3775 subrule call in any kind of parentheses. For example, change:
3776
3777 <[ListElem]>++
3778
3779 to:
3780
3781 (?: <[ListElem]> )++
3782
3783 "Unexpected item before first subrule specification in definition of
3784 <grammar: %s>"
3785 Named grammar definitions must consist only of rule and token
3786 definitions. They cannot have patterns before the first
3787 definitions. You had some kind of pattern before the first
3788 definition, which will be completely ignored within the grammar.
3789
3790 To silence the warning, either comment out or delete whatever is
3791 before the first rule/token definition.
3792
3793 "No main regex specified before rule definitions"
3794 You specified an unnamed grammar (i.e. no "<grammar:...>"
3795 directive), but didn't specify anything for it to actually match,
3796 just some rules that you don't actually call. For example:
3797
3798 my $grammar = qr{
3799
3800 <rule: list> \( <item> +% [,] \)
3801
3802 <token: item> <list> | \d+
3803 }x;
3804
3805 You have to provide something before the first rule to start the
3806 matching off. For example:
3807
3808 my $grammar = qr{
3809
3810 <list> # <--- This tells the grammar how to start matching
3811
3812 <rule: list> \( <item> +% [,] \)
3813
3814 <token: item> <list> | \d+
3815 }x;
3816
3817 "Ignoring useless empty <ws:> directive"
3818 The "<ws:...>" directive specifies what whitespace matches within
3819 the current rule. An empty "<ws:>" directive would cause whitespace
3820 to match nothing at all, which is what happens in a token
3821 definition, not in a rule definition.
3822
3823 Either put some subpattern inside the empty "<ws:...>" or, if you
3824 really do want whitespace to match nothing at all, remove the
3825 directive completely and change the rule definition to a token
3826 definition.
3827
3828 "Ignoring useless <ws: %s > directive in a token definition"
3829 The "<ws:...>" directive is used to specify what whitespace matches
3830 within a rule. Since whitespace never matches anything inside
3831 tokens, putting a "<ws:...>" directive in a token is a waste of
3832 time.
3833
3834 Either remove the useless directive, or else change the surrounding
3835 token definition to a rule definition.
3836
3837 "Quantifier that doesn't quantify anything: <%s>"
3838 You specified a rule or token something like:
3839
3840 <token: star> *
3841
3842 or:
3843
3844 <rule: add_op> plus | add | +
3845
3846 but the "*" and "+" in those examples are both regex meta-
3847 operators: quantifiers that usually cause what precedes them to
3848 match repeatedly. In these cases however, nothing is preceding the
3849 quantifier, so it's a Perl syntax error.
3850
3851 You almost certainly need to escape the meta-characters in some
3852 way. For example:
3853
3854 <token: star> \*
3855
3856 <rule: add_op> plus | add | [+]
3857
3859 Regexp::Grammars requires no configuration files or environment
3860 variables.
3861
3863 This module only works under Perl 5.10 or later.
3864
3866 This module is likely to be incompatible with any other module that
3867 automagically rewrites regexes. For example it may conflict with
3868 Regexp::DefaultFlags, Regexp::DeferredExecution, or Regexp::Extended.
3869
3871 No bugs have been reported.
3872
3873 Please report any bugs or feature requests to
3874 "bug-regexp-grammars@rt.cpan.org", or through the web interface at
3875 <http://rt.cpan.org>.
3876
3878 Damian Conway "<DCONWAY@CPAN.org>"
3879
3881 Copyright (c) 2009, Damian Conway "<DCONWAY@CPAN.org>". All rights
3882 reserved.
3883
3884 This module is free software; you can redistribute it and/or modify it
3885 under the same terms as Perl itself. See perlartistic.
3886
3888 BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
3889 FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
3890 WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
3891 PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
3892 EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
3893 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
3894 ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
3895 YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
3896 NECESSARY SERVICING, REPAIR, OR CORRECTION.
3897
3898 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
3899 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
3900 REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE
3901 TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
3902 CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
3903 SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
3904 RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
3905 FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
3906 SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
3907 DAMAGES.
3908
3909
3910
3911perl v5.30.1 2020-01-30 Regexp::Grammars(3)