1Marpa::XS::Grammar(3) User Contributed Perl DocumentationMarpa::XS::Grammar(3)
2
3
4
6 Marpa::XS::Grammar - Marpa grammars
7
9 my $grammar = Marpa::XS::Grammar->new(
10 { start => 'Expression',
11 actions => 'My_Actions',
12 default_action => 'first_arg',
13 rules => [
14 { lhs => 'Expression', rhs => [qw/Term/] },
15 { lhs => 'Term', rhs => [qw/Factor/] },
16 { lhs => 'Factor', rhs => [qw/Number/] },
17 { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
18 { lhs => 'Factor',
19 rhs => [qw/Factor Multiply Factor/],
20 action => 'do_multiply'
21 },
22 ],
23 }
24 );
25
26 $grammar->precompute();
27
29 To create a Marpa grammar object, use the "new" method. Rules and
30 symbols may be specified when the grammar is created.
31
32 To change a Marpa grammar object, use the "set" method. New rules may
33 be added until a grammar is precomputed.
34
35 A grammar cannot be used for parsing until it is precomputed. To
36 precompute a Marpa grammar object, use the "precompute" method. After
37 precomputation, no new rules may added and most other changes are
38 forbidden.
39
40 Symbol names
41 Symbol names that end in right square brackets are reserved for Marpa's
42 internal use. Any other valid Perl string is an acceptable symbol
43 name.
44
45 Terminal symbols
46 If a grammar has no empty rules, all symbols are terminals by default.
47 Unlike most parser generators, Marpa will allow terminals to appear on
48 the left hand side of rules. Marpa defines a terminal as a symbol
49 which is valid as an input token symbol.
50
51 Marpa allows direct marking of terminals using the "terminals" named
52 argument, and the "terminal" property. If any terminal is directly
53 marked, only directly marked symbols will be terminals.
54
55 Marpa's behavior can be changed by unsetting the "lhs_terminals" named
56 argument. When the "lhs_terminals" named argument is unset, only
57 symbols which do not appear on the LHS of a rule can be terminals. If
58 no symbols are directly marked, Marpa will implicitly mark all the non-
59 LHS symbols as terminals.
60
61 Marpa is stricter for grammars with empty rules -- it does not allow
62 them to take the default. There is an efficiency hit whenever the LHS
63 of an empty rule is a terminal as well, so Marpa does not allow this to
64 happen by default -- the user has to be explicitly indicate that that
65 is what she wants. If a grammar has any empty rules, it must either
66 directly mark its terminals, or must unset the "lhs_terminals" named
67 argument.
68
69 One advantage of not taking the default is efficiency. Precomputation
70 is faster for grammars which mark their terminals. Also, the
71 recognizer checks that input tokens are terminals, so that being
72 selective about which symbols are terminals and which are not can allow
73 better input checking.
74
75 Summary
76
77 The following summarizes the logic which determines whether a symbol
78 "S" is terminal. As a reminder, a symbol is said to be directly marked
79 as a terminal if it is one of the symbols in the "terminals" named
80 argument, or if the symbol has the "terminal" property set.
81
82 • If any symbol is directly marked as a terminal, then a symbol "S"
83 is a terminal if and only if it is also directly marked as a
84 terminal.
85
86 • If no symbol is directly marked as a terminal, and the
87 "lhs_terminals" named argument is unset, then all non-LHS symbols
88 are terminals.
89
90 • If the grammar contains no empty rules, no symbol is directly
91 marked as a terminal, and the "lhs_terminals" named argument is set
92 or left at its default, then all symbols are terminals.
93
94 • The only case not covered above is that in which a grammar contains
95 one or more empty rules, no symbol is directly marked as a
96 terminal, and the "lhs_terminals" named argument is set or left at
97 its default. This is a fatal error.
98
99 Sequence rules
100 It is very common in a grammar for one symbol to produce a repeating
101 sequence. Marpa allows a shorthand for this: sequence rules. The RHS
102 of a sequence rule will be repeated, as specified by the "min" rule
103 property. In sequence rules the RHS must always be one symbol in
104 length, and that symbol may not be a nullable symbol.
105
106 A rule is a sequence rule if the "min" rule property is defined. "min"
107 can be 0 or 1, and specifies the minimum number of times that the
108 sequence is allowed to repeat. As of this writing, the maximum number
109 of repetitions is always infinite.
110
111 { lhs => 'sequence', rhs => ['item'], min => 0 }
112
113 A "min" of zero indicates a sequence that repeats zero or more times.
114 This is the equivalent of using the star quantifier (""*"") in the
115 standard regular expression notation.
116
117 { lhs => 'sequence', rhs => ['item'], min => 1 }
118
119 A "min" of one indicates a sequence that repeats one or more times.
120 This is the equivalent of using the plus quantifier (""+"") in the
121 standard regular expression notation.
122
123 Sequences can have a separator, specified with the "separator" rule
124 property. By default, separation is Perl-style: trailing separators
125 are allowed. In ""proper"" separation, a separator must actually
126 separate two sequence items and therefore is not allowed after the last
127 item of a sequence. If you prefer ""proper"" separation, you can set
128 the "proper" rule property.
129
130 Advantages of sequence rules
131
132 You are never forced to use sequence rules, but it's usually better if
133 you do. When a sequence is written as a sequence rule, Marpa optimizes
134 it.
135
136 When a sequence is written using non-sequence rules, the semantics
137 typically wind up being spread over two or three Perl closures. The
138 semantic action for a sequence rule is a single Perl closure. Putting
139 the semantics into a single Perl closure often results in simpler and
140 more natural code. See the section on sequences in the semantics
141 document.
142
143 Caveats
144
145 Marpa throws an exception if you try to use a nullable symbol as the
146 right hand side of a sequence rule, or as the separator for a sequence
147 rule. The ban on nullables in sequences only applies to sequences when
148 they are written using sequence rules. Nothing prevents you from
149 specifying a sequence of nullables using non-sequence rules. But
150 usually there is no good reason to do this, and sequences of nullable
151 can be highly ambiguous, which makes them a good thing to avoid for
152 efficiency reasons.
153
154 To keep things simple, the right hand side of a sequence rule must be a
155 single symbol. Of course, applications will often want to repeat
156 sequences of multiple symbols. That is easy to do indirectly:
157
158 { lhs => 'sequence', rhs => [qw(item)], min => 0 },
159 { lhs => 'item', rhs => [qw(part1 part2)], },
160
162 new
163 my $grammar = Marpa::XS::Grammar->new(
164 { start => 'Expression',
165 actions => 'My_Actions',
166 default_action => 'first_arg',
167 rules => [
168 { lhs => 'Expression', rhs => [qw/Term/] },
169 { lhs => 'Term', rhs => [qw/Factor/] },
170 { lhs => 'Factor', rhs => [qw/Number/] },
171 { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
172 { lhs => 'Factor',
173 rhs => [qw/Factor Multiply Factor/],
174 action => 'do_multiply'
175 },
176 ],
177 }
178 );
179
180 "Marpa::XS::Grammar::new" returns a new Marpa grammar object or throws
181 an exception. The arguments to "Marpa::XS::Grammar::new" are
182 references to hashes of named arguments. In each key/value pair of
183 this hash, the hash key is the argument name and the hash value is the
184 value of the named argument. The available named arguments are
185 described below.
186
188 precompute
189 $grammar->precompute();
190
191 The "precompute" method compiles data structures that the recognizer
192 will need. It returns the grammar object or throws an exception.
193
194 set
195 $grammar->set( { trace_file_handle => $trace_fh } );
196
197 The arguments to the "set" method are references to hashes of named
198 arguments. The available named arguments are described below. "set"
199 either returns true or throws an exception.
200
202 check_terminal
203 Returns a Perl true when its argument is the name of a terminal symbol.
204 Otherwise, returns a Perl false. Not often needed, but a lexer may
205 find this the most convenient way to determine if a symbol is a
206 terminal.
207
209 show_AHFA
210 print $grammar->show_AHFA()
211 or die "print failed: $ERRNO";
212
213 Not of interest to most users. Returns a multi-line string listing the
214 states of an internal data structure called the Aycock-Horspool Finite
215 Automaton. Not useful before the grammar is precomputed.
216
217 show_problems
218 print $grammar->show_problems()
219 or die "print failed: $ERRNO";
220
221 Usually the application does not call this method directly. Returns a
222 string describing any serious but non-fatal problems a grammar had in
223 the precomputation phase. A serious problem is one that will prevent
224 parsing. Warnings are not serious problems in this sense. If there
225 were no serious problems, returns a string saying so. This method is
226 not useful before precomputation.
227
228 In Marpa, most serious grammar problems are not immediately thrown as
229 exceptions. This is because there can be a number of serious problems
230 in a grammar, particularly one that is large or in an early draft. If
231 each serious problem caused an immediate exception, the user would have
232 to fix them one at a time -- very tedious.
233
234 The recognizer throws an exception when the user attempts to create a
235 parse from a grammar with serious problems. When that happens, the
236 string returned by "show_problems" is part of the error message.
237
238 show_rules
239 print $grammar->show_rules()
240 or die "print failed: $ERRNO";
241
242 Returns a string listing the rules. Each rule is shown with comments
243 which indicate either rule properties or internal settings.
244 "show_rules" is useful in debugging grammars.
245
246 Marpa does extensive rewriting of its grammars, and both the original
247 rules and the rewritten rules appear in the "show_rules" list. When a
248 rule is rewritten, the original rule is often not used. In that case,
249 ""!used"" will be one of the comments for the original rule. The
250 ""!used"" comment also marks rules not used for reasons other than
251 rewrites. For example, inaccessible and unproductive rules are also
252 marked ""!used"".
253
254 The ""discard_sep"" comment indicates that the rule discards separators
255 This is only relevant in sequence rules. Other comments indicate
256 whether rules were nullable, unproductive, inaccessible, or empty.
257
258 The "vlhs", "vrhs" and "real" comments show rule settings relevant in
259 tracking "virtual" internal symbols. These are used internally, to
260 optimize evaluation.
261
262 show_symbols
263 print $grammar->show_symbols()
264 or die "print failed: $ERRNO";
265
266 Returns a string listing the symbols, along with comments indicating
267 whether they were terminal, nulling, nullable, unproductive or
268 inaccessible. Also shown is a list of rules with that symbol on the
269 left hand side, and a list of rules which have that symbol anywhere on
270 the right hand side. Useful for debugging grammars.
271
273 action_object
274 The "action_object" named argument specifies a class name to be used in
275 resolving action names to Perl closures. A "new" constructor must be
276 defined in the "action_object" package. It will be used to create the
277 per-parse-tree variables. Details are in the document on semantics.
278
279 actions
280 actions => 'My_Actions',
281
282 The "actions" named argument specifies the Perl package that Marpa will
283 use when resolving action names to Perl closures. If both an "actions"
284 named argument and an "action_object" named argument are specified, the
285 package from the "actions" named argument is the only one used to
286 resolve action names. The "actions" package is treated only as a
287 package, and not as a class. Any "new" constructor in the "actions"
288 package is ignored. Details are given in the document on semantics.
289
290 default_action
291 default_action => 'first_arg',
292
293 The "default_action" named argument specifies the value action name for
294 rules without an "action" property. Details are given in the document
295 on semantics.
296
297 default_null_value
298 The "default_null_value" named argument specifies the null value for
299 symbols without a "null_value" property. Details are given in the
300 document on semantics.
301
302 inaccessible_ok
303 The value must be a reference to an array of symbol names. By default,
304 Marpa warns if a symbol is inaccessible, but the warning is suppressed
305 for any symbol named in the array. Setting the "inaccessible_ok" named
306 argument after grammar precomputation is useless, and itself results in
307 a warning.
308
309 Inaccessible symbols are symbols which cannot be derived from the start
310 symbol, and which therefore can never be part of a successful parse.
311 Inaccessible symbols often indicate errors in grammar design. But a
312 user may have plans for these symbols, may wish to keep them as notes,
313 or may simply wish to deal with them later.
314
315 infinite_action
316 Takes as its value a string specifying what Marpa should do if it
317 discovers that its grammar is infinitely ambiguous. The value must be
318 one of ""fatal"", ""warn"" or ""quiet"". A grammar is infinitely
319 ambiguous if there is some input for which it produces an endless
320 number of parses.
321
322 If the value is ""fatal"", Marpa throws an exception when it encounters
323 an infinitely ambiguous grammar. This is the default and will usually
324 be what the user wants. In most cases, an infinitely ambiguous grammar
325 is simply a mistake.
326
327 ""quiet"" indicates that the user wants to allow infinitely ambiguous
328 grammars. ""warn"" indicates that the user wants to allow infinitely
329 ambiguous grammars, but wants a warning message to be printed to the
330 trace file handle.
331
332 lhs_terminals
333 The value of the "lhs_terminals" named argument is a Boolean. If true,
334 symbols which appear on the LHS of a rule are allowed to be terminals.
335 "lhs_terminals" is true by default.
336
337 If "lhs_terminals" is unset and no symbols are directly marked as
338 terminals, Marpa will mark all non-LHS symbols as terminals. If any
339 symbol is directly marked, all terminals must be directly marked. If
340 "lhs_terminals" is unset, but some symbols are directly marked, it is a
341 fatal error for a terminal to appear on the LHS of a rule. For more,
342 see the discussion of terminals above.
343
344 rules
345 The value of the "rules" named argument is a reference to an array of
346 rule descriptors. The "rules" named argument may be specified multiple
347 times, adding new rules to the grammar each time. New rules may be
348 added until the grammar is precomputed. The format of rule descriptors
349 is explained below.
350
351 start
352 start => 'Expression',
353
354 The value of the "start" named argument must be a symbol name. It will
355 be used as the start symbol for the grammar. The "start" named
356 argument is required.
357
358 symbols
359 The value of the "symbols" named arguments must be a reference to a
360 hash. In each key/value pair of this hash, the hash key is the symbol
361 property name and the hash value is the symbol descriptor. Symbol
362 descriptors are described below.
363
364 Note that the value of "symbols" named argument is a hash, but the
365 value of the "rules" named argument is an array. This is because
366 symbol names make convenient hash keys. For rules, there is no equally
367 natural choice for a hash key.
368
369 terminals
370 The value of the "terminals" named argument must be a reference to an
371 array of symbol names. Specifying a symbol's name in a "terminals"
372 named argument is one way of directly marking it as a terminal. See
373 the discussion of terminals above.
374
375 trace_file_handle
376 The value is a file handle. Trace output and warning messages go to
377 the trace file handle. By default the trace file handle is "STDERR".
378
379 trace_rules
380 Traces rules as they are added to the grammar. Useful, but you will
381 usually prefer the "show_rules" method. A "trace_rules" setting
382 becomes effective within the named argument hash which sets it. A
383 trace message warns the user if he turns on rule tracing when rules
384 have already been added.
385
386 unproductive_ok
387 The value must be a reference to an array of symbol names. By default,
388 Marpa warns if a symbol is unproductive, but the warning is suppressed
389 for any symbol named in the array. Setting the "unproductive_ok" named
390 argument after grammar precomputation is useless, and itself results in
391 a warning.
392
393 Unproductive symbols are symbols which can never derive a sentence. (A
394 sentence is a string of zero or more terminals.) That means that
395 unproductive symbols can never be part of a successful parse.
396 Unproductive symbols often indicate errors in grammar design. But a
397 user may have plans for these symbols, may wish to keep them as notes,
398 or may simply wish to deal with them later.
399
400 warnings
401 The value is a boolean. Warnings are written to the trace file handle.
402 By default, warnings are on. Usually, an application will want to
403 leave them on. If warnings are turned off, turning them back on after
404 grammar precomputation is useless, and itself results in a warning.
405
407 rules => [
408 { lhs => 'Expression', rhs => [qw/Term/] },
409 { lhs => 'Term', rhs => [qw/Factor/] },
410 { lhs => 'Factor', rhs => [qw/Number/] },
411 { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
412 { lhs => 'Factor',
413 rhs => [qw/Factor Multiply Factor/],
414 action => 'do_multiply'
415 },
416 ],
417
418 Rule descriptors as hashes
419 The long form descriptor of a rule is a reference to a hash of rule
420 properties. In each key/value pair of this hash, the hash key is the
421 rule property name and the hash value is the value of that property.
422
423 action
424 The value of the "action" rule property is a string which specifies the
425 semantics for the rule. For details, see the document on semantics.
426
427 The semantics of nulling symbols are dealt with on a per-symbol basis,
428 rather than a per-rule basis. For this reason the "action" rule
429 property is useless for empty rules. An exception is thrown if an
430 "action" property is defined for an empty rule.
431
432 keep
433 Separators in sequence rules are usually not semantically significant.
434 By default, Marpa throws away separators during parse tree traversal
435 and before node evaluation time, so that the semantic actions do not
436 see the separators.
437
438 If the value of the "keep" rule property is a Perl true, Marpa keeps
439 separators. This allows the semantic actions to examine them. The
440 downside is that the work of distinguishing sequence separators from
441 sequence items is pushed into the semantic actions. For details about
442 the semantics, see the document on semantics.
443
444 lhs
445 The value of the "lhs" rule property must be a string containing the
446 name of the rule's left hand side symbol. Every Marpa rule must have a
447 left hand side symbol.
448
449 The "lhs" plays a role in the semantics. If no "action" rule property
450 is defined, Marpa looks in either the grammar's "actions" package or
451 the grammar's "action_object" for a Perl closure that has the name of
452 the "lhs" symbol. For details, see the document on semantics.
453
454 min
455 "min" must be 0, 1, or undefined. If "min" is 0 or 1, the rule is a
456 sequence rule. If "min" is undefined, the rule is an ordinary, non-
457 sequence rule.
458
459 Only one symbol, called the sequence item, is allowed on the right hand
460 side of a sequence rule. The sequence item may not be a nullable
461 symbol. The input will be required to match the sequence item at least
462 "min" times and will be allowed to match the sequence item an unlimited
463 number of times.
464
465 null_ranking
466 "null_ranking" is ignored unless the recognizer's "ranking_method"
467 named argument is set to something other than its default. The
468 "null_ranking" named argument allows the application to control the
469 order in which rules with nullable symbols are returned by the "value"
470 method. Such rules can match the same input in several ways depending
471 on which symbols are nulled. These different ways of nulling symbols
472 in a rule are called its null variants.
473
474 If "null_ranking" is undefined, the order of the null variants will be
475 arbitrary. This is the default, and is acceptable to most
476 applications. For details on using the "null_ranking" named argument,
477 see the document on parse order.
478
479 proper
480 By default, sequence rules with separators allow trailing separators,
481 Perl-style. If the "proper" rule property is a Perl true, ""proper""
482 separation is enforced. In proper separation, separation must actually
483 separate sequence items, and trailing separators are not allowed.
484
485 rank
486 "rank" is ignored unless the recognizer's "ranking_method" named
487 argument is set to something other than its default. "rank", when
488 specified, must be an integer. It may be negative. "rank" is 0 by
489 default. For details on using the "rank" named argument, see the
490 document on parse order.
491
492 rhs
493 The value of the "rhs" property is a reference to an array of strings
494 containing the names of the rule's right hand symbols, in order. This
495 array may be zero length, in which case this is an empty rule -- a rule
496 with no symbols on the right hand side. A rule is also empty if there
497 is no "rhs" specifier in its descriptor.
498
499 separator
500 Any sequence rule may have a "separator" defined. The value must be a
501 symbol name. By default, Marpa allows trailing separators. This is
502 the usual style in Perl. The separator must not be a nullable symbol.
503
504 Rule descriptors as arrays
505 rules => [
506 [ 'E', [qw/E Add E/], 'do_add' ],
507 [ 'E', [qw/E Multiply E/], 'do_multiply' ],
508 [ 'E', [qw/Number/], ],
509 ],
510
511 Rule descriptors may be given in "short form" -- as a reference to an
512 array. The elements of the array, in order, are the "lhs" property,
513 the "rhs" property, and the "action" property. The last two are
514 optional. Omission of an optional property in a short form descriptor
515 has the same effect that omitting the same optional property would have
516 in the long form.
517
518 Duplicate rules
519 Marpa throws an exception if a duplicate rule is added. Two rules are
520 considered duplicates if
521
522 • Both rules have the same left hand symbol.
523
524 • Both rules have the same right hand symbols in the same order.
525
526 This definition applies to sequence rules, as well as to ordinary
527 rules. As a consequence, sequence rules can be considered duplicates
528 even when they have different separators and/or different minimum
529 counts.
530
532 symbols => {
533 L => { null_value => 'null L' },
534 R => { null_value => 'null R' },
535 A => { null_value => 'null A' },
536 B => { null_value => 'null B' },
537 X => { null_value => 'null X', terminal => 1 },
538 Y => { null_value => 'null Y', terminal => 1 },
539 },
540
541 A symbol descriptor is a hash. In the key/value pairs of this hash,
542 the hash key is the symbol property name and the hash value is the
543 value of that property. The available symbol properties are as
544 follows:
545
546 null_value
547 The "null_value" symbol property specifies the null value of its
548 symbol. Details are given in the document on semantics.
549
550 terminal
551 A boolean. If true, it directly marks the symbol as a terminal. If
552 false, it unmarks the symbol as a terminal. For details, see the
553 section on terminals.
554
556 Copyright 2012 Jeffrey Kegler
557 This file is part of Marpa::XS. Marpa::XS is free software: you can
558 redistribute it and/or modify it under the terms of the GNU Lesser
559 General Public License as published by the Free Software Foundation,
560 either version 3 of the License, or (at your option) any later version.
561
562 Marpa::XS is distributed in the hope that it will be useful,
563 but WITHOUT ANY WARRANTY; without even the implied warranty of
564 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
565 Lesser General Public License for more details.
566
567 You should have received a copy of the GNU Lesser
568 General Public License along with Marpa::XS. If not, see
569 http://www.gnu.org/licenses/.
570
571
572
573perl v5.36.0 2023-01-20 Marpa::XS::Grammar(3)