1Marpa::XS::Debug(3) User Contributed Perl Documentation Marpa::XS::Debug(3)
2
3
4
6 Marpa::XS::Debug - Debugging your Marpa grammar
7
9 This document describes the more powerful of Marpa's tracing and
10 debugging techniques. It assumes that you have written a grammar for
11 your Marpa application, and that something is going wrong. You should
12 read the overview document for tracing problems before reading this
13 document.
14
16 To read the "show_progress" output, it is important to have a basic
17 idea of what Earley items are, and of what the information in them
18 means. Everything that the user needs to know is explained in this
19 section.
20
21 Dotted rules
22 The idea behind Earley's algorithm is that you can parse by building a
23 table of rules and where you are in those rules. "Where" means two
24 things: location in the rule relative to the rule's symbols, and
25 location relative to the parse's input stream.
26
27 Let's look at an example of a rule in a context-free grammar. Here's
28 the rule for assignment from the Perl distribution's "perly.y"
29
30 " termbinop -> term ASSIGNOP term"
31
32 "ASSIGNOP" is "perly.y"'s internal name for the assignment operator.
33 In plain Perl terms, this is the ""="" character.
34
35 In parsing this rule, we can be at any of four possible locations. One
36 location is at the beginning, before all of the symbols. The other
37 three locations are immediately after each of the rule's three symbols.
38
39 Within a rule, position relative to the symbols of the rule is
40 traditionally indicated with a dot. In fact, the symbol-relative rule
41 position is very often called the dot location. Taken as a pair, a
42 rule and a dot location are called a dotted rule.
43
44 Here's our rule with a dot location indicated:
45
46 " termbinop -> X term ASSIGNOP term"
47
48 The dot location in this dotted rule is at the beginning. A dot
49 location at the beginning of a dotted rule means that we have not
50 recognized any symbols in the rule yet. All we are doing is predicting
51 that the rule will occur. A dotted rule with the dot before all of its
52 symbols is called a prediction or a predicted rule.
53
54 Here's another dotted rule:
55
56 " termbinop -> term X ASSIGNOP term"
57
58 In this dotted rule, we are saying we have seen a "term", but have not
59 yet recognized an "ASSIGNOP".
60
61 There's another special kind of dotted rule, a completion. A
62 completion (also called a completed rule) is a dotted rule with the dot
63 after all of the symbols. Here is the completion for the rule that we
64 have been using as an example:
65
66 " termbinop -> term ASSIGNOP term X"
67
68 A completion indicates that a rule has been fully recognized.
69
70 Earley items
71 The dotted rules contain all but one piece of the information that
72 Earley's algorithm needs to track. The missing piece is the second of
73 the two "wheres": where in the input stream. To associate input stream
74 location and dotted rules, Earley's algorithm uses what are now called
75 Earley items.
76
77 A convenient way to think of an Earley item is as a triple, or 3-tuple,
78 consisting of dotted rule, origin and current location. The origin is
79 the location in the input stream where the dotted rule starts. The
80 current location (also called the dot location) is the location in the
81 input stream which corresponds to the dot position.
82
83 Two noteworthy consequences follow from the way in which origin and
84 current location are defined. First, if a dotted rule is a prediction,
85 then origin and current location will always be the same. Second, the
86 input stream location where a rule ends is not tracked unless the
87 dotted rule is a completion. In other cases, an Earley item does not
88 tell us if a rule will ever be completed, much less at which location.
89
91 For this example of debugging, I've taken a very common example of a
92 grammar and deliberately introduced a problem. (All the code and the
93 full trace outputs for this example are in the Appendix.) I've
94 commented out the correct start rule:
95
96 ## { lhs => 'Expression', rhs => [qw/Term/] },
97
98 and replaced it with another start rule, one which will cause problems:
99
100 { lhs => 'Expression', rhs => [qw/Factor/] },
101
102 In what follows, we'll pretend we don't already know where the problem
103 is, and use the Marpa diagnostics and tracing facilities to "discover"
104 it.
105
107 Right off the bat, we get two warning messages:
108
109 Inaccessible symbol: Add
110 Inaccessible symbol: Term
111
112 If we were alert, these would be enough to tell us there is a serious
113 problem. "Inaccessible" symbols are symbols which cannot be reached
114 from the start symbol. This means that the grammar will never produce
115 them, and that parses will never find them in the input.
116
117 Since "Add" and "Term" are both important symbols in our application,
118 that should tell us our grammar has a serious problem. In fact, these
119 warning messages would often be enough to point us to the error. But,
120 in order to look at more of Marpa's tracing facilities, let's pretend
121 we have not had our morning coffee, and that we miss the significance
122 of these warning messages.
123
125 Before looking at Marpa's progress reports, it is usually best to
126 orient yourself by looking at the output from "trace_terminals".
127 Typically, you will be interested in the last tokens to be accepted,
128 and perhaps also in the tokens the recognizer was looking for when it
129 didn't find what it wanted. Sometimes that information alone is enough
130 to make it clear where the problem is.
131
132 The full "trace_terminals" output for this example is in the Appendix.
133 We see that the recognizer seems to accept ""42*1"" but it fails when
134 it confronts the plus sign (""+""). The last two lines are:
135
136 Accepted "Number" at 2-3
137 Expecting "Multiply" at 3
138
139 A note in passing: Marpa shows the location of the tokens it accepts as
140 a range of locations. For "Number", the range is ""2-3"", indicating
141 that the token starts at location 2 and ends at location 3. In its
142 default input model, all tokens have length 1, so this is clearly
143 information overkill. But Marpa allows other input models, and in
144 those models the information about start and end location of the token
145 is important.
146
147 Returning to the problem at hand: We notice that at location 3 we are
148 expecting a "Multiply" operator, but not an "Add" operator. That
149 should strike us as strange, and send us back to the grammar. But for
150 the sake of our example we will assume that we are slow on the uptake
151 today, and that this does not clue us in. We move on.
152
154 Marpa's most powerful tool for debugging grammars is its progress
155 report, which shows the Earley items being worked on. In the Appendix,
156 progress reports for the entire parse are shown. Our example in this
157 document is a very small one, so that producing progress reports for
158 the entire parse is a reasonable thing to do in this case. If a parse
159 is at all large, you will usually need to be selective.
160
161 The progress report that is usually of most interest is the one for the
162 Earley set that you were working on when the error occurred. This is
163 called the current location. In our example the current location is
164 location 3. By default, "show_progress" prints out only the progress
165 reports for the current location.
166
167 Here are the progress reports for the current location, location 3,
168 from our example.
169
170 F0 @0-3 Expression -> Factor .
171 F2 @2-3 Factor -> Number .
172 R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
173 F4 @0-3 Factor -> Factor Multiply Factor .
174 F5 @0-3 Expression['] -> Expression .
175
176 Progress report lines
177 The last field of each Progress Report line shows, in fully expanded
178 form, the dotted rule we were working on. Since that is the most
179 important information, it may be tempting to skip the rest of this
180 section, and move directly forward with the debugging.
181
182 In fact, you might want to do exactly that -- skip to the beginning of
183 the next section. What follows talks about the details of the format
184 of the first few fields in each progress report line. These first few
185 fields, while helpful, are also usually one or more of obvious in their
186 meaning, not relevant to our example, and repetitive of information
187 which can be deduced from other fields.
188
189 F5 @0-3 Expression['] -> Expression .
190
191 Prefixed to the dotted rule are two fields: ""F5 @0-3"". The ""F5""
192 says that this is a completed or final rule, and that it is rule number
193 5. The rule number is used in other tracing and debugging output, when
194 displaying the whole rule would take too much space. In what follows
195 we won't need the rule number.
196
197 The ""@0-3"" describes the location of the dotted rule in the parse.
198 In its simplest form, the location field is two location numbers,
199 separated by a hyphen. The first location number is the origin, the
200 place where Marpa first started recognizing the rule. The last
201 location number is the dot location, the location location of the dot
202 in a dotted rule. ""@0-3"" say that this rule began at location 0, and
203 that the dot is at location 3.
204
205 The current location is location 3, and this is no coincidence.
206 Whenever we are displaying the progress report for an location, all the
207 progress report lines will have their dot location at that location.
208
209 As an aside, notice that the left hand side symbol is "Expression[']".
210 That is is Marpa's special start symbol. The presence of a completed
211 start rule in our progress report indicates that if our input ended at
212 location 3, it would be a valid sentence in the language of our
213 grammar.
214
215 Let's look at another progress report line:
216
217 R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
218
219 Here the ""R4:1"" indicates that this is rule number 4 (the ""R""
220 stands for rule number) and that its dot position is after the first
221 symbol on the right hand side. Symbol positions are numbered using the
222 ordinal of the symbol just before the position. Symbols are numbered
223 starting with 1, and symbol position 1 is the position immediately
224 after symbol 1.
225
226 The next field (""x2"") is new. It is a count. A progress report can
227 contain multiple instances of the same dotted rule, and when there is
228 more than one, a count field is included in the progress report line.
229 Here the ""x2"" indicates that there are two instances of "Factor ->
230 Factor . Multiply Factor" at this location.
231
232 Multiple instances of a dotted rule will differ in their origin, and
233 where they do, this is shown in the location field of the progress
234 report line. Here the location field is ""@0,2-3"", which indicates
235 that one instance of this dotted rule has its origin at location 0, and
236 the other has its origin at location 2. All instances reported on a
237 single progress report line will always have the same dot location, and
238 in this case it is location 3.
239
240 Predicted rules also appear in progress reports:
241
242 P2 @2-2 Factor -> . Number
243
244 Here the ""P"" in the summary field means "predicted". As with much of
245 the information in the summary field, this only repeats what is obvious
246 from the full expansion of the dotted rule later in the line. But
247 final (or completed) and predicted rules can be important and the
248 initial "F" and "P" make these lines easy to spot.
249
250 Notice that in the predicted rule, the origin is the same as the dot
251 location. This will always be the case with predicted rules.
252
253 For any given location, no predicted rule has more than one instance.
254 For other dotted rules, there may be many instances of the dotted rule
255 at a single location. In grammars with right recursion, the number of
256 instances is limited only by the length of the recursion. The length
257 of a recursion is limited primarily by the available memory.
258
259 When there are many instances of a dotted rule at a single location, it
260 is inconvenient to show all the origins in a comma-separated list. In
261 that case the origins in the location field are shown as a range, with
262 the earliest separated from the most recent by a ""..."". The example
263 in this document contains no lines with a large number of instances,
264 but here is an example from another grammar. This is the progress
265 report line for the completed rule in a right recursion of length 20.
266
267 F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .
268
269 OK! Now to find the bug
270 Here again are progress reports at the location where things went
271 wrong:
272
273 F0 @0-3 Expression -> Factor .
274 F2 @2-3 Factor -> Number .
275 R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
276 F4 @0-3 Factor -> Factor Multiply Factor .
277 F5 @0-3 Expression['] -> Expression .
278
279 We see that we have completed rules for "Expression", and "Factor", as
280 expected. We also see two Earley items that show that we are in the
281 process of building another "Factor", and that it is expecting a
282 "Multiply" symbol. This is not the rule we want, but it explains why
283 the "trace_terminals" output showed that the recognizer was expecting a
284 "Multiply" symbol.
285
286 What we want to know is, why is the recognizer not expecting an "Add"
287 symbol? Looking back at the grammar, we see that only one rule uses
288 the "Add" symbol: the rule ""Term -> Term Add Term"". The next step is
289 to look at the Earley items for this rule. But there is a problem. We
290 don't find any.
291
292 Next, we ask ourselves, what is the earliest place the ""Term -> Term
293 Add Term"" rule should be appearing? The answer is that there should
294 be a prediction of ""Term -> Term Add Term"" at location 0. So we look
295 at the predictions at location 0.
296
297 P0 @0-0 Expression -> . Factor
298 P2 @0-0 Factor -> . Number
299 P4 @0-0 Factor -> . Factor Multiply Factor
300 P5 @0-0 Expression['] -> . Expression
301
302 No ""Term -> Term Add Term"" rule. We are never even predicting a
303 ""Term -> Term Add Term"" rule. We look back at the grammar, and start
304 from the beginning.
305
306 { lhs => 'Expression', rhs => [qw/Factor/] },
307 { lhs => 'Term', rhs => [qw/Factor/] },
308 { lhs => 'Factor', rhs => [qw/Number/] },
309 { lhs => 'Term',
310 rhs => [qw/Term Add Term/],
311 action => 'do_add'
312 },
313 { lhs => 'Factor',
314 rhs => [qw/Factor Multiply Factor/],
315 action => 'do_multiply'
316 },
317
318 Our special start symbol is "Expression[']" and we do see a rule with
319 "Expression[']" on the left hand side. This rule in turn produces an
320 "Expression" symbol, and there is a rule with "Expression" on the left
321 hand side. "Expression" in turn produces a "Factor" symbol, and there
322 are two rules with "Factor" on the left hand side.
323
324 But none of these rules ever produce a "Term". In fact, however far we
325 follow the productions, no rule ever produces a "Term". At this point
326 we see the problem: If we start at the start symbol, and follow the
327 rules of our grammar, we will never get to a "Term" symbol. Which is
328 exactly what that first warning message was saying.
329
330 Now that we know what is wrong, we can reread our grammar, and see that
331 our "Expression -> Factor" rule is wrong. It should be "Expression ->
332 Term". Change that and the problem is fixed.
333
335 Internally, Marpa rewrites Earley items and grammars. "show_progress"
336 hides most of this from the user, but not all of it. Some aspects of
337 Marpa's rewrites are useful to know. For those interested, Marpa's
338 grammar rewrites are described in complete detail in
339 Marpa::XS::Rewrite.
340
341 Special symbols
342 Marpa uses a few special symbols internally which it is useful for the
343 user of "show_progress" to be aware of. To distinguish them, Marpa's
344 internal symbols end in a right square bracket (""]""). No user-
345 defined symbol is allowed to end in a right square bracket.
346
347 One of these special symbols is Marpa's special start symbol, which
348 always ends in ""[']"". Marpa augments all of its grammars with a
349 special start rule, which will have the special start symbol on its
350 left hand side. We saw this above with the "Expression['] ->
351 Expression" rule.
352
353 If the empty, or null, string is a sentence in the language of the
354 grammar, Marpa will add a special empty start rule. The special empty
355 start rule will have its own special null start symbol on its left hand
356 side. The special null start symbol ends in ""['][]"".
357
358 Empty rules
359 Marpa removes all of the empty and nulling rules in the original
360 grammar. Internally, Marpa marks symbols as nulling and this produces
361 the same result much more efficiently. Outwardly, the effect is the
362 same, so much so that you might not even notice the absence of the
363 original grammar's nulling and empty rules from the progress reports.
364
365 The only nulling rule that Marpa allows is a rule that it creates
366 internally, not one specified by the user. If the grammar accepts the
367 null string as valid input, Marpa creates a nulling start rule.
368
369 Marpa's removal of nulling rules is recursive, as it needs to be.
370 Removing rules that are nulling reveals that the left hand side symbol
371 of those rules is also nulling. This in turn can reveal other nulling
372 rules.
373
374 Sequences
375 Marpa allows the user to explicitly specify sequences, rather than
376 write them out in BNF. Marpa is able to optimize explicitly specified
377 sequences. For the actual parsing, Marpa rewrites sequences into BNF.
378
379 In the Earley items, the rules will have been translated into BNF, so
380 that is how they appear in "show_progress". Marpa's rewritten sequence
381 rules take much the same form that a programmer's rewritten rules
382 would, if she had to do the rewrite by hand.
383
384 Here's are the rules of a Marpa grammar, with a sequence:
385
386 my $grammar = Marpa::XS::Grammar->new(
387 { start => 'Document',
388 strip => 0,
389 lhs_terminals => 0,
390 rules => [
391 { lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
392 ],
393 }
394 );
395
396 And here is how Marpa translates this sequence:
397
398 P1 @0-0 Document -> . Document[Subseq:0:1]
399 P2 @0-0 Document[Subseq:0:1] -> . Stuff
400 P3 @0-0 Document[Subseq:0:1] -> . Document[Subseq:0:1] Stuff
401 P4 @0-0 Document['] -> . Document
402
404 Below are the code, the trace outputs and the progress report for the
405 example used in this document.
406
407 Code
408 my $grammar = Marpa::XS::Grammar->new(
409 { start => 'Expression',
410 actions => 'My_Actions',
411 default_action => 'first_arg',
412 strip => 0,
413 rules => [
414 ## This is a deliberate error in the grammar
415 ## The next line should be:
416 ## { lhs => 'Expression', rhs => [qw/Term/] },
417 ## I have changed the Term to 'Factor' which
418 ## will cause problems.
419 { lhs => 'Expression', rhs => [qw/Factor/] },
420 { lhs => 'Term', rhs => [qw/Factor/] },
421 { lhs => 'Factor', rhs => [qw/Number/] },
422 { lhs => 'Term',
423 rhs => [qw/Term Add Term/],
424 action => 'do_add'
425 },
426 { lhs => 'Factor',
427 rhs => [qw/Factor Multiply Factor/],
428 action => 'do_multiply'
429 },
430 ],
431 }
432 );
433
434 $grammar->precompute();
435
436 my @tokens = (
437 [ 'Number', 42 ],
438 [ 'Multiply', q{*} ],
439 [ 'Number', 1 ],
440 [ 'Add', q{+} ],
441 [ 'Number', 7 ],
442 );
443
444 sub My_Actions::do_add {
445 my ( undef, $t1, undef, $t2 ) = @_;
446 return $t1 + $t2;
447 }
448
449 sub My_Actions::do_multiply {
450 my ( undef, $t1, undef, $t2 ) = @_;
451 return $t1 * $t2;
452 }
453
454 sub My_Actions::first_arg { shift; return shift; }
455
456 my $recce = Marpa::XS::Recognizer->new(
457 { grammar => $grammar, trace_terminals => 2 } );
458
459 my $token_ix = 0;
460
461 TOKEN: for my $token_and_value (@tokens) {
462 last TOKEN if not defined $recce->read( @{$token_and_value} );
463 }
464
465 $progress_report = $recce->show_progress( 0, -1 );
466
467 Trace output
468 Inaccessible symbol: Add
469 Inaccessible symbol: Term
470 Setting trace_terminals option
471 Expecting "Expression" at earleme 0
472 Expecting "Factor" at earleme 0
473 Expecting "Number" at earleme 0
474 Accepted "Number" at 0-1
475 Expecting "Multiply" at 1
476 Accepted "Multiply" at 1-2
477 Expecting "Factor" at 2
478 Expecting "Number" at 2
479 Accepted "Number" at 2-3
480 Expecting "Multiply" at 3
481 Rejected "Add" at 3-4
482
483 Note the use of the term "earleme". If you are using the default input
484 model, you can assume that earleme means "location": the earleme and
485 the location will always be exactly the same. Advanced users, using
486 alternative input models, may set it up so that earleme and location
487 are two different things, and in that case the distinction will matter.
488
489 Progress output
490 P0 @0-0 Expression -> . Factor
491 P2 @0-0 Factor -> . Number
492 P4 @0-0 Factor -> . Factor Multiply Factor
493 P5 @0-0 Expression['] -> . Expression
494 F0 @0-1 Expression -> Factor .
495 F2 @0-1 Factor -> Number .
496 R4:1 @0-1 Factor -> Factor . Multiply Factor
497 F5 @0-1 Expression['] -> Expression .
498 P2 @2-2 Factor -> . Number
499 P4 @2-2 Factor -> . Factor Multiply Factor
500 R4:2 @0-2 Factor -> Factor Multiply . Factor
501 F0 @0-3 Expression -> Factor .
502 F2 @2-3 Factor -> Number .
503 R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
504 F4 @0-3 Factor -> Factor Multiply Factor .
505 F5 @0-3 Expression['] -> Expression .
506
508 Copyright 2012 Jeffrey Kegler
509 This file is part of Marpa::XS. Marpa::XS is free software: you can
510 redistribute it and/or modify it under the terms of the GNU Lesser
511 General Public License as published by the Free Software Foundation,
512 either version 3 of the License, or (at your option) any later version.
513
514 Marpa::XS is distributed in the hope that it will be useful,
515 but WITHOUT ANY WARRANTY; without even the implied warranty of
516 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
517 Lesser General Public License for more details.
518
519 You should have received a copy of the GNU Lesser
520 General Public License along with Marpa::XS. If not, see
521 http://www.gnu.org/licenses/.
522
523
524
525perl v5.28.0 2018-07-14 Marpa::XS::Debug(3)