1Marpa::XS::Debug(3)   User Contributed Perl Documentation  Marpa::XS::Debug(3)
2
3
4

NAME

6       Marpa::XS::Debug - Debugging your Marpa grammar
7

OVERVIEW

9       This document describes the more powerful of Marpa's tracing and
10       debugging techniques.  It assumes that you have written a grammar for
11       your Marpa application, and that something is going wrong.  You should
12       read the overview document for tracing problems before reading this
13       document.
14

INTRODUCTION TO EARLEY PARSING

16       To read the "show_progress" output, it is important to have a basic
17       idea of what Earley items are, and of what the information in them
18       means.  Everything that the user needs to know is explained in this
19       section.
20
21   Dotted rules
22       The idea behind Earley's algorithm is that you can parse by building a
23       table of rules and where you are in those rules.  "Where" means two
24       things: location in the rule relative to the rule's symbols, and
25       location relative to the parse's input stream.
26
27       Let's look at an example of a rule in a context-free grammar.  Here's
28       the rule for assignment from the Perl distribution's "perly.y"
29
30       "    termbinop -> term ASSIGNOP term"
31
32       "ASSIGNOP" is "perly.y"'s internal name for the assignment operator.
33       In plain Perl terms, this is the ""="" character.
34
35       In parsing this rule, we can be at any of four possible locations.  One
36       location is at the beginning, before all of the symbols.  The other
37       three locations are immediately after each of the rule's three symbols.
38
39       Within a rule, position relative to the symbols of the rule is
40       traditionally indicated with a dot.  In fact, the symbol-relative rule
41       position is very often called the dot location.  Taken as a pair, a
42       rule and a dot location are called a dotted rule.
43
44       Here's our rule with a dot location indicated:
45
46       "    termbinop -> X term ASSIGNOP term"
47
48       The dot location in this dotted rule is at the beginning.  A dot
49       location at the beginning of a dotted rule means that we have not
50       recognized any symbols in the rule yet.  All we are doing is predicting
51       that the rule will occur.  A dotted rule with the dot before all of its
52       symbols is called a prediction or a predicted rule.
53
54       Here's another dotted rule:
55
56       "    termbinop -> term X ASSIGNOP term"
57
58       In this dotted rule, we are saying we have seen a "term", but have not
59       yet recognized an "ASSIGNOP".
60
61       There's another special kind of dotted rule, a completion.  A
62       completion (also called a completed rule) is a dotted rule with the dot
63       after all of the symbols.  Here is the completion for the rule that we
64       have been using as an example:
65
66       "    termbinop -> term ASSIGNOP term X"
67
68       A completion indicates that a rule has been fully recognized.
69
70   Earley items
71       The dotted rules contain all but one piece of the information that
72       Earley's algorithm needs to track.  The missing piece is the second of
73       the two "wheres": where in the input stream.  To associate input stream
74       location and dotted rules, Earley's algorithm uses what are now called
75       Earley items.
76
77       A convenient way to think of an Earley item is as a triple, or 3-tuple,
78       consisting of dotted rule, origin and current location.  The origin is
79       the location in the input stream where the dotted rule starts.  The
80       current location (also called the dot location) is the location in the
81       input stream which corresponds to the dot position.
82
83       Two noteworthy consequences follow from the way in which origin and
84       current location are defined.  First, if a dotted rule is a prediction,
85       then origin and current location will always be the same.  Second, the
86       input stream location where a rule ends is not tracked unless the
87       dotted rule is a completion.  In other cases, an Earley item does not
88       tell us if a rule will ever be completed, much less at which location.
89

THE EXAMPLE

91       For this example of debugging, I've taken a very common example of a
92       grammar and deliberately introduced a problem.  (All the code and the
93       full trace outputs for this example are in the Appendix.)  I've
94       commented out the correct start rule:
95
96               ## { lhs => 'Expression', rhs => [qw/Term/] },
97
98       and replaced it with another start rule, one which will cause problems:
99
100               { lhs => 'Expression', rhs => [qw/Factor/] },
101
102       In what follows, we'll pretend we don't already know where the problem
103       is, and use the Marpa diagnostics and tracing facilities to "discover"
104       it.
105

FIRST WARNING

107       Right off the bat, we get two warning messages:
108
109           Inaccessible symbol: Add
110           Inaccessible symbol: Term
111
112       If we were alert, these would be enough to tell us there is a serious
113       problem.  "Inaccessible" symbols are symbols which cannot be reached
114       from the start symbol.  This means that the grammar will never produce
115       them, and that parses will never find them in the input.
116
117       Since "Add" and "Term" are both important symbols in our application,
118       that should tell us our grammar has a serious problem.  In fact, these
119       warning messages would often be enough to point us to the error.  But,
120       in order to look at more of Marpa's tracing facilities, let's pretend
121       we have not had our morning coffee, and that we miss the significance
122       of these warning messages.
123

OUTPUT FROM trace_terminals

125       Before looking at Marpa's progress reports, it is usually best to
126       orient yourself by looking at the output from "trace_terminals".
127       Typically, you will be interested in the last tokens to be accepted,
128       and perhaps also in the tokens the recognizer was looking for when it
129       didn't find what it wanted.  Sometimes that information alone is enough
130       to make it clear where the problem is.
131
132       The full "trace_terminals" output for this example is in the Appendix.
133       We see that the recognizer seems to accept ""42*1"" but it fails when
134       it confronts the plus sign (""+"").  The last two lines are:
135
136           Accepted "Number" at 2-3
137           Expecting "Multiply" at 3
138
139       A note in passing: Marpa shows the location of the tokens it accepts as
140       a range of locations.  For "Number", the range is ""2-3"", indicating
141       that the token starts at location 2 and ends at location 3.  In its
142       default input model, all tokens have length 1, so this is clearly
143       information overkill.  But Marpa allows other input models, and in
144       those models the information about start and end location of the token
145       is important.
146
147       Returning to the problem at hand: We notice that at location 3 we are
148       expecting a "Multiply" operator, but not an "Add" operator.  That
149       should strike us as strange, and send us back to the grammar.  But for
150       the sake of our example we will assume that we are slow on the uptake
151       today, and that this does not clue us in.  We move on.
152

OUTPUT FROM show_progress

154       Marpa's most powerful tool for debugging grammars is its progress
155       report, which shows the Earley items being worked on.  In the Appendix,
156       progress reports for the entire parse are shown.  Our example in this
157       document is a very small one, so that producing progress reports for
158       the entire parse is a reasonable thing to do in this case.  If a parse
159       is at all large, you will usually need to be selective.
160
161       The progress report that is usually of most interest is the one for the
162       Earley set that you were working on when the error occurred.  This is
163       called the current location.  In our example the current location is
164       location 3.  By default, "show_progress" prints out only the progress
165       reports for the current location.
166
167       Here are the progress reports for the current location, location 3,
168       from our example.
169
170             F0 @0-3 Expression -> Factor .
171             F2 @2-3 Factor -> Number .
172             R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
173             F4 @0-3 Factor -> Factor Multiply Factor .
174             F5 @0-3 Expression['] -> Expression .
175
176   Progress report lines
177       The last field of each Progress Report line shows, in fully expanded
178       form, the dotted rule we were working on.  Since that is the most
179       important information, it may be tempting to skip the rest of this
180       section, and move directly forward with the debugging.
181
182       In fact, you might want to do exactly that -- skip to the beginning of
183       the next section.  What follows talks about the details of the format
184       of the first few fields in each progress report line.  These first few
185       fields, while helpful, are also usually one or more of obvious in their
186       meaning, not relevant to our example, and repetitive of information
187       which can be deduced from other fields.
188
189             F5 @0-3 Expression['] -> Expression .
190
191       Prefixed to the dotted rule are two fields: ""F5 @0-3"".  The ""F5""
192       says that this is a completed or final rule, and that it is rule number
193       5.  The rule number is used in other tracing and debugging output, when
194       displaying the whole rule would take too much space.  In what follows
195       we won't need the rule number.
196
197       The ""@0-3"" describes the location of the dotted rule in the parse.
198       In its simplest form, the location field is two location numbers,
199       separated by a hyphen.  The first location number is the origin, the
200       place where Marpa first started recognizing the rule.  The last
201       location number is the dot location, the location location of the dot
202       in a dotted rule.  ""@0-3"" say that this rule began at location 0, and
203       that the dot is at location 3.
204
205       The current location is location 3, and this is no coincidence.
206       Whenever we are displaying the progress report for an location, all the
207       progress report lines will have their dot location at that location.
208
209       As an aside, notice that the left hand side symbol is "Expression[']".
210       That is is Marpa's special start symbol.  The presence of a completed
211       start rule in our progress report indicates that if our input ended at
212       location 3, it would be a valid sentence in the language of our
213       grammar.
214
215       Let's look at another progress report line:
216
217             R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
218
219       Here the ""R4:1"" indicates that this is rule number 4 (the ""R""
220       stands for rule number) and that its dot position is after the first
221       symbol on the right hand side.  Symbol positions are numbered using the
222       ordinal of the symbol just before the position.  Symbols are numbered
223       starting with 1, and symbol position 1 is the position immediately
224       after symbol 1.
225
226       The next field (""x2"") is new.  It is a count.  A progress report can
227       contain multiple instances of the same dotted rule, and when there is
228       more than one, a count field is included in the progress report line.
229       Here the ""x2"" indicates that there are two instances of "Factor ->
230       Factor . Multiply Factor" at this location.
231
232       Multiple instances of a dotted rule will differ in their origin, and
233       where they do, this is shown in the location field of the progress
234       report line.  Here the location field is ""@0,2-3"", which indicates
235       that one instance of this dotted rule has its origin at location 0, and
236       the other has its origin at location 2.  All instances reported on a
237       single progress report line will always have the same dot location, and
238       in this case it is location 3.
239
240       Predicted rules also appear in progress reports:
241
242           P2 @2-2 Factor -> . Number
243
244       Here the ""P"" in the summary field means "predicted".  As with much of
245       the information in the summary field, this only repeats what is obvious
246       from the full expansion of the dotted rule later in the line.  But
247       final (or completed) and predicted rules can be important and the
248       initial "F" and "P" make these lines easy to spot.
249
250       Notice that in the predicted rule, the origin is the same as the dot
251       location.  This will always be the case with predicted rules.
252
253       For any given location, no predicted rule has more than one instance.
254       For other dotted rules, there may be many instances of the dotted rule
255       at a single location.  In grammars with right recursion, the number of
256       instances is limited only by the length of the recursion.  The length
257       of a recursion is limited primarily by the available memory.
258
259       When there are many instances of a dotted rule at a single location, it
260       is inconvenient to show all the origins in a comma-separated list.  In
261       that case the origins in the location field are shown as a range, with
262       the earliest separated from the most recent by a ""..."".  The example
263       in this document contains no lines with a large number of instances,
264       but here is an example from another grammar.  This is the progress
265       report line for the completed rule in a right recursion of length 20.
266
267           F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .
268
269   OK!  Now to find the bug
270       Here again are progress reports at the location where things went
271       wrong:
272
273             F0 @0-3 Expression -> Factor .
274             F2 @2-3 Factor -> Number .
275             R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
276             F4 @0-3 Factor -> Factor Multiply Factor .
277             F5 @0-3 Expression['] -> Expression .
278
279       We see that we have completed rules for "Expression", and "Factor", as
280       expected.  We also see two Earley items that show that we are in the
281       process of building another "Factor", and that it is expecting a
282       "Multiply" symbol.  This is not the rule we want, but it explains why
283       the "trace_terminals" output showed that the recognizer was expecting a
284       "Multiply" symbol.
285
286       What we want to know is, why is the recognizer not expecting an "Add"
287       symbol?  Looking back at the grammar, we see that only one rule uses
288       the "Add" symbol: the rule ""Term -> Term Add Term"".  The next step is
289       to look at the Earley items for this rule.  But there is a problem.  We
290       don't find any.
291
292       Next, we ask ourselves, what is the earliest place the ""Term -> Term
293       Add Term"" rule should be appearing?  The answer is that there should
294       be a prediction of ""Term -> Term Add Term"" at location 0.  So we look
295       at the predictions at location 0.
296
297             P0 @0-0 Expression -> . Factor
298             P2 @0-0 Factor -> . Number
299             P4 @0-0 Factor -> . Factor Multiply Factor
300             P5 @0-0 Expression['] -> . Expression
301
302       No ""Term -> Term Add Term"" rule.  We are never even predicting a
303       ""Term -> Term Add Term"" rule.  We look back at the grammar, and start
304       from the beginning.
305
306           { lhs     => 'Expression', rhs => [qw/Factor/] },
307           { lhs => 'Term',       rhs => [qw/Factor/] },
308           { lhs => 'Factor',     rhs => [qw/Number/] },
309           {   lhs    => 'Term',
310               rhs    => [qw/Term Add Term/],
311               action => 'do_add'
312           },
313           {   lhs    => 'Factor',
314               rhs    => [qw/Factor Multiply Factor/],
315               action => 'do_multiply'
316           },
317
318       Our special start symbol is "Expression[']" and we do see a rule with
319       "Expression[']" on the left hand side.  This rule in turn produces an
320       "Expression" symbol, and there is a rule with "Expression" on the left
321       hand side.  "Expression" in turn produces a "Factor" symbol, and there
322       are two rules with "Factor" on the left hand side.
323
324       But none of these rules ever produce a "Term".  In fact, however far we
325       follow the productions, no rule ever produces a "Term".  At this point
326       we see the problem: If we start at the start symbol, and follow the
327       rules of our grammar, we will never get to a "Term" symbol.  Which is
328       exactly what that first warning message was saying.
329
330       Now that we know what is wrong, we can reread our grammar, and see that
331       our "Expression -> Factor" rule is wrong.  It should be "Expression ->
332       Term".  Change that and the problem is fixed.
333

GRAMMAR REWRITING

335       Internally, Marpa rewrites Earley items and grammars.  "show_progress"
336       hides most of this from the user, but not all of it.  Some aspects of
337       Marpa's rewrites are useful to know.  For those interested, Marpa's
338       grammar rewrites are described in complete detail in
339       Marpa::XS::Rewrite.
340
341   Special symbols
342       Marpa uses a few special symbols internally which it is useful for the
343       user of "show_progress" to be aware of.  To distinguish them, Marpa's
344       internal symbols end in a right square bracket (""]"").  No user-
345       defined symbol is allowed to end in a right square bracket.
346
347       One of these special symbols is Marpa's special start symbol, which
348       always ends in ""[']"".  Marpa augments all of its grammars with a
349       special start rule, which will have the special start symbol on its
350       left hand side.  We saw this above with the "Expression['] ->
351       Expression" rule.
352
353       If the empty, or null, string is a sentence in the language of the
354       grammar, Marpa will add a special empty start rule.  The special empty
355       start rule will have its own special null start symbol on its left hand
356       side.  The special null start symbol ends in ""['][]"".
357
358   Empty rules
359       Marpa removes all of the empty and nulling rules in the original
360       grammar.  Internally, Marpa marks symbols as nulling and this produces
361       the same result much more efficiently.  Outwardly, the effect is the
362       same, so much so that you might not even notice the absence of the
363       original grammar's nulling and empty rules from the progress reports.
364
365       The only nulling rule that Marpa allows is a rule that it creates
366       internally, not one specified by the user.  If the grammar accepts the
367       null string as valid input, Marpa creates a nulling start rule.
368
369       Marpa's removal of nulling rules is recursive, as it needs to be.
370       Removing rules that are nulling reveals that the left hand side symbol
371       of those rules is also nulling.  This in turn can reveal other nulling
372       rules.
373
374   Sequences
375       Marpa allows the user to explicitly specify sequences, rather than
376       write them out in BNF.  Marpa is able to optimize explicitly specified
377       sequences.  For the actual parsing, Marpa rewrites sequences into BNF.
378
379       In the Earley items, the rules will have been translated into BNF, so
380       that is how they appear in "show_progress".  Marpa's rewritten sequence
381       rules take much the same form that a programmer's rewritten rules
382       would, if she had to do the rewrite by hand.
383
384       Here's are the rules of a Marpa grammar, with a sequence:
385
386           my $grammar = Marpa::XS::Grammar->new(
387               {   start         => 'Document',
388                   strip         => 0,
389                   lhs_terminals => 0,
390                   rules         => [
391                       { lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
392                   ],
393               }
394           );
395
396       And here is how Marpa translates this sequence:
397
398             P1 @0-0 Document -> . Document[Subseq:0:1]
399             P2 @0-0 Document[Subseq:0:1] -> . Stuff
400             P3 @0-0 Document[Subseq:0:1] -> . Document[Subseq:0:1] Stuff
401             P4 @0-0 Document['] -> . Document
402

APPENDIX: FULL CODE AND OUTPUT FOR THE EXAMPLE

404       Below are the code, the trace outputs and the progress report for the
405       example used in this document.
406
407   Code
408           my $grammar = Marpa::XS::Grammar->new(
409               {   start          => 'Expression',
410                   actions        => 'My_Actions',
411                   default_action => 'first_arg',
412                   strip          => 0,
413                   rules          => [
414                       ## This is a deliberate error in the grammar
415                       ## The next line should be:
416                       ## { lhs => 'Expression', rhs => [qw/Term/] },
417                       ## I have changed the Term to 'Factor' which
418                       ## will cause problems.
419                       { lhs => 'Expression', rhs => [qw/Factor/] },
420                       { lhs => 'Term',       rhs => [qw/Factor/] },
421                       { lhs => 'Factor',     rhs => [qw/Number/] },
422                       {   lhs    => 'Term',
423                           rhs    => [qw/Term Add Term/],
424                           action => 'do_add'
425                       },
426                       {   lhs    => 'Factor',
427                           rhs    => [qw/Factor Multiply Factor/],
428                           action => 'do_multiply'
429                       },
430                   ],
431               }
432           );
433
434           $grammar->precompute();
435
436           my @tokens = (
437               [ 'Number',   42 ],
438               [ 'Multiply', q{*} ],
439               [ 'Number',   1 ],
440               [ 'Add',      q{+} ],
441               [ 'Number',   7 ],
442           );
443
444           sub My_Actions::do_add {
445               my ( undef, $t1, undef, $t2 ) = @_;
446               return $t1 + $t2;
447           }
448
449           sub My_Actions::do_multiply {
450               my ( undef, $t1, undef, $t2 ) = @_;
451               return $t1 * $t2;
452           }
453
454           sub My_Actions::first_arg { shift; return shift; }
455
456           my $recce = Marpa::XS::Recognizer->new(
457               { grammar => $grammar, trace_terminals => 2 } );
458
459           my $token_ix = 0;
460
461           TOKEN: for my $token_and_value (@tokens) {
462               last TOKEN if not defined $recce->read( @{$token_and_value} );
463           }
464
465           $progress_report = $recce->show_progress( 0, -1 );
466
467   Trace output
468           Inaccessible symbol: Add
469           Inaccessible symbol: Term
470           Setting trace_terminals option
471           Expecting "Expression" at earleme 0
472           Expecting "Factor" at earleme 0
473           Expecting "Number" at earleme 0
474           Accepted "Number" at 0-1
475           Expecting "Multiply" at 1
476           Accepted "Multiply" at 1-2
477           Expecting "Factor" at 2
478           Expecting "Number" at 2
479           Accepted "Number" at 2-3
480           Expecting "Multiply" at 3
481           Rejected "Add" at 3-4
482
483       Note the use of the term "earleme".  If you are using the default input
484       model, you can assume that earleme means "location": the earleme and
485       the location will always be exactly the same.  Advanced users, using
486       alternative input models, may set it up so that earleme and location
487       are two different things, and in that case the distinction will matter.
488
489   Progress output
490           P0 @0-0 Expression -> . Factor
491           P2 @0-0 Factor -> . Number
492           P4 @0-0 Factor -> . Factor Multiply Factor
493           P5 @0-0 Expression['] -> . Expression
494           F0 @0-1 Expression -> Factor .
495           F2 @0-1 Factor -> Number .
496           R4:1 @0-1 Factor -> Factor . Multiply Factor
497           F5 @0-1 Expression['] -> Expression .
498           P2 @2-2 Factor -> . Number
499           P4 @2-2 Factor -> . Factor Multiply Factor
500           R4:2 @0-2 Factor -> Factor Multiply . Factor
501           F0 @0-3 Expression -> Factor .
502           F2 @2-3 Factor -> Number .
503           R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
504           F4 @0-3 Factor -> Factor Multiply Factor .
505           F5 @0-3 Expression['] -> Expression .
506
508         Copyright 2012 Jeffrey Kegler
509         This file is part of Marpa::XS.  Marpa::XS is free software: you can
510         redistribute it and/or modify it under the terms of the GNU Lesser
511         General Public License as published by the Free Software Foundation,
512         either version 3 of the License, or (at your option) any later version.
513
514         Marpa::XS is distributed in the hope that it will be useful,
515         but WITHOUT ANY WARRANTY; without even the implied warranty of
516         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
517         Lesser General Public License for more details.
518
519         You should have received a copy of the GNU Lesser
520         General Public License along with Marpa::XS.  If not, see
521         http://www.gnu.org/licenses/.
522
523
524
525perl v5.32.1                      2021-01-27               Marpa::XS::Debug(3)
Impressum