Marpa::XS::Advanced::Models(3pm)

1Marpa::XS::Advanced::MoUdseelrs(C3opnmt)ributed Perl DocMuamrepnat:a:tXiSo:n:Advanced::Models(3pm)
2
3
4

NAME

6       Marpa::XS::Advanced::Models - Other Input Models
7

ABOUT THIS DOCUMENT

9       The alternative input models described in this document are an advanced
10       technique.  If you are starting out with Marpa, you probably want to
11       ignore this document.  If you are an experienced Marpa user, it is
12       still safe to ignore this document, but you might find the
13       possibilities it discusses interesting.
14

DESCRIPTION

16   What is an alternative input model?
17       An alternative input model is anything that is not the default, token-
18       stream model.  More helpfully, Marpa allows variable-length tokens and
19       ambiguous tokens, and an alternative input model is any input model
20       which
21
22       •   Allows a token whose length is not exactly 1, or
23
24       •   Allows locations which have more than one token.
25
26       To do either of these things, a user must use the recognizer's
27       "alternative" method.  In other words, if you are not using the
28       recognizer's "alternative" method call, you are not using an
29       alternative input method.
30
31       Many concepts, such as parsing location, parse exhaustion, and the end
32       of parsing, are somewhat more complicated when alternative input models
33       are involved.  These concepts were explained in the main document for
34       the recognizer on the assumption that the default input model was in
35       use.  This document revises those explanations as necessary to take
36       into account the alternative input models.
37
38   Token streams
39       Marpa's default input model is the traditional one -- a token stream.
40       Token streams are very standard in parsing applications -- so much so
41       that most texts do not take the trouble of defining the term.  A token
42       stream is input structured as a sequence of tokens, where each token
43       occupies one location and every location has a token.  In the token
44       stream model, all tokens are of the same length.
45
46       Conventionally, all tokens are of length 1, and the token stream starts
47       at location 0.  Following this convention, the Nth token would start at
48       location N-1 and end at location N.  For example, the first token would
49       start at location 0 and end at location 1.  The second token would
50       start at location 1 and end at location 2.
51
52   Earlemes
53       For most parsers, position is location in a token stream.  To deal with
54       variable-length and overlapping tokens, Marpa needs a more flexible
55       idea of location.
56
57       Marpa's tracks position in terms of earlemes.  Earlemes are named after
58       Jay Earley, the inventor of the first algorithm in Marpa's lineage.
59       Every token has a start earleme and an end earleme.
60
61       The token stream model may also be called the token-per-earleme model.
62       In the token stream model, token location and earleme location are
63       exactly identical an one-to-one basis.  If a user's application uses
64       the token stream model, he can ignore earlemes, and think entirely in
65       terms of tokens and position in a token stream.  Because of this, the
66       main Marpa documents speak simply of the "location" in the parse.
67
68   The furthest earleme
69       The furthest earleme is the last earleme at which a token ends.  In the
70       default input model, the furthest earleme is not an important concept.
71       In the default input model, the furthest earleme and the current
72       earleme are always the same.
73
74       In alternative input models, tokens may be longer than 1 earleme, and
75       the furthest earleme and the current earleme may be far apart.  This
76       becomes an issue when parsing is finished.  Alternative input models
77       use the recognizer's "end_input" method to ensure that processing of
78       input catches up to the furthest earleme.
79

METHODS

81   alternative
82           defined $recce->alternative( 'a', 42, 1 )
83               or return 'First alternative failed';
84
85       The "alternative" method is the most general method for reading input,
86       and is used in alternative input models.  It takes three arguments,
87       only the first of which is required.
88
89       The first two arguments are similar to those of the recognizer's "read"
90       method: the token type and the token value.  As with the "read" method,
91       the token value is optional and, if omitted, defaults to a Perl
92       "undef".
93
94       The third argument to the "alternative" method is a token length.  If
95       omitted, token length defaults to 1, which is the correct value for the
96       token stream model.  Its value can be any integer greater than zero.
97       Marpa does not allow zero length tokens in any input model.
98
99       Unlike the "read" method, the "alternative" method does not advance the
100       current location (current earleme) on each call.  This allows the
101       application to read several tokens at the same earleme.  This is how
102       ambiguous input is created.  To advance the current earleme when input
103       is read using the "alternative" method, the "complete_earleme" method
104       must be called.
105
106       On success, "alternative" returns the current earleme.  On failures,
107       other than the rejection of a token in interactive mode, "alternative"
108       throws an exception.  If the token is rejected, "alternative" returns a
109       Perl "undef".
110
111   current_earleme
112           $current_earleme = $recce->current_earleme();
113
114       Returns the current parse location, also known as the current earleme.
115       Not often needed.
116
117   earleme_complete
118           $recce->earleme_complete();
119
120       Processes all tokens at the current earleme and advances the current
121       earleme by 1.  Returns the number of terminals acceptable at the new
122       current earleme.  Note that, in alternative input models, a return
123       value of 0 does not necessarily indicate an exhausted parser.
124
125       When reading input using the "alternative" method, "earleme_complete"
126       is used to move forward in the input stream.  All tokens read using the
127       "alternative" method are read at the same location -- the current
128       earleme.  A "earleme_complete" call causes the tokens read using the
129       "alternative" method to be processed, and the current earleme to be
130       advanced.
131
132       "earleme_complete" may be called even if the "alternative" method has
133       been not called since the last call to "earleme_complete".  This will
134       create an earleme with no tokens, which is often useful.
135
136   exhausted
137               $recce->exhausted() and die 'Recognizer exhausted';
138
139       The "exhausted" method returns a Perl true if parsing in a recognizer
140       is exhausted, and a Perl false otherwise.  Parsing is exhausted when
141       the recognizer will not accept any further input.  This method is
142       seldom used.  For most applications, the return values of the other
143       methods provide sufficient information about the parse status, and
144       there is no need to specifically test for parse exhaustion.
145
146       In the default input model, a recognizer was exhausted if the "read"
147       method returned 0, indicating that no tokens would be accepted at the
148       new current earleme.  In alternative input models, "earleme_complete"
149       will always return 0 when the parser is exhausted, but the converse is
150       not always true: "earleme_complete" may return 0 even in some cases
151       when the parser is not exhausted.
152
153       When "earleme_complete" returns 0, no input will be accepted at the
154       current earleme.  But in input models which allow tokens longer than 1,
155       it is possible for input to be accepted at earlemes after the current
156       earleme, even if no input is accepted at the current earleme.
157
158   end_input
159           $recce->end_input();
160
161       Indicates that input is finished.  Calling "end_input" is not necessary
162       or useful in the default input model, because in the default input
163       model no token has a length greater than 1.
164
165       The "end_input" method takes no arguments.  The "end_input" method
166       returns a Perl true value on success.  On failure, it throws an
167       exception.
168
169       In alternative input models, calling the "earleme_complete" method once
170       input is finished does not ensure that all input has been processed.
171       The "earleme_complete" method completes the current earleme, but in
172       alternative models, tokens may extend well past the current earleme.
173       The "end_input" method ensures that all input is processed.
174
175       Calling "end_input" multiple times on the same recognizer object is
176       harmless, but useless.  The second and subsequent calls will return a
177       Perl true, but will have no effect.
178

ALTERNATIVE MODELS AND THE "read" METHOD

180       A recognizer can mix calls to its "read" method with calls to its
181       "alternative" method.  The "read" method has the same effect as a
182       single call to the "alternative" method, followed immediately by a call
183       of the "earleme_complete" method.  More precisely, the "read" method is
184       equivalent to
185
186           sub Marpa::XS::Recognizer::read {
187               my $recce = shift;
188               return defined $recce->alternative(@_) ? $recce->earleme_complete() : undef;
189           }
190

AMBIGUOUS LEXING

192       Marpa allows ambiguous tokens.  Several Marpa tokens can start at a
193       single parsing location.  Ambiguous tokens can be of various lengths.
194       Tokens can also overlap.
195
196       Potentially ambiguous lexing occurs when more than one token starts at
197       a single earleme.  When potentially ambiguous lexing occurs, it becomes
198       possible for there to be more than one sequence of tokens.
199
200       An actual lexical ambiguity only occurs if more than one of the
201       potential token sequences is consistent with the grammar.  If there is
202       no actual lexical ambiguity, Marpa will use the only token choice that
203       is consistent with the grammar.
204
205       When lexing is actually ambiguous, Marpa will use all the alternatives
206       consistent with the grammar.  When the lexing in a parse is actually
207       ambiguous, the parse will be ambiguous.  From the point of view of
208       Marpa's semantics, ambiguities caused by lexing look the same as
209       ambiguities caused by an ambiguous grammar.
210
211       In the standard terminology, if a grammar produces more than one parse
212       tree for any input, then that grammar must be ambiguous.  In Marpa this
213       is not strictly true.  In Marpa, if the input is ambiguous, even an
214       unambiguous grammar can produce than one parse.
215

DUPLICATE TOKENS

217       A duplicate token is a token of the same type and the same length as
218       another that was read at the same earleme.  Duplicate tokens are
219       impossible in the default, token-stream, model.  This is because in the
220       token-stream model only one token can be read at each earleme.
221
222       In alternative models, more than one token may be read at an earleme,
223       and duplicates are possible.  Marpa detects duplicate tokens and treats
224       them as "hard errors" -- Marpa throws an exception when it sees a
225       duplicate token.  Marpa's assumption is that duplicate tokens indicate
226       an error at the application level.
227
228       An application can retry input after a duplicate token, if it catches
229       the exception.  In the future, if recovery from duplicate tokens is
230       found to be a useful technique, Marpa may provide an option to change
231       its behavior, so that a soft failure is returned when there is a
232       duplicate token.
233

EARLEMES: THE DETAILS

235       While scanning, Marpa keeps track of the current earleme.  Earlemes in
236       a parse start at earleme 0 and increase numerically.  The earleme
237       immediately following earleme 0 is earleme 1, the earleme immediately
238       following earleme 1 is earleme 2, and so on.  The earleme immediately
239       following earleme N is always earleme N+1.
240
241       Distance in the earleme stream is what you would intuitively expect it
242       to be.  The distance between earleme X and earleme Y is the absolute
243       value of the difference between X and Y, |X-Y|.  The distance from
244       earleme 3 to earleme 6, for example, is 3 earlemes.
245
246       Whenever a token is given to Marpa to be scanned, it starts at the
247       current earleme.  In addition to the type and value of the token, Marpa
248       must be told token's length in earlemes.  The length of a Marpa token
249       must be greater than zero.
250
251       This earleme length will become the distance from the start of the
252       token to the end of the token.  If the length of the token is L, and
253       the number of the current earleme is C, the end of the token will be at
254       the earleme whose number is C+L.
255

THE CHARACTER-PER-EARLEME MODEL

257       Many different models of the relationship between tokens and earlemes
258       are possible, but two are particularly important.  One is the one-
259       token-per-earleme model, which is the default, and which has already
260       been described.  The other is the one-character-per-earleme model.
261
262       In the one-character-per-earleme model, every character will be treated
263       as being exactly one earleme in length.  If a token is more than one
264       character in length, that token will span earlemes.  When the lexing is
265       ambiguous, tokens may overlap.
266
267       When a one-character-per-earleme model of input is used, there may be
268       many earlemes at which no tokens start.  For example, in a
269       straightforward character-per-earleme implementation of a grammar for a
270       language which allows comments, no tokens will start at any earlemes
271       which corresponds to character locations inside a comment.
272

OTHER INPUT MODELS

274       So far only the token-per-earleme and character-per-earleme models have
275       seen any real use in Marpa programs.  But other models are certainly
276       possible.  Using earlemes, you can structure your input in almost any
277       way you like.
278
279       There are only three restrictions:
280
281       1.  Scanning always starts at earleme 0.
282
283       2.  All tokens starting at earleme N must be scanned before any tokens
284           starting at earleme N+1.  In other words, the tokens must be
285           scanned in non-decreasing order by start earleme.
286
287       3.  Every token must have a length, in earlemes, which is greater than
288           zero.  In other words, token length can never be zero or negative.
289

COPYRIGHT AND LICENSE

291         Copyright 2012 Jeffrey Kegler
292         This file is part of Marpa::XS.  Marpa::XS is free software: you can
293         redistribute it and/or modify it under the terms of the GNU Lesser
294         General Public License as published by the Free Software Foundation,
295         either version 3 of the License, or (at your option) any later version.
296
297         Marpa::XS is distributed in the hope that it will be useful,
298         but WITHOUT ANY WARRANTY; without even the implied warranty of
299         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
300         Lesser General Public License for more details.
301
302         You should have received a copy of the GNU Lesser
303         General Public License along with Marpa::XS.  If not, see
304         http://www.gnu.org/licenses/.
305
306
307
308perl v5.38.0                      2023-07-20  Marpa::XS::Advanced::Models(3pm)