1Marpa::XS::Advanced::MoUdseelrs(C3o)ntributed Perl DocumMeanrtpaat:i:oXnS::Advanced::Models(3)
2
3
4
6 Marpa::XS::Advanced::Models - Other Input Models
7
9 The alternative input models described in this document are an advanced
10 technique. If you are starting out with Marpa, you probably want to
11 ignore this document. If you are an experienced Marpa user, it is
12 still safe to ignore this document, but you might find the
13 possibilities it discusses interesting.
14
16 What is an alternative input model?
17 An alternative input model is anything that is not the default, token-
18 stream model. More helpfully, Marpa allows variable-length tokens and
19 ambiguous tokens, and an alternative input model is any input model
20 which
21
22 • Allows a token whose length is not exactly 1, or
23
24 • Allows locations which have more than one token.
25
26 To do either of these things, a user must use the recognizer's
27 "alternative" method. In other words, if you are not using the
28 recognizer's "alternative" method call, you are not using an
29 alternative input method.
30
31 Many concepts, such as parsing location, parse exhaustion, and the end
32 of parsing, are somewhat more complicated when alternative input models
33 are involved. These concepts were explained in the main document for
34 the recognizer on the assumption that the default input model was in
35 use. This document revises those explanations as necessary to take
36 into account the alternative input models.
37
38 Token streams
39 Marpa's default input model is the traditional one -- a token stream.
40 Token streams are very standard in parsing applications -- so much so
41 that most texts do not take the trouble of defining the term. A token
42 stream is input structured as a sequence of tokens, where each token
43 occupies one location and every location has a token. In the token
44 stream model, all tokens are of the same length.
45
46 Conventionally, all tokens are of length 1, and the token stream starts
47 at location 0. Following this convention, the Nth token would start at
48 location N-1 and end at location N. For example, the first token would
49 start at location 0 and end at location 1. The second token would
50 start at location 1 and end at location 2.
51
52 Earlemes
53 For most parsers, position is location in a token stream. To deal with
54 variable-length and overlapping tokens, Marpa needs a more flexible
55 idea of location.
56
57 Marpa's tracks position in terms of earlemes. Earlemes are named after
58 Jay Earley, the inventor of the first algorithm in Marpa's lineage.
59 Every token has a start earleme and an end earleme.
60
61 The token stream model may also be called the token-per-earleme model.
62 In the token stream model, token location and earleme location are
63 exactly identical an one-to-one basis. If a user's application uses
64 the token stream model, he can ignore earlemes, and think entirely in
65 terms of tokens and position in a token stream. Because of this, the
66 main Marpa documents speak simply of the "location" in the parse.
67
68 The furthest earleme
69 The furthest earleme is the last earleme at which a token ends. In the
70 default input model, the furthest earleme is not an important concept.
71 In the default input model, the furthest earleme and the current
72 earleme are always the same.
73
74 In alternative input models, tokens may be longer than 1 earleme, and
75 the furthest earleme and the current earleme may be far apart. This
76 becomes an issue when parsing is finished. Alternative input models
77 use the recognizer's "end_input" method to ensure that processing of
78 input catches up to the furthest earleme.
79
81 alternative
82 defined $recce->alternative( 'a', 42, 1 )
83 or return 'First alternative failed';
84
85 The "alternative" method is the most general method for reading input,
86 and is used in alternative input models. It takes three arguments,
87 only the first of which is required.
88
89 The first two arguments are similar to those of the recognizer's "read"
90 method: the token type and the token value. As with the "read" method,
91 the token value is optional and, if omitted, defaults to a Perl
92 "undef".
93
94 The third argument to the "alternative" method is a token length. If
95 omitted, token length defaults to 1, which is the correct value for the
96 token stream model. Its value can be any integer greater than zero.
97 Marpa does not allow zero length tokens in any input model.
98
99 Unlike the "read" method, the "alternative" method does not advance the
100 current location (current earleme) on each call. This allows the
101 application to read several tokens at the same earleme. This is how
102 ambiguous input is created. To advance the current earleme when input
103 is read using the "alternative" method, the "complete_earleme" method
104 must be called.
105
106 On success, "alternative" returns the current earleme. On failures,
107 other than the rejection of a token in interactive mode, "alternative"
108 throws an exception. If the token is rejected, "alternative" returns a
109 Perl "undef".
110
111 current_earleme
112 $current_earleme = $recce->current_earleme();
113
114 Returns the current parse location, also known as the current earleme.
115 Not often needed.
116
117 earleme_complete
118 $recce->earleme_complete();
119
120 Processes all tokens at the current earleme and advances the current
121 earleme by 1. Returns the number of terminals acceptable at the new
122 current earleme. Note that, in alternative input models, a return
123 value of 0 does not necessarily indicate an exhausted parser.
124
125 When reading input using the "alternative" method, "earleme_complete"
126 is used to move forward in the input stream. All tokens read using the
127 "alternative" method are read at the same location -- the current
128 earleme. A "earleme_complete" call causes the tokens read using the
129 "alternative" method to be processed, and the current earleme to be
130 advanced.
131
132 "earleme_complete" may be called even if the "alternative" method has
133 been not called since the last call to "earleme_complete". This will
134 create an earleme with no tokens, which is often useful.
135
136 exhausted
137 $recce->exhausted() and die 'Recognizer exhausted';
138
139 The "exhausted" method returns a Perl true if parsing in a recognizer
140 is exhausted, and a Perl false otherwise. Parsing is exhausted when
141 the recognizer will not accept any further input. This method is
142 seldom used. For most applications, the return values of the other
143 methods provide sufficient information about the parse status, and
144 there is no need to specifically test for parse exhaustion.
145
146 In the default input model, a recognizer was exhausted if the "read"
147 method returned 0, indicating that no tokens would be accepted at the
148 new current earleme. In alternative input models, "earleme_complete"
149 will always return 0 when the parser is exhausted, but the converse is
150 not always true: "earleme_complete" may return 0 even in some cases
151 when the parser is not exhausted.
152
153 When "earleme_complete" returns 0, no input will be accepted at the
154 current earleme. But in input models which allow tokens longer than 1,
155 it is possible for input to be accepted at earlemes after the current
156 earleme, even if no input is accepted at the current earleme.
157
158 end_input
159 $recce->end_input();
160
161 Indicates that input is finished. Calling "end_input" is not necessary
162 or useful in the default input model, because in the default input
163 model no token has a length greater than 1.
164
165 The "end_input" method takes no arguments. The "end_input" method
166 returns a Perl true value on success. On failure, it throws an
167 exception.
168
169 In alternative input models, calling the "earleme_complete" method once
170 input is finished does not ensure that all input has been processed.
171 The "earleme_complete" method completes the current earleme, but in
172 alternative models, tokens may extend well past the current earleme.
173 The "end_input" method ensures that all input is processed.
174
175 Calling "end_input" multiple times on the same recognizer object is
176 harmless, but useless. The second and subsequent calls will return a
177 Perl true, but will have no effect.
178
180 A recognizer can mix calls to its "read" method with calls to its
181 "alternative" method. The "read" method has the same effect as a
182 single call to the "alternative" method, followed immediately by a call
183 of the "earleme_complete" method. More precisely, the "read" method is
184 equivalent to
185
186 sub Marpa::XS::Recognizer::read {
187 my $recce = shift;
188 return defined $recce->alternative(@_) ? $recce->earleme_complete() : undef;
189 }
190
192 Marpa allows ambiguous tokens. Several Marpa tokens can start at a
193 single parsing location. Ambiguous tokens can be of various lengths.
194 Tokens can also overlap.
195
196 Potentially ambiguous lexing occurs when more than one token starts at
197 a single earleme. When potentially ambiguous lexing occurs, it becomes
198 possible for there to be more than one sequence of tokens.
199
200 An actual lexical ambiguity only occurs if more than one of the
201 potential token sequences is consistent with the grammar. If there is
202 no actual lexical ambiguity, Marpa will use the only token choice that
203 is consistent with the grammar.
204
205 When lexing is actually ambiguous, Marpa will use all the alternatives
206 consistent with the grammar. When the lexing in a parse is actually
207 ambiguous, the parse will be ambiguous. From the point of view of
208 Marpa's semantics, ambiguities caused by lexing look the same as
209 ambiguities caused by an ambiguous grammar.
210
211 In the standard terminology, if a grammar produces more than one parse
212 tree for any input, then that grammar must be ambiguous. In Marpa this
213 is not strictly true. In Marpa, if the input is ambiguous, even an
214 unambiguous grammar can produce than one parse.
215
217 A duplicate token is a token of the same type and the same length as
218 another that was read at the same earleme. Duplicate tokens are
219 impossible in the default, token-stream, model. This is because in the
220 token-stream model only one token can be read at each earleme.
221
222 In alternative models, more than one token may be read at an earleme,
223 and duplicates are possible. Marpa detects duplicate tokens and treats
224 them as "hard errors" -- Marpa throws an exception when it sees a
225 duplicate token. Marpa's assumption is that duplicate tokens indicate
226 an error at the application level.
227
228 An application can retry input after a duplicate token, if it catches
229 the exception. In the future, if recovery from duplicate tokens is
230 found to be a useful technique, Marpa may provide an option to change
231 its behavior, so that a soft failure is returned when there is a
232 duplicate token.
233
235 While scanning, Marpa keeps track of the current earleme. Earlemes in
236 a parse start at earleme 0 and increase numerically. The earleme
237 immediately following earleme 0 is earleme 1, the earleme immediately
238 following earleme 1 is earleme 2, and so on. The earleme immediately
239 following earleme N is always earleme N+1.
240
241 Distance in the earleme stream is what you would intuitively expect it
242 to be. The distance between earleme X and earleme Y is the absolute
243 value of the difference between X and Y, |X-Y|. The distance from
244 earleme 3 to earleme 6, for example, is 3 earlemes.
245
246 Whenever a token is given to Marpa to be scanned, it starts at the
247 current earleme. In addition to the type and value of the token, Marpa
248 must be told token's length in earlemes. The length of a Marpa token
249 must be greater than zero.
250
251 This earleme length will become the distance from the start of the
252 token to the end of the token. If the length of the token is L, and
253 the number of the current earleme is C, the end of the token will be at
254 the earleme whose number is C+L.
255
257 Many different models of the relationship between tokens and earlemes
258 are possible, but two are particularly important. One is the one-
259 token-per-earleme model, which is the default, and which has already
260 been described. The other is the one-character-per-earleme model.
261
262 In the one-character-per-earleme model, every character will be treated
263 as being exactly one earleme in length. If a token is more than one
264 character in length, that token will span earlemes. When the lexing is
265 ambiguous, tokens may overlap.
266
267 When a one-character-per-earleme model of input is used, there may be
268 many earlemes at which no tokens start. For example, in a
269 straightforward character-per-earleme implementation of a grammar for a
270 language which allows comments, no tokens will start at any earlemes
271 which corresponds to character locations inside a comment.
272
274 So far only the token-per-earleme and character-per-earleme models have
275 seen any real use in Marpa programs. But other models are certainly
276 possible. Using earlemes, you can structure your input in almost any
277 way you like.
278
279 There are only three restrictions:
280
281 1. Scanning always starts at earleme 0.
282
283 2. All tokens starting at earleme N must be scanned before any tokens
284 starting at earleme N+1. In other words, the tokens must be
285 scanned in non-decreasing order by start earleme.
286
287 3. Every token must have a length, in earlemes, which is greater than
288 zero. In other words, token length can never be zero or negative.
289
291 Copyright 2012 Jeffrey Kegler
292 This file is part of Marpa::XS. Marpa::XS is free software: you can
293 redistribute it and/or modify it under the terms of the GNU Lesser
294 General Public License as published by the Free Software Foundation,
295 either version 3 of the License, or (at your option) any later version.
296
297 Marpa::XS is distributed in the hope that it will be useful,
298 but WITHOUT ANY WARRANTY; without even the implied warranty of
299 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
300 Lesser General Public License for more details.
301
302 You should have received a copy of the GNU Lesser
303 General Public License along with Marpa::XS. If not, see
304 http://www.gnu.org/licenses/.
305
306
307
308perl v5.32.1 2021-01-27 Marpa::XS::Advanced::Models(3)