1Marpa::XS::Vocabulary(3U)ser Contributed Perl DocumentatiMoanrpa::XS::Vocabulary(3)
2
3
4
6 Marpa::XS::Vocabulary - Standard parsing terms as used within Marpa
7
9 The definitions in this document are of standard parsing terms as they
10 are used in the Marpa context. I put defining uses of terms in
11 boldface, for easy skimming. Where a narrow or specialized sense of
12 the term is the one that applies within Marpa, that is the only
13 definition given. Marpa also sometimes uses a standard term with a
14 definition which is slightly different from the standard one.
15 ("Ambiguous grammar" is one example, and "grammar" itself is another.)
16 When this is the case, it is explicitly pointed out.
17
18 A reader totally new to parsing will find this document too terse to
19 act as a textbook or tutorial. They will need to look elsewhere first.
20 As an introduction, I recommend Mark Jason Dominus's excellent chapter
21 on parsing in the Perl context. It's available on-line. Wikipedia is
22 also an excellent place to start.
23
24 Basic
25 A grammar is a set of rules, associated with a set of symbols, one of
26 which is distinguished as the default start symbol. A symbol string,
27 or simply string where the meaning is clear, is an ordered series of
28 symbols. The length of a string is the number of symbols in it.
29
30 A language is a set of symbol strings. A grammar defines a language,
31 as will be described later.
32
33 It is important to note that the term language, as it is used in
34 parsing theory, means something very different from what it means in
35 ordinary use. The meaning of the strings is an essential part of the
36 ordinary idea of what a language is. In ordinary use, the word
37 "language" means far more than a unordered list of its sentences. In
38 parsing terminology, meaning (or semantics as it is called) is a
39 separate issue. For parsing theory a language is exactly a set of
40 strings -- that and nothing more.
41
42 The Marpa definition of a grammar differs slightly from the various
43 standard ones. Standard definitions usually sharply distinguish
44 terminal symbols from non-terminals. Marpa does not. In standard
45 definitions, the start symbol is fixed. In Marpa, it is merely a
46 default.
47
48 Stages of parsing
49 A recognizer is a program that determines whether its input is in the
50 language of a grammar and a start symbol. A parser is a program which
51 finds the structure of that input.
52
53 The term parsing is used in a strict and a loose sense. Parsing in the
54 loose sense is all phases of finding a grammar's structure, including a
55 separate recognition phase if the parser has one. (Marpa does.) If a
56 parser has phases, parsing in the strict sense refers specifically to
57 the phase that finds the structure of the input. When the Marpa
58 documents use the term parsing in its strict sense, they will speak
59 explicitly of "parsing in the strict sense". Otherwise, parsing will
60 mean parsing in the loose sense.
61
62 Parsers often use a lexical analyzer to convert raw input, usually
63 input text, into a token stream, which is a series of tokens. Each
64 token represents a symbol of the grammar and has a value. A lexical
65 analyzer is often called a lexer or a scanner, and lexical analysis is
66 often called lexing or scanning.
67
68 The series of symbols represented by the series of tokens becomes the
69 symbol string input seen by the recognizer. The symbol string input is
70 more often called the input sentence.
71
72 By default, Marpa uses the token stream model of input. Marpa also
73 allows alternative input models. These are new to Marpa, so that their
74 terminology is of necessity non-standard. The terminology needed for
75 alternative input models is explained in the document that introduces
76 them.
77
78 Rules
79 A standard way of describing rules is Backus-Naur Form, or BNF. A rule
80 of a grammar is also often called a production. In one common way of
81 writing BNF, a production looks like this:
82
83 Expression ::= Term Factor
84
85 In the production above, "Expression", "Term" and "Factor" are symbols.
86 A production consists of a left hand side and a right hand side. In a
87 context-free grammar, like those Marpa parses, the left hand side of a
88 production is always a symbol string of length 1. The right hand side
89 of a production is a symbol string of zero or more symbols. In the
90 example, "Expression" is the left hand side, and "Term" and "Factor"
91 are right hand side symbols.
92
93 Left hand side and right hand side are often abbreviated as RHS and
94 LHS. If the RHS of a production has no symbols, the production is
95 called an empty production or an empty rule.
96
97 Any symbol which is allowed to occur in the symbol string input is
98 called a terminal symbol. If the symbols in a symbol string are all
99 terminals, that symbol string is also called a sentence.
100
101 Derivations
102 A step of a derivation, or derivation step, is a change made to a
103 symbol string by applying one of the productions from the grammar. The
104 production must be one of those with a LHS that occurs in the symbol
105 string. The result of the derivation step is another symbol string,
106 one in which every occurence of the LHS symbol from the production is
107 replaced by the RHS of the production. For example, if "A", "B", "C",
108 "D", and "X" are symbols, and
109
110 X ::= B C
111
112 is a production, then
113
114 A X D -> A B C D
115
116 is a derivation step, with ""A X D"" as its beginning and ""A B C D""
117 as its end or result. We say that the symbol string ""A X D"" derives
118 the symbol string ""A B C D"".
119
120 A derivation is a sequence of derivation steps. The length of a
121 derivation is its length in steps.
122
123 · We say that a first symbol string directly derives a second symbol
124 string if and only if there is a derivation of length 1 from the
125 first symbol string to the second symbol string.
126
127 · Every symbol string is said to derive itself in a derivation of
128 length 0. A zero length derivation is a trivial derivation.
129
130 · A non-trivial derivation of a symbol string from itself is called a
131 cycle. Grammars which contain cycles are traditionally considered
132 useless, but Marpa will parse with such grammars.
133
134 · If a derivation is not trivial or direct, that is, if it has more
135 than one step, then it is an indirect derivation.
136
137 · A derivation which is not trivial (that is, a derivation which has
138 one or more steps) is a non-trivial derivation.
139
140 Technically, a symbol "X" and a string that consists of only that
141 symbol are two different things. But we often say "the symbol "X"" as
142 shorthand for "the string of length 1 whose only symbol is "X"". For
143 example, if the string containing only the symbol "X" derives a string
144 "Y", we will usually say simply that ""X" derives "Y"".
145
146 Wherever symbol or string "X" derives "Y", we may also say "X" produces
147 "Y". Derivations are often described as symbol matches. Wherever
148 symbol or string "X" derives "Y", we may also say that "Y" matches "X"
149 or that "X" matches "Y". It is particularly common to say that "X"
150 matches "Y" when "X" or "Y" is a sentence.
151
152 In any parse, one symbol is distinguished as the start symbol. The
153 parse of an input is successful if and only if the start symbol
154 produces the input sentence according to the grammar. The set of all
155 input sentences that a grammar and a start symbol will successfully
156 parse is their language. Traditionally, the start symbol is part of a
157 grammar's definition. The language of a grammar is the language of the
158 grammar and its default start symbol.
159
160 Nulling
161 The zero length symbol string is called the empty string. The empty
162 string can be considered to be a sentence, in which case it is the
163 empty sentence. A string of one or more symbols is non-empty. A
164 derivation which produces the empty string is a null derivation. A
165 derivation from the start symbol which produces the empty string is a
166 null parse.
167
168 If in a particular grammar, a symbol has a null derivation, it is a
169 nullable symbol. If, in a particular grammar, the only sentence
170 produced by a symbol is the empty sentence, it is a nulling symbol.
171 All nulling symbols are nullable symbols.
172
173 If a symbol is not nullable, it is non-nullable. If a symbol is not
174 nulling, it is non-nulling. In any instance where a symbol produces
175 the empty string, it is said to be nulled, or to be a null symbol.
176
177 Useless rules
178 If any derivation from the start symbol uses a rule, that rule is
179 called reachable or accessible. A rule that is not accessible is
180 called unreachable or inaccessible. If any derivation which results in
181 a sentence uses a rule, that rule is said to be productive. A rule
182 that is not productive is called unproductive. For example, a rule is
183 unproductive unless every symbol on its RHS either is a terminal or is
184 the LHS of some other rule. A rule which is inaccessible or
185 unproductive is called a useless rule. Marpa can handle grammars with
186 useless rules.
187
188 A symbol is reachable or accessible if it appears in a reachable
189 production. If a symbol is not reachable, it is unreachable or
190 inaccessible. A symbol is productive if it appears on the LHS of a
191 productive rule, or if it is a nullable symbol. If a symbol is not
192 productive, it is unproductive. A symbol which is inaccessible or
193 unproductive is called a useless symbol. Marpa can handle grammars
194 with useless symbols.
195
196 Recursion and cycles
197 If any symbol in the grammar non-trivially produces a symbol string
198 containing itself, the grammar is said to be recursive. If any symbol
199 non-trivially produces a symbol string with itself on the left, the
200 grammar is said to be left-recursive. If any symbol non-trivially
201 produces a symbol string with itself on the right, the grammar is said
202 to be right-recursive. Marpa can handle all recursive grammars,
203 including grammars which are left-recursive, grammars which are right-
204 recursive, and grammars which contain both left- and right-recursion.
205
206 A cycle is a non-trivial derivation of a string of symbols from itself.
207 If it is not possible for any derivation using a grammar to contain a
208 cycle, then that grammar is said to be cycle-free. Traditionally, a
209 grammar is considered useless if it is not cycle-free.
210
211 The traditional deprecation of cycles is well-founded. A cycle is the
212 parsing equivalent of an infinite loop. Once a cycle appears, it can
213 be repeated over and over again. Even a very short input sentence can
214 have an infinite number of parses when the grammar is not cycle-free.
215
216 For that reason, a grammar which contains a cycle is also called
217 infinitely ambiguous. Marpa can parse with grammars which are not
218 cycle-free, and will even parse inputs that cause cycles. When a parse
219 is infinitely ambiguous, Marpa limits cycles to a single loop, so that
220 only a finite number of parses is returned.
221
222 Parse structure
223 The structure of a parse can be represented as a series of derivation
224 steps from the start symbol to the input. Another way to represent
225 structure is as a parse tree. Every symbol used in the parse is
226 represented by a node of the parse tree. Wherever a production is used
227 in the parse, its LHS is represented by a parent node and the RHS
228 symbols are represented by child nodes. The start symbol is the root
229 of the tree. The node at the root of the tree is called the start
230 node.
231
232 Traditionally, grammars divide all symbols sharply into terminals and
233 non-terminals. A terminal symbol must ALWAYS be used as a terminal. A
234 non-terminal symbol can NEVER be used as a terminal.
235
236 Marpa's use of terminals is non-traditional, and its terminology is
237 different accordingly. As in the traditional approach, Marpa's non-
238 terminals can never be used as terminals. But Marpa terminals can be
239 used anywhere, even in places where the traditional approach requires a
240 a non-terminal symbol. In particular, a Marpa terminal can be the LHS
241 of a rule.
242
243 Traditionally, and in Marpa as well, every node is either a inner node
244 or a leaf node. In Marpa, leaf nodes are of two kinds:
245
246 · Nodes for nulled symbols. A node for a nulled symbol is called a
247 nulled node.
248
249 · Nodes for symbols being used as terminals.
250
251 Ambiguity
252 Marpa allows ambiguous grammars. Traditionally we say that a parse is
253 ambiguous if, for a given grammar and a given input, more than one
254 derivation tree is possible. However, Marpa allows ambiguous input
255 tokens, which the traditional definition does not take into account.
256 If Marpa used the traditional definition, all grammars would be
257 ambiguous except those grammars which allowed only the null parse.
258 Another complication is that in Marpa the start symbol of a grammar is
259 not fixed.
260
261 It is easiest if the Marpa definition and the traditional definition
262 were extensionally equivalent --- that is, if Marpa's set of ambiguous
263 grammars was exactly the same as the set of traditionally ambiguous
264 grammars. This can be accomplished by using a slightly altered
265 definition. In the Marpa context, a grammar with an associated start
266 symbol is ambiguous if and only if, for some UNAMBIGUOUS stream of
267 input tokens, that grammar with that start symbol produces more than
268 one parse tree. A Marpa grammar is ambiguous if that Marpa grammar
269 with its default start symbol is ambiguous.
270
271 Semantics
272 In real life, the structure of a parse is usually a means to an end.
273 Grammars usually have a semantics associated with them, and what the
274 user actually wants is the value of the parse according to the
275 semantics.
276
277 The tree representation is especially useful when evaluating a parse.
278 In the traditional method of evaluating a parse tree, every node which
279 represents a terminal symbol has a value associated with it on input.
280 Non-null inner nodes take their semantics from the production whose LHS
281 they represent. Nulled nodes are dealt with as special cases.
282
283 The semantics for a production describe how to calculate the value of
284 the node which represents the LHS (the parent node) from the values of
285 zero or more of the nodes which represent the RHS symbols (child
286 nodes). Values are computed recursively, bottom-up. The value of a
287 parse is the value of its start symbol.
288
290 Copyright 2012 Jeffrey Kegler
291 This file is part of Marpa::XS. Marpa::XS is free software: you can
292 redistribute it and/or modify it under the terms of the GNU Lesser
293 General Public License as published by the Free Software Foundation,
294 either version 3 of the License, or (at your option) any later version.
295
296 Marpa::XS is distributed in the hope that it will be useful,
297 but WITHOUT ANY WARRANTY; without even the implied warranty of
298 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
299 Lesser General Public License for more details.
300
301 You should have received a copy of the GNU Lesser
302 General Public License along with Marpa::XS. If not, see
303 http://www.gnu.org/licenses/.
304
305
306
307perl v5.32.0 2020-07-28 Marpa::XS::Vocabulary(3)