1Marpa::XS::Vocabulary(3U)ser Contributed Perl DocumentatiMoanrpa::XS::Vocabulary(3)
2
3
4

NAME

6       Marpa::XS::Vocabulary - Standard parsing terms as used within Marpa
7

DESCRIPTION

9       The definitions in this document are of standard parsing terms as they
10       are used in the Marpa context.  I put defining uses of terms in
11       boldface, for easy skimming.  Where a narrow or specialized sense of
12       the term is the one that applies within Marpa, that is the only
13       definition given.  Marpa also sometimes uses a standard term with a
14       definition which is slightly different from the standard one.
15       ("Ambiguous grammar" is one example, and "grammar" itself is another.)
16       When this is the case, it is explicitly pointed out.
17
18       A reader totally new to parsing will find this document too terse to
19       act as a textbook or tutorial.  They will need to look elsewhere first.
20       As an introduction, I recommend Mark Jason Dominus's excellent chapter
21       on parsing in the Perl context.  It's available on-line.  Wikipedia is
22       also an excellent place to start.
23
24   Basic
25       A grammar is a set of rules, associated with a set of symbols, one of
26       which is distinguished as the default start symbol.  A symbol string,
27       or simply string where the meaning is clear, is an ordered series of
28       symbols.  The length of a string is the number of symbols in it.
29
30       A language is a set of symbol strings.  A grammar defines a language,
31       as will be described later.
32
33       It is important to note that the term language, as it is used in
34       parsing theory, means something very different from what it means in
35       ordinary use.  The meaning of the strings is an essential part of the
36       ordinary idea of what a language is.  In ordinary use, the word
37       "language" means far more than a unordered list of its sentences.  In
38       parsing terminology, meaning (or semantics as it is called) is a
39       separate issue.  For parsing theory a language is exactly a set of
40       strings -- that and nothing more.
41
42       The Marpa definition of a grammar differs slightly from the various
43       standard ones.  Standard definitions usually sharply distinguish
44       terminal symbols from non-terminals.  Marpa does not.  In standard
45       definitions, the start symbol is fixed.  In Marpa, it is merely a
46       default.
47
48   Stages of parsing
49       A recognizer is a program that determines whether its input is in the
50       language of a grammar and a start symbol.  A parser is a program which
51       finds the structure of that input.
52
53       The term parsing is used in a strict and a loose sense.  Parsing in the
54       loose sense is all phases of finding a grammar's structure, including a
55       separate recognition phase if the parser has one.  (Marpa does.)  If a
56       parser has phases, parsing in the strict sense refers specifically to
57       the phase that finds the structure of the input.  When the Marpa
58       documents use the term parsing in its strict sense, they will speak
59       explicitly of "parsing in the strict sense".  Otherwise, parsing will
60       mean parsing in the loose sense.
61
62       Parsers often use a lexical analyzer to convert raw input, usually
63       input text, into a token stream, which is a series of tokens.  Each
64       token represents a symbol of the grammar and has a value.  A lexical
65       analyzer is often called a lexer or a scanner, and lexical analysis is
66       often called lexing or scanning.
67
68       The series of symbols represented by the series of tokens becomes the
69       symbol string input seen by the recognizer.  The symbol string input is
70       more often called the input sentence.
71
72       By default, Marpa uses the token stream model of input.  Marpa also
73       allows alternative input models.  These are new to Marpa, so that their
74       terminology is of necessity non-standard.  The terminology needed for
75       alternative input models is explained in the document that introduces
76       them.
77
78   Rules
79       A standard way of describing rules is Backus-Naur Form, or BNF.  A rule
80       of a grammar is also often called a production.  In one common way of
81       writing BNF, a production looks like this:
82
83           Expression ::= Term Factor
84
85       In the production above, "Expression", "Term" and "Factor" are symbols.
86       A production consists of a left hand side and a right hand side.  In a
87       context-free grammar, like those Marpa parses, the left hand side of a
88       production is always a symbol string of length 1.  The right hand side
89       of a production is a symbol string of zero or more symbols.  In the
90       example, "Expression" is the left hand side, and "Term" and "Factor"
91       are right hand side symbols.
92
93       Left hand side and right hand side are often abbreviated as RHS and
94       LHS.  If the RHS of a production has no symbols, the production is
95       called an empty production or an empty rule.
96
97       Any symbol which is allowed to occur in the symbol string input is
98       called a terminal symbol.  If the symbols in a symbol string are all
99       terminals, that symbol string is also called a sentence.
100
101   Derivations
102       A step of a derivation, or derivation step, is a change made to a
103       symbol string by applying one of the productions from the grammar.  The
104       production must be one of those with a LHS that occurs in the symbol
105       string.  The result of the derivation step is another symbol string,
106       one in which every occurence of the LHS symbol from the production is
107       replaced by the RHS of the production.  For example, if "A", "B", "C",
108       "D", and "X" are symbols, and
109
110           X ::= B C
111
112       is a production, then
113
114           A X D -> A B C D
115
116       is a derivation step, with ""A X D"" as its beginning and ""A B C D""
117       as its end or result.  We say that the symbol string ""A X D"" derives
118       the symbol string ""A B C D"".
119
120       A derivation is a sequence of derivation steps.  The length of a
121       derivation is its length in steps.
122
123       •   We say that a first symbol string directly derives a second symbol
124           string if and only if there is a derivation of length 1 from the
125           first symbol string to the second symbol string.
126
127       •   Every symbol string is said to derive itself in a derivation of
128           length 0.  A zero length derivation is a trivial derivation.
129
130       •   A non-trivial derivation of a symbol string from itself is called a
131           cycle.  Grammars which contain cycles are traditionally considered
132           useless, but Marpa will parse with such grammars.
133
134       •   If a derivation is not trivial or direct, that is, if it has more
135           than one step, then it is an indirect derivation.
136
137       •   A derivation which is not trivial (that is, a derivation which has
138           one or more steps) is a non-trivial derivation.
139
140       Technically, a symbol "X" and a string that consists of only that
141       symbol are two different things.  But we often say "the symbol "X"" as
142       shorthand for "the string of length 1 whose only symbol is "X"".  For
143       example, if the string containing only the symbol "X" derives a string
144       "Y", we will usually say simply that ""X" derives "Y"".
145
146       Wherever symbol or string "X" derives "Y", we may also say "X" produces
147       "Y".  Derivations are often described as symbol matches.  Wherever
148       symbol or string "X" derives "Y", we may also say that "Y" matches "X"
149       or that "X" matches "Y".  It is particularly common to say that "X"
150       matches "Y" when "X" or "Y" is a sentence.
151
152       In any parse, one symbol is distinguished as the start symbol.  The
153       parse of an input is successful if and only if the start symbol
154       produces the input sentence according to the grammar.  The set of all
155       input sentences that a grammar and a start symbol will successfully
156       parse is their language.  Traditionally, the start symbol is part of a
157       grammar's definition.  The language of a grammar is the language of the
158       grammar and its default start symbol.
159
160   Nulling
161       The zero length symbol string is called the empty string.  The empty
162       string can be considered to be a sentence, in which case it is the
163       empty sentence.  A string of one or more symbols is non-empty.  A
164       derivation which produces the empty string is a null derivation.  A
165       derivation from the start symbol which produces the empty string is a
166       null parse.
167
168       If in a particular grammar, a symbol has a null derivation, it is a
169       nullable symbol.  If, in a particular grammar, the only sentence
170       produced by a symbol is the empty sentence, it is a nulling symbol.
171       All nulling symbols are nullable symbols.
172
173       If a symbol is not nullable, it is non-nullable.  If a symbol is not
174       nulling, it is non-nulling.  In any instance where a symbol produces
175       the empty string, it is said to be nulled, or to be a null symbol.
176
177   Useless rules
178       If any derivation from the start symbol uses a rule, that rule is
179       called reachable or accessible.  A rule that is not accessible is
180       called unreachable or inaccessible.  If any derivation which results in
181       a sentence uses a rule, that rule is said to be productive.  A rule
182       that is not productive is called unproductive.  For example, a rule is
183       unproductive unless every symbol on its RHS either is a terminal or is
184       the LHS of some other rule.  A rule which is inaccessible or
185       unproductive is called a useless rule.  Marpa can handle grammars with
186       useless rules.
187
188       A symbol is reachable or accessible if it appears in a reachable
189       production.  If a symbol is not reachable, it is unreachable or
190       inaccessible.  A symbol is productive if it appears on the LHS of a
191       productive rule, or if it is a nullable symbol.  If a symbol is not
192       productive, it is unproductive.  A symbol which is inaccessible or
193       unproductive is called a useless symbol.  Marpa can handle grammars
194       with useless symbols.
195
196   Recursion and cycles
197       If any symbol in the grammar non-trivially produces a symbol string
198       containing itself, the grammar is said to be recursive.  If any symbol
199       non-trivially produces a symbol string with itself on the left, the
200       grammar is said to be left-recursive.  If any symbol non-trivially
201       produces a symbol string with itself on the right, the grammar is said
202       to be right-recursive.  Marpa can handle all recursive grammars,
203       including grammars which are left-recursive, grammars which are right-
204       recursive, and grammars which contain both left- and right-recursion.
205
206       A cycle is a non-trivial derivation of a string of symbols from itself.
207       If it is not possible for any derivation using a grammar to contain a
208       cycle, then that grammar is said to be cycle-free.  Traditionally, a
209       grammar is considered useless if it is not cycle-free.
210
211       The traditional deprecation of cycles is well-founded.  A cycle is the
212       parsing equivalent of an infinite loop.  Once a cycle appears, it can
213       be repeated over and over again.  Even a very short input sentence can
214       have an infinite number of parses when the grammar is not cycle-free.
215
216       For that reason, a grammar which contains a cycle is also called
217       infinitely ambiguous.  Marpa can parse with grammars which are not
218       cycle-free, and will even parse inputs that cause cycles.  When a parse
219       is infinitely ambiguous, Marpa limits cycles to a single loop, so that
220       only a finite number of parses is returned.
221
222   Parse structure
223       The structure of a parse can be represented as a series of derivation
224       steps from the start symbol to the input.  Another way to represent
225       structure is as a parse tree.  Every symbol used in the parse is
226       represented by a node of the parse tree.  Wherever a production is used
227       in the parse, its LHS is represented by a parent node and the RHS
228       symbols are represented by child nodes.  The start symbol is the root
229       of the tree.  The node at the root of the tree is called the start
230       node.
231
232       Traditionally, grammars divide all symbols sharply into terminals and
233       non-terminals.  A terminal symbol must ALWAYS be used as a terminal.  A
234       non-terminal symbol can NEVER be used as a terminal.
235
236       Marpa's use of terminals is non-traditional, and its terminology is
237       different accordingly.  As in the traditional approach, Marpa's non-
238       terminals can never be used as terminals.  But Marpa terminals can be
239       used anywhere, even in places where the traditional approach requires a
240       a non-terminal symbol.  In particular, a Marpa terminal can be the LHS
241       of a rule.
242
243       Traditionally, and in Marpa as well, every node is either a inner node
244       or a leaf node.  In Marpa, leaf nodes are of two kinds:
245
246       •   Nodes for nulled symbols.  A node for a nulled symbol is called a
247           nulled node.
248
249       •   Nodes for symbols being used as terminals.
250
251   Ambiguity
252       Marpa allows ambiguous grammars.  Traditionally we say that a parse is
253       ambiguous if, for a given grammar and a given input, more than one
254       derivation tree is possible.  However, Marpa allows ambiguous input
255       tokens, which the traditional definition does not take into account.
256       If Marpa used the traditional definition, all grammars would be
257       ambiguous except those grammars which allowed only the null parse.
258       Another complication is that in Marpa the start symbol of a grammar is
259       not fixed.
260
261       It is easiest if the Marpa definition and the traditional definition
262       were extensionally equivalent --- that is, if Marpa's set of ambiguous
263       grammars was exactly the same as the set of traditionally ambiguous
264       grammars.  This can be accomplished by using a slightly altered
265       definition.  In the Marpa context, a grammar with an associated start
266       symbol is ambiguous if and only if, for some UNAMBIGUOUS stream of
267       input tokens, that grammar with that start symbol produces more than
268       one parse tree.  A Marpa grammar is ambiguous if that Marpa grammar
269       with its default start symbol is ambiguous.
270
271   Semantics
272       In real life, the structure of a parse is usually a means to an end.
273       Grammars usually have a semantics associated with them, and what the
274       user actually wants is the value of the parse according to the
275       semantics.
276
277       The tree representation is especially useful when evaluating a parse.
278       In the traditional method of evaluating a parse tree, every node which
279       represents a terminal symbol has a value associated with it on input.
280       Non-null inner nodes take their semantics from the production whose LHS
281       they represent.  Nulled nodes are dealt with as special cases.
282
283       The semantics for a production describe how to calculate the value of
284       the node which represents the LHS (the parent node) from the values of
285       zero or more of the nodes which represent the RHS symbols (child
286       nodes).  Values are computed recursively, bottom-up.  The value of a
287       parse is the value of its start symbol.
288
290         Copyright 2012 Jeffrey Kegler
291         This file is part of Marpa::XS.  Marpa::XS is free software: you can
292         redistribute it and/or modify it under the terms of the GNU Lesser
293         General Public License as published by the Free Software Foundation,
294         either version 3 of the License, or (at your option) any later version.
295
296         Marpa::XS is distributed in the hope that it will be useful,
297         but WITHOUT ANY WARRANTY; without even the implied warranty of
298         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
299         Lesser General Public License for more details.
300
301         You should have received a copy of the GNU Lesser
302         General Public License along with Marpa::XS.  If not, see
303         http://www.gnu.org/licenses/.
304
305
306
307perl v5.32.1                      2021-01-27          Marpa::XS::Vocabulary(3)
Impressum