1grammar::peg(n) Grammar operations and usage grammar::peg(n)
2
3
4
5______________________________________________________________________________
6
8 grammar::peg - Create and manipulate parsing expression grammars
9
11 package require Tcl 8.4
12
13 package require snit
14
15 package require grammar::peg ?0.1?
16
17 ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
18
19 pegName destroy
20
21 pegName clear
22
23 pegName = srcPEG
24
25 pegName --> dstPEG
26
27 pegName serialize
28
29 pegName deserialize serialization
30
31 pegName is valid
32
33 pegName start ?pe?
34
35 pegName nonterminals
36
37 pegName nonterminal add nt pe
38
39 pegName nonterminal delete nt1 ?nt2 ...?
40
41 pegName nonterminal exists nt
42
43 pegName nonterminal rename nt ntnew
44
45 pegName nonterminal mode nt ?mode?
46
47 pegName nonterminal rule nt
48
49 pegName unknown nonterminals
50
51_________________________________________________________________
52
54 This package provides a container class for parsing expression grammars
55 (Short: PEG). It allows the incremental definition of the grammar, its
56 manipulation and querying of the definition. The package neither pro‐
57 vides complex operations on the grammar, nor has it the ability to exe‐
58 cute a grammar definition for a stream of symbols. Two packages
59 related to this one are grammar::mengine and grammar::peg::interpreter.
60 The first of them defines a general virtual machine for the matching of
61 a character stream, and the second implements an interpreter for pars‐
62 ing expression grammars on top of that virtual machine.
63
64 TERMS & CONCEPTS
65 The formal mathematical definition of parsing expressions and parsing
66 expression grammars can be found in section PARSING EXPRESSION GRAM‐
67 MARS.
68
69 In short, we have terminal symbols, which are the most basic building
70 blocks for sentences, and nonterminal symbols with associated parsing
71 expressions, defining the grammatical structure of the sentences. The
72 two sets of symbols are distinctive, and do not overlap. When speaking
73 about symbols the word "symbol" is often left out. The union of the
74 sets of terminal and nonterminal symbols is called the set of symbols.
75
76 Here the set of terminal symbols is not explicitly managed, but implic‐
77 itly defined as the set of all characters. Note that this means that we
78 inherit from Tcl the ability to handle all of Unicode.
79
80 A pair of nonterminal and parsing expression is also called a grammati‐
81 cal rule, or rule for short. In the context of a rule the nonterminal
82 is often called the left-hand-side (LHS), and the parsing expression
83 the right-hand-side (RHS).
84
85 The start expression of a grammar is a parsing expression from which
86 all the sentences contained in the language specified by the grammar
87 are derived. To make the understanding of this term easier let us
88 assume for a moment that the RHS of each rule, and the start expres‐
89 sion, is either a sequence of symbols, or a series of alternate parsing
90 expressions. In the latter case the rule can be seen as a set of
91 rules, each providing one alternative for the nonterminal. A parsing
92 expression A' is now a derivation of a parsing expression A if we pick
93 one of the nonterminals N in the expression, and one of the alternative
94 rules R for N, and then replace the nonterminal in A with the RHS of
95 the chosen rule. Here we can see why the terminal symbols are called
96 such. They cannot be expanded any further, thus terminate the process
97 of deriving new expressions. An example
98
99 Rules
100 (1) A <- a B c
101 (2a) B <- d B
102 (2b) B <- e
103
104 Some derivations, using starting expression A.
105
106 A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c
107
108
109 A derived expression containing only terminal symbols is a sentence.
110 The set of all sentences which can be derived from the start expression
111 is the language of the grammar.
112
113 Some definitions for nonterminals and expressions:
114
115 [1] A nonterminal A is called reachable if it is possible to derive
116 a parsing expression from the start expression which contains A.
117
118 [2] A nonterminal A is called useful if it is possible to derive a
119 sentence from it.
120
121 [3] A nonterminal A is called recursive if it is possible to derive
122 a parsing expression from it which contains A, again.
123
124 [4] The FIRST set of a nonterminal A contains all the symbols which
125 can occur of as the leftmost symbol in a parsing expression
126 derived from A. If the FIRST set contains A itself then that
127 nonterminal is called left-recursive.
128
129 [5] The LAST set of a nonterminal A contains all the symbols which
130 can occur of as the rightmost symbol in a parsing expression
131 derived from A. If the LAST set contains A itself then that non‐
132 terminal is called right-recursive.
133
134 [6] The FOLLOW set of a nonterminal A contains all the symbols which
135 can occur after A in a parsing expression derived from the start
136 expression.
137
138 [7] A nonterminal (or parsing expression) is called nullable if the
139 empty sentence can be derived from it.
140
141 And based on the above definitions for grammars:
142
143 [1] A grammar G is recursive if and only if it contains a nontermi‐
144 nal A which is recursive. The terms left- and right-recursive,
145 and useful are analogously defined.
146
147 [2] A grammar is minimal if it contains only reachable and useful
148 nonterminals.
149
150 [3] A grammar is wellformed if it is not left-recursive. Such gram‐
151 mars are also complete, which means that they always succeed or
152 fail on all input sentences. For an incomplete grammar on the
153 other hand input sentences exist for which an attempt to match
154 them against the grammar will not terminate.
155
156 [4] As we wish to allow ourselves to build a grammar incrementally
157 in a container object we will encounter stages where the RHS of
158 one or more rules reference symbols which are not yet known to
159 the container. Such a grammar we call invalid. We cannot use
160 the term incomplete as this term is already taken, see the last
161 item.
162
163 CONTAINER CLASS API
164 The package exports the API described here.
165
166 ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
167 The command creates a new container object for a parsing expres‐
168 sion grammar and returns the fully qualified name of the object
169 command as its result. The API the returned command is following
170 is described in the section CONTAINER OBJECT API. It may be used
171 to invoke various operations on the container and the grammar
172 within.
173
174 The new container, i.e. grammar will be empty if no src is spec‐
175 ified. Otherwise it will contain a copy of the grammar contained
176 in the src. The src has to be a container object reference for
177 all operators except deserialize. The deserialize operator
178 requires src to be the serialization of a parsing expression
179 grammar instead.
180
181 An empty grammar has no nonterminal symbols, and the start
182 expression is the empty expression, i.e. epsilon. It is valid,
183 but not useful.
184
185 CONTAINER OBJECT API
186 All grammar container objects provide the following methods for the
187 manipulation of their contents:
188
189 pegName destroy
190 Destroys the grammar, including its storage space and associated
191 command.
192
193 pegName clear
194 Clears out the definition of the grammar contained in pegName,
195 but does not destroy the object.
196
197 pegName = srcPEG
198 Assigns the contents of the grammar contained in srcPEG to peg‐
199 Name, overwriting any existing definition. This is the assign‐
200 ment operator for grammars. It copies the grammar contained in
201 the grammar object srcPEG over the grammar definition in peg‐
202 Name. The old contents of pegName are deleted by this operation.
203
204 This operation is in effect equivalent to
205
206
207 pegName deserialize [srcPEG serialize]
208
209
210 pegName --> dstPEG
211 This is the reverse assignment operator for grammars. It copies
212 the automation contained in the object pegName over the grammar
213 definition in the object dstPEG. The old contents of dstPEG are
214 deleted by this operation.
215
216 This operation is in effect equivalent to
217
218
219 dstPEG deserialize [pegName serialize]
220
221
222 pegName serialize
223 This method serializes the grammar stored in pegName. In other
224 words it returns a tcl value completely describing that grammar.
225 This allows, for example, the transfer of grammars over arbi‐
226 trary channels, persistence, etc. This method is also the basis
227 for both the copy constructor and the assignment operator.
228
229 The result of this method has to be semantically identical over
230 all implementations of the grammar::peg interface. This is what
231 will enable us to copy grammars between different implementa‐
232 tions of the same interface.
233
234 The result is a list of four elements with the following struc‐
235 ture:
236
237 [1] The constant string grammar::peg.
238
239 [2] A dictionary. Its keys are the names of all known nonter‐
240 minal symbols, and their associated values are the pars‐
241 ing expressions describing their sentennial structure.
242
243 [3] A dictionary. Its keys are the names of all known nonter‐
244 minal symbols, and their associated values hints to a
245 matcher regarding the semantic values produced by the
246 symbol.
247
248 [4] The last item is a parsing expression, the start expres‐
249 sion of the grammar.
250
251 Assuming the following PEG for simple mathematical expressions
252
253
254 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
255 Sign <- '+' / '-'
256 Number <- Sign? Digit+
257 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
258 MulOp <- '*' / '/'
259 Factor <- Term (AddOp Term)*
260 AddOp <- '+'/'-'
261 Term <- Number
262
263
264 a possible serialization is
265
266
267 grammar::peg \\
268 {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
269 Factor {x Term {* {x AddOp Term}}} \\
270 Term Number \\
271 MulOp {/ * /} \\
272 AddOp {/ + -} \\
273 Number {x {? Sign} {+ Digit}} \\
274 Sign {/ + -} \\
275 Digit {/ 0 1 2 3 4 5 6 7 8 9} \\
276 } \\
277 {Expression value Factor value \\
278 Term value MulOp value \\
279 AddOp value Number value \\
280 Sign value Digit value \\
281 }
282 Expression
283
284
285 A possible one, because the order of the nonterminals in the dictionary
286 is not relevant.
287
288 pegName deserialize serialization
289 This is the complement to serialize. It replaces the grammar
290 definition in pegName with the grammar described by the serial‐
291 ization value. The old contents of pegName are deleted by this
292 operation.
293
294 pegName is valid
295 A predicate. It tests whether the PEG in pegName is valid. See
296 section TERMS & CONCEPTS for the definition of this grammar
297 property. The result is a boolean value. It will be set to true
298 if the PEG has the tested property, and false otherwise.
299
300 pegName start ?pe?
301 This method defines the start expression of the grammar. It
302 replaces the previously defined start expression with the pars‐
303 ing expression pe. The method fails and throws an error if pe
304 does not contain a valid parsing expression as specified in the
305 section PARSING EXPRESSIONS. In that case the existing start
306 expression is not changed. The method returns the empty string
307 as its result.
308
309 If the method is called without an argument it will return the
310 currently defined start expression.
311
312 pegName nonterminals
313 Returns the set of all nonterminal symbols known to the grammar.
314
315 pegName nonterminal add nt pe
316 This method adds the nonterminal nt and its associated parsing
317 expression pe to the set of nonterminal symbols and rules of the
318 PEG contained in the object pegName. The method fails and
319 throws an error if either the string nt is already known as a
320 symbol of the grammar, or if pe does not contain a valid parsing
321 expression as specified in the section PARSING EXPRESSIONS. In
322 that case the current set of nonterminal symbols and rules is
323 not changed. The method returns the empty string as its result.
324
325 pegName nonterminal delete nt1 ?nt2 ...?
326 This method removes the named symbols nt1, nt2 from the set of
327 nonterminal symbols of the PEG contained in the object pegName.
328 The method fails and throws an error if any of the strings is
329 not known as a nonterminal symbol. In that case the current set
330 of nonterminal symbols is not changed. The method returns the
331 empty string as its result.
332
333 The stored grammar becomes invalid if the deleted nonterminals
334 are referenced by the RHS of still-known rules.
335
336 pegName nonterminal exists nt
337 A predicate. It tests whether the nonterminal symbol nt is known
338 to the PEG in pegName. The result is a boolean value. It will
339 be set to true if the symbol nt is known, and false otherwise.
340
341 pegName nonterminal rename nt ntnew
342 This method renames the nonterminal symbol nt to ntnew. The
343 method fails and throws an error if either nt is not known as a
344 nonterminal, or if ntnew is a known symbol. The method returns
345 the empty string as its result.
346
347 pegName nonterminal mode nt ?mode?
348 This mode returns or sets the semantic mode associated with the
349 nonterminal symbol nt. If no mode is specified the current mode
350 of the nonterminal is returned. Otherwise the current mode is
351 set to mode. The method fails and throws an error if nt is not
352 known as a nonterminal. The grammar interpreter implemented by
353 the package grammar::peg::interpreter recognizes the following
354 modes:
355
356 value The semantic value of the nonterminal is the abstract
357 syntax tree created from the AST's of the RHS and a node
358 for the nonterminal itself.
359
360 match The semantic value of the nonterminal is an the abstract
361 syntax tree consisting of single a node for the string
362 matched by the RHS. The ASTs generated by the RHS are
363 discarded.
364
365 leaf The semantic value of the nonterminal is an the abstract
366 syntax tree consisting of single a node for the nontermi‐
367 nal itself. The ASTs generated by the RHS are discarded.
368
369 discard
370 The nonterminal has no semantic value. The ASTs generated
371 by the RHS are discarded (as well).
372
373 pegName nonterminal rule nt
374 This method returns the parsing expression associated with the
375 nonterminal nt. The method fails and throws an error if nt is
376 not known as a nonterminal.
377
378 pegName unknown nonterminals
379 This method returns a list containing the names of all nontermi‐
380 nal symbols which are referenced on the RHS of a grammatical
381 rule, but have no rule definining their structure. In other
382 words, a list of the nonterminal symbols which make the grammar
383 invalid. The grammar is valid if this list is empty.
384
385 PARSING EXPRESSIONS
386 Various methods of PEG container objects expect a parsing expression as
387 their argument, or will return such. This section specifies the format
388 such parsing expressions are in.
389
390 [1] The string epsilon is an atomic parsing expression. It matches
391 the empty string.
392
393 [2] The string alnum is an atomic parsing expression. It matches any
394 alphanumeric character.
395
396 [3] The string alpha is an atomic parsing expression. It matches any
397 alphabetical character.
398
399 [4] The string dot is an atomic parsing expression. It matches any
400 character.
401
402 [5] The expression [list t x] is an atomic parsing expression. It
403 matches the terminal string x.
404
405 [6] The expression [list n A] is an atomic parsing expression. It
406 matches the nonterminal A.
407
408 [7] For parsing expressions e1, e2, ... the result of [list / e1 e2
409 ... ] is a parsing expression as well. This is the ordered
410 choice, aka prioritized choice.
411
412 [8] For parsing expressions e1, e2, ... the result of [list x e1 e2
413 ... ] is a parsing expression as well. This is the sequence.
414
415 [9] For a parsing expression e the result of [list * e] is a parsing
416 expression as well. This is the kleene closure, describing zero
417 or more repetitions.
418
419 [10] For a parsing expression e the result of [list + e] is a parsing
420 expression as well. This is the positive kleene closure,
421 describing one or more repetitions.
422
423 [11] For a parsing expression e the result of [list & e] is a parsing
424 expression as well. This is the and lookahead predicate.
425
426 [12] For a parsing expression e the result of [list ! e] is a parsing
427 expression as well. This is the not lookahead predicate.
428
429 [13] For a parsing expression e the result of [list ? e] is a parsing
430 expression as well. This is the optional input.
431
432 Examples of parsing expressions where already shown, in the description
433 of the method serialize.
434
436 For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where
437
438 · VN is a set of nonterminal symbols,
439
440 · VT is a set of terminal symbols,
441
442 · R is a finite set of rules, where each rule is a pair (A,e), A
443 in VN, and e a parsing expression.
444
445 · eS is a parsing expression, the start expression.
446
447 Further constraints are
448
449 · The intersection of VN and VT is empty.
450
451 · For all A in VT exists exactly one pair (A,e) in R. In other
452 words, R is a function from nonterminal symbols to parsing
453 expressions.
454
455 Parsing expression are inductively defined via
456
457 · The empty string (epsilon) is a parsing expression.
458
459 · A terminal symbol a is a parsing expression.
460
461 · A nonterminal symbol A is a parsing expression.
462
463 · e1e2 is a parsing expression for parsing expressions e1 and 2.
464 This is called sequence.
465
466 · e1/e2 is a parsing expression for parsing expressions e1 and 2.
467 This is called ordered choice.
468
469 · e* is a parsing expression for parsing expression e. This is
470 called zero-or-more repetitions, also known as kleene closure.
471
472 · e+ is a parsing expression for parsing expression e. This is
473 called one-or-more repetitions, also known as positive kleene
474 closure.
475
476 · !e is a parsing expression for parsing expression e1. This is
477 called a not lookahead predicate.
478
479 · &e is a parsing expression for parsing expression e1. This is
480 called an and lookahead predicate.
481
482 PEGs are used to define a grammatical structure for streams of symbols
483 over VT. They are a modern phrasing of older formalisms invented by
484 Alexander Birham. These formalisms were called TS (TMG recognition
485 scheme), and gTS (generalized TS). Later they were renamed to TPDL
486 (Top-Down Parsing Languages) and gTPDL (generalized TPDL).
487
488 They can be easily implemented by recursive descent parsers with back‐
489 tracking. This makes them relatives of LL(k) Context-Free Grammars.
490
492 [1] The Packrat Parsing and Parsing Expression Grammars Page
493 [http://www.pdos.lcs.mit.edu/~baford/packrat/], by Bryan Ford,
494 Massachusetts Institute of Technology. This is the main entry
495 page to PEGs, and their realization through Packrat Parsers.
496
497 [2] Parsing Techniques - A Practical Guide
498 [http://www.cs.vu.nl/~dick/PTAPG.html], an online book offering
499 a clear, accessible, and thorough discussion of many different
500 parsing techniques with their interrelations and applicabili‐
501 ties, including error recovery techniques.
502
503 [3] Compilers and Compiler Generators [http://scifac.ru.ac.za/com‐
504 pilers/], an online book using CoCo/R, a generator for recursive
505 descent parsers.
506
508 LL(k), TDPL, context-free languages, expression, grammar, parsing,
509 parsing expression, parsing expression grammar, push down automaton,
510 recursive descent, state, top-down parsing languages, transducer
511
513 Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
514
515
516
517
518grammar_peg 0.1 grammar::peg(n)