1grammar::peg(n) Grammar operations and usage grammar::peg(n)
2
3
4
5______________________________________________________________________________
6
8 grammar::peg - Create and manipulate parsing expression grammars
9
11 package require Tcl 8.4
12
13 package require snit
14
15 package require grammar::peg ?0.1?
16
17 ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
18
19 pegName destroy
20
21 pegName clear
22
23 pegName = srcPEG
24
25 pegName --> dstPEG
26
27 pegName serialize
28
29 pegName deserialize serialization
30
31 pegName is valid
32
33 pegName start ?pe?
34
35 pegName nonterminals
36
37 pegName nonterminal add nt pe
38
39 pegName nonterminal delete nt1 ?nt2 ...?
40
41 pegName nonterminal exists nt
42
43 pegName nonterminal rename nt ntnew
44
45 pegName nonterminal mode nt ?mode?
46
47 pegName nonterminal rule nt
48
49 pegName unknown nonterminals
50
51_________________________________________________________________
52
54 This package provides a container class for parsing expression grammars
55 (Short: PEG). It allows the incremental definition of the grammar, its
56 manipulation and querying of the definition. The package neither pro‐
57 vides complex operations on the grammar, nor has it the ability to exe‐
58 cute a grammar definition for a stream of symbols. Two packages
59 related to this one are grammar::mengine and grammar::peg::interpreter.
60 The first of them defines a general virtual machine for the matching of
61 a character stream, and the second implements an interpreter for pars‐
62 ing expression grammars on top of that virtual machine.
63
64 TERMS & CONCEPTS
65 PEGs are similar to context-free grammars, but not equivalent; in some
66 cases PEGs are strictly more powerful than context-free grammars (there
67 exist PEGs for some non-context-free languages). The formal mathemati‐
68 cal definition of parsing expressions and parsing expression grammars
69 can be found in section PARSING EXPRESSION GRAMMARS.
70
71 In short, we have terminal symbols, which are the most basic building
72 blocks for sentences, and nonterminal symbols with associated parsing
73 expressions, defining the grammatical structure of the sentences. The
74 two sets of symbols are distinctive, and do not overlap. When speaking
75 about symbols the word "symbol" is often left out. The union of the
76 sets of terminal and nonterminal symbols is called the set of symbols.
77
78 Here the set of terminal symbols is not explicitly managed, but implic‐
79 itly defined as the set of all characters. Note that this means that we
80 inherit from Tcl the ability to handle all of Unicode.
81
82 A pair of nonterminal and parsing expression is also called a grammati‐
83 cal rule, or rule for short. In the context of a rule the nonterminal
84 is often called the left-hand-side (LHS), and the parsing expression
85 the right-hand-side (RHS).
86
87 The start expression of a grammar is a parsing expression from which
88 all the sentences contained in the language specified by the grammar
89 are derived. To make the understanding of this term easier let us
90 assume for a moment that the RHS of each rule, and the start expres‐
91 sion, is either a sequence of symbols, or a series of alternate parsing
92 expressions. In the latter case the rule can be seen as a set of
93 rules, each providing one alternative for the nonterminal. A parsing
94 expression A' is now a derivation of a parsing expression A if we pick
95 one of the nonterminals N in the expression, and one of the alternative
96 rules R for N, and then replace the nonterminal in A with the RHS of
97 the chosen rule. Here we can see why the terminal symbols are called
98 such. They cannot be expanded any further, thus terminate the process
99 of deriving new expressions. An example
100
101 Rules
102 (1) A <- a B c
103 (2a) B <- d B
104 (2b) B <- e
105
106 Some derivations, using starting expression A.
107
108 A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c
109
110
111 A derived expression containing only terminal symbols is a sentence.
112 The set of all sentences which can be derived from the start expression
113 is the language of the grammar.
114
115 Some definitions for nonterminals and expressions:
116
117 [1] A nonterminal A is called reachable if it is possible to derive
118 a parsing expression from the start expression which contains A.
119
120 [2] A nonterminal A is called useful if it is possible to derive a
121 sentence from it.
122
123 [3] A nonterminal A is called recursive if it is possible to derive
124 a parsing expression from it which contains A, again.
125
126 [4] The FIRST set of a nonterminal A contains all the symbols which
127 can occur of as the leftmost symbol in a parsing expression
128 derived from A. If the FIRST set contains A itself then that
129 nonterminal is called left-recursive.
130
131 [5] The LAST set of a nonterminal A contains all the symbols which
132 can occur of as the rightmost symbol in a parsing expression
133 derived from A. If the LAST set contains A itself then that non‐
134 terminal is called right-recursive.
135
136 [6] The FOLLOW set of a nonterminal A contains all the symbols which
137 can occur after A in a parsing expression derived from the start
138 expression.
139
140 [7] A nonterminal (or parsing expression) is called nullable if the
141 empty sentence can be derived from it.
142
143 And based on the above definitions for grammars:
144
145 [1] A grammar G is recursive if and only if it contains a nontermi‐
146 nal A which is recursive. The terms left- and right-recursive,
147 and useful are analogously defined.
148
149 [2] A grammar is minimal if it contains only reachable and useful
150 nonterminals.
151
152 [3] A grammar is wellformed if it is not left-recursive. Such gram‐
153 mars are also complete, which means that they always succeed or
154 fail on all input sentences. For an incomplete grammar on the
155 other hand input sentences exist for which an attempt to match
156 them against the grammar will not terminate.
157
158 [4] As we wish to allow ourselves to build a grammar incrementally
159 in a container object we will encounter stages where the RHS of
160 one or more rules reference symbols which are not yet known to
161 the container. Such a grammar we call invalid. We cannot use
162 the term incomplete as this term is already taken, see the last
163 item.
164
165 CONTAINER CLASS API
166 The package exports the API described here.
167
168 ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
169 The command creates a new container object for a parsing expres‐
170 sion grammar and returns the fully qualified name of the object
171 command as its result. The API the returned command is following
172 is described in the section CONTAINER OBJECT API. It may be used
173 to invoke various operations on the container and the grammar
174 within.
175
176 The new container, i.e. grammar will be empty if no src is spec‐
177 ified. Otherwise it will contain a copy of the grammar contained
178 in the src. The src has to be a container object reference for
179 all operators except deserialize. The deserialize operator
180 requires src to be the serialization of a parsing expression
181 grammar instead.
182
183 An empty grammar has no nonterminal symbols, and the start
184 expression is the empty expression, i.e. epsilon. It is valid,
185 but not useful.
186
187 CONTAINER OBJECT API
188 All grammar container objects provide the following methods for the
189 manipulation of their contents:
190
191 pegName destroy
192 Destroys the grammar, including its storage space and associated
193 command.
194
195 pegName clear
196 Clears out the definition of the grammar contained in pegName,
197 but does not destroy the object.
198
199 pegName = srcPEG
200 Assigns the contents of the grammar contained in srcPEG to peg‐
201 Name, overwriting any existing definition. This is the assign‐
202 ment operator for grammars. It copies the grammar contained in
203 the grammar object srcPEG over the grammar definition in peg‐
204 Name. The old contents of pegName are deleted by this operation.
205
206 This operation is in effect equivalent to
207
208
209 pegName deserialize [srcPEG serialize]
210
211
212 pegName --> dstPEG
213 This is the reverse assignment operator for grammars. It copies
214 the automation contained in the object pegName over the grammar
215 definition in the object dstPEG. The old contents of dstPEG are
216 deleted by this operation.
217
218 This operation is in effect equivalent to
219
220
221 dstPEG deserialize [pegName serialize]
222
223
224 pegName serialize
225 This method serializes the grammar stored in pegName. In other
226 words it returns a tcl value completely describing that grammar.
227 This allows, for example, the transfer of grammars over arbi‐
228 trary channels, persistence, etc. This method is also the basis
229 for both the copy constructor and the assignment operator.
230
231 The result of this method has to be semantically identical over
232 all implementations of the grammar::peg interface. This is what
233 will enable us to copy grammars between different implementa‐
234 tions of the same interface.
235
236 The result is a list of four elements with the following struc‐
237 ture:
238
239 [1] The constant string grammar::peg.
240
241 [2] A dictionary. Its keys are the names of all known nonter‐
242 minal symbols, and their associated values are the pars‐
243 ing expressions describing their sentennial structure.
244
245 [3] A dictionary. Its keys are the names of all known nonter‐
246 minal symbols, and their associated values hints to a
247 matcher regarding the semantic values produced by the
248 symbol.
249
250 [4] The last item is a parsing expression, the start expres‐
251 sion of the grammar.
252
253 Assuming the following PEG for simple mathematical expressions
254
255
256 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
257 Sign <- '+' / '-'
258 Number <- Sign? Digit+
259 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
260 MulOp <- '*' / '/'
261 Factor <- Term (AddOp Term)*
262 AddOp <- '+'/'-'
263 Term <- Number
264
265
266 a possible serialization is
267
268
269 grammar::peg \\
270 {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
271 Factor {x Term {* {x AddOp Term}}} \\
272 Term Number \\
273 MulOp {/ * /} \\
274 AddOp {/ + -} \\
275 Number {x {? Sign} {+ Digit}} \\
276 Sign {/ + -} \\
277 Digit {/ 0 1 2 3 4 5 6 7 8 9} \\
278 } \\
279 {Expression value Factor value \\
280 Term value MulOp value \\
281 AddOp value Number value \\
282 Sign value Digit value \\
283 }
284 Expression
285
286
287 A possible one, because the order of the nonterminals in the dictionary
288 is not relevant.
289
290 pegName deserialize serialization
291 This is the complement to serialize. It replaces the grammar
292 definition in pegName with the grammar described by the serial‐
293 ization value. The old contents of pegName are deleted by this
294 operation.
295
296 pegName is valid
297 A predicate. It tests whether the PEG in pegName is valid. See
298 section TERMS & CONCEPTS for the definition of this grammar
299 property. The result is a boolean value. It will be set to true
300 if the PEG has the tested property, and false otherwise.
301
302 pegName start ?pe?
303 This method defines the start expression of the grammar. It
304 replaces the previously defined start expression with the pars‐
305 ing expression pe. The method fails and throws an error if pe
306 does not contain a valid parsing expression as specified in the
307 section PARSING EXPRESSIONS. In that case the existing start
308 expression is not changed. The method returns the empty string
309 as its result.
310
311 If the method is called without an argument it will return the
312 currently defined start expression.
313
314 pegName nonterminals
315 Returns the set of all nonterminal symbols known to the grammar.
316
317 pegName nonterminal add nt pe
318 This method adds the nonterminal nt and its associated parsing
319 expression pe to the set of nonterminal symbols and rules of the
320 PEG contained in the object pegName. The method fails and
321 throws an error if either the string nt is already known as a
322 symbol of the grammar, or if pe does not contain a valid parsing
323 expression as specified in the section PARSING EXPRESSIONS. In
324 that case the current set of nonterminal symbols and rules is
325 not changed. The method returns the empty string as its result.
326
327 pegName nonterminal delete nt1 ?nt2 ...?
328 This method removes the named symbols nt1, nt2 from the set of
329 nonterminal symbols of the PEG contained in the object pegName.
330 The method fails and throws an error if any of the strings is
331 not known as a nonterminal symbol. In that case the current set
332 of nonterminal symbols is not changed. The method returns the
333 empty string as its result.
334
335 The stored grammar becomes invalid if the deleted nonterminals
336 are referenced by the RHS of still-known rules.
337
338 pegName nonterminal exists nt
339 A predicate. It tests whether the nonterminal symbol nt is known
340 to the PEG in pegName. The result is a boolean value. It will
341 be set to true if the symbol nt is known, and false otherwise.
342
343 pegName nonterminal rename nt ntnew
344 This method renames the nonterminal symbol nt to ntnew. The
345 method fails and throws an error if either nt is not known as a
346 nonterminal, or if ntnew is a known symbol. The method returns
347 the empty string as its result.
348
349 pegName nonterminal mode nt ?mode?
350 This mode returns or sets the semantic mode associated with the
351 nonterminal symbol nt. If no mode is specified the current mode
352 of the nonterminal is returned. Otherwise the current mode is
353 set to mode. The method fails and throws an error if nt is not
354 known as a nonterminal. The grammar interpreter implemented by
355 the package grammar::peg::interpreter recognizes the following
356 modes:
357
358 value The semantic value of the nonterminal is the abstract
359 syntax tree created from the AST's of the RHS and a node
360 for the nonterminal itself.
361
362 match The semantic value of the nonterminal is an the abstract
363 syntax tree consisting of single a node for the string
364 matched by the RHS. The ASTs generated by the RHS are
365 discarded.
366
367 leaf The semantic value of the nonterminal is an the abstract
368 syntax tree consisting of single a node for the nontermi‐
369 nal itself. The ASTs generated by the RHS are discarded.
370
371 discard
372 The nonterminal has no semantic value. The ASTs generated
373 by the RHS are discarded (as well).
374
375 pegName nonterminal rule nt
376 This method returns the parsing expression associated with the
377 nonterminal nt. The method fails and throws an error if nt is
378 not known as a nonterminal.
379
380 pegName unknown nonterminals
381 This method returns a list containing the names of all nontermi‐
382 nal symbols which are referenced on the RHS of a grammatical
383 rule, but have no rule definining their structure. In other
384 words, a list of the nonterminal symbols which make the grammar
385 invalid. The grammar is valid if this list is empty.
386
387 PARSING EXPRESSIONS
388 Various methods of PEG container objects expect a parsing expression as
389 their argument, or will return such. This section specifies the format
390 such parsing expressions are in.
391
392 [1] The string epsilon is an atomic parsing expression. It matches
393 the empty string.
394
395 [2] The string alnum is an atomic parsing expression. It matches any
396 alphanumeric character.
397
398 [3] The string alpha is an atomic parsing expression. It matches any
399 alphabetical character.
400
401 [4] The string dot is an atomic parsing expression. It matches any
402 character.
403
404 [5] The expression [list t x] is an atomic parsing expression. It
405 matches the terminal string x.
406
407 [6] The expression [list n A] is an atomic parsing expression. It
408 matches the nonterminal A.
409
410 [7] For parsing expressions e1, e2, ... the result of [list / e1 e2
411 ... ] is a parsing expression as well. This is the ordered
412 choice, aka prioritized choice.
413
414 [8] For parsing expressions e1, e2, ... the result of [list x e1 e2
415 ... ] is a parsing expression as well. This is the sequence.
416
417 [9] For a parsing expression e the result of [list * e] is a parsing
418 expression as well. This is the kleene closure, describing zero
419 or more repetitions.
420
421 [10] For a parsing expression e the result of [list + e] is a parsing
422 expression as well. This is the positive kleene closure,
423 describing one or more repetitions.
424
425 [11] For a parsing expression e the result of [list & e] is a parsing
426 expression as well. This is the and lookahead predicate.
427
428 [12] For a parsing expression e the result of [list ! e] is a parsing
429 expression as well. This is the not lookahead predicate.
430
431 [13] For a parsing expression e the result of [list ? e] is a parsing
432 expression as well. This is the optional input.
433
434 Examples of parsing expressions where already shown, in the description
435 of the method serialize.
436
438 For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where
439
440 · VN is a set of nonterminal symbols,
441
442 · VT is a set of terminal symbols,
443
444 · R is a finite set of rules, where each rule is a pair (A,e), A
445 in VN, and e a parsing expression.
446
447 · eS is a parsing expression, the start expression.
448
449 Further constraints are
450
451 · The intersection of VN and VT is empty.
452
453 · For all A in VT exists exactly one pair (A,e) in R. In other
454 words, R is a function from nonterminal symbols to parsing
455 expressions.
456
457 Parsing expression are inductively defined via
458
459 · The empty string (epsilon) is a parsing expression.
460
461 · A terminal symbol a is a parsing expression.
462
463 · A nonterminal symbol A is a parsing expression.
464
465 · e1e2 is a parsing expression for parsing expressions e1 and 2.
466 This is called sequence.
467
468 · e1/e2 is a parsing expression for parsing expressions e1 and 2.
469 This is called ordered choice.
470
471 · e* is a parsing expression for parsing expression e. This is
472 called zero-or-more repetitions, also known as kleene closure.
473
474 · e+ is a parsing expression for parsing expression e. This is
475 called one-or-more repetitions, also known as positive kleene
476 closure.
477
478 · !e is a parsing expression for parsing expression e1. This is
479 called a not lookahead predicate.
480
481 · &e is a parsing expression for parsing expression e1. This is
482 called an and lookahead predicate.
483
484 PEGs are used to define a grammatical structure for streams of symbols
485 over VT. They are a modern phrasing of older formalisms invented by
486 Alexander Birham. These formalisms were called TS (TMG recognition
487 scheme), and gTS (generalized TS). Later they were renamed to TPDL
488 (Top-Down Parsing Languages) and gTPDL (generalized TPDL).
489
490 They can be easily implemented by recursive descent parsers with back‐
491 tracking. This makes them relatives of LL(k) Context-Free Grammars.
492
494 [1] The Packrat Parsing and Parsing Expression Grammars Page
495 [http://www.pdos.lcs.mit.edu/~baford/packrat/], by Bryan Ford,
496 Massachusetts Institute of Technology. This is the main entry
497 page to PEGs, and their realization through Packrat Parsers.
498
499 [2] Parsing Techniques - A Practical Guide
500 [http://www.cs.vu.nl/~dick/PTAPG.html], an online book offering
501 a clear, accessible, and thorough discussion of many different
502 parsing techniques with their interrelations and applicabili‐
503 ties, including error recovery techniques.
504
505 [3] Compilers and Compiler Generators [http://scifac.ru.ac.za/com‐
506 pilers/], an online book using CoCo/R, a generator for recursive
507 descent parsers.
508
510 This document, and the package it describes, will undoubtedly contain
511 bugs and other problems. Please report such in the category gram‐
512 mar_peg of the Tcllib SF Trackers [http://source‐
513 forge.net/tracker/?group_id=12883]. Please also report any ideas for
514 enhancements you may have for either package and/or documentation.
515
517 LL(k), TDPL, context-free languages, expression, grammar, parsing,
518 parsing expression, parsing expression grammar, push down automaton,
519 recursive descent, state, top-down parsing languages, transducer
520
522 Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
523
524
525
526
527grammar_peg 0.1 grammar::peg(n)