1grammar::peg(n) Grammar operations and usage grammar::peg(n)
2
3
4
5______________________________________________________________________________
6
8 grammar::peg - Create and manipulate parsing expression grammars
9
11 package require Tcl 8.4
12
13 package require snit
14
15 package require grammar::peg ?0.1?
16
17 ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
18
19 pegName destroy
20
21 pegName clear
22
23 pegName = srcPEG
24
25 pegName --> dstPEG
26
27 pegName serialize
28
29 pegName deserialize serialization
30
31 pegName is valid
32
33 pegName start ?pe?
34
35 pegName nonterminals
36
37 pegName nonterminal add nt pe
38
39 pegName nonterminal delete nt1 ?nt2 ...?
40
41 pegName nonterminal exists nt
42
43 pegName nonterminal rename nt ntnew
44
45 pegName nonterminal mode nt ?mode?
46
47 pegName nonterminal rule nt
48
49 pegName unknown nonterminals
50
51______________________________________________________________________________
52
54 This package provides a container class for parsing expression grammars
55 (Short: PEG). It allows the incremental definition of the grammar, its
56 manipulation and querying of the definition. The package neither pro‐
57 vides complex operations on the grammar, nor has it the ability to exe‐
58 cute a grammar definition for a stream of symbols. Two packages
59 related to this one are grammar::mengine and grammar::peg::interpreter.
60 The first of them defines a general virtual machine for the matching of
61 a character stream, and the second implements an interpreter for pars‐
62 ing expression grammars on top of that virtual machine.
63
64 TERMS & CONCEPTS
65 PEGs are similar to context-free grammars, but not equivalent; in some
66 cases PEGs are strictly more powerful than context-free grammars (there
67 exist PEGs for some non-context-free languages). The formal mathemati‐
68 cal definition of parsing expressions and parsing expression grammars
69 can be found in section PARSING EXPRESSION GRAMMARS.
70
71 In short, we have terminal symbols, which are the most basic building
72 blocks for sentences, and nonterminal symbols with associated parsing
73 expressions, defining the grammatical structure of the sentences. The
74 two sets of symbols are distinctive, and do not overlap. When speaking
75 about symbols the word "symbol" is often left out. The union of the
76 sets of terminal and nonterminal symbols is called the set of symbols.
77
78 Here the set of terminal symbols is not explicitly managed, but implic‐
79 itly defined as the set of all characters. Note that this means that we
80 inherit from Tcl the ability to handle all of Unicode.
81
82 A pair of nonterminal and parsing expression is also called a grammati‐
83 cal rule, or rule for short. In the context of a rule the nonterminal
84 is often called the left-hand-side (LHS), and the parsing expression
85 the right-hand-side (RHS).
86
87 The start expression of a grammar is a parsing expression from which
88 all the sentences contained in the language specified by the grammar
89 are derived. To make the understanding of this term easier let us
90 assume for a moment that the RHS of each rule, and the start expres‐
91 sion, is either a sequence of symbols, or a series of alternate parsing
92 expressions. In the latter case the rule can be seen as a set of
93 rules, each providing one alternative for the nonterminal. A parsing
94 expression A' is now a derivation of a parsing expression A if we pick
95 one of the nonterminals N in the expression, and one of the alternative
96 rules R for N, and then replace the nonterminal in A with the RHS of
97 the chosen rule. Here we can see why the terminal symbols are called
98 such. They cannot be expanded any further, thus terminate the process
99 of deriving new expressions. An example
100
101 Rules
102 (1) A <- a B c
103 (2a) B <- d B
104 (2b) B <- e
105
106 Some derivations, using starting expression A.
107
108 A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c
109
110
111 A derived expression containing only terminal symbols is a sentence.
112 The set of all sentences which can be derived from the start expression
113 is the language of the grammar.
114
115 Some definitions for nonterminals and expressions:
116
117 [1] A nonterminal A is called reachable if it is possible to derive
118 a parsing expression from the start expression which contains A.
119
120 [2] A nonterminal A is called useful if it is possible to derive a
121 sentence from it.
122
123 [3] A nonterminal A is called recursive if it is possible to derive
124 a parsing expression from it which contains A, again.
125
126 [4] The FIRST set of a nonterminal A contains all the symbols which
127 can occur of as the leftmost symbol in a parsing expression
128 derived from A. If the FIRST set contains A itself then that
129 nonterminal is called left-recursive.
130
131 [5] The LAST set of a nonterminal A contains all the symbols which
132 can occur of as the rightmost symbol in a parsing expression
133 derived from A. If the LAST set contains A itself then that non‐
134 terminal is called right-recursive.
135
136 [6] The FOLLOW set of a nonterminal A contains all the symbols which
137 can occur after A in a parsing expression derived from the start
138 expression.
139
140 [7] A nonterminal (or parsing expression) is called nullable if the
141 empty sentence can be derived from it.
142
143 And based on the above definitions for grammars:
144
145 [1] A grammar G is recursive if and only if it contains a nontermi‐
146 nal A which is recursive. The terms left- and right-recursive,
147 and useful are analogously defined.
148
149 [2] A grammar is minimal if it contains only reachable and useful
150 nonterminals.
151
152 [3] A grammar is wellformed if it is not left-recursive. Such gram‐
153 mars are also complete, which means that they always succeed or
154 fail on all input sentences. For an incomplete grammar on the
155 other hand input sentences exist for which an attempt to match
156 them against the grammar will not terminate.
157
158 [4] As we wish to allow ourselves to build a grammar incrementally
159 in a container object we will encounter stages where the RHS of
160 one or more rules reference symbols which are not yet known to
161 the container. Such a grammar we call invalid. We cannot use
162 the term incomplete as this term is already taken, see the last
163 item.
164
165 CONTAINER CLASS API
166 The package exports the API described here.
167
168 ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
169 The command creates a new container object for a parsing expres‐
170 sion grammar and returns the fully qualified name of the object
171 command as its result. The API the returned command is following
172 is described in the section CONTAINER OBJECT API. It may be used
173 to invoke various operations on the container and the grammar
174 within.
175
176 The new container, i.e. grammar will be empty if no src is spec‐
177 ified. Otherwise it will contain a copy of the grammar contained
178 in the src. The src has to be a container object reference for
179 all operators except deserialize. The deserialize operator
180 requires src to be the serialization of a parsing expression
181 grammar instead.
182
183 An empty grammar has no nonterminal symbols, and the start
184 expression is the empty expression, i.e. epsilon. It is valid,
185 but not useful.
186
187 CONTAINER OBJECT API
188 All grammar container objects provide the following methods for the
189 manipulation of their contents:
190
191 pegName destroy
192 Destroys the grammar, including its storage space and associated
193 command.
194
195 pegName clear
196 Clears out the definition of the grammar contained in pegName,
197 but does not destroy the object.
198
199 pegName = srcPEG
200 Assigns the contents of the grammar contained in srcPEG to peg‐
201 Name, overwriting any existing definition. This is the assign‐
202 ment operator for grammars. It copies the grammar contained in
203 the grammar object srcPEG over the grammar definition in peg‐
204 Name. The old contents of pegName are deleted by this operation.
205
206 This operation is in effect equivalent to
207
208
209
210 pegName deserialize [srcPEG serialize]
211
212
213 pegName --> dstPEG
214 This is the reverse assignment operator for grammars. It copies
215 the automation contained in the object pegName over the grammar
216 definition in the object dstPEG. The old contents of dstPEG are
217 deleted by this operation.
218
219 This operation is in effect equivalent to
220
221
222
223 dstPEG deserialize [pegName serialize]
224
225
226 pegName serialize
227 This method serializes the grammar stored in pegName. In other
228 words it returns a tcl value completely describing that grammar.
229 This allows, for example, the transfer of grammars over arbi‐
230 trary channels, persistence, etc. This method is also the basis
231 for both the copy constructor and the assignment operator.
232
233 The result of this method has to be semantically identical over
234 all implementations of the grammar::peg interface. This is what
235 will enable us to copy grammars between different implementa‐
236 tions of the same interface.
237
238 The result is a list of four elements with the following struc‐
239 ture:
240
241 [1] The constant string grammar::peg.
242
243 [2] A dictionary. Its keys are the names of all known nonter‐
244 minal symbols, and their associated values are the pars‐
245 ing expressions describing their sentennial structure.
246
247 [3] A dictionary. Its keys are the names of all known nonter‐
248 minal symbols, and their associated values hints to a
249 matcher regarding the semantic values produced by the
250 symbol.
251
252 [4] The last item is a parsing expression, the start expres‐
253 sion of the grammar.
254
255 Assuming the following PEG for simple mathematical expressions
256
257
258
259 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
260 Sign <- '+' / '-'
261 Number <- Sign? Digit+
262 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
263 MulOp <- '*' / '/'
264 Factor <- Term (AddOp Term)*
265 AddOp <- '+'/'-'
266 Term <- Number
267
268
269 a possible serialization is
270
271
272
273 grammar::peg \\
274 {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
275 Factor {x Term {* {x AddOp Term}}} \\
276 Term Number \\
277 MulOp {/ * /} \\
278 AddOp {/ + -} \\
279 Number {x {? Sign} {+ Digit}} \\
280 Sign {/ + -} \\
281 Digit {/ 0 1 2 3 4 5 6 7 8 9} \\
282 } \\
283 {Expression value Factor value \\
284 Term value MulOp value \\
285 AddOp value Number value \\
286 Sign value Digit value \\
287 }
288 Expression
289
290
291 A possible one, because the order of the nonterminals in the dictionary
292 is not relevant.
293
294 pegName deserialize serialization
295 This is the complement to serialize. It replaces the grammar
296 definition in pegName with the grammar described by the serial‐
297 ization value. The old contents of pegName are deleted by this
298 operation.
299
300 pegName is valid
301 A predicate. It tests whether the PEG in pegName is valid. See
302 section TERMS & CONCEPTS for the definition of this grammar
303 property. The result is a boolean value. It will be set to true
304 if the PEG has the tested property, and false otherwise.
305
306 pegName start ?pe?
307 This method defines the start expression of the grammar. It
308 replaces the previously defined start expression with the pars‐
309 ing expression pe. The method fails and throws an error if pe
310 does not contain a valid parsing expression as specified in the
311 section PARSING EXPRESSIONS. In that case the existing start
312 expression is not changed. The method returns the empty string
313 as its result.
314
315 If the method is called without an argument it will return the
316 currently defined start expression.
317
318 pegName nonterminals
319 Returns the set of all nonterminal symbols known to the grammar.
320
321 pegName nonterminal add nt pe
322 This method adds the nonterminal nt and its associated parsing
323 expression pe to the set of nonterminal symbols and rules of the
324 PEG contained in the object pegName. The method fails and
325 throws an error if either the string nt is already known as a
326 symbol of the grammar, or if pe does not contain a valid parsing
327 expression as specified in the section PARSING EXPRESSIONS. In
328 that case the current set of nonterminal symbols and rules is
329 not changed. The method returns the empty string as its result.
330
331 pegName nonterminal delete nt1 ?nt2 ...?
332 This method removes the named symbols nt1, nt2 from the set of
333 nonterminal symbols of the PEG contained in the object pegName.
334 The method fails and throws an error if any of the strings is
335 not known as a nonterminal symbol. In that case the current set
336 of nonterminal symbols is not changed. The method returns the
337 empty string as its result.
338
339 The stored grammar becomes invalid if the deleted nonterminals
340 are referenced by the RHS of still-known rules.
341
342 pegName nonterminal exists nt
343 A predicate. It tests whether the nonterminal symbol nt is known
344 to the PEG in pegName. The result is a boolean value. It will
345 be set to true if the symbol nt is known, and false otherwise.
346
347 pegName nonterminal rename nt ntnew
348 This method renames the nonterminal symbol nt to ntnew. The
349 method fails and throws an error if either nt is not known as a
350 nonterminal, or if ntnew is a known symbol. The method returns
351 the empty string as its result.
352
353 pegName nonterminal mode nt ?mode?
354 This mode returns or sets the semantic mode associated with the
355 nonterminal symbol nt. If no mode is specified the current mode
356 of the nonterminal is returned. Otherwise the current mode is
357 set to mode. The method fails and throws an error if nt is not
358 known as a nonterminal. The grammar interpreter implemented by
359 the package grammar::peg::interpreter recognizes the following
360 modes:
361
362 value The semantic value of the nonterminal is the abstract
363 syntax tree created from the AST's of the RHS and a node
364 for the nonterminal itself.
365
366 match The semantic value of the nonterminal is an the abstract
367 syntax tree consisting of single a node for the string
368 matched by the RHS. The ASTs generated by the RHS are
369 discarded.
370
371 leaf The semantic value of the nonterminal is an the abstract
372 syntax tree consisting of single a node for the nontermi‐
373 nal itself. The ASTs generated by the RHS are discarded.
374
375 discard
376 The nonterminal has no semantic value. The ASTs generated
377 by the RHS are discarded (as well).
378
379 pegName nonterminal rule nt
380 This method returns the parsing expression associated with the
381 nonterminal nt. The method fails and throws an error if nt is
382 not known as a nonterminal.
383
384 pegName unknown nonterminals
385 This method returns a list containing the names of all nontermi‐
386 nal symbols which are referenced on the RHS of a grammatical
387 rule, but have no rule definining their structure. In other
388 words, a list of the nonterminal symbols which make the grammar
389 invalid. The grammar is valid if this list is empty.
390
391 PARSING EXPRESSIONS
392 Various methods of PEG container objects expect a parsing expression as
393 their argument, or will return such. This section specifies the format
394 such parsing expressions are in.
395
396 [1] The string epsilon is an atomic parsing expression. It matches
397 the empty string.
398
399 [2] The string alnum is an atomic parsing expression. It matches any
400 alphanumeric character.
401
402 [3] The string alpha is an atomic parsing expression. It matches any
403 alphabetical character.
404
405 [4] The string dot is an atomic parsing expression. It matches any
406 character.
407
408 [5] The expression [list t x] is an atomic parsing expression. It
409 matches the terminal string x.
410
411 [6] The expression [list n A] is an atomic parsing expression. It
412 matches the nonterminal A.
413
414 [7] For parsing expressions e1, e2, ... the result of [list / e1 e2
415 ... ] is a parsing expression as well. This is the ordered
416 choice, aka prioritized choice.
417
418 [8] For parsing expressions e1, e2, ... the result of [list x e1 e2
419 ... ] is a parsing expression as well. This is the sequence.
420
421 [9] For a parsing expression e the result of [list * e] is a parsing
422 expression as well. This is the kleene closure, describing zero
423 or more repetitions.
424
425 [10] For a parsing expression e the result of [list + e] is a parsing
426 expression as well. This is the positive kleene closure,
427 describing one or more repetitions.
428
429 [11] For a parsing expression e the result of [list & e] is a parsing
430 expression as well. This is the and lookahead predicate.
431
432 [12] For a parsing expression e the result of [list ! e] is a parsing
433 expression as well. This is the not lookahead predicate.
434
435 [13] For a parsing expression e the result of [list ? e] is a parsing
436 expression as well. This is the optional input.
437
438 Examples of parsing expressions where already shown, in the description
439 of the method serialize.
440
442 For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where
443
444 · VN is a set of nonterminal symbols,
445
446 · VT is a set of terminal symbols,
447
448 · R is a finite set of rules, where each rule is a pair (A,e), A
449 in VN, and e a parsing expression.
450
451 · eS is a parsing expression, the start expression.
452
453 Further constraints are
454
455 · The intersection of VN and VT is empty.
456
457 · For all A in VT exists exactly one pair (A,e) in R. In other
458 words, R is a function from nonterminal symbols to parsing
459 expressions.
460
461 Parsing expression are inductively defined via
462
463 · The empty string (epsilon) is a parsing expression.
464
465 · A terminal symbol a is a parsing expression.
466
467 · A nonterminal symbol A is a parsing expression.
468
469 · e1e2 is a parsing expression for parsing expressions e1 and 2.
470 This is called sequence.
471
472 · e1/e2 is a parsing expression for parsing expressions e1 and 2.
473 This is called ordered choice.
474
475 · e* is a parsing expression for parsing expression e. This is
476 called zero-or-more repetitions, also known as kleene closure.
477
478 · e+ is a parsing expression for parsing expression e. This is
479 called one-or-more repetitions, also known as positive kleene
480 closure.
481
482 · !e is a parsing expression for parsing expression e1. This is
483 called a not lookahead predicate.
484
485 · &e is a parsing expression for parsing expression e1. This is
486 called an and lookahead predicate.
487
488 PEGs are used to define a grammatical structure for streams of symbols
489 over VT. They are a modern phrasing of older formalisms invented by
490 Alexander Birham. These formalisms were called TS (TMG recognition
491 scheme), and gTS (generalized TS). Later they were renamed to TPDL
492 (Top-Down Parsing Languages) and gTPDL (generalized TPDL).
493
494 They can be easily implemented by recursive descent parsers with back‐
495 tracking. This makes them relatives of LL(k) Context-Free Grammars.
496
498 [1] The Packrat Parsing and Parsing Expression Grammars Page
499 [http://www.pdos.lcs.mit.edu/~baford/packrat/], by Bryan Ford,
500 Massachusetts Institute of Technology. This is the main entry
501 page to PEGs, and their realization through Packrat Parsers.
502
503 [2] Parsing Techniques - A Practical Guide
504 [http://www.cs.vu.nl/~dick/PTAPG.html], an online book offering
505 a clear, accessible, and thorough discussion of many different
506 parsing techniques with their interrelations and applicabili‐
507 ties, including error recovery techniques.
508
509 [3] Compilers and Compiler Generators [http://scifac.ru.ac.za/com‐
510 pilers/], an online book using CoCo/R, a generator for recursive
511 descent parsers.
512
514 This document, and the package it describes, will undoubtedly contain
515 bugs and other problems. Please report such in the category gram‐
516 mar_peg of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist].
517 Please also report any ideas for enhancements you may have for either
518 package and/or documentation.
519
520 When proposing code changes, please provide unified diffs, i.e the out‐
521 put of diff -u.
522
523 Note further that attachments are strongly preferred over inlined
524 patches. Attachments can be made by going to the Edit form of the
525 ticket immediately after its creation, and then using the left-most
526 button in the secondary navigation bar.
527
529 LL(k), TDPL, context-free languages, expression, grammar, parsing,
530 parsing expression, parsing expression grammar, push down automaton,
531 recursive descent, state, top-down parsing languages, transducer
532
534 Grammars and finite automata
535
537 Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
538
539
540
541
542tcllib 0.1 grammar::peg(n)