peg(n) - f14

1grammar::peg(n)          Grammar operations and usage          grammar::peg(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       grammar::peg - Create and manipulate parsing expression grammars
9

SYNOPSIS

11       package require Tcl  8.4
12
13       package require snit
14
15       package require grammar::peg  ?0.1?
16
17       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
18
19       pegName destroy
20
21       pegName clear
22
23       pegName = srcPEG
24
25       pegName --> dstPEG
26
27       pegName serialize
28
29       pegName deserialize serialization
30
31       pegName is valid
32
33       pegName start ?pe?
34
35       pegName nonterminals
36
37       pegName nonterminal add nt pe
38
39       pegName nonterminal delete nt1 ?nt2 ...?
40
41       pegName nonterminal exists nt
42
43       pegName nonterminal rename nt ntnew
44
45       pegName nonterminal mode nt ?mode?
46
47       pegName nonterminal rule nt
48
49       pegName unknown nonterminals
50
51_________________________________________________________________
52

DESCRIPTION

54       This package provides a container class for parsing expression grammars
55       (Short: PEG).  It allows the incremental definition of the grammar, its
56       manipulation  and querying of the definition.  The package neither pro‐
57       vides complex operations on the grammar, nor has it the ability to exe‐
58       cute  a  grammar  definition  for  a  stream  of symbols.  Two packages
59       related to this one are grammar::mengine and grammar::peg::interpreter.
60       The first of them defines a general virtual machine for the matching of
61       a character stream, and the second implements an interpreter for  pars‐
62       ing expression grammars on top of that virtual machine.
63
64   TERMS & CONCEPTS
65       PEGs  are similar to context-free grammars, but not equivalent; in some
66       cases PEGs are strictly more powerful than context-free grammars (there
67       exist PEGs for some non-context-free languages).  The formal mathemati‐
68       cal definition of parsing expressions and parsing  expression  grammars
69       can be found in section PARSING EXPRESSION GRAMMARS.
70
71       In  short,  we have terminal symbols, which are the most basic building
72       blocks for sentences, and nonterminal symbols with  associated  parsing
73       expressions,  defining  the grammatical structure of the sentences. The
74       two sets of symbols are distinctive, and do not overlap. When  speaking
75       about  symbols  the  word  "symbol" is often left out. The union of the
76       sets of terminal and nonterminal symbols is called the set of symbols.
77
78       Here the set of terminal symbols is not explicitly managed, but implic‐
79       itly defined as the set of all characters. Note that this means that we
80       inherit from Tcl the ability to handle all of Unicode.
81
82       A pair of nonterminal and parsing expression is also called a grammati‐
83       cal  rule,  or rule for short. In the context of a rule the nonterminal
84       is often called the left-hand-side (LHS), and  the  parsing  expression
85       the right-hand-side (RHS).
86
87       The  start  expression  of a grammar is a parsing expression from which
88       all the sentences contained in the language specified  by  the  grammar
89       are  derived.   To  make  the  understanding of this term easier let us
90       assume for a moment that the RHS of each rule, and  the  start  expres‐
91       sion, is either a sequence of symbols, or a series of alternate parsing
92       expressions.  In the latter case the rule can  be  seen  as  a  set  of
93       rules,  each  providing one alternative for the nonterminal.  A parsing
94       expression A' is now a derivation of a parsing expression A if we  pick
95       one of the nonterminals N in the expression, and one of the alternative
96       rules R for N, and then replace the nonterminal in A with  the  RHS  of
97       the  chosen  rule.  Here we can see why the terminal symbols are called
98       such. They cannot be expanded any further, thus terminate  the  process
99       of deriving new expressions.  An example
100
101           Rules
102             (1)  A <- a B c
103             (2a) B <- d B
104             (2b) B <- e
105
106           Some derivations, using starting expression A.
107
108             A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c
109
110
111       A  derived  expression  containing only terminal symbols is a sentence.
112       The set of all sentences which can be derived from the start expression
113       is the language of the grammar.
114
115       Some definitions for nonterminals and expressions:
116
117       [1]    A  nonterminal A is called reachable if it is possible to derive
118              a parsing expression from the start expression which contains A.
119
120       [2]    A nonterminal A is called useful if it is possible to  derive  a
121              sentence from it.
122
123       [3]    A  nonterminal A is called recursive if it is possible to derive
124              a parsing expression from it which contains A, again.
125
126       [4]    The FIRST set of a nonterminal A contains all the symbols  which
127              can  occur  of  as  the  leftmost symbol in a parsing expression
128              derived from A. If the FIRST set contains  A  itself  then  that
129              nonterminal is called left-recursive.
130
131       [5]    The  LAST  set of a nonterminal A contains all the symbols which
132              can occur of as the rightmost symbol  in  a  parsing  expression
133              derived from A. If the LAST set contains A itself then that non‐
134              terminal is called right-recursive.
135
136       [6]    The FOLLOW set of a nonterminal A contains all the symbols which
137              can occur after A in a parsing expression derived from the start
138              expression.
139
140       [7]    A nonterminal (or parsing expression) is called nullable if  the
141              empty sentence can be derived from it.
142
143       And based on the above definitions for grammars:
144
145       [1]    A  grammar G is recursive if and only if it contains a nontermi‐
146              nal A which is recursive. The terms left-  and  right-recursive,
147              and useful are analogously defined.
148
149       [2]    A  grammar  is  minimal if it contains only reachable and useful
150              nonterminals.
151
152       [3]    A grammar is wellformed if it is not left-recursive. Such  gram‐
153              mars  are also complete, which means that they always succeed or
154              fail on all input sentences. For an incomplete  grammar  on  the
155              other  hand  input sentences exist for which an attempt to match
156              them against the grammar will not terminate.
157
158       [4]    As we wish to allow ourselves to build a  grammar  incrementally
159              in  a container object we will encounter stages where the RHS of
160              one or more rules reference symbols which are not yet  known  to
161              the  container.  Such  a grammar we call invalid.  We cannot use
162              the term incomplete as this term is already taken, see the  last
163              item.
164
165   CONTAINER CLASS API
166       The package exports the API described here.
167
168       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
169              The command creates a new container object for a parsing expres‐
170              sion grammar and returns the fully qualified name of the  object
171              command as its result. The API the returned command is following
172              is described in the section CONTAINER OBJECT API. It may be used
173              to  invoke  various  operations on the container and the grammar
174              within.
175
176              The new container, i.e. grammar will be empty if no src is spec‐
177              ified. Otherwise it will contain a copy of the grammar contained
178              in the src.  The src has to be a container object reference  for
179              all  operators  except  deserialize.   The  deserialize operator
180              requires src to be the serialization  of  a  parsing  expression
181              grammar instead.
182
183              An  empty  grammar  has  no  nonterminal  symbols, and the start
184              expression is the empty expression, i.e. epsilon. It  is  valid,
185              but not useful.
186
187   CONTAINER OBJECT API
188       All  grammar  container  objects  provide the following methods for the
189       manipulation of their contents:
190
191       pegName destroy
192              Destroys the grammar, including its storage space and associated
193              command.
194
195       pegName clear
196              Clears  out  the definition of the grammar contained in pegName,
197              but does not destroy the object.
198
199       pegName = srcPEG
200              Assigns the contents of the grammar contained in srcPEG to  peg‐
201              Name,  overwriting any existing definition.  This is the assign‐
202              ment operator for grammars. It copies the grammar  contained  in
203              the  grammar  object  srcPEG over the grammar definition in peg‐
204              Name. The old contents of pegName are deleted by this operation.
205
206              This operation is in effect equivalent to
207
208
209                  pegName deserialize [srcPEG serialize]
210
211
212       pegName --> dstPEG
213              This is the reverse assignment operator for grammars. It  copies
214              the  automation contained in the object pegName over the grammar
215              definition in the object dstPEG.  The old contents of dstPEG are
216              deleted by this operation.
217
218              This operation is in effect equivalent to
219
220
221                  dstPEG deserialize [pegName serialize]
222
223
224       pegName serialize
225              This  method  serializes the grammar stored in pegName. In other
226              words it returns a tcl value completely describing that grammar.
227              This  allows,  for  example, the transfer of grammars over arbi‐
228              trary channels, persistence, etc.  This method is also the basis
229              for both the copy constructor and the assignment operator.
230
231              The  result of this method has to be semantically identical over
232              all implementations of the grammar::peg interface. This is  what
233              will  enable  us  to copy grammars between different implementa‐
234              tions of the same interface.
235
236              The result is a list of four elements with the following  struc‐
237              ture:
238
239              [1]    The constant string grammar::peg.
240
241              [2]    A dictionary. Its keys are the names of all known nonter‐
242                     minal symbols, and their associated values are the  pars‐
243                     ing expressions describing their sentennial structure.
244
245              [3]    A dictionary. Its keys are the names of all known nonter‐
246                     minal symbols, and their associated  values  hints  to  a
247                     matcher  regarding  the  semantic  values produced by the
248                     symbol.
249
250              [4]    The last item is a parsing expression, the start  expres‐
251                     sion of the grammar.
252
253       Assuming the following PEG for simple mathematical expressions
254
255
256           Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
257           Sign       <- '+' / '-'
258           Number     <- Sign? Digit+
259           Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
260           MulOp      <- '*' / '/'
261           Factor     <- Term (AddOp Term)*
262           AddOp      <- '+'/'-'
263           Term       <- Number
264
265
266       a possible serialization is
267
268
269           grammar::peg \\
270           {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
271            Factor     {x Term {* {x AddOp Term}}} \\
272            Term       Number \\
273            MulOp      {/ * /} \\
274            AddOp      {/ + -} \\
275            Number     {x {? Sign} {+ Digit}} \\
276            Sign       {/ + -} \\
277            Digit      {/ 0 1 2 3 4 5 6 7 8 9} \\
278           } \\
279           {Expression value     Factor     value \\
280            Term       value     MulOp      value \\
281            AddOp      value     Number     value \\
282            Sign       value     Digit      value \\
283           }
284           Expression
285
286
287       A possible one, because the order of the nonterminals in the dictionary
288       is not relevant.
289
290       pegName deserialize serialization
291              This is the complement to serialize.  It  replaces  the  grammar
292              definition  in pegName with the grammar described by the serial‐
293              ization value. The old contents of pegName are deleted  by  this
294              operation.
295
296       pegName is valid
297              A  predicate. It tests whether the PEG in pegName is valid.  See
298              section TERMS & CONCEPTS for  the  definition  of  this  grammar
299              property.  The result is a boolean value. It will be set to true
300              if the PEG has the tested property, and false otherwise.
301
302       pegName start ?pe?
303              This method defines the start  expression  of  the  grammar.  It
304              replaces  the previously defined start expression with the pars‐
305              ing expression pe.  The method fails and throws an error  if  pe
306              does  not contain a valid parsing expression as specified in the
307              section PARSING EXPRESSIONS. In that  case  the  existing  start
308              expression  is not changed.  The method returns the empty string
309              as its result.
310
311              If the method is called without an argument it will  return  the
312              currently defined start expression.
313
314       pegName nonterminals
315              Returns the set of all nonterminal symbols known to the grammar.
316
317       pegName nonterminal add nt pe
318              This  method  adds the nonterminal nt and its associated parsing
319              expression pe to the set of nonterminal symbols and rules of the
320              PEG  contained  in  the  object  pegName.   The method fails and
321              throws an error if either the string nt is already  known  as  a
322              symbol of the grammar, or if pe does not contain a valid parsing
323              expression as specified in the section PARSING  EXPRESSIONS.  In
324              that  case  the  current set of nonterminal symbols and rules is
325              not changed.  The method returns the empty string as its result.
326
327       pegName nonterminal delete nt1 ?nt2 ...?
328              This method removes the named symbols nt1, nt2 from the  set  of
329              nonterminal  symbols of the PEG contained in the object pegName.
330              The method fails and throws an error if any of  the  strings  is
331              not  known as a nonterminal symbol. In that case the current set
332              of nonterminal symbols is not changed.  The method  returns  the
333              empty string as its result.
334
335              The  stored  grammar becomes invalid if the deleted nonterminals
336              are referenced by the RHS of still-known rules.
337
338       pegName nonterminal exists nt
339              A predicate. It tests whether the nonterminal symbol nt is known
340              to  the  PEG in pegName.  The result is a boolean value. It will
341              be set to true if the symbol nt is known, and false otherwise.
342
343       pegName nonterminal rename nt ntnew
344              This method renames the nonterminal symbol  nt  to  ntnew.   The
345              method  fails and throws an error if either nt is not known as a
346              nonterminal, or if ntnew is a known symbol.  The method  returns
347              the empty string as its result.
348
349       pegName nonterminal mode nt ?mode?
350              This  mode returns or sets the semantic mode associated with the
351              nonterminal symbol nt. If no mode is specified the current  mode
352              of  the  nonterminal  is returned. Otherwise the current mode is
353              set to mode.  The method fails and throws an error if nt is  not
354              known  as a nonterminal.  The grammar interpreter implemented by
355              the package grammar::peg::interpreter recognizes  the  following
356              modes:
357
358              value  The  semantic  value  of  the nonterminal is the abstract
359                     syntax tree created from the AST's of the RHS and a  node
360                     for the nonterminal itself.
361
362              match  The  semantic value of the nonterminal is an the abstract
363                     syntax tree consisting of single a node  for  the  string
364                     matched  by  the  RHS.  The ASTs generated by the RHS are
365                     discarded.
366
367              leaf   The semantic value of the nonterminal is an the  abstract
368                     syntax tree consisting of single a node for the nontermi‐
369                     nal itself. The ASTs generated by the RHS are discarded.
370
371              discard
372                     The nonterminal has no semantic value. The ASTs generated
373                     by the RHS are discarded (as well).
374
375       pegName nonterminal rule nt
376              This  method  returns the parsing expression associated with the
377              nonterminal nt.  The method fails and throws an error if  nt  is
378              not known as a nonterminal.
379
380       pegName unknown nonterminals
381              This method returns a list containing the names of all nontermi‐
382              nal symbols which are referenced on the  RHS  of  a  grammatical
383              rule,  but  have  no  rule  definining their structure. In other
384              words, a list of the nonterminal symbols which make the  grammar
385              invalid. The grammar is valid if this list is empty.
386
387   PARSING EXPRESSIONS
388       Various methods of PEG container objects expect a parsing expression as
389       their argument, or will return such. This section specifies the  format
390       such parsing expressions are in.
391
392       [1]    The  string  epsilon is an atomic parsing expression. It matches
393              the empty string.
394
395       [2]    The string alnum is an atomic parsing expression. It matches any
396              alphanumeric character.
397
398       [3]    The string alpha is an atomic parsing expression. It matches any
399              alphabetical character.
400
401       [4]    The string dot is an atomic parsing expression. It  matches  any
402              character.
403
404       [5]    The  expression  [list  t x] is an atomic parsing expression. It
405              matches the terminal string x.
406
407       [6]    The expression [list n A] is an atomic  parsing  expression.  It
408              matches the nonterminal A.
409
410       [7]    For  parsing expressions e1, e2, ... the result of [list / e1 e2
411              ... ] is a parsing expression as  well.   This  is  the  ordered
412              choice, aka prioritized choice.
413
414       [8]    For  parsing expressions e1, e2, ... the result of [list x e1 e2
415              ... ] is a parsing expression as well.  This is the sequence.
416
417       [9]    For a parsing expression e the result of [list * e] is a parsing
418              expression as well.  This is the kleene closure, describing zero
419              or more repetitions.
420
421       [10]   For a parsing expression e the result of [list + e] is a parsing
422              expression  as  well.   This  is  the  positive  kleene closure,
423              describing one or more repetitions.
424
425       [11]   For a parsing expression e the result of [list & e] is a parsing
426              expression as well.  This is the and lookahead predicate.
427
428       [12]   For a parsing expression e the result of [list ! e] is a parsing
429              expression as well.  This is the not lookahead predicate.
430
431       [13]   For a parsing expression e the result of [list ? e] is a parsing
432              expression as well.  This is the optional input.
433
434       Examples of parsing expressions where already shown, in the description
435       of the method serialize.
436

PARSING EXPRESSION GRAMMARS

438       For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where
439
440       ·      VN is a set of nonterminal symbols,
441
442       ·      VT is a set of terminal symbols,
443
444       ·      R is a finite set of rules, where each rule is a pair  (A,e),  A
445              in VN, and e a parsing expression.
446
447       ·      eS is a parsing expression, the start expression.
448
449       Further constraints are
450
451       ·      The intersection of VN and VT is empty.
452
453       ·      For  all  A  in  VT exists exactly one pair (A,e) in R. In other
454              words, R is a  function  from  nonterminal  symbols  to  parsing
455              expressions.
456
457       Parsing expression are inductively defined via
458
459       ·      The empty string (epsilon) is a parsing expression.
460
461       ·      A terminal symbol a is a parsing expression.
462
463       ·      A nonterminal symbol A is a parsing expression.
464
465       ·      e1e2  is  a parsing expression for parsing expressions e1 and 2.
466              This is called sequence.
467
468       ·      e1/e2 is a parsing expression for parsing expressions e1 and  2.
469              This is called ordered choice.
470
471       ·      e*  is  a  parsing  expression for parsing expression e. This is
472              called zero-or-more repetitions, also known as kleene closure.
473
474       ·      e+ is a parsing expression for parsing  expression  e.  This  is
475              called  one-or-more  repetitions,  also known as positive kleene
476              closure.
477
478       ·      !e is a parsing expression for parsing expression  e1.  This  is
479              called a not lookahead predicate.
480
481       ·      &e  is  a  parsing expression for parsing expression e1. This is
482              called an and lookahead predicate.
483
484       PEGs are used to define a grammatical structure for streams of  symbols
485       over  VT.  They  are  a modern phrasing of older formalisms invented by
486       Alexander Birham. These formalisms  were  called  TS  (TMG  recognition
487       scheme),  and  gTS  (generalized  TS).  Later they were renamed to TPDL
488       (Top-Down Parsing Languages) and gTPDL (generalized TPDL).
489
490       They can be easily implemented by recursive descent parsers with  back‐
491       tracking. This makes them relatives of LL(k) Context-Free Grammars.
492

REFERENCES

494       [1]    The   Packrat  Parsing  and  Parsing  Expression  Grammars  Page
495              [http://www.pdos.lcs.mit.edu/~baford/packrat/], by  Bryan  Ford,
496              Massachusetts  Institute  of  Technology. This is the main entry
497              page to PEGs, and their realization through Packrat Parsers.
498
499       [2]    Parsing      Techniques      -      A      Practical       Guide
500              [http://www.cs.vu.nl/~dick/PTAPG.html],  an online book offering
501              a clear, accessible, and thorough discussion of  many  different
502              parsing  techniques  with  their interrelations and applicabili‐
503              ties, including error recovery techniques.
504
505       [3]    Compilers and Compiler  Generators  [http://scifac.ru.ac.za/com‐
506              pilers/], an online book using CoCo/R, a generator for recursive
507              descent parsers.
508

BUGS, IDEAS, FEEDBACK

510       This document, and the package it describes, will  undoubtedly  contain
511       bugs  and  other  problems.   Please  report such in the category gram‐
512       mar_peg    of    the     Tcllib     SF     Trackers     [http://source‐
513       forge.net/tracker/?group_id=12883].   Please  also report any ideas for
514       enhancements you may have for either package and/or documentation.
515

KEYWORDS

517       LL(k), TDPL,  context-free  languages,  expression,  grammar,  parsing,
518       parsing  expression,  parsing  expression grammar, push down automaton,
519       recursive descent, state, top-down parsing languages, transducer
520

COPYRIGHT

522       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
523
524
525
526
527grammar_peg                           0.1                      grammar::peg(n)