peg(n) - f34

1grammar::peg(n)          Grammar operations and usage          grammar::peg(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       grammar::peg - Create and manipulate parsing expression grammars
9

SYNOPSIS

11       package require Tcl  8.4
12
13       package require snit
14
15       package require grammar::peg  ?0.1?
16
17       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
18
19       pegName destroy
20
21       pegName clear
22
23       pegName = srcPEG
24
25       pegName --> dstPEG
26
27       pegName serialize
28
29       pegName deserialize serialization
30
31       pegName is valid
32
33       pegName start ?pe?
34
35       pegName nonterminals
36
37       pegName nonterminal add nt pe
38
39       pegName nonterminal delete nt1 ?nt2 ...?
40
41       pegName nonterminal exists nt
42
43       pegName nonterminal rename nt ntnew
44
45       pegName nonterminal mode nt ?mode?
46
47       pegName nonterminal rule nt
48
49       pegName unknown nonterminals
50
51______________________________________________________________________________
52

DESCRIPTION

54       This package provides a container class for parsing expression grammars
55       (Short: PEG).  It allows the incremental definition of the grammar, its
56       manipulation  and querying of the definition.  The package neither pro‐
57       vides complex operations on the grammar, nor has it the ability to exe‐
58       cute  a  grammar  definition  for  a  stream  of symbols.  Two packages
59       related to this one are grammar::mengine and grammar::peg::interpreter.
60       The first of them defines a general virtual machine for the matching of
61       a character stream, and the second implements an interpreter for  pars‐
62       ing expression grammars on top of that virtual machine.
63
64   TERMS & CONCEPTS
65       PEGs  are similar to context-free grammars, but not equivalent; in some
66       cases PEGs are strictly more powerful than context-free grammars (there
67       exist PEGs for some non-context-free languages).  The formal mathemati‐
68       cal definition of parsing expressions and parsing  expression  grammars
69       can be found in section PARSING EXPRESSION GRAMMARS.
70
71       In  short,  we have terminal symbols, which are the most basic building
72       blocks for sentences, and nonterminal symbols with  associated  parsing
73       expressions,  defining  the grammatical structure of the sentences. The
74       two sets of symbols are distinctive, and do not overlap. When  speaking
75       about  symbols  the  word  "symbol" is often left out. The union of the
76       sets of terminal and nonterminal symbols is called the set of symbols.
77
78       Here the set of terminal symbols is not explicitly managed, but implic‐
79       itly defined as the set of all characters. Note that this means that we
80       inherit from Tcl the ability to handle all of Unicode.
81
82       A pair of nonterminal and parsing expression is also called a grammati‐
83       cal  rule,  or rule for short. In the context of a rule the nonterminal
84       is often called the left-hand-side (LHS), and  the  parsing  expression
85       the right-hand-side (RHS).
86
87       The  start  expression  of a grammar is a parsing expression from which
88       all the sentences contained in the language specified  by  the  grammar
89       are  derived.   To  make  the  understanding of this term easier let us
90       assume for a moment that the RHS of each rule, and  the  start  expres‐
91       sion, is either a sequence of symbols, or a series of alternate parsing
92       expressions.  In the latter case the rule can  be  seen  as  a  set  of
93       rules,  each  providing one alternative for the nonterminal.  A parsing
94       expression A' is now a derivation of a parsing expression A if we  pick
95       one of the nonterminals N in the expression, and one of the alternative
96       rules R for N, and then replace the nonterminal in A with  the  RHS  of
97       the  chosen  rule.  Here we can see why the terminal symbols are called
98       such. They cannot be expanded any further, thus terminate  the  process
99       of deriving new expressions.  An example
100
101                  Rules
102                    (1)  A <- a B c
103                    (2a) B <- d B
104                    (2b) B <- e
105
106                  Some derivations, using starting expression A.
107
108                    A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c
109
110
111       A  derived  expression  containing only terminal symbols is a sentence.
112       The set of all sentences which can be derived from the start expression
113       is the language of the grammar.
114
115       Some definitions for nonterminals and expressions:
116
117       [1]    A  nonterminal A is called reachable if it is possible to derive
118              a parsing expression from the start expression which contains A.
119
120       [2]    A nonterminal A is called useful if it is possible to  derive  a
121              sentence from it.
122
123       [3]    A  nonterminal A is called recursive if it is possible to derive
124              a parsing expression from it which contains A, again.
125
126       [4]    The FIRST set of a nonterminal A contains all the symbols  which
127              can  occur  of  as  the  leftmost symbol in a parsing expression
128              derived from A. If the FIRST set contains  A  itself  then  that
129              nonterminal is called left-recursive.
130
131       [5]    The  LAST  set of a nonterminal A contains all the symbols which
132              can occur of as the rightmost symbol  in  a  parsing  expression
133              derived from A. If the LAST set contains A itself then that non‐
134              terminal is called right-recursive.
135
136       [6]    The FOLLOW set of a nonterminal A contains all the symbols which
137              can occur after A in a parsing expression derived from the start
138              expression.
139
140       [7]    A nonterminal (or parsing expression) is called nullable if  the
141              empty sentence can be derived from it.
142
143       And based on the above definitions for grammars:
144
145       [1]    A  grammar G is recursive if and only if it contains a nontermi‐
146              nal A which is recursive. The terms left-  and  right-recursive,
147              and useful are analogously defined.
148
149       [2]    A  grammar  is  minimal if it contains only reachable and useful
150              nonterminals.
151
152       [3]    A grammar is wellformed if it is not left-recursive. Such  gram‐
153              mars  are also complete, which means that they always succeed or
154              fail on all input sentences. For an incomplete  grammar  on  the
155              other  hand  input sentences exist for which an attempt to match
156              them against the grammar will not terminate.
157
158       [4]    As we wish to allow ourselves to build a  grammar  incrementally
159              in  a container object we will encounter stages where the RHS of
160              one or more rules reference symbols which are not yet  known  to
161              the  container.  Such  a grammar we call invalid.  We cannot use
162              the term incomplete as this term is already taken, see the  last
163              item.
164
165   CONTAINER CLASS API
166       The package exports the API described here.
167
168       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
169              The command creates a new container object for a parsing expres‐
170              sion grammar and returns the fully qualified name of the  object
171              command as its result. The API the returned command is following
172              is described in the section CONTAINER OBJECT API. It may be used
173              to  invoke  various  operations on the container and the grammar
174              within.
175
176              The new container, i.e. grammar will be empty if no src is spec‐
177              ified. Otherwise it will contain a copy of the grammar contained
178              in the src.  The src has to be a container object reference  for
179              all  operators  except  deserialize.   The  deserialize operator
180              requires src to be the serialization  of  a  parsing  expression
181              grammar instead.
182
183              An  empty  grammar  has  no  nonterminal  symbols, and the start
184              expression is the empty expression, i.e. epsilon. It  is  valid,
185              but not useful.
186
187   CONTAINER OBJECT API
188       All  grammar  container  objects  provide the following methods for the
189       manipulation of their contents:
190
191       pegName destroy
192              Destroys the grammar, including its storage space and associated
193              command.
194
195       pegName clear
196              Clears  out  the definition of the grammar contained in pegName,
197              but does not destroy the object.
198
199       pegName = srcPEG
200              Assigns the contents of the grammar contained in srcPEG to  peg‐
201              Name,  overwriting any existing definition.  This is the assign‐
202              ment operator for grammars. It copies the grammar  contained  in
203              the  grammar  object  srcPEG over the grammar definition in peg‐
204              Name. The old contents of pegName are deleted by this operation.
205
206              This operation is in effect equivalent to
207
208
209
210                  pegName deserialize [srcPEG serialize]
211
212
213       pegName --> dstPEG
214              This is the reverse assignment operator for grammars. It  copies
215              the  automation contained in the object pegName over the grammar
216              definition in the object dstPEG.  The old contents of dstPEG are
217              deleted by this operation.
218
219              This operation is in effect equivalent to
220
221
222
223                  dstPEG deserialize [pegName serialize]
224
225
226       pegName serialize
227              This  method  serializes the grammar stored in pegName. In other
228              words it returns a tcl value completely describing that grammar.
229              This  allows,  for  example, the transfer of grammars over arbi‐
230              trary channels, persistence, etc.  This method is also the basis
231              for both the copy constructor and the assignment operator.
232
233              The  result of this method has to be semantically identical over
234              all implementations of the grammar::peg interface. This is  what
235              will  enable  us  to copy grammars between different implementa‐
236              tions of the same interface.
237
238              The result is a list of four elements with the following  struc‐
239              ture:
240
241              [1]    The constant string grammar::peg.
242
243              [2]    A dictionary. Its keys are the names of all known nonter‐
244                     minal symbols, and their associated values are the  pars‐
245                     ing expressions describing their sentennial structure.
246
247              [3]    A dictionary. Its keys are the names of all known nonter‐
248                     minal symbols, and their associated  values  hints  to  a
249                     matcher  regarding  the  semantic  values produced by the
250                     symbol.
251
252              [4]    The last item is a parsing expression, the start  expres‐
253                     sion of the grammar.
254
255       Assuming the following PEG for simple mathematical expressions
256
257
258
259                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
260                  Sign       <- '+' / '-'
261                  Number     <- Sign? Digit+
262                  Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
263                  MulOp      <- '*' / '/'
264                  Factor     <- Term (AddOp Term)*
265                  AddOp      <- '+'/'-'
266                  Term       <- Number
267
268
269       a possible serialization is
270
271
272
273                  grammar::peg \\
274                  {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
275                   Factor     {x Term {* {x AddOp Term}}} \\
276                   Term       Number \\
277                   MulOp      {/ * /} \\
278                   AddOp      {/ + -} \\
279                   Number     {x {? Sign} {+ Digit}} \\
280                   Sign       {/ + -} \\
281                   Digit      {/ 0 1 2 3 4 5 6 7 8 9} \\
282                  } \\
283                  {Expression value     Factor     value \\
284                   Term       value     MulOp      value \\
285                   AddOp      value     Number     value \\
286                   Sign       value     Digit      value \\
287                  }
288                  Expression
289
290
291       A possible one, because the order of the nonterminals in the dictionary
292       is not relevant.
293
294       pegName deserialize serialization
295              This is the complement to serialize.  It  replaces  the  grammar
296              definition  in pegName with the grammar described by the serial‐
297              ization value. The old contents of pegName are deleted  by  this
298              operation.
299
300       pegName is valid
301              A  predicate. It tests whether the PEG in pegName is valid.  See
302              section TERMS & CONCEPTS for  the  definition  of  this  grammar
303              property.  The result is a boolean value. It will be set to true
304              if the PEG has the tested property, and false otherwise.
305
306       pegName start ?pe?
307              This method defines the start  expression  of  the  grammar.  It
308              replaces  the previously defined start expression with the pars‐
309              ing expression pe.  The method fails and throws an error  if  pe
310              does  not contain a valid parsing expression as specified in the
311              section PARSING EXPRESSIONS. In that  case  the  existing  start
312              expression  is not changed.  The method returns the empty string
313              as its result.
314
315              If the method is called without an argument it will  return  the
316              currently defined start expression.
317
318       pegName nonterminals
319              Returns the set of all nonterminal symbols known to the grammar.
320
321       pegName nonterminal add nt pe
322              This  method  adds the nonterminal nt and its associated parsing
323              expression pe to the set of nonterminal symbols and rules of the
324              PEG  contained  in  the  object  pegName.   The method fails and
325              throws an error if either the string nt is already  known  as  a
326              symbol of the grammar, or if pe does not contain a valid parsing
327              expression as specified in the section PARSING  EXPRESSIONS.  In
328              that  case  the  current set of nonterminal symbols and rules is
329              not changed.  The method returns the empty string as its result.
330
331       pegName nonterminal delete nt1 ?nt2 ...?
332              This method removes the named symbols nt1, nt2 from the  set  of
333              nonterminal  symbols of the PEG contained in the object pegName.
334              The method fails and throws an error if any of  the  strings  is
335              not  known as a nonterminal symbol. In that case the current set
336              of nonterminal symbols is not changed.  The method  returns  the
337              empty string as its result.
338
339              The  stored  grammar becomes invalid if the deleted nonterminals
340              are referenced by the RHS of still-known rules.
341
342       pegName nonterminal exists nt
343              A predicate. It tests whether the nonterminal symbol nt is known
344              to  the  PEG in pegName.  The result is a boolean value. It will
345              be set to true if the symbol nt is known, and false otherwise.
346
347       pegName nonterminal rename nt ntnew
348              This method renames the nonterminal symbol  nt  to  ntnew.   The
349              method  fails and throws an error if either nt is not known as a
350              nonterminal, or if ntnew is a known symbol.  The method  returns
351              the empty string as its result.
352
353       pegName nonterminal mode nt ?mode?
354              This  mode returns or sets the semantic mode associated with the
355              nonterminal symbol nt. If no mode is specified the current  mode
356              of  the  nonterminal  is returned. Otherwise the current mode is
357              set to mode.  The method fails and throws an error if nt is  not
358              known  as a nonterminal.  The grammar interpreter implemented by
359              the package grammar::peg::interpreter recognizes  the  following
360              modes:
361
362              value  The  semantic  value  of  the nonterminal is the abstract
363                     syntax tree created from the AST's of the RHS and a  node
364                     for the nonterminal itself.
365
366              match  The  semantic value of the nonterminal is an the abstract
367                     syntax tree consisting of single a node  for  the  string
368                     matched  by  the  RHS.  The ASTs generated by the RHS are
369                     discarded.
370
371              leaf   The semantic value of the nonterminal is an the  abstract
372                     syntax tree consisting of single a node for the nontermi‐
373                     nal itself. The ASTs generated by the RHS are discarded.
374
375              discard
376                     The nonterminal has no semantic value. The ASTs generated
377                     by the RHS are discarded (as well).
378
379       pegName nonterminal rule nt
380              This  method  returns the parsing expression associated with the
381              nonterminal nt.  The method fails and throws an error if  nt  is
382              not known as a nonterminal.
383
384       pegName unknown nonterminals
385              This method returns a list containing the names of all nontermi‐
386              nal symbols which are referenced on the  RHS  of  a  grammatical
387              rule,  but  have  no  rule  definining their structure. In other
388              words, a list of the nonterminal symbols which make the  grammar
389              invalid. The grammar is valid if this list is empty.
390
391   PARSING EXPRESSIONS
392       Various methods of PEG container objects expect a parsing expression as
393       their argument, or will return such. This section specifies the  format
394       such parsing expressions are in.
395
396       [1]    The  string  epsilon is an atomic parsing expression. It matches
397              the empty string.
398
399       [2]    The string alnum is an atomic parsing expression. It matches any
400              alphanumeric character.
401
402       [3]    The string alpha is an atomic parsing expression. It matches any
403              alphabetical character.
404
405       [4]    The string dot is an atomic parsing expression. It  matches  any
406              character.
407
408       [5]    The  expression  [list  t x] is an atomic parsing expression. It
409              matches the terminal string x.
410
411       [6]    The expression [list n A] is an atomic  parsing  expression.  It
412              matches the nonterminal A.
413
414       [7]    For  parsing expressions e1, e2, ... the result of [list / e1 e2
415              ... ] is a parsing expression as  well.   This  is  the  ordered
416              choice, aka prioritized choice.
417
418       [8]    For  parsing expressions e1, e2, ... the result of [list x e1 e2
419              ... ] is a parsing expression as well.  This is the sequence.
420
421       [9]    For a parsing expression e the result of [list * e] is a parsing
422              expression as well.  This is the kleene closure, describing zero
423              or more repetitions.
424
425       [10]   For a parsing expression e the result of [list + e] is a parsing
426              expression  as  well.   This  is  the  positive  kleene closure,
427              describing one or more repetitions.
428
429       [11]   For a parsing expression e the result of [list & e] is a parsing
430              expression as well.  This is the and lookahead predicate.
431
432       [12]   For a parsing expression e the result of [list ! e] is a parsing
433              expression as well.  This is the not lookahead predicate.
434
435       [13]   For a parsing expression e the result of [list ? e] is a parsing
436              expression as well.  This is the optional input.
437
438       Examples of parsing expressions where already shown, in the description
439       of the method serialize.
440

PARSING EXPRESSION GRAMMARS

442       For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where
443
444       ·      VN is a set of nonterminal symbols,
445
446       ·      VT is a set of terminal symbols,
447
448       ·      R is a finite set of rules, where each rule is a pair  (A,e),  A
449              in VN, and e a parsing expression.
450
451       ·      eS is a parsing expression, the start expression.
452
453       Further constraints are
454
455       ·      The intersection of VN and VT is empty.
456
457       ·      For  all  A  in  VT exists exactly one pair (A,e) in R. In other
458              words, R is a  function  from  nonterminal  symbols  to  parsing
459              expressions.
460
461       Parsing expression are inductively defined via
462
463       ·      The empty string (epsilon) is a parsing expression.
464
465       ·      A terminal symbol a is a parsing expression.
466
467       ·      A nonterminal symbol A is a parsing expression.
468
469       ·      e1e2  is  a parsing expression for parsing expressions e1 and 2.
470              This is called sequence.
471
472       ·      e1/e2 is a parsing expression for parsing expressions e1 and  2.
473              This is called ordered choice.
474
475       ·      e*  is  a  parsing  expression for parsing expression e. This is
476              called zero-or-more repetitions, also known as kleene closure.
477
478       ·      e+ is a parsing expression for parsing  expression  e.  This  is
479              called  one-or-more  repetitions,  also known as positive kleene
480              closure.
481
482       ·      !e is a parsing expression for parsing expression  e1.  This  is
483              called a not lookahead predicate.
484
485       ·      &e  is  a  parsing expression for parsing expression e1. This is
486              called an and lookahead predicate.
487
488       PEGs are used to define a grammatical structure for streams of  symbols
489       over  VT.  They  are  a modern phrasing of older formalisms invented by
490       Alexander Birham. These formalisms  were  called  TS  (TMG  recognition
491       scheme),  and  gTS  (generalized  TS).  Later they were renamed to TPDL
492       (Top-Down Parsing Languages) and gTPDL (generalized TPDL).
493
494       They can be easily implemented by recursive descent parsers with  back‐
495       tracking. This makes them relatives of LL(k) Context-Free Grammars.
496

REFERENCES

498       [1]    The   Packrat  Parsing  and  Parsing  Expression  Grammars  Page
499              [http://www.pdos.lcs.mit.edu/~baford/packrat/], by  Bryan  Ford,
500              Massachusetts  Institute  of  Technology. This is the main entry
501              page to PEGs, and their realization through Packrat Parsers.
502
503       [2]    Parsing      Techniques      -      A      Practical       Guide
504              [http://www.cs.vu.nl/~dick/PTAPG.html],  an online book offering
505              a clear, accessible, and thorough discussion of  many  different
506              parsing  techniques  with  their interrelations and applicabili‐
507              ties, including error recovery techniques.
508
509       [3]    Compilers and Compiler  Generators  [http://scifac.ru.ac.za/com‐
510              pilers/], an online book using CoCo/R, a generator for recursive
511              descent parsers.
512

BUGS, IDEAS, FEEDBACK

514       This document, and the package it describes, will  undoubtedly  contain
515       bugs  and  other  problems.   Please  report such in the category gram‐
516       mar_peg of the Tcllib Trackers  [http://core.tcl.tk/tcllib/reportlist].
517       Please  also  report any ideas for enhancements you may have for either
518       package and/or documentation.
519
520       When proposing code changes, please provide unified diffs, i.e the out‐
521       put of diff -u.
522
523       Note  further  that  attachments  are  strongly  preferred over inlined
524       patches. Attachments can be made by going  to  the  Edit  form  of  the
525       ticket  immediately  after  its  creation, and then using the left-most
526       button in the secondary navigation bar.
527

KEYWORDS

529       LL(k), TDPL,  context-free  languages,  expression,  grammar,  parsing,
530       parsing  expression,  parsing  expression grammar, push down automaton,
531       recursive descent, state, top-down parsing languages, transducer
532

COPYRIGHT

537       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
538
539
540
541
542tcllib                                0.1                      grammar::peg(n)