peg(n) - f7

1grammar::peg(n)          Grammar operations and usage          grammar::peg(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       grammar::peg - Create and manipulate parsing expression grammars
9

SYNOPSIS

11       package require Tcl  8.4
12
13       package require snit
14
15       package require grammar::peg  ?0.1?
16
17       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
18
19       pegName destroy
20
21       pegName clear
22
23       pegName = srcPEG
24
25       pegName --> dstPEG
26
27       pegName serialize
28
29       pegName deserialize serialization
30
31       pegName is valid
32
33       pegName start ?pe?
34
35       pegName nonterminals
36
37       pegName nonterminal add nt pe
38
39       pegName nonterminal delete nt1 ?nt2 ...?
40
41       pegName nonterminal exists nt
42
43       pegName nonterminal rename nt ntnew
44
45       pegName nonterminal mode nt ?mode?
46
47       pegName nonterminal rule nt
48
49       pegName unknown nonterminals
50
51_________________________________________________________________
52

DESCRIPTION

54       This package provides a container class for parsing expression grammars
55       (Short: PEG).  It allows the incremental definition of the grammar, its
56       manipulation  and querying of the definition.  The package neither pro‐
57       vides complex operations on the grammar, nor has it the ability to exe‐
58       cute  a  grammar  definition  for  a  stream  of symbols.  Two packages
59       related to this one are grammar::mengine and grammar::peg::interpreter.
60       The first of them defines a general virtual machine for the matching of
61       a character stream, and the second implements an interpreter for  pars‐
62       ing expression grammars on top of that virtual machine.
63
64   TERMS & CONCEPTS
65       The  formal  mathematical definition of parsing expressions and parsing
66       expression grammars can be found in section  PARSING  EXPRESSION  GRAM‐
67       MARS.
68
69       In  short,  we have terminal symbols, which are the most basic building
70       blocks for sentences, and nonterminal symbols with  associated  parsing
71       expressions,  defining  the grammatical structure of the sentences. The
72       two sets of symbols are distinctive, and do not overlap. When  speaking
73       about  symbols  the  word  "symbol" is often left out. The union of the
74       sets of terminal and nonterminal symbols is called the set of symbols.
75
76       Here the set of terminal symbols is not explicitly managed, but implic‐
77       itly defined as the set of all characters. Note that this means that we
78       inherit from Tcl the ability to handle all of Unicode.
79
80       A pair of nonterminal and parsing expression is also called a grammati‐
81       cal  rule,  or rule for short. In the context of a rule the nonterminal
82       is often called the left-hand-side (LHS), and  the  parsing  expression
83       the right-hand-side (RHS).
84
85       The  start  expression  of a grammar is a parsing expression from which
86       all the sentences contained in the language specified  by  the  grammar
87       are  derived.   To  make  the  understanding of this term easier let us
88       assume for a moment that the RHS of each rule, and  the  start  expres‐
89       sion, is either a sequence of symbols, or a series of alternate parsing
90       expressions.  In the latter case the rule can  be  seen  as  a  set  of
91       rules,  each  providing one alternative for the nonterminal.  A parsing
92       expression A' is now a derivation of a parsing expression A if we  pick
93       one of the nonterminals N in the expression, and one of the alternative
94       rules R for N, and then replace the nonterminal in A with  the  RHS  of
95       the  chosen  rule.  Here we can see why the terminal symbols are called
96       such. They cannot be expanded any further, thus terminate  the  process
97       of deriving new expressions.  An example
98
99           Rules
100             (1)  A <- a B c
101             (2a) B <- d B
102             (2b) B <- e
103
104           Some derivations, using starting expression A.
105
106             A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c
107
108
109       A  derived  expression  containing only terminal symbols is a sentence.
110       The set of all sentences which can be derived from the start expression
111       is the language of the grammar.
112
113       Some definitions for nonterminals and expressions:
114
115       [1]    A  nonterminal A is called reachable if it is possible to derive
116              a parsing expression from the start expression which contains A.
117
118       [2]    A nonterminal A is called useful if it is possible to  derive  a
119              sentence from it.
120
121       [3]    A  nonterminal A is called recursive if it is possible to derive
122              a parsing expression from it which contains A, again.
123
124       [4]    The FIRST set of a nonterminal A contains all the symbols  which
125              can  occur  of  as  the  leftmost symbol in a parsing expression
126              derived from A. If the FIRST set contains  A  itself  then  that
127              nonterminal is called left-recursive.
128
129       [5]    The  LAST  set of a nonterminal A contains all the symbols which
130              can occur of as the rightmost symbol  in  a  parsing  expression
131              derived from A. If the LAST set contains A itself then that non‐
132              terminal is called right-recursive.
133
134       [6]    The FOLLOW set of a nonterminal A contains all the symbols which
135              can occur after A in a parsing expression derived from the start
136              expression.
137
138       [7]    A nonterminal (or parsing expression) is called nullable if  the
139              empty sentence can be derived from it.
140
141       And based on the above definitions for grammars:
142
143       [1]    A  grammar G is recursive if and only if it contains a nontermi‐
144              nal A which is recursive. The terms left-  and  right-recursive,
145              and useful are analogously defined.
146
147       [2]    A  grammar  is  minimal if it contains only reachable and useful
148              nonterminals.
149
150       [3]    A grammar is wellformed if it is not left-recursive. Such  gram‐
151              mars  are also complete, which means that they always succeed or
152              fail on all input sentences. For an incomplete  grammar  on  the
153              other  hand  input sentences exist for which an attempt to match
154              them against the grammar will not terminate.
155
156       [4]    As we wish to allow ourselves to build a  grammar  incrementally
157              in  a container object we will encounter stages where the RHS of
158              one or more rules reference symbols which are not yet  known  to
159              the  container.  Such  a grammar we call invalid.  We cannot use
160              the term incomplete as this term is already taken, see the  last
161              item.
162
163   CONTAINER CLASS API
164       The package exports the API described here.
165
166       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
167              The command creates a new container object for a parsing expres‐
168              sion grammar and returns the fully qualified name of the  object
169              command as its result. The API the returned command is following
170              is described in the section CONTAINER OBJECT API. It may be used
171              to  invoke  various  operations on the container and the grammar
172              within.
173
174              The new container, i.e. grammar will be empty if no src is spec‐
175              ified. Otherwise it will contain a copy of the grammar contained
176              in the src.  The src has to be a container object reference  for
177              all  operators  except  deserialize.   The  deserialize operator
178              requires src to be the serialization  of  a  parsing  expression
179              grammar instead.
180
181              An  empty  grammar  has  no  nonterminal  symbols, and the start
182              expression is the empty expression, i.e. epsilon. It  is  valid,
183              but not useful.
184
185   CONTAINER OBJECT API
186       All  grammar  container  objects  provide the following methods for the
187       manipulation of their contents:
188
189       pegName destroy
190              Destroys the grammar, including its storage space and associated
191              command.
192
193       pegName clear
194              Clears  out  the definition of the grammar contained in pegName,
195              but does not destroy the object.
196
197       pegName = srcPEG
198              Assigns the contents of the grammar contained in srcPEG to  peg‐
199              Name,  overwriting any existing definition.  This is the assign‐
200              ment operator for grammars. It copies the grammar  contained  in
201              the  grammar  object  srcPEG over the grammar definition in peg‐
202              Name. The old contents of pegName are deleted by this operation.
203
204              This operation is in effect equivalent to
205
206
207                  pegName deserialize [srcPEG serialize]
208
209
210       pegName --> dstPEG
211              This is the reverse assignment operator for grammars. It  copies
212              the  automation contained in the object pegName over the grammar
213              definition in the object dstPEG.  The old contents of dstPEG are
214              deleted by this operation.
215
216              This operation is in effect equivalent to
217
218
219                  dstPEG deserialize [pegName serialize]
220
221
222       pegName serialize
223              This  method  serializes the grammar stored in pegName. In other
224              words it returns a tcl value completely describing that grammar.
225              This  allows,  for  example, the transfer of grammars over arbi‐
226              trary channels, persistence, etc.  This method is also the basis
227              for both the copy constructor and the assignment operator.
228
229              The  result of this method has to be semantically identical over
230              all implementations of the grammar::peg interface. This is  what
231              will  enable  us  to copy grammars between different implementa‐
232              tions of the same interface.
233
234              The result is a list of four elements with the following  struc‐
235              ture:
236
237              [1]    The constant string grammar::peg.
238
239              [2]    A dictionary. Its keys are the names of all known nonter‐
240                     minal symbols, and their associated values are the  pars‐
241                     ing expressions describing their sentennial structure.
242
243              [3]    A dictionary. Its keys are the names of all known nonter‐
244                     minal symbols, and their associated  values  hints  to  a
245                     matcher  regarding  the  semantic  values produced by the
246                     symbol.
247
248              [4]    The last item is a parsing expression, the start  expres‐
249                     sion of the grammar.
250
251       Assuming the following PEG for simple mathematical expressions
252
253
254           Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
255           Sign       <- '+' / '-'
256           Number     <- Sign? Digit+
257           Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
258           MulOp      <- '*' / '/'
259           Factor     <- Term (AddOp Term)*
260           AddOp      <- '+'/'-'
261           Term       <- Number
262
263
264       a possible serialization is
265
266
267           grammar::peg \\
268           {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
269            Factor     {x Term {* {x AddOp Term}}} \\
270            Term       Number \\
271            MulOp      {/ * /} \\
272            AddOp      {/ + -} \\
273            Number     {x {? Sign} {+ Digit}} \\
274            Sign       {/ + -} \\
275            Digit      {/ 0 1 2 3 4 5 6 7 8 9} \\
276           } \\
277           {Expression value     Factor     value \\
278            Term       value     MulOp      value \\
279            AddOp      value     Number     value \\
280            Sign       value     Digit      value \\
281           }
282           Expression
283
284
285       A possible one, because the order of the nonterminals in the dictionary
286       is not relevant.
287
288       pegName deserialize serialization
289              This is the complement to serialize.  It  replaces  the  grammar
290              definition  in pegName with the grammar described by the serial‐
291              ization value. The old contents of pegName are deleted  by  this
292              operation.
293
294       pegName is valid
295              A  predicate. It tests whether the PEG in pegName is valid.  See
296              section TERMS & CONCEPTS for  the  definition  of  this  grammar
297              property.  The result is a boolean value. It will be set to true
298              if the PEG has the tested property, and false otherwise.
299
300       pegName start ?pe?
301              This method defines the start  expression  of  the  grammar.  It
302              replaces  the previously defined start expression with the pars‐
303              ing expression pe.  The method fails and throws an error  if  pe
304              does  not contain a valid parsing expression as specified in the
305              section PARSING EXPRESSIONS. In that  case  the  existing  start
306              expression  is not changed.  The method returns the empty string
307              as its result.
308
309              If the method is called without an argument it will  return  the
310              currently defined start expression.
311
312       pegName nonterminals
313              Returns the set of all nonterminal symbols known to the grammar.
314
315       pegName nonterminal add nt pe
316              This  method  adds the nonterminal nt and its associated parsing
317              expression pe to the set of nonterminal symbols and rules of the
318              PEG  contained  in  the  object  pegName.   The method fails and
319              throws an error if either the string nt is already  known  as  a
320              symbol of the grammar, or if pe does not contain a valid parsing
321              expression as specified in the section PARSING  EXPRESSIONS.  In
322              that  case  the  current set of nonterminal symbols and rules is
323              not changed.  The method returns the empty string as its result.
324
325       pegName nonterminal delete nt1 ?nt2 ...?
326              This method removes the named symbols nt1, nt2 from the  set  of
327              nonterminal  symbols of the PEG contained in the object pegName.
328              The method fails and throws an error if any of  the  strings  is
329              not  known as a nonterminal symbol. In that case the current set
330              of nonterminal symbols is not changed.  The method  returns  the
331              empty string as its result.
332
333              The  stored  grammar becomes invalid if the deleted nonterminals
334              are referenced by the RHS of still-known rules.
335
336       pegName nonterminal exists nt
337              A predicate. It tests whether the nonterminal symbol nt is known
338              to  the  PEG in pegName.  The result is a boolean value. It will
339              be set to true if the symbol nt is known, and false otherwise.
340
341       pegName nonterminal rename nt ntnew
342              This method renames the nonterminal symbol  nt  to  ntnew.   The
343              method  fails and throws an error if either nt is not known as a
344              nonterminal, or if ntnew is a known symbol.  The method  returns
345              the empty string as its result.
346
347       pegName nonterminal mode nt ?mode?
348              This  mode returns or sets the semantic mode associated with the
349              nonterminal symbol nt. If no mode is specified the current  mode
350              of  the  nonterminal  is returned. Otherwise the current mode is
351              set to mode.  The method fails and throws an error if nt is  not
352              known  as a nonterminal.  The grammar interpreter implemented by
353              the package grammar::peg::interpreter recognizes  the  following
354              modes:
355
356              value  The  semantic  value  of  the nonterminal is the abstract
357                     syntax tree created from the AST's of the RHS and a  node
358                     for the nonterminal itself.
359
360              match  The  semantic value of the nonterminal is an the abstract
361                     syntax tree consisting of single a node  for  the  string
362                     matched  by  the  RHS.  The ASTs generated by the RHS are
363                     discarded.
364
365              leaf   The semantic value of the nonterminal is an the  abstract
366                     syntax tree consisting of single a node for the nontermi‐
367                     nal itself. The ASTs generated by the RHS are discarded.
368
369              discard
370                     The nonterminal has no semantic value. The ASTs generated
371                     by the RHS are discarded (as well).
372
373       pegName nonterminal rule nt
374              This  method  returns the parsing expression associated with the
375              nonterminal nt.  The method fails and throws an error if  nt  is
376              not known as a nonterminal.
377
378       pegName unknown nonterminals
379              This method returns a list containing the names of all nontermi‐
380              nal symbols which are referenced on the  RHS  of  a  grammatical
381              rule,  but  have  no  rule  definining their structure. In other
382              words, a list of the nonterminal symbols which make the  grammar
383              invalid. The grammar is valid if this list is empty.
384
385   PARSING EXPRESSIONS
386       Various methods of PEG container objects expect a parsing expression as
387       their argument, or will return such. This section specifies the  format
388       such parsing expressions are in.
389
390       [1]    The  string  epsilon is an atomic parsing expression. It matches
391              the empty string.
392
393       [2]    The string alnum is an atomic parsing expression. It matches any
394              alphanumeric character.
395
396       [3]    The string alpha is an atomic parsing expression. It matches any
397              alphabetical character.
398
399       [4]    The string dot is an atomic parsing expression. It  matches  any
400              character.
401
402       [5]    The  expression  [list  t x] is an atomic parsing expression. It
403              matches the terminal string x.
404
405       [6]    The expression [list n A] is an atomic  parsing  expression.  It
406              matches the nonterminal A.
407
408       [7]    For  parsing expressions e1, e2, ... the result of [list / e1 e2
409              ... ] is a parsing expression as  well.   This  is  the  ordered
410              choice, aka prioritized choice.
411
412       [8]    For  parsing expressions e1, e2, ... the result of [list x e1 e2
413              ... ] is a parsing expression as well.  This is the sequence.
414
415       [9]    For a parsing expression e the result of [list * e] is a parsing
416              expression as well.  This is the kleene closure, describing zero
417              or more repetitions.
418
419       [10]   For a parsing expression e the result of [list + e] is a parsing
420              expression  as  well.   This  is  the  positive  kleene closure,
421              describing one or more repetitions.
422
423       [11]   For a parsing expression e the result of [list & e] is a parsing
424              expression as well.  This is the and lookahead predicate.
425
426       [12]   For a parsing expression e the result of [list ! e] is a parsing
427              expression as well.  This is the not lookahead predicate.
428
429       [13]   For a parsing expression e the result of [list ? e] is a parsing
430              expression as well.  This is the optional input.
431
432       Examples of parsing expressions where already shown, in the description
433       of the method serialize.
434

PARSING EXPRESSION GRAMMARS

436       For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where
437
438       ·      VN is a set of nonterminal symbols,
439
440       ·      VT is a set of terminal symbols,
441
442       ·      R is a finite set of rules, where each rule is a pair  (A,e),  A
443              in VN, and e a parsing expression.
444
445       ·      eS is a parsing expression, the start expression.
446
447       Further constraints are
448
449       ·      The intersection of VN and VT is empty.
450
451       ·      For  all  A  in  VT exists exactly one pair (A,e) in R. In other
452              words, R is a  function  from  nonterminal  symbols  to  parsing
453              expressions.
454
455       Parsing expression are inductively defined via
456
457       ·      The empty string (epsilon) is a parsing expression.
458
459       ·      A terminal symbol a is a parsing expression.
460
461       ·      A nonterminal symbol A is a parsing expression.
462
463       ·      e1e2  is  a parsing expression for parsing expressions e1 and 2.
464              This is called sequence.
465
466       ·      e1/e2 is a parsing expression for parsing expressions e1 and  2.
467              This is called ordered choice.
468
469       ·      e*  is  a  parsing  expression for parsing expression e. This is
470              called zero-or-more repetitions, also known as kleene closure.
471
472       ·      e+ is a parsing expression for parsing  expression  e.  This  is
473              called  one-or-more  repetitions,  also known as positive kleene
474              closure.
475
476       ·      !e is a parsing expression for parsing expression  e1.  This  is
477              called a not lookahead predicate.
478
479       ·      &e  is  a  parsing expression for parsing expression e1. This is
480              called an and lookahead predicate.
481
482       PEGs are used to define a grammatical structure for streams of  symbols
483       over  VT.  They  are  a modern phrasing of older formalisms invented by
484       Alexander Birham. These formalisms  were  called  TS  (TMG  recognition
485       scheme),  and  gTS  (generalized  TS).  Later they were renamed to TPDL
486       (Top-Down Parsing Languages) and gTPDL (generalized TPDL).
487
488       They can be easily implemented by recursive descent parsers with  back‐
489       tracking. This makes them relatives of LL(k) Context-Free Grammars.
490

REFERENCES

492       [1]    The   Packrat  Parsing  and  Parsing  Expression  Grammars  Page
493              [http://www.pdos.lcs.mit.edu/~baford/packrat/], by  Bryan  Ford,
494              Massachusetts  Institute  of  Technology. This is the main entry
495              page to PEGs, and their realization through Packrat Parsers.
496
497       [2]    Parsing      Techniques      -      A      Practical       Guide
498              [http://www.cs.vu.nl/~dick/PTAPG.html],  an online book offering
499              a clear, accessible, and thorough discussion of  many  different
500              parsing  techniques  with  their interrelations and applicabili‐
501              ties, including error recovery techniques.
502
503       [3]    Compilers and Compiler  Generators  [http://scifac.ru.ac.za/com‐
504              pilers/], an online book using CoCo/R, a generator for recursive
505              descent parsers.
506

KEYWORDS

508       LL(k), TDPL,  context-free  languages,  expression,  grammar,  parsing,
509       parsing  expression,  parsing  expression grammar, push down automaton,
510       recursive descent, state, top-down parsing languages, transducer
511

COPYRIGHT

513       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
514
515
516
517
518grammar_peg                           0.1                      grammar::peg(n)