1pt::peg(n)                       Parser Tools                       pt::peg(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt::peg - Parsing Expression Grammar Serialization
9

SYNOPSIS

11       package require Tcl  8.5
12
13       package require pt::peg  ?1?
14
15       package require pt::pe
16
17       ::pt::peg verify serial ?canonvar?
18
19       ::pt::peg verify-as-canonical serial
20
21       ::pt::peg canonicalize serial
22
23       ::pt::peg print serial
24
25       ::pt::peg merge seriala serialb
26
27       ::pt::peg equal seriala serialb
28
29______________________________________________________________________________
30

DESCRIPTION

32       Are  you  lost ?  Do you have trouble understanding this document ?  In
33       that case please read the overview  provided  by  the  Introduction  to
34       Parser  Tools.  This document is the entrypoint to the whole system the
35       current package is a part of.
36
37       This package provides commands to work with the serializations of pars‐
38       ing  expression  grammars as managed by the Parser Tools, and specified
39       in section PEG serialization format.
40
41       This is a supporting package in the Core Layer of Parser Tools.
42
43       IMAGE: arch_core_support
44

API

46       ::pt::peg verify serial ?canonvar?
47              This command verifies that the content  of  serial  is  a  valid
48              serialization of a parsing expression and will throw an error if
49              that is not the case. The result of the  command  is  the  empty
50              string.
51
52              If  the  argument canonvar is specified it is interpreted as the
53              name of a variable in the calling context. This variable will be
54              written  to  if and only if serial is a valid regular serializa‐
55              tion. Its value will be a boolean, with True indicating that the
56              serialization  is not only valid, but also canonical. False will
57              be written for a valid, but non-canonical serialization.
58
59              For the specification of serializations see the section PE seri‐
60              alization format.
61
62       ::pt::peg verify-as-canonical serial
63              This  command  verifies  that  the  content of serial is a valid
64              canonical serialization of a PEG and will throw an error if that
65              is not the case. The result of the command is the empty string.
66
67              For  the  specification of canonical serializations see the sec‐
68              tion PEG serialization format.
69
70       ::pt::peg canonicalize serial
71              This command assumes that the content of serial is a valid regu‐
72              lar  serialization  of  a PEG and will throw an error if that is
73              not the case.
74
75              It will then convert the input into the canonical  serialization
76              of  the  contained PEG and return it as its result. If the input
77              is already canonical it will be returned unchanged.
78
79              For the specification of regular  and  canonical  serializations
80              see the section PEG serialization format.
81
82       ::pt::peg print serial
83              This  command  assumes that the argument serial contains a valid
84              serialization of a parsing expression and returns a string  con‐
85              taining that PE in a human readable form.
86
87              The  exact  format  of  this form is not specified and cannot be
88              relied on for parsing or other machine-based activities.
89
90              For the specification of  serializations  see  the  section  PEG
91              serialization format.
92
93       ::pt::peg merge seriala serialb
94              This  command accepts the regular serializations of two grammars
95              and uses them to create their union.  The result of the  command
96              is the canonical serialization of this unified grammar.
97
98              A  merge  errors occurs if for any nonterminal symbol S occuring
99              in both input grammars the two input grammars specify  different
100              semantic modes.
101
102              The  semantic  mode of each nonterminal symbol S is the semantic
103              mode of S in any of its input grammars. The previous  rule  made
104              sure that for symbols occuring in both grammars these values are
105              identical.
106
107              The right-hand side of each nonterminal  symbol  S  occuring  in
108              both  input  grammars is the choice between the right-hand sides
109              of S in the input grammars, with the parsing expression of S  in
110              seriala  coming first, except if both expressions are identical.
111              In that case the first expression is taken.
112
113              The right-hand side of each nonterminal  symbol  S  occuring  in
114              only  one  of  the input grammars is the right-hand side of S in
115              its input grammar.
116
117              The start expression  of  the  unified  grammar  is  the  choice
118              between  the  start  expressions of the input grammars, with the
119              start expression of seriala coming first, except if both expres‐
120              sions are identical.  In that case the first expression is taken
121
122       ::pt::peg equal seriala serialb
123              This  command  tests  the  two  grammars seriala and serialb for
124              structural equality. The result of  the  command  is  a  boolean
125              value.  It will be set to true if the expressions are identical,
126              and false otherwise.
127
128              String equality is usable only if we can  assume  that  the  two
129              grammars are pure Tcl lists and dictionaries.
130

PEG SERIALIZATION FORMAT

132       Here  we specify the format used by the Parser Tools to serialize Pars‐
133       ing Expression Grammars as immutable values for transport,  comparison,
134       etc.
135
136       We  distinguish  between regular and canonical serializations.  While a
137       PEG may have more than one regular serialization only  exactly  one  of
138       them will be canonical.
139
140       regular serialization
141
142              [1]    The serialization of any PEG is a nested Tcl dictionary.
143
144              [2]    This dictionary holds a single key, pt::grammar::peg, and
145                     its value. This value holds the contents of the grammar.
146
147              [3]    The contents of the grammar are a Tcl dictionary  holding
148                     the  set  of nonterminal symbols and the starting expres‐
149                     sion. The relevant keys and their values are
150
151                     rules  The value is a Tcl dictionary whose keys  are  the
152                            names  of  the  nonterminal  symbols  known to the
153                            grammar.
154
155                            [1]    Each  nonterminal  symbol  may  occur  only
156                                   once.
157
158                            [2]    The empty string is not a legal nonterminal
159                                   symbol.
160
161                            [3]    The value for each symbol is a Tcl  dictio‐
162                                   nary  itself.  The  relevant keys and their
163                                   values in this dictionary are
164
165                                   is     The value is  the  serialization  of
166                                          the  parsing  expression  describing
167                                          the symbols sentennial structure, as
168                                          specified  in the section PE serial‐
169                                          ization format.
170
171                                   mode   The value can be one of three values
172                                          specifying  how a parser should han‐
173                                          dle the semantic value  produced  by
174                                          the symbol.
175
176                                          value  The  semantic  value  of  the
177                                                 nonterminal  symbol   is   an
178                                                 abstract syntax tree consist‐
179                                                 ing of a single node node for
180                                                 the nonterminal itself, which
181                                                 has the ASTs of the  symbol's
182                                                 right  hand side as its chil‐
183                                                 dren.
184
185                                          leaf   The  semantic  value  of  the
186                                                 nonterminal   symbol   is  an
187                                                 abstract syntax tree consist‐
188                                                 ing of a single node node for
189                                                 the nonterminal, without  any
190                                                 children.  Any ASTs generated
191                                                 by the  symbol's  right  hand
192                                                 side are discarded.
193
194                                          void   The nonterminal has no seman‐
195                                                 tic value. Any ASTs generated
196                                                 by  the  symbol's  right hand
197                                                 side are discarded (as well).
198
199                     start  The value is the serialization of the start  pars‐
200                            ing expression of the grammar, as specified in the
201                            section PE serialization format.
202
203              [4]    The terminal symbols of the grammar are specified implic‐
204                     itly as the set of all terminal symbols used in the start
205                     expression and on the RHS of the grammar rules.
206
207       canonical serialization
208              The canonical serialization of a grammar has the format as spec‐
209              ified  in the previous item, and then additionally satisfies the
210              constraints below, which make it unique among all  the  possible
211              serializations of this grammar.
212
213              [1]    The  keys  found  in  all the nested Tcl dictionaries are
214                     sorted in ascending dictionary  order,  as  generated  by
215                     Tcl's builtin command lsort -increasing -dict.
216
217              [2]    The  string  representation of the value is the canonical
218                     representation of a Tcl dictionary. I.e. it does not con‐
219                     tain superfluous whitespace.
220
221   EXAMPLE
222       Assuming the following PEG for simple mathematical expressions
223
224              PEG calculator (Expression)
225                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
226                  Sign       <- '-' / '+'                                     ;
227                  Number     <- Sign? Digit+                                  ;
228                  Expression <- Term (AddOp Term)*                            ;
229                  MulOp      <- '*' / '/'                                     ;
230                  Term       <- Factor (MulOp Factor)*                        ;
231                  AddOp      <- '+'/'-'                                       ;
232                  Factor     <- '(' Expression ')' / Number                   ;
233              END;
234
235
236       then its canonical serialization (except for whitespace) is
237
238              pt::grammar::peg {
239                  rules {
240                      AddOp      {is {/ {t -} {t +}}                                                                mode value}
241                      Digit      {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}                mode value}
242                      Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}}                                        mode value}
243                      Factor     {is {/ {x {t (} {n Expression} {t )}} {n Number}}                                  mode value}
244                      MulOp      {is {/ {t *} {t /}}                                                                mode value}
245                      Number     {is {x {? {n Sign}} {+ {n Digit}}}                                                 mode value}
246                      Sign       {is {/ {t -} {t +}}                                                                mode value}
247                      Term       {is {x {n Factor} {* {x {n MulOp} {n Factor}}}}                                    mode value}
248                  }
249                  start {n Expression}
250              }
251
252

PE SERIALIZATION FORMAT

254       Here  we specify the format used by the Parser Tools to serialize Pars‐
255       ing Expressions as immutable values for transport, comparison, etc.
256
257       We distinguish between regular and canonical serializations.   While  a
258       parsing  expression  may  have more than one regular serialization only
259       exactly one of them will be canonical.
260
261       Regular serialization
262
263              Atomic Parsing Expressions
264
265                     [1]    The string epsilon is an  atomic  parsing  expres‐
266                            sion. It matches the empty string.
267
268                     [2]    The string dot is an atomic parsing expression. It
269                            matches any character.
270
271                     [3]    The string alnum is an atomic parsing  expression.
272                            It  matches  any Unicode alphabet or digit charac‐
273                            ter. This is a custom extension of  PEs  based  on
274                            Tcl's builtin command string is.
275
276                     [4]    The  string alpha is an atomic parsing expression.
277                            It matches any Unicode alphabet character. This is
278                            a  custom  extension of PEs based on Tcl's builtin
279                            command string is.
280
281                     [5]    The string ascii is an atomic parsing  expression.
282                            It matches any Unicode character below U0080. This
283                            is a  custom  extension  of  PEs  based  on  Tcl's
284                            builtin command string is.
285
286                     [6]    The  string  control  is an atomic parsing expres‐
287                            sion. It matches any  Unicode  control  character.
288                            This  is  a custom extension of PEs based on Tcl's
289                            builtin command string is.
290
291                     [7]    The string digit is an atomic parsing  expression.
292                            It  matches any Unicode digit character. Note that
293                            this includes characters  outside  of  the  [0..9]
294                            range.  This is a custom extension of PEs based on
295                            Tcl's builtin command string is.
296
297                     [8]    The string graph is an atomic parsing  expression.
298                            It  matches any Unicode printing character, except
299                            for space. This is a custom extension of PEs based
300                            on Tcl's builtin command string is.
301
302                     [9]    The  string lower is an atomic parsing expression.
303                            It matches any Unicode lower-case alphabet charac‐
304                            ter.  This  is  a custom extension of PEs based on
305                            Tcl's builtin command string is.
306
307                     [10]   The string print is an atomic parsing  expression.
308                            It matches any Unicode printing character, includ‐
309                            ing space. This is a custom extension of PEs based
310                            on Tcl's builtin command string is.
311
312                     [11]   The  string punct is an atomic parsing expression.
313                            It matches any Unicode punctuation character. This
314                            is  a  custom  extension  of  PEs  based  on Tcl's
315                            builtin command string is.
316
317                     [12]   The string space is an atomic parsing  expression.
318                            It  matches any Unicode space character. This is a
319                            custom extension of PEs  based  on  Tcl's  builtin
320                            command string is.
321
322                     [13]   The  string upper is an atomic parsing expression.
323                            It matches any Unicode upper-case alphabet charac‐
324                            ter.  This  is  a custom extension of PEs based on
325                            Tcl's builtin command string is.
326
327                     [14]   The string wordchar is an atomic  parsing  expres‐
328                            sion.  It matches any Unicode word character. This
329                            is any alphanumeric character (see alnum), and any
330                            connector  punctuation  characters  (e.g.   under‐
331                            score). This is a custom extension of PEs based on
332                            Tcl's builtin command string is.
333
334                     [15]   The string xdigit is an atomic parsing expression.
335                            It matches any hexadecimal digit  character.  This
336                            is  a  custom  extension  of  PEs  based  on Tcl's
337                            builtin command string is.
338
339                     [16]   The string ddigit is an atomic parsing expression.
340                            It  matches any decimal digit character. This is a
341                            custom extension of PEs  based  on  Tcl's  builtin
342                            command regexp.
343
344                     [17]   The  expression  [list  t  x] is an atomic parsing
345                            expression. It matches the terminal string x.
346
347                     [18]   The expression [list n A]  is  an  atomic  parsing
348                            expression. It matches the nonterminal A.
349
350              Combined Parsing Expressions
351
352                     [1]    For  parsing expressions e1, e2, ... the result of
353                            [list / e1 e2 ... ] is  a  parsing  expression  as
354                            well.  This is the ordered choice, aka prioritized
355                            choice.
356
357                     [2]    For parsing expressions e1, e2, ... the result  of
358                            [list  x  e1  e2  ... ] is a parsing expression as
359                            well.  This is the sequence.
360
361                     [3]    For a parsing expression e the result of  [list  *
362                            e]  is  a parsing expression as well.  This is the
363                            kleene closure, describing zero  or  more  repeti‐
364                            tions.
365
366                     [4]    For  a  parsing expression e the result of [list +
367                            e] is a parsing expression as well.  This  is  the
368                            positive  kleene  closure,  describing one or more
369                            repetitions.
370
371                     [5]    For a parsing expression e the result of  [list  &
372                            e]  is  a parsing expression as well.  This is the
373                            and lookahead predicate.
374
375                     [6]    For a parsing expression e the result of  [list  !
376                            e]  is  a parsing expression as well.  This is the
377                            not lookahead predicate.
378
379                     [7]    For a parsing expression e the result of  [list  ?
380                            e]  is  a parsing expression as well.  This is the
381                            optional input.
382
383       Canonical serialization
384              The canonical serialization of a parsing expression has the for‐
385              mat  as  specified  in  the previous item, and then additionally
386              satisfies the constraints below, which make it unique among  all
387              the possible serializations of this parsing expression.
388
389              [1]    The  string  representation of the value is the canonical
390                     representation of a pure Tcl list. I.e. it does not  con‐
391                     tain superfluous whitespace.
392
393              [2]    Terminals  are not encoded as ranges (where start and end
394                     of the range are identical).
395
396   EXAMPLE
397       Assuming the parsing expression shown on the  right-hand  side  of  the
398       rule
399
400                  Expression <- Term (AddOp Term)*
401
402
403       then its canonical serialization (except for whitespace) is
404
405                  {x {n Term} {* {x {n AddOp} {n Term}}}}
406
407

BUGS, IDEAS, FEEDBACK

409       This  document,  and the package it describes, will undoubtedly contain
410       bugs and other problems.  Please report such in the category pt of  the
411       Tcllib  Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please also
412       report any ideas for enhancements  you  may  have  for  either  package
413       and/or documentation.
414
415       When proposing code changes, please provide unified diffs, i.e the out‐
416       put of diff -u.
417
418       Note further that  attachments  are  strongly  preferred  over  inlined
419       patches.  Attachments  can  be  made  by  going to the Edit form of the
420       ticket immediately after its creation, and  then  using  the  left-most
421       button in the secondary navigation bar.
422

KEYWORDS

424       EBNF,  LL(k),  PEG,  TDPL, context-free languages, expression, grammar,
425       matching, parser, parsing expression, parsing expression grammar,  push
426       down  automaton,  recursive descent, state, top-down parsing languages,
427       transducer
428

CATEGORY

430       Parsing and Grammars
431
433       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
434
435
436
437
438tcllib                                 1                            pt::peg(n)
Impressum