1pt::peg::from::peg(n)            Parser Tools            pt::peg::from::peg(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt::peg::from::peg - PEG Conversion. Read PEG format
9

SYNOPSIS

11       package require Tcl  8.5
12
13       package require pt::peg::from::peg  ?1.0.3?
14
15       pt::peg::from::peg convert text
16
17______________________________________________________________________________
18

DESCRIPTION

20       Are  you  lost ?  Do you have trouble understanding this document ?  In
21       that case please read the overview  provided  by  the  Introduction  to
22       Parser  Tools.  This document is the entrypoint to the whole system the
23       current package is a part of.
24
25       This package implements  the  converter  from  PEG  markup  to  parsing
26       expression grammars.
27
28       It resides in the Import section of the Core Layer of Parser Tools, and
29       can be used either directly with the other packages of this  layer,  or
30       indirectly  through the import manager provided by pt::peg::import. The
31       latter is intented for use in untrusted environments and  done  through
32       the  corresponding  import  plugin pt::peg::import::peg sitting between
33       converter and import manager.
34
35       IMAGE: arch_core_iplugins
36

API

38       The API provided by this package satisfies  the  specification  of  the
39       Converter API found in the Parser Tools Import API specification.
40
41       pt::peg::from::peg convert text
42              This  command takes the PEG markup encoding a parsing expression
43              grammar and contained in text, and generates the canonical seri‐
44              alization  of  said grammar, as specified in section PEG serial‐
45              ization format.  The created  value  is  then  returned  as  the
46              result of the command.
47

PEG SPECIFICATION LANGUAGE

49       peg, a language for the specification of parsing expression grammars is
50       meant to be human readable, and writable as well, yet strict enough  to
51       allow  its  processing  by  machine. Like any computer language. It was
52       defined to make writing the specification of a grammar easy,  something
53       the other formats found in the Parser Tools do not lend themselves too.
54
55       It is formally specified by the grammar shown below, written in itself.
56       For a tutorial / introduction to the language please go  and  read  the
57       PEG Language Tutorial.
58
59              PEG pe-grammar-for-peg (Grammar)
60
61                # --------------------------------------------------------------------
62                      # Syntactical constructs
63
64                      Grammar         <- WHITESPACE Header Definition* Final EOF ;
65
66                      Header          <- PEG Identifier StartExpr ;
67                      Definition      <- Attribute? Identifier IS Expression SEMICOLON ;
68                      Attribute       <- (VOID / LEAF) COLON ;
69                      Expression      <- Sequence (SLASH Sequence)* ;
70                      Sequence        <- Prefix+ ;
71                      Prefix          <- (AND / NOT)? Suffix ;
72                      Suffix          <- Primary (QUESTION / STAR / PLUS)? ;
73                      Primary         <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
74                                      /  GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
75                                      /  WORDCHAR / XDIGIT
76                                      / Identifier
77                                      /  OPEN Expression CLOSE
78                                      /  Literal
79                                      /  Class
80                                      /  DOT
81                                      ;
82                      Literal         <- APOSTROPH  (!APOSTROPH  Char)* APOSTROPH  WHITESPACE
83                                      /  DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
84                      Class           <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
85                      Range           <- Char TO Char / Char ;
86
87                      StartExpr       <- OPEN Expression CLOSE ;
88              void:   Final           <- "END" WHITESPACE SEMICOLON WHITESPACE ;
89
90                      # --------------------------------------------------------------------
91                      # Lexing constructs
92
93                      Identifier      <- Ident WHITESPACE ;
94              leaf:   Ident           <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
95                      Char            <- CharSpecial / CharOctalFull / CharOctalPart
96                                      /  CharUnicode / CharUnescaped
97                                      ;
98
99              leaf:   CharSpecial     <- "\\" [nrt'"\[\]\\] ;
100              leaf:   CharOctalFull   <- "\\" [0-2][0-7][0-7] ;
101              leaf:   CharOctalPart   <- "\\" [0-7][0-7]? ;
102              leaf:   CharUnicode     <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
103              leaf:   CharUnescaped   <- !"\\" . ;
104
105              void:   HexDigit        <- [0-9a-fA-F] ;
106
107              void:   TO              <- '-'           ;
108              void:   OPENB           <- "["           ;
109              void:   CLOSEB          <- "]"           ;
110              void:   APOSTROPH       <- "'"           ;
111              void:   DAPOSTROPH      <- '"'           ;
112              void:   PEG             <- "PEG" !([_:] / <alnum>) WHITESPACE ;
113              void:   IS              <- "<-"    WHITESPACE ;
114              leaf:   VOID            <- "void"  WHITESPACE ; # Implies that definition has no semantic value.
115              leaf:   LEAF            <- "leaf"  WHITESPACE ; # Implies that definition has no terminals.
116              void:   SEMICOLON       <- ";"     WHITESPACE ;
117              void:   COLON           <- ":"     WHITESPACE ;
118              void:   SLASH           <- "/"     WHITESPACE ;
119              leaf:   AND             <- "&"     WHITESPACE ;
120              leaf:   NOT             <- "!"     WHITESPACE ;
121              leaf:   QUESTION        <- "?"     WHITESPACE ;
122              leaf:   STAR            <- "*"     WHITESPACE ;
123              leaf:   PLUS            <- "+"     WHITESPACE ;
124              void:   OPEN            <- "("     WHITESPACE ;
125              void:   CLOSE           <- ")"     WHITESPACE ;
126              leaf:   DOT             <- "."     WHITESPACE ;
127
128              leaf:   ALNUM           <- "<alnum>"    WHITESPACE ;
129              leaf:   ALPHA           <- "<alpha>"    WHITESPACE ;
130              leaf:   ASCII           <- "<ascii>"    WHITESPACE ;
131              leaf:   CONTROL         <- "<control>"  WHITESPACE ;
132              leaf:   DDIGIT          <- "<ddigit>"   WHITESPACE ;
133              leaf:   DIGIT           <- "<digit>"    WHITESPACE ;
134              leaf:   GRAPH           <- "<graph>"    WHITESPACE ;
135              leaf:   LOWER           <- "<lower>"    WHITESPACE ;
136              leaf:   PRINTABLE       <- "<print>"    WHITESPACE ;
137              leaf:   PUNCT           <- "<punct>"    WHITESPACE ;
138              leaf:   SPACE           <- "<space>"    WHITESPACE ;
139              leaf:   UPPER           <- "<upper>"    WHITESPACE ;
140              leaf:   WORDCHAR        <- "<wordchar>" WHITESPACE ;
141              leaf:   XDIGIT          <- "<xdigit>"   WHITESPACE ;
142
143              void:   WHITESPACE      <- (" " / "\t" / EOL / COMMENT)* ;
144              void:   COMMENT         <- '#' (!EOL .)* EOL ;
145              void:   EOL             <- "\n\r" / "\n" / "\r" ;
146              void:   EOF             <- !. ;
147
148                      # --------------------------------------------------------------------
149              END;
150
151
152   EXAMPLE
153       Our example specifies the grammar for a basic 4-operation calculator.
154
155              PEG calculator (Expression)
156                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
157                  Sign       <- '-' / '+'                                     ;
158                  Number     <- Sign? Digit+                                  ;
159                  Expression <- Term (AddOp Term)*                            ;
160                  MulOp      <- '*' / '/'                                     ;
161                  Term       <- Factor (MulOp Factor)*                        ;
162                  AddOp      <- '+'/'-'                                       ;
163                  Factor     <- '(' Expression ')' / Number                   ;
164              END;
165
166
167       Using higher-level features of the notation, i.e. the character classes
168       (predefined and custom), this example can be rewritten as
169
170              PEG calculator (Expression)
171                  Sign       <- [-+] ;
172                  Number     <- Sign? <ddigit>+;
173                  Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
174                  MulOp      <- [*/];
175                  Factor     <- Term (AddOp Term)*;
176                  AddOp      <- [-+];
177                  Term       <- Number;
178              END;
179
180

PEG SERIALIZATION FORMAT

182       Here we specify the format used by the Parser Tools to serialize  Pars‐
183       ing  Expression Grammars as immutable values for transport, comparison,
184       etc.
185
186       We distinguish between regular and canonical serializations.   While  a
187       PEG  may  have  more than one regular serialization only exactly one of
188       them will be canonical.
189
190       regular serialization
191
192              [1]    The serialization of any PEG is a nested Tcl dictionary.
193
194              [2]    This dictionary holds a single key, pt::grammar::peg, and
195                     its value. This value holds the contents of the grammar.
196
197              [3]    The  contents of the grammar are a Tcl dictionary holding
198                     the set of nonterminal symbols and the  starting  expres‐
199                     sion. The relevant keys and their values are
200
201                     rules  The  value  is a Tcl dictionary whose keys are the
202                            names of the  nonterminal  symbols  known  to  the
203                            grammar.
204
205                            [1]    Each  nonterminal  symbol  may  occur  only
206                                   once.
207
208                            [2]    The empty string is not a legal nonterminal
209                                   symbol.
210
211                            [3]    The  value for each symbol is a Tcl dictio‐
212                                   nary itself. The relevant  keys  and  their
213                                   values in this dictionary are
214
215                                   is     The  value  is  the serialization of
216                                          the  parsing  expression  describing
217                                          the symbols sentennial structure, as
218                                          specified in the section PE  serial‐
219                                          ization format.
220
221                                   mode   The value can be one of three values
222                                          specifying how a parser should  han‐
223                                          dle  the  semantic value produced by
224                                          the symbol.
225
226                                          value  The  semantic  value  of  the
227                                                 nonterminal   symbol   is  an
228                                                 abstract syntax tree consist‐
229                                                 ing of a single node node for
230                                                 the nonterminal itself, which
231                                                 has  the ASTs of the symbol's
232                                                 right hand side as its  chil‐
233                                                 dren.
234
235                                          leaf   The  semantic  value  of  the
236                                                 nonterminal  symbol   is   an
237                                                 abstract syntax tree consist‐
238                                                 ing of a single node node for
239                                                 the  nonterminal, without any
240                                                 children. Any ASTs  generated
241                                                 by  the  symbol's  right hand
242                                                 side are discarded.
243
244                                          void   The nonterminal has no seman‐
245                                                 tic value. Any ASTs generated
246                                                 by the  symbol's  right  hand
247                                                 side are discarded (as well).
248
249                     start  The  value is the serialization of the start pars‐
250                            ing expression of the grammar, as specified in the
251                            section PE serialization format.
252
253              [4]    The terminal symbols of the grammar are specified implic‐
254                     itly as the set of all terminal symbols used in the start
255                     expression and on the RHS of the grammar rules.
256
257       canonical serialization
258              The canonical serialization of a grammar has the format as spec‐
259              ified in the previous item, and then additionally satisfies  the
260              constraints  below,  which make it unique among all the possible
261              serializations of this grammar.
262
263              [1]    The keys found in all the  nested  Tcl  dictionaries  are
264                     sorted  in  ascending  dictionary  order, as generated by
265                     Tcl's builtin command lsort -increasing -dict.
266
267              [2]    The string representation of the value is  the  canonical
268                     representation of a Tcl dictionary. I.e. it does not con‐
269                     tain superfluous whitespace.
270
271   EXAMPLE
272       Assuming the following PEG for simple mathematical expressions
273
274              PEG calculator (Expression)
275                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
276                  Sign       <- '-' / '+'                                     ;
277                  Number     <- Sign? Digit+                                  ;
278                  Expression <- Term (AddOp Term)*                            ;
279                  MulOp      <- '*' / '/'                                     ;
280                  Term       <- Factor (MulOp Factor)*                        ;
281                  AddOp      <- '+'/'-'                                       ;
282                  Factor     <- '(' Expression ')' / Number                   ;
283              END;
284
285
286       then its canonical serialization (except for whitespace) is
287
288              pt::grammar::peg {
289                  rules {
290                      AddOp      {is {/ {t -} {t +}}                                                                mode value}
291                      Digit      {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}                mode value}
292                      Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}}                                        mode value}
293                      Factor     {is {/ {x {t (} {n Expression} {t )}} {n Number}}                                  mode value}
294                      MulOp      {is {/ {t *} {t /}}                                                                mode value}
295                      Number     {is {x {? {n Sign}} {+ {n Digit}}}                                                 mode value}
296                      Sign       {is {/ {t -} {t +}}                                                                mode value}
297                      Term       {is {x {n Factor} {* {x {n MulOp} {n Factor}}}}                                    mode value}
298                  }
299                  start {n Expression}
300              }
301
302

PE SERIALIZATION FORMAT

304       Here we specify the format used by the Parser Tools to serialize  Pars‐
305       ing Expressions as immutable values for transport, comparison, etc.
306
307       We  distinguish  between regular and canonical serializations.  While a
308       parsing expression may have more than one  regular  serialization  only
309       exactly one of them will be canonical.
310
311       Regular serialization
312
313              Atomic Parsing Expressions
314
315                     [1]    The  string  epsilon  is an atomic parsing expres‐
316                            sion. It matches the empty string.
317
318                     [2]    The string dot is an atomic parsing expression. It
319                            matches any character.
320
321                     [3]    The  string alnum is an atomic parsing expression.
322                            It matches any Unicode alphabet or  digit  charac‐
323                            ter.  This  is  a custom extension of PEs based on
324                            Tcl's builtin command string is.
325
326                     [4]    The string alpha is an atomic parsing  expression.
327                            It matches any Unicode alphabet character. This is
328                            a custom extension of PEs based on  Tcl's  builtin
329                            command string is.
330
331                     [5]    The  string ascii is an atomic parsing expression.
332                            It matches any Unicode character below U0080. This
333                            is  a  custom  extension  of  PEs  based  on Tcl's
334                            builtin command string is.
335
336                     [6]    The string control is an  atomic  parsing  expres‐
337                            sion.  It  matches  any Unicode control character.
338                            This is a custom extension of PEs based  on  Tcl's
339                            builtin command string is.
340
341                     [7]    The  string digit is an atomic parsing expression.
342                            It matches any Unicode digit character. Note  that
343                            this  includes  characters  outside  of the [0..9]
344                            range. This is a custom extension of PEs based  on
345                            Tcl's builtin command string is.
346
347                     [8]    The  string graph is an atomic parsing expression.
348                            It matches any Unicode printing character,  except
349                            for space. This is a custom extension of PEs based
350                            on Tcl's builtin command string is.
351
352                     [9]    The string lower is an atomic parsing  expression.
353                            It matches any Unicode lower-case alphabet charac‐
354                            ter. This is a custom extension of  PEs  based  on
355                            Tcl's builtin command string is.
356
357                     [10]   The  string print is an atomic parsing expression.
358                            It matches any Unicode printing character, includ‐
359                            ing space. This is a custom extension of PEs based
360                            on Tcl's builtin command string is.
361
362                     [11]   The string punct is an atomic parsing  expression.
363                            It matches any Unicode punctuation character. This
364                            is a  custom  extension  of  PEs  based  on  Tcl's
365                            builtin command string is.
366
367                     [12]   The  string space is an atomic parsing expression.
368                            It matches any Unicode space character. This is  a
369                            custom  extension  of  PEs  based on Tcl's builtin
370                            command string is.
371
372                     [13]   The string upper is an atomic parsing  expression.
373                            It matches any Unicode upper-case alphabet charac‐
374                            ter. This is a custom extension of  PEs  based  on
375                            Tcl's builtin command string is.
376
377                     [14]   The  string  wordchar is an atomic parsing expres‐
378                            sion. It matches any Unicode word character.  This
379                            is any alphanumeric character (see alnum), and any
380                            connector  punctuation  characters  (e.g.   under‐
381                            score). This is a custom extension of PEs based on
382                            Tcl's builtin command string is.
383
384                     [15]   The string xdigit is an atomic parsing expression.
385                            It  matches  any hexadecimal digit character. This
386                            is a  custom  extension  of  PEs  based  on  Tcl's
387                            builtin command string is.
388
389                     [16]   The string ddigit is an atomic parsing expression.
390                            It matches any decimal digit character. This is  a
391                            custom  extension  of  PEs  based on Tcl's builtin
392                            command regexp.
393
394                     [17]   The expression [list t x]  is  an  atomic  parsing
395                            expression. It matches the terminal string x.
396
397                     [18]   The  expression  [list  n  A] is an atomic parsing
398                            expression. It matches the nonterminal A.
399
400              Combined Parsing Expressions
401
402                     [1]    For parsing expressions e1, e2, ... the result  of
403                            [list  /  e1  e2  ... ] is a parsing expression as
404                            well.  This is the ordered choice, aka prioritized
405                            choice.
406
407                     [2]    For  parsing expressions e1, e2, ... the result of
408                            [list x e1 e2 ... ] is  a  parsing  expression  as
409                            well.  This is the sequence.
410
411                     [3]    For  a  parsing expression e the result of [list *
412                            e] is a parsing expression as well.  This  is  the
413                            kleene  closure,  describing  zero or more repeti‐
414                            tions.
415
416                     [4]    For a parsing expression e the result of  [list  +
417                            e]  is  a parsing expression as well.  This is the
418                            positive kleene closure, describing  one  or  more
419                            repetitions.
420
421                     [5]    For  a  parsing expression e the result of [list &
422                            e] is a parsing expression as well.  This  is  the
423                            and lookahead predicate.
424
425                     [6]    For  a  parsing expression e the result of [list !
426                            e] is a parsing expression as well.  This  is  the
427                            not lookahead predicate.
428
429                     [7]    For  a  parsing expression e the result of [list ?
430                            e] is a parsing expression as well.  This  is  the
431                            optional input.
432
433       Canonical serialization
434              The canonical serialization of a parsing expression has the for‐
435              mat as specified in the previous  item,  and  then  additionally
436              satisfies  the constraints below, which make it unique among all
437              the possible serializations of this parsing expression.
438
439              [1]    The string representation of the value is  the  canonical
440                     representation  of a pure Tcl list. I.e. it does not con‐
441                     tain superfluous whitespace.
442
443              [2]    Terminals are not encoded as ranges (where start and  end
444                     of the range are identical).
445
446   EXAMPLE
447       Assuming  the  parsing  expression  shown on the right-hand side of the
448       rule
449
450                  Expression <- Term (AddOp Term)*
451
452
453       then its canonical serialization (except for whitespace) is
454
455                  {x {n Term} {* {x {n AddOp} {n Term}}}}
456
457

BUGS, IDEAS, FEEDBACK

459       This document, and the package it describes, will  undoubtedly  contain
460       bugs  and other problems.  Please report such in the category pt of the
461       Tcllib Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please  also
462       report  any  ideas  for  enhancements  you  may have for either package
463       and/or documentation.
464
465       When proposing code changes, please provide unified diffs, i.e the out‐
466       put of diff -u.
467
468       Note  further  that  attachments  are  strongly  preferred over inlined
469       patches. Attachments can be made by going  to  the  Edit  form  of  the
470       ticket  immediately  after  its  creation, and then using the left-most
471       button in the secondary navigation bar.
472

KEYWORDS

474       EBNF, LL(k), PEG, TDPL, context-free languages, conversion, expression,
475       format conversion, grammar, matching, parser, parsing expression, pars‐
476       ing expression grammar, push down automaton, recursive descent, serial‐
477       ization, state, top-down parsing languages, transducer
478

CATEGORY

480       Parsing and Grammars
481
483       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
484
485
486
487
488tcllib                               1.0.3               pt::peg::from::peg(n)
Impressum