pt_peg_import_peg(n)

1pt::peg::import::peg(n)          Parser Tools          pt::peg::import::peg(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt::peg::import::peg - PEG Import Plugin. Read PEG format
9

SYNOPSIS

11       package require Tcl  8.5
12
13       package require pt::peg::import::peg  ?1?
14
15       package require pt::peg::to::peg
16
17       import text
18
19______________________________________________________________________________
20

DESCRIPTION

22       Are  you  lost ?  Do you have trouble understanding this document ?  In
23       that case please read the overview  provided  by  the  Introduction  to
24       Parser  Tools.  This document is the entrypoint to the whole system the
25       current package is a part of.
26
27       This package implements the parsing expression  grammar  import  plugin
28       processing PEG markup.
29
30       It  resides in the Import section of the Core Layer of Parser Tools and
31       is intended to be used by pt::peg::import, the import manager,  sitting
32       between it and the corresponding core conversion functionality provided
33       by pt::peg::from::peg.
34
35       IMAGE: arch_core_iplugins
36
37       While the direct use of this package with a regular interpreter is pos‐
38       sible, this is strongly disrecommended and requires a number of contor‐
39       tions to provide the expected environment.  The proper way to use  this
40       functionality depends on the situation:
41
42       [1]    In  an  untrusted  environment  the proper access is through the
43              package pt::peg::import and the import manager objects  it  pro‐
44              vides.
45
46       [2]    In   a  trusted  environment  however  simply  use  the  package
47              pt::peg::from::peg and access the core conversion  functionality
48              directly.
49

API

51       The  API  provided  by  this package satisfies the specification of the
52       Plugin API found in the Parser Tools Import API specification.
53
54       import text
55              This command takes the PEG markup encoding a parsing  expression
56              grammar and contained in text, and generates the canonical seri‐
57              alization of said grammar, as specified in section  PEG  serial‐
58              ization  format.   The  created  value  is  then returned as the
59              result of the command.
60

PEG SPECIFICATION LANGUAGE

62       peg, a language for the specification of parsing expression grammars is
63       meant  to be human readable, and writable as well, yet strict enough to
64       allow its processing by machine. Like any  computer  language.  It  was
65       defined  to make writing the specification of a grammar easy, something
66       the other formats found in the Parser Tools do not lend themselves too.
67
68       It is formally specified by the grammar shown below, written in itself.
69       For  a  tutorial  / introduction to the language please go and read the
70       PEG Language Tutorial.
71
72              PEG pe-grammar-for-peg (Grammar)
73
74                # --------------------------------------------------------------------
75                      # Syntactical constructs
76
77                      Grammar         <- WHITESPACE Header Definition* Final EOF ;
78
79                      Header          <- PEG Identifier StartExpr ;
80                      Definition      <- Attribute? Identifier IS Expression SEMICOLON ;
81                      Attribute       <- (VOID / LEAF) COLON ;
82                      Expression      <- Sequence (SLASH Sequence)* ;
83                      Sequence        <- Prefix+ ;
84                      Prefix          <- (AND / NOT)? Suffix ;
85                      Suffix          <- Primary (QUESTION / STAR / PLUS)? ;
86                      Primary         <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
87                                      /  GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
88                                      /  WORDCHAR / XDIGIT
89                                      / Identifier
90                                      /  OPEN Expression CLOSE
91                                      /  Literal
92                                      /  Class
93                                      /  DOT
94                                      ;
95                      Literal         <- APOSTROPH  (!APOSTROPH  Char)* APOSTROPH  WHITESPACE
96                                      /  DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
97                      Class           <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
98                      Range           <- Char TO Char / Char ;
99
100                      StartExpr       <- OPEN Expression CLOSE ;
101              void:   Final           <- "END" WHITESPACE SEMICOLON WHITESPACE ;
102
103                      # --------------------------------------------------------------------
104                      # Lexing constructs
105
106                      Identifier      <- Ident WHITESPACE ;
107              leaf:   Ident           <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
108                      Char            <- CharSpecial / CharOctalFull / CharOctalPart
109                                      /  CharUnicode / CharUnescaped
110                                      ;
111
112              leaf:   CharSpecial     <- "\\" [nrt'"\[\]\\] ;
113              leaf:   CharOctalFull   <- "\\" [0-2][0-7][0-7] ;
114              leaf:   CharOctalPart   <- "\\" [0-7][0-7]? ;
115              leaf:   CharUnicode     <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
116              leaf:   CharUnescaped   <- !"\\" . ;
117
118              void:   HexDigit        <- [0-9a-fA-F] ;
119
120              void:   TO              <- '-'           ;
121              void:   OPENB           <- "["           ;
122              void:   CLOSEB          <- "]"           ;
123              void:   APOSTROPH       <- "'"           ;
124              void:   DAPOSTROPH      <- '"'           ;
125              void:   PEG             <- "PEG" !([_:] / <alnum>) WHITESPACE ;
126              void:   IS              <- "<-"    WHITESPACE ;
127              leaf:   VOID            <- "void"  WHITESPACE ; # Implies that definition has no semantic value.
128              leaf:   LEAF            <- "leaf"  WHITESPACE ; # Implies that definition has no terminals.
129              void:   SEMICOLON       <- ";"     WHITESPACE ;
130              void:   COLON           <- ":"     WHITESPACE ;
131              void:   SLASH           <- "/"     WHITESPACE ;
132              leaf:   AND             <- "&"     WHITESPACE ;
133              leaf:   NOT             <- "!"     WHITESPACE ;
134              leaf:   QUESTION        <- "?"     WHITESPACE ;
135              leaf:   STAR            <- "*"     WHITESPACE ;
136              leaf:   PLUS            <- "+"     WHITESPACE ;
137              void:   OPEN            <- "("     WHITESPACE ;
138              void:   CLOSE           <- ")"     WHITESPACE ;
139              leaf:   DOT             <- "."     WHITESPACE ;
140
141              leaf:   ALNUM           <- "<alnum>"    WHITESPACE ;
142              leaf:   ALPHA           <- "<alpha>"    WHITESPACE ;
143              leaf:   ASCII           <- "<ascii>"    WHITESPACE ;
144              leaf:   CONTROL         <- "<control>"  WHITESPACE ;
145              leaf:   DDIGIT          <- "<ddigit>"   WHITESPACE ;
146              leaf:   DIGIT           <- "<digit>"    WHITESPACE ;
147              leaf:   GRAPH           <- "<graph>"    WHITESPACE ;
148              leaf:   LOWER           <- "<lower>"    WHITESPACE ;
149              leaf:   PRINTABLE       <- "<print>"    WHITESPACE ;
150              leaf:   PUNCT           <- "<punct>"    WHITESPACE ;
151              leaf:   SPACE           <- "<space>"    WHITESPACE ;
152              leaf:   UPPER           <- "<upper>"    WHITESPACE ;
153              leaf:   WORDCHAR        <- "<wordchar>" WHITESPACE ;
154              leaf:   XDIGIT          <- "<xdigit>"   WHITESPACE ;
155
156              void:   WHITESPACE      <- (" " / "\t" / EOL / COMMENT)* ;
157              void:   COMMENT         <- '#' (!EOL .)* EOL ;
158              void:   EOL             <- "\n\r" / "\n" / "\r" ;
159              void:   EOF             <- !. ;
160
161                      # --------------------------------------------------------------------
162              END;
163
164
165   EXAMPLE
166       Our example specifies the grammar for a basic 4-operation calculator.
167
168              PEG calculator (Expression)
169                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
170                  Sign       <- '-' / '+'                                     ;
171                  Number     <- Sign? Digit+                                  ;
172                  Expression <- Term (AddOp Term)*                            ;
173                  MulOp      <- '*' / '/'                                     ;
174                  Term       <- Factor (MulOp Factor)*                        ;
175                  AddOp      <- '+'/'-'                                       ;
176                  Factor     <- '(' Expression ')' / Number                   ;
177              END;
178
179
180       Using higher-level features of the notation, i.e. the character classes
181       (predefined and custom), this example can be rewritten as
182
183              PEG calculator (Expression)
184                  Sign       <- [-+] ;
185                  Number     <- Sign? <ddigit>+;
186                  Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
187                  MulOp      <- [*/];
188                  Factor     <- Term (AddOp Term)*;
189                  AddOp      <- [-+];
190                  Term       <- Number;
191              END;
192
193

PEG SERIALIZATION FORMAT

195       Here  we specify the format used by the Parser Tools to serialize Pars‐
196       ing Expression Grammars as immutable values for transport,  comparison,
197       etc.
198
199       We  distinguish  between regular and canonical serializations.  While a
200       PEG may have more than one regular serialization only  exactly  one  of
201       them will be canonical.
202
203       regular serialization
204
205              [1]    The serialization of any PEG is a nested Tcl dictionary.
206
207              [2]    This dictionary holds a single key, pt::grammar::peg, and
208                     its value. This value holds the contents of the grammar.
209
210              [3]    The contents of the grammar are a Tcl dictionary  holding
211                     the  set  of nonterminal symbols and the starting expres‐
212                     sion. The relevant keys and their values are
213
214                     rules  The value is a Tcl dictionary whose keys  are  the
215                            names  of  the  nonterminal  symbols  known to the
216                            grammar.
217
218                            [1]    Each  nonterminal  symbol  may  occur  only
219                                   once.
220
221                            [2]    The empty string is not a legal nonterminal
222                                   symbol.
223
224                            [3]    The value for each symbol is a Tcl  dictio‐
225                                   nary  itself.  The  relevant keys and their
226                                   values in this dictionary are
227
228                                   is     The value is  the  serialization  of
229                                          the  parsing  expression  describing
230                                          the symbols sentennial structure, as
231                                          specified  in the section PE serial‐
232                                          ization format.
233
234                                   mode   The value can be one of three values
235                                          specifying  how a parser should han‐
236                                          dle the semantic value  produced  by
237                                          the symbol.
238
239                                          value  The  semantic  value  of  the
240                                                 nonterminal  symbol   is   an
241                                                 abstract syntax tree consist‐
242                                                 ing of a single node node for
243                                                 the nonterminal itself, which
244                                                 has the ASTs of the  symbol's
245                                                 right  hand side as its chil‐
246                                                 dren.
247
248                                          leaf   The  semantic  value  of  the
249                                                 nonterminal   symbol   is  an
250                                                 abstract syntax tree consist‐
251                                                 ing of a single node node for
252                                                 the nonterminal, without  any
253                                                 children.  Any ASTs generated
254                                                 by the  symbol's  right  hand
255                                                 side are discarded.
256
257                                          void   The nonterminal has no seman‐
258                                                 tic value. Any ASTs generated
259                                                 by  the  symbol's  right hand
260                                                 side are discarded (as well).
261
262                     start  The value is the serialization of the start  pars‐
263                            ing expression of the grammar, as specified in the
264                            section PE serialization format.
265
266              [4]    The terminal symbols of the grammar are specified implic‐
267                     itly as the set of all terminal symbols used in the start
268                     expression and on the RHS of the grammar rules.
269
270       canonical serialization
271              The canonical serialization of a grammar has the format as spec‐
272              ified  in the previous item, and then additionally satisfies the
273              constraints below, which make it unique among all  the  possible
274              serializations of this grammar.
275
276              [1]    The  keys  found  in  all the nested Tcl dictionaries are
277                     sorted in ascending dictionary  order,  as  generated  by
278                     Tcl's builtin command lsort -increasing -dict.
279
280              [2]    The  string  representation of the value is the canonical
281                     representation of a Tcl dictionary. I.e. it does not con‐
282                     tain superfluous whitespace.
283
284   EXAMPLE
285       Assuming the following PEG for simple mathematical expressions
286
287              PEG calculator (Expression)
288                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
289                  Sign       <- '-' / '+'                                     ;
290                  Number     <- Sign? Digit+                                  ;
291                  Expression <- Term (AddOp Term)*                            ;
292                  MulOp      <- '*' / '/'                                     ;
293                  Term       <- Factor (MulOp Factor)*                        ;
294                  AddOp      <- '+'/'-'                                       ;
295                  Factor     <- '(' Expression ')' / Number                   ;
296              END;
297
298
299       then its canonical serialization (except for whitespace) is
300
301              pt::grammar::peg {
302                  rules {
303                      AddOp      {is {/ {t -} {t +}}                                                                mode value}
304                      Digit      {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}                mode value}
305                      Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}}                                        mode value}
306                      Factor     {is {/ {x {t (} {n Expression} {t )}} {n Number}}                                  mode value}
307                      MulOp      {is {/ {t *} {t /}}                                                                mode value}
308                      Number     {is {x {? {n Sign}} {+ {n Digit}}}                                                 mode value}
309                      Sign       {is {/ {t -} {t +}}                                                                mode value}
310                      Term       {is {x {n Factor} {* {x {n MulOp} {n Factor}}}}                                    mode value}
311                  }
312                  start {n Expression}
313              }
314
315

PE SERIALIZATION FORMAT

317       Here  we specify the format used by the Parser Tools to serialize Pars‐
318       ing Expressions as immutable values for transport, comparison, etc.
319
320       We distinguish between regular and canonical serializations.   While  a
321       parsing  expression  may  have more than one regular serialization only
322       exactly one of them will be canonical.
323
324       Regular serialization
325
326              Atomic Parsing Expressions
327
328                     [1]    The string epsilon is an  atomic  parsing  expres‐
329                            sion. It matches the empty string.
330
331                     [2]    The string dot is an atomic parsing expression. It
332                            matches any character.
333
334                     [3]    The string alnum is an atomic parsing  expression.
335                            It  matches  any Unicode alphabet or digit charac‐
336                            ter. This is a custom extension of  PEs  based  on
337                            Tcl's builtin command string is.
338
339                     [4]    The  string alpha is an atomic parsing expression.
340                            It matches any Unicode alphabet character. This is
341                            a  custom  extension of PEs based on Tcl's builtin
342                            command string is.
343
344                     [5]    The string ascii is an atomic parsing  expression.
345                            It matches any Unicode character below U0080. This
346                            is a  custom  extension  of  PEs  based  on  Tcl's
347                            builtin command string is.
348
349                     [6]    The  string  control  is an atomic parsing expres‐
350                            sion. It matches any  Unicode  control  character.
351                            This  is  a custom extension of PEs based on Tcl's
352                            builtin command string is.
353
354                     [7]    The string digit is an atomic parsing  expression.
355                            It  matches any Unicode digit character. Note that
356                            this includes characters  outside  of  the  [0..9]
357                            range.  This is a custom extension of PEs based on
358                            Tcl's builtin command string is.
359
360                     [8]    The string graph is an atomic parsing  expression.
361                            It  matches any Unicode printing character, except
362                            for space. This is a custom extension of PEs based
363                            on Tcl's builtin command string is.
364
365                     [9]    The  string lower is an atomic parsing expression.
366                            It matches any Unicode lower-case alphabet charac‐
367                            ter.  This  is  a custom extension of PEs based on
368                            Tcl's builtin command string is.
369
370                     [10]   The string print is an atomic parsing  expression.
371                            It matches any Unicode printing character, includ‐
372                            ing space. This is a custom extension of PEs based
373                            on Tcl's builtin command string is.
374
375                     [11]   The  string punct is an atomic parsing expression.
376                            It matches any Unicode punctuation character. This
377                            is  a  custom  extension  of  PEs  based  on Tcl's
378                            builtin command string is.
379
380                     [12]   The string space is an atomic parsing  expression.
381                            It  matches any Unicode space character. This is a
382                            custom extension of PEs  based  on  Tcl's  builtin
383                            command string is.
384
385                     [13]   The  string upper is an atomic parsing expression.
386                            It matches any Unicode upper-case alphabet charac‐
387                            ter.  This  is  a custom extension of PEs based on
388                            Tcl's builtin command string is.
389
390                     [14]   The string wordchar is an atomic  parsing  expres‐
391                            sion.  It matches any Unicode word character. This
392                            is any alphanumeric character (see alnum), and any
393                            connector  punctuation  characters  (e.g.   under‐
394                            score). This is a custom extension of PEs based on
395                            Tcl's builtin command string is.
396
397                     [15]   The string xdigit is an atomic parsing expression.
398                            It matches any hexadecimal digit  character.  This
399                            is  a  custom  extension  of  PEs  based  on Tcl's
400                            builtin command string is.
401
402                     [16]   The string ddigit is an atomic parsing expression.
403                            It  matches any decimal digit character. This is a
404                            custom extension of PEs  based  on  Tcl's  builtin
405                            command regexp.
406
407                     [17]   The  expression  [list  t  x] is an atomic parsing
408                            expression. It matches the terminal string x.
409
410                     [18]   The expression [list n A]  is  an  atomic  parsing
411                            expression. It matches the nonterminal A.
412
413              Combined Parsing Expressions
414
415                     [1]    For  parsing expressions e1, e2, ... the result of
416                            [list / e1 e2 ... ] is  a  parsing  expression  as
417                            well.  This is the ordered choice, aka prioritized
418                            choice.
419
420                     [2]    For parsing expressions e1, e2, ... the result  of
421                            [list  x  e1  e2  ... ] is a parsing expression as
422                            well.  This is the sequence.
423
424                     [3]    For a parsing expression e the result of  [list  *
425                            e]  is  a parsing expression as well.  This is the
426                            kleene closure, describing zero  or  more  repeti‐
427                            tions.
428
429                     [4]    For  a  parsing expression e the result of [list +
430                            e] is a parsing expression as well.  This  is  the
431                            positive  kleene  closure,  describing one or more
432                            repetitions.
433
434                     [5]    For a parsing expression e the result of  [list  &
435                            e]  is  a parsing expression as well.  This is the
436                            and lookahead predicate.
437
438                     [6]    For a parsing expression e the result of  [list  !
439                            e]  is  a parsing expression as well.  This is the
440                            not lookahead predicate.
441
442                     [7]    For a parsing expression e the result of  [list  ?
443                            e]  is  a parsing expression as well.  This is the
444                            optional input.
445
446       Canonical serialization
447              The canonical serialization of a parsing expression has the for‐
448              mat  as  specified  in  the previous item, and then additionally
449              satisfies the constraints below, which make it unique among  all
450              the possible serializations of this parsing expression.
451
452              [1]    The  string  representation of the value is the canonical
453                     representation of a pure Tcl list. I.e. it does not  con‐
454                     tain superfluous whitespace.
455
456              [2]    Terminals  are not encoded as ranges (where start and end
457                     of the range are identical).
458
459   EXAMPLE
460       Assuming the parsing expression shown on the  right-hand  side  of  the
461       rule
462
463                  Expression <- Term (AddOp Term)*
464
465
466       then its canonical serialization (except for whitespace) is
467
468                  {x {n Term} {* {x {n AddOp} {n Term}}}}
469
470

BUGS, IDEAS, FEEDBACK

472       This  document,  and the package it describes, will undoubtedly contain
473       bugs and other problems.  Please report such in the category pt of  the
474       Tcllib  Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please also
475       report any ideas for enhancements  you  may  have  for  either  package
476       and/or documentation.
477
478       When proposing code changes, please provide unified diffs, i.e the out‐
479       put of diff -u.
480
481       Note further that  attachments  are  strongly  preferred  over  inlined
482       patches.  Attachments  can  be  made  by  going to the Edit form of the
483       ticket immediately after its creation, and  then  using  the  left-most
484       button in the secondary navigation bar.
485

KEYWORDS

487       EBNF,  LL(k),  PEG,  TDPL, context-free languages, expression, grammar,
488       import, matching, parser, parsing expression, parsing expression  gram‐
489       mar,  plugin,  push  down  automaton, recursive descent, serialization,
490       state, top-down parsing languages, transducer
491

COPYRIGHT

496       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
497
498
499
500
501tcllib                                 1               pt::peg::import::peg(n)