1pt::json_language(n)             Parser Tools             pt::json_language(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt::json_language - The JSON Grammar Exchange Format
9

SYNOPSIS

11       package require Tcl  8.5
12
13______________________________________________________________________________
14

DESCRIPTION

16       Are  you  lost ?  Do you have trouble understanding this document ?  In
17       that case please read the overview  provided  by  the  Introduction  to
18       Parser  Tools.  This document is the entrypoint to the whole system the
19       current package is a part of.
20
21       The json format for parsing expression grammars was written as  a  data
22       exchange  format not bound to Tcl. It was defined to allow the exchange
23       of grammars with PackRat/PEG based parser  generators  for  other  lan‐
24       guages.
25
26       It is formally specified by the rules below:
27
28       [1]    The JSON of any PEG is a JSON object.
29
30       [2]    This object holds a single key, pt::grammar::peg, and its value.
31              This value holds the contents of the grammar.
32
33       [3]    The contents of the grammar are a JSON object holding the set of
34              nonterminal  symbols  and  the starting expression. The relevant
35              keys and their values are
36
37              rules  The value is a JSON object whose keys are  the  names  of
38                     the nonterminal symbols known to the grammar.
39
40                     [1]    Each nonterminal symbol may occur only once.
41
42                     [2]    The  empty  string is not a legal nonterminal sym‐
43                            bol.
44
45                     [3]    The value for each symbol is a JSON object itself.
46                            The relevant keys and their values in this dictio‐
47                            nary are
48
49                            is     The value is a JSON string holding the  Tcl
50                                   serialization  of  the  parsing  expression
51                                   describing the  symbols  sentennial  struc‐
52                                   ture,  as specified in the section PE seri‐
53                                   alization format.
54
55                            mode   The value is a JSON holding holding one  of
56                                   three values specifying how a parser should
57                                   handle the semantic value produced  by  the
58                                   symbol.
59
60                                   value  The  semantic value of the nontermi‐
61                                          nal symbol  is  an  abstract  syntax
62                                          tree  consisting  of  a  single node
63                                          node  for  the  nonterminal  itself,
64                                          which  has  the ASTs of the symbol's
65                                          right hand side as its children.
66
67                                   leaf   The semantic value of the  nontermi‐
68                                          nal  symbol  is  an  abstract syntax
69                                          tree consisting  of  a  single  node
70                                          node  for  the  nonterminal, without
71                                          any children. Any ASTs generated  by
72                                          the  symbol's  right  hand  side are
73                                          discarded.
74
75                                   void   The  nonterminal  has  no   semantic
76                                          value.  Any  ASTs  generated  by the
77                                          symbol's right hand  side  are  dis‐
78                                          carded (as well).
79
80              start  The  value is a JSON string holding the Tcl serialization
81                     of the start parsing expression of the grammar, as speci‐
82                     fied in the section PE serialization format.
83
84       [4]    The  terminal symbols of the grammar are specified implicitly as
85              the set of all terminal symbols used in the start expression and
86              on the RHS of the grammar rules.
87
88       As an aside to the advanced reader, this is pretty much the same as the
89       Tcl serialization of PE grammars, as specified in section  PEG  serial‐
90       ization format, except that the Tcl dictionaries and lists of that for‐
91       mat are mapped to JSON objects and arrays. Only the parsing expressions
92       themselves  are  not  translated further, but kept as JSON strings con‐
93       taining a nested Tcl list, and there is no concept  of  canonicity  for
94       the JSON either.
95
96   EXAMPLE
97       Assuming the following PEG for simple mathematical expressions
98
99              PEG calculator (Expression)
100                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
101                  Sign       <- '-' / '+'                                     ;
102                  Number     <- Sign? Digit+                                  ;
103                  Expression <- Term (AddOp Term)*                            ;
104                  MulOp      <- '*' / '/'                                     ;
105                  Term       <- Factor (MulOp Factor)*                        ;
106                  AddOp      <- '+'/'-'                                       ;
107                  Factor     <- '(' Expression ')' / Number                   ;
108              END;
109
110
111       a JSON serialization for it is
112
113              {
114                  "pt::grammar::peg" : {
115                      "rules" : {
116                          "AddOp"     : {
117                              "is"   : "\/ {t -} {t +}",
118                              "mode" : "value"
119                          },
120                          "Digit"     : {
121                              "is"   : "\/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}",
122                              "mode" : "value"
123                          },
124                          "Expression" : {
125                              "is"   : "\/ {x {t (} {n Expression} {t )}} {x {n Factor} {* {x {n MulOp} {n Factor}}}}",
126                              "mode" : "value"
127                          },
128                          "Factor"    : {
129                              "is"   : "x {n Term} {* {x {n AddOp} {n Term}}}",
130                              "mode" : "value"
131                          },
132                          "MulOp"     : {
133                              "is"   : "\/ {t *} {t \/}",
134                              "mode" : "value"
135                          },
136                          "Number"    : {
137                              "is"   : "x {? {n Sign}} {+ {n Digit}}",
138                              "mode" : "value"
139                          },
140                          "Sign"      : {
141                              "is"   : "\/ {t -} {t +}",
142                              "mode" : "value"
143                          },
144                          "Term"      : {
145                              "is"   : "n Number",
146                              "mode" : "value"
147                          }
148                      },
149                      "start" : "n Expression"
150                  }
151              }
152
153
154       and a Tcl serialization of the same is
155
156              pt::grammar::peg {
157                  rules {
158                      AddOp      {is {/ {t -} {t +}}                                                                mode value}
159                      Digit      {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}                mode value}
160                      Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}}                                        mode value}
161                      Factor     {is {/ {x {t (} {n Expression} {t )}} {n Number}}                                  mode value}
162                      MulOp      {is {/ {t *} {t /}}                                                                mode value}
163                      Number     {is {x {? {n Sign}} {+ {n Digit}}}                                                 mode value}
164                      Sign       {is {/ {t -} {t +}}                                                                mode value}
165                      Term       {is {x {n Factor} {* {x {n MulOp} {n Factor}}}}                                    mode value}
166                  }
167                  start {n Expression}
168              }
169
170
171       The similarity of the latter to the JSON should be quite obvious.
172

PEG SERIALIZATION FORMAT

174       Here  we specify the format used by the Parser Tools to serialize Pars‐
175       ing Expression Grammars as immutable values for transport,  comparison,
176       etc.
177
178       We  distinguish  between regular and canonical serializations.  While a
179       PEG may have more than one regular serialization only  exactly  one  of
180       them will be canonical.
181
182       regular serialization
183
184              [1]    The serialization of any PEG is a nested Tcl dictionary.
185
186              [2]    This dictionary holds a single key, pt::grammar::peg, and
187                     its value. This value holds the contents of the grammar.
188
189              [3]    The contents of the grammar are a Tcl dictionary  holding
190                     the  set  of nonterminal symbols and the starting expres‐
191                     sion. The relevant keys and their values are
192
193                     rules  The value is a Tcl dictionary whose keys  are  the
194                            names  of  the  nonterminal  symbols  known to the
195                            grammar.
196
197                            [1]    Each  nonterminal  symbol  may  occur  only
198                                   once.
199
200                            [2]    The empty string is not a legal nonterminal
201                                   symbol.
202
203                            [3]    The value for each symbol is a Tcl  dictio‐
204                                   nary  itself.  The  relevant keys and their
205                                   values in this dictionary are
206
207                                   is     The value is  the  serialization  of
208                                          the  parsing  expression  describing
209                                          the symbols sentennial structure, as
210                                          specified  in the section PE serial‐
211                                          ization format.
212
213                                   mode   The value can be one of three values
214                                          specifying  how a parser should han‐
215                                          dle the semantic value  produced  by
216                                          the symbol.
217
218                                          value  The  semantic  value  of  the
219                                                 nonterminal  symbol   is   an
220                                                 abstract syntax tree consist‐
221                                                 ing of a single node node for
222                                                 the nonterminal itself, which
223                                                 has the ASTs of the  symbol's
224                                                 right  hand side as its chil‐
225                                                 dren.
226
227                                          leaf   The  semantic  value  of  the
228                                                 nonterminal   symbol   is  an
229                                                 abstract syntax tree consist‐
230                                                 ing of a single node node for
231                                                 the nonterminal, without  any
232                                                 children.  Any ASTs generated
233                                                 by the  symbol's  right  hand
234                                                 side are discarded.
235
236                                          void   The nonterminal has no seman‐
237                                                 tic value. Any ASTs generated
238                                                 by  the  symbol's  right hand
239                                                 side are discarded (as well).
240
241                     start  The value is the serialization of the start  pars‐
242                            ing expression of the grammar, as specified in the
243                            section PE serialization format.
244
245              [4]    The terminal symbols of the grammar are specified implic‐
246                     itly as the set of all terminal symbols used in the start
247                     expression and on the RHS of the grammar rules.
248
249       canonical serialization
250              The canonical serialization of a grammar has the format as spec‐
251              ified  in the previous item, and then additionally satisfies the
252              constraints below, which make it unique among all  the  possible
253              serializations of this grammar.
254
255              [1]    The  keys  found  in  all the nested Tcl dictionaries are
256                     sorted in ascending dictionary  order,  as  generated  by
257                     Tcl's builtin command lsort -increasing -dict.
258
259              [2]    The  string  representation of the value is the canonical
260                     representation of a Tcl dictionary. I.e. it does not con‐
261                     tain superfluous whitespace.
262
263   EXAMPLE
264       Assuming the following PEG for simple mathematical expressions
265
266              PEG calculator (Expression)
267                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
268                  Sign       <- '-' / '+'                                     ;
269                  Number     <- Sign? Digit+                                  ;
270                  Expression <- Term (AddOp Term)*                            ;
271                  MulOp      <- '*' / '/'                                     ;
272                  Term       <- Factor (MulOp Factor)*                        ;
273                  AddOp      <- '+'/'-'                                       ;
274                  Factor     <- '(' Expression ')' / Number                   ;
275              END;
276
277
278       then its canonical serialization (except for whitespace) is
279
280              pt::grammar::peg {
281                  rules {
282                      AddOp      {is {/ {t -} {t +}}                                                                mode value}
283                      Digit      {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}                mode value}
284                      Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}}                                        mode value}
285                      Factor     {is {/ {x {t (} {n Expression} {t )}} {n Number}}                                  mode value}
286                      MulOp      {is {/ {t *} {t /}}                                                                mode value}
287                      Number     {is {x {? {n Sign}} {+ {n Digit}}}                                                 mode value}
288                      Sign       {is {/ {t -} {t +}}                                                                mode value}
289                      Term       {is {x {n Factor} {* {x {n MulOp} {n Factor}}}}                                    mode value}
290                  }
291                  start {n Expression}
292              }
293
294

PE SERIALIZATION FORMAT

296       Here  we specify the format used by the Parser Tools to serialize Pars‐
297       ing Expressions as immutable values for transport, comparison, etc.
298
299       We distinguish between regular and canonical serializations.   While  a
300       parsing  expression  may  have more than one regular serialization only
301       exactly one of them will be canonical.
302
303       Regular serialization
304
305              Atomic Parsing Expressions
306
307                     [1]    The string epsilon is an  atomic  parsing  expres‐
308                            sion. It matches the empty string.
309
310                     [2]    The string dot is an atomic parsing expression. It
311                            matches any character.
312
313                     [3]    The string alnum is an atomic parsing  expression.
314                            It  matches  any Unicode alphabet or digit charac‐
315                            ter. This is a custom extension of  PEs  based  on
316                            Tcl's builtin command string is.
317
318                     [4]    The  string alpha is an atomic parsing expression.
319                            It matches any Unicode alphabet character. This is
320                            a  custom  extension of PEs based on Tcl's builtin
321                            command string is.
322
323                     [5]    The string ascii is an atomic parsing  expression.
324                            It matches any Unicode character below U0080. This
325                            is a  custom  extension  of  PEs  based  on  Tcl's
326                            builtin command string is.
327
328                     [6]    The  string  control  is an atomic parsing expres‐
329                            sion. It matches any  Unicode  control  character.
330                            This  is  a custom extension of PEs based on Tcl's
331                            builtin command string is.
332
333                     [7]    The string digit is an atomic parsing  expression.
334                            It  matches any Unicode digit character. Note that
335                            this includes characters  outside  of  the  [0..9]
336                            range.  This is a custom extension of PEs based on
337                            Tcl's builtin command string is.
338
339                     [8]    The string graph is an atomic parsing  expression.
340                            It  matches any Unicode printing character, except
341                            for space. This is a custom extension of PEs based
342                            on Tcl's builtin command string is.
343
344                     [9]    The  string lower is an atomic parsing expression.
345                            It matches any Unicode lower-case alphabet charac‐
346                            ter.  This  is  a custom extension of PEs based on
347                            Tcl's builtin command string is.
348
349                     [10]   The string print is an atomic parsing  expression.
350                            It matches any Unicode printing character, includ‐
351                            ing space. This is a custom extension of PEs based
352                            on Tcl's builtin command string is.
353
354                     [11]   The  string punct is an atomic parsing expression.
355                            It matches any Unicode punctuation character. This
356                            is  a  custom  extension  of  PEs  based  on Tcl's
357                            builtin command string is.
358
359                     [12]   The string space is an atomic parsing  expression.
360                            It  matches any Unicode space character. This is a
361                            custom extension of PEs  based  on  Tcl's  builtin
362                            command string is.
363
364                     [13]   The  string upper is an atomic parsing expression.
365                            It matches any Unicode upper-case alphabet charac‐
366                            ter.  This  is  a custom extension of PEs based on
367                            Tcl's builtin command string is.
368
369                     [14]   The string wordchar is an atomic  parsing  expres‐
370                            sion.  It matches any Unicode word character. This
371                            is any alphanumeric character (see alnum), and any
372                            connector  punctuation  characters  (e.g.   under‐
373                            score). This is a custom extension of PEs based on
374                            Tcl's builtin command string is.
375
376                     [15]   The string xdigit is an atomic parsing expression.
377                            It matches any hexadecimal digit  character.  This
378                            is  a  custom  extension  of  PEs  based  on Tcl's
379                            builtin command string is.
380
381                     [16]   The string ddigit is an atomic parsing expression.
382                            It  matches any decimal digit character. This is a
383                            custom extension of PEs  based  on  Tcl's  builtin
384                            command regexp.
385
386                     [17]   The  expression  [list  t  x] is an atomic parsing
387                            expression. It matches the terminal string x.
388
389                     [18]   The expression [list n A]  is  an  atomic  parsing
390                            expression. It matches the nonterminal A.
391
392              Combined Parsing Expressions
393
394                     [1]    For  parsing expressions e1, e2, ... the result of
395                            [list / e1 e2 ... ] is  a  parsing  expression  as
396                            well.  This is the ordered choice, aka prioritized
397                            choice.
398
399                     [2]    For parsing expressions e1, e2, ... the result  of
400                            [list  x  e1  e2  ... ] is a parsing expression as
401                            well.  This is the sequence.
402
403                     [3]    For a parsing expression e the result of  [list  *
404                            e]  is  a parsing expression as well.  This is the
405                            kleene closure, describing zero  or  more  repeti‐
406                            tions.
407
408                     [4]    For  a  parsing expression e the result of [list +
409                            e] is a parsing expression as well.  This  is  the
410                            positive  kleene  closure,  describing one or more
411                            repetitions.
412
413                     [5]    For a parsing expression e the result of  [list  &
414                            e]  is  a parsing expression as well.  This is the
415                            and lookahead predicate.
416
417                     [6]    For a parsing expression e the result of  [list  !
418                            e]  is  a parsing expression as well.  This is the
419                            not lookahead predicate.
420
421                     [7]    For a parsing expression e the result of  [list  ?
422                            e]  is  a parsing expression as well.  This is the
423                            optional input.
424
425       Canonical serialization
426              The canonical serialization of a parsing expression has the for‐
427              mat  as  specified  in  the previous item, and then additionally
428              satisfies the constraints below, which make it unique among  all
429              the possible serializations of this parsing expression.
430
431              [1]    The  string  representation of the value is the canonical
432                     representation of a pure Tcl list. I.e. it does not  con‐
433                     tain superfluous whitespace.
434
435              [2]    Terminals  are not encoded as ranges (where start and end
436                     of the range are identical).
437
438   EXAMPLE
439       Assuming the parsing expression shown on the  right-hand  side  of  the
440       rule
441
442                  Expression <- Term (AddOp Term)*
443
444
445       then its canonical serialization (except for whitespace) is
446
447                  {x {n Term} {* {x {n AddOp} {n Term}}}}
448
449

BUGS, IDEAS, FEEDBACK

451       This  document,  and the package it describes, will undoubtedly contain
452       bugs and other problems.  Please report such in the category pt of  the
453       Tcllib  Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please also
454       report any ideas for enhancements  you  may  have  for  either  package
455       and/or documentation.
456
457       When proposing code changes, please provide unified diffs, i.e the out‐
458       put of diff -u.
459
460       Note further that  attachments  are  strongly  preferred  over  inlined
461       patches.  Attachments  can  be  made  by  going to the Edit form of the
462       ticket immediately after its creation, and  then  using  the  left-most
463       button in the secondary navigation bar.
464

KEYWORDS

466       EBNF,  LL(k),  PEG,  TDPL, context-free languages, expression, grammar,
467       matching, parser, parsing expression, parsing expression grammar,  push
468       down  automaton,  recursive descent, state, top-down parsing languages,
469       transducer
470

CATEGORY

472       Parsing and Grammars
473
475       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
476
477
478
479
480tcllib                                 1                  pt::json_language(n)
Impressum