1pt::json_language(n) Parser Tools pt::json_language(n)
2
3
4
5______________________________________________________________________________
6
8 pt::json_language - The JSON Grammar Exchange Format
9
11 package require Tcl 8.5
12
13______________________________________________________________________________
14
16 Are you lost ? Do you have trouble understanding this document ? In
17 that case please read the overview provided by the Introduction to
18 Parser Tools. This document is the entrypoint to the whole system the
19 current package is a part of.
20
21 The json format for parsing expression grammars was written as a data
22 exchange format not bound to Tcl. It was defined to allow the exchange
23 of grammars with PackRat/PEG based parser generators for other lan‐
24 guages.
25
26 It is formally specified by the rules below:
27
28 [1] The JSON of any PEG is a JSON object.
29
30 [2] This object holds a single key, pt::grammar::peg, and its value.
31 This value holds the contents of the grammar.
32
33 [3] The contents of the grammar are a JSON object holding the set of
34 nonterminal symbols and the starting expression. The relevant
35 keys and their values are
36
37 rules The value is a JSON object whose keys are the names of
38 the nonterminal symbols known to the grammar.
39
40 [1] Each nonterminal symbol may occur only once.
41
42 [2] The empty string is not a legal nonterminal sym‐
43 bol.
44
45 [3] The value for each symbol is a JSON object itself.
46 The relevant keys and their values in this dictio‐
47 nary are
48
49 is The value is a JSON string holding the Tcl
50 serialization of the parsing expression de‐
51 scribing the symbols sentennial structure,
52 as specified in the section PE serializa‐
53 tion format.
54
55 mode The value is a JSON holding holding one of
56 three values specifying how a parser should
57 handle the semantic value produced by the
58 symbol.
59
60 value The semantic value of the nontermi‐
61 nal symbol is an abstract syntax
62 tree consisting of a single node
63 node for the nonterminal itself,
64 which has the ASTs of the symbol's
65 right hand side as its children.
66
67 leaf The semantic value of the nontermi‐
68 nal symbol is an abstract syntax
69 tree consisting of a single node
70 node for the nonterminal, without
71 any children. Any ASTs generated by
72 the symbol's right hand side are
73 discarded.
74
75 void The nonterminal has no semantic
76 value. Any ASTs generated by the
77 symbol's right hand side are dis‐
78 carded (as well).
79
80 start The value is a JSON string holding the Tcl serialization
81 of the start parsing expression of the grammar, as speci‐
82 fied in the section PE serialization format.
83
84 [4] The terminal symbols of the grammar are specified implicitly as
85 the set of all terminal symbols used in the start expression and
86 on the RHS of the grammar rules.
87
88 As an aside to the advanced reader, this is pretty much the same as the
89 Tcl serialization of PE grammars, as specified in section PEG serial‐
90 ization format, except that the Tcl dictionaries and lists of that for‐
91 mat are mapped to JSON objects and arrays. Only the parsing expressions
92 themselves are not translated further, but kept as JSON strings con‐
93 taining a nested Tcl list, and there is no concept of canonicity for
94 the JSON either.
95
96 EXAMPLE
97 Assuming the following PEG for simple mathematical expressions
98
99 PEG calculator (Expression)
100 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
101 Sign <- '-' / '+' ;
102 Number <- Sign? Digit+ ;
103 Expression <- Term (AddOp Term)* ;
104 MulOp <- '*' / '/' ;
105 Term <- Factor (MulOp Factor)* ;
106 AddOp <- '+'/'-' ;
107 Factor <- '(' Expression ')' / Number ;
108 END;
109
110
111 a JSON serialization for it is
112
113 {
114 "pt::grammar::peg" : {
115 "rules" : {
116 "AddOp" : {
117 "is" : "\/ {t -} {t +}",
118 "mode" : "value"
119 },
120 "Digit" : {
121 "is" : "\/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}",
122 "mode" : "value"
123 },
124 "Expression" : {
125 "is" : "\/ {x {t (} {n Expression} {t )}} {x {n Factor} {* {x {n MulOp} {n Factor}}}}",
126 "mode" : "value"
127 },
128 "Factor" : {
129 "is" : "x {n Term} {* {x {n AddOp} {n Term}}}",
130 "mode" : "value"
131 },
132 "MulOp" : {
133 "is" : "\/ {t *} {t \/}",
134 "mode" : "value"
135 },
136 "Number" : {
137 "is" : "x {? {n Sign}} {+ {n Digit}}",
138 "mode" : "value"
139 },
140 "Sign" : {
141 "is" : "\/ {t -} {t +}",
142 "mode" : "value"
143 },
144 "Term" : {
145 "is" : "n Number",
146 "mode" : "value"
147 }
148 },
149 "start" : "n Expression"
150 }
151 }
152
153
154 and a Tcl serialization of the same is
155
156 pt::grammar::peg {
157 rules {
158 AddOp {is {/ {t -} {t +}} mode value}
159 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
160 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
161 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
162 MulOp {is {/ {t *} {t /}} mode value}
163 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
164 Sign {is {/ {t -} {t +}} mode value}
165 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
166 }
167 start {n Expression}
168 }
169
170
171 The similarity of the latter to the JSON should be quite obvious.
172
174 Here we specify the format used by the Parser Tools to serialize Pars‐
175 ing Expression Grammars as immutable values for transport, comparison,
176 etc.
177
178 We distinguish between regular and canonical serializations. While a
179 PEG may have more than one regular serialization only exactly one of
180 them will be canonical.
181
182 regular serialization
183
184 [1] The serialization of any PEG is a nested Tcl dictionary.
185
186 [2] This dictionary holds a single key, pt::grammar::peg, and
187 its value. This value holds the contents of the grammar.
188
189 [3] The contents of the grammar are a Tcl dictionary holding
190 the set of nonterminal symbols and the starting expres‐
191 sion. The relevant keys and their values are
192
193 rules The value is a Tcl dictionary whose keys are the
194 names of the nonterminal symbols known to the
195 grammar.
196
197 [1] Each nonterminal symbol may occur only
198 once.
199
200 [2] The empty string is not a legal nonterminal
201 symbol.
202
203 [3] The value for each symbol is a Tcl dictio‐
204 nary itself. The relevant keys and their
205 values in this dictionary are
206
207 is The value is the serialization of
208 the parsing expression describing
209 the symbols sentennial structure, as
210 specified in the section PE serial‐
211 ization format.
212
213 mode The value can be one of three values
214 specifying how a parser should han‐
215 dle the semantic value produced by
216 the symbol.
217
218 value The semantic value of the
219 nonterminal symbol is an ab‐
220 stract syntax tree consisting
221 of a single node node for the
222 nonterminal itself, which has
223 the ASTs of the symbol's
224 right hand side as its chil‐
225 dren.
226
227 leaf The semantic value of the
228 nonterminal symbol is an ab‐
229 stract syntax tree consisting
230 of a single node node for the
231 nonterminal, without any
232 children. Any ASTs generated
233 by the symbol's right hand
234 side are discarded.
235
236 void The nonterminal has no seman‐
237 tic value. Any ASTs generated
238 by the symbol's right hand
239 side are discarded (as well).
240
241 start The value is the serialization of the start pars‐
242 ing expression of the grammar, as specified in the
243 section PE serialization format.
244
245 [4] The terminal symbols of the grammar are specified implic‐
246 itly as the set of all terminal symbols used in the start
247 expression and on the RHS of the grammar rules.
248
249 canonical serialization
250 The canonical serialization of a grammar has the format as spec‐
251 ified in the previous item, and then additionally satisfies the
252 constraints below, which make it unique among all the possible
253 serializations of this grammar.
254
255 [1] The keys found in all the nested Tcl dictionaries are
256 sorted in ascending dictionary order, as generated by
257 Tcl's builtin command lsort -increasing -dict.
258
259 [2] The string representation of the value is the canonical
260 representation of a Tcl dictionary. I.e. it does not con‐
261 tain superfluous whitespace.
262
263 EXAMPLE
264 Assuming the following PEG for simple mathematical expressions
265
266 PEG calculator (Expression)
267 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
268 Sign <- '-' / '+' ;
269 Number <- Sign? Digit+ ;
270 Expression <- Term (AddOp Term)* ;
271 MulOp <- '*' / '/' ;
272 Term <- Factor (MulOp Factor)* ;
273 AddOp <- '+'/'-' ;
274 Factor <- '(' Expression ')' / Number ;
275 END;
276
277
278 then its canonical serialization (except for whitespace) is
279
280 pt::grammar::peg {
281 rules {
282 AddOp {is {/ {t -} {t +}} mode value}
283 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
284 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
285 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
286 MulOp {is {/ {t *} {t /}} mode value}
287 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
288 Sign {is {/ {t -} {t +}} mode value}
289 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
290 }
291 start {n Expression}
292 }
293
294
296 Here we specify the format used by the Parser Tools to serialize Pars‐
297 ing Expressions as immutable values for transport, comparison, etc.
298
299 We distinguish between regular and canonical serializations. While a
300 parsing expression may have more than one regular serialization only
301 exactly one of them will be canonical.
302
303 Regular serialization
304
305 Atomic Parsing Expressions
306
307 [1] The string epsilon is an atomic parsing expres‐
308 sion. It matches the empty string.
309
310 [2] The string dot is an atomic parsing expression. It
311 matches any character.
312
313 [3] The string alnum is an atomic parsing expression.
314 It matches any Unicode alphabet or digit charac‐
315 ter. This is a custom extension of PEs based on
316 Tcl's builtin command string is.
317
318 [4] The string alpha is an atomic parsing expression.
319 It matches any Unicode alphabet character. This is
320 a custom extension of PEs based on Tcl's builtin
321 command string is.
322
323 [5] The string ascii is an atomic parsing expression.
324 It matches any Unicode character below U0080. This
325 is a custom extension of PEs based on Tcl's
326 builtin command string is.
327
328 [6] The string control is an atomic parsing expres‐
329 sion. It matches any Unicode control character.
330 This is a custom extension of PEs based on Tcl's
331 builtin command string is.
332
333 [7] The string digit is an atomic parsing expression.
334 It matches any Unicode digit character. Note that
335 this includes characters outside of the [0..9]
336 range. This is a custom extension of PEs based on
337 Tcl's builtin command string is.
338
339 [8] The string graph is an atomic parsing expression.
340 It matches any Unicode printing character, except
341 for space. This is a custom extension of PEs based
342 on Tcl's builtin command string is.
343
344 [9] The string lower is an atomic parsing expression.
345 It matches any Unicode lower-case alphabet charac‐
346 ter. This is a custom extension of PEs based on
347 Tcl's builtin command string is.
348
349 [10] The string print is an atomic parsing expression.
350 It matches any Unicode printing character, includ‐
351 ing space. This is a custom extension of PEs based
352 on Tcl's builtin command string is.
353
354 [11] The string punct is an atomic parsing expression.
355 It matches any Unicode punctuation character. This
356 is a custom extension of PEs based on Tcl's
357 builtin command string is.
358
359 [12] The string space is an atomic parsing expression.
360 It matches any Unicode space character. This is a
361 custom extension of PEs based on Tcl's builtin
362 command string is.
363
364 [13] The string upper is an atomic parsing expression.
365 It matches any Unicode upper-case alphabet charac‐
366 ter. This is a custom extension of PEs based on
367 Tcl's builtin command string is.
368
369 [14] The string wordchar is an atomic parsing expres‐
370 sion. It matches any Unicode word character. This
371 is any alphanumeric character (see alnum), and any
372 connector punctuation characters (e.g. under‐
373 score). This is a custom extension of PEs based on
374 Tcl's builtin command string is.
375
376 [15] The string xdigit is an atomic parsing expression.
377 It matches any hexadecimal digit character. This
378 is a custom extension of PEs based on Tcl's
379 builtin command string is.
380
381 [16] The string ddigit is an atomic parsing expression.
382 It matches any decimal digit character. This is a
383 custom extension of PEs based on Tcl's builtin
384 command regexp.
385
386 [17] The expression [list t x] is an atomic parsing ex‐
387 pression. It matches the terminal string x.
388
389 [18] The expression [list n A] is an atomic parsing ex‐
390 pression. It matches the nonterminal A.
391
392 Combined Parsing Expressions
393
394 [1] For parsing expressions e1, e2, ... the result of
395 [list / e1 e2 ... ] is a parsing expression as
396 well. This is the ordered choice, aka prioritized
397 choice.
398
399 [2] For parsing expressions e1, e2, ... the result of
400 [list x e1 e2 ... ] is a parsing expression as
401 well. This is the sequence.
402
403 [3] For a parsing expression e the result of [list *
404 e] is a parsing expression as well. This is the
405 kleene closure, describing zero or more repeti‐
406 tions.
407
408 [4] For a parsing expression e the result of [list +
409 e] is a parsing expression as well. This is the
410 positive kleene closure, describing one or more
411 repetitions.
412
413 [5] For a parsing expression e the result of [list &
414 e] is a parsing expression as well. This is the
415 and lookahead predicate.
416
417 [6] For a parsing expression e the result of [list !
418 e] is a parsing expression as well. This is the
419 not lookahead predicate.
420
421 [7] For a parsing expression e the result of [list ?
422 e] is a parsing expression as well. This is the
423 optional input.
424
425 Canonical serialization
426 The canonical serialization of a parsing expression has the for‐
427 mat as specified in the previous item, and then additionally
428 satisfies the constraints below, which make it unique among all
429 the possible serializations of this parsing expression.
430
431 [1] The string representation of the value is the canonical
432 representation of a pure Tcl list. I.e. it does not con‐
433 tain superfluous whitespace.
434
435 [2] Terminals are not encoded as ranges (where start and end
436 of the range are identical).
437
438 EXAMPLE
439 Assuming the parsing expression shown on the right-hand side of the
440 rule
441
442 Expression <- Term (AddOp Term)*
443
444
445 then its canonical serialization (except for whitespace) is
446
447 {x {n Term} {* {x {n AddOp} {n Term}}}}
448
449
451 This document, and the package it describes, will undoubtedly contain
452 bugs and other problems. Please report such in the category pt of the
453 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
454 report any ideas for enhancements you may have for either package
455 and/or documentation.
456
457 When proposing code changes, please provide unified diffs, i.e the out‐
458 put of diff -u.
459
460 Note further that attachments are strongly preferred over inlined
461 patches. Attachments can be made by going to the Edit form of the
462 ticket immediately after its creation, and then using the left-most
463 button in the secondary navigation bar.
464
466 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
467 matching, parser, parsing expression, parsing expression grammar, push
468 down automaton, recursive descent, state, top-down parsing languages,
469 transducer
470
472 Parsing and Grammars
473
475 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
476
477
478
479
480tcllib 1 pt::json_language(n)