1pt::peg::import::json(n) Parser Tools pt::peg::import::json(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg::import::json - PEG Import Plugin. Read JSON format
9
11 package require Tcl 8.5
12
13 package require pt::peg::import::json ?1?
14
15 package require pt::peg::to::json
16
17 import text
18
19______________________________________________________________________________
20
22 Are you lost ? Do you have trouble understanding this document ? In
23 that case please read the overview provided by the Introduction to
24 Parser Tools. This document is the entrypoint to the whole system the
25 current package is a part of.
26
27 This package implements the parsing expression grammar import plugin
28 processing JSON markup.
29
30 It resides in the Import section of the Core Layer of Parser Tools and
31 is intended to be used by pt::peg::import, the import manager, sitting
32 between it and the corresponding core conversion functionality provided
33 by pt::peg::from::json.
34
35 IMAGE: arch_core_iplugins
36
37 While the direct use of this package with a regular interpreter is pos‐
38 sible, this is strongly disrecommended and requires a number of contor‐
39 tions to provide the expected environment. The proper way to use this
40 functionality depends on the situation:
41
42 [1] In an untrusted environment the proper access is through the
43 package pt::peg::import and the import manager objects it pro‐
44 vides.
45
46 [2] In a trusted environment however simply use the package
47 pt::peg::from::json and access the core conversion functionality
48 directly.
49
51 The API provided by this package satisfies the specification of the
52 Plugin API found in the Parser Tools Import API specification.
53
54 import text
55 This command takes the JSON markup encoding a parsing expression
56 grammar and contained in text, and generates the canonical seri‐
57 alization of said grammar, as specified in section PEG serial‐
58 ization format. The created value is then returned as the re‐
59 sult of the command.
60
62 The json format for parsing expression grammars was written as a data
63 exchange format not bound to Tcl. It was defined to allow the exchange
64 of grammars with PackRat/PEG based parser generators for other lan‐
65 guages.
66
67 It is formally specified by the rules below:
68
69 [1] The JSON of any PEG is a JSON object.
70
71 [2] This object holds a single key, pt::grammar::peg, and its value.
72 This value holds the contents of the grammar.
73
74 [3] The contents of the grammar are a JSON object holding the set of
75 nonterminal symbols and the starting expression. The relevant
76 keys and their values are
77
78 rules The value is a JSON object whose keys are the names of
79 the nonterminal symbols known to the grammar.
80
81 [1] Each nonterminal symbol may occur only once.
82
83 [2] The empty string is not a legal nonterminal sym‐
84 bol.
85
86 [3] The value for each symbol is a JSON object itself.
87 The relevant keys and their values in this dictio‐
88 nary are
89
90 is The value is a JSON string holding the Tcl
91 serialization of the parsing expression de‐
92 scribing the symbols sentennial structure,
93 as specified in the section PE serializa‐
94 tion format.
95
96 mode The value is a JSON holding holding one of
97 three values specifying how a parser should
98 handle the semantic value produced by the
99 symbol.
100
101 value The semantic value of the nontermi‐
102 nal symbol is an abstract syntax
103 tree consisting of a single node
104 node for the nonterminal itself,
105 which has the ASTs of the symbol's
106 right hand side as its children.
107
108 leaf The semantic value of the nontermi‐
109 nal symbol is an abstract syntax
110 tree consisting of a single node
111 node for the nonterminal, without
112 any children. Any ASTs generated by
113 the symbol's right hand side are
114 discarded.
115
116 void The nonterminal has no semantic
117 value. Any ASTs generated by the
118 symbol's right hand side are dis‐
119 carded (as well).
120
121 start The value is a JSON string holding the Tcl serialization
122 of the start parsing expression of the grammar, as speci‐
123 fied in the section PE serialization format.
124
125 [4] The terminal symbols of the grammar are specified implicitly as
126 the set of all terminal symbols used in the start expression and
127 on the RHS of the grammar rules.
128
129 As an aside to the advanced reader, this is pretty much the same as the
130 Tcl serialization of PE grammars, as specified in section PEG serial‐
131 ization format, except that the Tcl dictionaries and lists of that for‐
132 mat are mapped to JSON objects and arrays. Only the parsing expressions
133 themselves are not translated further, but kept as JSON strings con‐
134 taining a nested Tcl list, and there is no concept of canonicity for
135 the JSON either.
136
137 EXAMPLE
138 Assuming the following PEG for simple mathematical expressions
139
140 PEG calculator (Expression)
141 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
142 Sign <- '-' / '+' ;
143 Number <- Sign? Digit+ ;
144 Expression <- Term (AddOp Term)* ;
145 MulOp <- '*' / '/' ;
146 Term <- Factor (MulOp Factor)* ;
147 AddOp <- '+'/'-' ;
148 Factor <- '(' Expression ')' / Number ;
149 END;
150
151
152 a JSON serialization for it is
153
154 {
155 "pt::grammar::peg" : {
156 "rules" : {
157 "AddOp" : {
158 "is" : "\/ {t -} {t +}",
159 "mode" : "value"
160 },
161 "Digit" : {
162 "is" : "\/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}",
163 "mode" : "value"
164 },
165 "Expression" : {
166 "is" : "\/ {x {t (} {n Expression} {t )}} {x {n Factor} {* {x {n MulOp} {n Factor}}}}",
167 "mode" : "value"
168 },
169 "Factor" : {
170 "is" : "x {n Term} {* {x {n AddOp} {n Term}}}",
171 "mode" : "value"
172 },
173 "MulOp" : {
174 "is" : "\/ {t *} {t \/}",
175 "mode" : "value"
176 },
177 "Number" : {
178 "is" : "x {? {n Sign}} {+ {n Digit}}",
179 "mode" : "value"
180 },
181 "Sign" : {
182 "is" : "\/ {t -} {t +}",
183 "mode" : "value"
184 },
185 "Term" : {
186 "is" : "n Number",
187 "mode" : "value"
188 }
189 },
190 "start" : "n Expression"
191 }
192 }
193
194
195 and a Tcl serialization of the same is
196
197 pt::grammar::peg {
198 rules {
199 AddOp {is {/ {t -} {t +}} mode value}
200 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
201 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
202 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
203 MulOp {is {/ {t *} {t /}} mode value}
204 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
205 Sign {is {/ {t -} {t +}} mode value}
206 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
207 }
208 start {n Expression}
209 }
210
211
212 The similarity of the latter to the JSON should be quite obvious.
213
215 Here we specify the format used by the Parser Tools to serialize Pars‐
216 ing Expression Grammars as immutable values for transport, comparison,
217 etc.
218
219 We distinguish between regular and canonical serializations. While a
220 PEG may have more than one regular serialization only exactly one of
221 them will be canonical.
222
223 regular serialization
224
225 [1] The serialization of any PEG is a nested Tcl dictionary.
226
227 [2] This dictionary holds a single key, pt::grammar::peg, and
228 its value. This value holds the contents of the grammar.
229
230 [3] The contents of the grammar are a Tcl dictionary holding
231 the set of nonterminal symbols and the starting expres‐
232 sion. The relevant keys and their values are
233
234 rules The value is a Tcl dictionary whose keys are the
235 names of the nonterminal symbols known to the
236 grammar.
237
238 [1] Each nonterminal symbol may occur only
239 once.
240
241 [2] The empty string is not a legal nonterminal
242 symbol.
243
244 [3] The value for each symbol is a Tcl dictio‐
245 nary itself. The relevant keys and their
246 values in this dictionary are
247
248 is The value is the serialization of
249 the parsing expression describing
250 the symbols sentennial structure, as
251 specified in the section PE serial‐
252 ization format.
253
254 mode The value can be one of three values
255 specifying how a parser should han‐
256 dle the semantic value produced by
257 the symbol.
258
259 value The semantic value of the
260 nonterminal symbol is an ab‐
261 stract syntax tree consisting
262 of a single node node for the
263 nonterminal itself, which has
264 the ASTs of the symbol's
265 right hand side as its chil‐
266 dren.
267
268 leaf The semantic value of the
269 nonterminal symbol is an ab‐
270 stract syntax tree consisting
271 of a single node node for the
272 nonterminal, without any
273 children. Any ASTs generated
274 by the symbol's right hand
275 side are discarded.
276
277 void The nonterminal has no seman‐
278 tic value. Any ASTs generated
279 by the symbol's right hand
280 side are discarded (as well).
281
282 start The value is the serialization of the start pars‐
283 ing expression of the grammar, as specified in the
284 section PE serialization format.
285
286 [4] The terminal symbols of the grammar are specified implic‐
287 itly as the set of all terminal symbols used in the start
288 expression and on the RHS of the grammar rules.
289
290 canonical serialization
291 The canonical serialization of a grammar has the format as spec‐
292 ified in the previous item, and then additionally satisfies the
293 constraints below, which make it unique among all the possible
294 serializations of this grammar.
295
296 [1] The keys found in all the nested Tcl dictionaries are
297 sorted in ascending dictionary order, as generated by
298 Tcl's builtin command lsort -increasing -dict.
299
300 [2] The string representation of the value is the canonical
301 representation of a Tcl dictionary. I.e. it does not con‐
302 tain superfluous whitespace.
303
304 EXAMPLE
305 Assuming the following PEG for simple mathematical expressions
306
307 PEG calculator (Expression)
308 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
309 Sign <- '-' / '+' ;
310 Number <- Sign? Digit+ ;
311 Expression <- Term (AddOp Term)* ;
312 MulOp <- '*' / '/' ;
313 Term <- Factor (MulOp Factor)* ;
314 AddOp <- '+'/'-' ;
315 Factor <- '(' Expression ')' / Number ;
316 END;
317
318
319 then its canonical serialization (except for whitespace) is
320
321 pt::grammar::peg {
322 rules {
323 AddOp {is {/ {t -} {t +}} mode value}
324 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
325 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
326 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
327 MulOp {is {/ {t *} {t /}} mode value}
328 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
329 Sign {is {/ {t -} {t +}} mode value}
330 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
331 }
332 start {n Expression}
333 }
334
335
337 Here we specify the format used by the Parser Tools to serialize Pars‐
338 ing Expressions as immutable values for transport, comparison, etc.
339
340 We distinguish between regular and canonical serializations. While a
341 parsing expression may have more than one regular serialization only
342 exactly one of them will be canonical.
343
344 Regular serialization
345
346 Atomic Parsing Expressions
347
348 [1] The string epsilon is an atomic parsing expres‐
349 sion. It matches the empty string.
350
351 [2] The string dot is an atomic parsing expression. It
352 matches any character.
353
354 [3] The string alnum is an atomic parsing expression.
355 It matches any Unicode alphabet or digit charac‐
356 ter. This is a custom extension of PEs based on
357 Tcl's builtin command string is.
358
359 [4] The string alpha is an atomic parsing expression.
360 It matches any Unicode alphabet character. This is
361 a custom extension of PEs based on Tcl's builtin
362 command string is.
363
364 [5] The string ascii is an atomic parsing expression.
365 It matches any Unicode character below U0080. This
366 is a custom extension of PEs based on Tcl's
367 builtin command string is.
368
369 [6] The string control is an atomic parsing expres‐
370 sion. It matches any Unicode control character.
371 This is a custom extension of PEs based on Tcl's
372 builtin command string is.
373
374 [7] The string digit is an atomic parsing expression.
375 It matches any Unicode digit character. Note that
376 this includes characters outside of the [0..9]
377 range. This is a custom extension of PEs based on
378 Tcl's builtin command string is.
379
380 [8] The string graph is an atomic parsing expression.
381 It matches any Unicode printing character, except
382 for space. This is a custom extension of PEs based
383 on Tcl's builtin command string is.
384
385 [9] The string lower is an atomic parsing expression.
386 It matches any Unicode lower-case alphabet charac‐
387 ter. This is a custom extension of PEs based on
388 Tcl's builtin command string is.
389
390 [10] The string print is an atomic parsing expression.
391 It matches any Unicode printing character, includ‐
392 ing space. This is a custom extension of PEs based
393 on Tcl's builtin command string is.
394
395 [11] The string punct is an atomic parsing expression.
396 It matches any Unicode punctuation character. This
397 is a custom extension of PEs based on Tcl's
398 builtin command string is.
399
400 [12] The string space is an atomic parsing expression.
401 It matches any Unicode space character. This is a
402 custom extension of PEs based on Tcl's builtin
403 command string is.
404
405 [13] The string upper is an atomic parsing expression.
406 It matches any Unicode upper-case alphabet charac‐
407 ter. This is a custom extension of PEs based on
408 Tcl's builtin command string is.
409
410 [14] The string wordchar is an atomic parsing expres‐
411 sion. It matches any Unicode word character. This
412 is any alphanumeric character (see alnum), and any
413 connector punctuation characters (e.g. under‐
414 score). This is a custom extension of PEs based on
415 Tcl's builtin command string is.
416
417 [15] The string xdigit is an atomic parsing expression.
418 It matches any hexadecimal digit character. This
419 is a custom extension of PEs based on Tcl's
420 builtin command string is.
421
422 [16] The string ddigit is an atomic parsing expression.
423 It matches any decimal digit character. This is a
424 custom extension of PEs based on Tcl's builtin
425 command regexp.
426
427 [17] The expression [list t x] is an atomic parsing ex‐
428 pression. It matches the terminal string x.
429
430 [18] The expression [list n A] is an atomic parsing ex‐
431 pression. It matches the nonterminal A.
432
433 Combined Parsing Expressions
434
435 [1] For parsing expressions e1, e2, ... the result of
436 [list / e1 e2 ... ] is a parsing expression as
437 well. This is the ordered choice, aka prioritized
438 choice.
439
440 [2] For parsing expressions e1, e2, ... the result of
441 [list x e1 e2 ... ] is a parsing expression as
442 well. This is the sequence.
443
444 [3] For a parsing expression e the result of [list *
445 e] is a parsing expression as well. This is the
446 kleene closure, describing zero or more repeti‐
447 tions.
448
449 [4] For a parsing expression e the result of [list +
450 e] is a parsing expression as well. This is the
451 positive kleene closure, describing one or more
452 repetitions.
453
454 [5] For a parsing expression e the result of [list &
455 e] is a parsing expression as well. This is the
456 and lookahead predicate.
457
458 [6] For a parsing expression e the result of [list !
459 e] is a parsing expression as well. This is the
460 not lookahead predicate.
461
462 [7] For a parsing expression e the result of [list ?
463 e] is a parsing expression as well. This is the
464 optional input.
465
466 Canonical serialization
467 The canonical serialization of a parsing expression has the for‐
468 mat as specified in the previous item, and then additionally
469 satisfies the constraints below, which make it unique among all
470 the possible serializations of this parsing expression.
471
472 [1] The string representation of the value is the canonical
473 representation of a pure Tcl list. I.e. it does not con‐
474 tain superfluous whitespace.
475
476 [2] Terminals are not encoded as ranges (where start and end
477 of the range are identical).
478
479 EXAMPLE
480 Assuming the parsing expression shown on the right-hand side of the
481 rule
482
483 Expression <- Term (AddOp Term)*
484
485
486 then its canonical serialization (except for whitespace) is
487
488 {x {n Term} {* {x {n AddOp} {n Term}}}}
489
490
492 This document, and the package it describes, will undoubtedly contain
493 bugs and other problems. Please report such in the category pt of the
494 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
495 report any ideas for enhancements you may have for either package
496 and/or documentation.
497
498 When proposing code changes, please provide unified diffs, i.e the out‐
499 put of diff -u.
500
501 Note further that attachments are strongly preferred over inlined
502 patches. Attachments can be made by going to the Edit form of the
503 ticket immediately after its creation, and then using the left-most
504 button in the secondary navigation bar.
505
507 EBNF, JSON, LL(k), PEG, TDPL, context-free languages, expression, gram‐
508 mar, import, matching, parser, parsing expression, parsing expression
509 grammar, plugin, push down automaton, recursive descent, serialization,
510 state, top-down parsing languages, transducer
511
513 Parsing and Grammars
514
516 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
517
518
519
520
521tcllib 1 pt::peg::import::json(n)