1pt::peg(n) Parser Tools pt::peg(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg - Parsing Expression Grammar Serialization
9
11 package require Tcl 8.5
12
13 package require pt::peg ?1?
14
15 package require pt::pe
16
17 ::pt::peg verify serial ?canonvar?
18
19 ::pt::peg verify-as-canonical serial
20
21 ::pt::peg canonicalize serial
22
23 ::pt::peg print serial
24
25 ::pt::peg merge seriala serialb
26
27 ::pt::peg equal seriala serialb
28
29______________________________________________________________________________
30
32 Are you lost ? Do you have trouble understanding this document ? In
33 that case please read the overview provided by the Introduction to
34 Parser Tools. This document is the entrypoint to the whole system the
35 current package is a part of.
36
37 This package provides commands to work with the serializations of pars‐
38 ing expression grammars as managed by the Parser Tools, and specified
39 in section PEG serialization format.
40
41 This is a supporting package in the Core Layer of Parser Tools.
42
43 IMAGE: arch_core_support
44
46 ::pt::peg verify serial ?canonvar?
47 This command verifies that the content of serial is a valid se‐
48 rialization of a parsing expression and will throw an error if
49 that is not the case. The result of the command is the empty
50 string.
51
52 If the argument canonvar is specified it is interpreted as the
53 name of a variable in the calling context. This variable will be
54 written to if and only if serial is a valid regular serializa‐
55 tion. Its value will be a boolean, with True indicating that the
56 serialization is not only valid, but also canonical. False will
57 be written for a valid, but non-canonical serialization.
58
59 For the specification of serializations see the section PE seri‐
60 alization format.
61
62 ::pt::peg verify-as-canonical serial
63 This command verifies that the content of serial is a valid
64 canonical serialization of a PEG and will throw an error if that
65 is not the case. The result of the command is the empty string.
66
67 For the specification of canonical serializations see the sec‐
68 tion PEG serialization format.
69
70 ::pt::peg canonicalize serial
71 This command assumes that the content of serial is a valid regu‐
72 lar serialization of a PEG and will throw an error if that is
73 not the case.
74
75 It will then convert the input into the canonical serialization
76 of the contained PEG and return it as its result. If the input
77 is already canonical it will be returned unchanged.
78
79 For the specification of regular and canonical serializations
80 see the section PEG serialization format.
81
82 ::pt::peg print serial
83 This command assumes that the argument serial contains a valid
84 serialization of a parsing expression and returns a string con‐
85 taining that PE in a human readable form.
86
87 The exact format of this form is not specified and cannot be re‐
88 lied on for parsing or other machine-based activities.
89
90 For the specification of serializations see the section PEG se‐
91 rialization format.
92
93 ::pt::peg merge seriala serialb
94 This command accepts the regular serializations of two grammars
95 and uses them to create their union. The result of the command
96 is the canonical serialization of this unified grammar.
97
98 A merge errors occurs if for any nonterminal symbol S occuring
99 in both input grammars the two input grammars specify different
100 semantic modes.
101
102 The semantic mode of each nonterminal symbol S is the semantic
103 mode of S in any of its input grammars. The previous rule made
104 sure that for symbols occuring in both grammars these values are
105 identical.
106
107 The right-hand side of each nonterminal symbol S occuring in
108 both input grammars is the choice between the right-hand sides
109 of S in the input grammars, with the parsing expression of S in
110 seriala coming first, except if both expressions are identical.
111 In that case the first expression is taken.
112
113 The right-hand side of each nonterminal symbol S occuring in
114 only one of the input grammars is the right-hand side of S in
115 its input grammar.
116
117 The start expression of the unified grammar is the choice be‐
118 tween the start expressions of the input grammars, with the
119 start expression of seriala coming first, except if both expres‐
120 sions are identical. In that case the first expression is taken
121
122 ::pt::peg equal seriala serialb
123 This command tests the two grammars seriala and serialb for
124 structural equality. The result of the command is a boolean
125 value. It will be set to true if the expressions are identical,
126 and false otherwise.
127
128 String equality is usable only if we can assume that the two
129 grammars are pure Tcl lists and dictionaries.
130
132 Here we specify the format used by the Parser Tools to serialize Pars‐
133 ing Expression Grammars as immutable values for transport, comparison,
134 etc.
135
136 We distinguish between regular and canonical serializations. While a
137 PEG may have more than one regular serialization only exactly one of
138 them will be canonical.
139
140 regular serialization
141
142 [1] The serialization of any PEG is a nested Tcl dictionary.
143
144 [2] This dictionary holds a single key, pt::grammar::peg, and
145 its value. This value holds the contents of the grammar.
146
147 [3] The contents of the grammar are a Tcl dictionary holding
148 the set of nonterminal symbols and the starting expres‐
149 sion. The relevant keys and their values are
150
151 rules The value is a Tcl dictionary whose keys are the
152 names of the nonterminal symbols known to the
153 grammar.
154
155 [1] Each nonterminal symbol may occur only
156 once.
157
158 [2] The empty string is not a legal nonterminal
159 symbol.
160
161 [3] The value for each symbol is a Tcl dictio‐
162 nary itself. The relevant keys and their
163 values in this dictionary are
164
165 is The value is the serialization of
166 the parsing expression describing
167 the symbols sentennial structure, as
168 specified in the section PE serial‐
169 ization format.
170
171 mode The value can be one of three values
172 specifying how a parser should han‐
173 dle the semantic value produced by
174 the symbol.
175
176 value The semantic value of the
177 nonterminal symbol is an ab‐
178 stract syntax tree consisting
179 of a single node node for the
180 nonterminal itself, which has
181 the ASTs of the symbol's
182 right hand side as its chil‐
183 dren.
184
185 leaf The semantic value of the
186 nonterminal symbol is an ab‐
187 stract syntax tree consisting
188 of a single node node for the
189 nonterminal, without any
190 children. Any ASTs generated
191 by the symbol's right hand
192 side are discarded.
193
194 void The nonterminal has no seman‐
195 tic value. Any ASTs generated
196 by the symbol's right hand
197 side are discarded (as well).
198
199 start The value is the serialization of the start pars‐
200 ing expression of the grammar, as specified in the
201 section PE serialization format.
202
203 [4] The terminal symbols of the grammar are specified implic‐
204 itly as the set of all terminal symbols used in the start
205 expression and on the RHS of the grammar rules.
206
207 canonical serialization
208 The canonical serialization of a grammar has the format as spec‐
209 ified in the previous item, and then additionally satisfies the
210 constraints below, which make it unique among all the possible
211 serializations of this grammar.
212
213 [1] The keys found in all the nested Tcl dictionaries are
214 sorted in ascending dictionary order, as generated by
215 Tcl's builtin command lsort -increasing -dict.
216
217 [2] The string representation of the value is the canonical
218 representation of a Tcl dictionary. I.e. it does not con‐
219 tain superfluous whitespace.
220
221 EXAMPLE
222 Assuming the following PEG for simple mathematical expressions
223
224 PEG calculator (Expression)
225 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
226 Sign <- '-' / '+' ;
227 Number <- Sign? Digit+ ;
228 Expression <- Term (AddOp Term)* ;
229 MulOp <- '*' / '/' ;
230 Term <- Factor (MulOp Factor)* ;
231 AddOp <- '+'/'-' ;
232 Factor <- '(' Expression ')' / Number ;
233 END;
234
235
236 then its canonical serialization (except for whitespace) is
237
238 pt::grammar::peg {
239 rules {
240 AddOp {is {/ {t -} {t +}} mode value}
241 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
242 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
243 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
244 MulOp {is {/ {t *} {t /}} mode value}
245 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
246 Sign {is {/ {t -} {t +}} mode value}
247 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
248 }
249 start {n Expression}
250 }
251
252
254 Here we specify the format used by the Parser Tools to serialize Pars‐
255 ing Expressions as immutable values for transport, comparison, etc.
256
257 We distinguish between regular and canonical serializations. While a
258 parsing expression may have more than one regular serialization only
259 exactly one of them will be canonical.
260
261 Regular serialization
262
263 Atomic Parsing Expressions
264
265 [1] The string epsilon is an atomic parsing expres‐
266 sion. It matches the empty string.
267
268 [2] The string dot is an atomic parsing expression. It
269 matches any character.
270
271 [3] The string alnum is an atomic parsing expression.
272 It matches any Unicode alphabet or digit charac‐
273 ter. This is a custom extension of PEs based on
274 Tcl's builtin command string is.
275
276 [4] The string alpha is an atomic parsing expression.
277 It matches any Unicode alphabet character. This is
278 a custom extension of PEs based on Tcl's builtin
279 command string is.
280
281 [5] The string ascii is an atomic parsing expression.
282 It matches any Unicode character below U0080. This
283 is a custom extension of PEs based on Tcl's
284 builtin command string is.
285
286 [6] The string control is an atomic parsing expres‐
287 sion. It matches any Unicode control character.
288 This is a custom extension of PEs based on Tcl's
289 builtin command string is.
290
291 [7] The string digit is an atomic parsing expression.
292 It matches any Unicode digit character. Note that
293 this includes characters outside of the [0..9]
294 range. This is a custom extension of PEs based on
295 Tcl's builtin command string is.
296
297 [8] The string graph is an atomic parsing expression.
298 It matches any Unicode printing character, except
299 for space. This is a custom extension of PEs based
300 on Tcl's builtin command string is.
301
302 [9] The string lower is an atomic parsing expression.
303 It matches any Unicode lower-case alphabet charac‐
304 ter. This is a custom extension of PEs based on
305 Tcl's builtin command string is.
306
307 [10] The string print is an atomic parsing expression.
308 It matches any Unicode printing character, includ‐
309 ing space. This is a custom extension of PEs based
310 on Tcl's builtin command string is.
311
312 [11] The string punct is an atomic parsing expression.
313 It matches any Unicode punctuation character. This
314 is a custom extension of PEs based on Tcl's
315 builtin command string is.
316
317 [12] The string space is an atomic parsing expression.
318 It matches any Unicode space character. This is a
319 custom extension of PEs based on Tcl's builtin
320 command string is.
321
322 [13] The string upper is an atomic parsing expression.
323 It matches any Unicode upper-case alphabet charac‐
324 ter. This is a custom extension of PEs based on
325 Tcl's builtin command string is.
326
327 [14] The string wordchar is an atomic parsing expres‐
328 sion. It matches any Unicode word character. This
329 is any alphanumeric character (see alnum), and any
330 connector punctuation characters (e.g. under‐
331 score). This is a custom extension of PEs based on
332 Tcl's builtin command string is.
333
334 [15] The string xdigit is an atomic parsing expression.
335 It matches any hexadecimal digit character. This
336 is a custom extension of PEs based on Tcl's
337 builtin command string is.
338
339 [16] The string ddigit is an atomic parsing expression.
340 It matches any decimal digit character. This is a
341 custom extension of PEs based on Tcl's builtin
342 command regexp.
343
344 [17] The expression [list t x] is an atomic parsing ex‐
345 pression. It matches the terminal string x.
346
347 [18] The expression [list n A] is an atomic parsing ex‐
348 pression. It matches the nonterminal A.
349
350 Combined Parsing Expressions
351
352 [1] For parsing expressions e1, e2, ... the result of
353 [list / e1 e2 ... ] is a parsing expression as
354 well. This is the ordered choice, aka prioritized
355 choice.
356
357 [2] For parsing expressions e1, e2, ... the result of
358 [list x e1 e2 ... ] is a parsing expression as
359 well. This is the sequence.
360
361 [3] For a parsing expression e the result of [list *
362 e] is a parsing expression as well. This is the
363 kleene closure, describing zero or more repeti‐
364 tions.
365
366 [4] For a parsing expression e the result of [list +
367 e] is a parsing expression as well. This is the
368 positive kleene closure, describing one or more
369 repetitions.
370
371 [5] For a parsing expression e the result of [list &
372 e] is a parsing expression as well. This is the
373 and lookahead predicate.
374
375 [6] For a parsing expression e the result of [list !
376 e] is a parsing expression as well. This is the
377 not lookahead predicate.
378
379 [7] For a parsing expression e the result of [list ?
380 e] is a parsing expression as well. This is the
381 optional input.
382
383 Canonical serialization
384 The canonical serialization of a parsing expression has the for‐
385 mat as specified in the previous item, and then additionally
386 satisfies the constraints below, which make it unique among all
387 the possible serializations of this parsing expression.
388
389 [1] The string representation of the value is the canonical
390 representation of a pure Tcl list. I.e. it does not con‐
391 tain superfluous whitespace.
392
393 [2] Terminals are not encoded as ranges (where start and end
394 of the range are identical).
395
396 EXAMPLE
397 Assuming the parsing expression shown on the right-hand side of the
398 rule
399
400 Expression <- Term (AddOp Term)*
401
402
403 then its canonical serialization (except for whitespace) is
404
405 {x {n Term} {* {x {n AddOp} {n Term}}}}
406
407
409 This document, and the package it describes, will undoubtedly contain
410 bugs and other problems. Please report such in the category pt of the
411 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
412 report any ideas for enhancements you may have for either package
413 and/or documentation.
414
415 When proposing code changes, please provide unified diffs, i.e the out‐
416 put of diff -u.
417
418 Note further that attachments are strongly preferred over inlined
419 patches. Attachments can be made by going to the Edit form of the
420 ticket immediately after its creation, and then using the left-most
421 button in the secondary navigation bar.
422
424 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
425 matching, parser, parsing expression, parsing expression grammar, push
426 down automaton, recursive descent, state, top-down parsing languages,
427 transducer
428
430 Parsing and Grammars
431
433 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
434
435
436
437
438tcllib 1 pt::peg(n)