1pt::peg::interp(n) Parser Tools pt::peg::interp(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg::interp - Interpreter for parsing expression grammars
9
11 package require Tcl 8.5
12
13 package require pt::peg::interp ?1.0.1?
14
15 package require pt::rde ?1?
16
17 package require snit
18
19 ::pt::peg::interp objectName grammar
20
21 objectName use grammar
22
23 objectName destroy
24
25 objectName parse chan
26
27 objectName parset text
28
29______________________________________________________________________________
30
32 Are you lost ? Do you have trouble understanding this document ? In
33 that case please read the overview provided by the Introduction to
34 Parser Tools. This document is the entrypoint to the whole system the
35 current package is a part of.
36
37 This package provides a class whose instances are Packrat parsers con‐
38 figurable with a parsing expression grammar. The grammar is executed
39 directly, i.e. interpreted, with the underlying runtime provided by the
40 package pt::rde, basing everything on the PARAM.
41
42 Like the supporting runtime this package resides in the Execution sec‐
43 tion of the Core Layer of Parser Tools.
44
45 IMAGE: arch_core_transform
46
47 The interpreted grammar is copied from an instance of pt::peg::con‐
48 tainer, or anything providing the same API, like the container classes
49 created by pt::peg::to::container or the associated export plugin
50 pt::peg::export::container.
51
52 CLASS API
53 The package exports the API described here.
54
55 ::pt::peg::interp objectName grammar
56 The command creates a new parser object and returns the fully
57 qualified name of the object command as its result. The API of
58 this object command is described in the section Object API. It
59 may be used to invoke various operations on the object.
60
61 This new parser is configured for the execution of an empty PEG.
62 To configure the object for any other PEG use the method use of
63 the Object API.
64
65 OBJECT API
66 All objects created by this package provide the following methods.
67
68 objectName use grammar
69 This method configures the grammar interpreter / parser for the
70 execution of the PEG stored in grammar, an object which is API-
71 compatible to instances of pt::peg::container. The parser copies
72 the relevant information of the grammar, and does not take own‐
73 ership of the object.
74
75 The information of any previously used grammar is overwritten.
76
77 The result of the method the empty string.
78
79 objectName destroy
80 This method destroys the parser instance, releasing all claimed
81 memory and other resources, and deleting the instance command.
82
83 The result of the command is the empty string.
84
85 objectName parse chan
86 This method runs the parser using the contents of chan as input
87 (starting at the current location in the channel), until parsing
88 is not possible anymore, either because parsing has completed,
89 or run into a syntax error.
90
91 Note here that the Parser Tools are based on Tcl 8.5+. In other
92 words, the channel argument is not restricted to files, sockets,
93 etc. We have the full power of reflected channels available.
94
95 It should also be noted that the parser pulls the characters
96 from the input stream as it needs them. If a parser created by
97 this package has to be operated in a push aka event-driven man‐
98 ner it will be necessary to go to Tcl 8.6+ and use the corou‐
99 tine::auto to wrap it into a coroutine where read is properly
100 changed for push-operation.
101
102 Upon successful completion the command returns an abstract syn‐
103 tax tree as its result. This AST is in the form specified in
104 section AST serialization format. As a plain nested Tcl-list it
105 can then be processed with any Tcl commands the user likes,
106 doing transformations, semantic checks, etc. To help in this
107 the package pt::ast provides a set of convenience commands for
108 validation of the tree's basic structure, printing it for debug‐
109 ging, and walking it either from the bottom up, or top down.
110
111 When encountering a syntax error the command will throw an error
112 instead. This error will be a 4-element Tcl-list, containing,
113 in the order listed below:
114
115 [1] The string pt::rde identifying it as parser runtime
116 error.
117
118 [2] The location of the parse error, as character offset from
119 the beginning of the parsed input.
120
121 [3] The location of parse error, now as a 2-element list con‐
122 taining line-number and column in the line.
123
124 [4] A set of atomic parsing expressions indicating encoding
125 the characters and/or nonterminal symbols the parser
126 expected to see at the location of the parse error, but
127 did not get. For the specification of atomic parsing
128 expressions please see the section PE serialization for‐
129 mat.
130
131 objectName parset text
132 This method runs the parser using the string in text as input.
133 In all other ways it behaves like the method parse, shown above.
134
136 Here we specify the format used by the Parser Tools to serialize
137 Abstract Syntax Trees (ASTs) as immutable values for transport, compar‐
138 ison, etc.
139
140 Each node in an AST represents a nonterminal symbol of a grammar, and
141 the range of tokens/characters in the input covered by it. ASTs do not
142 contain terminal symbols, i.e. tokens/characters. These can be recov‐
143 ered from the input given a symbol's location.
144
145 We distinguish between regular and canonical serializations. While a
146 tree may have more than one regular serialization only exactly one of
147 them will be canonical.
148
149 Regular serialization
150
151 [1] The serialization of any AST is the serialization of its
152 root node.
153
154 [2] The serialization of any node is a Tcl list containing at
155 least three elements.
156
157 [1] The first element is the name of the nonterminal
158 symbol stored in the node.
159
160 [2] The second and third element are the locations of
161 the first and last token in the token stream the
162 node represents (covers).
163
164 [1] Locations are provided as non-negative
165 integer offsets from the beginning of the
166 token stream, with the first token found in
167 the stream located at offset 0 (zero).
168
169 [2] The end location has to be equal to or
170 larger than the start location.
171
172 [3] All elements after the first three represent the
173 children of the node, which are themselves nodes.
174 This means that the serializations of nodes with‐
175 out children, i.e. leaf nodes, have exactly three
176 elements. The children are stored in the list
177 with the leftmost child first, and the rightmost
178 child last.
179
180 Canonical serialization
181 The canonical serialization of an abstract syntax tree has the
182 format as specified in the previous item, and then additionally
183 satisfies the constraints below, which make it unique among all
184 the possible serializations of this tree.
185
186 [1] The string representation of the value is the canonical
187 representation of a pure Tcl list. I.e. it does not con‐
188 tain superfluous whitespace.
189
190 EXAMPLE
191 Assuming the parsing expression grammar below
192
193 PEG calculator (Expression)
194 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
195 Sign <- '-' / '+' ;
196 Number <- Sign? Digit+ ;
197 Expression <- Term (AddOp Term)* ;
198 MulOp <- '*' / '/' ;
199 Term <- Factor (MulOp Factor)* ;
200 AddOp <- '+'/'-' ;
201 Factor <- '(' Expression ')' / Number ;
202 END;
203
204
205 and the input string
206
207 120+5
208 then a parser should deliver the abstract syntax tree below (except for
209 whitespace)
210
211 set ast {Expression 0 4
212 {Factor 0 4
213 {Term 0 2
214 {Number 0 2
215 {Digit 0 0}
216 {Digit 1 1}
217 {Digit 2 2}
218 }
219 }
220 {AddOp 3 3}
221 {Term 4 4
222 {Number 4 4
223 {Digit 4 4}
224 }
225 }
226 }
227 }
228
229
230 Or, more graphical
231
232 .nf +- Digit 0 0 | 1 | | +- Term 0 2 --- Number 0 2 -+-
233 Digit 1 1 | 2 | | | |
234 +- Digit 2 2 | 0 | | Expression
235 0 4 --- Factor 0 4 -+----------------------------- AddOp 3 3 | + |
236 | +- Term 4 4 --- Number 4 4 --- Digit 4 4 | 5 .fi
237
239 Here we specify the format used by the Parser Tools to serialize Pars‐
240 ing Expressions as immutable values for transport, comparison, etc.
241
242 We distinguish between regular and canonical serializations. While a
243 parsing expression may have more than one regular serialization only
244 exactly one of them will be canonical.
245
246 Regular serialization
247
248 Atomic Parsing Expressions
249
250 [1] The string epsilon is an atomic parsing expres‐
251 sion. It matches the empty string.
252
253 [2] The string dot is an atomic parsing expression. It
254 matches any character.
255
256 [3] The string alnum is an atomic parsing expression.
257 It matches any Unicode alphabet or digit charac‐
258 ter. This is a custom extension of PEs based on
259 Tcl's builtin command string is.
260
261 [4] The string alpha is an atomic parsing expression.
262 It matches any Unicode alphabet character. This is
263 a custom extension of PEs based on Tcl's builtin
264 command string is.
265
266 [5] The string ascii is an atomic parsing expression.
267 It matches any Unicode character below U0080. This
268 is a custom extension of PEs based on Tcl's
269 builtin command string is.
270
271 [6] The string control is an atomic parsing expres‐
272 sion. It matches any Unicode control character.
273 This is a custom extension of PEs based on Tcl's
274 builtin command string is.
275
276 [7] The string digit is an atomic parsing expression.
277 It matches any Unicode digit character. Note that
278 this includes characters outside of the [0..9]
279 range. This is a custom extension of PEs based on
280 Tcl's builtin command string is.
281
282 [8] The string graph is an atomic parsing expression.
283 It matches any Unicode printing character, except
284 for space. This is a custom extension of PEs based
285 on Tcl's builtin command string is.
286
287 [9] The string lower is an atomic parsing expression.
288 It matches any Unicode lower-case alphabet charac‐
289 ter. This is a custom extension of PEs based on
290 Tcl's builtin command string is.
291
292 [10] The string print is an atomic parsing expression.
293 It matches any Unicode printing character, includ‐
294 ing space. This is a custom extension of PEs based
295 on Tcl's builtin command string is.
296
297 [11] The string punct is an atomic parsing expression.
298 It matches any Unicode punctuation character. This
299 is a custom extension of PEs based on Tcl's
300 builtin command string is.
301
302 [12] The string space is an atomic parsing expression.
303 It matches any Unicode space character. This is a
304 custom extension of PEs based on Tcl's builtin
305 command string is.
306
307 [13] The string upper is an atomic parsing expression.
308 It matches any Unicode upper-case alphabet charac‐
309 ter. This is a custom extension of PEs based on
310 Tcl's builtin command string is.
311
312 [14] The string wordchar is an atomic parsing expres‐
313 sion. It matches any Unicode word character. This
314 is any alphanumeric character (see alnum), and any
315 connector punctuation characters (e.g. under‐
316 score). This is a custom extension of PEs based on
317 Tcl's builtin command string is.
318
319 [15] The string xdigit is an atomic parsing expression.
320 It matches any hexadecimal digit character. This
321 is a custom extension of PEs based on Tcl's
322 builtin command string is.
323
324 [16] The string ddigit is an atomic parsing expression.
325 It matches any decimal digit character. This is a
326 custom extension of PEs based on Tcl's builtin
327 command regexp.
328
329 [17] The expression [list t x] is an atomic parsing
330 expression. It matches the terminal string x.
331
332 [18] The expression [list n A] is an atomic parsing
333 expression. It matches the nonterminal A.
334
335 Combined Parsing Expressions
336
337 [1] For parsing expressions e1, e2, ... the result of
338 [list / e1 e2 ... ] is a parsing expression as
339 well. This is the ordered choice, aka prioritized
340 choice.
341
342 [2] For parsing expressions e1, e2, ... the result of
343 [list x e1 e2 ... ] is a parsing expression as
344 well. This is the sequence.
345
346 [3] For a parsing expression e the result of [list *
347 e] is a parsing expression as well. This is the
348 kleene closure, describing zero or more repeti‐
349 tions.
350
351 [4] For a parsing expression e the result of [list +
352 e] is a parsing expression as well. This is the
353 positive kleene closure, describing one or more
354 repetitions.
355
356 [5] For a parsing expression e the result of [list &
357 e] is a parsing expression as well. This is the
358 and lookahead predicate.
359
360 [6] For a parsing expression e the result of [list !
361 e] is a parsing expression as well. This is the
362 not lookahead predicate.
363
364 [7] For a parsing expression e the result of [list ?
365 e] is a parsing expression as well. This is the
366 optional input.
367
368 Canonical serialization
369 The canonical serialization of a parsing expression has the for‐
370 mat as specified in the previous item, and then additionally
371 satisfies the constraints below, which make it unique among all
372 the possible serializations of this parsing expression.
373
374 [1] The string representation of the value is the canonical
375 representation of a pure Tcl list. I.e. it does not con‐
376 tain superfluous whitespace.
377
378 [2] Terminals are not encoded as ranges (where start and end
379 of the range are identical).
380
381 EXAMPLE
382 Assuming the parsing expression shown on the right-hand side of the
383 rule
384
385 Expression <- Term (AddOp Term)*
386
387
388 then its canonical serialization (except for whitespace) is
389
390 {x {n Term} {* {x {n AddOp} {n Term}}}}
391
392
394 This document, and the package it describes, will undoubtedly contain
395 bugs and other problems. Please report such in the category pt of the
396 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
397 report any ideas for enhancements you may have for either package
398 and/or documentation.
399
400 When proposing code changes, please provide unified diffs, i.e the out‐
401 put of diff -u.
402
403 Note further that attachments are strongly preferred over inlined
404 patches. Attachments can be made by going to the Edit form of the
405 ticket immediately after its creation, and then using the left-most
406 button in the secondary navigation bar.
407
409 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
410 matching, parser, parsing expression, parsing expression grammar, push
411 down automaton, recursive descent, state, top-down parsing languages,
412 transducer
413
415 Parsing and Grammars
416
418 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
419
420
421
422
423tcllib 1.0.1 pt::peg::interp(n)