1pt::peg::interp(n) Parser Tools pt::peg::interp(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg::interp - Interpreter for parsing expression grammars
9
11 package require Tcl 8.5
12
13 package require pt::peg::interp ?1.0.1?
14
15 package require pt::rde ?1?
16
17 package require snit
18
19 ::pt::peg::interp objectName grammar
20
21 objectName use grammar
22
23 objectName destroy
24
25 objectName parse chan
26
27 objectName parset text
28
29______________________________________________________________________________
30
32 Are you lost ? Do you have trouble understanding this document ? In
33 that case please read the overview provided by the Introduction to
34 Parser Tools. This document is the entrypoint to the whole system the
35 current package is a part of.
36
37 This package provides a class whose instances are Packrat parsers con‐
38 figurable with a parsing expression grammar. The grammar is executed
39 directly, i.e. interpreted, with the underlying runtime provided by the
40 package pt::rde, basing everything on the PARAM.
41
42 Like the supporting runtime this package resides in the Execution sec‐
43 tion of the Core Layer of Parser Tools.
44
45 IMAGE: arch_core_transform
46
47 The interpreted grammar is copied from an instance of pt::peg::con‐
48 tainer, or anything providing the same API, like the container classes
49 created by pt::peg::to::container or the associated export plugin
50 pt::peg::export::container.
51
52 CLASS API
53 The package exports the API described here.
54
55 ::pt::peg::interp objectName grammar
56 The command creates a new parser object and returns the fully
57 qualified name of the object command as its result. The API of
58 this object command is described in the section Object API. It
59 may be used to invoke various operations on the object.
60
61 This new parser is configured for the execution of an empty PEG.
62 To configure the object for any other PEG use the method use of
63 the Object API.
64
65 OBJECT API
66 All objects created by this package provide the following methods.
67
68 objectName use grammar
69 This method configures the grammar interpreter / parser for the
70 execution of the PEG stored in grammar, an object which is API-
71 compatible to instances of pt::peg::container. The parser copies
72 the relevant information of the grammar, and does not take own‐
73 ership of the object.
74
75 The information of any previously used grammar is overwritten.
76
77 The result of the method the empty string.
78
79 objectName destroy
80 This method destroys the parser instance, releasing all claimed
81 memory and other resources, and deleting the instance command.
82
83 The result of the command is the empty string.
84
85 objectName parse chan
86 This method runs the parser using the contents of chan as input
87 (starting at the current location in the channel), until parsing
88 is not possible anymore, either because parsing has completed,
89 or run into a syntax error.
90
91 Note here that the Parser Tools are based on Tcl 8.5+. In other
92 words, the channel argument is not restricted to files, sockets,
93 etc. We have the full power of reflected channels available.
94
95 It should also be noted that the parser pulls the characters
96 from the input stream as it needs them. If a parser created by
97 this package has to be operated in a push aka event-driven man‐
98 ner it will be necessary to go to Tcl 8.6+ and use the corou‐
99 tine::auto to wrap it into a coroutine where read is properly
100 changed for push-operation.
101
102 Upon successful completion the command returns an abstract syn‐
103 tax tree as its result. This AST is in the form specified in
104 section AST serialization format. As a plain nested Tcl-list it
105 can then be processed with any Tcl commands the user likes, do‐
106 ing transformations, semantic checks, etc. To help in this the
107 package pt::ast provides a set of convenience commands for vali‐
108 dation of the tree's basic structure, printing it for debugging,
109 and walking it either from the bottom up, or top down.
110
111 When encountering a syntax error the command will throw an error
112 instead. This error will be a 4-element Tcl-list, containing,
113 in the order listed below:
114
115 [1] The string pt::rde identifying it as parser runtime er‐
116 ror.
117
118 [2] The location of the parse error, as character offset from
119 the beginning of the parsed input.
120
121 [3] The location of parse error, now as a 2-element list con‐
122 taining line-number and column in the line.
123
124 [4] A set of atomic parsing expressions indicating encoding
125 the characters and/or nonterminal symbols the parser ex‐
126 pected to see at the location of the parse error, but did
127 not get. For the specification of atomic parsing expres‐
128 sions please see the section PE serialization format.
129
130 objectName parset text
131 This method runs the parser using the string in text as input.
132 In all other ways it behaves like the method parse, shown above.
133
135 Here we specify the format used by the Parser Tools to serialize Ab‐
136 stract Syntax Trees (ASTs) as immutable values for transport, compari‐
137 son, etc.
138
139 Each node in an AST represents a nonterminal symbol of a grammar, and
140 the range of tokens/characters in the input covered by it. ASTs do not
141 contain terminal symbols, i.e. tokens/characters. These can be recov‐
142 ered from the input given a symbol's location.
143
144 We distinguish between regular and canonical serializations. While a
145 tree may have more than one regular serialization only exactly one of
146 them will be canonical.
147
148 Regular serialization
149
150 [1] The serialization of any AST is the serialization of its
151 root node.
152
153 [2] The serialization of any node is a Tcl list containing at
154 least three elements.
155
156 [1] The first element is the name of the nonterminal
157 symbol stored in the node.
158
159 [2] The second and third element are the locations of
160 the first and last token in the token stream the
161 node represents (covers).
162
163 [1] Locations are provided as non-negative in‐
164 teger offsets from the beginning of the to‐
165 ken stream, with the first token found in
166 the stream located at offset 0 (zero).
167
168 [2] The end location has to be equal to or
169 larger than the start location.
170
171 [3] All elements after the first three represent the
172 children of the node, which are themselves nodes.
173 This means that the serializations of nodes with‐
174 out children, i.e. leaf nodes, have exactly three
175 elements. The children are stored in the list
176 with the leftmost child first, and the rightmost
177 child last.
178
179 Canonical serialization
180 The canonical serialization of an abstract syntax tree has the
181 format as specified in the previous item, and then additionally
182 satisfies the constraints below, which make it unique among all
183 the possible serializations of this tree.
184
185 [1] The string representation of the value is the canonical
186 representation of a pure Tcl list. I.e. it does not con‐
187 tain superfluous whitespace.
188
189 EXAMPLE
190 Assuming the parsing expression grammar below
191
192 PEG calculator (Expression)
193 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
194 Sign <- '-' / '+' ;
195 Number <- Sign? Digit+ ;
196 Expression <- Term (AddOp Term)* ;
197 MulOp <- '*' / '/' ;
198 Term <- Factor (MulOp Factor)* ;
199 AddOp <- '+'/'-' ;
200 Factor <- '(' Expression ')' / Number ;
201 END;
202
203
204 and the input string
205
206 120+5
207 then a parser should deliver the abstract syntax tree below (except for
208 whitespace)
209
210 set ast {Expression 0 4
211 {Factor 0 4
212 {Term 0 2
213 {Number 0 2
214 {Digit 0 0}
215 {Digit 1 1}
216 {Digit 2 2}
217 }
218 }
219 {AddOp 3 3}
220 {Term 4 4
221 {Number 4 4
222 {Digit 4 4}
223 }
224 }
225 }
226 }
227
228
229 Or, more graphical
230
231 .nf +- Digit 0 0 | 1 | | +- Term 0 2 --- Number 0 2 -+-
232 Digit 1 1 | 2 | | | |
233 +- Digit 2 2 | 0 | | Expression
234 0 4 --- Factor 0 4 -+----------------------------- AddOp 3 3 | + |
235 | +- Term 4 4 --- Number 4 4 --- Digit 4 4 | 5 .fi
236
238 Here we specify the format used by the Parser Tools to serialize Pars‐
239 ing Expressions as immutable values for transport, comparison, etc.
240
241 We distinguish between regular and canonical serializations. While a
242 parsing expression may have more than one regular serialization only
243 exactly one of them will be canonical.
244
245 Regular serialization
246
247 Atomic Parsing Expressions
248
249 [1] The string epsilon is an atomic parsing expres‐
250 sion. It matches the empty string.
251
252 [2] The string dot is an atomic parsing expression. It
253 matches any character.
254
255 [3] The string alnum is an atomic parsing expression.
256 It matches any Unicode alphabet or digit charac‐
257 ter. This is a custom extension of PEs based on
258 Tcl's builtin command string is.
259
260 [4] The string alpha is an atomic parsing expression.
261 It matches any Unicode alphabet character. This is
262 a custom extension of PEs based on Tcl's builtin
263 command string is.
264
265 [5] The string ascii is an atomic parsing expression.
266 It matches any Unicode character below U0080. This
267 is a custom extension of PEs based on Tcl's
268 builtin command string is.
269
270 [6] The string control is an atomic parsing expres‐
271 sion. It matches any Unicode control character.
272 This is a custom extension of PEs based on Tcl's
273 builtin command string is.
274
275 [7] The string digit is an atomic parsing expression.
276 It matches any Unicode digit character. Note that
277 this includes characters outside of the [0..9]
278 range. This is a custom extension of PEs based on
279 Tcl's builtin command string is.
280
281 [8] The string graph is an atomic parsing expression.
282 It matches any Unicode printing character, except
283 for space. This is a custom extension of PEs based
284 on Tcl's builtin command string is.
285
286 [9] The string lower is an atomic parsing expression.
287 It matches any Unicode lower-case alphabet charac‐
288 ter. This is a custom extension of PEs based on
289 Tcl's builtin command string is.
290
291 [10] The string print is an atomic parsing expression.
292 It matches any Unicode printing character, includ‐
293 ing space. This is a custom extension of PEs based
294 on Tcl's builtin command string is.
295
296 [11] The string punct is an atomic parsing expression.
297 It matches any Unicode punctuation character. This
298 is a custom extension of PEs based on Tcl's
299 builtin command string is.
300
301 [12] The string space is an atomic parsing expression.
302 It matches any Unicode space character. This is a
303 custom extension of PEs based on Tcl's builtin
304 command string is.
305
306 [13] The string upper is an atomic parsing expression.
307 It matches any Unicode upper-case alphabet charac‐
308 ter. This is a custom extension of PEs based on
309 Tcl's builtin command string is.
310
311 [14] The string wordchar is an atomic parsing expres‐
312 sion. It matches any Unicode word character. This
313 is any alphanumeric character (see alnum), and any
314 connector punctuation characters (e.g. under‐
315 score). This is a custom extension of PEs based on
316 Tcl's builtin command string is.
317
318 [15] The string xdigit is an atomic parsing expression.
319 It matches any hexadecimal digit character. This
320 is a custom extension of PEs based on Tcl's
321 builtin command string is.
322
323 [16] The string ddigit is an atomic parsing expression.
324 It matches any decimal digit character. This is a
325 custom extension of PEs based on Tcl's builtin
326 command regexp.
327
328 [17] The expression [list t x] is an atomic parsing ex‐
329 pression. It matches the terminal string x.
330
331 [18] The expression [list n A] is an atomic parsing ex‐
332 pression. It matches the nonterminal A.
333
334 Combined Parsing Expressions
335
336 [1] For parsing expressions e1, e2, ... the result of
337 [list / e1 e2 ... ] is a parsing expression as
338 well. This is the ordered choice, aka prioritized
339 choice.
340
341 [2] For parsing expressions e1, e2, ... the result of
342 [list x e1 e2 ... ] is a parsing expression as
343 well. This is the sequence.
344
345 [3] For a parsing expression e the result of [list *
346 e] is a parsing expression as well. This is the
347 kleene closure, describing zero or more repeti‐
348 tions.
349
350 [4] For a parsing expression e the result of [list +
351 e] is a parsing expression as well. This is the
352 positive kleene closure, describing one or more
353 repetitions.
354
355 [5] For a parsing expression e the result of [list &
356 e] is a parsing expression as well. This is the
357 and lookahead predicate.
358
359 [6] For a parsing expression e the result of [list !
360 e] is a parsing expression as well. This is the
361 not lookahead predicate.
362
363 [7] For a parsing expression e the result of [list ?
364 e] is a parsing expression as well. This is the
365 optional input.
366
367 Canonical serialization
368 The canonical serialization of a parsing expression has the for‐
369 mat as specified in the previous item, and then additionally
370 satisfies the constraints below, which make it unique among all
371 the possible serializations of this parsing expression.
372
373 [1] The string representation of the value is the canonical
374 representation of a pure Tcl list. I.e. it does not con‐
375 tain superfluous whitespace.
376
377 [2] Terminals are not encoded as ranges (where start and end
378 of the range are identical).
379
380 EXAMPLE
381 Assuming the parsing expression shown on the right-hand side of the
382 rule
383
384 Expression <- Term (AddOp Term)*
385
386
387 then its canonical serialization (except for whitespace) is
388
389 {x {n Term} {* {x {n AddOp} {n Term}}}}
390
391
393 This document, and the package it describes, will undoubtedly contain
394 bugs and other problems. Please report such in the category pt of the
395 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
396 report any ideas for enhancements you may have for either package
397 and/or documentation.
398
399 When proposing code changes, please provide unified diffs, i.e the out‐
400 put of diff -u.
401
402 Note further that attachments are strongly preferred over inlined
403 patches. Attachments can be made by going to the Edit form of the
404 ticket immediately after its creation, and then using the left-most
405 button in the secondary navigation bar.
406
408 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
409 matching, parser, parsing expression, parsing expression grammar, push
410 down automaton, recursive descent, state, top-down parsing languages,
411 transducer
412
414 Parsing and Grammars
415
417 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
418
419
420
421
422tcllib 1.0.1 pt::peg::interp(n)