1pt_parser_api(i) Parser Tools pt_parser_api(i)
2
3
4
5______________________________________________________________________________
6
8 pt_parser_api - Parser API
9
11 package require Tcl 8.5
12
13 className ?objectName?
14
15 objectName destroy
16
17 objectName parse chan
18
19 objectName parset text
20
21______________________________________________________________________________
22
24 Are you lost ? Do you have trouble understanding this document ? In
25 that case please read the overview provided by the Introduction to
26 Parser Tools. This document is the entrypoint to the whole system the
27 current package is a part of.
28
29 This document describes the API shared by the grammar interpreter pro‐
30 vided by the package pt::peg::interp and the parsers generated by the
31 pt application for the result formats critcl, snit, and oo regarding
32 access to the actual parsing functionality.
33
34 Its intended audience are people who wish to create a parser for some
35 language of theirs and then use that parser within a Tcl-based package
36 or application.
37
38 It resides in the User Layer of Parser Tools.
39
40 IMAGE: arch_user_pkg
41
43 className ?objectName?
44 The class command constructs parser instances, i.e. objects. The
45 result of the command is the fully-qualified name of the in‐
46 stance command.
47
48 If no objectName is specified the class will generate and use an
49 automatic name. If the objectName was specified, but is not
50 fully qualified the command will be created in the current name‐
51 space.
52
54 All parser instances provide at least the methods shown below:
55
56 objectName destroy
57 This method destroys the parser instance, releasing all claimed
58 memory and other resources, and deleting the instance command.
59
60 The result of the command is the empty string.
61
62 objectName parse chan
63 This method runs the parser using the contents of chan as input
64 (starting at the current location in the channel), until parsing
65 is not possible anymore, either because parsing has completed,
66 or run into a syntax error.
67
68 Note here that the Parser Tools are based on Tcl 8.5+. In other
69 words, the channel argument is not restricted to files, sockets,
70 etc. We have the full power of reflected channels available.
71
72 It should also be noted that the parser pulls the characters
73 from the input stream as it needs them. If a parser created by
74 this package has to be operated in a push aka event-driven man‐
75 ner it will be necessary to go to Tcl 8.6+ and use the corou‐
76 tine::auto to wrap it into a coroutine where read is properly
77 changed for push-operation.
78
79 Upon successful completion the command returns an abstract syn‐
80 tax tree as its result. This AST is in the form specified in
81 section AST serialization format. As a plain nested Tcl-list it
82 can then be processed with any Tcl commands the user likes, do‐
83 ing transformations, semantic checks, etc. To help in this the
84 package pt::ast provides a set of convenience commands for vali‐
85 dation of the tree's basic structure, printing it for debugging,
86 and walking it either from the bottom up, or top down.
87
88 When encountering a syntax error the command will throw an error
89 instead. This error will be a 4-element Tcl-list, containing,
90 in the order listed below:
91
92 [1] The string pt::rde identifying it as parser runtime er‐
93 ror.
94
95 [2] The location of the parse error, as character offset from
96 the beginning of the parsed input.
97
98 [3] The location of parse error, now as a 2-element list con‐
99 taining line-number and column in the line.
100
101 [4] A set of atomic parsing expressions indicating encoding
102 the characters and/or nonterminal symbols the parser ex‐
103 pected to see at the location of the parse error, but did
104 not get. For the specification of atomic parsing expres‐
105 sions please see the section PE serialization format.
106
107 objectName parset text
108 This method runs the parser using the string in text as input.
109 In all other ways it behaves like the method parse, shown above.
110
112 A generated parser is used like this
113
114
115 package require the-parser-package ;# Generated by result-formats 'critcl', 'snit' or 'oo' of 'pt'.
116 set parser [the-parser-class]
117
118 set ast [$parser parse $channel]
119 ... process the abstract syntax tree ...
120
121 When using a grammar interpreter for parsing some differences creep in
122
123
124 package require the-grammar-package ;# Generated by result-format 'container' of 'pt'.
125 set grammar [the-grammar-class]
126
127 package require pt::peg::interp
128 set parser [pt::peg::interp]
129
130 $parser use $grammar
131
132 set ast [$parser parse $channel]
133 $parser destroy
134
135 ... process the abstract syntax tree ...
136
137
139 Here we specify the format used by the Parser Tools to serialize Ab‐
140 stract Syntax Trees (ASTs) as immutable values for transport, compari‐
141 son, etc.
142
143 Each node in an AST represents a nonterminal symbol of a grammar, and
144 the range of tokens/characters in the input covered by it. ASTs do not
145 contain terminal symbols, i.e. tokens/characters. These can be recov‐
146 ered from the input given a symbol's location.
147
148 We distinguish between regular and canonical serializations. While a
149 tree may have more than one regular serialization only exactly one of
150 them will be canonical.
151
152 Regular serialization
153
154 [1] The serialization of any AST is the serialization of its
155 root node.
156
157 [2] The serialization of any node is a Tcl list containing at
158 least three elements.
159
160 [1] The first element is the name of the nonterminal
161 symbol stored in the node.
162
163 [2] The second and third element are the locations of
164 the first and last token in the token stream the
165 node represents (covers).
166
167 [1] Locations are provided as non-negative in‐
168 teger offsets from the beginning of the to‐
169 ken stream, with the first token found in
170 the stream located at offset 0 (zero).
171
172 [2] The end location has to be equal to or
173 larger than the start location.
174
175 [3] All elements after the first three represent the
176 children of the node, which are themselves nodes.
177 This means that the serializations of nodes with‐
178 out children, i.e. leaf nodes, have exactly three
179 elements. The children are stored in the list
180 with the leftmost child first, and the rightmost
181 child last.
182
183 Canonical serialization
184 The canonical serialization of an abstract syntax tree has the
185 format as specified in the previous item, and then additionally
186 satisfies the constraints below, which make it unique among all
187 the possible serializations of this tree.
188
189 [1] The string representation of the value is the canonical
190 representation of a pure Tcl list. I.e. it does not con‐
191 tain superfluous whitespace.
192
193 EXAMPLE
194 Assuming the parsing expression grammar below
195
196 PEG calculator (Expression)
197 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
198 Sign <- '-' / '+' ;
199 Number <- Sign? Digit+ ;
200 Expression <- Term (AddOp Term)* ;
201 MulOp <- '*' / '/' ;
202 Term <- Factor (MulOp Factor)* ;
203 AddOp <- '+'/'-' ;
204 Factor <- '(' Expression ')' / Number ;
205 END;
206
207
208 and the input string
209
210 120+5
211 then a parser should deliver the abstract syntax tree below (except for
212 whitespace)
213
214 set ast {Expression 0 4
215 {Factor 0 4
216 {Term 0 2
217 {Number 0 2
218 {Digit 0 0}
219 {Digit 1 1}
220 {Digit 2 2}
221 }
222 }
223 {AddOp 3 3}
224 {Term 4 4
225 {Number 4 4
226 {Digit 4 4}
227 }
228 }
229 }
230 }
231
232
233 Or, more graphical
234
235 .nf +- Digit 0 0 | 1 | | +- Term 0 2 --- Number 0 2 -+-
236 Digit 1 1 | 2 | | | |
237 +- Digit 2 2 | 0 | | Expression
238 0 4 --- Factor 0 4 -+----------------------------- AddOp 3 3 | + |
239 | +- Term 4 4 --- Number 4 4 --- Digit 4 4 | 5 .fi
240
242 Here we specify the format used by the Parser Tools to serialize Pars‐
243 ing Expressions as immutable values for transport, comparison, etc.
244
245 We distinguish between regular and canonical serializations. While a
246 parsing expression may have more than one regular serialization only
247 exactly one of them will be canonical.
248
249 Regular serialization
250
251 Atomic Parsing Expressions
252
253 [1] The string epsilon is an atomic parsing expres‐
254 sion. It matches the empty string.
255
256 [2] The string dot is an atomic parsing expression. It
257 matches any character.
258
259 [3] The string alnum is an atomic parsing expression.
260 It matches any Unicode alphabet or digit charac‐
261 ter. This is a custom extension of PEs based on
262 Tcl's builtin command string is.
263
264 [4] The string alpha is an atomic parsing expression.
265 It matches any Unicode alphabet character. This is
266 a custom extension of PEs based on Tcl's builtin
267 command string is.
268
269 [5] The string ascii is an atomic parsing expression.
270 It matches any Unicode character below U0080. This
271 is a custom extension of PEs based on Tcl's
272 builtin command string is.
273
274 [6] The string control is an atomic parsing expres‐
275 sion. It matches any Unicode control character.
276 This is a custom extension of PEs based on Tcl's
277 builtin command string is.
278
279 [7] The string digit is an atomic parsing expression.
280 It matches any Unicode digit character. Note that
281 this includes characters outside of the [0..9]
282 range. This is a custom extension of PEs based on
283 Tcl's builtin command string is.
284
285 [8] The string graph is an atomic parsing expression.
286 It matches any Unicode printing character, except
287 for space. This is a custom extension of PEs based
288 on Tcl's builtin command string is.
289
290 [9] The string lower is an atomic parsing expression.
291 It matches any Unicode lower-case alphabet charac‐
292 ter. This is a custom extension of PEs based on
293 Tcl's builtin command string is.
294
295 [10] The string print is an atomic parsing expression.
296 It matches any Unicode printing character, includ‐
297 ing space. This is a custom extension of PEs based
298 on Tcl's builtin command string is.
299
300 [11] The string punct is an atomic parsing expression.
301 It matches any Unicode punctuation character. This
302 is a custom extension of PEs based on Tcl's
303 builtin command string is.
304
305 [12] The string space is an atomic parsing expression.
306 It matches any Unicode space character. This is a
307 custom extension of PEs based on Tcl's builtin
308 command string is.
309
310 [13] The string upper is an atomic parsing expression.
311 It matches any Unicode upper-case alphabet charac‐
312 ter. This is a custom extension of PEs based on
313 Tcl's builtin command string is.
314
315 [14] The string wordchar is an atomic parsing expres‐
316 sion. It matches any Unicode word character. This
317 is any alphanumeric character (see alnum), and any
318 connector punctuation characters (e.g. under‐
319 score). This is a custom extension of PEs based on
320 Tcl's builtin command string is.
321
322 [15] The string xdigit is an atomic parsing expression.
323 It matches any hexadecimal digit character. This
324 is a custom extension of PEs based on Tcl's
325 builtin command string is.
326
327 [16] The string ddigit is an atomic parsing expression.
328 It matches any decimal digit character. This is a
329 custom extension of PEs based on Tcl's builtin
330 command regexp.
331
332 [17] The expression [list t x] is an atomic parsing ex‐
333 pression. It matches the terminal string x.
334
335 [18] The expression [list n A] is an atomic parsing ex‐
336 pression. It matches the nonterminal A.
337
338 Combined Parsing Expressions
339
340 [1] For parsing expressions e1, e2, ... the result of
341 [list / e1 e2 ... ] is a parsing expression as
342 well. This is the ordered choice, aka prioritized
343 choice.
344
345 [2] For parsing expressions e1, e2, ... the result of
346 [list x e1 e2 ... ] is a parsing expression as
347 well. This is the sequence.
348
349 [3] For a parsing expression e the result of [list *
350 e] is a parsing expression as well. This is the
351 kleene closure, describing zero or more repeti‐
352 tions.
353
354 [4] For a parsing expression e the result of [list +
355 e] is a parsing expression as well. This is the
356 positive kleene closure, describing one or more
357 repetitions.
358
359 [5] For a parsing expression e the result of [list &
360 e] is a parsing expression as well. This is the
361 and lookahead predicate.
362
363 [6] For a parsing expression e the result of [list !
364 e] is a parsing expression as well. This is the
365 not lookahead predicate.
366
367 [7] For a parsing expression e the result of [list ?
368 e] is a parsing expression as well. This is the
369 optional input.
370
371 Canonical serialization
372 The canonical serialization of a parsing expression has the for‐
373 mat as specified in the previous item, and then additionally
374 satisfies the constraints below, which make it unique among all
375 the possible serializations of this parsing expression.
376
377 [1] The string representation of the value is the canonical
378 representation of a pure Tcl list. I.e. it does not con‐
379 tain superfluous whitespace.
380
381 [2] Terminals are not encoded as ranges (where start and end
382 of the range are identical).
383
384 EXAMPLE
385 Assuming the parsing expression shown on the right-hand side of the
386 rule
387
388 Expression <- Term (AddOp Term)*
389
390
391 then its canonical serialization (except for whitespace) is
392
393 {x {n Term} {* {x {n AddOp} {n Term}}}}
394
395
397 This document, and the package it describes, will undoubtedly contain
398 bugs and other problems. Please report such in the category pt of the
399 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
400 report any ideas for enhancements you may have for either package
401 and/or documentation.
402
403 When proposing code changes, please provide unified diffs, i.e the out‐
404 put of diff -u.
405
406 Note further that attachments are strongly preferred over inlined
407 patches. Attachments can be made by going to the Edit form of the
408 ticket immediately after its creation, and then using the left-most
409 button in the secondary navigation bar.
410
412 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
413 matching, parser, parsing expression, parsing expression grammar, push
414 down automaton, recursive descent, state, top-down parsing languages,
415 transducer
416
418 Parsing and Grammars
419
421 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
422
423
424
425
426tcllib 1 pt_parser_api(i)