1pt_parser_api(i) Parser Tools pt_parser_api(i)
2
3
4
5______________________________________________________________________________
6
8 pt_parser_api - Parser API
9
11 package require Tcl 8.5
12
13 className ?objectName?
14
15 objectName destroy
16
17 objectName parse chan
18
19 objectName parset text
20
21______________________________________________________________________________
22
24 Are you lost ? Do you have trouble understanding this document ? In
25 that case please read the overview provided by the Introduction to
26 Parser Tools. This document is the entrypoint to the whole system the
27 current package is a part of.
28
29 This document describes the API shared by the grammar interpreter pro‐
30 vided by the package pt::peg::interp and the parsers generated by the
31 pt application for the result formats critcl, snit, and oo regarding
32 access to the actual parsing functionality.
33
34 Its intended audience are people who wish to create a parser for some
35 language of theirs and then use that parser within a Tcl-based package
36 or application.
37
38 It resides in the User Layer of Parser Tools.
39
40 IMAGE: arch_user_pkg
41
43 className ?objectName?
44 The class command constructs parser instances, i.e. objects. The
45 result of the command is the fully-qualified name of the
46 instance command.
47
48 If no objectName is specified the class will generate and use an
49 automatic name. If the objectName was specified, but is not
50 fully qualified the command will be created in the current
51 namespace.
52
54 All parser instances provide at least the methods shown below:
55
56 objectName destroy
57 This method destroys the parser instance, releasing all claimed
58 memory and other resources, and deleting the instance command.
59
60 The result of the command is the empty string.
61
62 objectName parse chan
63 This method runs the parser using the contents of chan as input
64 (starting at the current location in the channel), until parsing
65 is not possible anymore, either because parsing has completed,
66 or run into a syntax error.
67
68 Note here that the Parser Tools are based on Tcl 8.5+. In other
69 words, the channel argument is not restricted to files, sockets,
70 etc. We have the full power of reflected channels available.
71
72 It should also be noted that the parser pulls the characters
73 from the input stream as it needs them. If a parser created by
74 this package has to be operated in a push aka event-driven man‐
75 ner it will be necessary to go to Tcl 8.6+ and use the corou‐
76 tine::auto to wrap it into a coroutine where read is properly
77 changed for push-operation.
78
79 Upon successful completion the command returns an abstract syn‐
80 tax tree as its result. This AST is in the form specified in
81 section AST serialization format. As a plain nested Tcl-list it
82 can then be processed with any Tcl commands the user likes,
83 doing transformations, semantic checks, etc. To help in this
84 the package pt::ast provides a set of convenience commands for
85 validation of the tree's basic structure, printing it for debug‐
86 ging, and walking it either from the bottom up, or top down.
87
88 When encountering a syntax error the command will throw an error
89 instead. This error will be a 4-element Tcl-list, containing,
90 in the order listed below:
91
92 [1] The string pt::rde identifying it as parser runtime
93 error.
94
95 [2] The location of the parse error, as character offset from
96 the beginning of the parsed input.
97
98 [3] The location of parse error, now as a 2-element list con‐
99 taining line-number and column in the line.
100
101 [4] A set of atomic parsing expressions indicating encoding
102 the characters and/or nonterminal symbols the parser
103 expected to see at the location of the parse error, but
104 did not get. For the specification of atomic parsing
105 expressions please see the section PE serialization for‐
106 mat.
107
108 objectName parset text
109 This method runs the parser using the string in text as input.
110 In all other ways it behaves like the method parse, shown above.
111
113 A generated parser is used like this
114
115
116 package require the-parser-package ;# Generated by result-formats 'critcl', 'snit' or 'oo' of 'pt'.
117 set parser [the-parser-class]
118
119 set ast [$parser parse $channel]
120 ... process the abstract syntax tree ...
121
122 When using a grammar interpreter for parsing some differences creep in
123
124
125 package require the-grammar-package ;# Generated by result-format 'container' of 'pt'.
126 set grammar [the-grammar-class]
127
128 package require pt::peg::interp
129 set parser [pt::peg::interp]
130
131 $parser use $grammar
132
133 set ast [$parser parse $channel]
134 $parser destroy
135
136 ... process the abstract syntax tree ...
137
138
140 Here we specify the format used by the Parser Tools to serialize
141 Abstract Syntax Trees (ASTs) as immutable values for transport, compar‐
142 ison, etc.
143
144 Each node in an AST represents a nonterminal symbol of a grammar, and
145 the range of tokens/characters in the input covered by it. ASTs do not
146 contain terminal symbols, i.e. tokens/characters. These can be recov‐
147 ered from the input given a symbol's location.
148
149 We distinguish between regular and canonical serializations. While a
150 tree may have more than one regular serialization only exactly one of
151 them will be canonical.
152
153 Regular serialization
154
155 [1] The serialization of any AST is the serialization of its
156 root node.
157
158 [2] The serialization of any node is a Tcl list containing at
159 least three elements.
160
161 [1] The first element is the name of the nonterminal
162 symbol stored in the node.
163
164 [2] The second and third element are the locations of
165 the first and last token in the token stream the
166 node represents (covers).
167
168 [1] Locations are provided as non-negative
169 integer offsets from the beginning of the
170 token stream, with the first token found in
171 the stream located at offset 0 (zero).
172
173 [2] The end location has to be equal to or
174 larger than the start location.
175
176 [3] All elements after the first three represent the
177 children of the node, which are themselves nodes.
178 This means that the serializations of nodes with‐
179 out children, i.e. leaf nodes, have exactly three
180 elements. The children are stored in the list
181 with the leftmost child first, and the rightmost
182 child last.
183
184 Canonical serialization
185 The canonical serialization of an abstract syntax tree has the
186 format as specified in the previous item, and then additionally
187 satisfies the constraints below, which make it unique among all
188 the possible serializations of this tree.
189
190 [1] The string representation of the value is the canonical
191 representation of a pure Tcl list. I.e. it does not con‐
192 tain superfluous whitespace.
193
194 EXAMPLE
195 Assuming the parsing expression grammar below
196
197 PEG calculator (Expression)
198 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
199 Sign <- '-' / '+' ;
200 Number <- Sign? Digit+ ;
201 Expression <- Term (AddOp Term)* ;
202 MulOp <- '*' / '/' ;
203 Term <- Factor (MulOp Factor)* ;
204 AddOp <- '+'/'-' ;
205 Factor <- '(' Expression ')' / Number ;
206 END;
207
208
209 and the input string
210
211 120+5
212 then a parser should deliver the abstract syntax tree below (except for
213 whitespace)
214
215 set ast {Expression 0 4
216 {Factor 0 4
217 {Term 0 2
218 {Number 0 2
219 {Digit 0 0}
220 {Digit 1 1}
221 {Digit 2 2}
222 }
223 }
224 {AddOp 3 3}
225 {Term 4 4
226 {Number 4 4
227 {Digit 4 4}
228 }
229 }
230 }
231 }
232
233
234 Or, more graphical
235
236 .nf +- Digit 0 0 | 1 | | +- Term 0 2 --- Number 0 2 -+-
237 Digit 1 1 | 2 | | | |
238 +- Digit 2 2 | 0 | | Expression
239 0 4 --- Factor 0 4 -+----------------------------- AddOp 3 3 | + |
240 | +- Term 4 4 --- Number 4 4 --- Digit 4 4 | 5 .fi
241
243 Here we specify the format used by the Parser Tools to serialize Pars‐
244 ing Expressions as immutable values for transport, comparison, etc.
245
246 We distinguish between regular and canonical serializations. While a
247 parsing expression may have more than one regular serialization only
248 exactly one of them will be canonical.
249
250 Regular serialization
251
252 Atomic Parsing Expressions
253
254 [1] The string epsilon is an atomic parsing expres‐
255 sion. It matches the empty string.
256
257 [2] The string dot is an atomic parsing expression. It
258 matches any character.
259
260 [3] The string alnum is an atomic parsing expression.
261 It matches any Unicode alphabet or digit charac‐
262 ter. This is a custom extension of PEs based on
263 Tcl's builtin command string is.
264
265 [4] The string alpha is an atomic parsing expression.
266 It matches any Unicode alphabet character. This is
267 a custom extension of PEs based on Tcl's builtin
268 command string is.
269
270 [5] The string ascii is an atomic parsing expression.
271 It matches any Unicode character below U0080. This
272 is a custom extension of PEs based on Tcl's
273 builtin command string is.
274
275 [6] The string control is an atomic parsing expres‐
276 sion. It matches any Unicode control character.
277 This is a custom extension of PEs based on Tcl's
278 builtin command string is.
279
280 [7] The string digit is an atomic parsing expression.
281 It matches any Unicode digit character. Note that
282 this includes characters outside of the [0..9]
283 range. This is a custom extension of PEs based on
284 Tcl's builtin command string is.
285
286 [8] The string graph is an atomic parsing expression.
287 It matches any Unicode printing character, except
288 for space. This is a custom extension of PEs based
289 on Tcl's builtin command string is.
290
291 [9] The string lower is an atomic parsing expression.
292 It matches any Unicode lower-case alphabet charac‐
293 ter. This is a custom extension of PEs based on
294 Tcl's builtin command string is.
295
296 [10] The string print is an atomic parsing expression.
297 It matches any Unicode printing character, includ‐
298 ing space. This is a custom extension of PEs based
299 on Tcl's builtin command string is.
300
301 [11] The string punct is an atomic parsing expression.
302 It matches any Unicode punctuation character. This
303 is a custom extension of PEs based on Tcl's
304 builtin command string is.
305
306 [12] The string space is an atomic parsing expression.
307 It matches any Unicode space character. This is a
308 custom extension of PEs based on Tcl's builtin
309 command string is.
310
311 [13] The string upper is an atomic parsing expression.
312 It matches any Unicode upper-case alphabet charac‐
313 ter. This is a custom extension of PEs based on
314 Tcl's builtin command string is.
315
316 [14] The string wordchar is an atomic parsing expres‐
317 sion. It matches any Unicode word character. This
318 is any alphanumeric character (see alnum), and any
319 connector punctuation characters (e.g. under‐
320 score). This is a custom extension of PEs based on
321 Tcl's builtin command string is.
322
323 [15] The string xdigit is an atomic parsing expression.
324 It matches any hexadecimal digit character. This
325 is a custom extension of PEs based on Tcl's
326 builtin command string is.
327
328 [16] The string ddigit is an atomic parsing expression.
329 It matches any decimal digit character. This is a
330 custom extension of PEs based on Tcl's builtin
331 command regexp.
332
333 [17] The expression [list t x] is an atomic parsing
334 expression. It matches the terminal string x.
335
336 [18] The expression [list n A] is an atomic parsing
337 expression. It matches the nonterminal A.
338
339 Combined Parsing Expressions
340
341 [1] For parsing expressions e1, e2, ... the result of
342 [list / e1 e2 ... ] is a parsing expression as
343 well. This is the ordered choice, aka prioritized
344 choice.
345
346 [2] For parsing expressions e1, e2, ... the result of
347 [list x e1 e2 ... ] is a parsing expression as
348 well. This is the sequence.
349
350 [3] For a parsing expression e the result of [list *
351 e] is a parsing expression as well. This is the
352 kleene closure, describing zero or more repeti‐
353 tions.
354
355 [4] For a parsing expression e the result of [list +
356 e] is a parsing expression as well. This is the
357 positive kleene closure, describing one or more
358 repetitions.
359
360 [5] For a parsing expression e the result of [list &
361 e] is a parsing expression as well. This is the
362 and lookahead predicate.
363
364 [6] For a parsing expression e the result of [list !
365 e] is a parsing expression as well. This is the
366 not lookahead predicate.
367
368 [7] For a parsing expression e the result of [list ?
369 e] is a parsing expression as well. This is the
370 optional input.
371
372 Canonical serialization
373 The canonical serialization of a parsing expression has the for‐
374 mat as specified in the previous item, and then additionally
375 satisfies the constraints below, which make it unique among all
376 the possible serializations of this parsing expression.
377
378 [1] The string representation of the value is the canonical
379 representation of a pure Tcl list. I.e. it does not con‐
380 tain superfluous whitespace.
381
382 [2] Terminals are not encoded as ranges (where start and end
383 of the range are identical).
384
385 EXAMPLE
386 Assuming the parsing expression shown on the right-hand side of the
387 rule
388
389 Expression <- Term (AddOp Term)*
390
391
392 then its canonical serialization (except for whitespace) is
393
394 {x {n Term} {* {x {n AddOp} {n Term}}}}
395
396
398 This document, and the package it describes, will undoubtedly contain
399 bugs and other problems. Please report such in the category pt of the
400 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
401 report any ideas for enhancements you may have for either package
402 and/or documentation.
403
404 When proposing code changes, please provide unified diffs, i.e the out‐
405 put of diff -u.
406
407 Note further that attachments are strongly preferred over inlined
408 patches. Attachments can be made by going to the Edit form of the
409 ticket immediately after its creation, and then using the left-most
410 button in the secondary navigation bar.
411
413 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
414 matching, parser, parsing expression, parsing expression grammar, push
415 down automaton, recursive descent, state, top-down parsing languages,
416 transducer
417
419 Parsing and Grammars
420
422 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
423
424
425
426
427tcllib 1 pt_parser_api(i)