1pt::param(n) Parser Tools pt::param(n)
2
3
4
5______________________________________________________________________________
6
8 pt::param - PackRat Machine Specification
9
11 package require Tcl 8.5
12
13______________________________________________________________________________
14
16 Are you lost ? Do you have trouble understanding this document ? In
17 that case please read the overview provided by the Introduction to
18 Parser Tools. This document is the entrypoint to the whole system the
19 current package is a part of.
20
21 Welcome to the PackRat Machine (short: PARAM), a virtual machine geared
22 towards the support of recursive descent parsers, especially packrat
23 parsers. Towards this end it has features like the caching and reuse of
24 partial results, the caching of the encountered input, and the ability
25 to backtrack in both input and AST creation.
26
27 This document specifies the machine in terms of its architectural state
28 and instruction set.
29
31 Any PARAM implementation has to manage at least the following state:
32
33 Input (IN)
34 This is the channel the characters to process are read from.
35
36 This part of the machine's state is used and modified by the
37 instructions defined in the section Input Handling.
38
39 Current Character (CC)
40 The character from the input currently tested against its possi‐
41 ble alternatives.
42
43 This part of the machine's state is used and modified by the
44 instructions defined in the section Character Processing.
45
46 Current Location (CL)
47 The location of the current character in the input, as offset
48 relative to the beginning of the input. Character offsets are
49 counted from 0.
50
51 This part of the machine's state is used and modified by the
52 instructions defined in the sections Character Processing, Loca‐
53 tion Handling, and Nonterminal Execution.
54
55 Location Stack (LS)
56 A stack of locations in the input, saved for possible backtrack‐
57 ing.
58
59 This part of the machine's state is used and modified by the
60 instructions defined in the sections Character Processing, Loca‐
61 tion Handling, and Nonterminal Execution.
62
63 Status (ST)
64 The status of the last attempt of testing the input, indicating
65 either success or failure.
66
67 This part of the machine's state is used and modified by the
68 instructions defined in the sections Status Control, Character
69 Processing, and Nonterminal Execution.
70
71 Semantic Value (SV)
72 The current semantic value, either empty, or a node for AST con‐
73 structed from the input.
74
75 This part of the machine's state is used and modified by the
76 instructions defined in the sections Value Construction, and AST
77 Construction.
78
79 AST Reduction Stack (ARS)
80 The stack of partial ASTs constructed during the processing of
81 nonterminal symbols.
82
83 This part of the machine's state is used and modified by the
84 instructions defined in the sections Value Construction, and AST
85 Construction.
86
87 AST Stack (AS)
88 The stack of reduction stacks, saved for possible backtracking.
89
90 This part of the machine's state is used and modified by the
91 instructions defined in the sections Value Construction, and AST
92 Construction.
93
94 Error Status (ER)
95 The machine's current knowledge of errors. This is either empty,
96 or set to a pair of location in the input and the set of mes‐
97 sages for that location.
98
99 Note that this part of the machine's state can be set even if
100 the last test of the current character was successful. For exam‐
101 ple, the *-operator (matching a sub-expression zero or more
102 times) in a PEG is always successful, even if it encounters a
103 problem further in the input and has to backtrack. Such problems
104 must not be forgotten when continuing the parsing.
105
106 This part of the machine's state is used and modified by the
107 instructions defined in the sections Error Handling, Character
108 Processing, and Nonterminal Execution.
109
110 Error Stack (ES)
111 The stack of error stati, saved for backtracking. This enables
112 the machine to merge current and older error stati when perform‐
113 ing backtracking in choices after an failed match.
114
115 This part of the machine's state is used and modified by the
116 instructions defined in the sections Error Handling, Character
117 Processing, and Nonterminal Execution.
118
119 Nonterminal Cache (NC)
120 A cache of machine states keyed by pairs name of nonterminal
121 symbol and location in the input. Each pair (N, L) is associated
122 with a 4-tuple holding the values to use for CL, ST, SV, and ER
123 after the nonterminal N was parsed starting from the location L.
124 It is a performance aid for backtracking parsers, allowing them
125 to avoid an expensive reparsing of complex nonterminal symbols
126 if they have been encountered before at a given location.
127
128 The key location is where machine started the attempt to match
129 the named nonterminal symbol, and the location in the saved
130 4-tuple is where machine ended up after the attempt completed,
131 independent of the success of the attempt.
132
133 This part of the machine's state is used and modified by the
134 instructions defined in the section Nonterminal Execution.
135
136 Terminal Cache (TC)
137 A cache of characters read from IN, with their location in IN as
138 pair of line and column, keyed by the location in IN, this time
139 as character offset from the beginning of IN. It is a perfor‐
140 mance aid for backtracking parsers, allowing them to avoid a
141 possibly expensive rereading of characters from IN, or even
142 enabling backtracking at, i.e. in the case of IN not randomly
143 seekable.
144
145 This part of the machine's state is used and modified by the
146 instructions defined in the section Input Handling.
147
149 With the machine's architectural state specified it is now possible to
150 specify the instruction set operating on that state and to be imple‐
151 mented by any realization of the PARAM. The 37 instructions are grouped
152 roughly by the state they influence and/or query during their execu‐
153 tion.
154
155 INPUT HANDLING
156 The instructions in this section mainly access IN, pulling the charac‐
157 ters to process into the machine.
158
159 input_next msg
160 This method reads the next character, i.e. the character after
161 CL, from IN. If successful this character becomes CC, CL is
162 advanced by one, ES is cleared, and the operation is recorded as
163 a success in ST.
164
165 The operation may read the character from IN if the next charac‐
166 ter is not yet known to TC. If successful the new character is
167 stored in TC, with its location (line, column), and the opera‐
168 tion otherwise behaves as specified above. Future reads from the
169 same location, possible due to backtracking, will then be satis‐
170 fied from TC instead of IN.
171
172 If, on the other hand, the end of IN was reached, the operation
173 is recorded as failed in ST, CL is left unchanged, and the pair
174 of CL and msg becomes the new ES.
175
176 CHARACTER PROCESSING
177 The instructions in this section mainly access CC, testing it against
178 character classes, ranges, and individual characters.
179
180 test_alnum
181 This instruction implements the special PE operator "alnum",
182 which checks if CC falls into the character class of the same
183 name, or not.
184
185 Success and failure of the test are both recorded directly in
186 ST. Success further clears ES, wheras failure sets the pair of
187 CL and expected input (encoded as a leaf parsing expression) as
188 the new ES and then rewinds CL by one character, preparing the
189 machine for another parse attempt by a possible alternative.
190
191 test_alpha
192 This instruction implements the special PE operator "alpha",
193 which checks if CC falls into the character class of the same
194 name, or not.
195
196 Success and failure of the test are both recorded directly in
197 ST. Success further clears ES, wheras failure sets the pair of
198 CL and expected input (encoded as a leaf parsing expression) as
199 the new ES and then rewinds CL by one character, preparing the
200 machine for another parse attempt by a possible alternative.
201
202 test_ascii
203 This instruction implements the special PE operator "ascii",
204 which checks if CC falls into the character class of the same
205 name, or not.
206
207 Success and failure of the test are both recorded directly in
208 ST. Success further clears ES, wheras failure sets the pair of
209 CL and expected input (encoded as a leaf parsing expression) as
210 the new ES and then rewinds CL by one character, preparing the
211 machine for another parse attempt by a possible alternative.
212
213 test_char char
214 This instruction implements the character matching operator,
215 i.e. it checks if CC is char.
216
217 Success and failure of the test are both recorded directly in
218 ST. Success further clears ES, wheras failure sets the pair of
219 CL and expected input (encoded as a leaf parsing expression) as
220 the new ES and then rewinds CL by one character, preparing the
221 machine for another parse attempt by a possible alternative.
222
223 test_ddigit
224 This instruction implements the special PE operator "ddigit",
225 which checks if CC falls into the character class of the same
226 name, or not.
227
228 Success and failure of the test are both recorded directly in
229 ST. Success further clears ES, wheras failure sets the pair of
230 CL and expected input (encoded as a leaf parsing expression) as
231 the new ES and then rewinds CL by one character, preparing the
232 machine for another parse attempt by a possible alternative.
233
234 test_digit
235 This instruction implements the special PE operator "digit",
236 which checks if CC falls into the character class of the same
237 name, or not.
238
239 Success and failure of the test are both recorded directly in
240 ST. Success further clears ES, wheras failure sets the pair of
241 CL and expected input (encoded as a leaf parsing expression) as
242 the new ES and then rewinds CL by one character, preparing the
243 machine for another parse attempt by a possible alternative.
244
245 test_graph
246 This instruction implements the special PE operator "graph",
247 which checks if CC falls into the character class of the same
248 name, or not.
249
250 Success and failure of the test are both recorded directly in
251 ST. Success further clears ES, wheras failure sets the pair of
252 CL and expected input (encoded as a leaf parsing expression) as
253 the new ES and then rewinds CL by one character, preparing the
254 machine for another parse attempt by a possible alternative.
255
256 test_lower
257 This instruction implements the special PE operator "lower",
258 which checks if CC falls into the character class of the same
259 name, or not.
260
261 Success and failure of the test are both recorded directly in
262 ST. Success further clears ES, wheras failure sets the pair of
263 CL and expected input (encoded as a leaf parsing expression) as
264 the new ES and then rewinds CL by one character, preparing the
265 machine for another parse attempt by a possible alternative.
266
267 test_print
268 This instruction implements the special PE operator "print",
269 which checks if CC falls into the character class of the same
270 name, or not.
271
272 Success and failure of the test are both recorded directly in
273 ST. Success further clears ES, wheras failure sets the pair of
274 CL and expected input (encoded as a leaf parsing expression) as
275 the new ES and then rewinds CL by one character, preparing the
276 machine for another parse attempt by a possible alternative.
277
278 test_punct
279 This instruction implements the special PE operator "punct",
280 which checks if CC falls into the character class of the same
281 name, or not.
282
283 Success and failure of the test are both recorded directly in
284 ST. Success further clears ES, wheras failure sets the pair of
285 CL and expected input (encoded as a leaf parsing expression) as
286 the new ES and then rewinds CL by one character, preparing the
287 machine for another parse attempt by a possible alternative.
288
289 test_range chars chare
290 This instruction implements the range matching operator, i.e. it
291 checks if CC falls into the interval of characters spanned up by
292 the two characters from chars to chare, both inclusive.
293
294 Success and failure of the test are both recorded directly in
295 ST. Success further clears ES, wheras failure sets the pair of
296 CL and expected input (encoded as a leaf parsing expression) as
297 the new ES and then rewinds CL by one character, preparing the
298 machine for another parse attempt by a possible alternative.
299
300 test_space
301 This instruction implements the special PE operator "space",
302 which checks if CC falls into the character class of the same
303 name, or not.
304
305 Success and failure of the test are both recorded directly in
306 ST. Success further clears ES, wheras failure sets the pair of
307 CL and expected input (encoded as a leaf parsing expression) as
308 the new ES and then rewinds CL by one character, preparing the
309 machine for another parse attempt by a possible alternative.
310
311 test_upper
312 This instruction implements the special PE operator "upper",
313 which checks if CC falls into the character class of the same
314 name, or not.
315
316 Success and failure of the test are both recorded directly in
317 ST. Success further clears ES, wheras failure sets the pair of
318 CL and expected input (encoded as a leaf parsing expression) as
319 the new ES and then rewinds CL by one character, preparing the
320 machine for another parse attempt by a possible alternative.
321
322 test_wordchar
323 This instruction implements the special PE operator "wordchar",
324 which checks if CC falls into the character class of the same
325 name, or not.
326
327 Success and failure of the test are both recorded directly in
328 ST. Success further clears ES, wheras failure sets the pair of
329 CL and expected input (encoded as a leaf parsing expression) as
330 the new ES and then rewinds CL by one character, preparing the
331 machine for another parse attempt by a possible alternative.
332
333 test_xdigit
334 This instruction implements the special PE operator "xdigit",
335 which checks if CC falls into the character class of the same
336 name, or not.
337
338 Success and failure of the test are both recorded directly in
339 ST. Success further clears ES, wheras failure sets the pair of
340 CL and expected input (encoded as a leaf parsing expression) as
341 the new ES and then rewinds CL by one character, preparing the
342 machine for another parse attempt by a possible alternative.
343
344 ERROR HANDLING
345 The instructions in this section mainly access ER and ES.
346
347 error_clear
348 This instruction clears ER.
349
350 error_push
351 This instruction makes a copy of ER and pushes it on ES.
352
353 error_pop_merge
354 This instruction takes the topmost entry of ES and merges the
355 error status it contains with ES, making the result the new ES.
356
357 The merge is governed by four rules, with the merge result
358
359 [1] Empty if both states are empty.
360
361 [2] The non-empty state if only one of the two states is non-
362 empty.
363
364 [3] The state with the larger location, if the two states
365 specify different locations.
366
367 [4] The pair of the location shared by the two states, and
368 the set-union of their messages for states at the same
369 location.
370
371 error_nonterminal symbol
372 This is a guarded instruction. It does nothing if either ES is
373 empty, or if the location in ES is not just past the last loca‐
374 tion saved in LS. Otherwise it sets the pair of that location
375 and the nonterminal symbol as the new ES.
376
377 Note: In the above "just past" means "that location plus one",
378 or also "the location of the next character after that loca‐
379 tion".
380
381 STATUS CONTROL
382 The instructions in this section directly manipulate ST.
383
384 status_ok
385 This instruction sets ST to true, recording a success.
386
387 status_fail
388 This instruction sets ST to false, recording a failure.
389
390 status_negate
391 This instruction negates ST, turning a failure into a success
392 and vice versa.
393
394 LOCATION HANDLING
395 The instructions in this section access CL and LS.
396
397 loc_push
398 This instruction makes a copy of CL and pushes it on LS.
399
400 loc_pop_discard
401 This instructions pops the last saved location from LS.
402
403 loc_pop_rewind
404 This instruction pops the last saved location from LS and
405 restores it as CL.
406
407 NONTERMINAL EXECUTION
408 The instructions in this section access and manipulate NC.
409
410 symbol_restore symbol
411 This instruction checks if NC contains data for the nonterminal
412 symbol at CL, or not. The result of the instruction is a boolean
413 flag, with True indicating that data was found in the cache. In
414 that case the instruction has further updated the architectural
415 state of the machine with the cached information, namely CL, ST,
416 ER, and SV.
417
418 The method with which the instruction's result is transformed
419 into control flow is left undefined and the responsibility of
420 the implementation.
421
422 symbol_save symbol
423 This instructions saves the current settings of CL, ST, ER, and
424 SV in NC, using the pair of nonterminal symbol and the last
425 location saved in LS as key.
426
427 VALUE CONSTRUCTION
428 The instructions in this section manipulate SV.
429
430 value_clear
431 This instruction clears SV.
432
433 value_leaf symbol
434 This instruction constructs an AST node for symbol covering the
435 range of IN from one character after the last location saved on
436 LS to CL and stores it in SV. ...
437
438 value_reduce symbol
439 This instruction generally behaves like value_nonterminal_leaf,
440 except that it takes all AST nodes on ARS, if any, and makes
441 them the children of the new node, with the last node saved on
442 ARS becoming the right-most / last child. Note that ARS is not
443 modfied by this operation.
444
445 AST CONSTRUCTION
446 The instructions in this section manipulate ARS and AS.
447
448 ast_value_push
449 This instruction makes a copy of SV and pushes it on ARS.
450
451 ast_push
452 This instruction pushes the current state of ARS on AS and then
453 clears ARS.
454
455 ast_pop_rewind
456 This instruction pops the last entry saved on AS and restores it
457 as the new state of ARS.
458
459 ast_pop_discard
460 This instruction pops the last entry saved on AS.
461
462 CONTROL FLOW
463 Normally this section would contain the specifications of the control
464 flow instructions of the PARAM, i.e. (un)conditional jumps and the
465 like. However, this part of the PARAM is intentionally left unspeci‐
466 fied. This allows the implementations to freely choose how to implement
467 control flow.
468
469 The implementation of this machine in Parser Tools, i.e the package
470 pt::rde, is not only coded in Tcl, but also relies on Tcl commands to
471 provide it with control flow (instructions).
472
474 InstructionInputsOutputs
475 ======================= ===========================================
476 ast_pop_discardAS->AS
477 ast_pop_rewindAS->AS, ARS
478 ast_push ARS, AS->AS
479 ast_value_pushSV, ARS->ARS
480 ======================= ===========================================
481 error_clear-->ER
482 error_nonterminal symER, LS->ER
483 error_pop_merge ES, ER->ER
484 error_pushES, ER->ES
485 ======================= ===========================================
486 input_next msgIN->TC, CL, CC, ST, ER
487 ======================= ===========================================
488 loc_pop_discardLS->LS
489 loc_pop_rewindLS->LS, CL
490 loc_push CL, LS->LS
491 ======================= ===========================================
492 status_fail-->ST
493 status_negateST->ST
494 status_ok -->ST
495 ======================= ===========================================
496 symbol_restore symNC->CL, ST, ER, SV
497 symbol_save symCL, ST, ER, SV LS->NC
498 ======================= ===========================================
499 test_alnum CC->ST, ER
500 test_alphaCC->ST, ER
501 test_asciiCC->ST, ER
502 test_char charCC->ST, ER
503 test_ddigitCC->ST, ER
504 test_digitCC->ST, ER
505 test_graphCC->ST, ER
506 test_lowerCC->ST, ER
507 test_printCC->ST, ER
508 test_punctCC->ST, ER
509 test_range chars chareCC->ST, ER
510 test_spaceCC->ST, ER
511 test_upperCC->ST, ER
512 test_wordcharCC->ST, ER
513 test_xdigitCC->ST, ER
514 ======================= ===========================================
515 value_clear-->SV
516 value_leaf symbolLS, CL->SV
517 value_reduce symbolARS, LS, CL->SV
518 ======================= ===========================================
519
520
522 This document, and the package it describes, will undoubtedly contain
523 bugs and other problems. Please report such in the category pt of the
524 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
525 report any ideas for enhancements you may have for either package
526 and/or documentation.
527
528 When proposing code changes, please provide unified diffs, i.e the out‐
529 put of diff -u.
530
531 Note further that attachments are strongly preferred over inlined
532 patches. Attachments can be made by going to the Edit form of the
533 ticket immediately after its creation, and then using the left-most
534 button in the secondary navigation bar.
535
537 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
538 matching, parser, parsing expression, parsing expression grammar, push
539 down automaton, recursive descent, state, top-down parsing languages,
540 transducer, virtual machine
541
543 Parsing and Grammars
544
546 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
547
548
549
550
551tcllib 1 pt::param(n)