1pt::param(n)                     Parser Tools                     pt::param(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt::param - PackRat Machine Specification
9

SYNOPSIS

11       package require Tcl  8.5
12
13______________________________________________________________________________
14

DESCRIPTION

16       Are  you  lost ?  Do you have trouble understanding this document ?  In
17       that case please read the overview  provided  by  the  Introduction  to
18       Parser  Tools.  This document is the entrypoint to the whole system the
19       current package is a part of.
20
21       Welcome to the PackRat Machine (short: PARAM), a virtual machine geared
22       towards  the  support  of recursive descent parsers, especially packrat
23       parsers. Towards this end it has features like the caching and reuse of
24       partial  results, the caching of the encountered input, and the ability
25       to backtrack in both input and AST creation.
26
27       This document specifies the machine in terms of its architectural state
28       and instruction set.
29

ARCHITECTURAL STATE

31       Any PARAM implementation has to manage at least the following state:
32
33       Input (IN)
34              This is the channel the characters to process are read from.
35
36              This part of the machine's state is used and modified by the in‐
37              structions defined in the section Input Handling.
38
39       Current Character (CC)
40              The character from the input currently tested against its possi‐
41              ble alternatives.
42
43              This part of the machine's state is used and modified by the in‐
44              structions defined in the section Character Processing.
45
46       Current Location (CL)
47              The location of the current character in the  input,  as  offset
48              relative  to  the  beginning of the input. Character offsets are
49              counted from 0.
50
51              This part of the machine's state is used and modified by the in‐
52              structions  defined  in the sections Character Processing, Loca‐
53              tion Handling, and Nonterminal Execution.
54
55       Location Stack (LS)
56              A stack of locations in the input, saved for possible backtrack‐
57              ing.
58
59              This part of the machine's state is used and modified by the in‐
60              structions defined in the sections Character  Processing,  Loca‐
61              tion Handling, and Nonterminal Execution.
62
63       Status (ST)
64              The  status of the last attempt of testing the input, indicating
65              either success or failure.
66
67              This part of the machine's state is used and modified by the in‐
68              structions  defined  in  the  sections Status Control, Character
69              Processing, and Nonterminal Execution.
70
71       Semantic Value (SV)
72              The current semantic value, either empty, or a node for AST con‐
73              structed from the input.
74
75              This part of the machine's state is used and modified by the in‐
76              structions defined in the sections Value Construction,  and  AST
77              Construction.
78
79       AST Reduction Stack (ARS)
80              The  stack  of partial ASTs constructed during the processing of
81              nonterminal symbols.
82
83              This part of the machine's state is used and modified by the in‐
84              structions  defined  in the sections Value Construction, and AST
85              Construction.
86
87       AST Stack (AS)
88              The stack of reduction stacks, saved for possible backtracking.
89
90              This part of the machine's state is used and modified by the in‐
91              structions  defined  in the sections Value Construction, and AST
92              Construction.
93
94       Error Status (ER)
95              The machine's current knowledge of errors. This is either empty,
96              or  set  to  a pair of location in the input and the set of mes‐
97              sages for that location.
98
99              Note that this part of the machine's state can be  set  even  if
100              the last test of the current character was successful. For exam‐
101              ple, the *-operator (matching  a  sub-expression  zero  or  more
102              times)  in  a  PEG is always successful, even if it encounters a
103              problem further in the input and has to backtrack. Such problems
104              must not be forgotten when continuing the parsing.
105
106              This part of the machine's state is used and modified by the in‐
107              structions defined in the  sections  Error  Handling,  Character
108              Processing, and Nonterminal Execution.
109
110       Error Stack (ES)
111              The  stack  of error stati, saved for backtracking. This enables
112              the machine to merge current and older error stati when perform‐
113              ing backtracking in choices after an failed match.
114
115              This part of the machine's state is used and modified by the in‐
116              structions defined in the  sections  Error  Handling,  Character
117              Processing, and Nonterminal Execution.
118
119       Nonterminal Cache (NC)
120              A  cache  of  machine  states keyed by pairs name of nonterminal
121              symbol and location in the input. Each pair (N, L) is associated
122              with  a 4-tuple holding the values to use for CL, ST, SV, and ER
123              after the nonterminal N was parsed starting from the location L.
124              It  is a performance aid for backtracking parsers, allowing them
125              to avoid an expensive reparsing of complex  nonterminal  symbols
126              if they have been encountered before at a given location.
127
128              The  key  location is where machine started the attempt to match
129              the named nonterminal symbol, and  the  location  in  the  saved
130              4-tuple  is  where machine ended up after the attempt completed,
131              independent of the success of the attempt.
132
133              This part of the machine's state is used and modified by the in‐
134              structions defined in the section Nonterminal Execution.
135
136       Terminal Cache (TC)
137              A cache of characters read from IN, with their location in IN as
138              pair of line and column, keyed by the location in IN, this  time
139              as  character  offset from the beginning of IN.  It is a perfor‐
140              mance aid for backtracking parsers, allowing  them  to  avoid  a
141              possibly  expensive rereading of characters from IN, or even en‐
142              abling backtracking at, i.e. in the  case  of  IN  not  randomly
143              seekable.
144
145              This part of the machine's state is used and modified by the in‐
146              structions defined in the section Input Handling.
147

INSTRUCTION SET

149       With the machine's architectural state specified it is now possible  to
150       specify  the  instruction  set operating on that state and to be imple‐
151       mented by any realization of the PARAM. The 37 instructions are grouped
152       roughly  by  the  state they influence and/or query during their execu‐
153       tion.
154
155   INPUT HANDLING
156       The instructions in this section mainly access IN, pulling the  charac‐
157       ters to process into the machine.
158
159       input_next msg
160              This  method  reads the next character, i.e. the character after
161              CL, from IN. If successful this character becomes CC, CL is  ad‐
162              vanced by one, ES is cleared, and the operation is recorded as a
163              success in ST.
164
165              The operation may read the character from IN if the next charac‐
166              ter  is  not yet known to TC. If successful the new character is
167              stored in TC, with its location (line, column), and  the  opera‐
168              tion otherwise behaves as specified above. Future reads from the
169              same location, possible due to backtracking, will then be satis‐
170              fied from TC instead of IN.
171
172              If,  on the other hand, the end of IN was reached, the operation
173              is recorded as failed in ST, CL is left unchanged, and the  pair
174              of CL and msg becomes the new ES.
175
176   CHARACTER PROCESSING
177       The  instructions  in this section mainly access CC, testing it against
178       character classes, ranges, and individual characters.
179
180       test_alnum
181              This instruction implements the  special  PE  operator  "alnum",
182              which  checks  if  CC falls into the character class of the same
183              name, or not.
184
185              Success and failure of the test are both  recorded  directly  in
186              ST.   Success further clears ES, wheras failure sets the pair of
187              CL and expected input (encoded as a leaf parsing expression)  as
188              the  new  ES and then rewinds CL by one character, preparing the
189              machine for another parse attempt by a possible alternative.
190
191       test_alpha
192              This instruction implements the  special  PE  operator  "alpha",
193              which  checks  if  CC falls into the character class of the same
194              name, or not.
195
196              Success and failure of the test are both  recorded  directly  in
197              ST.   Success further clears ES, wheras failure sets the pair of
198              CL and expected input (encoded as a leaf parsing expression)  as
199              the  new  ES and then rewinds CL by one character, preparing the
200              machine for another parse attempt by a possible alternative.
201
202       test_ascii
203              This instruction implements the  special  PE  operator  "ascii",
204              which  checks  if  CC falls into the character class of the same
205              name, or not.
206
207              Success and failure of the test are both  recorded  directly  in
208              ST.   Success further clears ES, wheras failure sets the pair of
209              CL and expected input (encoded as a leaf parsing expression)  as
210              the  new  ES and then rewinds CL by one character, preparing the
211              machine for another parse attempt by a possible alternative.
212
213       test_char char
214              This instruction implements  the  character  matching  operator,
215              i.e. it checks if CC is char.
216
217              Success  and  failure  of the test are both recorded directly in
218              ST.  Success further clears ES, wheras failure sets the pair  of
219              CL  and expected input (encoded as a leaf parsing expression) as
220              the new ES and then rewinds CL by one character,  preparing  the
221              machine for another parse attempt by a possible alternative.
222
223       test_ddigit
224              This  instruction  implements  the special PE operator "ddigit",
225              which checks if CC falls into the character class  of  the  same
226              name, or not.
227
228              Success  and  failure  of the test are both recorded directly in
229              ST.  Success further clears ES, wheras failure sets the pair  of
230              CL  and expected input (encoded as a leaf parsing expression) as
231              the new ES and then rewinds CL by one character,  preparing  the
232              machine for another parse attempt by a possible alternative.
233
234       test_digit
235              This  instruction  implements  the  special PE operator "digit",
236              which checks if CC falls into the character class  of  the  same
237              name, or not.
238
239              Success  and  failure  of the test are both recorded directly in
240              ST.  Success further clears ES, wheras failure sets the pair  of
241              CL  and expected input (encoded as a leaf parsing expression) as
242              the new ES and then rewinds CL by one character,  preparing  the
243              machine for another parse attempt by a possible alternative.
244
245       test_graph
246              This  instruction  implements  the  special PE operator "graph",
247              which checks if CC falls into the character class  of  the  same
248              name, or not.
249
250              Success  and  failure  of the test are both recorded directly in
251              ST.  Success further clears ES, wheras failure sets the pair  of
252              CL  and expected input (encoded as a leaf parsing expression) as
253              the new ES and then rewinds CL by one character,  preparing  the
254              machine for another parse attempt by a possible alternative.
255
256       test_lower
257              This  instruction  implements  the  special PE operator "lower",
258              which checks if CC falls into the character class  of  the  same
259              name, or not.
260
261              Success  and  failure  of the test are both recorded directly in
262              ST.  Success further clears ES, wheras failure sets the pair  of
263              CL  and expected input (encoded as a leaf parsing expression) as
264              the new ES and then rewinds CL by one character,  preparing  the
265              machine for another parse attempt by a possible alternative.
266
267       test_print
268              This  instruction  implements  the  special PE operator "print",
269              which checks if CC falls into the character class  of  the  same
270              name, or not.
271
272              Success  and  failure  of the test are both recorded directly in
273              ST.  Success further clears ES, wheras failure sets the pair  of
274              CL  and expected input (encoded as a leaf parsing expression) as
275              the new ES and then rewinds CL by one character,  preparing  the
276              machine for another parse attempt by a possible alternative.
277
278       test_punct
279              This  instruction  implements  the  special PE operator "punct",
280              which checks if CC falls into the character class  of  the  same
281              name, or not.
282
283              Success  and  failure  of the test are both recorded directly in
284              ST.  Success further clears ES, wheras failure sets the pair  of
285              CL  and expected input (encoded as a leaf parsing expression) as
286              the new ES and then rewinds CL by one character,  preparing  the
287              machine for another parse attempt by a possible alternative.
288
289       test_range chars chare
290              This instruction implements the range matching operator, i.e. it
291              checks if CC falls into the interval of characters spanned up by
292              the two characters from chars to chare, both inclusive.
293
294              Success  and  failure  of the test are both recorded directly in
295              ST.  Success further clears ES, wheras failure sets the pair  of
296              CL  and expected input (encoded as a leaf parsing expression) as
297              the new ES and then rewinds CL by one character,  preparing  the
298              machine for another parse attempt by a possible alternative.
299
300       test_space
301              This  instruction  implements  the  special PE operator "space",
302              which checks if CC falls into the character class  of  the  same
303              name, or not.
304
305              Success  and  failure  of the test are both recorded directly in
306              ST.  Success further clears ES, wheras failure sets the pair  of
307              CL  and expected input (encoded as a leaf parsing expression) as
308              the new ES and then rewinds CL by one character,  preparing  the
309              machine for another parse attempt by a possible alternative.
310
311       test_upper
312              This  instruction  implements  the  special PE operator "upper",
313              which checks if CC falls into the character class  of  the  same
314              name, or not.
315
316              Success  and  failure  of the test are both recorded directly in
317              ST.  Success further clears ES, wheras failure sets the pair  of
318              CL  and expected input (encoded as a leaf parsing expression) as
319              the new ES and then rewinds CL by one character,  preparing  the
320              machine for another parse attempt by a possible alternative.
321
322       test_wordchar
323              This  instruction implements the special PE operator "wordchar",
324              which checks if CC falls into the character class  of  the  same
325              name, or not.
326
327              Success  and  failure  of the test are both recorded directly in
328              ST.  Success further clears ES, wheras failure sets the pair  of
329              CL  and expected input (encoded as a leaf parsing expression) as
330              the new ES and then rewinds CL by one character,  preparing  the
331              machine for another parse attempt by a possible alternative.
332
333       test_xdigit
334              This  instruction  implements  the special PE operator "xdigit",
335              which checks if CC falls into the character class  of  the  same
336              name, or not.
337
338              Success  and  failure  of the test are both recorded directly in
339              ST.  Success further clears ES, wheras failure sets the pair  of
340              CL  and expected input (encoded as a leaf parsing expression) as
341              the new ES and then rewinds CL by one character,  preparing  the
342              machine for another parse attempt by a possible alternative.
343
344   ERROR HANDLING
345       The instructions in this section mainly access ER and ES.
346
347       error_clear
348              This instruction clears ER.
349
350       error_push
351              This instruction makes a copy of ER and pushes it on ES.
352
353       error_pop_merge
354              This  instruction  takes  the topmost entry of ES and merges the
355              error status it contains with ES, making the result the new ES.
356
357              The merge is governed by four rules, with the merge result
358
359              [1]    Empty if both states are empty.
360
361              [2]    The non-empty state if only one of the two states is non-
362                     empty.
363
364              [3]    The  state  with  the  larger location, if the two states
365                     specify different locations.
366
367              [4]    The pair of the location shared by the  two  states,  and
368                     the  set-union  of  their messages for states at the same
369                     location.
370
371       error_nonterminal symbol
372              This is a guarded instruction. It does nothing if either  ES  is
373              empty,  or if the location in ES is not just past the last loca‐
374              tion saved in LS. Otherwise it sets the pair  of  that  location
375              and the nonterminal symbol as the new ES.
376
377              Note:  In  the above "just past" means "that location plus one",
378              or also "the location of the next  character  after  that  loca‐
379              tion".
380
381   STATUS CONTROL
382       The instructions in this section directly manipulate ST.
383
384       status_ok
385              This instruction sets ST to true, recording a success.
386
387       status_fail
388              This instruction sets ST to false, recording a failure.
389
390       status_negate
391              This  instruction  negates  ST, turning a failure into a success
392              and vice versa.
393
394   LOCATION HANDLING
395       The instructions in this section access CL and LS.
396
397       loc_push
398              This instruction makes a copy of CL and pushes it on LS.
399
400       loc_pop_discard
401              This instructions pops the last saved location from LS.
402
403       loc_pop_rewind
404              This instruction pops the last saved location from  LS  and  re‐
405              stores it as CL.
406
407   NONTERMINAL EXECUTION
408       The instructions in this section access and manipulate NC.
409
410       symbol_restore symbol
411              This  instruction checks if NC contains data for the nonterminal
412              symbol at CL, or not. The result of the instruction is a boolean
413              flag,  with True indicating that data was found in the cache. In
414              that case the instruction has further updated the  architectural
415              state of the machine with the cached information, namely CL, ST,
416              ER, and SV.
417
418              The method with which the instruction's  result  is  transformed
419              into  control  flow  is left undefined and the responsibility of
420              the implementation.
421
422       symbol_save symbol
423              This instructions saves the current settings of CL, ST, ER,  and
424              SV  in NC, using the pair of nonterminal symbol and the last lo‐
425              cation saved in LS as key.
426
427   VALUE CONSTRUCTION
428       The instructions in this section manipulate SV.
429
430       value_clear
431              This instruction clears SV.
432
433       value_leaf symbol
434              This instruction constructs an AST node for symbol covering  the
435              range  of IN from one character after the last location saved on
436              LS to CL and stores it in SV. ...
437
438       value_reduce symbol
439              This instruction generally behaves like  value_nonterminal_leaf,
440              except  that  it  takes  all AST nodes on ARS, if any, and makes
441              them the children of the new node, with the last node  saved  on
442              ARS  becoming  the right-most / last child. Note that ARS is not
443              modfied by this operation.
444
445   AST CONSTRUCTION
446       The instructions in this section manipulate ARS and AS.
447
448       ast_value_push
449              This instruction makes a copy of SV and pushes it on ARS.
450
451       ast_push
452              This instruction pushes the current state of ARS on AS and  then
453              clears ARS.
454
455       ast_pop_rewind
456              This instruction pops the last entry saved on AS and restores it
457              as the new state of ARS.
458
459       ast_pop_discard
460              This instruction pops the last entry saved on AS.
461
462   CONTROL FLOW
463       Normally this section would contain the specifications of  the  control
464       flow  instructions  of  the  PARAM,  i.e. (un)conditional jumps and the
465       like. However, this part of the PARAM is  intentionally  left  unspeci‐
466       fied. This allows the implementations to freely choose how to implement
467       control flow.
468
469       The implementation of this machine in Parser  Tools,  i.e  the  package
470       pt::rde,  is  not only coded in Tcl, but also relies on Tcl commands to
471       provide it with control flow (instructions).
472

INTERACTION OF THE INSTRUCTIONS WITH THE ARCHITECTURAL STATE

474              InstructionInputsOutputs
475              ======================= ===========================================
476              ast_pop_discardAS->AS
477              ast_pop_rewindAS->AS, ARS
478              ast_push  ARS, AS->AS
479              ast_value_pushSV, ARS->ARS
480              ======================= ===========================================
481              error_clear-->ER
482              error_nonterminal symER, LS->ER
483              error_pop_merge   ES, ER->ER
484              error_pushES, ER->ES
485              ======================= ===========================================
486              input_next msgIN->TC, CL, CC, ST, ER
487              ======================= ===========================================
488              loc_pop_discardLS->LS
489              loc_pop_rewindLS->LS, CL
490              loc_push  CL, LS->LS
491              ======================= ===========================================
492              status_fail-->ST
493              status_negateST->ST
494              status_ok -->ST
495              ======================= ===========================================
496              symbol_restore symNC->CL, ST, ER, SV
497              symbol_save    symCL, ST, ER, SV LS->NC
498              ======================= ===========================================
499              test_alnum  CC->ST, ER
500              test_alphaCC->ST, ER
501              test_asciiCC->ST, ER
502              test_char charCC->ST, ER
503              test_ddigitCC->ST, ER
504              test_digitCC->ST, ER
505              test_graphCC->ST, ER
506              test_lowerCC->ST, ER
507              test_printCC->ST, ER
508              test_punctCC->ST, ER
509              test_range chars chareCC->ST, ER
510              test_spaceCC->ST, ER
511              test_upperCC->ST, ER
512              test_wordcharCC->ST, ER
513              test_xdigitCC->ST, ER
514              ======================= ===========================================
515              value_clear-->SV
516              value_leaf symbolLS, CL->SV
517              value_reduce symbolARS, LS, CL->SV
518              ======================= ===========================================
519
520

BUGS, IDEAS, FEEDBACK

522       This document, and the package it describes, will  undoubtedly  contain
523       bugs  and other problems.  Please report such in the category pt of the
524       Tcllib Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please  also
525       report  any  ideas  for  enhancements  you  may have for either package
526       and/or documentation.
527
528       When proposing code changes, please provide unified diffs, i.e the out‐
529       put of diff -u.
530
531       Note  further  that  attachments  are  strongly  preferred over inlined
532       patches. Attachments can be made by going  to  the  Edit  form  of  the
533       ticket  immediately  after  its  creation, and then using the left-most
534       button in the secondary navigation bar.
535

KEYWORDS

537       EBNF, LL(k), PEG, TDPL, context-free  languages,  expression,  grammar,
538       matching,  parser, parsing expression, parsing expression grammar, push
539       down automaton, recursive descent, state, top-down  parsing  languages,
540       transducer, virtual machine
541

CATEGORY

543       Parsing and Grammars
544
546       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
547
548
549
550
551tcllib                                 1                          pt::param(n)
Impressum