1grammar::me_vm(n)        Grammar operations and usage        grammar::me_vm(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       grammar::me_vm - Virtual machine for parsing token streams
9

DESCRIPTION

11       Please go and read the document grammar::me_intro first for an overview
12       of the various documents and their relations.
13
14       This document specifies a virtual machine for the  controlled  matching
15       and  parsing  of token streams, creating an abstract syntax tree (short
16       AST) reflecting the structure of the input.  Special  machine  features
17       are  the  caching  and reuse of partial results, caching of the encoun‐
18       tered input, and the ability to backtrack in both input  and  AST  cre‐
19       ation.
20
21       These  features make the specified virtual machine especially useful to
22       packrat parsers based on parsing expression grammars. It is however not
23       restricted  to  this  type  of  parser. Normal LL and LR parsers can be
24       implemented with it as well.
25
26       The following sections will discuss first the abstract state kept by ME
27       virtual machines, and then their instruction set.
28

MACHINE STATE

30       A ME virtual machine manages the following state:
31
32       Current token CT
33              The token from the input under consideration by the machine.
34
35              This  information  is  used  and  modified  by  the instructions
36              defined in the section TERMINAL MATCHING.
37
38       Current location CL
39              The location of the current token in the input stream, as offset
40              relative to the beginning of the stream. The first token is con‐
41              sidered to be at offset 0.
42
43              This information is implicitly used and modified by the instruc‐
44              tions  defined in the sections TERMINAL MATCHING and NONTERMINAL
45              MATCHING, and can  be  directly  queried  and  modified  by  the
46              instructions defined in section INPUT LOCATION HANDLING.
47
48       Location stack LS
49              In addition to the above a stack of locations, for backtracking.
50              Locations can put on the stack, removed  from  it,  and  removed
51              with setting the current location.
52
53              This information is implicitly used and modified by the instruc‐
54              tions defined in the sections TERMINAL MATCHING and  NONTERMINAL
55              MATCHING,  and  can  be  directly  queried  and  modified by the
56              instructions defined in section INPUT LOCATION HANDLING.
57
58       Match status OK
59              A boolean value, the result of  the  last  attempt  at  matching
60              input.   It  is  set to true if that attempt was successful, and
61              false otherwise.
62
63              This information is influenced by the  instructions  defined  in
64              the sections TERMINAL MATCHING, NONTERMINAL MATCHING, and UNCON‐
65              DITIONAL MATCHING.  It is queried by the instructions defined in
66              the section CONTROL FLOW.
67
68       Semantic value SV
69              The  semantic  value  associated  with  (generated  by) the last
70              attempt at matching input. Contains either the empty string or a
71              node for the abstract syntax tree constructed from the input.
72
73              This  information  is  influenced by the instructions defined in
74              the sections SEMANTIC VALUES, and AST STACK HANDLING.
75
76       AST stack AS
77              A stack of partial abstract  syntax  trees  constructed  by  the
78              machine during matching.
79
80              This  information  is  influenced by the instructions defined in
81              the sections SEMANTIC VALUES, and AST STACK HANDLING.
82
83       AST Marker stack MS
84              In addition to the above a stack of  stacks,  for  backtracking.
85              This  is  actually  a  stack of markers into the AST stack, thus
86              implicitly snapshooting the state of the AST stack at some point
87              in  time.  Markers can be put on the stack, dropped from it, and
88              used to roll back the AST stack to an earlier state.
89
90              This information is influenced by the  instructions  defined  in
91              the sections SEMANTIC VALUES, and AST STACK HANDLING.
92
93       Error status ER
94              Error  information  associated with the last attempt at matching
95              input. Contains either the empty string or a list of 2 elements,
96              a  location in the input and a list of error messages associated
97              with it, in this order.
98
99              Note that error information can be set even if the last  attempt
100              at  matching  input  was  successful. For example the *-operator
101              (matching a sub-expression zero or  more  times)  in  a  parsing
102              expression grammar is always successful, even if it encounters a
103              problem further in the input and has to backtrack. Such problems
104              must not be forgotten when continuing to match.
105
106              This  information  is queried and influenced by the instructions
107              defined in the sections TERMINAL MATCHING, NONTERMINAL MATCHING,
108              and ERROR HANDLING.
109
110       Error stack ES
111              In  addition to the above a stack of error information, to allow
112              the merging of current and older error information when perform‐
113              ing backtracking in choices after an unsucessful match.
114
115              This  information  is queried and influenced by the instructions
116              defined in the sections TERMINAL MATCHING, NONTERMINAL MATCHING,
117              and ERROR HANDLING.
118
119       Return stack RS
120              A  stack  of  program counter values, i.e. locations in the code
121              controlling the virtual machine, for the management  of  subrou‐
122              tine calls, i.e. the matching of nonterminal symbols.
123
124              This  information  is queried and influenced by the instructions
125              defined in the section NONTERMINAL MATCHING.
126
127       Nonterminal cache NC
128              A cache of machine states (A 4-tuple containing  a  location  in
129              the  input, match status OK, semantic value SV, and error status
130              ER) keyed by name of nonterminal  symbol  and  location  in  the
131              input stream.
132
133              The  key  location is where machine started the attempt to match
134              the named nonterminal symbol, and the location in the  value  is
135              where  machine ended up after the attempt completed, independent
136              of the success of the attempt.
137
138              This status  is  queried  and  influenced  by  the  instructions
139              defined in the section NONTERMINAL MATCHING.
140

MACHINE INSTRUCTIONS

142       With  the  machine  state  specified  it is now possible to explain the
143       instruction set of ME virtual machines. They are grouped roughly by the
144       machine state they influence and/or query.
145
146   TERMINAL MATCHING
147       First  the  instructions  to match tokens from the input stream, and by
148       extension all terminal symbols.
149
150       These instructions are the only ones which may  retrieve  a  new  token
151       from  the  input  stream.  This  is  a  may  and not a will because the
152       instructions will a retrieve new token if,  and  only  if  the  current
153       location  CL  is  at  the head of the stream.  If the machine has back‐
154       tracked (see icl_rewind) the instructions will retrieve  the  token  to
155       compare against from the internal cache.
156
157       ict_advance message
158              This instruction tries to advance to the next token in the input
159              stream, i.e. the one after the current location CL. The instruc‐
160              tion  will  fail  if, and only if the end of the input stream is
161              reached, i.e. if there is no next token.
162
163              The sucess/failure of the instruction is remembered in the match
164              status  OK. In the case of failure the error status ER is set to
165              the current location and the message message.  In  the  case  of
166              success  the  error  status ER is cleared, the new token is made
167              the current token CT, and the new location is made  the  current
168              location CL.
169
170              The  argument  message  is a reference to the string to put into
171              the error status ER, if such is needed.
172
173       ict_match_token tok message
174              This instruction tests the current token CT  for  equality  with
175              the  argument tok and records the result in the match status OK.
176              The instruction fails if the current token is not equal to tok.
177
178              In case of failure the error status ER is  set  to  the  current
179              location CL and the message message, and the current location CL
180              is moved one token backwards.  Otherwise, i.e. upon success, the
181              error  status  ER  is cleared and the current location CL is not
182              touched.
183
184       ict_match_tokrange tokbegin tokend message
185              This instruction tests the current token CT  for  being  in  the
186              range  of tokens from tokbegin to tokend (inclusive) and records
187              the result in the match status OK. The instruction fails if  the
188              current token is not inside the range.
189
190              In  case  of  failure  the error status ER is set to the current
191              location CL and the message message, and the current location CL
192              is moved one token backwards.  Otherwise, i.e. upon success, the
193              error status ER is cleared and the current location  CL  is  not
194              touched.
195
196       ict_match_tokclass code message
197              This  instruction  tests the current token CT for being a member
198              of the token class code and records the result in the match sta‐
199              tus OK. The instruction fails if the current token is not a mem‐
200              ber of the specified class.
201
202              In case of failure the error status ER is  set  to  the  current
203              location CL and the message message, and the current location CL
204              is moved one token backwards.  Otherwise, i.e. upon success, the
205              error  status  ER  is cleared and the current location CL is not
206              touched.
207
208              Currently the following classes are legal:
209
210              alnum  A token is accepted if it is a unicode alphabetical char‐
211                     acter, or a digit.
212
213              alpha  A token is accepted if it is a unicode alphabetical char‐
214                     acter.
215
216              digit  A token is accepted if it is a unicode digit character.
217
218              xdigit A token is accepted if it is a hexadecimal digit  charac‐
219                     ter.
220
221              punct  A  token is accepted if it is a unicode punctuation char‐
222                     acter.
223
224              space  A token is accepted if it is a unicode space character.
225
226   NONTERMINAL MATCHING
227       The instructions in this section handle  the  matching  of  nonterminal
228       symbols. They query the nonterminal cache NC for saved information, and
229       put such information into the cache.
230
231       The usage of the cache is a performance aid for  backtracking  parsers,
232       allowing them to avoid an expensive rematch of complex nonterminal sym‐
233       bols if they have been encountered before.
234
235       inc_restore branchlabel nt
236              This instruction checks if the  nonterminal  cache  NC  contains
237              information  about  the  nonterminal  symbol  nt, at the current
238              location CL. If that is the case the instruction will update the
239              machine  state  (current  location CL, match status OK, semantic
240              value SV, and error status ER) with the  found  information  and
241              continue  execution at the instruction refered to by the branch‐
242              label. The new current  location  CL  will  be  the  last  token
243              matched by the nonterminal symbol, i.e. belonging to it.
244
245              If no information was found the instruction will continue execu‐
246              tion at the next instruction.
247
248              Together with icf_ntcall it is possible  to  generate  code  for
249              memoized  and  non-memoized  matching  of  nonterminal  symbols,
250              either as subroutine calls, or inlined in the caller.
251
252       inc_save nt
253              This instruction saves the current state of the machine (current
254              location  CL, match status OK, semantic value SV, and error sta‐
255              tus ER), to the nonterminal cache NC. It will also pop an  entry
256              from  the location stack LS and save it as the start location of
257              the match.
258
259              It is expected to be called at the end of matching a nonterminal
260              symbol,  with nt the name of the nonterminal symbol the code was
261              working on. This allows the instruction inc_restore to check for
262              and  retrieve the data, should we have to match this nonterminal
263              symbol at the same location again, during backtracking.
264
265       icf_ntcall branchlabel
266              This instruction invokes the code for matching  the  nonterminal
267              symbol  nt  as  a  subroutine. To this end it stores the current
268              program counter PC on the return stack RS, the current  location
269              CL on the location stack LS, and then continues execution at the
270              address branchlabel.
271
272              The next matching icf_ntreturn will cause the execution to  con‐
273              tinue at the instruction coming after the call.
274
275       icf_ntreturn
276              This  instruction  will  pop  an entry from the return stack RS,
277              assign it to the program counter PC, and then continue execution
278              at the new address.
279
280   UNCONDITIONAL MATCHING
281       The  instructions  in  this  section are the remaining match operators.
282       They change the match status OK directly and unconditionally.
283
284       iok_ok This instruction sets the match status OK to true, indicating  a
285              successful match.
286
287       iok_fail
288              This instruction sets the match status OK to false, indicating a
289              failed match.
290
291       iok_negate
292              This instruction negates the match status OK, turning a  failure
293              into a success and vice versa.
294
295   CONTROL FLOW
296       The  instructions in this section implement both conditional and uncon‐
297       ditional control flow. The conditional jumps query the match status OK.
298
299       icf_jalways branchlabel
300              This instruction sets the program  counter  PC  to  the  address
301              specified  by  branchlabel  and  then  continues  execution from
302              there. This is an unconditional jump.
303
304       icf_jok branchlabel
305              This instruction sets the program  counter  PC  to  the  address
306              specified by branchlabel. This happens if, and only if the match
307              status OK indicates a success.  Otherwise  it  simply  continues
308              execution at the next instruction. This is a conditional jump.
309
310       icf_jfail branchlabel
311              This  instruction  sets  the  program  counter PC to the address
312              specified by branchlabel. This happens if, and only if the match
313              status  OK  indicates  a  failure. Otherwise it simply continues
314              execution at the next instruction. This is a conditional jump.
315
316       icf_halt
317              This instruction halts the machine and blocks any further execu‐
318              tion.
319
320   INPUT LOCATION HANDLING
321       The  instructions in this section are for backtracking, they manipulate
322       the current location CL of the machine state.  They allow a user of the
323       machine  to  query  and  save locations in the input, and to rewind the
324       current location CL to saved locations, making them one of  the  compo‐
325       nents enabling the implementation of backtracking parsers.
326
327       icl_push
328              This instruction pushes a copy of the current location CL on the
329              location stack LS.
330
331       icl_rewind
332              This instruction pops an entry from the location  stack  LS  and
333              then  moves  the  current  location CL back to this point in the
334              input.
335
336       icl_pop
337              This instruction pops an entry from the location  stack  LS  and
338              discards it.
339
340   ERROR HANDLING
341       The  instructions  in this section provide read and write access to the
342       error status ER of the machine.
343
344       ier_push
345              This instruction pushes a copy of the current error status ER on
346              the error stack ES.
347
348       ier_clear
349              This instruction clears the error status ER.
350
351       ier_nonterminal message
352              This instruction checks if the error status ER contains an error
353              whose location is just past the location found in the top  entry
354              of  the  location stack LS.  Nothing happens if no such error is
355              found.  Otherwise the found error is replaced by an error at the
356              location found on the stack, having the message message.
357
358       ier_merge
359              This  instruction  pops an entry from the error stack ES, merges
360              it with the current error status ER and stores the result of the
361              merge as the new error status ER.
362
363              The merge is performed as described below:
364
365              If  one of the two error states is empty the other is chosen. If
366              neither error state is empty, and refering  to  different  loca‐
367              tions,  then  the  error  state with the location further in the
368              input is chosen. If both error states refer to the same location
369              their messages are merged (with removing duplicates).
370
371   SEMANTIC VALUES
372       The instructions in this section manipulate the semantic value SV.
373
374       isv_clear
375              This instruction clears the semantic value SV.
376
377       isv_terminal
378              This  instruction  creates  a  terminal AST node for the current
379              token CT, makes it the semantic value SV, and  also  pushes  the
380              node on the AST stack AS.
381
382       isv_nonterminal_leaf nt
383              This  instruction  creates  a  nonterminal  AST node without any
384              children for the nonterminal nt, and makes it the semantic value
385              SV.
386
387              This  instruction  should  be executed if, and only if the match
388              status OK indicates  a  success.   In  the  case  of  a  failure
389              isv_clear should be called.
390
391       isv_nonterminal_range nt
392              This  instruction creates a nonterminal AST node for the nonter‐
393              minal nt, with a single terminal node as its  child,  and  makes
394              this  AST the semantic value SV. The terminal node refers to the
395              input string from the location found  on  top  of  the  location
396              stack LS to the current location CL (both inclusive).
397
398              This  instruction  should  be executed if, and only if the match
399              status OK indicates  a  success.   In  the  case  of  a  failure
400              isv_clear should be called.
401
402       isv_nonterminal_reduce nt
403              This  instruction creates a nonterminal AST node for the nonter‐
404              minal nt and makes it the semantic value SV.
405
406              All entries on the AST stack AS above the marker  found  in  the
407              top  entry of the AST Marker stack MS become children of the new
408              node, with the entry at the stack  top  becoming  the  rightmost
409              child.  If  the  AST Marker stack MS is empty the whole stack is
410              used. The AST marker stack MS is left unchanged.
411
412              This instruction should be executed if, and only  if  the  match
413              status  OK  indicates  a  success.   In  the  case  of a failure
414              isv_clear should be called.
415
416   AST STACK HANDLING
417       The instructions in this section manipulate the AST stack AS,  and  the
418       AST Marker stack MS.
419
420       ias_push
421              This  instruction  pushes the semantic value SV on the AST stack
422              AS.
423
424       ias_mark
425              This instruction pushes a marker for the current  state  of  the
426              AST stack AS on the AST Marker stack MS.
427
428       ias_mrewind
429              This  instruction pops an entry from the AST Marker stack MS and
430              then proceeds to pop entries from the AST  stack  AS  until  the
431              state  represented  by the popped marker has been reached again.
432              Nothing is done if the AST stack  AS  is  already  smaller  than
433              indicated by the popped marker.
434
435       ias_mpop
436              This  instruction pops an entry from the AST Marker stack MS and
437              discards it.
438

BUGS, IDEAS, FEEDBACK

440       This document, and the package it describes, will  undoubtedly  contain
441       bugs and other problems.  Please report such in the category grammar_me
442       of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist].   Please
443       also  report any ideas for enhancements you may have for either package
444       and/or documentation.
445
446       When proposing code changes, please provide unified diffs, i.e the out‐
447       put of diff -u.
448
449       Note  further  that  attachments  are  strongly  preferred over inlined
450       patches. Attachments can be made by going  to  the  Edit  form  of  the
451       ticket  immediately  after  its  creation, and then using the left-most
452       button in the secondary navigation bar.
453

KEYWORDS

455       grammar, parsing, virtual machine
456

CATEGORY

458       Grammars and finite automata
459
461       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
462
463
464
465
466tcllib                                0.1                    grammar::me_vm(n)
Impressum