1grammar::me_vm(n) Grammar operations and usage grammar::me_vm(n)
2
3
4
5______________________________________________________________________________
6
8 grammar::me_vm - Virtual machine for parsing token streams
9
11 Please go and read the document grammar::me_intro first for an overview
12 of the various documents and their relations.
13
14 This document specifies a virtual machine for the controlled matching
15 and parsing of token streams, creating an abstract syntax tree (short
16 AST) reflecting the structure of the input. Special machine features
17 are the caching and reuse of partial results, caching of the encoun‐
18 tered input, and the ability to backtrack in both input and AST cre‐
19 ation.
20
21 These features make the specified virtual machine especially useful to
22 packrat parsers based on parsing expression grammars. It is however not
23 restricted to this type of parser. Normal LL and LR parsers can be
24 implemented with it as well.
25
26 The following sections will discuss first the abstract state kept by ME
27 virtual machines, and then their instruction set.
28
30 A ME virtual machine manages the following state:
31
32 Current token CT
33 The token from the input under consideration by the machine.
34
35 This information is used and modified by the instructions
36 defined in the section TERMINAL MATCHING.
37
38 Current location CL
39 The location of the current token in the input stream, as offset
40 relative to the beginning of the stream. The first token is con‐
41 sidered to be at offset 0.
42
43 This information is implicitly used and modified by the instruc‐
44 tions defined in the sections TERMINAL MATCHING and NONTERMINAL
45 MATCHING, and can be directly queried and modified by the
46 instructions defined in section INPUT LOCATION HANDLING.
47
48 Location stack LS
49 In addition to the above a stack of locations, for backtracking.
50 Locations can put on the stack, removed from it, and removed
51 with setting the current location.
52
53 This information is implicitly used and modified by the instruc‐
54 tions defined in the sections TERMINAL MATCHING and NONTERMINAL
55 MATCHING, and can be directly queried and modified by the
56 instructions defined in section INPUT LOCATION HANDLING.
57
58 Match status OK
59 A boolean value, the result of the last attempt at matching
60 input. It is set to true if that attempt was successful, and
61 false otherwise.
62
63 This information is influenced by the instructions defined in
64 the sections TERMINAL MATCHING, NONTERMINAL MATCHING, and UNCON‐
65 DITIONAL MATCHING. It is queried by the instructions defined in
66 the section CONTROL FLOW.
67
68 Semantic value SV
69 The semantic value associated with (generated by) the last
70 attempt at matching input. Contains either the empty string or a
71 node for the abstract syntax tree constructed from the input.
72
73 This information is influenced by the instructions defined in
74 the sections SEMANTIC VALUES, and AST STACK HANDLING.
75
76 AST stack AS
77 A stack of partial abstract syntax trees constructed by the
78 machine during matching.
79
80 This information is influenced by the instructions defined in
81 the sections SEMANTIC VALUES, and AST STACK HANDLING.
82
83 AST Marker stack MS
84 In addition to the above a stack of stacks, for backtracking.
85 This is actually a stack of markers into the AST stack, thus
86 implicitly snapshooting the state of the AST stack at some point
87 in time. Markers can be put on the stack, dropped from it, and
88 used to roll back the AST stack to an earlier state.
89
90 This information is influenced by the instructions defined in
91 the sections SEMANTIC VALUES, and AST STACK HANDLING.
92
93 Error status ER
94 Error information associated with the last attempt at matching
95 input. Contains either the empty string or a list of 2 elements,
96 a location in the input and a list of error messages associated
97 with it, in this order.
98
99 Note that error information can be set even if the last attempt
100 at matching input was successful. For example the *-operator
101 (matching a sub-expression zero or more times) in a parsing
102 expression grammar is always successful, even if it encounters a
103 problem further in the input and has to backtrack. Such problems
104 must not be forgotten when continuing to match.
105
106 This information is queried and influenced by the instructions
107 defined in the sections TERMINAL MATCHING, NONTERMINAL MATCHING,
108 and ERROR HANDLING.
109
110 Error stack ES
111 In addition to the above a stack of error information, to allow
112 the merging of current and older error information when perform‐
113 ing backtracking in choices after an unsucessful match.
114
115 This information is queried and influenced by the instructions
116 defined in the sections TERMINAL MATCHING, NONTERMINAL MATCHING,
117 and ERROR HANDLING.
118
119 Return stack RS
120 A stack of program counter values, i.e. locations in the code
121 controlling the virtual machine, for the management of subrou‐
122 tine calls, i.e. the matching of nonterminal symbols.
123
124 This information is queried and influenced by the instructions
125 defined in the section NONTERMINAL MATCHING.
126
127 Nonterminal cache NC
128 A cache of machine states (A 4-tuple containing a location in
129 the input, match status OK, semantic value SV, and error status
130 ER) keyed by name of nonterminal symbol and location in the
131 input stream.
132
133 The key location is where machine started the attempt to match
134 the named nonterminal symbol, and the location in the value is
135 where machine ended up after the attempt completed, independent
136 of the success of the attempt.
137
138 This status is queried and influenced by the instructions
139 defined in the section NONTERMINAL MATCHING.
140
142 With the machine state specified it is now possible to explain the
143 instruction set of ME virtual machines. They are grouped roughly by the
144 machine state they influence and/or query.
145
146 TERMINAL MATCHING
147 First the instructions to match tokens from the input stream, and by
148 extension all terminal symbols.
149
150 These instructions are the only ones which may retrieve a new token
151 from the input stream. This is a may and not a will because the
152 instructions will a retrieve new token if, and only if the current
153 location CL is at the head of the stream. If the machine has back‐
154 tracked (see icl_rewind) the instructions will retrieve the token to
155 compare against from the internal cache.
156
157 ict_advance message
158 This instruction tries to advance to the next token in the input
159 stream, i.e. the one after the current location CL. The instruc‐
160 tion will fail if, and only if the end of the input stream is
161 reached, i.e. if there is no next token.
162
163 The sucess/failure of the instruction is remembered in the match
164 status OK. In the case of failure the error status ER is set to
165 the current location and the message message. In the case of
166 success the error status ER is cleared, the new token is made
167 the current token CT, and the new location is made the current
168 location CL.
169
170 The argument message is a reference to the string to put into
171 the error status ER, if such is needed.
172
173 ict_match_token tok message
174 This instruction tests the current token CT for equality with
175 the argument tok and records the result in the match status OK.
176 The instruction fails if the current token is not equal to tok.
177
178 In case of failure the error status ER is set to the current
179 location CL and the message message, and the current location CL
180 is moved one token backwards. Otherwise, i.e. upon success, the
181 error status ER is cleared and the current location CL is not
182 touched.
183
184 ict_match_tokrange tokbegin tokend message
185 This instruction tests the current token CT for being in the
186 range of tokens from tokbegin to tokend (inclusive) and records
187 the result in the match status OK. The instruction fails if the
188 current token is not inside the range.
189
190 In case of failure the error status ER is set to the current
191 location CL and the message message, and the current location CL
192 is moved one token backwards. Otherwise, i.e. upon success, the
193 error status ER is cleared and the current location CL is not
194 touched.
195
196 ict_match_tokclass code message
197 This instruction tests the current token CT for being a member
198 of the token class code and records the result in the match sta‐
199 tus OK. The instruction fails if the current token is not a mem‐
200 ber of the specified class.
201
202 In case of failure the error status ER is set to the current
203 location CL and the message message, and the current location CL
204 is moved one token backwards. Otherwise, i.e. upon success, the
205 error status ER is cleared and the current location CL is not
206 touched.
207
208 Currently the following classes are legal:
209
210 alnum A token is accepted if it is a unicode alphabetical char‐
211 acter, or a digit.
212
213 alpha A token is accepted if it is a unicode alphabetical char‐
214 acter.
215
216 digit A token is accepted if it is a unicode digit character.
217
218 xdigit A token is accepted if it is a hexadecimal digit charac‐
219 ter.
220
221 punct A token is accepted if it is a unicode punctuation char‐
222 acter.
223
224 space A token is accepted if it is a unicode space character.
225
226 NONTERMINAL MATCHING
227 The instructions in this section handle the matching of nonterminal
228 symbols. They query the nonterminal cache NC for saved information, and
229 put such information into the cache.
230
231 The usage of the cache is a performance aid for backtracking parsers,
232 allowing them to avoid an expensive rematch of complex nonterminal sym‐
233 bols if they have been encountered before.
234
235 inc_restore branchlabel nt
236 This instruction checks if the nonterminal cache NC contains
237 information about the nonterminal symbol nt, at the current
238 location CL. If that is the case the instruction will update the
239 machine state (current location CL, match status OK, semantic
240 value SV, and error status ER) with the found information and
241 continue execution at the instruction refered to by the branch‐
242 label. The new current location CL will be the last token
243 matched by the nonterminal symbol, i.e. belonging to it.
244
245 If no information was found the instruction will continue execu‐
246 tion at the next instruction.
247
248 Together with icf_ntcall it is possible to generate code for
249 memoized and non-memoized matching of nonterminal symbols,
250 either as subroutine calls, or inlined in the caller.
251
252 inc_save nt
253 This instruction saves the current state of the machine (current
254 location CL, match status OK, semantic value SV, and error sta‐
255 tus ER), to the nonterminal cache NC. It will also pop an entry
256 from the location stack LS and save it as the start location of
257 the match.
258
259 It is expected to be called at the end of matching a nonterminal
260 symbol, with nt the name of the nonterminal symbol the code was
261 working on. This allows the instruction inc_restore to check for
262 and retrieve the data, should we have to match this nonterminal
263 symbol at the same location again, during backtracking.
264
265 icf_ntcall branchlabel
266 This instruction invokes the code for matching the nonterminal
267 symbol nt as a subroutine. To this end it stores the current
268 program counter PC on the return stack RS, the current location
269 CL on the location stack LS, and then continues execution at the
270 address branchlabel.
271
272 The next matching icf_ntreturn will cause the execution to con‐
273 tinue at the instruction coming after the call.
274
275 icf_ntreturn
276 This instruction will pop an entry from the return stack RS,
277 assign it to the program counter PC, and then continue execution
278 at the new address.
279
280 UNCONDITIONAL MATCHING
281 The instructions in this section are the remaining match operators.
282 They change the match status OK directly and unconditionally.
283
284 iok_ok This instruction sets the match status OK to true, indicating a
285 successful match.
286
287 iok_fail
288 This instruction sets the match status OK to false, indicating a
289 failed match.
290
291 iok_negate
292 This instruction negates the match status OK, turning a failure
293 into a success and vice versa.
294
295 CONTROL FLOW
296 The instructions in this section implement both conditional and uncon‐
297 ditional control flow. The conditional jumps query the match status OK.
298
299 icf_jalways branchlabel
300 This instruction sets the program counter PC to the address
301 specified by branchlabel and then continues execution from
302 there. This is an unconditional jump.
303
304 icf_jok branchlabel
305 This instruction sets the program counter PC to the address
306 specified by branchlabel. This happens if, and only if the match
307 status OK indicates a success. Otherwise it simply continues
308 execution at the next instruction. This is a conditional jump.
309
310 icf_jfail branchlabel
311 This instruction sets the program counter PC to the address
312 specified by branchlabel. This happens if, and only if the match
313 status OK indicates a failure. Otherwise it simply continues
314 execution at the next instruction. This is a conditional jump.
315
316 icf_halt
317 This instruction halts the machine and blocks any further execu‐
318 tion.
319
320 INPUT LOCATION HANDLING
321 The instructions in this section are for backtracking, they manipulate
322 the current location CL of the machine state. They allow a user of the
323 machine to query and save locations in the input, and to rewind the
324 current location CL to saved locations, making them one of the compo‐
325 nents enabling the implementation of backtracking parsers.
326
327 icl_push
328 This instruction pushes a copy of the current location CL on the
329 location stack LS.
330
331 icl_rewind
332 This instruction pops an entry from the location stack LS and
333 then moves the current location CL back to this point in the
334 input.
335
336 icl_pop
337 This instruction pops an entry from the location stack LS and
338 discards it.
339
340 ERROR HANDLING
341 The instructions in this section provide read and write access to the
342 error status ER of the machine.
343
344 ier_push
345 This instruction pushes a copy of the current error status ER on
346 the error stack ES.
347
348 ier_clear
349 This instruction clears the error status ER.
350
351 ier_nonterminal message
352 This instruction checks if the error status ER contains an error
353 whose location is just past the location found in the top entry
354 of the location stack LS. Nothing happens if no such error is
355 found. Otherwise the found error is replaced by an error at the
356 location found on the stack, having the message message.
357
358 ier_merge
359 This instruction pops an entry from the error stack ES, merges
360 it with the current error status ER and stores the result of the
361 merge as the new error status ER.
362
363 The merge is performed as described below:
364
365 If one of the two error states is empty the other is chosen. If
366 neither error state is empty, and refering to different loca‐
367 tions, then the error state with the location further in the
368 input is chosen. If both error states refer to the same location
369 their messages are merged (with removing duplicates).
370
371 SEMANTIC VALUES
372 The instructions in this section manipulate the semantic value SV.
373
374 isv_clear
375 This instruction clears the semantic value SV.
376
377 isv_terminal
378 This instruction creates a terminal AST node for the current
379 token CT, makes it the semantic value SV, and also pushes the
380 node on the AST stack AS.
381
382 isv_nonterminal_leaf nt
383 This instruction creates a nonterminal AST node without any
384 children for the nonterminal nt, and makes it the semantic value
385 SV.
386
387 This instruction should be executed if, and only if the match
388 status OK indicates a success. In the case of a failure
389 isv_clear should be called.
390
391 isv_nonterminal_range nt
392 This instruction creates a nonterminal AST node for the nonter‐
393 minal nt, with a single terminal node as its child, and makes
394 this AST the semantic value SV. The terminal node refers to the
395 input string from the location found on top of the location
396 stack LS to the current location CL (both inclusive).
397
398 This instruction should be executed if, and only if the match
399 status OK indicates a success. In the case of a failure
400 isv_clear should be called.
401
402 isv_nonterminal_reduce nt
403 This instruction creates a nonterminal AST node for the nonter‐
404 minal nt and makes it the semantic value SV.
405
406 All entries on the AST stack AS above the marker found in the
407 top entry of the AST Marker stack MS become children of the new
408 node, with the entry at the stack top becoming the rightmost
409 child. If the AST Marker stack MS is empty the whole stack is
410 used. The AST marker stack MS is left unchanged.
411
412 This instruction should be executed if, and only if the match
413 status OK indicates a success. In the case of a failure
414 isv_clear should be called.
415
416 AST STACK HANDLING
417 The instructions in this section manipulate the AST stack AS, and the
418 AST Marker stack MS.
419
420 ias_push
421 This instruction pushes the semantic value SV on the AST stack
422 AS.
423
424 ias_mark
425 This instruction pushes a marker for the current state of the
426 AST stack AS on the AST Marker stack MS.
427
428 ias_mrewind
429 This instruction pops an entry from the AST Marker stack MS and
430 then proceeds to pop entries from the AST stack AS until the
431 state represented by the popped marker has been reached again.
432 Nothing is done if the AST stack AS is already smaller than
433 indicated by the popped marker.
434
435 ias_mpop
436 This instruction pops an entry from the AST Marker stack MS and
437 discards it.
438
440 grammar, parsing, virtual machine
441
443 Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>
444
445
446
447
448grammar_me 0.1 grammar::me_vm(n)