1interop(3) ANTLR3C interop(3)
2
3
4
6 interop - Interacting with the Generated Code
7
8
10 The main way to interact with the generated code is via action code
11 placed within { and } characters in your rules. In general, you are
12 advised to keep the code you embed within these actions, and the
13 grammar itself to an absolute minimum. Rather than embed code directly
14 in your grammar, you should construct an API, that is called from the
15 actions within your grammar. This way you will keep the grammar clean
16 and maintainable and separate the code generators or other code from
17 the definition of the grammar itself.
18
19 However, when you wish to call your API functions, or insert small
20 pieces of code that do not warrant external functions, you will need to
21 access elements of tokens, return elements from parser rules and
22 perhaps the internals of the recognizer itself. The C runtime provides
23 a number of MACROs that you can use within your action code. It also
24 provides a number of performant structures that you may find useful for
25 building symbol tables, lists, tries, stacks, arrays and so on (all of
26 which are managed so that your memory allocation problems are
27 minimized.)
28
30 The C target does not differ from the Java target in any major ways
31 here, and you should consult the standard documentation for the use of
32 parameters on rules and the returns clause. You should be aware though,
33 that the rules generate C function calls and therefore the input and
34 returns clauses are subject to the constraints of C scoping.
35
36 You should note that if your parser rule returns more than a single
37 entity, then the return type of the generated rule function is a
38 struct, which is returned by value. This is also the case if your rule
39 is part of a tree building grammar (uses the output=AST; option.
40
41 Other than the notes above, you can use any pre-declared type as an
42 input or output parameter for your rule.
43
45 You are responsible for allocating and freeing any memory used by your
46 own constructs, ANTLR will track and release any memory allocated
47 internally for tokens, trees, stacks, scopes and so on. This memory is
48 returned to the malloc pool when you call the free method of any ANTLR3
49 produced structure.
50
51 For performance reasons, and to avoid thrashing the malloc allocation
52 system, memory for amy elements of your generated parser is allocated
53 in chunks and parcelled out by factories. For instance memory for
54 tokens is created as an array of tokens, and a token factory hands out
55 the next available slot to the lexer. When you free the lexer, the
56 allocated memory is returned to the pool. The same applies to 'strings'
57 that contain the token text and various other text elements accessed
58 within the lexer.
59
60 The only side effect of this is that after your parse and analysis is
61 complete, if you wish to retain anything generated automatically, you
62 must copy it before freeing the recognizer structures. In practice it
63 is usually practical to retain the recognizer context objects until
64 your processing is complete or to use your own allocation scheme for
65 generating output etc.
66
67 The advantage of using object factories is of course that memory leaks
68 and accessing de-allocated memory are bugs that rarely occur within the
69 ANTLR3 C runtime. Further, allocating memory for tokens, trees and so
70 on is very fast.
71
73 The CTX macro is a fundamental parameter that is passed as the first
74 parameter to any generated function concerned with your lexer, parser,
75 or tree parser. The is is the context pointer for your generated
76 recognizer and is how you invoke the generated functions, and access
77 the data embedded within your generated recognizer. While you can use
78 it to directly access stacks, scopes and so on, this is not really
79 recommended as you should use the $xxx references that are available
80 generically within ANTLR grammars.
81
82 The context pointer is used because this removes the need for any
83 global/static variables at all, either within the generated code, or
84 the C runtime. This is of course fundamental to creating free threading
85 recognizers. Wherever a function call or rule call required the ctx
86 parameter, you either reference it via the CTX macro, or the ctx
87 parameter is in fact the return type from calling the 'constructor'
88 function for your parser/lexer/tree parser (see code example in 'How to
89 build Generated Code' .)
90
92 While the author is not fond of using C MACROs to hide code or
93 structure access, in the case of generated code, they serve two useful
94 purposes. The first is to simplify the references to internal
95 constructs, the second is to facilitate the change of any internal
96 interface without requiring you to port grammars from earlier versions
97 (just regenerate and recompile). As of release 3.1, these macros are
98 stable and will only change their usage interface in the event of bugs
99 being discovered. You are encouraged to use these macros in your code,
100 rather than access the raw interface.
101
102 \bNB: Macros that act like statements must be terminated with a ';'.
103 The macro body does not supply this, nor should it. Macros that call
104 functions are declared with () even if they have no parameters, macros
105 that reference fields do not have a () declaration.
106
108 There are a number of macros that are useful exclusively within lexer
109 rules. There are additional macros, common to all recognizer, and these
110 are documented in the section Common Macros.
111
112 LEXER
113 The LEXER macro returns a pointer to the base lexer object, which is of
114 type pANTLR3_LEXER. This is not the pointer to your generated lexer,
115 which is supplied by the CTX macro, but to the common implementation of
116 a lexer interface, which is supplied to all generated lexers.
117
118 LEXSTATE
119 Provides a pointer to the lexer shared state structure, which is where
120 the tokens for a rule are constructed and the status elements of the
121 lexer are kept. This pointer is of type
122 #pANTLR3_RECOGNIZER_SHARED_STATE.In general you should only access
123 elements of this structure if there is not already another MACRO or
124 standard $xxxx antlr reference that refers to it.
125
126 LA(n)
127 The LA macro returns the character at index n from the current input
128 stream index. The return type is ANTLR3_UINT32. Hence LA(1) returns the
129 character at the current input position (the character that will be
130 consumed next), LA(-1) returns the character that has just been
131 consumed and so on. The LA(n) macro is useful for constructing semantic
132 predicates in lexer rules. The reference LA(0) is undefined and will
133 cause an error in your lexer.
134
135 GETCHARINDEX()
136 The GETCHARINDEX macro returns the index of the current character
137 position as a 0 based offset from the start of the input stream. It
138 returns a value type of ANTLR3_UINT32.
139
140 GETLINE()
141 The GETLINE macro returns the line number of current character (LA(1)
142 in the input stream. It returns a value type of ANTLR3_UINT32. Note
143 that the line number is incremented automatically by an input stream
144 when it sees the input character '
145
146 GETTEXT()
147 The GETTEXT macro returns the text currently matched by the lexer rule.
148 In general you should use the generic $text reference in ANTLR to
149 retrieve this. The return type is a reference type of pANTLR3_STRING
150 which allows you to manipulate the text you have retrieved (NB this
151 does not change the input stream only the text you copy from the input
152 stream when you use this MACRO or $text).
153
154 The reference $text->chars or GETTEXT()->chars will reference a pointer
155 to the '\0' terminated character string that the ANTLR3 pANTLR3_STRING
156 represents. String space is allocated automatically as well as the
157 structure that holds the string. The pANTLR3_STRING_FACTORY associated
158 with the lexer handles this and when you close the lexer, it will
159 automatically free any space allocated for strings and their
160 structures.
161
162 GETCHARPOSITIONINLINE()
163 The GETCHARPOSITIONINLINE returns the zero based offset of character
164 LA(1) from the start of the current input line. See the macro GETLINE
165 for details on what the line number means.
166
167 EMIT()
168 The macro EMIT causes the text range currently matched to the lexer
169 rule to be emitted immediately as the token for the rule. Subsequent
170 text is matched but ignored. The type used for the the token is the
171 name of the lexer rule or, if you have change this by using $type =
172 XXX;, the type XXX is used.
173
174 EMITNEW(t)
175 The macro EMITNEW causes the supplied token reference t to be used as
176 the token emitted by the rule. The parameter t must be of type
177 pANTLR3_COMMON_TOKEN.
178
179 INDEX()
180 The INDEX macro returns the current input position according to the
181 input stream. It is not guaranteed to be the character offset in the
182 input stream but is instead used as a value for marking and rewinding
183 to specific points in the input stream. Use the macro GETCHARINDEX() to
184 find out the position of the LA(1) in the input stream.
185
186 PUSHSTREAM(str)
187 The PUSHSTREAM macro, in conjunction with the POPSTREAM macro (called
188 internally in the runtime usually) can be used to stack many input
189 streams to the lexer, and implement constructs such as the C pre-
190 processor #include directive.
191
192 An input stream that is pushed on to the stack becomes the current
193 input stream for the lexer and the state of the previous stream is
194 automatically saved. The input stream will be automatically popped from
195 the stack when it is exhausted by the lexer. You may use the macro
196 POPSTREAM to return to the previous input stream prior to exhausting
197 the currently stacked input stream.
198
199 Here is an example of using the macro in a lexer to implement the C
200 #include pre-processor directive:
201
202 fragment
203 STRING_GUTS : (~('\'|'"') )* ;
204
205 LINE_COMMAND
206 : '#' (' ' | '')*
207 ( '? '0
208 'include' (' ' | '')+ '"' file = STRING_GUTS '"' (' ' | '')* '
209 {
210 pANTLR3_STRING fName;
211 pANTLR3_INPUT_STREAM in;
212
213 // Create an initial string, then take a substring
214 // We can do this by messing with the start and end
215 // pointers of tokens and so on. This shows a reasonable way to
216 // manipulate strings.
217 //
218 fName = $file.text;
219 printf("Including file 's'0, fName->chars);
220
221 // Create a new input stream and take advantage of built in stream stacking
222 // in C target runtime.
223 //
224 in = antlr38BitFileStreamNew(fName->chars);
225 PUSHSTREAM(in);
226
227 // Note that the input stream is not closed when it EOFs, I don't bother
228 // to do it here, but it is up to you to track streams created like this
229 // and destroy them when the whole parse session is complete. Remember that you
230 // don't want to do this until all tokens have been manipulated all the way through
231 // your tree parsers etc as the token does not store the text it just refers
232 // back to the input stream and trying to get the text for it will abort if you
233 // close the input stream too early.
234 //
235 '? '0
236 } ')* '
237 | (('0'..'9')=>('0'..'9'))+ ~('0|'
238 )
239 {$channel=HIDDEN;}
240 ;
241
242 POPSTREAM()
243 Assuming that you have stacked an input stream using the PUSHSTREAM
244 macro, you can remove it from the stream stack and revert to the
245 previous input stream. You should be careful to pop the stream at an
246 appropriate point in your lexer action, so you do not match characters
247 from one stream with those from another in the same rule (unless this
248 is what you want to do)
249
250 SETTEXT(str)
251 A token manufactured by the lexer does not actually physically store
252 the text from the input stream to which it matches. The token string is
253 instead created only if you ask for the text. However if you wish to
254 change the text that the token represents you can use this macro to set
255 it explicitly. Note that this does not change the input stream text but
256 associates the supplied pANTLR3_STRING with the token. This string is
257 then returned when parser and tree parser reference the tokens via the
258 $xxx.text reference.
259
260 USER1 USER2 USER3 and CUSTOM
261 While you can create your own custom token class and have the lexer
262 deal with this, this is a lot of work compared to the trivial
263 inheritance that can be achieved in the Java target. In many cases
264 though, all that is needed is the addition of a few data items such as
265 an integer or a pointer. Rather than require C programmers to create
266 complicated structures just to add a few data items, the C target
267 provides a few custom fields in the standard token, which will fulfil
268 the needs of most lexers and parsers.
269
270 The token fields user1, user2, and user3 are all value types of
271 #ANTLR_UINT32. In the parser you can reference these fields directly
272 from the token: x=TOKNAME { $x->user1 ... but when you are building the
273 token in the lexer, you must assign to the fields using the macros
274 USER1, USER2, or USER3. As in:
275
276 LEXTOK: 'AAAAA' { USER1 = 99; } ;
277
279 PARSER
280 The PARSER macro returns a pointer to the base parser or tree parser
281 object, which is of type pANTLR3_PARSER or pANTLR3_TREE_PARSER . This
282 is not the pointer to your generated parser, which is supplied by the
283 CTX macro, but to the common implementation of a parser or tree parser
284 interface, which is supplied to all generated parsers.
285
286 INDEX()
287 When used in the parser, the INDEX macro returns the position of the
288 current token ( LT(1) ) in the input token stream. It can be used for
289 MARK and REWIND operations.
290
291 LT(n) and LA(n)
292 In the parser, the macro LT(n) returns the pANTLR3_COMMON_TOKEN at
293 offset n from the current token stream input position. The macro LA(n)
294 returns the token type of the token at position n. The value n cannot
295 be zero, and such a reference will return NULL and possibly cause an
296 error. LA(1) is the token that is about to be recognized and LA(-1) is
297 the token that has just been recognized. Values of n that exceed the
298 limits of the token stream boundaries will return NULL.
299
300 PSRSTATE
301 Returns the shared state pointer of type
302 pANTLR3_RECOGNIZER_SHARED_STATE. This is not generally useful to the
303 grammar programmer as the useful elements have generic $xxx references
304 built in to ANTLR.
305
306 ADAPTOR
307 When building an AST via a parser, the work of constructing and
308 manipulating trees is done by a supplied adaptor class. The default
309 class is usually fine for most tree operations but if you wish to build
310 your own specialized linked/tree structure, then you may need to
311 reference the adaptor you supply directly. The ADAPTOR macro returns
312 the reference to the tree adaptor which is always of type
313 pANTLR3_BASE_TREE_ADAPTOR, even if it is your custom adapter.
314
316 RECOGNIZER
317 Returns a reference type of #pANTRL3_BASE_RECOGNIZER, which is the base
318 functionality supplied to all recognizers, whether lexers, parsers or
319 tree parsers. You can override methods in this interface by installing
320 your own function pointers (once you know what you are doing).
321
322 INPUT
323 Returns a reference to the input stream of the appropriate type for the
324 recognizer. In a lexer this macro returns a reference type of
325 pANTLR3_INPUT_STREAM, in a parser this is type pANTLR3_TOKEN_STREAM and
326 in a tree parser this is type pANTLR3_COMMON_TREE_NODE_STREAM. You can
327 of course provide your own implementations of any of these interfaces.
328
329 MARK()
330 This macro will cause the input stream for the current recognizer to be
331 marked with a checkpoint. It will return a value type of ANTLR3_MARKER
332 which you can use as the parameter to a REWIND macro to return to the
333 marked point in the input.
334
335 If you know you will only ever rewind to the last MARK, then you can
336 ignore the return value of this macro and just use the REWINDLAST macro
337 to return to the last MARK that was set in the input stream.
338
339 REWIND(m)
340 Rewinds the appropriate input stream back to the marked checkpoint
341 returned from a prior MARK macro call and supplied as the parameter m
342 to the REWIND(m) macro.
343
344 REWINDLAST()
345 Rewinds the current input stream (character, tokens, tree nodes) back
346 to the last checkpoint marker created by a MARK macro call. Fails
347 silently if there was no prior MARK call.
348
349 SEEK(n)
350 Causes the input stream to position itself directly at offset n in the
351 stream. Works for all input stream types, both lexer, parser and tree
352 parser.
353
354
355
356Version 3.3.1 Fri May 3 2019 interop(3)