antlr3-interop(3)

1interop(3)                          ANTLR3C                         interop(3)
2
3
4

NAME

6       interop - Interoperation Within Rule Actions
7

Introduction

9       The main way to interact with the generated code is via action code
10       placed within { and } characters in your rules. In general, you are
11       advised to keep the code you embed within these actions, and the
12       grammar itself to an absolute minimum. Rather than embed code directly
13       in your grammar, you should construct an API, that is called from the
14       actions within your grammar. This way you will keep the grammar clean
15       and maintainable and separate the code generators or other code from
16       the definition of the grammar itself.
17
18       However, when you wish to call your API functions, or insert small
19       pieces of code that do not warrant external functions, you will need to
20       access elements of tokens, return elements from parser rules and
21       perhaps the internals of the recognizer itself. The C runtime provides
22       a number of MACROs that you can use within your action code. It also
23       provides a number of performant structures that you may find useful for
24       building symbol tables, lists, tries, stacks, arrays and so on (all of
25       which are managed so that your memory allocation problems are
26       minimized.)
27

Parameters and Returns from Parser Rules

29       The C target does not differ from the Java target in any major ways
30       here, and you should consult the standard documentation for the use of
31       parameters on rules and the returns clause. You should be aware though,
32       that the rules generate C function calls and therefore the input and
33       returns clauses are subject to the constraints of C scoping.
34
35       You should note that if your parser rule returns more than a single
36       entity, then the return type of the generated rule function is a
37       struct, which is returned by value. This is also the case if your rule
38       is part of a tree building grammar (uses the output=AST; option.
39
40       Other than the notes above, you can use any pre-declared type as an
41       input or output parameter for your rule.
42

Memory Management

44       You are responsible for allocating and freeing any memory used by your
45       own constructs, ANTLR will track and release any memory allocated
46       internally for tokens, trees, stacks, scopes and so on. This memory is
47       returned to the malloc pool when you call the free method of any ANTLR3
48       produced structure.
49
50       For performance reasons, and to avoid thrashing the malloc allocation
51       system, memory for amy elements of your generated parser is allocated
52       in chunks and parcelled out by factories. For instance memory for
53       tokens is created as an array of tokens, and a token factory hands out
54       the next available slot to the lexer. When you free the lexer, the
55       allocated memory is returned to the pool. The same applies to 'strings'
56       that contain the token text and various other text elements accessed
57       within the lexer.
58
59       The only side effect of this is that after your parse and analysis is
60       complete, if you wish to retain anything generated automatically, you
61       must copy it before freeing the recognizer structures. In practice it
62       is usually practical to retain the recognizer context objects until
63       your processing is complete or to use your own allocation scheme for
64       generating output etc.
65
66       The advantage of using object factories is of course that memory leaks
67       and accessing de-allocated memory are bugs that rarely occur within the
68       ANTLR3 C runtime. Further, allocating memory for tokens, trees and so
69       on is very fast.
70

The CTX Macro

72       The CTX macro is a fundamental parameter that is passed as the first
73       parameter to any generated function concerned with your lexer, parser,
74       or tree parser. The is is the context pointer for your generated
75       recognizer and is how you invoke the generated functions, and access
76       the data embedded within your generated recognizer. While you can use
77       it to directly access stacks, scopes and so on, this is not really
78       recommended as you should use the $xxx references that are available
79       generically within ANTLR grammars.
80
81       The context pointer is used because this removes the need for any
82       global/static variables at all, either within the generated code, or
83       the C runtime. This is of course fundamental to creating free threading
84       recognizers. Wherever a function call or rule call required the ctx
85       parameter, you either reference it via the CTX macro, or the ctx
86       parameter is in fact the return type from calling the 'constructor'
87       function for your parser/lexer/tree parser (see code example in 'How to
88       build Generated Code' .)
89

Macro Changes

91       While the author is not fond of using C MACROs to hide code or
92       structure access, in the case of generated code, they serve two useful
93       purposes. The first is to simplify the references to internal
94       constructs, the second is to facilitate the change of any internal
95       interface without requiring you to port grammars from earlier versions
96       (just regenerate and recompile). As of release 3.1, these macros are
97       stable and will only change their usage interface in the event of bugs
98       being discovered. You are encouraged to use these macros in your code,
99       rather than access the raw interface.
100
101       \bNB: Macros that act like statements must be terminated with a ';'.
102       The macro body does not supply this, nor should it. Macros that call
103       functions are declared with () even if they have no parameters, macros
104       that reference fields do not have a () declaration.
105

Lexer Macros

107       There are a number of macros that are useful exclusively within lexer
108       rules. There are additional macros, common to all recognizer, and these
109       are documented in the section Common Macros.
110
111   LEXER
112       The LEXER macro returns a pointer to the base lexer object, which is of
113       type pANTLR3_LEXER. This is not the pointer to your generated lexer,
114       which is supplied by the CTX macro, but to the common implementation of
115       a lexer interface, which is supplied to all generated lexers.
116
117   LEXSTATE
118       Provides a pointer to the lexer shared state structure, which is where
119       the tokens for a rule are constructed and the status elements of the
120       lexer are kept. This pointer is of type
121       #pANTLR3_RECOGNIZER_SHARED_STATE.In general you should only access
122       elements of this structure if there is not already another MACRO or
123       standard $xxxx antlr reference that refers to it.
124
125   LA(n)
126       The LA macro returns the character at index n from the current input
127       stream index. The return type is ANTLR3_UINT32. Hence LA(1) returns the
128       character at the current input position (the character that will be
129       consumed next), LA(-1) returns the character that has just been
130       consumed and so on. The LA(n) macro is useful for constructing semantic
131       predicates in lexer rules. The reference LA(0) is undefined and will
132       cause an error in your lexer.
133
134   GETCHARINDEX()
135       The GETCHARINDEX macro returns the index of the current character
136       position as a 0 based offset from the start of the input stream. It
137       returns a value type of ANTLR3_UINT32.
138
139   GETLINE()
140       The GETLINE macro returns the line number of current character (LA(1)
141       in the input stream. It returns a value type of ANTLR3_UINT32. Note
142       that the line number is incremented automatically by an input stream
143       when it sees the input character '
144
145   GETTEXT()
146       The GETTEXT macro returns the text currently matched by the lexer rule.
147       In general you should use the generic $text reference in ANTLR to
148       retrieve this. The return type is a reference type of pANTLR3_STRING
149       which allows you to manipulate the text you have retrieved (NB this
150       does not change the input stream only the text you copy from the input
151       stream when you use this MACRO or $text).
152
153       The reference $text->chars or GETTEXT()->chars will reference a pointer
154       to the '\0' terminated character string that the ANTLR3 pANTLR3_STRING
155       represents. String space is allocated automatically as well as the
156       structure that holds the string. The pANTLR3_STRING_FACTORY associated
157       with the lexer handles this and when you close the lexer, it will
158       automatically free any space allocated for strings and their
159       structures.
160
161   GETCHARPOSITIONINLINE()
162       The GETCHARPOSITIONINLINE returns the zero based offset of character
163       LA(1) from the start of the current input line. See the macro GETLINE
164       for details on what the line number means.
165
166   EMIT()
167       The macro EMIT causes the text range currently matched to the lexer
168       rule to be emitted immediately as the token for the rule. Subsequent
169       text is matched but ignored. The type used for the the token is the
170       name of the lexer rule or, if you have change this by using $type =
171       XXX;, the type XXX is used.
172
173   EMITNEW(t)
174       The macro EMITNEW causes the supplied token reference t to be used as
175       the token emitted by the rule. The parameter t  must be of type
176       pANTLR3_COMMON_TOKEN.
177
178   INDEX()
179       The INDEX macro returns the current input position according to the
180       input stream. It is not guaranteed to be the character offset in the
181       input stream but is instead used as a value for marking and rewinding
182       to specific points in the input stream. Use the macro GETCHARINDEX() to
183       find out the position of the LA(1) in the input stream.
184
185   PUSHSTREAM(str)
186       The PUSHSTREAM macro, in conjunction with the POPSTREAM macro (called
187       internally in the runtime usually) can be used to stack many input
188       streams to the lexer, and implement constructs such as the C pre-
189       processor #include directive.
190
191       An input stream that is pushed on to the stack becomes the current
192       input stream for the lexer and the state of the previous stream is
193       automatically saved. The input stream will be automatically popped from
194       the stack when it is exhausted by the lexer. You may use the macro
195       POPSTREAM to return to the previous input stream prior to exhausting
196       the currently stacked input stream.
197
198       Here is an example of using the macro in a lexer to implement the C
199       #include pre-processor directive:
200
201       fragment
202       STRING_GUTS :   (~('\\'|'"') )* ;
203
204       LINE_COMMAND
205       : '#' (' ' | '\t')*
206           (
207               'include' (' ' | '\t')+ '"' file = STRING_GUTS '"' (' ' | '\t')* '\r'? '\n'
208               {
209                   pANTLR3_STRING      fName;
210                   pANTLR3_INPUT_STREAM    in;
211
212                   // Create an initial string, then take a substring
213                   // We can do this by messing with the start and end
214                   // pointers of tokens and so on. This shows a reasonable way to
215                   // manipulate strings.
216                   //
217                   fName = $file.text;
218                   printf("Including file '\%s'\n", fName->chars);
219
220                   // Create a new input stream and take advantage of built in stream stacking
221                   // in C target runtime.
222                   //
223                   in = antlr38BitFileStreamNew(fName->chars);
224                   PUSHSTREAM(in);
225
226                   // Note that the input stream is not closed when it EOFs, I don't bother
227                   // to do it here, but it is up to you to track streams created like this
228                   // and destroy them when the whole parse session is complete. Remember that you
229                   // don't want to do this until all tokens have been manipulated all the way through
230                   // your tree parsers etc as the token does not store the text it just refers
231                   // back to the input stream and trying to get the text for it will abort if you
232                   // close the input stream too early.
233                   //
234
235               }
236                   | (('0'..'9')=>('0'..'9'))+ ~('\n'|'\r')* '\r'? '\n'
237               )
238            {$channel=HIDDEN;}
239           ;
240
241   POPSTREAM()
242       Assuming that you have stacked an input stream using the PUSHSTREAM
243       macro, you can remove it from the stream stack and revert to the
244       previous input stream. You should be careful to pop the stream at an
245       appropriate point in your lexer action, so you do not match characters
246       from one stream with those from another in the same rule (unless this
247       is what you want to do)
248
249   SETTEXT(str)
250       A token manufactured by the lexer does not actually physically store
251       the text from the input stream to which it matches. The token string is
252       instead created only if you ask for the text. However if you wish to
253       change the text that the token represents you can use this macro to set
254       it explicitly. Note that this does not change the input stream text but
255       associates the supplied pANTLR3_STRING with the token. This string is
256       then returned when parser and tree parser reference the tokens via the
257       $xxx.text reference.
258
259   USER1 USER2 USER3 and CUSTOM
260       While you can create your own custom token class and have the lexer
261       deal with this, this is a lot of work compared to the trivial
262       inheritance that can be achieved in the Java target. In many cases
263       though, all that is needed is the addition of a few data items such as
264       an integer or a pointer. Rather than require C programmers to create
265       complicated structures just to add a few data items, the C target
266       provides a few custom fields in the standard token, which will fulfil
267       the needs of most lexers and parsers.
268
269       The token fields user1, user2, and user3 are all value types of
270       #ANTLR_UINT32. In the parser you can reference these fields directly
271       from the token: x=TOKNAME { $x->user1 ... but when you are building the
272       token in the lexer, you must assign to the fields using the macros
273       USER1, USER2, or USER3. As in:
274
275       LEXTOK: 'AAAAA' { USER1 = 99; } ;
276

Parser and Tree Parser Macros

278   PARSER
279       The PARSER macro returns a pointer to the base parser or tree parser
280       object, which is of type pANTLR3_PARSER or pANTLR3_TREE_PARSER . This
281       is not the pointer to your generated parser, which is supplied by the
282       CTX macro, but to the common implementation of a parser or tree parser
283       interface, which is supplied to all generated parsers.
284
285   INDEX()
286       When used in the parser, the INDEX macro returns the position of the
287       current token ( LT(1) ) in the input token stream. It can be used for
288       MARK and REWIND operations.
289
290   LT(n) and LA(n)
291       In the parser, the macro LT(n) returns the pANTLR3_COMMON_TOKEN at
292       offset n from the current token stream input position. The macro LA(n)
293       returns the token type of the token at position n. The value n cannot
294       be zero, and such a reference will return NULL and possibly cause an
295       error. LA(1) is the token that is about to be recognized and LA(-1) is
296       the token that has just been recognized. Values of n that exceed the
297       limits of the token stream boundaries will return NULL.
298
299   PSRSTATE
300       Returns the shared state pointer of type
301       pANTLR3_RECOGNIZER_SHARED_STATE. This is not generally useful to the
302       grammar programmer as the useful elements have generic $xxx references
303       built in to ANTLR.
304
305   ADAPTOR
306       When building an AST via a parser, the work of constructing and
307       manipulating trees is done by a supplied adaptor class. The default
308       class is usually fine for most tree operations but if you wish to build
309       your own specialized linked/tree structure, then you may need to
310       reference the adaptor you supply directly. The ADAPTOR macro returns
311       the reference to the tree adaptor which is always of type
312       pANTLR3_BASE_TREE_ADAPTOR, even if it is your custom adapter.
313

Macros Common to All Recognizers

315   RECOGNIZER
316       Returns a reference type of #pANTRL3_BASE_RECOGNIZER, which is the base
317       functionality supplied to all recognizers, whether lexers, parsers or
318       tree parsers. You can override methods in this interface by installing
319       your own function pointers (once you know what you are doing).
320
321   INPUT
322       Returns a reference to the input stream of the appropriate type for the
323       recognizer. In a lexer this macro returns a reference type of
324       pANTLR3_INPUT_STREAM, in a parser this is type pANTLR3_TOKEN_STREAM and
325       in a tree parser this is type pANTLR3_COMMON_TREE_NODE_STREAM. You can
326       of course provide your own implementations of any of these interfaces.
327
328   MARK()
329       This macro will cause the input stream for the current recognizer to be
330       marked with a checkpoint. It will return a value type of ANTLR3_MARKER
331       which you can use as the parameter to a REWIND macro to return to the
332       marked point in the input.
333
334       If you know you will only ever rewind to the last MARK, then you can
335       ignore the return value of this macro and just use the REWINDLAST macro
336       to return to the last MARK that was set in the input stream.
337
338   REWIND(m)
339       Rewinds the appropriate input stream back to the marked checkpoint
340       returned from a prior MARK macro call and supplied as the parameter m
341       to the REWIND(m) macro.
342
343   REWINDLAST()
344       Rewinds the current input stream (character, tokens, tree nodes) back
345       to the last checkpoint marker created by a MARK macro call. Fails
346       silently if there was no prior MARK call.
347
348   SEEK(n)
349       Causes the input stream to position itself directly at offset n in the
350       stream. Works for all input stream types, both lexer, parser and tree
351       parser.
352
353
354
355Version 3.3.1                   Wed Jan 18 2023                     interop(3)