1RE2C(1) General Commands Manual RE2C(1)
2
3
4
6 re2c - convert regular expressions to C/C++
7
8
10 re2c [-bdefghisuvVw1] [-o output] file
11
12
14 re2c is a preprocessor that generates C-based recognizers from regular
15 expressions. The input to re2c consists of C/C++ source interleaved
16 with comments of the form /*!re2c ... */ which contain scanner specifi‐
17 cations. In the output these comments are replaced with code that,
18 when executed, will find the next input token and then execute some
19 user-supplied token-specific code.
20
21 For example, given the following code
22
23 char *scan(char *p)
24 {
25 /*!re2c
26 re2c:define:YYCTYPE = "unsigned char";
27 re2c:define:YYCURSOR = p;
28 re2c:yyfill:enable = 0;
29 re2c:yych:conversion = 1;
30 re2c:indent:top = 1;
31 [0-9]+ {return p;}
32 [ 00-377] {return (char*)0;}
33 */
34 }
35
36 re2c -is will generate
37
38 /* Generated by re2c on Sat Apr 16 11:40:58 1994 */
39 char *scan(char *p)
40 {
41 {
42 unsigned char yych;
43
44 yych = (unsigned char)*p;
45 if(yych <= '/') goto yy4;
46 if(yych >= ':') goto yy4;
47 ++p;
48 yych = (unsigned char)*p;
49 goto yy7;
50 yy3:
51 {return p;}
52 yy4:
53 ++p;
54 yych = (unsigned char)*p;
55 {return char*)0;}
56 yy6:
57 ++p;
58 yych = (unsigned char)*p;
59 yy7:
60 if(yych <= '/') goto yy3;
61 if(yych <= '9') goto yy6;
62 goto yy3;
63 }
64
65 }
66
67 You can place one /*!max:re2c */ comment that will output a "#define
68 YYMAXFILL <n>" line that holds the maximum number of characters
69 required to parse the input. That is the maximum value YYFILL(n) will
70 receive. If -1 is in effect then YYMAXFILL can only be triggered once
71 after the last /*!re2c */.
72
73 You can also use /*!ignore:re2c */ blocks that allows to document the
74 scanner code and will not be part of the output.
75
76
78 re2c provides the following options:
79
80 -? -h Invoke a short help.
81
82 -b Implies -s. Use bit vectors as well in the attempt to coax bet‐
83 ter code out of the compiler. Most useful for specifications
84 with more than a few keywords (e.g. for most programming lan‐
85 guages).
86
87 -d Creates a parser that dumps information about the current posi‐
88 tion and in which state the parser is while parsing the input.
89 This is useful to debug parser issues and states. If you use
90 this switch you need to define a macro YYDEBUG that is called
91 like a function with two parameters: void YYDEBUG(int state,
92 char current). The first parameter receives the state or -1 and
93 the second parameter receives the input at the current cursor.
94
95 -e Cross-compile from an ASCII platform to an EBCDIC one.
96
97 -f Generate a scanner with support for storable state. For details
98 see below at SCANNER WITH STORABLE STATES.
99
100 -g Generate a scanner that utilizes GCC's computed goto feature.
101 That is re2c generates jump tables whenever a decision is of a
102 certain complexity (e.g. a lot of if conditions are otherwise
103 necessary). This is only useable with GCC and produces output
104 that cannot be compiled with any other compiler. Note that this
105 implies -b and that the complexity threshold can be configured
106 using the inplace configuration "cgoto:threshold".
107
108 -i Do not output #line information. This is usefull when you want
109 use a CMS tool with the re2c output which you might want if you
110 do not require your users to have re2c themselves when building
111 from your source. -o output Specify the output file.
112
113 -s Generate nested ifs for some switches. Many compilers need this
114 assist to generate better code.
115
116 -u Generate a parser that supports Unicode chars (UTF-32). This
117 means the generated code can deal with any valid Unicode charac‐
118 ter up to 0x10FFFF. When UTF-8 or UTF-16 needs to be supported
119 you need to convert the incoming stream to UTF-32 upon input
120 yourself.
121
122 -v Show version information.
123
124 -V Show the version as a number XXYYZZ.
125
126 -w Create a parser that supports wide chars (UCS-2). This implies
127 -s and cannot be used together with -e switch.
128
129 -1 Force single pass generation, this cannot be combined with -f
130 and disables YYMAXFILL generation prior to last re2c block.
131
132 --no-generation-date
133 Suppress date output in the generated output so that it only
134 shows the re2c version.
135
137 Unlike other scanner generators, re2c does not generate complete scan‐
138 ners: the user must supply some interface code. In particular, the
139 user must define the following macros or use the corresponding inplace
140 configurations:
141
142 YYCTYPE
143 Type used to hold an input symbol. Usually char or unsigned
144 char.
145
146 YYCURSOR
147 l-expression of type *YYCTYPE that points to the current input
148 symbol. The generated code advances YYCURSOR as symbols are
149 matched. On entry, YYCURSOR is assumed to point to the first
150 character of the current token. On exit, YYCURSOR will point to
151 the first character of the following token.
152
153 YYLIMIT
154 Expression of type *YYCTYPE that marks the end of the buffer
155 (YYLIMIT[-1] is the last character in the buffer). The gener‐
156 ated code repeatedly compares YYCURSOR to YYLIMIT to determine
157 when the buffer needs (re)filling.
158
159 YYMARKER
160 l-expression of type *YYCTYPE. The generated code saves back‐
161 tracking information in YYMARKER. Some easy scanners might not
162 use this.
163
164 YYCTXMARKER
165 l-expression of type *YYCTYPE. The generated code saves trail‐
166 ing context backtracking information in YYCTXMARKER. The user
167 only needs to define this macro if a scanner specification uses
168 trailing context in one or more of its regular expressions.
169
170 YYFILL(n)
171 The generated code "calls" YYFILL(n) when the buffer needs
172 (re)filling: at least n additional characters should be pro‐
173 vided. YYFILL(n) should adjust YYCURSOR, YYLIMIT, YYMARKER and
174 YYCTXMARKER as needed. Note that for typical programming lan‐
175 guages n will be the length of the longest keyword plus one.
176 The user can place a comment of the form /*!max:re2c */ once to
177 insert a YYMAXFILL(n) definition that is set to the maximum
178 length value. If -1 switch is used then YYMAXFILL can be trig‐
179 gered only once after the last /*!re2c */ block.
180
181 YYGETSTATE()
182 The user only needs to define this macro if the -f flag was
183 specified. In that case, the generated code "calls" YYGET‐
184 STATE() at the very beginning of the scanner in order to obtain
185 the saved state. YYGETSTATE() must return a signed integer. The
186 value must be either -1, indicating that the scanner is entered
187 for the first time, or a value previously saved by YYSET‐
188 STATE(s). In the second case, the scanner will resume opera‐
189 tions right after where the last YYFILL(n) was called.
190
191 YYSETSTATE(s)
192 The user only needs to define this macro if the -f flag was
193 specified. In that case, the generated code "calls" YYSETSTATE
194 just before calling YYFILL(n). The parameter to YYSETSTATE is a
195 signed integer that uniquely identifies the specific instance of
196 YYFILL(n) that is about to be called. Should the user wish to
197 save the state of the scanner and have YYFILL(n) return to the
198 caller, all he has to do is store that unique identifer in a
199 variable. Later, when the scannered is called again, it will
200 call YYGETSTATE() and resume execution right where it left off.
201 The generated code will contain both YYSETSTATE(s) and YYGET‐
202 STATE even if YYFILL(n) is being disabled.
203
204 YYDEBUG(state,current)
205 This is only needed if the -d flag was specified. It allows to
206 easily debug the generated parser by calling a user defined
207 function for every state. The function should have the following
208 signature: void YYDEBUG(int state, char current). The first
209 parameter receives the state or -1 and the second parameter
210 receives the input at the current cursor.
211
212 YYMAXFILL
213 This will be automatically defined by /*!max:re2c */ blocks as
214 explained above.
215
216
218 When the -f flag is specified, re2c generates a scanner that can store
219 its current state, return to the caller, and later resume operations
220 exactly where it left off.
221
222 The default operation of re2c is a "pull" model, where the scanner asks
223 for extra input whenever it needs it. However, this mode of operation
224 assumes that the scanner is the "owner" the parsing loop, and that may
225 not always be convenient.
226
227 Typically, if there is a preprocessor ahead of the scanner in the
228 stream, or for that matter any other procedural source of data, the
229 scanner cannot "ask" for more data unless both scanner and source live
230 in a separate threads.
231
232 The -f flag is useful for just this situation : it lets users design
233 scanners that work in a "push" model, i.e. where data is fed to the
234 scanner chunk by chunk. When the scanner runs out of data to consume,
235 it just stores its state, and return to the caller. When more input
236 data is fed to the scanner, it resumes operations exactly where it left
237 off.
238
239 When using the -f option re2c does not accept stdin because it has to
240 do the full generation process twice which means it has to read the
241 input twice. That means re2c would fail in case it cannot open the
242 input twice or reading the input for the first time influences the sec‐
243 ond read attempt.
244
245 Changes needed compared to the "pull" model.
246
247 1. User has to supply macros YYSETSTATE() and YYGETSTATE(state)
248
249 2. The -f option inhibits declaration of yych and yyaccept. So the user
250 has to declare these. Also the user has to save and restore these. In
251 the example examples/push.re these are declared as fields of the (C++)
252 class of which the scanner is a method, so they do not need to be
253 saved/restored explicitly. For C they could e.g. be made macros that
254 select fields from a structure passed in as parameter. Alternatively,
255 they could be declared as local variables, saved with YYFILL(n) when it
256 decides to return and restored at entry to the function. Also, it could
257 be more efficient to save the state from YYFILL(n) because YYSET‐
258 STATE(state) is called unconditionally. YYFILL(n) however does not get
259 state as parameter, so we would have to store state in a local variable
260 by YYSETSTATE(state).
261
262 3. Modify YYFILL(n) to return (from the function calling it) if more
263 input is needed.
264
265 4. Modify caller to recognise "more input is needed" and respond appro‐
266 priately.
267
268 5. The generated code will contain a switch block that is used to
269 restores the last state by jumping behind the corrspoding YYFILL(n)
270 call. This code is automatically generated in the epilog of the first
271 "/*!re2c */" block. It is possible to trigger generation of the YYGET‐
272 STATE() block earlier by placing a "/*!getstate:re2c */" comment. This
273 is especially useful when the scanner code should be wrapped inside a
274 loop.
275
276 Please see examples/push.re for push-model scanner. The generated code
277 can be tweaked using inplace configurations "state:abort" and
278 "state:nextlabel".
279
280
282 Each scanner specification consists of a set of rules, named defini‐
283 tions and configurations.
284
285 Rules consist of a regular expression along with a block of C/C++ code
286 that is to be executed when the associated regular expression is
287 matched.
288
289 regular expression { C/C++ code }
290
291 Named definitions are of the form:
292
293 name = regular expression;
294
295 Configurations look like named definitions whose names start with
296 "re2c:":
297
298 re2c:name = value;
299 re2c:name = "value";
300
301
303 "foo" the literal string foo. ANSI-C escape sequences can be used.
304
305 'foo' the literal string foo (characters [a-zA-Z] treated case-insen‐
306 sitive). ANSI-C escape sequences can be used.
307
308 [xyz] a "character class"; in this case, the regular expression
309 matches either an 'x', a 'y', or a 'z'.
310
311 [abj-oZ]
312 a "character class" with a range in it; matches an 'a', a 'b',
313 any letter from 'j' through 'o', or a 'Z'.
314
315 [^class]
316 an inverted "character class".
317
318 r\s match any r which isn't an s. r and s must be regular expres‐
319 sions which can be expressed as character classes.
320
321 r* zero or more r's, where r is any regular expression
322
323 r+ one or more r's
324
325 r? zero or one r's (that is, "an optional r")
326
327 name the expansion of the "named definition" (see above)
328
329 (r) an r; parentheses are used to override precedence (see below)
330
331 rs an r followed by an s ("concatenation")
332
333 r|s either an r or an s
334
335 r/s an r but only if it is followed by an s. The s is not part of
336 the matched text. This type of regular expression is called
337 "trailing context". A trailing context can only be the end of a
338 rule and not part of a named definition.
339
340 r{n} matches r exactly n times.
341
342 r{n,} matches r at least n times.
343
344 r{n,m} matches r at least n but not more than m times.
345
346 . match any character except newline (\n).
347
348 def matches named definition as specified by def.
349
350 Character classes and string literals may contain octoal or hexadecimal
351 character definitions and the following set of escape sequences (\n,
352 \t, \v, \b, \r, \f, \a, \\). An octal character is defined by a back‐
353 slash followed by its three octal digits and a hexadecimal character is
354 defined by backslash, a lower cased 'x' and its two hexadecimal digits
355 or a backslash, an upper cased X and its four hexadecimal digits.
356
357 re2c further more supports the c/c++ unicode notation. That is a back‐
358 slash followed by either a lowercased u and its four hexadecimal digits
359 or an uppercased U and its eight hexadecimal digits. However only in -u
360 mode the generated code can deal with any valid Unicode character up to
361 0x10FFFF.
362
363 Since characters greater \X00FF are not allowed in non unicode mode,
364 the only portable "any" rules are (.|"\n") and [^].
365
366 The regular expressions listed above are grouped according to prece‐
367 dence, from highest precedence at the top to lowest at the bottom.
368 Those grouped together have equal precedence.
369
370
372 It is possible to configure code generation inside re2c blocks. The
373 following lists the available configurations:
374
375 re2c:indent:top = 0 ;
376 Specifies the minimum number of indendation to use. Requires a
377 numeric value greater than or equal zero.
378
379 re2c:indent:string = "\t" ;
380 Specifies the string to use for indendation. Requires a string
381 that should contain only whitespace unless you need this for
382 external tools. The easiest way to specify spaces is to enclude
383 them in single or double quotes. If you do not want any indenda‐
384 tion at all you can simply set this to "".
385
386 re2c:yybm:hex = 0 ;
387 If set to zero then a decimal table is being used else a hexa‐
388 decimal table will be generated.
389
390 re2c:yyfill:enable = 1 ;
391 Set this to zero to suppress generation of YYFILL(n). When using
392 this be sure to verify that the generated scanner does not read
393 behind input. Allowing this behavior might introduce sever secu‐
394 rity issues to you programs.
395
396 re2c:yyfill:parameter = 1 ;
397 Allows to suppress parameter passing to YYFILL calls. If set to
398 zero then no parameter is passed to YYFILL. If set to a non zero
399 value then YYFILL usage will be followed by the number of
400 requested characters in braces.
401
402 re2c:startlabel = 0 ;
403 If set to a non zero integer then the start label of the next
404 scanner blocks will be generated even if not used by the scanner
405 itself. Otherwise the normal yy0 like start label is only being
406 generated if needed. If set to a text value then a label with
407 that text will be generated regardless of whether the normal
408 start label is being used or not. This setting is being reset to
409 0 after a start label has been generated.
410
411 re2c:labelprefix = yy ;
412 Allows to change the prefix of numbered labels. The default is
413 yy and can be set any string that is a valid label.
414
415 re2c:state:abort = 0 ;
416 When not zero and switch -f is active then the YYGETSTATE block
417 will contain a default case that aborts and a -1 case is used
418 for initialization.
419
420 re2c:state:nextlabel = 0 ;
421 Used when -f is active to control whether the YYGETSTATE block
422 is followed by a yyNext: label line. Instead of using yyNext you
423 can usually also use configuration startlabel to force a spe‐
424 cific start label or default to yy0 as start label. Instead of
425 using a dedicated label it is often better to separate the
426 YYGETSTATE code from the actual scanner code by placing a
427 "/*!getstate:re2c */" comment.
428
429 re2c:cgoto:threshold = 9 ;
430 When -g is active this value specifies the complexity threshold
431 that triggers generation of jump tables rather than using nested
432 if's and decision bitfields. The threshold is compared against
433 a calculated estimation of if-s needed where every used bitmap
434 divides the threshold by 2.
435
436 re2c:yych:conversion = 0 ;
437 When the input uses signed characters and -s or -b switches are
438 in effect re2c allows to automatically convert to the unsigned
439 character type that is then necessary for its internal single
440 character. When this setting is zero or an empty string the con‐
441 version is disabled. Using a non zero number the conversion is
442 taken from YYCTYPE. If that is given by an inplace configuration
443 that value is being used. Otherwise it will be (YYCTYPE) and
444 changes to that configuration are no longer possible. When this
445 setting is a string the braces must be specified. Now assuming
446 your input is a char* buffer and you are using above mentioned
447 switches you can set YYCTYPE to unsigned char and this setting
448 to either 1 or "(unsigned char)".
449
450 re2c:define:YYCTXMARKER = YYCTXMARKER ;
451 Allows to overwrite the define YYCTXMARKER and thus avoiding it
452 by setting the value to the actual code needed.
453
454 re2c:define:YYCTYPE = YYCTYPE ;
455 Allows to overwrite the define YYCTYPE and thus avoiding it by
456 setting the value to the actual code needed.
457
458 re2c:define:YYCURSOR = YYCURSOR ;
459 Allows to overwrite the define YYCURSOR and thus avoiding it by
460 setting the value to the actual code needed.
461
462 re2c:define:YYDEBUG = YYDEBUG ;
463 Allows to overwrite the define YYDEBUG and thus avoiding it by
464 setting the value to the actual code needed.
465
466 re2c:define:YYFILL = YYFILL ;
467 Allows to overwrite the define YYFILL and thus avoiding it by
468 setting the value to the actual code needed.
469
470 re2c:define:YYGETSTATE = YYGETSTATE ;
471 Allows to overwrite the define YYGETSTATE and thus avoiding it
472 by setting the value to the actual code needed.
473
474 re2c:define:YYLIMIT = YYLIMIT ;
475 Allows to overwrite the define YYLIMIT and thus avoiding it by
476 setting the value to the actual code needed.
477
478 re2c:define:YYMARKER = YYMARKER ;
479 Allows to overwrite the define YYMARKER and thus avoiding it by
480 setting the value to the actual code needed.
481
482 re2c:define:YYSETSTATE = YYSETSTATE ;
483 Allows to overwrite the define YYSETSTATE and thus avoiding it
484 by setting the value to the actual code needed.
485
486 re2c:label:yyFillLabel = yyFillLabel ;
487 Allows to overwrite the name of the label yyFillLabel.
488
489 re2c:label:yyNext = yyNext ;
490 Allows to overwrite the name of the label yyNext.
491
492 re2c:variable:yyaccept = yyaccept ;
493 Allows to overwrite the name of the variable yyaccept.
494
495 re2c:variable:yybm = yybm ;
496 Allows to overwrite the name of the variable yybm.
497
498 re2c:variable:yych = yych ;
499 Allows to overwrite the name of the variable yych.
500
501 re2c:variable:yytarget = yytarget ;
502 Allows to overwrite the name of the variable yytarget.
503
504
506 The subdirectory lessons of the re2c distribution contains a few step
507 by step lessons to get you started with re2c. All examples in the
508 lessons subdirectory can be compiled and actually work.
509
510
512 re2c does not provide a default action: the generated code assumes that
513 the input will consist of a sequence of tokens. Typically this can be
514 dealt with by adding a rule such as the one for unexpected characters
515 in the example above.
516
517 The user must arrange for a sentinel token to appear at the end of
518 input (and provide a rule for matching it): re2c does not provide an
519 <<EOF>> expression. If the source is from a null-byte terminated
520 string, a rule matching a null character will suffice. If the source
521 is from a file then you could pad the input with a newline (or some
522 other character that cannot appear within another token); upon recog‐
523 nizing such a character check to see if it is the sentinel and act
524 accordingly. And you can also use YYFILL(n) to end the scanner in case
525 not enough characters are available which is nothing else then e detec‐
526 tion of end of data/file.
527
528 re2c does not provide start conditions: use a separate scanner speci‐
529 fication for each start condition (as illustrated in the above exam‐
530 ple).
531
532
534 Difference only works for character sets.
535
536 The re2c internal algorithms need documentation.
537
538
540 flex(1), lex(1).
541
542 More information on re2c can be found here:
543 http://re2c.org/
544
545
547 Peter Bumbulis <peter@csg.uwaterloo.ca>
548 Brian Young <bayoung@acm.org>
549 Dan Nuffer <nuffer@users.sourceforge.net>
550 Marcus Boerger <helly@users.sourceforge.net>
551 Hartmut Kaiser <hkaiser@users.sourceforge.net>
552 Emmanuel Mogenet <mgix@mgix.com> added storable state
554 This manpage describes re2c, version 0.12.1.
555
556
557
558
559Version 0.12.1 22 April 2005 RE2C(1)