1RE2C(1) RE2C(1)
2
3
4
6 re2c - compile regular expressions to code
7
9 re2c [OPTIONS] INPUT [-o OUTPUT]
10
11 re2go [OPTIONS] INPUT [-o OUTPUT]
12
14 re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15
17 A re2c program consists of normal code intermixed with re2c blocks and
18 directives. Each re2c block may contain definitions, configurations
19 and rules. Definitions are of the form name = regexp; where name is
20 an identifier that consists of letters, digits and underscores, and
21 regexp is a regular expression. Regular expressions may contain other
22 definitions, but recursion is not allowed and each name should be
23 defined before used. Configurations are of the form re2c:config =
24 value; where config is the configuration descriptor and value can be a
25 number, a string or a special word. Rules consist of a regular expres‐
26 sion followed by a semantic action (a block of code enclosed in curly
27 braces { and }, or a raw one line of code preceded with := and ended
28 with a newline that is not followed by a whitespace). If the input
29 matches the regular expression, the associated semantic action is exe‐
30 cuted. If multiple rules match, the longest match takes precedence.
31 If multiple rules match the same string, the earlier rule takes prece‐
32 dence. There are two special rules: default rule * and EOF rule $.
33 Default rule should always be defined, it has the lowest priority
34 regardless of its place and matches any code unit (not necessarily a
35 valid character, see encoding support). EOF rule matches the end of
36 input, it should be defined if the corresponding EOF handling method is
37 used. If start conditions are used, rules have more complex syntax.
38 All rules of a single block are compiled into a deterministic
39 finite-state automaton (DFA) and encoded in the form of a program in
40 the target language. The generated code interfaces with the outer pro‐
41 gram by the means of a few user-defined primitives (see the program
42 interface section). Reusable blocks allow sharing rules, definitions
43 and configurations between different blocks.
44
46 Input file
47 // re2c $INPUT -o $OUTPUT -i
48 #include <assert.h> //
49 // C/C++ code
50 int lex(const char *YYCURSOR) //
51 {
52 /*!re2c // start of re2c block
53 re2c:define:YYCTYPE = char; // configuration
54 re2c:yyfill:enable = 0; // configuration
55 re2c:flags:case-ranges = 1; // configuration
56 //
57 ident = [a-zA-Z_][a-zA-Z_0-9]*; // named definition
58 //
59 ident { return 0; } // normal rule
60 * { return 1; } // default rule
61 */
62 } //
63 //
64 int main() //
65 { // C/C++ code
66 assert(lex("_Zer0") == 0); //
67 return 0; //
68 } //
69
70
71 Output file
72 /* Generated by re2c */
73 // re2c $INPUT -o $OUTPUT -i
74 #include <assert.h> //
75 // C/C++ code
76 int lex(const char *YYCURSOR) //
77 {
78
79 {
80 char yych;
81 yych = *YYCURSOR;
82 switch (yych) {
83 case 'A' ... 'Z':
84 case '_':
85 case 'a' ... 'z': goto yy4;
86 default: goto yy2;
87 }
88 yy2:
89 ++YYCURSOR;
90 { return 1; }
91 yy4:
92 yych = *++YYCURSOR;
93 switch (yych) {
94 case '0' ... '9':
95 case 'A' ... 'Z':
96 case '_':
97 case 'a' ... 'z': goto yy4;
98 default: goto yy6;
99 }
100 yy6:
101 { return 0; }
102 }
103
104 } //
105 //
106 int main() //
107 { // C/C++ code
108 assert(lex("_Zer0") == 0); //
109 return 0; //
110 } //
111
112
114 -? -h --help
115 Show help message.
116
117 -1 --single-pass
118 Deprecated. Does nothing (single pass is the default now).
119
120 -8 --utf-8
121 Generate a lexer that reads input in UTF-8 encoding. re2c
122 assumes that character range is 0 -- 0x10FFFF and character size
123 is 1 byte.
124
125 -b --bit-vectors
126 Optimize conditional jumps using bit masks. Implies -s.
127
128 -c --conditions --start-conditions
129 Enable support of Flex-like "conditions": multiple interrelated
130 lexers within one block. Option --start-conditions is a legacy
131 alias; use --conditions instead.
132
133 --case-insensitive
134 Treat single-quoted and double-quoted strings as case-insensi‐
135 tive.
136
137 --case-inverted
138 Invert the meaning of single-quoted and double-quoted strings:
139 treat single-quoted strings as case-sensitive and double-quoted
140 strings as case-insensitive.
141
142 --case-ranges
143 Collapse consecutive cases in a switch statements into a range
144 of the form case low ... high:. This syntax is an extension of
145 the C/C++ language, supported by compilers like GCC, Clang and
146 Tcc. The main advantage over using single cases is smaller gen‐
147 erated C code and faster generation time, although for some com‐
148 pilers like Tcc it also results in smaller binary size. This
149 option doesn't work for the Go backend.
150
151 -e --ecb
152 Generate a lexer that reads input in EBCDIC encoding. re2c
153 assumes that character range is 0 -- 0xFF an character size is 1
154 byte.
155
156 --empty-class <match-empty | match-none | error>
157 Define the way re2c treats empty character classes. With
158 match-empty (the default) empty class matches empty input (which
159 is illogical, but backwards-compatible). With``match-none``
160 empty class always fails to match. With error empty class
161 raises a compilation error.
162
163 --encoding-policy <fail | substitute | ignore>
164 Define the way re2c treats Unicode surrogates. With fail re2c
165 aborts with an error when a surrogate is encountered. With sub‐
166 stitute re2c silently replaces surrogates with the error code
167 point 0xFFFD. With ignore (the default) re2c treats surrogates
168 as normal code points. The Unicode standard says that standalone
169 surrogates are invalid, but real-world libraries and programs
170 behave in different ways.
171
172 -f --storable-state
173 Generate a lexer which can store its inner state. This is use‐
174 ful in push-model lexers which are stopped by an outer program
175 when there is not enough input, and then resumed when more input
176 becomes available. In this mode users should additionally define
177 YYGETSTATE() and YYSETSTATE(state) macros and variables yych,
178 yyaccept and state as part of the lexer state.
179
180 -F --flex-syntax
181 Partial support for Flex syntax: in this mode named definitions
182 don't need the equal sign and the terminating semicolon, and
183 when used they must be surrounded by curly braces. Names without
184 curly braces are treated as double-quoted strings.
185
186 -g --computed-gotos
187 Optimize conditional jumps using non-standard "computed goto"
188 extension (which must be supported by the compiler). re2c gener‐
189 ates jump tables only in complex cases with a lot of conditional
190 branches. Complexity threshold can be configured with
191 cgoto:threshold configuration. This option implies -b. This
192 option doesn't work for the Go backend.
193
194 -I PATH
195 Add PATH to the list of locations which are used when searching
196 for include files. This option is useful in combination with
197 /*!include:re2c ... */ directive. Re2c looks for FILE in the
198 directory of including file and in the list of include paths
199 specified by -I option.
200
201 -i --no-debug-info
202 Do not output #line information. This is useful when the gener‐
203 ated code is tracked by some version control system or IDE.
204
205 --input <default | custom>
206 Specify the API used by the generated code to interface with
207 used-defined code. Option default is the C API based on pointer
208 arithmetic (it is the default for the C backend). Option custom
209 is the generic API (it is the default for the Go backend).
210
211 --input-encoding <ascii | utf8>
212 Specify the way re2c parses regular expressions. With ascii
213 (the default) re2c handles input as ASCII-encoded: any sequence
214 of code units is a sequence of standalone 1-byte characters.
215 With utf8 re2c handles input as UTF8-encoded and recognizes
216 multibyte characters.
217
218 --lang <c | go>
219 Specify the output language. Supported languages are C and Go
220 (the default is C).
221
222 --location-format <gnu | msvc>
223 Specify location format in messages. With gnu locations are
224 printed as 'filename:line:column: ...'. With msvc locations are
225 printed as 'filename(line,column) ...'. Default is gnu.
226
227 --no-generation-date
228 Suppress date output in the generated file.
229
230 --no-version
231 Suppress version output in the generated file.
232
233 -o OUTPUT --output=OUTPUT
234 Specify the OUTPUT file.
235
236 -P --posix-captures
237 Enable submatch extraction with POSIX-style capturing groups.
238
239 -r --reusable
240 Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
241 */ blocks. Exactly one rules-block must be present. The rules
242 are saved and used by every use-block that follows, which may
243 add its own rules and configurations.
244
245 -S --skeleton
246 Ignore user-defined interface code and generate a self-contained
247 "skeleton" program. Additionally, generate input files with
248 strings derived from the regular grammar and compressed match
249 results that are used to verify "skeleton" behavior on all
250 inputs. This option is useful for finding bugs in optimizations
251 and code generation. This option doesn't work for the Go back‐
252 end.
253
254 -s --nested-ifs
255 Use nested if statements instead of switch statements in condi‐
256 tional jumps. This usually results in more efficient code with
257 non-optimizing compilers.
258
259 -T --tags
260 Enable submatch extraction with tags.
261
262 -t HEADER --type-header=HEADER
263 Generate a HEADER file that contains enum with condition names.
264 Requires -c option.
265
266 -u --unicode
267 Generate a lexer that reads UTF32-encoded input. Re2c assumes
268 that character range is 0 -- 0x10FFFF and character size is 4
269 bytes. This option implies -s.
270
271 -V --vernum
272 Show version information in MMmmpp format (major, minor, patch).
273
274 --verbose
275 Output a short message in case of success.
276
277 -v --version
278 Show version information.
279
280 -w --wide-chars
281 Generate a lexer that reads UCS2-encoded input. Re2c assumes
282 that character range is 0 -- 0xFFFF and character size is 2
283 bytes. This option implies -s.
284
285 -x --utf-16
286 Generate a lexer that reads UTF16-encoded input. Re2c assumes
287 that character range is 0 -- 0x10FFFF and character size is 2
288 bytes. This option implies -s.
289
290 Debug options
291 -D --emit-dot
292 Instead of normal output generate lexer graph in .dot format.
293 The output can be converted to an image with the help of
294 Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
295
296 -d --debug-output
297 Emit YYDEBUG in the generated code. YYDEBUG should be defined
298 by the user in the form of a void function with two parameters:
299 state (lexer state or -1) and symbol (current input symbol of
300 type YYCTYPE).
301
302 --dump-adfa
303 Debug option: output DFA after tunneling (in .dot format).
304
305 --dump-cfg
306 Debug option: output control flow graph of tag variables (in
307 .dot format).
308
309 --dump-closure-stats
310 Debug option: output statistics on the number of states in clo‐
311 sure.
312
313 --dump-dfa-det
314 Debug option: output DFA immediately after determinization (in
315 .dot format).
316
317 --dump-dfa-min
318 Debug option: output DFA after minimization (in .dot format).
319
320 --dump-dfa-tagopt
321 Debug option: output DFA after tag optimizations (in .dot for‐
322 mat).
323
324 --dump-dfa-tree
325 Debug option: output DFA under construction with states repre‐
326 sented as tag history trees (in .dot format).
327
328 --dump-dfa-raw
329 Debug option: output DFA under construction with expanded
330 state-sets (in .dot format).
331
332 --dump-interf
333 Debug option: output interference table produced by liveness
334 analysis of tag variables.
335
336 --dump-nfa
337 Debug option: output NFA (in .dot format).
338
339 Internal options
340 --dfa-minimization <moore | table>
341 Internal option: DFA minimization algorithm used by re2c. The
342 moore option is the Moore algorithm (it is the default). The ta‐
343 ble option is the "table filling" algorithm. Both algorithms
344 should produce the same DFA up to states relabeling; table fill‐
345 ing is simpler and much slower and serves as a reference imple‐
346 mentation.
347
348 --eager-skip
349 Internal option: make the generated lexer advance the input
350 position eagerly -- immediately after reading the input symbol.
351 This changes the default behavior when the input position is
352 advanced lazily -- after transition to the next state. This
353 option is implied by --no-lookahead.
354
355 --no-lookahead
356 Internal option: use TDFA(0) instead of TDFA(1). This option
357 has effect only with --tags or --posix-captures options.
358
359 --no-optimize-tags
360 Internal optionL: suppress optimization of tag variables (useful
361 for debugging).
362
363 --posix-closure <gor1 | gtop>
364 Internal option: specify shortest-path algorithm used for the
365 construction of epsilon-closure with POSIX disambiguation seman‐
366 tics: gor1 (the default) stands for Goldberg-Radzik algorithm,
367 and gtop stands for "global topological order" algorithm.
368
369 --posix-prectable <complex | naive>
370 Internal option: specify the algorithm used to compute POSIX
371 precedence table. The complex algorithm computes precedence ta‐
372 ble in one traversal of tag history tree and has quadratic com‐
373 plexity in the number of TNFA states; it is the default. The
374 naive algorithm has worst-case cubic complexity in the number of
375 TNFA states, but it is much simpler than complex and may be
376 slightly faster in non-pathological cases.
377
378 --stadfa
379 Internal option: use staDFA algorithm for submatch extraction.
380 The main difference with TDFA is that tag operations in staDFA
381 are placed in states, not on transitions.
382
383 Warnings
384 -W Turn on all warnings.
385
386 -Werror
387 Turn warnings into errors. Note that this option alone doesn't
388 turn on any warnings; it only affects those warnings that have
389 been turned on so far or will be turned on later.
390
391 -W<warning>
392 Turn on warning.
393
394 -Wno-<warning>
395 Turn off warning.
396
397 -Werror-<warning>
398 Turn on warning and treat it as an error (this implies -W<warn‐
399 ing>).
400
401 -Wno-error-<warning>
402 Don't treat this particular warning as an error. This doesn't
403 turn off the warning itself.
404
405 -Wcondition-order
406 Warn if the generated program makes implicit assumptions about
407 condition numbering. One should use either the -t, --type-header
408 option or the /*!types:re2c*/ directive to generate a mapping of
409 condition names to numbers and then use the autogenerated condi‐
410 tion names.
411
412 -Wempty-character-class
413 Warn if a regular expression contains an empty character class.
414 Trying to match an empty character class makes no sense: it
415 should always fail. However, for backwards compatibility rea‐
416 sons re2c allows empty character classes and treats them as
417 empty strings. Use the --empty-class option to change the
418 default behavior.
419
420 -Wmatch-empty-string
421 Warn if a rule is nullable (matches an empty string). If the
422 lexer runs in a loop and the empty match is unintentional, the
423 lexer may unexpectedly hang in an infinite loop.
424
425 -Wswapped-range
426 Warn if the lower bound of a range is greater than its upper
427 bound. The default behavior is to silently swap the range
428 bounds.
429
430 -Wundefined-control-flow
431 Warn if some input strings cause undefined control flow in the
432 lexer (the faulty patterns are reported). This is the most dan‐
433 gerous and most common mistake. It can be easily fixed by adding
434 the default rule * which has the lowest priority, matches any
435 code unit, and consumes exactly one code unit.
436
437 -Wunreachable-rules
438 Warn about rules that are shadowed by other rules and will never
439 match.
440
441 -Wuseless-escape
442 Warn if a symbol is escaped when it shouldn't be. By default,
443 re2c silently ignores such escapes, but this may as well indi‐
444 cate a typo or an error in the escape sequence.
445
446 -Wnondeterministic-tags
447 Warn if a tag has n-th degree of nondeterminism, where n is
448 greater than 1.
449
450 -Wsentinel-in-midrule
451 Warn if the sentinel symbol occurs in the middle of a rule ---
452 this may cause reads past the end of buffer, crashes or memory
453 corruption in the generated lexer. This warning is only applica‐
454 ble if the sentinel method of checking for the end of input is
455 used. It is set to an error if re2c:sentinel configuration is
456 used.
457
459 Re2c has a flexible interface that gives the user both the freedom and
460 the responsibility to define how the generated code interacts with the
461 outer program. There are two major options:
462
463 · Pointer API. It is also called "default API", since it was histori‐
464 cally the first, and for a long time the only one. This is a more
465 restricted API based on C pointer arithmetics. It consists of
466 pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT,
467 which are normally defined as pointers of type YYCTYPE*. Pointer API
468 is enabled by default for the C backend, and it cannot be used with
469 other backends that do not have pointer arithmetics.
470
471
472
473 · Generic API. This is a less restricted API that does not assume
474 pointer semantics. It consists of primitives YYPEEK, YYSKIP,
475 YYBACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
476 YYRESTORECTX, YYRESTORETAG, YYSHIFT, YYSHIFTSTAG, YYSHIFTMTAG and
477 YYLESSTHAN. For the C backend generic API is enabled with --input
478 custom option or re2c:flags:input = custom; configuration; for the Go
479 backend it is enabled by default. Generic API was added in version
480 0.14. It is intentionally designed to give the user as much freedom
481 as possible in redefining the input model and the semantics of dif‐
482 ferent actions performed by the generated code. As an example, one
483 can override YYPEEK to check for the end of input before reading the
484 input character, or do some logging, etc.
485
486 Generic API has two styles:
487
488 · Function-like. This style is enabled with re2c:api:style = func‐
489 tions; configuration, and it is the default for C backend. In this
490 style API primitives should be defined as functions or macros with
491 parentheses, accepting the necessary arguments. For example, in C the
492 default pointer API can be defined in function-like style generic API
493 as follows:
494
495 #define YYPEEK() *YYCURSOR
496 #define YYSKIP() ++YYCURSOR
497 #define YYBACKUP() YYMARKER = YYCURSOR
498 #define YYBACKUPCTX() YYCTXMARKER = YYCURSOR
499 #define YYRESTORE() YYCURSOR = YYMARKER
500 #define YYRESTORECTX() YYCURSOR = YYCTXMARKER
501 #define YYRESTORETAG(tag) YYCURSOR = tag
502 #define YYLESSTHAN(len) YYLIMIT - YYCURSOR < len
503 #define YYSTAGP(tag) tag = YYCURSOR
504 #define YYSTAGN(tag) tag = NULL
505 #define YYSHIFT(shift) YYCURSOR += shift
506 #define YYSHIFTSTAG(tag, shift) tag += shift
507
508
509
510 · Free-form. This style is enabled with re2c:api:style = free-form;
511 configuration, and it is the default for Go backend. In this style
512 API primitives can be defined as free-form pieces of code, and
513 instead of arguments they have interpolated variables of the form
514 @@{name}, or optionally just @@ if there is only one argument. The @@
515 text is called "sigil". It can be redefined to any other text with
516 re2c:api:sigil configuration. For example, the default pointer API
517 can be defined in free-form style generic API as follows:
518
519 re2c:define:YYPEEK = "*YYCURSOR";
520 re2c:define:YYSKIP = "++YYCURSOR";
521 re2c:define:YYBACKUP = "YYMARKER = YYCURSOR";
522 re2c:define:YYBACKUPCTX = "YYCTXMARKER = YYCURSOR";
523 re2c:define:YYRESTORE = "YYCURSOR = YYMARKER";
524 re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
525 re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
526 re2c:define:YYLESSTHAN = "YYLIMIT - YYCURSOR < @@{len}";
527 re2c:define:YYSTAGP = "@@{tag} = YYCURSOR";
528 re2c:define:YYSTAGN = "@@{tag} = NULL";
529 re2c:define:YYSHIFT = "YYCURSOR += @@{shift}";
530 re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
531
532 API primitives
533 Here is a list of API primitives that may be used by the generated code
534 in order to interface with the outer program. Which primitives are
535 needed depends on multiple factors, including the complexity of regular
536 expressions, input representation, buffering, the use of various fea‐
537 tures and so on. All the necessary primitives should be defined by the
538 user in the form of macros, functions, variables, free-form pieces of
539 code or any other suitable form. Re2c does not (and cannot) check the
540 definitions, so if anything is missing or defined incorrectly the gen‐
541 erated code will not compile.
542
543 YYCTYPE
544 The type of the input characters (code units). For ASCII,
545 EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
546 For UTF-16 or UCS-2 it should be 2-byte unsigned integer. For
547 UTF-32 it should be 4-byte unsigned integer.
548
549 YYCURSOR
550 A pointer-like l-value that stores the current input position
551 (usually a pointer of type YYCTYPE*). Initially YYCURSOR should
552 point to the first input character. It is advanced by the gener‐
553 ated code. When a rule matches, YYCURSOR points to the one
554 after the last matched character. It is used only in the default
555 C API.
556
557 YYLIMIT
558 A pointer-like r-value that stores the end of input position
559 (usually a pointer of type YYCTYPE*). Initially YYLIMIT should
560 point to the one after the last available input character. It is
561 not changed by the generated code. Lexer compares YYCURSOR to
562 YYLIMIT in order to determine if there is enough input charac‐
563 ters left. YYLIMIT is used only in the default C API.
564
565 YYMARKER
566 A pointer-like l-value (usually a pointer of type YYCTYPE*) that
567 stores the position of the latest matched rule. It is used to
568 restores YYCURSOR position if the longer match fails and lexer
569 needs to rollback. Initialization is not needed. YYMARKER is
570 used only in the default C API.
571
572 YYCTXMARKER
573 A pointer-like l-value that stores the position of the trailing
574 context (usually a pointer of type YYCTYPE*). No initialization
575 is needed. It is used only in the default C API, and only with
576 the lookahead operator /.
577
578 YYFILL API primitive with one argument len. The meaning of YYFILL is
579 to provide at least len more input characters or fail. If EOF
580 rule is used, YYFILL should always return to the calling func‐
581 tion; the return value should be zero on success and non-zero on
582 failure. If EOF rule is not used, YYFILL return value is ignored
583 and it should not return on failure. Maximal value of len is
584 YYMAXFILL, which can be generated with /*!max:re2c*/ directive.
585 The definition of YYFILL can be either function-like or
586 free-form depending on the API style (see re2c:api:style and
587 re2c:define:YYFILL:naked).
588
589 YYMAXFILL
590 An integral constant equal to the maximal value of YYFILL argu‐
591 ment. It can be generated with /*!max:re2c*/ directive.
592
593 YYLESSTHAN
594 A generic API primitive with one argument len. It should be
595 defined as an r-value of boolean type that equals true if and
596 only if there is less than len input characters left. The defi‐
597 nition can be either function-like or free-form depending on the
598 API style (see re2c:api:style).
599
600 YYPEEK A generic API primitive with no arguments. It should be defined
601 as an r-value of type YYCTYPE that is equal to the character at
602 the current input position. The definition can be either func‐
603 tion-like or free-form depending on the API style (see
604 re2c:api:style).
605
606 YYSKIP A generic API primitive with no arguments. The meaning of
607 YYSKIP is to advance the current input position by one charac‐
608 ter. The definition can be either function-like or free-form
609 depending on the API style (see re2c:api:style).
610
611 YYBACKUP
612 A generic API primitive with no arguments. The meaning of
613 YYBACKUP is to save the current input position, which is later
614 restored with YYRESTORE. The definition should be either func‐
615 tion-like or free-form depending on the API style (see
616 re2c:api:style).
617
618 YYRESTORE
619 A generic API primitive with no arguments. The meaning of YYRE‐
620 STORE is to restore the current input position to the value
621 saved by YYBACKUP. The definition should be either func‐
622 tion-like or free-form depending on the API style (see
623 re2c:api:style).
624
625 YYBACKUPCTX
626 A generic API primitive with zero arguments. The meaning of
627 YYBACKUPCTX is to save the current input position as the posi‐
628 tion of the trailing context, which is later restored by YYRE‐
629 STORECTX. The definition should be either function-like or
630 free-form depending on the API style (see re2c:api:style).
631
632 YYRESTORECTX
633 A generic API primitive with no arguments. The meaning of YYRE‐
634 STORECTX is to restore the trailing context position saved with
635 YYBACKUPCTX. The definition should be either function-like or
636 free-form depending on the API style (see re2c:api:style).
637
638 YYRESTORETAG
639 A generic API primitive with one argument tag. The meaning of
640 YYRESTORETAG is to restore the trailing context position to the
641 value of tag. The definition should be either function-like or
642 free-form depending on the API style (see re2c:api:style).
643
644 YYSTAGP
645 A generic API primitive with one argument tag. The meaning of
646 YYSTAGP is to set tag value to the current input position. The
647 definition should be either function-like or free-form depending
648 on the API style (see re2c:api:style).
649
650 YYSTAGN
651 A generic API primitive with one argument tag. The meaning of
652 YYSTAGP is to set tag value to null (or some default value). The
653 definition should be either function-like or free-form depending
654 on the API style (see re2c:api:style).
655
656 YYMTAGP
657 A generic API primitive with one argument tag. The meaning of
658 YYMTAGP is to append the current position to the history of tag.
659 The definition should be either function-like or free-form
660 depending on the API style (see re2c:api:style).
661
662 YYMTAGN
663 A generic API primitive with one argument tag. The meaning of
664 YYMTAGN is to append null (or some other default) value to the
665 history of tag. The definition can be either function-like or
666 free-form depending on the API style (see re2c:api:style).
667
668 YYSHIFT
669 A generic API primitive with one argument shift. The meaning of
670 YYSHIFT is to shift the current input position by shift charac‐
671 ters (the shift value may be negative). The definition can be
672 either function-like or free-form depending on the API style
673 (see re2c:api:style).
674
675 YYSHIFTSTAG
676 A generic API primitive with two arguments, tag and shift. The
677 meaning of YYSHIFTSTAG is to shift tag by shift characters (the
678 shift value may be negative). The definition can be either
679 function-like or free-form depending on the API style (see
680 re2c:api:style).
681
682 YYSHIFTMTAG
683 A generic API primitive with two arguments, tag and shift. The
684 meaning of YYSHIFTMTAG is to shift the latest value in the his‐
685 tory of tag by shift characters (the shift value may be nega‐
686 tive). The definition should be either function-like or
687 free-form depending on the API style (see re2c:api:style).
688
689 YYMAXNMATCH
690 An integral constant equal to the maximal number of POSIX cap‐
691 turing groups in a rule. It is generated with /*!maxn‐
692 match:re2c*/ directive.
693
694 YYCONDTYPE
695 The type of the condition enum. It should be generated either
696 with /*!types:re2c*/ directive or -t --type-header option.
697
698 YYGETCONDITION
699 An API primitive with zero arguments. It should be defined as
700 an r-value of type YYCONDTYPE that is equal to the current con‐
701 dition identifier. The definition can be either function-like or
702 free-form depending on the API style (see re2c:api:style and
703 re2c:define:YYGETCONDITION:naked).
704
705 YYSETCONDITION
706 An API primitive with one argument cond. The meaning of YYSET‐
707 CONDITION is to set the current condition identifier to cond.
708 The definition should be either function-like or free-form
709 depending on the API style (see re2c:api:style and
710 re2c:define:YYSETCONDITION@cond).
711
712 YYGETSTATE
713 An API primitive with zero arguments. It should be defined as
714 an r-value of integer type that is equal to the current lexer
715 state. Should be initialized to -1. The definition can be either
716 function-like or free-form depending on the API style (see
717 re2c:api:style and re2c:define:YYGETSTATE:naked).
718
719 YYSETSTATE
720 An API primitive with one argument state. The meaning of YYSET‐
721 STATE is to set the current lexer state to state. The defini‐
722 tion should be either function-like or free-form depending on
723 the API style (see re2c:api:style and re2c:define:YYSET‐
724 STATE@state).
725
726 YYDEBUG
727 A debug API primitive with two arguments. It can be used to
728 debug the generated code (with -d --debug-output option). YYDE‐
729 BUG should return no value and accept two arguments: state
730 (either a DFA state index or -1) and symbol (the current input
731 symbol).
732
733 yych An l-value of type YYCTYPE that stores the current input charac‐
734 ter. User definition is necessary only with -f --storable-state
735 option.
736
737 yyaccept
738 An l-value of unsigned integral type that stores the number of
739 the latest matched rule. User definition is necessary only with
740 -f --storable-state option.
741
742 yynmatch
743 An l-value of unsigned integral type that stores the number of
744 POSIX capturing groups in the matched rule. Used only with -P
745 --posix-captures option.
746
747 yypmatch
748 An array of l-values that are used to hold the tag values corre‐
749 sponding to the capturing parentheses in the matching rule.
750 Array length must be at least yynmatch * 2 (usually YYMAXNMATCH
751 * 2 is a good choice). Used only with -P --posix-captures
752 option.
753
754 Directives
755 Below is the list of all directives provided by re2c (in no particular
756 order). More information on each directive can be found in the related
757 sections.
758
759 /*!re2c ... */
760 A standard re2c block.
761
762 %{ ... %}
763 A standard re2c block in -F --flex-support mode.
764
765 /*!rules:re2c ... */
766 A reusable re2c block (requires -r --reuse option).
767
768 /*!use:re2c ... */
769 A block that reuses previous rules-block specified with
770 /*!rules:re2c ... */ (requires -r --reuse option).
771
772 /*!ignore:re2c ... */
773 A block which contents are ignored and cut off from the output
774 file.
775
776 /*!max:re2c*/
777 This directive is substituted with the macro-definition of
778 YYMAXFILL.
779
780 /*!maxnmatch:re2c*/
781 This directive is substituted with the macro-definition of
782 YYMAXNMATCH (requires -P --posix-captures option).
783
784 /*!getstate:re2c*/
785 This directive is substituted with conditional dispatch on lexer
786 state (requires -f --storable-state option).
787
788 /*!types:re2c ... */
789 This directive is substituted with the definition of condition
790 enum (requires -c --conditions option).
791
792 /*!stags:re2c ... */, /*!mtags:re2c ... */
793 These directives allow one to specify a template piece of code
794 that is expanded for each s-tag/m-tag variable generated by
795 re2c. This block has two optional configurations: format = "@@";
796 (specifies the template where @@ is substituted with the name of
797 each tag variable), and separator = ""; (specifies the piece of
798 code used to join the generated pieces for different tag vari‐
799 ables).
800
801 /*!include:re2c FILE */
802 This directive allows one to include FILE (in the same sense as
803 #include directive in C/C++).
804
805 /*!header:re2c:on*/
806 This directive marks the start of header file. Everything after
807 it and up to the following /*!header:re2c:off*/ directive is
808 processed by re2c and written to the header file specified with
809 -t --type-header option.
810
811 /*!header:re2c:off*/
812 This directive marks the end of header file started with
813 /*!header:re2c:on*/.
814
815 Configurations
816 re2c:flags:t, re2c:flags:type-header
817 Specify the name of the generated header file relative to the
818 directory of the output file. (Same as -t, --type-header com‐
819 mand-line option except that the filepath is relative.)
820
821 re2c:flags:input
822 Same as --input command-line option.
823
824 re2c:api:style
825 Allows one to specify the style of generic API. Possible values
826 are functions and free-form. With functions style (the default
827 for the C backend) API primitives behave like functions, and
828 re2c generates parentheses with an argument list after the name
829 of each primitive. With free-form style (the default for the Go
830 backend) re2c treats API definitions as interpolated strings and
831 substitutes argument placeholders with the actual argument val‐
832 ues. This option can be overridden by options for individual
833 API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
834
835 re2c:api:sigil
836 Allows one to specify the "sigil" symbol (or string) that is
837 used to recognize argument placeholders in the definitions of
838 generic API primitives. The default value is @@. Placeholders
839 start with sigil, followed by the argument name in curly braces.
840 For example, if sigil is set to $, then placeholders will have
841 the form ${name}. Single-argument APIs may use shorthand nota‐
842 tion without the name in braces. This option can be overridden
843 by options for individual API primitives, e.g.
844 re2c:define:YYFILL@len for YYFILL.
845
846 re2c:define:YYCTYPE
847 Defines YYCTYPE (see the user interface section).
848
849 re2c:define:YYCURSOR
850 Defines C API primitive YYCURSOR (see the user interface sec‐
851 tion).
852
853 re2c:define:YYLIMIT
854 Defines C API primitive YYLIMIT (see the user interface sec‐
855 tion).
856
857 re2c:define:YYMARKER
858 Defines C API primitive YYMARKER (see the user interface sec‐
859 tion).
860
861 re2c:define:YYCTXMARKER
862 Defines C API primitive YYCTXMARKER (see the user interface sec‐
863 tion).
864
865 re2c:define:YYFILL
866 Defines API primitive YYFILL (see the user interface section).
867
868 re2c:define:YYFILL@len
869 Specifies the sigil used for argument substitution in YYFILL
870 definition. Defaults to @@. Overrides the more generic
871 re2c:api:sigil configuration.
872
873 re2c:define:YYFILL:naked
874 Allows one to override re2c:api:style for YYFILL. Value 0 cor‐
875 responds to free-form API style.
876
877 re2c:yyfill:enable
878 Defaults to 1 (YYFILL is enabled). Set this to zero to suppress
879 the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
880 tinel configuration to verify that the generated lexer cannot
881 read past the end of input, as this might introduce severe secu‐
882 rity issues to your programs.
883
884 re2c:yyfill:parameter
885 Controls the argument in the parentheses that follow YYFILL.
886 Defaults to 1, which means that the argument is generated. If
887 zero, the argument is omitted. Can be overridden with
888 re2c:define:YYFILL:naked or re2c:api:style.
889
890 re2c:eof
891 Specifies the sentinel symbol used with EOF rule $ to check for
892 the end of input in the generated lexer. The default value is -1
893 (EOF rule is not used). Other possible values include all valid
894 code units. Only decimal numbers are recognized.
895
896 re2c:sentinel
897 Specifies the sentinel symbol used with the sentinel method of
898 checking for the end of input in the generated lexer (the case
899 when bounds checking is disabled with re2c:yyfill:enable = 0;
900 and EOF rule $ is not used). This configuration does not affect
901 code generation. It is used by re2c to verify that the sentinel
902 symbol is not allowed in the middle of the rule, and prevent
903 possible reads past the end of buffer in the generated lexer.
904 The default value is -1 (re2c assumes that the sentinel symbol
905 is 0, which is the most common case). Other possible values
906 include all valid code units. Only decimal numbers are recog‐
907 nized.
908
909 re2c:define:YYLESSTHAN
910 Defines generic API primitive YYLESSTHAN (see the user interface
911 section).
912
913 re2c:yyfill:check
914 Setting this to zero allows to suppress the generation of YYFILL
915 check (YYLESSTHAN in generic API of YYLIMIT-based comparison in
916 default C API). This configuration is useful when the necessary
917 input is always available. it defaults to 1 (the check is gener‐
918 ated).
919
920 re2c:label:yyFillLabel
921 Allows one to change the prefix of YYFILL labels (used with EOF
922 rule or with storable states).
923
924 re2c:define:YYPEEK
925 Defines generic API primitive YYPEEK (see the user interface
926 section).
927
928 re2c:define:YYSKIP
929 Defines generic API primitive YYSKIP (see the user interface
930 section).
931
932 re2c:define:YYBACKUP
933 Defines generic API primitive YYBACKUP (see the user interface
934 section).
935
936 re2c:define:YYBACKUPCTX
937 Defines generic API primitive YYBACKUPCTX (see the user inter‐
938 face section).
939
940 re2c:define:YYRESTORE
941 Defines generic API primitive YYRESTORE (see the user interface
942 section).
943
944 re2c:define:YYRESTORECTX
945 Defines generic API primitive YYRESTORECTX (see the user inter‐
946 face section).
947
948 re2c:define:YYRESTORETAG
949 Defines generic API primitive YYRESTORETAG (see the user inter‐
950 face section).
951
952 re2c:define:YYSHIFT
953 Defines generic API primitive YYSHIFT (see the user interface
954 section).
955
956 re2c:define:YYSHIFTMTAG
957 Defines generic API primitive YYSHIFTMTAG (see the user inter‐
958 face section).
959
960 re2c:define:YYSHIFTSTAG
961 Defines generic API primitive YYSHIFTSTAG (see the user inter‐
962 face section).
963
964 re2c:define:YYSTAGN
965 Defines generic API primitive YYSTAGN (see the user interface
966 section).
967
968 re2c:define:YYSTAGP
969 Defines generic API primitive YYSTAGP (see the user interface
970 section).
971
972 re2c:define:YYMTAGN
973 Defines generic API primitive YYMTAGN (see the user interface
974 section).
975
976 re2c:define:YYMTAGP
977 Defines generic API primitive YYMTAGP (see the user interface
978 section).
979
980 re2c:flags:T, re2c:flags:tags
981 Same as -T --tags command-line option.
982
983 re2c:flags:P, re2c:flags:posix-captures
984 Same as -P --posix-captures command-line option.
985
986 re2c:tags:expression
987 Allows one to customize the way re2c addresses tag variables.
988 By default re2c generates expressions of the form yyt<N>. This
989 might be inconvenient, for example if tag variables are defined
990 as fields in a struct. Re2c recognizes placeholder of the form
991 @@{tag} or @@ and replaces it with the actual tag name. Sigil
992 @@ can be redefined with re2c:api:sigil configuration. For
993 example, setting re2c:tags:expression = "p->@@"; results in
994 expressions of the form p->yyt<N> in the generated code.
995
996 re2c:tags:prefix
997 Allows one to override the prefix of tag variables (defaults to
998 yyt).
999
1000 re2c:flags:lookahead
1001 Same as inverted --no-lookahead command-line option.
1002
1003 re2c:flags:optimize-tags
1004 Same as inverted --no-optimize-tags command-line option.
1005
1006 re2c:define:YYCONDTYPE
1007 Defines YYCONDTYPE (see the user interface section).
1008
1009 re2c:define:YYGETCONDITION
1010 Defines API primitive YYGETCONDITION (see the user interface
1011 section).
1012
1013 re2c:define:YYGETCONDITION:naked
1014 Allows one to override re2c:api:style for YYGETCONDITION. Value
1015 0 corresponds to free-form API style.
1016
1017 re2c:define:YYSETCONDITION
1018 Defines API primitive YYSETCONDITION (see the user interface
1019 section).
1020
1021 re2c:define:YYSETCONDITION@cond
1022 Specifies the sigil used for argument substitution in YYSETCON‐
1023 DITION definition. The default value is @@. Overrides the more
1024 generic re2c:api:sigil configuration.
1025
1026 re2c:define:YYSETCONDITION:naked
1027 Allows one to override re2c:api:style for YYSETCONDITION. Value
1028 0 corresponds to free-form API style.
1029
1030 re2c:cond:goto
1031 Allows one to customize the goto statements used with the short‐
1032 cut :=> rules in conditions. The default value is goto @@;.
1033 Placeholders are substituted with condition name (see
1034 re2c:api;sigil and re2c:cond:goto@cond).
1035
1036 re2c:cond:goto@cond
1037 Specifies the sigil used for argument substitution in
1038 re2c:cond:goto definition. The default value is @@. Overrides
1039 the more generic re2c:api:sigil configuration.
1040
1041 re2c:cond:divider
1042 Defines the divider for condition blocks. The default value is
1043 /* *********************************** */. Placeholders are
1044 substituted with condition name (see re2c:api;sigil and
1045 re2c:cond:divider@cond).
1046
1047 re2c:cond:divider@cond
1048 Specifies the sigil used for argument substitution in
1049 re2c:cond:divider definition. The default value is @@. Over‐
1050 rides the more generic re2c:api:sigil configuration.
1051
1052 re2c:condprefix
1053 Specifies the prefix used for condition labels. The default
1054 value is yyc_.
1055
1056 re2c:condenumprefix
1057 Specifies the prefix used for condition identifiers. The
1058 default value is yyc.
1059
1060 re2c:define:YYGETSTATE
1061 Defines API primitive YYGETSTATE (see the user interface sec‐
1062 tion).
1063
1064 re2c:define:YYGETSTATE:naked
1065 Allows one to override re2c:api:style for YYGETSTATE. Value 0
1066 corresponds to free-form API style.
1067
1068 re2c:define:YYSETSTATE
1069 Defines API primitive YYSETSTATE (see the user interface sec‐
1070 tion).
1071
1072 re2c:define:YYSETSTATE@state
1073 Specifies the sigil used for argument substitution in YYSETSTATE
1074 definition. The default value is @@. Overrides the more generic
1075 re2c:api:sigil configuration.
1076
1077 re2c:define:YYSETSTATE:naked
1078 Allows one to override re2c:api:style for YYSETSTATE. Value 0
1079 corresponds to free-form API style.
1080
1081 re2c:state:abort
1082 If set to a positive integer value, changes the form of the
1083 YYGETSTATE switch: instead of using default case to jump to the
1084 beginning of the lexer block, a -1 case is used, and the default
1085 case aborts the program.
1086
1087 re2c:state:nextlabel
1088 With storable states, allows to control if the YYGETSTATE block
1089 is followed by a yyNext label (the default value is zero, which
1090 corresponds to no label). Instead of using yyNext it is possible
1091 to use re2c:startlabel to force the generation of a specific
1092 start label. Instead of using labels it is often more conve‐
1093 nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1094
1095 re2c:label:yyNext
1096 Allows one to change the name of the yyNext label.
1097
1098 re2c:startlabel
1099 Controls the generation of start label for the next lexer block.
1100 The default value is zero, which means that the start label is
1101 generated only if it is used. An integer value greater than zero
1102 forces the generation of start label even if it is unused by the
1103 lexer. A string value also forces start label generation and
1104 sets the label name to the specified string. This configuration
1105 applies only to the current block (it is reset to default for
1106 the next block).
1107
1108 re2c:flags:s, re2c:flags:nested-ifs
1109 Same as -s --nested-ifs command-line option.
1110
1111 re2c:flags:b, re2c:flags:bit-vectors
1112 Same as -b --bit-vectors command-line option.
1113
1114 re2c:variable:yybm
1115 Overrides the name of the yybm variable.
1116
1117 re2c:yybm:hex
1118 Defaults to zero (a decimal bitmap table is generated). If set
1119 to nonzero, a hexadecimal table is generated.
1120
1121 re2c:flags:g, re2c:flags:computed-gotos
1122 Same as -g --computed-gotos command-line option.
1123
1124 re2c:cgoto:threshold
1125 With -g --computed-gotos option this value specifies the com‐
1126 plexity threshold that triggers the generation of jump tables
1127 instead of nested if statements and bitmaps. The default value
1128 is 9.
1129
1130 re2c:flags:case-ranges
1131 Same as --case-ranges command-line option.
1132
1133 re2c:flags:e, re2c:flags:ecb
1134 Same as -e --ecb command-line option.
1135
1136 re2c:flags:8, re2c:flags:utf-8
1137 Same as -8 --utf-8 command-line option.
1138
1139 re2c:flags:w, re2c:flags:wide-chars
1140 Same as -w --wide-chars command-line option.
1141
1142 re2c:flags:x, re2c:flags:utf-16
1143 Same as -x --utf-16 command-line option.
1144
1145 re2c:flags:u, re2c:flags:unicode
1146 Same as -u --unicode command-line option.
1147
1148 re2c:flags:encoding-policy
1149 Same as --encoding-policy command-line option.
1150
1151 re2c:flags:empty-class
1152 Same as --empty-class command-line option.
1153
1154 re2c:flags:case-insensitive
1155 Same as --case-insensitive command-line option.
1156
1157 re2c:flags:case-inverted
1158 Same as --case-inverted command-line option.
1159
1160 re2c:flags:i, re2c:flags:no-debug-info
1161 Same as -i --no-debug-info command-line option.
1162
1163 re2c:indent:string
1164 Specifies the string to use for indentation. The default value
1165 is "\t". Indent string should contain only whitespace charac‐
1166 ters. To disable indentation entirely, set this configuration
1167 to empty string "".
1168
1169 re2c:indent:top
1170 Specifies the minimum amount of indentation to use. The default
1171 value is zero. The value should be a non-negative integer num‐
1172 ber.
1173
1174 re2c:labelprefix
1175 Allows one to change the prefix of DFA state labels. The
1176 default value is yy.
1177
1178 re2c:yych:emit
1179 Set this to zero to suppress the generation of yych definition.
1180 Defaults to 1 (the definition is generated).
1181
1182 re2c:variable:yych
1183 Overrides the name of the yych variable.
1184
1185 re2c:yych:conversion
1186 If set to nonzero, re2c automatically generates a cast to YYC‐
1187 TYPE every time yych is read. Defaults to zero (no cast).
1188
1189 re2c:variable:yyaccept
1190 Overrides the name of the yyaccept variable.
1191
1192 re2c:variable:yytarget
1193 Overrides the name of the yytarget variable.
1194
1195 re2c:variable:yystable
1196 Deprecated.
1197
1198 re2c:variable:yyctable
1199 When both -c --conditions and -g --computed-gotos are active,
1200 re2c will use this variable to generate a static jump table for
1201 YYGETCONDITION.
1202
1203 re2c:define:YYDEBUG
1204 Defines YYDEBUG (see the user interface section).
1205
1206 re2c:flags:d, re2c:flags:debug-output
1207 Same as -d --debug-output command-line option.
1208
1209 re2c:flags:dfa-minimization
1210 Same as --dfa-minimization command-line option.
1211
1212 re2c:flags:eager-skip
1213 Same as --eager-skip command-line option.
1214
1216 re2c uses the following syntax for regular expressions:
1217
1218 · "foo" case-sensitive string literal
1219
1220 · 'foo' case-insensitive string literal
1221
1222 · [a-xyz], [^a-xyz] character class (possibly negated)
1223
1224 · . any character except newline
1225
1226 · R \ S difference of character classes R and S
1227
1228 · R* zero or more occurrences of R
1229
1230 · R+ one or more occurrences of R
1231
1232 · R? optional R
1233
1234 · R{n} repetition of R exactly n times
1235
1236 · R{n,} repetition of R at least n times
1237
1238 · R{n,m} repetition of R from n to m times
1239
1240 · (R) just R; parentheses are used to override precedence or for
1241 POSIX-style submatch
1242
1243 · R S concatenation: R followed by S
1244
1245 · R | S alternative: R or S
1246
1247 · R / S lookahead: R followed by S, but S is not consumed
1248
1249 · name the regular expression defined as name (or literal string "name"
1250 in Flex compatibility mode)
1251
1252 · {name} the regular expression defined as name in Flex compatibility
1253 mode
1254
1255 · @stag an s-tag: saves the last input position at which @stag matches
1256 in a variable named stag
1257
1258 · #mtag an m-tag: saves all input positions at which #mtag matches in a
1259 variable named mtag
1260
1261 Character classes and string literals may contain the following escape
1262 sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1263 decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1264
1266 Re2c provides a number of ways to handle end-of-input situation. Which
1267 way to use depends on the complexity of regular expressions, perfor‐
1268 mance considerations, the need for input buffering and various other
1269 factors. EOF handling is probably the most complex part of re2c user
1270 interface --- it definitely requires a bit of understanding of how the
1271 generated lexer works. But in return is allows the user to customize
1272 lexer for a particular environment and avoid the unnecessary overhead
1273 of generic methods when a simpler method is sufficient. Roughly speak‐
1274 ing, there are four main methods:
1275
1276 · using sentinel symbol (simple and efficient, but limited)
1277
1278 · bounds checking with padding (generic, but complex)
1279
1280 · EOF rule: a combination of sentinel symbol and bounds checking
1281 (generic and simple, can be more or less efficient than bounds check‐
1282 ing with padding depending on the grammar)
1283
1284 · using generic API (user-defined, so may be incorrect ;])
1285
1286 Using sentinel symbol
1287 This is the simplest and the most efficient method. It is applicable in
1288 cases when the input is small enough to fit into a continuous memory
1289 buffer and there is a natural "sentinel" symbol --- a code unit that is
1290 not allowed by any of the regular expressions in grammar (except possi‐
1291 bly as a terminating character). Sentinel symbol never appears in
1292 well-formed input, therefore it can be appended at the end of input and
1293 used as a stop signal by the lexer. A good example of such input is a
1294 null-terminated C-string, provided that the grammar does not allow NULL
1295 in the middle of lexemes. Sentinel method is very efficient, because
1296 the lexer does not need to perform any additional checks for the end of
1297 input --- it comes naturally as a part of processing the next charac‐
1298 ter. It is very important that the sentinel symbol is not allowed in
1299 the middle of the rule --- otherwise on some inputs the lexer may read
1300 past the end of buffer and crash or cause memory corruption. Re2c veri‐
1301 fies this automatically. Use re2c:sentinel configuration to specify
1302 which sentinel symbol is used.
1303
1304 Below is an example of using sentinel method. Configuration
1305 re2c:yyfill:enable = 0; suppresses generation of end-of-input checks
1306 and YYFILL calls.
1307
1308 // re2c $INPUT -o $OUTPUT
1309 #include <assert.h>
1310
1311 // expect a null-terminated string
1312 static int lex(const char *YYCURSOR)
1313 {
1314 int count = 0;
1315 loop:
1316 /*!re2c
1317 re2c:define:YYCTYPE = char;
1318 re2c:yyfill:enable = 0;
1319
1320 * { return -1; }
1321 [\x00] { return count; }
1322 [a-z]+ { ++count; goto loop; }
1323 [ ]+ { goto loop; }
1324
1325 */
1326 }
1327
1328 int main()
1329 {
1330 assert(lex("") == 0);
1331 assert(lex("one two three") == 3);
1332 assert(lex("f0ur") == -1);
1333 return 0;
1334 }
1335
1336
1337 Bounds checking with padding
1338 Bounds checking is a generic method: it can be used with any input
1339 grammar. The basic idea is simple: we need to check for the end of
1340 input before reading the next input character. However, if implemented
1341 in a straightforward way, this would be quite inefficient: checking on
1342 each input character would cause a major slowdown. Re2c avoids slowdown
1343 by generating checks only in certain key states of the lexer, and let‐
1344 ting it run without checks in-between the key states. More precisely,
1345 re2c computes strongly connected components (SCCs) of the underlying
1346 DFA (which roughly correspond to loops), and generates only a few
1347 checks per each SCC (usually just one, but in general enough to make
1348 the SCC acyclic). The check is of the form (YYLIMIT - YYCURSOR) < n,
1349 where n is the maximal length of a simple path in the corresponding
1350 SCC. If this condiiton is true, the lexer calls YYFILL(n), which must
1351 either supply at least n input characters, or do not return. When the
1352 lexer continues after the check, it is certain that the next n charac‐
1353 ters can be read safely without checks.
1354
1355 This approach reduces the number of checks significantly (and makes the
1356 lexer much faster as a result), but it has a downside. Since the lexer
1357 checks for multiple characters at once, it may end up in a situation
1358 when there are a few remaining input characters (less than n) corre‐
1359 sponding to a short path in the SCC, but the lexer cannot proceed
1360 because of the check, and YYFILL cannot supply more character because
1361 it is the end of input. To solve this problem, re2c requires that addi‐
1362 tional padding consisting of fake characters is appended at the end of
1363 input. The length of padding should be YYMAXFILL, which equals to the
1364 maximum n parameter to YYFILL and must be generated by re2c using
1365 /*!max:re2c*/ directive. The fake characters should not form a valid
1366 lexeme suffix, otherwise the lexer may be fooled into matching a fake
1367 lexeme. Usually it's a good idea to use NULL characters for padding.
1368
1369 Below is an example of using bounds checking with padding. Note that
1370 the grammar rule for single-quoted strings allows arbitrary symbols in
1371 the middle of lexeme, so there is no natural sentinel in the grammar.
1372 Strings like "aha\0ha" are perfectly valid, but ill-formed strings like
1373 "aha\0 are also possible and shouldn’t crash the lexer. In this example
1374 we do not use buffer refilling, therefore YYFILL definition simply
1375 returns an error. Note that YYFILL will only be called after the lexer
1376 reaches padding, because only then will the check condition be satis‐
1377 fied.
1378
1379 // re2c $INPUT -o $OUTPUT
1380 #include <assert.h>
1381 #include <stdlib.h>
1382 #include <string.h>
1383
1384 /*!max:re2c*/
1385
1386 // expect YYMAXFILL-padded string
1387 static int lex(const char *str, unsigned int len)
1388 {
1389 const char *YYCURSOR = str, *YYLIMIT = str + len + YYMAXFILL;
1390 int count = 0;
1391
1392 loop:
1393 /*!re2c
1394 re2c:api:style = free-form;
1395 re2c:define:YYCTYPE = char;
1396 re2c:define:YYFILL = "return -1;";
1397
1398 * { return -1; }
1399 [\x00] { return YYCURSOR == YYLIMIT ? count : -1; }
1400 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1401 [ ]+ { goto loop; }
1402
1403 */
1404 }
1405
1406 // make a copy of the string with YYMAXFILL zeroes at the end
1407 static void test(const char *str, unsigned int len, int res)
1408 {
1409 char *s = (char*) malloc(len + YYMAXFILL);
1410 memcpy(s, str, len);
1411 memset(s + len, 0, YYMAXFILL);
1412 int r = lex(s, len);
1413 free(s);
1414 assert(r == res);
1415 }
1416
1417 #define TEST(s, r) test(s, sizeof(s) - 1, r)
1418 int main()
1419 {
1420 TEST("", 0);
1421 TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1422 TEST("'unterminated\\'", -1);
1423 return 0;
1424 }
1425
1426
1427 EOF rule
1428 EOF rule $ was introduced in version 1.2. It is a hybrid approach that
1429 tries to take the best of both worlds: simplicity and efficiency of the
1430 sentinel method combined with the generality of bounds-checking method.
1431 The idea is to appoint an arbitrary symbol to be the sentinel, and only
1432 perform further bounds checking if the sentinel symbol matches (more
1433 precisely, if the symbol class that contains it matches). The check is
1434 of the form YYLIMIT <= YYCURSOR. If this condition is not satisfied,
1435 then the sentinel is just an ordinary input character and the lexer
1436 continues. Otherwise this is a real sentinel, and the lexer calls
1437 YYFILL(). If YYFILL returns zero, the lexer assumes that it has more
1438 input and tries to re-match. Otherwise YYFILL returns non-zero and the
1439 lexer knows that it has reached the end of input. At this point there
1440 are three possibilities. First, it might have already matched a shorter
1441 lexeme --- in this case it just rolls back to the last accepting state.
1442 Second, it might have consumed some characters, but failed to match ---
1443 in this case it falls back to default rule *. Finally, it might be in
1444 the initial state --- in this (and only this!) case it matches EOF rule
1445 $.
1446
1447 Below is an example of using EOF rule. Configuration re2c:yyfill:enable
1448 = 0; suppresses generation of YYFILL calls (but not the bounds checks).
1449
1450 // re2c $INPUT -o $OUTPUT
1451 #include <assert.h>
1452
1453 // expect a null-terminated string
1454 static int lex(const char *str, unsigned int len)
1455 {
1456 const char *YYCURSOR = str, *YYLIMIT = str + len, *YYMARKER;
1457 int count = 0;
1458
1459 loop:
1460 /*!re2c
1461 re2c:define:YYCTYPE = char;
1462 re2c:yyfill:enable = 0;
1463 re2c:eof = 0;
1464
1465 * { return -1; }
1466 $ { return count; }
1467 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1468 [ ]+ { goto loop; }
1469
1470 */
1471 }
1472
1473 #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
1474 int main()
1475 {
1476 TEST("", 0);
1477 TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1478 TEST("'unterminated\\'", -1);
1479 return 0;
1480 }
1481
1482
1483 Using generic API
1484 Generic API can be used with any of the above methods. It also allows
1485 one to use a user-defined method by placing EOF checks in one of the
1486 basic primitives. Usually this is either YYSKIP (the check is per‐
1487 formed when advancing to the next input character), or YYPEEK (the
1488 check is performed when reading the next input character). The result‐
1489 ing methods are inefficient, as they check on each input character.
1490 However, they can be useful in cases when the input cannot be buffered
1491 or padded and does not contain a sentinel character at the end. One
1492 should be cautious when using such ad-hoc methods, as it is easy to
1493 overlook some corner cases and come up with a method that only par‐
1494 tially works. Also it should be noted that not everything can be
1495 expressed via generic API: for example, it is impossible to reimplement
1496 the way EOF rule works (in particular, it is impossible to re-match the
1497 character after successful YYFILL).
1498
1499 Below is an example of using YYSKIP to perform bounds checking without
1500 padding. YYFILL generation is suppressed using re2c:yyfill:enable = 0;
1501 configuration. Note that if the grammar was more complex, this method
1502 might not work in case when two rules overlap and EOF check fails after
1503 a shorter lexeme has already been matched (as it happens in our exam‐
1504 ple, there are no overlapping rules).
1505
1506 // re2c $INPUT -o $OUTPUT
1507 #include <assert.h>
1508 #include <stdlib.h>
1509 #include <string.h>
1510
1511 // expect a string without terminating null
1512 static int lex(const char *str, unsigned int len)
1513 {
1514 const char *cur = str, *lim = str + len, *mar;
1515 int count = 0;
1516
1517 loop:
1518 /*!re2c
1519 re2c:yyfill:enable = 0;
1520 re2c:eof = 0;
1521 re2c:flags:input = custom;
1522 re2c:api:style = free-form;
1523 re2c:define:YYCTYPE = char;
1524 re2c:define:YYLESSTHAN = "cur >= lim";
1525 re2c:define:YYPEEK = "cur < lim ? *cur : 0"; // fake null
1526 re2c:define:YYSKIP = "++cur;";
1527 re2c:define:YYBACKUP = "mar = cur;";
1528 re2c:define:YYRESTORE = "cur = mar;";
1529
1530 * { return -1; }
1531 $ { return count; }
1532 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1533 [ ]+ { goto loop; }
1534
1535 */
1536 }
1537
1538 // make a copy of the string without terminating null
1539 static void test(const char *str, unsigned int len, int res)
1540 {
1541 char *s = (char*) malloc(len);
1542 memcpy(s, str, len);
1543 int r = lex(s, len);
1544 free(s);
1545 assert(r == res);
1546 }
1547
1548 #define TEST(s, r) test(s, sizeof(s) - 1, r)
1549 int main()
1550 {
1551 TEST("", 0);
1552 TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1553 TEST("'unterminated\\'", -1);
1554 return 0;
1555 }
1556
1557
1559 The need for buffering arises when the input cannot be mapped in memory
1560 all at once: either it is too large, or it comes in a streaming fashion
1561 (like reading from a socket). The usual technique in such cases is to
1562 allocate a fixed-sized memory buffer and process input in chunks that
1563 fit into the buffer. When the current chunk is processed, it is moved
1564 out and new data is moved in. In practice it is somewhat more complex,
1565 because lexer state consists not of a single input position, but a set
1566 of interrelated posiitons:
1567
1568 · cursor: the next input character to be read (YYCURSOR in default API
1569 or YYSKIP/YYPEEK in generic API)
1570
1571 · limit: the position after the last available input character (YYLIMIT
1572 in default API, implicitly handled by YYLESSTHAN in generic API)
1573
1574 · marker: the position of the most recent match, if any (YYMARKER in
1575 default API or YYBACKUP/YYRESTORE in generic API)
1576
1577 · token: the start of the current lexeme (implicit in re2c API, as it
1578 is not needed for the normal lexer operation and can be defined and
1579 updated by the user)
1580
1581 · context marker: the position of the trailing context (YYCTXMARKER in
1582 default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1583
1584 · tag variables: submatch positions (defined with /*!stags:re2c*/ and
1585 /*!mtags:re2c*/ directives and YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN in
1586 generic API)
1587
1588 Not all these are used in every case, but if used, they must be updated
1589 by YYFILL. All active positions are contained in the segment between
1590 token and cursor, therefore everything between buffer start and token
1591 can be discarded, the segment from token and up to limit should be
1592 moved to the beginning of buffer, and the free space at the end of buf‐
1593 fer should be filled with new data. In order to avoid frequent YYFILL
1594 calls it is best to fill in as many input characters as possible (even
1595 though fewer characters might suffice to resume the lexer). The details
1596 of YYFILL implementation are slightly different depending on which EOF
1597 handling method is used: the case of EOF rule is somewhat simpler than
1598 the case of bounds-checking with padding. Also note that if -f
1599 --storable-state option is used, YYFILL has slightly different seman‐
1600 tics (desrbed in the section about storable state).
1601
1602 YYFILL with EOF rule
1603 If EOF rule is used, YYFILL is a function-like primitive that accepts
1604 no arguments and returns a value which is checked against zero. YYFILL
1605 invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1606 and YYLESSTHAN() in generic API. A non-zero return value means that
1607 YYFILL has failed. A successful YYFILL call must supply at least one
1608 character and adjust input positions accordingly. Limit must always be
1609 set to one after the last input position in buffer, and the character
1610 at the limit position must be the sentinel symbol specified by re2c:eof
1611 configuration. The pictures below show the relative locations of input
1612 positions in buffer before and after YYFILL call (sentinel symbol is
1613 marked with #, and the second picture shows the case when there is not
1614 enough input to fill the whole buffer).
1615
1616 <-- shift -->
1617 >-A------------B---------C-------------D#-----------E->
1618 buffer token marker limit,
1619 cursor
1620 >-A------------B---------C-------------D------------E#->
1621 buffer, marker cursor limit
1622 token
1623
1624 <-- shift -->
1625 >-A------------B---------C-------------D#--E (EOF)
1626 buffer token marker limit,
1627 cursor
1628 >-A------------B---------C-------------D---E#........
1629 buffer, marker cursor limit
1630 token
1631
1632 Here is an example of a program that reads input file input.txt in
1633 chunks of 4096 bytes and uses EOF rule.
1634
1635 // re2c $INPUT -o $OUTPUT
1636 #include <assert.h>
1637 #include <stdio.h>
1638 #include <string.h>
1639
1640 #define SIZE 4096
1641
1642 typedef struct {
1643 FILE *file;
1644 char buf[SIZE + 1], *lim, *cur, *mar, *tok;
1645 int eof;
1646 } Input;
1647
1648 static int fill(Input *in)
1649 {
1650 if (in->eof) {
1651 return 1;
1652 }
1653 const size_t free = in->tok - in->buf;
1654 if (free < 1) {
1655 return 2;
1656 }
1657 memmove(in->buf, in->tok, in->lim - in->tok);
1658 in->lim -= free;
1659 in->cur -= free;
1660 in->mar -= free;
1661 in->tok -= free;
1662 in->lim += fread(in->lim, 1, free, in->file);
1663 in->lim[0] = 0;
1664 in->eof |= in->lim < in->buf + SIZE;
1665 return 0;
1666 }
1667
1668 static void init(Input *in, FILE *file)
1669 {
1670 in->file = file;
1671 in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1672 in->eof = 0;
1673 fill(in);
1674 }
1675
1676 static int lex(Input *in)
1677 {
1678 int count = 0;
1679 loop:
1680 in->tok = in->cur;
1681 /*!re2c
1682 re2c:eof = 0;
1683 re2c:api:style = free-form;
1684 re2c:define:YYCTYPE = char;
1685 re2c:define:YYCURSOR = in->cur;
1686 re2c:define:YYMARKER = in->mar;
1687 re2c:define:YYLIMIT = in->lim;
1688 re2c:define:YYFILL = "fill(in) == 0";
1689
1690 * { return -1; }
1691 $ { return count; }
1692 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1693 [ ]+ { goto loop; }
1694
1695 */
1696 }
1697
1698 int main()
1699 {
1700 const char *fname = "input";
1701 const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1702 FILE *f;
1703 Input in;
1704
1705 // prepare input file: a few times the size of the buffer,
1706 // containing strings with zeroes and escaped quotes
1707 f = fopen(fname, "w");
1708 for (int i = 0; i < SIZE; ++i) {
1709 fwrite(str, 1, sizeof(str) - 1, f);
1710 }
1711 fclose(f);
1712
1713 f = fopen(fname, "r");
1714 init(&in, f);
1715 assert(lex(&in) == SIZE * 3);
1716 fclose(f);
1717
1718 remove(fname);
1719 return 0;
1720 }
1721
1722
1723 YYFILL with padding
1724 In the default case (when EOF rule is not used) YYFILL is a func‐
1725 tion-like primitive that accepts a single argument and does not return
1726 any value. YYFILL invocation is triggered by condition (YYLIMIT -
1727 YYCURSOR) < n in default API and YYLESSTHAN(n) in generic API. The
1728 argument passed to YYFILL is the minimal number of characters that must
1729 be supplied. If it fails to do so, YYFILL must not return to the lexer
1730 (for that reason it is best implemented as a macro that returns from
1731 the calling function on failure). In case of a successful YYFILL invo‐
1732 cation the limit position must be set either to one after the last
1733 input position in buffer, or to the end of YYMAXFILL padding (in case
1734 YYFILL has successfully read at least n characters, but not enough to
1735 fill the entire buffer). The pictures below show the relative locations
1736 of input positions in buffer before and after YYFILL invocation (YYMAX‐
1737 FILL padding on the second picture is marked with # symbols).
1738
1739 <-- shift --> <-- need -->
1740 >-A------------B---------C-----D-------E---F--------G->
1741 buffer token marker cursor limit
1742
1743 >-A------------B---------C-----D-------E---F--------G->
1744 buffer, marker cursor limit
1745 token
1746
1747 <-- shift --> <-- need -->
1748 >-A------------B---------C-----D-------E-F (EOF)
1749 buffer token marker cursor limit
1750
1751 >-A------------B---------C-----D-------E-F###############
1752 buffer, marker cursor limit
1753 token <- YYMAXFILL ->
1754
1755 Here is an example of a program that reads input file input.txt in
1756 chunks of 4096 bytes and uses bounds-checking with padding.
1757
1758 // re2c $INPUT -o $OUTPUT
1759 #include <assert.h>
1760 #include <stdio.h>
1761 #include <string.h>
1762
1763 /*!max:re2c*/
1764 #define SIZE 4096
1765
1766 typedef struct {
1767 FILE *file;
1768 char buf[SIZE + YYMAXFILL], *lim, *cur, *mar, *tok;
1769 int eof;
1770 } Input;
1771
1772 static int fill(Input *in, size_t need)
1773 {
1774 if (in->eof) {
1775 return 1;
1776 }
1777 const size_t free = in->tok - in->buf;
1778 if (free < need) {
1779 return 2;
1780 }
1781 memmove(in->buf, in->tok, in->lim - in->tok);
1782 in->lim -= free;
1783 in->cur -= free;
1784 in->mar -= free;
1785 in->tok -= free;
1786 in->lim += fread(in->lim, 1, free, in->file);
1787 if (in->lim < in->buf + SIZE) {
1788 in->eof = 1;
1789 memset(in->lim, 0, YYMAXFILL);
1790 in->lim += YYMAXFILL;
1791 }
1792 return 0;
1793 }
1794
1795 static void init(Input *in, FILE *file)
1796 {
1797 in->file = file;
1798 in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1799 in->eof = 0;
1800 fill(in, 1);
1801 }
1802
1803 static int lex(Input *in)
1804 {
1805 int count = 0;
1806 loop:
1807 in->tok = in->cur;
1808 /*!re2c
1809 re2c:api:style = free-form;
1810 re2c:define:YYCTYPE = char;
1811 re2c:define:YYCURSOR = in->cur;
1812 re2c:define:YYMARKER = in->mar;
1813 re2c:define:YYLIMIT = in->lim;
1814 re2c:define:YYFILL = "if (fill(in, @@) != 0) return -1;";
1815
1816 * { return -1; }
1817 [\x00] { return (YYMAXFILL == in->lim - in->tok) ? count : -1; }
1818 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1819 [ ]+ { goto loop; }
1820
1821 */
1822 }
1823
1824 int main()
1825 {
1826 const char *fname = "input";
1827 const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1828 FILE *f;
1829 Input in;
1830
1831 // prepare input file: a few times the size of the buffer,
1832 // containing strings with zeroes and escaped quotes
1833 f = fopen(fname, "w");
1834 for (int i = 0; i < SIZE; ++i) {
1835 fwrite(str, 1, sizeof(str) - 1, f);
1836 }
1837 fclose(f);
1838
1839 f = fopen(fname, "r");
1840 init(&in, f);
1841 assert(lex(&in) == SIZE * 3);
1842 fclose(f);
1843
1844 remove(fname);
1845 return 0;
1846 }
1847
1848
1850 Re2c allows one to include other files using directive /*!include:re2c
1851 FILE */, where FILE is the name of file to be included. Re2c looks for
1852 included files in the directory of the including file and in include
1853 locations, which can be specified with -I option. Re2c include direc‐
1854 tive works in the same way as C/C++ #include: the contents of FILE are
1855 copy-pasted verbatim in place of the directive. Include files may have
1856 further includes of their own. Re2c provides some predefined include
1857 files that can be found in the include/ subdirectory of the project.
1858 These files contain definitions that can be useful to other projects
1859 (such as Unicode categories) and form something like a standard library
1860 for re2c. Here is an example:
1861
1862 Include file (definitions.h)
1863 typedef enum { OK, FAIL } Result;
1864
1865 /*!re2c
1866 number = [1-9][0-9]*;
1867 */
1868
1869
1870 Input file
1871 // re2c $INPUT -o $OUTPUT -i
1872 #include <assert.h>
1873 /*!include:re2c "definitions.h" */
1874
1875 Result lex(const char *YYCURSOR)
1876 {
1877 /*!re2c
1878 re2c:define:YYCTYPE = char;
1879 re2c:yyfill:enable = 0;
1880
1881 number { return OK; }
1882 * { return FAIL; }
1883 */
1884 }
1885
1886 int main()
1887 {
1888 assert(lex("123") == OK);
1889 return 0;
1890 }
1891
1892
1894 Re2c allows one to generate header file from the input .re file using
1895 option -t, --type-header or configuration re2c:flags:type-header and
1896 directives /*!header:re2c:on*/ and /*!header:re2c:off*/. The first
1897 directive marks the beginning of header file, and the second directive
1898 marks the end of it. Everything between these directives is processed
1899 by re2c, and the generated code is written to the file specified by the
1900 -t --type-header option (or stdout if this option was not used). Auto‐
1901 generated header file may be needed in cases when re2c is used to gen‐
1902 erate definitions of constants, variables and structs that must be vis‐
1903 ible from other translation units.
1904
1905 Here is an example of generating a header file that contains definition
1906 of the lexer state with tag variables (the number variables depends on
1907 the regular grammar and is unknown to the programmer).
1908
1909 Input file
1910 // re2c $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.h
1911 #include <assert.h>
1912 #include "src/lexer/lexer.h" // generated by re2c
1913
1914 /*!header:re2c:on*/
1915
1916 typedef struct {
1917 const char *str, *cur, *mar;
1918 /*!stags:re2c format = "const char *@@{tag}; "; */
1919 } LexerState;
1920
1921 /*!header:re2c:off*/
1922
1923 int lex(LexerState *st)
1924 {
1925 /*!re2c
1926 re2c:flags:type-header = "src/lexer/lexer.h";
1927 re2c:yyfill:enable = 0;
1928 re2c:flags:tags = 1;
1929 re2c:define:YYCTYPE = char;
1930 re2c:define:YYCURSOR = "st->cur";
1931 re2c:define:YYMARKER = "st->mar";
1932 re2c:tags:expression = "st->@@{tag}";
1933
1934 [x]{1,4} / [x]{3,5} { return 0; } // ambiguous trailing context
1935 * { return 1; }
1936 */
1937 }
1938
1939 int main()
1940 {
1941 LexerState st;
1942 st.str = st.cur = "xxxxxxxx";
1943 assert(lex(&st) == 0 && st.cur - st.str == 4);
1944 return 0;
1945 }
1946
1947
1948 Header file
1949 /* Generated by re2c */
1950
1951
1952 typedef struct {
1953 const char *str, *cur, *mar;
1954 const char *yyt1; const char *yyt2; const char *yyt3;
1955 } LexerState;
1956
1957
1958
1960 Re2c has two options for submatch extraction.
1961
1962 The first option is -T --tags. With this option one can use standalone
1963 tags of the form @stag and #mtag, where stag and mtag are arbitrary
1964 used-defined names. Tags can be used anywhere inside of a regular
1965 expression; semantically they are just position markers. Tags of the
1966 form @stag are called s-tags: they denote a single submatch value (the
1967 last input position where this tag matched). Tags of the form #mtag are
1968 called m-tags: they denote multiple submatch values (the whole history
1969 of repetitions of this tag). All tags should be defined by the user as
1970 variables with the corresponding names. With standalone tags re2c uses
1971 leftmost greedy disambiguation: submatch positions correspond to the
1972 leftmost matching path through the regular expression.
1973
1974 The second option is -P --posix-captures: it enables POSIX-compliant
1975 capturing groups. In this mode parentheses in regular expressions
1976 denote the beginning and the end of capturing groups; the whole regular
1977 expression is group number zero. The number of groups for the matching
1978 rule is stored in a variable yynmatch, and submatch results are stored
1979 in yypmatch array. Both yynmatch and yypmatch should be defined by the
1980 user, and yypmatch size must be at least [yynmatch * 2]. Re2c provides
1981 a directive /*!maxnmatch:re2c*/ that defines YYMAXNMATCH: a constant
1982 equal to the maximal value of yynmatch among all rules. Note that re2c
1983 implements POSIX-compliant disambiguation: each subexpression matches
1984 as long as possible, and subexpressions that start earlier in regular
1985 expression have priority over those starting later. Capturing groups
1986 are translated into s-tags under the hood, therefore we use the word
1987 "tag" to describe them as well.
1988
1989 With both -P --posix-captures and T --tags options re2c uses efficient
1990 submatch extraction algorithm described in the Tagged Deterministic
1991 Finite Automata with Lookahead paper. The overhead on submatch extrac‐
1992 tion in the generated lexer grows with the number of tags --- if this
1993 number is moderate, the overhead is barely noticeable. In the lexer
1994 tags are implemented using a number of tag variables generated by re2c.
1995 There is no one-to-one correspondence between tag variables and tags: a
1996 single variable may be reused for different tags, and one tag may
1997 require multiple variables to hold all its ambiguous values. Eventually
1998 ambiguity is resolved, and only one final variable per tag survives.
1999 When a rule matches, all its tags are set to the values of the corre‐
2000 sponding tag variables. The exact number of tag variables is unknown
2001 to the user; this number is determined by re2c. However, tag variables
2002 should be defined by the user as a part of the lexer state and updated
2003 by YYFILL, therefore re2c provides directives /*!stags:re2c*/ and
2004 /*!mtags:re2c*/ that can be used to declare, initialize and manipulate
2005 tag variables. These directives have two optional configurations: for‐
2006 mat = "@@"; (specifies the template where @@ is substituted with the
2007 name of each tag variable), and separator = ""; (specifies the piece of
2008 code used to join the generated pieces for different tag variables).
2009
2010 S-tags support the following operations:
2011
2012 · save input position to an s-tag: t = YYCURSOR with default API or a
2013 user-defined operation YYSTAGP(t) with generic API
2014
2015 · save default value to an s-tag: t = NULL with default API or a
2016 user-defined operation YYSTAGN(t) with generic API
2017
2018 · copy one s-tag to another: t1 = t2
2019
2020 M-tags support the following operations:
2021
2022 · append input position to an m-tag: a user-defined operation YYM‐
2023 TAGP(t) with both default and generic API
2024
2025 · append default value to an m-tag: a user-defined operation YYMTAGN(t)
2026 with both default and generic API
2027
2028 · copy one m-tag to another: t1 = t2
2029
2030 S-tags can be implemented as scalar values (pointers or offsets).
2031 M-tags need a more complex representation, as they need to store a
2032 sequence of tag values. The most naive and inefficient representation
2033 of an m-tag is a list (array, vector) of tag values; a more efficient
2034 representation is to store all m-tags in a prefix-tree represented as
2035 array of nodes (v, p), where v is tag value and p is a pointer to par‐
2036 ent node.
2037
2038 Here is an example of using s-tags to parse an IPv4 address.
2039
2040 // re2c $INPUT -o $OUTPUT
2041 #include <assert.h>
2042 #include <stdint.h>
2043
2044 static uint32_t num(const char *s, const char *e)
2045 {
2046 uint32_t n = 0;
2047 for (; s < e; ++s) n = n * 10 + (*s - '0');
2048 return n;
2049 }
2050
2051 static const uint64_t ERROR = ~0lu;
2052
2053 static uint64_t lex(const char *YYCURSOR)
2054 {
2055 const char *YYMARKER, *o1, *o2, *o3, *o4;
2056 /*!stags:re2c format = 'const char *@@;'; */
2057
2058 /*!re2c
2059 re2c:yyfill:enable = 0;
2060 re2c:flags:tags = 1;
2061 re2c:define:YYCTYPE = char;
2062
2063 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2064 dot = [.];
2065 end = [\x00];
2066
2067 @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2068 return num(o4, YYCURSOR - 1)
2069 + (num(o3, o4 - 1) << 8)
2070 + (num(o2, o3 - 1) << 16)
2071 + (num(o1, o2 - 1) << 24);
2072 }
2073 * { return ERROR; }
2074 */
2075 }
2076
2077 int main()
2078 {
2079 assert(lex("1.2.3.4") == 0x01020304);
2080 assert(lex("127.0.0.1") == 0x7f000001);
2081 assert(lex("255.255.255.255") == 0xffffffff);
2082 assert(lex("1.2.3.") == ERROR);
2083 assert(lex("1.2.3.256") == ERROR);
2084 return 0;
2085 }
2086
2087
2088 Here is an example of using POSIX capturing groups to parse an IPv4
2089 address.
2090
2091 // re2c $INPUT -o $OUTPUT
2092 #include <assert.h>
2093 #include <stdint.h>
2094
2095 static uint32_t num(const char *s, const char *e)
2096 {
2097 uint32_t n = 0;
2098 for (; s < e; ++s) n = n * 10 + (*s - '0');
2099 return n;
2100 }
2101
2102 /*!maxnmatch:re2c*/
2103 static const uint64_t ERROR = ~0lu;
2104
2105 static uint64_t lex(const char *YYCURSOR)
2106 {
2107 const char *YYMARKER;
2108 const char *yypmatch[YYMAXNMATCH * 2];
2109 uint32_t yynmatch;
2110 /*!stags:re2c format = 'const char *@@;'; */
2111
2112 /*!re2c
2113 re2c:yyfill:enable = 0;
2114 re2c:flags:posix-captures = 1;
2115 re2c:define:YYCTYPE = char;
2116
2117 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2118 dot = [.];
2119 end = [\x00];
2120
2121 (octet) dot (octet) dot (octet) dot (octet) end {
2122 assert(yynmatch == 5);
2123 return num(yypmatch[8], yypmatch[9])
2124 + (num(yypmatch[6], yypmatch[7]) << 8)
2125 + (num(yypmatch[4], yypmatch[5]) << 16)
2126 + (num(yypmatch[2], yypmatch[3]) << 24);
2127 }
2128 * { return ERROR; }
2129 */
2130 }
2131
2132 int main()
2133 {
2134 assert(lex("1.2.3.4") == 0x01020304);
2135 assert(lex("127.0.0.1") == 0x7f000001);
2136 assert(lex("255.255.255.255") == 0xffffffff);
2137 assert(lex("1.2.3.") == ERROR);
2138 assert(lex("1.2.3.256") == ERROR);
2139 return 0;
2140 }
2141
2142
2143 Here is an example of using m-tags to parse a semicolon-separated
2144 sequence of words (C++). Tag variables are stored in a tree that is
2145 packed in a vector.
2146
2147 // re2c $INPUT -o $OUTPUT
2148 #include <assert.h>
2149 #include <vector>
2150 #include <string>
2151
2152 static const int ROOT = -1;
2153
2154 struct Mtag {
2155 int pred;
2156 const char *tag;
2157 };
2158
2159 typedef std::vector<Mtag> MtagTree;
2160 typedef std::vector<std::string> Words;
2161
2162 static void mtag(int *pt, const char *t, MtagTree *tree)
2163 {
2164 Mtag m = {*pt, t};
2165 *pt = (int)tree->size();
2166 tree->push_back(m);
2167 }
2168
2169 static void unfold(const MtagTree &tree, int x, int y, Words &words)
2170 {
2171 if (x == ROOT) return;
2172 unfold(tree, tree[x].pred, tree[y].pred, words);
2173 const char *px = tree[x].tag, *py = tree[y].tag;
2174 words.push_back(std::string(px, py - px));
2175 }
2176
2177 #define YYMTAGP(t) mtag(&t, YYCURSOR, &tree)
2178 #define YYMTAGN(t) mtag(&t, NULL, &tree)
2179 static bool lex(const char *YYCURSOR, Words &words)
2180 {
2181 const char *YYMARKER;
2182 /*!mtags:re2c format = "int @@ = ROOT;"; */
2183 MtagTree tree;
2184 int x, y;
2185
2186 /*!re2c
2187 re2c:yyfill:enable = 0;
2188 re2c:flags:tags = 1;
2189 re2c:define:YYCTYPE = char;
2190
2191 (#x [a-z]+ #y [;])+ {
2192 words.clear();
2193 unfold(tree, x, y, words);
2194 return true;
2195 }
2196 * { return false; }
2197 */
2198 }
2199
2200 int main()
2201 {
2202 Words w;
2203 assert(lex("one;two;three;", w) && w == Words({"one", "two", "three"}));
2204 return 0;
2205 }
2206
2207
2209 With -f --storable-state option re2c generates a lexer that can store
2210 its current state, return to the caller, and later resume operations
2211 exactly where it left off. The default mode of operation in re2c is a
2212 "pull" model, in which the lexer "pulls" more input whenever it needs
2213 it. This may be unacceptable in cases when the input becomes available
2214 piece by piece (for example, if the lexer is invoked by the parser, or
2215 if the lexer program communicates via a socket protocol with some other
2216 program that must wait for a reply from the lexer before it transmits
2217 the next message). Storable state feature is intended exactly for such
2218 cases: it allows one to generate lexers that work in a "push" model.
2219 When the lexer needs more input, it stores its state and returns to the
2220 caller. Later, when more input becomes available, the caller resumes
2221 the lexer exactly where it stopped. There are a few changes necessary
2222 compared to the "pull" model:
2223
2224 · Define YYSETSTATE() and YYGETSTATE(state) promitives.
2225
2226 · Define yych, yyaccept and state variables as a part of persistent
2227 lexer state. The state variable should be initialized to -1.
2228
2229 · YYFILL should return to the outer program instead of trying to supply
2230 more input. Return code should indicate that lexer needs more input.
2231
2232 · The outer program should recognize situations when lexer needs more
2233 input and respond appropriately.
2234
2235 · Use /*!getstate:re2c*/ directive if it is necessary to execute any
2236 code before entering the lexer.
2237
2238 · Use configurations state:abort and state:nextlabel to further tweak
2239 the generated code.
2240
2241 Here is an example of a "push"-model lexer that reads input from stdin
2242 and expects a sequence of words separated by spaces and newlines. The
2243 lexer loops forever, waiting for more input. It can be terminated by
2244 sending a special EOF token --- a word "stop", in which case the lexer
2245 terminates successfully and prints the number of words it has seen.
2246 Abnormal termination happens in case of a syntax error, premature end
2247 of input (without the "stop" word) or in case the buffer is too small
2248 to hold a lexeme (for example, if one of the words exceeds buffer
2249 size). Premature end of input happens in case the lexer fails to read
2250 any input while being in the initial state --- this is the only case
2251 when EOF rule matches. Note that the lexer may call YYFILL twice before
2252 terminating (and thus require hitting Ctrl+D a few times). First time
2253 YYFILL is called when the lexer expects continuation of the current
2254 greedy lexeme (either a word or a whitespace sequence). If YYFILL
2255 fails, the lexer knows that it has reached the end of the current lex‐
2256 eme and executes the corresponding semantic action. The action jumps to
2257 the beginning of the loop, the lexer enters the initial state and calls
2258 YYFILL once more. If it fails, the lexer matches EOF rule. (Alterna‐
2259 tively EOF rule can be used for termination instead of a special EOF
2260 lexeme.)
2261
2262 Example
2263 // re2c $INPUT -o $OUTPUT -f
2264 #include <assert.h>
2265 #include <stdio.h>
2266 #include <string.h>
2267
2268 #define DEBUG 0
2269 #define LOG(...) if (DEBUG) fprintf(stderr, __VA_ARGS__);
2270 #define BUFSIZE 10
2271
2272 typedef struct {
2273 FILE *file;
2274 char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok;
2275 unsigned yyaccept;
2276 int state;
2277 } Input;
2278
2279 static void init(Input *in, FILE *f)
2280 {
2281 in->file = f;
2282 in->cur = in->mar = in->tok = in->lim = in->buf + BUFSIZE;
2283 in->lim[0] = 0; // append sentinel symbol
2284 in->yyaccept = 0;
2285 in->state = -1;
2286 }
2287
2288 typedef enum {END, READY, WAITING, BAD_PACKET, BIG_PACKET} Status;
2289
2290 static Status fill(Input *in)
2291 {
2292 const size_t shift = in->tok - in->buf;
2293 const size_t free = BUFSIZE - (in->lim - in->tok);
2294
2295 if (free < 1) return BIG_PACKET;
2296
2297 memmove(in->buf, in->tok, BUFSIZE - shift);
2298 in->lim -= shift;
2299 in->cur -= shift;
2300 in->mar -= shift;
2301 in->tok -= shift;
2302
2303 const size_t read = fread(in->lim, 1, free, in->file);
2304 in->lim += read;
2305 in->lim[0] = 0; // append sentinel symbol
2306
2307 return READY;
2308 }
2309
2310 static Status lex(Input *in, unsigned int *recv)
2311 {
2312 char yych;
2313 /*!getstate:re2c*/
2314 loop:
2315 in->tok = in->cur;
2316 /*!re2c
2317 re2c:eof = 0;
2318 re2c:api:style = free-form;
2319 re2c:define:YYCTYPE = "char";
2320 re2c:define:YYCURSOR = "in->cur";
2321 re2c:define:YYMARKER = "in->mar";
2322 re2c:define:YYLIMIT = "in->lim";
2323 re2c:define:YYGETSTATE = "in->state";
2324 re2c:define:YYSETSTATE = "in->state = @@;";
2325 re2c:define:YYFILL = "return WAITING;";
2326
2327 packet = [a-z]+[;];
2328
2329 * { return BAD_PACKET; }
2330 $ { return END; }
2331 packet { *recv = *recv + 1; goto loop; }
2332 */
2333 }
2334
2335 void test(const char **packets, Status status)
2336 {
2337 const char *fname = "pipe";
2338 FILE *fw = fopen(fname, "w");
2339 FILE *fr = fopen(fname, "r");
2340 setvbuf(fw, NULL, _IONBF, 0);
2341 setvbuf(fr, NULL, _IONBF, 0);
2342
2343 Input in;
2344 init(&in, fr);
2345 Status st;
2346 unsigned int send = 0, recv = 0;
2347
2348 for (;;) {
2349 st = lex(&in, &recv);
2350 if (st == END) {
2351 LOG("done: got %u packets\n", recv);
2352 break;
2353 } else if (st == WAITING) {
2354 LOG("waiting...\n");
2355 if (*packets) {
2356 LOG("sent packet %u\n", send);
2357 fprintf(fw, "%s", *packets++);
2358 ++send;
2359 }
2360 st = fill(&in);
2361 LOG("queue: '%s'\n", in.buf);
2362 if (st == BIG_PACKET) {
2363 LOG("error: packet too big\n");
2364 break;
2365 }
2366 assert(st == READY);
2367 } else {
2368 assert(st == BAD_PACKET);
2369 LOG("error: ill-formed packet\n");
2370 break;
2371 }
2372 }
2373
2374 LOG("\n");
2375 assert(st == status);
2376 if (st == END) assert(recv == send);
2377
2378 fclose(fw);
2379 fclose(fr);
2380 remove(fname);
2381 }
2382
2383 int main()
2384 {
2385 const char *packets1[] = {0};
2386 const char *packets2[] = {"zero;", "one;", "two;", "three;", "four;", 0};
2387 const char *packets3[] = {"zer0;", 0};
2388 const char *packets4[] = {"goooooooooogle;", 0};
2389
2390 test(packets1, END);
2391 test(packets2, END);
2392 test(packets3, BAD_PACKET);
2393 test(packets4, BIG_PACKET);
2394
2395 return 0;
2396 }
2397
2398
2400 Reuse mode is enabled with the -r --reusable option. In this mode re2c
2401 allows one to reuse definitions, configurations and rules specified by
2402 a /*!rules:re2c*/ block in subsequent /*!use:re2c*/ blocks. As of
2403 re2c-1.2 it is possible to mix such blocks with normal /*!re2c*/
2404 blocks; prior to that re2c expects a single rules-block followed by
2405 use-blocks (normal blocks are disallowed). Use-blocks can have addi‐
2406 tional definitions, configurations and rules: they are merged to those
2407 specified by the rules-block. A very common use case for -r --reusable
2408 option is a lexer that supports multiple input encodings: lexer rules
2409 are defined once and reused multiple times with encoding-specific con‐
2410 figurations, such as re2c:flags:utf-8.
2411
2412 Below is an example of a multi-encoding lexer: it reads a phrase with
2413 Unicode math symbols and accepts input either in UTF8 or in UT32. Note
2414 that the --input-encoding utf8 option allows us to write UTF8-encoded
2415 symbols in the regular expressions; without this option re2c would
2416 parse them as a plain ASCII byte sequnce (and we would have to use
2417 hexadecimal escape sequences).
2418
2419 Example
2420 // re2c $INPUT -o $OUTPUT -r --input-encoding utf8
2421 #include <assert.h>
2422 #include <stdint.h>
2423
2424 /*!rules:re2c
2425 re2c:yyfill:enable = 0;
2426
2427 "∀x ∃y: p(x, y)" { return 0; }
2428 * { return 1; }
2429 */
2430
2431 static int lex_utf8(const uint8_t *YYCURSOR)
2432 {
2433 const uint8_t *YYMARKER;
2434 /*!use:re2c
2435 re2c:define:YYCTYPE = uint8_t;
2436 re2c:flags:8 = 1;
2437 */
2438 }
2439
2440 static int lex_utf32(const uint32_t *YYCURSOR)
2441 {
2442 const uint32_t *YYMARKER;
2443 /*!use:re2c
2444 re2c:define:YYCTYPE = uint32_t;
2445 re2c:flags:8 = 0;
2446 re2c:flags:u = 1;
2447 */
2448 }
2449
2450 int main()
2451 {
2452 static const uint8_t s8[] = // UTF-8
2453 { 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79
2454 , 0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29 };
2455
2456 static const uint32_t s32[] = // UTF32
2457 { 0x00002200, 0x00000078, 0x00000020, 0x00002203
2458 , 0x00000079, 0x0000003a, 0x00000020, 0x00000070
2459 , 0x00000028, 0x00000078, 0x0000002c, 0x00000020
2460 , 0x00000079, 0x00000029 };
2461
2462 assert(lex_utf8(s8) == 0);
2463 assert(lex_utf32(s32) == 0);
2464 return 0;
2465 }
2466
2467
2468
2470 re2c supports the following encodings: ASCII (default), EBCDIC (-e),
2471 UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8). See also inplace
2472 configuration re2c:flags.
2473
2474 The following concepts should be clarified when talking about encod‐
2475 ings. A code point is an abstract number that represents a single sym‐
2476 bol. A code unit is the smallest unit of memory, which is used in the
2477 encoded text (it corresponds to one character in the input stream). One
2478 or more code units may be needed to represent a single code point,
2479 depending on the encoding. In a fixed-length encoding, each code point
2480 is represented with an equal number of code units. In variable-length
2481 encodings, different code points can be represented with different num‐
2482 ber of code units.
2483
2484 · ASCII is a fixed-length encoding. Its code space includes 0x100 code
2485 points, from 0 to 0xFF. A code point is represented with exactly one
2486 1-byte code unit, which has the same value as the code point. The
2487 size of YYCTYPE must be 1 byte.
2488
2489 · EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
2490 points, from 0 to 0xFF. A code point is represented with exactly one
2491 1-byte code unit, which has the same value as the code point. The
2492 size of YYCTYPE must be 1 byte.
2493
2494 · UCS-2 is a fixed-length encoding. Its code space includes 0x10000
2495 code points, from 0 to 0xFFFF. One code point is represented with
2496 exactly one 2-byte code unit, which has the same value as the code
2497 point. The size of YYCTYPE must be 2 bytes.
2498
2499 · UTF-16 is a variable-length encoding. Its code space includes all
2500 Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
2501 One code point is represented with one or two 2-byte code units. The
2502 size of YYCTYPE must be 2 bytes.
2503
2504 · UTF-32 is a fixed-length encoding. Its code space includes all Uni‐
2505 code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
2506 code point is represented with exactly one 4-byte code unit. The size
2507 of YYCTYPE must be 4 bytes.
2508
2509 · UTF-8 is a variable-length encoding. Its code space includes all Uni‐
2510 code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
2511 code point is represented with a sequence of one, two, three, or four
2512 1-byte code units. The size of YYCTYPE must be 1 byte.
2513
2514 In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not
2515 valid Unicode code points. Any encoded sequence of code units that
2516 would map to Unicode code points in the range 0xD800-0xDFFF, is
2517 ill-formed. The user can control how re2c treats such ill-formed
2518 sequences with the --encoding-policy <policy> switch.
2519
2520 For some encodings, there are code units that never occur in a valid
2521 encoded stream (e.g., 0xFF byte in UTF-8). If the generated scanner
2522 must check for invalid input, the only correct way to do so is to use
2523 the default rule (*). Note that the full range rule ([^]) won't catch
2524 invalid code units when a variable-length encoding is used ([^] means
2525 "any valid code point", whereas the default rule (*) means "any possi‐
2526 ble code unit").
2527
2529 Conditions are enabled with -c --conditions. This option allows one to
2530 encode multiple interrelated lexers within the same re2c block.
2531
2532 Each lexer corresponds to a single condition. It starts with a label
2533 of the form yyc_name, where name is condition name and yyc prefix can
2534 be adjusted with configuration re2c:condprefix. Different lexers are
2535 separated with a comment /* *********************************** */
2536 which can be adjusted with configuration re2c:cond:divider.
2537
2538 Furthermore, each condition has a unique identifier of the form yyc‐
2539 name, where name is condition name and yyc prefix can be adjusted with
2540 configuration re2c:condenumprefix. Identifiers have the type YYCOND‐
2541 TYPE and should be generated with /*!types:re2c*/ directive or -t
2542 --type-header option. Users shouldn't define these identifiers manu‐
2543 ally, as the order of conditions is not specified.
2544
2545 Before all conditions re2c generates entry code that checks the current
2546 condition identifier and transfers control flow to the start label of
2547 the active condition. After matching some rule of this condition,
2548 lexer may either transfer control flow back to the entry code (after
2549 executing the associated action and optionally setting another condi‐
2550 tion with =>), or use :=> shortcut and transition directly to the start
2551 label of another condition (skipping the action and the entry code).
2552 Configuration re2c:cond:goto allows one to change the default behavior.
2553
2554 Syntactically each rule must be preceded with a list of comma-separated
2555 condition names or a wildcard * enclosed in angle brackets < and >.
2556 Wildcard means "any condition" and is semantically equivalent to list‐
2557 ing all condition names. Here regexp is a regular expression, default
2558 refers to the default rule *, and action is a block of code.
2559
2560 · <conditions-or-wildcard> regexp-or-default action
2561
2562 · <conditions-or-wildcard> regexp-or-default => condition action
2563
2564 · <conditions-or-wildcard> regexp-or-default :=> condition
2565
2566 Rules with an exclamation mark ! in front of condition list have a spe‐
2567 cial meaning: they have no regular expression, and the associated
2568 action is merged as an entry code to actions of normal rules. This
2569 might be a convenient place to peform a routine task that is common to
2570 all rules.
2571
2572 · <!conditions-or-wildcard> action
2573
2574 Another special form of rules with an empty condition list <> and no
2575 regular expression allows one to specify an "entry condition" that can
2576 be used to execute code before entering the lexer. It is semantically
2577 equivalent to a condition with number zero, name 0 and an empty regular
2578 expression.
2579
2580 · <> action
2581
2582 · <> => condition action
2583
2584 · <> :=> condition
2585
2586 Example
2587 // re2c $INPUT -o $OUTPUT -ci
2588 #include <stdint.h>
2589 #include <limits.h>
2590 #include <assert.h>
2591
2592 static const uint64_t ERROR = ~0lu;
2593 /*!types:re2c*/
2594
2595 template<int BASE> static void adddgt(uint64_t &u, unsigned int d)
2596 {
2597 u = u * BASE + d;
2598 if (u > UINT32_MAX) u = ERROR;
2599 }
2600
2601 static uint64_t parse_u32(const char *s)
2602 {
2603 const char *YYMARKER;
2604 int c = yycinit;
2605 uint64_t u = 0;
2606
2607 /*!re2c
2608 re2c:yyfill:enable = 0;
2609 re2c:api:style = free-form;
2610 re2c:define:YYCTYPE = char;
2611 re2c:define:YYCURSOR = s;
2612 re2c:define:YYGETCONDITION = "c";
2613 re2c:define:YYSETCONDITION = "c = @@;";
2614
2615 <*> * { return ERROR; }
2616
2617 <init> '0b' / [01] :=> bin
2618 <init> "0" :=> oct
2619 <init> "" / [1-9] :=> dec
2620 <init> '0x' / [0-9a-fA-F] :=> hex
2621
2622 <bin, oct, dec, hex> "\x00" { return u; }
2623
2624 <bin> [01] { adddgt<2> (u, s[-1] - '0'); goto yyc_bin; }
2625 <oct> [0-7] { adddgt<8> (u, s[-1] - '0'); goto yyc_oct; }
2626 <dec> [0-9] { adddgt<10>(u, s[-1] - '0'); goto yyc_dec; }
2627 <hex> [0-9] { adddgt<16>(u, s[-1] - '0'); goto yyc_hex; }
2628 <hex> [a-f] { adddgt<16>(u, s[-1] - 'a' + 10); goto yyc_hex; }
2629 <hex> [A-F] { adddgt<16>(u, s[-1] - 'A' + 10); goto yyc_hex; }
2630 */
2631 }
2632
2633 int main()
2634 {
2635 assert(parse_u32("1234567890") == 1234567890);
2636 assert(parse_u32("0b1101") == 13);
2637 assert(parse_u32("0x7Fe") == 2046);
2638 assert(parse_u32("0644") == 420);
2639 assert(parse_u32("9999999999") == ERROR);
2640 assert(parse_u32("") == ERROR);
2641 return 0;
2642 }
2643
2644
2646 With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
2647 erates a self-contained C program that can be further compiled and exe‐
2648 cuted. The program consists of lexer code and input data. For each con‐
2649 structed DFA (block or condition) re2c generates a standalone lexer and
2650 two files: an .input file with strings derived from the DFA and a .keys
2651 file with expected match results. The program runs each lexer on the
2652 corresponding .input file and compares results with the expectations.
2653 Skeleton programs are very useful for a number of reasons:
2654
2655 · They can check correctness of various re2c optimizations (the data is
2656 generated early in the process, before any DFA transformations have
2657 taken place).
2658
2659 · Generating a set of input data with good coverage may be useful for
2660 both testing and benchmarking.
2661
2662 · Generating self-contained executable programs allows one to get mini‐
2663 mized test cases (the original code may be large or have a lot of
2664 dependencies).
2665
2666 The difficulty with generating input data is that for all but the most
2667 trivial cases the number of possible input strings is too large (even
2668 if the string length is limited). Re2c solves this difficulty by gener‐
2669 ating sufficiently many strings to cover almost all DFA transitions. It
2670 uses the following algorithm. First, it constructs a skeleton of the
2671 DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
2672 EBCDIC) skeleton is just an exact copy of the original DFA. For encod‐
2673 ings with multibyte code units skeleton is a copy of DFA with certain
2674 transitions omitted: namely, re2c takes at most 256 code units for each
2675 disjoint continuous range that corresponds to a DFA transition. The
2676 chosen values are evenly distributed and include range bounds. Instead
2677 of trying to cover all possible paths in the skeleton (which is infea‐
2678 sible) re2c generates sufficiently many paths to cover all skeleton
2679 transitions, and thus trigger the corresponding conditional jumps in
2680 the lexer. The algorithm implementation is limited by ~1Gb of transi‐
2681 tions and consumes constant amount of memory (re2c writes data to file
2682 as soon as it is generated).
2683
2685 With the -D, --emit-dot option, re2c does not generate code. Instead,
2686 it dumps the generated DFA in DOT format. One can convert this dump to
2687 an image of the DFA using Graphviz or another library. Note that this
2688 option shows the final DFA after it has gone through a number of opti‐
2689 mizations and transformations. Earlier stages can be dumped with vari‐
2690 ous debug options, such as --dump-nfa, --dump-dfa-raw etc. (see the
2691 full list of options).
2692
2694 You can find more information about re2c at the official website:
2695 http://re2c.org. Similar programs are flex(1), lex(1), quex(‐
2696 http://quex.sourceforge.net).
2697
2699 Re2c was originaly written by Peter Bumbulis in 1993. Since then it
2700 has been developed and maintained by multiple volunteers; mots notably,
2701 Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
2702
2703
2704
2705
2706 RE2C(1)