1RE2C(1) RE2C(1)
2
3
4
6 re2c - compile regular expressions to code
7
9 re2c [OPTIONS] INPUT [-o OUTPUT]
10
11 re2go [OPTIONS] INPUT [-o OUTPUT]
12
14 re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15
17 A re2c program consists of normal code intermixed with re2c blocks and
18 directives. Each re2c block may contain definitions, configurations
19 and rules. Definitions are of the form name = regexp; where name is
20 an identifier that consists of letters, digits and underscores, and
21 regexp is a regular expression. Regular expressions may contain other
22 definitions, but recursion is not allowed and each name should be de‐
23 fined before used. Configurations are of the form re2c:config = value;
24 where config is the configuration descriptor and value can be a number,
25 a string or a special word. Rules consist of a regular expression fol‐
26 lowed by a semantic action (a block of code enclosed in curly braces {
27 and }, or a raw one line of code preceded with := and ended with a new‐
28 line that is not followed by a whitespace). If the input matches the
29 regular expression, the associated semantic action is executed. If
30 multiple rules match, the longest match takes precedence. If multiple
31 rules match the same string, the earlier rule takes precedence. There
32 are two special rules: default rule * and EOF rule $. Default rule
33 should always be defined, it has the lowest priority regardless of its
34 place and matches any code unit (not necessarily a valid character, see
35 encoding support). EOF rule matches the end of input, it should be de‐
36 fined if the corresponding method for handling the end of input is
37 used. If start conditions are used, rules have more complex syntax.
38 All rules of a single block are compiled into a deterministic fi‐
39 nite-state automaton (DFA) and encoded in the form of a program in the
40 target language. The generated code interfaces with the outer program
41 by the means of a few user-defined primitives (see the program inter‐
42 face section). Reusable blocks allow sharing rules, definitions and
43 configurations between different blocks.
44
46 Input file
47 // re2c $INPUT -o $OUTPUT -i
48 #include <assert.h> //
49 // C/C++ code
50 int lex(const char *YYCURSOR) //
51 {
52 /*!re2c // start of re2c block
53 re2c:define:YYCTYPE = char; // configuration
54 re2c:yyfill:enable = 0; // configuration
55 re2c:flags:case-ranges = 1; // configuration
56 //
57 ident = [a-zA-Z_][a-zA-Z_0-9]*; // named definition
58 //
59 ident { return 0; } // normal rule
60 * { return 1; } // default rule
61 */
62 } //
63 //
64 int main() //
65 { // C/C++ code
66 assert(lex("_Zer0") == 0); //
67 return 0; //
68 } //
69
70
71 Output file
72 /* Generated by re2c */
73 // re2c $INPUT -o $OUTPUT -i
74 #include <assert.h> //
75 // C/C++ code
76 int lex(const char *YYCURSOR) //
77 {
78
79 {
80 char yych;
81 yych = *YYCURSOR;
82 switch (yych) {
83 case 'A' ... 'Z':
84 case '_':
85 case 'a' ... 'z': goto yy4;
86 default: goto yy2;
87 }
88 yy2:
89 ++YYCURSOR;
90 { return 1; }
91 yy4:
92 yych = *++YYCURSOR;
93 switch (yych) {
94 case '0' ... '9':
95 case 'A' ... 'Z':
96 case '_':
97 case 'a' ... 'z': goto yy4;
98 default: goto yy6;
99 }
100 yy6:
101 { return 0; }
102 }
103
104 } //
105 //
106 int main() //
107 { // C/C++ code
108 assert(lex("_Zer0") == 0); //
109 return 0; //
110 } //
111
112
114 -? -h --help
115 Show help message.
116
117 -1 --single-pass
118 Deprecated. Does nothing (single pass is the default now).
119
120 -8 --utf-8
121 Generate a lexer that reads input in UTF-8 encoding. re2c as‐
122 sumes that character range is 0 -- 0x10FFFF and character size
123 is 1 byte.
124
125 -b --bit-vectors
126 Optimize conditional jumps using bit masks. Implies -s.
127
128 -c --conditions --start-conditions
129 Enable support of Flex-like "conditions": multiple interrelated
130 lexers within one block. Option --start-conditions is a legacy
131 alias; use --conditions instead.
132
133 --case-insensitive
134 Treat single-quoted and double-quoted strings as case-insensi‐
135 tive.
136
137 --case-inverted
138 Invert the meaning of single-quoted and double-quoted strings:
139 treat single-quoted strings as case-sensitive and double-quoted
140 strings as case-insensitive.
141
142 --case-ranges
143 Collapse consecutive cases in a switch statements into a range
144 of the form case low ... high:. This syntax is an extension of
145 the C/C++ language, supported by compilers like GCC, Clang and
146 Tcc. The main advantage over using single cases is smaller gen‐
147 erated C code and faster generation time, although for some com‐
148 pilers like Tcc it also results in smaller binary size. This
149 option doesn't work for the Go backend.
150
151 --depfile FILE
152 Write dependency information to FILE in the form of a Makefile
153 rule <output-file> : <input-file> [include-file ...]. This al‐
154 lows to track build dependencies in the presence of /*!in‐
155 clude:re2c*/ directives, so that updating include files triggers
156 regeneration of the output file. This option requires that -o
157 --output option is specified.
158
159 -e --ecb
160 Generate a lexer that reads input in EBCDIC encoding. re2c as‐
161 sumes that character range is 0 -- 0xFF an character size is 1
162 byte.
163
164 --empty-class <match-empty | match-none | error>
165 Define the way re2c treats empty character classes. With
166 match-empty (the default) empty class matches empty input (which
167 is illogical, but backwards-compatible). With``match-none``
168 empty class always fails to match. With error empty class
169 raises a compilation error.
170
171 --encoding-policy <fail | substitute | ignore>
172 Define the way re2c treats Unicode surrogates. With fail re2c
173 aborts with an error when a surrogate is encountered. With sub‐
174 stitute re2c silently replaces surrogates with the error code
175 point 0xFFFD. With ignore (the default) re2c treats surrogates
176 as normal code points. The Unicode standard says that standalone
177 surrogates are invalid, but real-world libraries and programs
178 behave in different ways.
179
180 -f --storable-state
181 Generate a lexer which can store its inner state. This is use‐
182 ful in push-model lexers which are stopped by an outer program
183 when there is not enough input, and then resumed when more input
184 becomes available. In this mode users should additionally define
185 YYGETSTATE() and YYSETSTATE(state) macros and variables yych,
186 yyaccept and state as part of the lexer state.
187
188 -F --flex-syntax
189 Partial support for Flex syntax: in this mode named definitions
190 don't need the equal sign and the terminating semicolon, and
191 when used they must be surrounded by curly braces. Names without
192 curly braces are treated as double-quoted strings.
193
194 -g --computed-gotos
195 Optimize conditional jumps using non-standard "computed goto"
196 extension (which must be supported by the compiler). re2c gener‐
197 ates jump tables only in complex cases with a lot of conditional
198 branches. Complexity threshold can be configured with
199 cgoto:threshold configuration. This option implies -b. This op‐
200 tion doesn't work for the Go backend.
201
202 -I PATH
203 Add PATH to the list of locations which are used when searching
204 for include files. This option is useful in combination with
205 /*!include:re2c ... */ directive. Re2c looks for FILE in the di‐
206 rectory of including file and in the list of include paths spec‐
207 ified by -I option.
208
209 -i --no-debug-info
210 Do not output #line information. This is useful when the gener‐
211 ated code is tracked by some version control system or IDE.
212
213 --input <default | custom>
214 Specify the API used by the generated code to interface with
215 used-defined code. Option default is the C API based on pointer
216 arithmetic (it is the default for the C backend). Option custom
217 is the generic API (it is the default for the Go backend).
218
219 --input-encoding <ascii | utf8>
220 Specify the way re2c parses regular expressions. With ascii
221 (the default) re2c handles input as ASCII-encoded: any sequence
222 of code units is a sequence of standalone 1-byte characters.
223 With utf8 re2c handles input as UTF8-encoded and recognizes
224 multibyte characters.
225
226 --lang <c | go>
227 Specify the output language. Supported languages are C and Go
228 (the default is C).
229
230 --location-format <gnu | msvc>
231 Specify location format in messages. With gnu locations are
232 printed as 'filename:line:column: ...'. With msvc locations are
233 printed as 'filename(line,column) ...'. Default is gnu.
234
235 --no-generation-date
236 Suppress date output in the generated file.
237
238 --no-version
239 Suppress version output in the generated file.
240
241 -o OUTPUT --output=OUTPUT
242 Specify the OUTPUT file.
243
244 -P --posix-captures
245 Enable submatch extraction with POSIX-style capturing groups.
246
247 -r --reusable
248 Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
249 */ blocks. Exactly one rules-block must be present. The rules
250 are saved and used by every use-block that follows, which may
251 add its own rules and configurations.
252
253 -S --skeleton
254 Ignore user-defined interface code and generate a self-contained
255 "skeleton" program. Additionally, generate input files with
256 strings derived from the regular grammar and compressed match
257 results that are used to verify "skeleton" behavior on all in‐
258 puts. This option is useful for finding bugs in optimizations
259 and code generation. This option doesn't work for the Go back‐
260 end.
261
262 -s --nested-ifs
263 Use nested if statements instead of switch statements in condi‐
264 tional jumps. This usually results in more efficient code with
265 non-optimizing compilers.
266
267 -T --tags
268 Enable submatch extraction with tags.
269
270 -t HEADER --type-header=HEADER
271 Generate a HEADER file that contains enum with condition names.
272 Requires -c option.
273
274 -u --unicode
275 Generate a lexer that reads UTF32-encoded input. Re2c assumes
276 that character range is 0 -- 0x10FFFF and character size is 4
277 bytes. This option implies -s.
278
279 -V --vernum
280 Show version information in MMmmpp format (major, minor, patch).
281
282 --verbose
283 Output a short message in case of success.
284
285 -v --version
286 Show version information.
287
288 -w --wide-chars
289 Generate a lexer that reads UCS2-encoded input. Re2c assumes
290 that character range is 0 -- 0xFFFF and character size is 2
291 bytes. This option implies -s.
292
293 -x --utf-16
294 Generate a lexer that reads UTF16-encoded input. Re2c assumes
295 that character range is 0 -- 0x10FFFF and character size is 2
296 bytes. This option implies -s.
297
298 Debug options
299 -D --emit-dot
300 Instead of normal output generate lexer graph in .dot format.
301 The output can be converted to an image with the help of
302 Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
303
304 -d --debug-output
305 Emit YYDEBUG in the generated code. YYDEBUG should be defined
306 by the user in the form of a void function with two parameters:
307 state (lexer state or -1) and symbol (current input symbol of
308 type YYCTYPE).
309
310 --dump-adfa
311 Debug option: output DFA after tunneling (in .dot format).
312
313 --dump-cfg
314 Debug option: output control flow graph of tag variables (in
315 .dot format).
316
317 --dump-closure-stats
318 Debug option: output statistics on the number of states in clo‐
319 sure.
320
321 --dump-dfa-det
322 Debug option: output DFA immediately after determinization (in
323 .dot format).
324
325 --dump-dfa-min
326 Debug option: output DFA after minimization (in .dot format).
327
328 --dump-dfa-tagopt
329 Debug option: output DFA after tag optimizations (in .dot for‐
330 mat).
331
332 --dump-dfa-tree
333 Debug option: output DFA under construction with states repre‐
334 sented as tag history trees (in .dot format).
335
336 --dump-dfa-raw
337 Debug option: output DFA under construction with expanded
338 state-sets (in .dot format).
339
340 --dump-interf
341 Debug option: output interference table produced by liveness
342 analysis of tag variables.
343
344 --dump-nfa
345 Debug option: output NFA (in .dot format).
346
347 Internal options
348 --dfa-minimization <moore | table>
349 Internal option: DFA minimization algorithm used by re2c. The
350 moore option is the Moore algorithm (it is the default). The ta‐
351 ble option is the "table filling" algorithm. Both algorithms
352 should produce the same DFA up to states relabeling; table fill‐
353 ing is simpler and much slower and serves as a reference imple‐
354 mentation.
355
356 --eager-skip
357 Internal option: make the generated lexer advance the input po‐
358 sition eagerly -- immediately after reading the input symbol.
359 This changes the default behavior when the input position is ad‐
360 vanced lazily -- after transition to the next state. This option
361 is implied by --no-lookahead.
362
363 --no-lookahead
364 Internal option: use TDFA(0) instead of TDFA(1). This option
365 has effect only with --tags or --posix-captures options.
366
367 --no-optimize-tags
368 Internal optionL: suppress optimization of tag variables (useful
369 for debugging).
370
371 --posix-closure <gor1 | gtop>
372 Internal option: specify shortest-path algorithm used for the
373 construction of epsilon-closure with POSIX disambiguation seman‐
374 tics: gor1 (the default) stands for Goldberg-Radzik algorithm,
375 and gtop stands for "global topological order" algorithm.
376
377 --posix-prectable <complex | naive>
378 Internal option: specify the algorithm used to compute POSIX
379 precedence table. The complex algorithm computes precedence ta‐
380 ble in one traversal of tag history tree and has quadratic com‐
381 plexity in the number of TNFA states; it is the default. The
382 naive algorithm has worst-case cubic complexity in the number of
383 TNFA states, but it is much simpler than complex and may be
384 slightly faster in non-pathological cases.
385
386 --stadfa
387 Internal option: use staDFA algorithm for submatch extraction.
388 The main difference with TDFA is that tag operations in staDFA
389 are placed in states, not on transitions.
390
391 --fixed-tags <none | toplevel | all>
392 Internal option: specify whether the fixed-tag optimization
393 should be applied to all tags (all), none of them (none), or
394 only those in toplevel concatenation (toplevel). The default is
395 all. "Fixed" tags are those that are located within a fixed
396 distance to some other tag (called "base"). In such cases only
397 tha base tag needs to be tracked, and the value of the fixed tag
398 can be computed as the value of the base tag plus a static off‐
399 set. For tags that are under alternative or repetition it is
400 also necessary to check if the base tag has a no-match value (in
401 that case fixed tag should also be set to no-match, disregarding
402 the offset). For tags in top-level concatenation the check is
403 not needed, because they always match.
404
405 Warnings
406 -W Turn on all warnings.
407
408 -Werror
409 Turn warnings into errors. Note that this option alone doesn't
410 turn on any warnings; it only affects those warnings that have
411 been turned on so far or will be turned on later.
412
413 -W<warning>
414 Turn on warning.
415
416 -Wno-<warning>
417 Turn off warning.
418
419 -Werror-<warning>
420 Turn on warning and treat it as an error (this implies -W<warn‐
421 ing>).
422
423 -Wno-error-<warning>
424 Don't treat this particular warning as an error. This doesn't
425 turn off the warning itself.
426
427 -Wcondition-order
428 Warn if the generated program makes implicit assumptions about
429 condition numbering. One should use either the -t, --type-header
430 option or the /*!types:re2c*/ directive to generate a mapping of
431 condition names to numbers and then use the autogenerated condi‐
432 tion names.
433
434 -Wempty-character-class
435 Warn if a regular expression contains an empty character class.
436 Trying to match an empty character class makes no sense: it
437 should always fail. However, for backwards compatibility rea‐
438 sons re2c allows empty character classes and treats them as
439 empty strings. Use the --empty-class option to change the de‐
440 fault behavior.
441
442 -Wmatch-empty-string
443 Warn if a rule is nullable (matches an empty string). If the
444 lexer runs in a loop and the empty match is unintentional, the
445 lexer may unexpectedly hang in an infinite loop.
446
447 -Wswapped-range
448 Warn if the lower bound of a range is greater than its upper
449 bound. The default behavior is to silently swap the range
450 bounds.
451
452 -Wundefined-control-flow
453 Warn if some input strings cause undefined control flow in the
454 lexer (the faulty patterns are reported). This is the most dan‐
455 gerous and most common mistake. It can be easily fixed by adding
456 the default rule * which has the lowest priority, matches any
457 code unit, and consumes exactly one code unit.
458
459 -Wunreachable-rules
460 Warn about rules that are shadowed by other rules and will never
461 match.
462
463 -Wuseless-escape
464 Warn if a symbol is escaped when it shouldn't be. By default,
465 re2c silently ignores such escapes, but this may as well indi‐
466 cate a typo or an error in the escape sequence.
467
468 -Wnondeterministic-tags
469 Warn if a tag has n-th degree of nondeterminism, where n is
470 greater than 1.
471
472 -Wsentinel-in-midrule
473 Warn if the sentinel symbol occurs in the middle of a rule ---
474 this may cause reads past the end of buffer, crashes or memory
475 corruption in the generated lexer. This warning is only applica‐
476 ble if the sentinel method of checking for the end of input is
477 used. It is set to an error if re2c:sentinel configuration is
478 used.
479
481 Re2c has a flexible interface that gives the user both the freedom and
482 the responsibility to define how the generated code interacts with the
483 outer program. There are two major options:
484
485 • Pointer API. It is also called "default API", since it was histori‐
486 cally the first, and for a long time the only one. This is a more
487 restricted API based on C pointer arithmetics. It consists of
488 pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT,
489 which are normally defined as pointers of type YYCTYPE*. Pointer API
490 is enabled by default for the C backend, and it cannot be used with
491 other backends that do not have pointer arithmetics.
492
493
494
495 • Generic API. This is a less restricted API that does not assume
496 pointer semantics. It consists of primitives YYPEEK, YYSKIP, YY‐
497 BACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
498 YYRESTORECTX, YYRESTORETAG, YYSHIFT, YYSHIFTSTAG, YYSHIFTMTAG and YY‐
499 LESSTHAN. For the C backend generic API is enabled with --input cus‐
500 tom option or re2c:flags:input = custom; configuration; for the Go
501 backend it is enabled by default. Generic API was added in version
502 0.14. It is intentionally designed to give the user as much freedom
503 as possible in redefining the input model and the semantics of dif‐
504 ferent actions performed by the generated code. As an example, one
505 can override YYPEEK to check for the end of input before reading the
506 input character, or do some logging, etc.
507
508 Generic API has two styles:
509
510 • Function-like. This style is enabled with re2c:api:style = func‐
511 tions; configuration, and it is the default for C backend. In this
512 style API primitives should be defined as functions or macros with
513 parentheses, accepting the necessary arguments. For example, in C the
514 default pointer API can be defined in function-like style generic API
515 as follows:
516
517 #define YYPEEK() *YYCURSOR
518 #define YYSKIP() ++YYCURSOR
519 #define YYBACKUP() YYMARKER = YYCURSOR
520 #define YYBACKUPCTX() YYCTXMARKER = YYCURSOR
521 #define YYRESTORE() YYCURSOR = YYMARKER
522 #define YYRESTORECTX() YYCURSOR = YYCTXMARKER
523 #define YYRESTORETAG(tag) YYCURSOR = tag
524 #define YYLESSTHAN(len) YYLIMIT - YYCURSOR < len
525 #define YYSTAGP(tag) tag = YYCURSOR
526 #define YYSTAGN(tag) tag = NULL
527 #define YYSHIFT(shift) YYCURSOR += shift
528 #define YYSHIFTSTAG(tag, shift) tag += shift
529
530
531
532 • Free-form. This style is enabled with re2c:api:style = free-form;
533 configuration, and it is the default for Go backend. In this style
534 API primitives can be defined as free-form pieces of code, and in‐
535 stead of arguments they have interpolated variables of the form
536 @@{name}, or optionally just @@ if there is only one argument. The @@
537 text is called "sigil". It can be redefined to any other text with
538 re2c:api:sigil configuration. For example, the default pointer API
539 can be defined in free-form style generic API as follows:
540
541 re2c:define:YYPEEK = "*YYCURSOR";
542 re2c:define:YYSKIP = "++YYCURSOR";
543 re2c:define:YYBACKUP = "YYMARKER = YYCURSOR";
544 re2c:define:YYBACKUPCTX = "YYCTXMARKER = YYCURSOR";
545 re2c:define:YYRESTORE = "YYCURSOR = YYMARKER";
546 re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
547 re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
548 re2c:define:YYLESSTHAN = "YYLIMIT - YYCURSOR < @@{len}";
549 re2c:define:YYSTAGP = "@@{tag} = YYCURSOR";
550 re2c:define:YYSTAGN = "@@{tag} = NULL";
551 re2c:define:YYSHIFT = "YYCURSOR += @@{shift}";
552 re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
553
554 API primitives
555 Here is a list of API primitives that may be used by the generated code
556 in order to interface with the outer program. Which primitives are
557 needed depends on multiple factors, including the complexity of regular
558 expressions, input representation, buffering, the use of various fea‐
559 tures and so on. All the necessary primitives should be defined by the
560 user in the form of macros, functions, variables, free-form pieces of
561 code or any other suitable form. Re2c does not (and cannot) check the
562 definitions, so if anything is missing or defined incorrectly the gen‐
563 erated code will not compile.
564
565 YYCTYPE
566 The type of the input characters (code units). For ASCII,
567 EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
568 For UTF-16 or UCS-2 it should be 2-byte unsigned integer. For
569 UTF-32 it should be 4-byte unsigned integer.
570
571 YYCURSOR
572 A pointer-like l-value that stores the current input position
573 (usually a pointer of type YYCTYPE*). Initially YYCURSOR should
574 point to the first input character. It is advanced by the gener‐
575 ated code. When a rule matches, YYCURSOR points to the one af‐
576 ter the last matched character. It is used only in the default C
577 API.
578
579 YYLIMIT
580 A pointer-like r-value that stores the end of input position
581 (usually a pointer of type YYCTYPE*). Initially YYLIMIT should
582 point to the one after the last available input character. It is
583 not changed by the generated code. Lexer compares YYCURSOR to
584 YYLIMIT in order to determine if there is enough input charac‐
585 ters left. YYLIMIT is used only in the default C API.
586
587 YYMARKER
588 A pointer-like l-value (usually a pointer of type YYCTYPE*) that
589 stores the position of the latest matched rule. It is used to
590 restores YYCURSOR position if the longer match fails and lexer
591 needs to rollback. Initialization is not needed. YYMARKER is
592 used only in the default C API.
593
594 YYCTXMARKER
595 A pointer-like l-value that stores the position of the trailing
596 context (usually a pointer of type YYCTYPE*). No initialization
597 is needed. It is used only in the default C API, and only with
598 the lookahead operator /.
599
600 YYFILL API primitive with one argument len. The meaning of YYFILL is
601 to provide at least len more input characters or fail. If EOF
602 rule is used, YYFILL should always return to the calling func‐
603 tion; the return value should be zero on success and non-zero on
604 failure. If EOF rule is not used, YYFILL return value is ignored
605 and it should not return on failure. Maximal value of len is YY‐
606 MAXFILL, which can be generated with /*!max:re2c*/ directive.
607 The definition of YYFILL can be either function-like or
608 free-form depending on the API style (see re2c:api:style and
609 re2c:define:YYFILL:naked).
610
611 YYMAXFILL
612 An integral constant equal to the maximal value of YYFILL argu‐
613 ment. It can be generated with /*!max:re2c*/ directive.
614
615 YYLESSTHAN
616 A generic API primitive with one argument len. It should be de‐
617 fined as an r-value of boolean type that equals true if and only
618 if there is less than len input characters left. The definition
619 can be either function-like or free-form depending on the API
620 style (see re2c:api:style).
621
622 YYPEEK A generic API primitive with no arguments. It should be defined
623 as an r-value of type YYCTYPE that is equal to the character at
624 the current input position. The definition can be either func‐
625 tion-like or free-form depending on the API style (see
626 re2c:api:style).
627
628 YYSKIP A generic API primitive with no arguments. The meaning of
629 YYSKIP is to advance the current input position by one charac‐
630 ter. The definition can be either function-like or free-form de‐
631 pending on the API style (see re2c:api:style).
632
633 YYBACKUP
634 A generic API primitive with no arguments. The meaning of YY‐
635 BACKUP is to save the current input position, which is later re‐
636 stored with YYRESTORE. The definition should be either func‐
637 tion-like or free-form depending on the API style (see
638 re2c:api:style).
639
640 YYRESTORE
641 A generic API primitive with no arguments. The meaning of YYRE‐
642 STORE is to restore the current input position to the value
643 saved by YYBACKUP. The definition should be either func‐
644 tion-like or free-form depending on the API style (see
645 re2c:api:style).
646
647 YYBACKUPCTX
648 A generic API primitive with zero arguments. The meaning of YY‐
649 BACKUPCTX is to save the current input position as the position
650 of the trailing context, which is later restored by YYRE‐
651 STORECTX. The definition should be either function-like or
652 free-form depending on the API style (see re2c:api:style).
653
654 YYRESTORECTX
655 A generic API primitive with no arguments. The meaning of YYRE‐
656 STORECTX is to restore the trailing context position saved with
657 YYBACKUPCTX. The definition should be either function-like or
658 free-form depending on the API style (see re2c:api:style).
659
660 YYRESTORETAG
661 A generic API primitive with one argument tag. The meaning of
662 YYRESTORETAG is to restore the trailing context position to the
663 value of tag. The definition should be either function-like or
664 free-form depending on the API style (see re2c:api:style).
665
666 YYSTAGP
667 A generic API primitive with one argument tag. The meaning of
668 YYSTAGP is to set tag value to the current input position. The
669 definition should be either function-like or free-form depending
670 on the API style (see re2c:api:style).
671
672 YYSTAGN
673 A generic API primitive with one argument tag. The meaning of
674 YYSTAGN is to set tag value to null (or some default value). The
675 definition should be either function-like or free-form depending
676 on the API style (see re2c:api:style).
677
678 YYMTAGP
679 A generic API primitive with one argument tag. The meaning of
680 YYMTAGP is to append the current position to the history of tag.
681 The definition should be either function-like or free-form de‐
682 pending on the API style (see re2c:api:style).
683
684 YYMTAGN
685 A generic API primitive with one argument tag. The meaning of
686 YYMTAGN is to append null (or some other default) value to the
687 history of tag. The definition can be either function-like or
688 free-form depending on the API style (see re2c:api:style).
689
690 YYSHIFT
691 A generic API primitive with one argument shift. The meaning of
692 YYSHIFT is to shift the current input position by shift charac‐
693 ters (the shift value may be negative). The definition can be
694 either function-like or free-form depending on the API style
695 (see re2c:api:style).
696
697 YYSHIFTSTAG
698 A generic API primitive with two arguments, tag and shift. The
699 meaning of YYSHIFTSTAG is to shift tag by shift characters (the
700 shift value may be negative). The definition can be either
701 function-like or free-form depending on the API style (see
702 re2c:api:style).
703
704 YYSHIFTMTAG
705 A generic API primitive with two arguments, tag and shift. The
706 meaning of YYSHIFTMTAG is to shift the latest value in the his‐
707 tory of tag by shift characters (the shift value may be nega‐
708 tive). The definition should be either function-like or
709 free-form depending on the API style (see re2c:api:style).
710
711 YYMAXNMATCH
712 An integral constant equal to the maximal number of POSIX cap‐
713 turing groups in a rule. It is generated with /*!maxn‐
714 match:re2c*/ directive.
715
716 YYCONDTYPE
717 The type of the condition enum. It should be generated either
718 with /*!types:re2c*/ directive or -t --type-header option.
719
720 YYGETCONDITION
721 An API primitive with zero arguments. It should be defined as
722 an r-value of type YYCONDTYPE that is equal to the current con‐
723 dition identifier. The definition can be either function-like or
724 free-form depending on the API style (see re2c:api:style and
725 re2c:define:YYGETCONDITION:naked).
726
727 YYSETCONDITION
728 An API primitive with one argument cond. The meaning of YYSET‐
729 CONDITION is to set the current condition identifier to cond.
730 The definition should be either function-like or free-form de‐
731 pending on the API style (see re2c:api:style and re2c:define:YY‐
732 SETCONDITION@cond).
733
734 YYGETSTATE
735 An API primitive with zero arguments. It should be defined as
736 an r-value of integer type that is equal to the current lexer
737 state. Should be initialized to -1. The definition can be either
738 function-like or free-form depending on the API style (see
739 re2c:api:style and re2c:define:YYGETSTATE:naked).
740
741 YYSETSTATE
742 An API primitive with one argument state. The meaning of YYSET‐
743 STATE is to set the current lexer state to state. The defini‐
744 tion should be either function-like or free-form depending on
745 the API style (see re2c:api:style and re2c:define:YYSET‐
746 STATE@state).
747
748 YYDEBUG
749 A debug API primitive with two arguments. It can be used to de‐
750 bug the generated code (with -d --debug-output option). YYDEBUG
751 should return no value and accept two arguments: state (either a
752 DFA state index or -1) and symbol (the current input symbol).
753
754 yych An l-value of type YYCTYPE that stores the current input charac‐
755 ter. User definition is necessary only with -f --storable-state
756 option.
757
758 yyaccept
759 An l-value of unsigned integral type that stores the number of
760 the latest matched rule. User definition is necessary only with
761 -f --storable-state option.
762
763 yynmatch
764 An l-value of unsigned integral type that stores the number of
765 POSIX capturing groups in the matched rule. Used only with -P
766 --posix-captures option.
767
768 yypmatch
769 An array of l-values that are used to hold the tag values corre‐
770 sponding to the capturing parentheses in the matching rule. Ar‐
771 ray length must be at least yynmatch * 2 (usually YYMAXNMATCH *
772 2 is a good choice). Used only with -P --posix-captures option.
773
774 Directives
775 Below is the list of all directives provided by re2c (in no particular
776 order). More information on each directive can be found in the related
777 sections.
778
779 /*!re2c ... */
780 A standard re2c block.
781
782 %{ ... %}
783 A standard re2c block in -F --flex-support mode.
784
785 /*!rules:re2c ... */
786 A reusable re2c block (requires -r --reuse option).
787
788 /*!use:re2c ... */
789 A block that reuses previous rules-block specified with
790 /*!rules:re2c ... */ (requires -r --reuse option).
791
792 /*!ignore:re2c ... */
793 A block which contents are ignored and cut off from the output
794 file.
795
796 /*!max:re2c*/
797 This directive is substituted with the macro-definition of YY‐
798 MAXFILL.
799
800 /*!maxnmatch:re2c*/
801 This directive is substituted with the macro-definition of YY‐
802 MAXNMATCH (requires -P --posix-captures option).
803
804 /*!getstate:re2c*/
805 This directive is substituted with conditional dispatch on lexer
806 state (requires -f --storable-state option).
807
808 /*!types:re2c ... */
809 This directive is substituted with the definition of condition
810 enum (requires -c --conditions option).
811
812 /*!stags:re2c ... */, /*!mtags:re2c ... */
813 These directives allow one to specify a template piece of code
814 that is expanded for each s-tag/m-tag variable generated by
815 re2c. This block has two optional configurations: format = "@@";
816 (specifies the template where @@ is substituted with the name of
817 each tag variable), and separator = ""; (specifies the piece of
818 code used to join the generated pieces for different tag vari‐
819 ables).
820
821 /*!include:re2c FILE */
822 This directive allows one to include FILE (in the same sense as
823 #include directive in C/C++).
824
825 /*!header:re2c:on*/
826 This directive marks the start of header file. Everything after
827 it and up to the following /*!header:re2c:off*/ directive is
828 processed by re2c and written to the header file specified with
829 -t --type-header option.
830
831 /*!header:re2c:off*/
832 This directive marks the end of header file started with
833 /*!header:re2c:on*/.
834
835 Configurations
836 re2c:flags:t, re2c:flags:type-header
837 Specify the name of the generated header file relative to the
838 directory of the output file. (Same as -t, --type-header com‐
839 mand-line option except that the filepath is relative.)
840
841 re2c:flags:input
842 Same as --input command-line option.
843
844 re2c:api:style
845 Allows one to specify the style of generic API. Possible values
846 are functions and free-form. With functions style (the default
847 for the C backend) API primitives behave like functions, and
848 re2c generates parentheses with an argument list after the name
849 of each primitive. With free-form style (the default for the Go
850 backend) re2c treats API definitions as interpolated strings and
851 substitutes argument placeholders with the actual argument val‐
852 ues. This option can be overridden by options for individual
853 API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
854
855 re2c:api:sigil
856 Allows one to specify the "sigil" symbol (or string) that is
857 used to recognize argument placeholders in the definitions of
858 generic API primitives. The default value is @@. Placeholders
859 start with sigil, followed by the argument name in curly braces.
860 For example, if sigil is set to $, then placeholders will have
861 the form ${name}. Single-argument APIs may use shorthand nota‐
862 tion without the name in braces. This option can be overridden
863 by options for individual API primitives, e.g. re2c:define:YY‐
864 FILL@len for YYFILL.
865
866 re2c:define:YYCTYPE
867 Defines YYCTYPE (see the user interface section).
868
869 re2c:define:YYCURSOR
870 Defines C API primitive YYCURSOR (see the user interface sec‐
871 tion).
872
873 re2c:define:YYLIMIT
874 Defines C API primitive YYLIMIT (see the user interface sec‐
875 tion).
876
877 re2c:define:YYMARKER
878 Defines C API primitive YYMARKER (see the user interface sec‐
879 tion).
880
881 re2c:define:YYCTXMARKER
882 Defines C API primitive YYCTXMARKER (see the user interface sec‐
883 tion).
884
885 re2c:define:YYFILL
886 Defines API primitive YYFILL (see the user interface section).
887
888 re2c:define:YYFILL@len
889 Specifies the sigil used for argument substitution in YYFILL
890 definition. Defaults to @@. Overrides the more generic
891 re2c:api:sigil configuration.
892
893 re2c:define:YYFILL:naked
894 Allows one to override re2c:api:style for YYFILL. Value 0 cor‐
895 responds to free-form API style.
896
897 re2c:yyfill:enable
898 Defaults to 1 (YYFILL is enabled). Set this to zero to suppress
899 the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
900 tinel configuration to verify that the generated lexer cannot
901 read past the end of input, as this might introduce severe secu‐
902 rity issues to your programs.
903
904 re2c:yyfill:parameter
905 Controls the argument in the parentheses that follow YYFILL. De‐
906 faults to 1, which means that the argument is generated. If
907 zero, the argument is omitted. Can be overridden with re2c:de‐
908 fine:YYFILL:naked or re2c:api:style.
909
910 re2c:eof
911 Specifies the sentinel symbol used with EOF rule $ to check for
912 the end of input in the generated lexer. The default value is -1
913 (EOF rule is not used). Other possible values include all valid
914 code units. Only decimal numbers are recognized.
915
916 re2c:sentinel
917 Specifies the sentinel symbol used with the sentinel method of
918 checking for the end of input in the generated lexer (the case
919 when bounds checking is disabled with re2c:yyfill:enable = 0;
920 and EOF rule $ is not used). This configuration does not affect
921 code generation. It is used by re2c to verify that the sentinel
922 symbol is not allowed in the middle of the rule, and prevent
923 possible reads past the end of buffer in the generated lexer.
924 The default value is -1 (re2c assumes that the sentinel symbol
925 is 0, which is the most common case). Other possible values in‐
926 clude all valid code units. Only decimal numbers are recognized.
927
928 re2c:define:YYLESSTHAN
929 Defines generic API primitive YYLESSTHAN (see the user interface
930 section).
931
932 re2c:yyfill:check
933 Setting this to zero allows to suppress the generation of YYFILL
934 check (YYLESSTHAN in generic API of YYLIMIT-based comparison in
935 default C API). This configuration is useful when the necessary
936 input is always available. it defaults to 1 (the check is gener‐
937 ated).
938
939 re2c:label:yyFillLabel
940 Allows one to change the prefix of YYFILL labels (used with EOF
941 rule or with storable states).
942
943 re2c:define:YYPEEK
944 Defines generic API primitive YYPEEK (see the user interface
945 section).
946
947 re2c:define:YYSKIP
948 Defines generic API primitive YYSKIP (see the user interface
949 section).
950
951 re2c:define:YYBACKUP
952 Defines generic API primitive YYBACKUP (see the user interface
953 section).
954
955 re2c:define:YYBACKUPCTX
956 Defines generic API primitive YYBACKUPCTX (see the user inter‐
957 face section).
958
959 re2c:define:YYRESTORE
960 Defines generic API primitive YYRESTORE (see the user interface
961 section).
962
963 re2c:define:YYRESTORECTX
964 Defines generic API primitive YYRESTORECTX (see the user inter‐
965 face section).
966
967 re2c:define:YYRESTORETAG
968 Defines generic API primitive YYRESTORETAG (see the user inter‐
969 face section).
970
971 re2c:define:YYSHIFT
972 Defines generic API primitive YYSHIFT (see the user interface
973 section).
974
975 re2c:define:YYSHIFTMTAG
976 Defines generic API primitive YYSHIFTMTAG (see the user inter‐
977 face section).
978
979 re2c:define:YYSHIFTSTAG
980 Defines generic API primitive YYSHIFTSTAG (see the user inter‐
981 face section).
982
983 re2c:define:YYSTAGN
984 Defines generic API primitive YYSTAGN (see the user interface
985 section).
986
987 re2c:define:YYSTAGP
988 Defines generic API primitive YYSTAGP (see the user interface
989 section).
990
991 re2c:define:YYMTAGN
992 Defines generic API primitive YYMTAGN (see the user interface
993 section).
994
995 re2c:define:YYMTAGP
996 Defines generic API primitive YYMTAGP (see the user interface
997 section).
998
999 re2c:flags:T, re2c:flags:tags
1000 Same as -T --tags command-line option.
1001
1002 re2c:flags:P, re2c:flags:posix-captures
1003 Same as -P --posix-captures command-line option.
1004
1005 re2c:tags:expression
1006 Allows one to customize the way re2c addresses tag variables.
1007 By default re2c generates expressions of the form yyt<N>. This
1008 might be inconvenient, for example if tag variables are defined
1009 as fields in a struct. Re2c recognizes placeholder of the form
1010 @@{tag} or @@ and replaces it with the actual tag name. Sigil
1011 @@ can be redefined with re2c:api:sigil configuration. For ex‐
1012 ample, setting re2c:tags:expression = "p->@@"; results in ex‐
1013 pressions of the form p->yyt<N> in the generated code.
1014
1015 re2c:tags:prefix
1016 Allows one to override the prefix of tag variables (defaults to
1017 yyt).
1018
1019 re2c:flags:lookahead
1020 Same as inverted --no-lookahead command-line option.
1021
1022 re2c:flags:optimize-tags
1023 Same as inverted --no-optimize-tags command-line option.
1024
1025 re2c:define:YYCONDTYPE
1026 Defines YYCONDTYPE (see the user interface section).
1027
1028 re2c:define:YYGETCONDITION
1029 Defines API primitive YYGETCONDITION (see the user interface
1030 section).
1031
1032 re2c:define:YYGETCONDITION:naked
1033 Allows one to override re2c:api:style for YYGETCONDITION. Value
1034 0 corresponds to free-form API style.
1035
1036 re2c:define:YYSETCONDITION
1037 Defines API primitive YYSETCONDITION (see the user interface
1038 section).
1039
1040 re2c:define:YYSETCONDITION@cond
1041 Specifies the sigil used for argument substitution in YYSETCON‐
1042 DITION definition. The default value is @@. Overrides the more
1043 generic re2c:api:sigil configuration.
1044
1045 re2c:define:YYSETCONDITION:naked
1046 Allows one to override re2c:api:style for YYSETCONDITION. Value
1047 0 corresponds to free-form API style.
1048
1049 re2c:cond:goto
1050 Allows one to customize the goto statements used with the short‐
1051 cut :=> rules in conditions. The default value is goto @@;.
1052 Placeholders are substituted with condition name (see
1053 re2c:api;sigil and re2c:cond:goto@cond).
1054
1055 re2c:cond:goto@cond
1056 Specifies the sigil used for argument substitution in
1057 re2c:cond:goto definition. The default value is @@. Overrides
1058 the more generic re2c:api:sigil configuration.
1059
1060 re2c:cond:divider
1061 Defines the divider for condition blocks. The default value is
1062 /* *********************************** */. Placeholders are
1063 substituted with condition name (see re2c:api;sigil and
1064 re2c:cond:divider@cond).
1065
1066 re2c:cond:divider@cond
1067 Specifies the sigil used for argument substitution in
1068 re2c:cond:divider definition. The default value is @@. Over‐
1069 rides the more generic re2c:api:sigil configuration.
1070
1071 re2c:condprefix
1072 Specifies the prefix used for condition labels. The default
1073 value is yyc_.
1074
1075 re2c:condenumprefix
1076 Specifies the prefix used for condition identifiers. The de‐
1077 fault value is yyc.
1078
1079 re2c:define:YYGETSTATE
1080 Defines API primitive YYGETSTATE (see the user interface sec‐
1081 tion).
1082
1083 re2c:define:YYGETSTATE:naked
1084 Allows one to override re2c:api:style for YYGETSTATE. Value 0
1085 corresponds to free-form API style.
1086
1087 re2c:define:YYSETSTATE
1088 Defines API primitive YYSETSTATE (see the user interface sec‐
1089 tion).
1090
1091 re2c:define:YYSETSTATE@state
1092 Specifies the sigil used for argument substitution in YYSETSTATE
1093 definition. The default value is @@. Overrides the more generic
1094 re2c:api:sigil configuration.
1095
1096 re2c:define:YYSETSTATE:naked
1097 Allows one to override re2c:api:style for YYSETSTATE. Value 0
1098 corresponds to free-form API style.
1099
1100 re2c:state:abort
1101 If set to a positive integer value, changes the form of the
1102 YYGETSTATE switch: instead of using default case to jump to the
1103 beginning of the lexer block, a -1 case is used, and the default
1104 case aborts the program.
1105
1106 re2c:state:nextlabel
1107 With storable states, allows to control if the YYGETSTATE block
1108 is followed by a yyNext label (the default value is zero, which
1109 corresponds to no label). Instead of using yyNext it is possible
1110 to use re2c:startlabel to force the generation of a specific
1111 start label. Instead of using labels it is often more conve‐
1112 nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1113
1114 re2c:label:yyNext
1115 Allows one to change the name of the yyNext label.
1116
1117 re2c:startlabel
1118 Controls the generation of start label for the next lexer block.
1119 The default value is zero, which means that the start label is
1120 generated only if it is used. An integer value greater than zero
1121 forces the generation of start label even if it is unused by the
1122 lexer. A string value also forces start label generation and
1123 sets the label name to the specified string. This configuration
1124 applies only to the current block (it is reset to default for
1125 the next block).
1126
1127 re2c:flags:s, re2c:flags:nested-ifs
1128 Same as -s --nested-ifs command-line option.
1129
1130 re2c:flags:b, re2c:flags:bit-vectors
1131 Same as -b --bit-vectors command-line option.
1132
1133 re2c:variable:yybm
1134 Overrides the name of the yybm variable.
1135
1136 re2c:yybm:hex
1137 Defaults to zero (a decimal bitmap table is generated). If set
1138 to nonzero, a hexadecimal table is generated.
1139
1140 re2c:flags:g, re2c:flags:computed-gotos
1141 Same as -g --computed-gotos command-line option.
1142
1143 re2c:cgoto:threshold
1144 With -g --computed-gotos option this value specifies the com‐
1145 plexity threshold that triggers the generation of jump tables
1146 instead of nested if statements and bitmaps. The default value
1147 is 9.
1148
1149 re2c:flags:case-ranges
1150 Same as --case-ranges command-line option.
1151
1152 re2c:flags:e, re2c:flags:ecb
1153 Same as -e --ecb command-line option.
1154
1155 re2c:flags:8, re2c:flags:utf-8
1156 Same as -8 --utf-8 command-line option.
1157
1158 re2c:flags:w, re2c:flags:wide-chars
1159 Same as -w --wide-chars command-line option.
1160
1161 re2c:flags:x, re2c:flags:utf-16
1162 Same as -x --utf-16 command-line option.
1163
1164 re2c:flags:u, re2c:flags:unicode
1165 Same as -u --unicode command-line option.
1166
1167 re2c:flags:encoding-policy
1168 Same as --encoding-policy command-line option.
1169
1170 re2c:flags:empty-class
1171 Same as --empty-class command-line option.
1172
1173 re2c:flags:case-insensitive
1174 Same as --case-insensitive command-line option.
1175
1176 re2c:flags:case-inverted
1177 Same as --case-inverted command-line option.
1178
1179 re2c:flags:i, re2c:flags:no-debug-info
1180 Same as -i --no-debug-info command-line option.
1181
1182 re2c:indent:string
1183 Specifies the string to use for indentation. The default value
1184 is "\t". Indent string should contain only whitespace charac‐
1185 ters. To disable indentation entirely, set this configuration
1186 to empty string "".
1187
1188 re2c:indent:top
1189 Specifies the minimum amount of indentation to use. The default
1190 value is zero. The value should be a non-negative integer num‐
1191 ber.
1192
1193 re2c:labelprefix
1194 Allows one to change the prefix of DFA state labels. The de‐
1195 fault value is yy.
1196
1197 re2c:yych:emit
1198 Set this to zero to suppress the generation of yych definition.
1199 Defaults to 1 (the definition is generated).
1200
1201 re2c:variable:yych
1202 Overrides the name of the yych variable.
1203
1204 re2c:yych:conversion
1205 If set to nonzero, re2c automatically generates a cast to YYC‐
1206 TYPE every time yych is read. Defaults to zero (no cast).
1207
1208 re2c:variable:yyaccept
1209 Overrides the name of the yyaccept variable.
1210
1211 re2c:variable:yytarget
1212 Overrides the name of the yytarget variable.
1213
1214 re2c:variable:yystable
1215 Deprecated.
1216
1217 re2c:variable:yyctable
1218 When both -c --conditions and -g --computed-gotos are active,
1219 re2c will use this variable to generate a static jump table for
1220 YYGETCONDITION.
1221
1222 re2c:define:YYDEBUG
1223 Defines YYDEBUG (see the user interface section).
1224
1225 re2c:flags:d, re2c:flags:debug-output
1226 Same as -d --debug-output command-line option.
1227
1228 re2c:flags:dfa-minimization
1229 Same as --dfa-minimization command-line option.
1230
1231 re2c:flags:eager-skip
1232 Same as --eager-skip command-line option.
1233
1235 re2c uses the following syntax for regular expressions:
1236
1237 • "foo" case-sensitive string literal
1238
1239 • 'foo' case-insensitive string literal
1240
1241 • [a-xyz], [^a-xyz] character class (possibly negated)
1242
1243 • . any character except newline
1244
1245 • R \ S difference of character classes R and S
1246
1247 • R* zero or more occurrences of R
1248
1249 • R+ one or more occurrences of R
1250
1251 • R? optional R
1252
1253 • R{n} repetition of R exactly n times
1254
1255 • R{n,} repetition of R at least n times
1256
1257 • R{n,m} repetition of R from n to m times
1258
1259 • (R) just R; parentheses are used to override precedence or for
1260 POSIX-style submatch
1261
1262 • R S concatenation: R followed by S
1263
1264 • R | S alternative: R or S
1265
1266 • R / S lookahead: R followed by S, but S is not consumed
1267
1268 • name the regular expression defined as name (or literal string "name"
1269 in Flex compatibility mode)
1270
1271 • {name} the regular expression defined as name in Flex compatibility
1272 mode
1273
1274 • @stag an s-tag: saves the last input position at which @stag matches
1275 in a variable named stag
1276
1277 • #mtag an m-tag: saves all input positions at which #mtag matches in a
1278 variable named mtag
1279
1280 Character classes and string literals may contain the following escape
1281 sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1282 decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1283
1285 One of the main problems for the lexer is to know when to stop. There
1286 are a few terminating conditions:
1287
1288 • the lexer may match some rule (including default rule *) and come to
1289 a final state
1290
1291 • the lexer may fail to match any rule and come to a default state
1292
1293 • the lexer may reach the end of input
1294
1295 The first two conditions terminate the lexer in a "natural" way: it
1296 comes to a state with no outgoing transitions, and the matching auto‐
1297 matically stops. The third condition, end of input, is different: it
1298 may happen in any state, and the lexer should be able to handle it.
1299 Checking for the end of input interrupts the normal lexer workflow and
1300 adds conditional branches to the generated program, therefore it is
1301 necessary to minimize the number of such checks. re2c supports a few
1302 different methods for end of input handling. Which one to use depends
1303 on the complexity of regular expressions, the need for buffering, per‐
1304 formance considerations and other factors. Here is a list of all meth‐
1305 ods:
1306
1307 • Sentinel character. This method eliminates the need for the end of
1308 input checks altogether. It is simple and efficient, but limited to
1309 the case when there is a natural "sentinel" character that can never
1310 occur in valid input. This character may still occur in invalid in‐
1311 put, but it is not allowed by the regular expressions, except perhaps
1312 as the last character of a rule. The sentinel character is appended
1313 at the end of input and serves as a stop signal: when the lexer reads
1314 it, it must be either the end of input, or a syntax error. In both
1315 cases the lexer stops. This method is used if YYFILL is disabled
1316 with re2c:yyfill:enable = 0; and re2c:eof has the default value -1.
1317
1318
1319
1320 • Sentinel character with bounds checks. This method is generic: it
1321 allows to handle any input without restrictions on the regular ex‐
1322 pressions. The idea is to reduce the number of end of input checks
1323 by performing them only on certain characters. Similar to the "sen‐
1324 tinel character" method, one of the characters is chosen as a "sen‐
1325 tinel" and appended at the end of input. However, there is no re‐
1326 striction on where the sentinel character may occur (in fact, any
1327 character can be chosen for a sentinel). When the lexer reads this
1328 character, it additionally performs a bounds check. If the current
1329 position is within bounds, the lexer will resume matching and handle
1330 the sentinel character as a regular one. Otherwise it will try to
1331 get more input with YYFILL (unless YYFILL is disabled). If more in‐
1332 put is available, the lexer will rematch the last character and con‐
1333 tinue as if the sentinel never occurred. Otherwise it is the real
1334 end of input, and the lexer will stop. This method is used if
1335 re2c:eof has non-negative value (it should be set to the ordinal of
1336 the sentinel character). YYFILL must be either defined or disabled
1337 with re2c:yyfill:enable = 0;.
1338
1339
1340
1341 • Bounds checks with padding. This method is the default one. It is
1342 generic, and it is usually faster than the "sentinel character with
1343 bounds checks" method, but also more complex to use. The idea is to
1344 partition the underlying finite-state automaton into strongly con‐
1345 nected components (SCCs), and generate only one bounds check per SCC,
1346 but make it check for multiple characters at once (enough to cover
1347 the longest non-looping path in the SCC). This way the checks are
1348 less frequent, which makes the lexer run much faster. If a check
1349 shows that there is not enough input, the lexer will invoke YYFILL,
1350 which may either supply enough input or else it should not return (in
1351 the latter case the lexer will stop). This approach has a problem
1352 with matching short lexemes at the end of input, because the
1353 multi-character check requires enough characters to cover the longest
1354 possible lexeme. To fix this problem, it is necessary to append a
1355 few fake characters at the end of input. The padding should not form
1356 a valid lexeme suffix to avoid fooling the lexer into matching it as
1357 part of the input. The minimum sufficient length of padding is YY‐
1358 MAXFILL and it is autogenerated by re2c with /*!max:re2c*/. This
1359 method is used if re2c:yyfill:enable has the default nonzero value,
1360 and re2c:eof has the default value -1. YYFILL must be defined.
1361
1362
1363
1364 • Custom methods with generic API. Generic API allows to override ba‐
1365 sic operations like reading a character, which makes it possible to
1366 include the end of input checks as part of them. Such methods are
1367 error-prone and should be used with caution, only if other methods
1368 cannot be used. These methods are used if generic API is enabled
1369 with --input custom or re2c:flags:input = custom; and default bounds
1370 checks are disabled with re2c:yyfill:enable = 0;. Note that the use
1371 of generic API does not imply the use of custom methods, it merely
1372 allows it.
1373
1374 The following subsections contain an example of each method.
1375
1376 Sentinel character
1377 In this example the lexer uses a sentinel character to handle the end
1378 of input. The program counts space-separated words in a null-termi‐
1379 nated string. Configuration re2c:yyfill:enable = 0; suppresses the
1380 generation of bounds checks and YYFILL invocations. The sentinel char‐
1381 acter is null. It is the last character of each input string, and it
1382 is not allowed in the middle of a lexeme by any of the rules (in par‐
1383 ticular, it is not included in the character ranges, where it is easy
1384 to overlook). If a null occurs in the middle of a string, it is a syn‐
1385 tax error and the lexer will match default rule *, but it won't read
1386 past the end of input or crash. -Wsentinel-in-midrule warning verifies
1387 that the rules do not allow sentinel in the middle (it is possible to
1388 tell re2c which character is used as a sentinel with re2c:sentinel con‐
1389 figuration --- the default assumption is null, since this is the most
1390 common case).
1391
1392 // re2c $INPUT -o $OUTPUT
1393 #include <assert.h>
1394
1395 // expect a null-terminated string
1396 static int lex(const char *YYCURSOR)
1397 {
1398 int count = 0;
1399 loop:
1400 /*!re2c
1401 re2c:define:YYCTYPE = char;
1402 re2c:yyfill:enable = 0;
1403
1404 * { return -1; }
1405 [\x00] { return count; }
1406 [a-z]+ { ++count; goto loop; }
1407 [ ]+ { goto loop; }
1408
1409 */
1410 }
1411
1412 int main()
1413 {
1414 assert(lex("") == 0);
1415 assert(lex("one two three") == 3);
1416 assert(lex("f0ur") == -1);
1417 return 0;
1418 }
1419
1420
1421 Sentinel character with bounds checks
1422 In this example the lexer uses sentinel character with bounds checks to
1423 handle the end of input (this method was added in version 1.2). The
1424 program counts single-quoted strings separated with spaces. The sen‐
1425 tinel character is null, which is specified with re2c:eof = 0; configu‐
1426 ration. Null is the last character of each input string --- this is
1427 essential to detect the end of input. Null, as well as any other char‐
1428 acter, is allowed in the middle of a rule (for example, 'aaa\0aa'\0 is
1429 valid input, but 'aaa\0 is a syntax error). Bounds checks are gener‐
1430 ated in each state that has a switch on an input character, in the con‐
1431 ditional branch that corresponds to null (that branch may also cover
1432 other characters --- re2c does not split out a separate branch for sen‐
1433 tinel, because increasing the number of branches degrades performance
1434 more than bounds checks do). Bounds checks are of the form YYLIMIT <=
1435 YYCURSOR or YYLESSTHAN(1) with generic API. If a bounds check suc‐
1436 ceeds, the lexer will continue matching. If a bounds check fails, the
1437 lexer has reached the end of input, and it should stop. In this exam‐
1438 ple YYFILL is disabled with re2c:yyfill:enable = 0; and the lexer does
1439 not attempt to get more input (see another example that uses YYFILL in
1440 the YYFILL with sentinel character section). When the end of input has
1441 been reached, there are three possibilities: if the lexer is in the
1442 initial state, it will match the end of input rule $, otherwise it will
1443 either fallback to a previously matched rule (including default rule *)
1444 or go to a default state, causing -Wundefined-control-flow.
1445
1446 // re2c $INPUT -o $OUTPUT
1447 #include <assert.h>
1448
1449 // expect a null-terminated string
1450 static int lex(const char *str, unsigned int len)
1451 {
1452 const char *YYCURSOR = str, *YYLIMIT = str + len, *YYMARKER;
1453 int count = 0;
1454
1455 loop:
1456 /*!re2c
1457 re2c:define:YYCTYPE = char;
1458 re2c:yyfill:enable = 0;
1459 re2c:eof = 0;
1460
1461 * { return -1; }
1462 $ { return count; }
1463 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1464 [ ]+ { goto loop; }
1465
1466 */
1467 }
1468
1469 #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
1470 int main()
1471 {
1472 TEST("", 0);
1473 TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1474 TEST("'unterminated\\'", -1);
1475 return 0;
1476 }
1477
1478
1479 Bounds checks with padding
1480 In this example the lexer uses bounds checking with padding to handle
1481 the end of input (it is the default method). The program counts sin‐
1482 gle-quoted strings separated with spaces. There is a padding of YYMAX‐
1483 FILL null characters appended at the end of input, where YYMAXFILL
1484 value is autogenerated with /*!max:re2c*/ directive. It is not neces‐
1485 sary to use null for padding --- any characters can be used, as long as
1486 they do not form a valid lexeme suffix (in this example padding should
1487 not contain single quotes, as they may be mistaken for a suffix of a
1488 single-quoted string). There is a "stop" rule that matches the first
1489 padding character (null) and terminates the lexer (it returns success
1490 only if it has matched at the beginning of padding, otherwise a stray
1491 null is syntax error). Bounds checks are generated only in some states
1492 that depend on the strongly connected components of the underlying au‐
1493 tomaton. They are of the form (YYLIMIT - YYCURSOR) < n or YY‐
1494 LESSTHAN(n) with generic API, where n is the minimum number of charac‐
1495 ters that are needed for the lexer to proceed (it also means that the
1496 next bounds check will occur in at most n characters). If a bounds
1497 check succeeds, the lexer will continue matching. If a bounds check
1498 fails, the lexer has reached the end of input and will invoke YY‐
1499 FILL(n), which should either supply at least n input characters, or it
1500 should not return. In this example YYFILL always fails and terminates
1501 the lexer with an error. This is fine, because in this example YYFILL
1502 can only be called when the lexer has advanced into the padding, which
1503 means that is has encountered an unterminated string and should return
1504 a syntax error. See the YYFILL with padding section for an example
1505 that refills the input buffer with YYFILL.
1506
1507 // re2c $INPUT -o $OUTPUT
1508 #include <assert.h>
1509 #include <stdlib.h>
1510 #include <string.h>
1511
1512 /*!max:re2c*/
1513
1514 // expect YYMAXFILL-padded string
1515 static int lex(const char *str, unsigned int len)
1516 {
1517 const char *YYCURSOR = str, *YYLIMIT = str + len + YYMAXFILL;
1518 int count = 0;
1519
1520 loop:
1521 /*!re2c
1522 re2c:api:style = free-form;
1523 re2c:define:YYCTYPE = char;
1524 re2c:define:YYFILL = "return -1;";
1525
1526 * { return -1; }
1527 [\x00] { return YYCURSOR + YYMAXFILL - 1 == YYLIMIT ? count : -1; }
1528 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1529 [ ]+ { goto loop; }
1530
1531 */
1532 }
1533
1534 // make a copy of the string with YYMAXFILL zeroes at the end
1535 static void test(const char *str, unsigned int len, int res)
1536 {
1537 char *s = (char*) malloc(len + YYMAXFILL);
1538 memcpy(s, str, len);
1539 memset(s + len, 0, YYMAXFILL);
1540 int r = lex(s, len);
1541 free(s);
1542 assert(r == res);
1543 }
1544
1545 #define TEST(s, r) test(s, sizeof(s) - 1, r)
1546 int main()
1547 {
1548 TEST("", 0);
1549 TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1550 TEST("'unterminated\\'", -1);
1551 return 0;
1552 }
1553
1554
1555 Custom methods with generic API
1556 In this example the lexer uses a custom end of input handling method
1557 based on generic API. The program counts single-quoted strings sepa‐
1558 rated with spaces. It is the same as the sentinel character with
1559 bounds checks example, except that the input is not null-terminated (so
1560 this method can be used if it's not possible to have any padding at
1561 all, not even a single sentinel character). To cover up for the ab‐
1562 sence of sentinel character at the end of input, YYPEEK is redefined to
1563 perform a bounds check before it reads the next input character. This
1564 is inefficient, because checks are done very often. If the check suc‐
1565 ceeds, YYPEEK returns the real character, otherwise it returns a fake
1566 sentinel character.
1567
1568 // re2c $INPUT -o $OUTPUT
1569 #include <assert.h>
1570 #include <stdlib.h>
1571 #include <string.h>
1572
1573 // expect a string without terminating null
1574 static int lex(const char *str, unsigned int len)
1575 {
1576 const char *cur = str, *lim = str + len, *mar;
1577 int count = 0;
1578
1579 loop:
1580 /*!re2c
1581 re2c:yyfill:enable = 0;
1582 re2c:eof = 0;
1583 re2c:flags:input = custom;
1584 re2c:api:style = free-form;
1585 re2c:define:YYCTYPE = char;
1586 re2c:define:YYLESSTHAN = "cur >= lim";
1587 re2c:define:YYPEEK = "cur < lim ? *cur : 0"; // fake null
1588 re2c:define:YYSKIP = "++cur;";
1589 re2c:define:YYBACKUP = "mar = cur;";
1590 re2c:define:YYRESTORE = "cur = mar;";
1591
1592 * { return -1; }
1593 $ { return count; }
1594 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1595 [ ]+ { goto loop; }
1596
1597 */
1598 }
1599
1600 // make a copy of the string without terminating null
1601 static void test(const char *str, unsigned int len, int res)
1602 {
1603 char *s = (char*) malloc(len);
1604 memcpy(s, str, len);
1605 int r = lex(s, len);
1606 free(s);
1607 assert(r == res);
1608 }
1609
1610 #define TEST(s, r) test(s, sizeof(s) - 1, r)
1611 int main()
1612 {
1613 TEST("", 0);
1614 TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1615 TEST("'unterminated\\'", -1);
1616 return 0;
1617 }
1618
1619
1621 The need for buffering arises when the input cannot be mapped in memory
1622 all at once: either it is too large, or it comes in a streaming fashion
1623 (like reading from a socket). The usual technique in such cases is to
1624 allocate a fixed-sized memory buffer and process input in chunks that
1625 fit into the buffer. When the current chunk is processed, it is moved
1626 out and new data is moved in. In practice it is somewhat more complex,
1627 because lexer state consists not of a single input position, but a set
1628 of interrelated posiitons:
1629
1630 • cursor: the next input character to be read (YYCURSOR in default API
1631 or YYSKIP/YYPEEK in generic API)
1632
1633 • limit: the position after the last available input character (YYLIMIT
1634 in default API, implicitly handled by YYLESSTHAN in generic API)
1635
1636 • marker: the position of the most recent match, if any (YYMARKER in
1637 default API or YYBACKUP/YYRESTORE in generic API)
1638
1639 • token: the start of the current lexeme (implicit in re2c API, as it
1640 is not needed for the normal lexer operation and can be defined and
1641 updated by the user)
1642
1643 • context marker: the position of the trailing context (YYCTXMARKER in
1644 default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1645
1646 • tag variables: submatch positions (defined with /*!stags:re2c*/ and
1647 /*!mtags:re2c*/ directives and YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN in
1648 generic API)
1649
1650 Not all these are used in every case, but if used, they must be updated
1651 by YYFILL. All active positions are contained in the segment between
1652 token and cursor, therefore everything between buffer start and token
1653 can be discarded, the segment from token and up to limit should be
1654 moved to the beginning of buffer, and the free space at the end of buf‐
1655 fer should be filled with new data. In order to avoid frequent YYFILL
1656 calls it is best to fill in as many input characters as possible (even
1657 though fewer characters might suffice to resume the lexer). The details
1658 of YYFILL implementation are slightly different depending on which EOF
1659 handling method is used: the case of EOF rule is somewhat simpler than
1660 the case of bounds-checking with padding. Also note that if -f
1661 --storable-state option is used, YYFILL has slightly different seman‐
1662 tics (desrbed in the section about storable state).
1663
1664 YYFILL with sentinel character
1665 If EOF rule is used, YYFILL is a function-like primitive that accepts
1666 no arguments and returns a value which is checked against zero. YYFILL
1667 invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1668 and YYLESSTHAN() in generic API. A non-zero return value means that YY‐
1669 FILL has failed. A successful YYFILL call must supply at least one
1670 character and adjust input positions accordingly. Limit must always be
1671 set to one after the last input position in buffer, and the character
1672 at the limit position must be the sentinel symbol specified by re2c:eof
1673 configuration. The pictures below show the relative locations of input
1674 positions in buffer before and after YYFILL call (sentinel symbol is
1675 marked with #, and the second picture shows the case when there is not
1676 enough input to fill the whole buffer).
1677
1678 <-- shift -->
1679 >-A------------B---------C-------------D#-----------E->
1680 buffer token marker limit,
1681 cursor
1682 >-A------------B---------C-------------D------------E#->
1683 buffer, marker cursor limit
1684 token
1685
1686 <-- shift -->
1687 >-A------------B---------C-------------D#--E (EOF)
1688 buffer token marker limit,
1689 cursor
1690 >-A------------B---------C-------------D---E#........
1691 buffer, marker cursor limit
1692 token
1693
1694 Here is an example of a program that reads input file input.txt in
1695 chunks of 4096 bytes and uses EOF rule.
1696
1697 // re2c $INPUT -o $OUTPUT
1698 #include <assert.h>
1699 #include <stdio.h>
1700 #include <string.h>
1701
1702 #define SIZE 4096
1703
1704 typedef struct {
1705 FILE *file;
1706 char buf[SIZE + 1], *lim, *cur, *mar, *tok;
1707 int eof;
1708 } Input;
1709
1710 static int fill(Input *in)
1711 {
1712 if (in->eof) {
1713 return 1;
1714 }
1715 const size_t free = in->tok - in->buf;
1716 if (free < 1) {
1717 return 2;
1718 }
1719 memmove(in->buf, in->tok, in->lim - in->tok);
1720 in->lim -= free;
1721 in->cur -= free;
1722 in->mar -= free;
1723 in->tok -= free;
1724 in->lim += fread(in->lim, 1, free, in->file);
1725 in->lim[0] = 0;
1726 in->eof |= in->lim < in->buf + SIZE;
1727 return 0;
1728 }
1729
1730 static void init(Input *in, FILE *file)
1731 {
1732 in->file = file;
1733 in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1734 in->eof = 0;
1735 fill(in);
1736 }
1737
1738 static int lex(Input *in)
1739 {
1740 int count = 0;
1741 loop:
1742 in->tok = in->cur;
1743 /*!re2c
1744 re2c:eof = 0;
1745 re2c:api:style = free-form;
1746 re2c:define:YYCTYPE = char;
1747 re2c:define:YYCURSOR = in->cur;
1748 re2c:define:YYMARKER = in->mar;
1749 re2c:define:YYLIMIT = in->lim;
1750 re2c:define:YYFILL = "fill(in) == 0";
1751
1752 * { return -1; }
1753 $ { return count; }
1754 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1755 [ ]+ { goto loop; }
1756
1757 */
1758 }
1759
1760 int main()
1761 {
1762 const char *fname = "input";
1763 const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1764 FILE *f;
1765 Input in;
1766
1767 // prepare input file: a few times the size of the buffer,
1768 // containing strings with zeroes and escaped quotes
1769 f = fopen(fname, "w");
1770 for (int i = 0; i < SIZE; ++i) {
1771 fwrite(str, 1, sizeof(str) - 1, f);
1772 }
1773 fclose(f);
1774
1775 f = fopen(fname, "r");
1776 init(&in, f);
1777 assert(lex(&in) == SIZE * 3);
1778 fclose(f);
1779
1780 remove(fname);
1781 return 0;
1782 }
1783
1784
1785 YYFILL with padding
1786 In the default case (when EOF rule is not used) YYFILL is a func‐
1787 tion-like primitive that accepts a single argument and does not return
1788 any value. YYFILL invocation is triggered by condition (YYLIMIT - YY‐
1789 CURSOR) < n in default API and YYLESSTHAN(n) in generic API. The argu‐
1790 ment passed to YYFILL is the minimal number of characters that must be
1791 supplied. If it fails to do so, YYFILL must not return to the lexer
1792 (for that reason it is best implemented as a macro that returns from
1793 the calling function on failure). In case of a successful YYFILL invo‐
1794 cation the limit position must be set either to one after the last in‐
1795 put position in buffer, or to the end of YYMAXFILL padding (in case YY‐
1796 FILL has successfully read at least n characters, but not enough to
1797 fill the entire buffer). The pictures below show the relative locations
1798 of input positions in buffer before and after YYFILL invocation (YYMAX‐
1799 FILL padding on the second picture is marked with # symbols).
1800
1801 <-- shift --> <-- need -->
1802 >-A------------B---------C-----D-------E---F--------G->
1803 buffer token marker cursor limit
1804
1805 >-A------------B---------C-----D-------E---F--------G->
1806 buffer, marker cursor limit
1807 token
1808
1809 <-- shift --> <-- need -->
1810 >-A------------B---------C-----D-------E-F (EOF)
1811 buffer token marker cursor limit
1812
1813 >-A------------B---------C-----D-------E-F###############
1814 buffer, marker cursor limit
1815 token <- YYMAXFILL ->
1816
1817 Here is an example of a program that reads input file input.txt in
1818 chunks of 4096 bytes and uses bounds-checking with padding.
1819
1820 // re2c $INPUT -o $OUTPUT
1821 #include <assert.h>
1822 #include <stdio.h>
1823 #include <string.h>
1824
1825 /*!max:re2c*/
1826 #define SIZE 4096
1827
1828 typedef struct {
1829 FILE *file;
1830 char buf[SIZE + YYMAXFILL], *lim, *cur, *mar, *tok;
1831 int eof;
1832 } Input;
1833
1834 static int fill(Input *in, size_t need)
1835 {
1836 if (in->eof) {
1837 return 1;
1838 }
1839 const size_t free = in->tok - in->buf;
1840 if (free < need) {
1841 return 2;
1842 }
1843 memmove(in->buf, in->tok, in->lim - in->tok);
1844 in->lim -= free;
1845 in->cur -= free;
1846 in->mar -= free;
1847 in->tok -= free;
1848 in->lim += fread(in->lim, 1, free, in->file);
1849 if (in->lim < in->buf + SIZE) {
1850 in->eof = 1;
1851 memset(in->lim, 0, YYMAXFILL);
1852 in->lim += YYMAXFILL;
1853 }
1854 return 0;
1855 }
1856
1857 static void init(Input *in, FILE *file)
1858 {
1859 in->file = file;
1860 in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1861 in->eof = 0;
1862 fill(in, 1);
1863 }
1864
1865 static int lex(Input *in)
1866 {
1867 int count = 0;
1868 loop:
1869 in->tok = in->cur;
1870 /*!re2c
1871 re2c:api:style = free-form;
1872 re2c:define:YYCTYPE = char;
1873 re2c:define:YYCURSOR = in->cur;
1874 re2c:define:YYMARKER = in->mar;
1875 re2c:define:YYLIMIT = in->lim;
1876 re2c:define:YYFILL = "if (fill(in, @@) != 0) return -1;";
1877
1878 * { return -1; }
1879 [\x00] { return (in->lim - in->cur == YYMAXFILL - 1) ? count : -1; }
1880 ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1881 [ ]+ { goto loop; }
1882
1883 */
1884 }
1885
1886 int main()
1887 {
1888 const char *fname = "input";
1889 const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1890 FILE *f;
1891 Input in;
1892
1893 // prepare input file: a few times the size of the buffer,
1894 // containing strings with zeroes and escaped quotes
1895 f = fopen(fname, "w");
1896 for (int i = 0; i < SIZE; ++i) {
1897 fwrite(str, 1, sizeof(str) - 1, f);
1898 }
1899 fclose(f);
1900
1901 f = fopen(fname, "r");
1902 init(&in, f);
1903 assert(lex(&in) == SIZE * 3);
1904 fclose(f);
1905
1906 remove(fname);
1907 return 0;
1908 }
1909
1910
1912 re2c allows one to include other files using directive /*!include:re2c
1913 FILE */, where FILE is the name of file to be included. re2c looks for
1914 included files in the directory of the including file and in include
1915 locations, which can be specified with -I option. Include directives
1916 in re2c work in the same way as C/C++ #include: the contents of FILE
1917 are copy-pasted verbatim in place of the directive. Include files may
1918 have further includes of their own. Use --depfile option to track build
1919 dependencies of the output file on include files. re2c provides some
1920 predefined include files that can be found in the include/ subdirectory
1921 of the project. These files contain definitions that can be useful to
1922 other projects (such as Unicode categories) and form something like a
1923 standard library for re2c. Below is an example of using include direc‐
1924 tive.
1925
1926 Include file (definitions.h)
1927 typedef enum { OK, FAIL } Result;
1928
1929 /*!re2c
1930 number = [1-9][0-9]*;
1931 */
1932
1933
1934 Input file
1935 // re2c $INPUT -o $OUTPUT -i
1936 #include <assert.h>
1937 /*!include:re2c "definitions.h" */
1938
1939 Result lex(const char *YYCURSOR)
1940 {
1941 /*!re2c
1942 re2c:define:YYCTYPE = char;
1943 re2c:yyfill:enable = 0;
1944
1945 number { return OK; }
1946 * { return FAIL; }
1947 */
1948 }
1949
1950 int main()
1951 {
1952 assert(lex("123") == OK);
1953 return 0;
1954 }
1955
1956
1958 Re2c allows one to generate header file from the input .re file using
1959 option -t, --type-header or configuration re2c:flags:type-header and
1960 directives /*!header:re2c:on*/ and /*!header:re2c:off*/. The first di‐
1961 rective marks the beginning of header file, and the second directive
1962 marks the end of it. Everything between these directives is processed
1963 by re2c, and the generated code is written to the file specified by the
1964 -t --type-header option (or stdout if this option was not used). Auto‐
1965 generated header file may be needed in cases when re2c is used to gen‐
1966 erate definitions of constants, variables and structs that must be vis‐
1967 ible from other translation units.
1968
1969 Here is an example of generating a header file that contains definition
1970 of the lexer state with tag variables (the number variables depends on
1971 the regular grammar and is unknown to the programmer).
1972
1973 Input file
1974 // re2c $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.h
1975 #include <assert.h>
1976 #include "src/lexer/lexer.h" // generated by re2c
1977
1978 /*!header:re2c:on*/
1979
1980 typedef struct {
1981 const char *str, *cur, *mar;
1982 /*!stags:re2c format = "const char *@@{tag}; "; */
1983 } LexerState;
1984
1985 /*!header:re2c:off*/
1986
1987 int lex(LexerState *st)
1988 {
1989 /*!re2c
1990 re2c:flags:type-header = "src/lexer/lexer.h";
1991 re2c:yyfill:enable = 0;
1992 re2c:flags:tags = 1;
1993 re2c:define:YYCTYPE = char;
1994 re2c:define:YYCURSOR = "st->cur";
1995 re2c:define:YYMARKER = "st->mar";
1996 re2c:tags:expression = "st->@@{tag}";
1997
1998 [x]{1,4} / [x]{3,5} { return 0; } // ambiguous trailing context
1999 * { return 1; }
2000 */
2001 }
2002
2003 int main()
2004 {
2005 LexerState st;
2006 st.str = st.cur = "xxxxxxxx";
2007 assert(lex(&st) == 0 && st.cur - st.str == 4);
2008 return 0;
2009 }
2010
2011
2012 Header file
2013 /* Generated by re2c */
2014
2015
2016 typedef struct {
2017 const char *str, *cur, *mar;
2018 const char *yyt1; const char *yyt2; const char *yyt3;
2019 } LexerState;
2020
2021
2022
2024 Re2c has two options for submatch extraction.
2025
2026 The first option is -T --tags. With this option one can use standalone
2027 tags of the form @stag and #mtag, where stag and mtag are arbitrary
2028 used-defined names. Tags can be used anywhere inside of a regular ex‐
2029 pression; semantically they are just position markers. Tags of the form
2030 @stag are called s-tags: they denote a single submatch value (the last
2031 input position where this tag matched). Tags of the form #mtag are
2032 called m-tags: they denote multiple submatch values (the whole history
2033 of repetitions of this tag). All tags should be defined by the user as
2034 variables with the corresponding names. With standalone tags re2c uses
2035 leftmost greedy disambiguation: submatch positions correspond to the
2036 leftmost matching path through the regular expression.
2037
2038 The second option is -P --posix-captures: it enables POSIX-compliant
2039 capturing groups. In this mode parentheses in regular expressions de‐
2040 note the beginning and the end of capturing groups; the whole regular
2041 expression is group number zero. The number of groups for the matching
2042 rule is stored in a variable yynmatch, and submatch results are stored
2043 in yypmatch array. Both yynmatch and yypmatch should be defined by the
2044 user, and yypmatch size must be at least [yynmatch * 2]. Re2c provides
2045 a directive /*!maxnmatch:re2c*/ that defines YYMAXNMATCH: a constant
2046 equal to the maximal value of yynmatch among all rules. Note that re2c
2047 implements POSIX-compliant disambiguation: each subexpression matches
2048 as long as possible, and subexpressions that start earlier in regular
2049 expression have priority over those starting later. Capturing groups
2050 are translated into s-tags under the hood, therefore we use the word
2051 "tag" to describe them as well.
2052
2053 With both -P --posix-captures and T --tags options re2c uses efficient
2054 submatch extraction algorithm described in the Tagged Deterministic Fi‐
2055 nite Automata with Lookahead paper. The overhead on submatch extraction
2056 in the generated lexer grows with the number of tags --- if this number
2057 is moderate, the overhead is barely noticeable. In the lexer tags are
2058 implemented using a number of tag variables generated by re2c. There is
2059 no one-to-one correspondence between tag variables and tags: a single
2060 variable may be reused for different tags, and one tag may require mul‐
2061 tiple variables to hold all its ambiguous values. Eventually ambiguity
2062 is resolved, and only one final variable per tag survives. When a rule
2063 matches, all its tags are set to the values of the corresponding tag
2064 variables. The exact number of tag variables is unknown to the user;
2065 this number is determined by re2c. However, tag variables should be de‐
2066 fined by the user as a part of the lexer state and updated by YYFILL,
2067 therefore re2c provides directives /*!stags:re2c*/ and /*!mtags:re2c*/
2068 that can be used to declare, initialize and manipulate tag variables.
2069 These directives have two optional configurations: format = "@@";
2070 (specifies the template where @@ is substituted with the name of each
2071 tag variable), and separator = ""; (specifies the piece of code used to
2072 join the generated pieces for different tag variables).
2073
2074 S-tags support the following operations:
2075
2076 • save input position to an s-tag: t = YYCURSOR with default API or a
2077 user-defined operation YYSTAGP(t) with generic API
2078
2079 • save default value to an s-tag: t = NULL with default API or a
2080 user-defined operation YYSTAGN(t) with generic API
2081
2082 • copy one s-tag to another: t1 = t2
2083
2084 M-tags support the following operations:
2085
2086 • append input position to an m-tag: a user-defined operation YYM‐
2087 TAGP(t) with both default and generic API
2088
2089 • append default value to an m-tag: a user-defined operation YYMTAGN(t)
2090 with both default and generic API
2091
2092 • copy one m-tag to another: t1 = t2
2093
2094 S-tags can be implemented as scalar values (pointers or offsets).
2095 M-tags need a more complex representation, as they need to store a se‐
2096 quence of tag values. The most naive and inefficient representation of
2097 an m-tag is a list (array, vector) of tag values; a more efficient rep‐
2098 resentation is to store all m-tags in a prefix-tree represented as ar‐
2099 ray of nodes (v, p), where v is tag value and p is a pointer to parent
2100 node.
2101
2102 Here is a simple example of using s-tags to parse an IPv4 address (see
2103 below for a more complex example that uses YYFILL).
2104
2105 // re2c $INPUT -o $OUTPUT
2106 #include <assert.h>
2107 #include <stdint.h>
2108
2109 static uint32_t num(const char *s, const char *e)
2110 {
2111 uint32_t n = 0;
2112 for (; s < e; ++s) n = n * 10 + (*s - '0');
2113 return n;
2114 }
2115
2116 static const uint64_t ERROR = ~0lu;
2117
2118 static uint64_t lex(const char *YYCURSOR)
2119 {
2120 const char *YYMARKER, *o1, *o2, *o3, *o4;
2121 /*!stags:re2c format = 'const char *@@;'; */
2122
2123 /*!re2c
2124 re2c:yyfill:enable = 0;
2125 re2c:flags:tags = 1;
2126 re2c:define:YYCTYPE = char;
2127
2128 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2129 dot = [.];
2130 end = [\x00];
2131
2132 @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2133 return num(o4, YYCURSOR - 1)
2134 + (num(o3, o4 - 1) << 8)
2135 + (num(o2, o3 - 1) << 16)
2136 + (num(o1, o2 - 1) << 24);
2137 }
2138 * { return ERROR; }
2139 */
2140 }
2141
2142 int main()
2143 {
2144 assert(lex("1.2.3.4") == 0x01020304);
2145 assert(lex("127.0.0.1") == 0x7f000001);
2146 assert(lex("255.255.255.255") == 0xffffffff);
2147 assert(lex("1.2.3.") == ERROR);
2148 assert(lex("1.2.3.256") == ERROR);
2149 return 0;
2150 }
2151
2152
2153 Here is a more complex example of using s-tags with YYFILL to parse a
2154 file with IPv4 addresses. Tag variables are part of the lexer state,
2155 and they are adjusted in YYFILL like other input positions. Note that
2156 it is necessary for s-tags because their values are invalidated after
2157 shifting buffer contents. It may not be necessary in a custom implemen‐
2158 tation where tag variables store offsets relative to the start of the
2159 input string rather than buffer, which may be the case with m-tags.
2160
2161 // re2c $INPUT -o $OUTPUT --tags
2162 #include <assert.h>
2163 #include <stdint.h>
2164 #include <stdio.h>
2165 #include <string.h>
2166 #include <vector>
2167
2168 #define SIZE 4096
2169
2170 typedef struct {
2171 FILE *file;
2172 char buf[SIZE + 1], *lim, *cur, *mar, *tok;
2173 // Tag variables must be part of the lexer state passed to YYFILL.
2174 // They don't correspond to tags and should be autogenerated by re2c.
2175 /*!stags:re2c format = 'const char *@@;'; */
2176 int eof;
2177 } Input;
2178
2179 static int fill(Input *in)
2180 {
2181 if (in->eof) return 1;
2182
2183 const size_t free = in->tok - in->buf;
2184 if (free < 1) return 2;
2185
2186 memmove(in->buf, in->tok, in->lim - in->tok);
2187
2188 in->lim -= free;
2189 in->cur -= free;
2190 in->mar -= free;
2191 in->tok -= free;
2192 // Tag variables need to be shifted like other input positions. The check
2193 // for non-NULL is only needed if some tags are nested inside of alternative
2194 // or repetition, so that they can have NULL value.
2195 /*!stags:re2c format = "if (in->@@) in->@@ -= free;"; */
2196
2197 in->lim += fread(in->lim, 1, free, in->file);
2198 in->lim[0] = 0;
2199 in->eof |= in->lim < in->buf + SIZE;
2200
2201 return 0;
2202 }
2203
2204 static void init(Input *in, FILE *file)
2205 {
2206 in->file = file;
2207 in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
2208 // Initialization is only needed to avoid "use of uninitialized" warnings
2209 // when shifting tags in YYFILL. In the lexer tags are guaranteed to be
2210 // set before they are used (either to a valid input position, or NULL).
2211 /*!stags:re2c format = "in->@@ = in->lim;"; */
2212 in->eof = 0;
2213 fill(in);
2214 }
2215
2216 static uint32_t num(const char *s, const char *e)
2217 {
2218 uint32_t n = 0;
2219 for (; s < e; ++s) n = n * 10 + (*s - '0');
2220 return n;
2221 }
2222
2223 static bool lex(Input *in, std::vector<uint32_t> &ips)
2224 {
2225 // User-defined local variables that store final tag values.
2226 // They are different from tag variables autogenerated with /*!stags:re2c*/,
2227 // as they are set at the end of match and used only in semantic actions.
2228 const char *o1, *o2, *o3, *o4;
2229 loop:
2230 in->tok = in->cur;
2231 /*!re2c
2232 re2c:eof = 0;
2233 re2c:api:style = free-form;
2234 re2c:define:YYCTYPE = char;
2235 re2c:define:YYCURSOR = in->cur;
2236 re2c:define:YYMARKER = in->mar;
2237 re2c:define:YYLIMIT = in->lim;
2238 re2c:define:YYFILL = "fill(in) == 0";
2239
2240 // The way tag variables are accessed from the lexer (not needed if tag
2241 // variables are defined as local variables).
2242 re2c:tags:expression = "in->@@";
2243
2244 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2245 dot = [.];
2246 eol = [\n];
2247
2248 @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet eol {
2249 ips.push_back(num(o4, in->cur - 1)
2250 + (num(o3, o4 - 1) << 8)
2251 + (num(o2, o3 - 1) << 16)
2252 + (num(o1, o2 - 1) << 24));
2253 goto loop;
2254 }
2255 $ { return true; }
2256 * { return false; }
2257 */
2258 }
2259
2260 int main()
2261 {
2262 const char *fname = "input";
2263 FILE *f;
2264 Input in;
2265 std::vector<uint32_t> have, want;
2266
2267 // Write a few IPv4 addresses to the input file and save them to compare
2268 // against parse results.
2269 f = fopen(fname, "w");
2270 for (int i = 0; i < 256; ++i) {
2271 fprintf(f, "%d.%d.%d.%d\n", i, i, i, i);
2272 want.push_back(i + (i << 8) + (i << 16) + (i << 24));
2273 }
2274 fclose(f);
2275
2276 f = fopen(fname, "r");
2277 init(&in, f);
2278
2279 assert(lex(&in, have) && have == want);
2280
2281 fclose(f);
2282 remove(fname);
2283 return 0;
2284 }
2285
2286
2287 Here is an example of using POSIX capturing groups to parse an IPv4 ad‐
2288 dress.
2289
2290 // re2c $INPUT -o $OUTPUT
2291 #include <assert.h>
2292 #include <stdint.h>
2293
2294 static uint32_t num(const char *s, const char *e)
2295 {
2296 uint32_t n = 0;
2297 for (; s < e; ++s) n = n * 10 + (*s - '0');
2298 return n;
2299 }
2300
2301 /*!maxnmatch:re2c*/
2302 static const uint64_t ERROR = ~0lu;
2303
2304 static uint64_t lex(const char *YYCURSOR)
2305 {
2306 const char *YYMARKER;
2307 const char *yypmatch[YYMAXNMATCH * 2];
2308 uint32_t yynmatch;
2309 /*!stags:re2c format = 'const char *@@;'; */
2310
2311 /*!re2c
2312 re2c:yyfill:enable = 0;
2313 re2c:flags:posix-captures = 1;
2314 re2c:define:YYCTYPE = char;
2315
2316 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2317 dot = [.];
2318 end = [\x00];
2319
2320 (octet) dot (octet) dot (octet) dot (octet) end {
2321 assert(yynmatch == 5);
2322 return num(yypmatch[8], yypmatch[9])
2323 + (num(yypmatch[6], yypmatch[7]) << 8)
2324 + (num(yypmatch[4], yypmatch[5]) << 16)
2325 + (num(yypmatch[2], yypmatch[3]) << 24);
2326 }
2327 * { return ERROR; }
2328 */
2329 }
2330
2331 int main()
2332 {
2333 assert(lex("1.2.3.4") == 0x01020304);
2334 assert(lex("127.0.0.1") == 0x7f000001);
2335 assert(lex("255.255.255.255") == 0xffffffff);
2336 assert(lex("1.2.3.") == ERROR);
2337 assert(lex("1.2.3.256") == ERROR);
2338 return 0;
2339 }
2340
2341
2342 Here is an example of using m-tags to parse a semicolon-separated se‐
2343 quence of words (C++). Tag variables are stored in a tree that is
2344 packed in a vector.
2345
2346 // re2c $INPUT -o $OUTPUT
2347 #include <assert.h>
2348 #include <vector>
2349 #include <string>
2350
2351 static const int ROOT = -1;
2352
2353 struct Mtag {
2354 int pred;
2355 const char *tag;
2356 };
2357
2358 typedef std::vector<Mtag> MtagTree;
2359 typedef std::vector<std::string> Words;
2360
2361 static void mtag(int *pt, const char *t, MtagTree *tree)
2362 {
2363 Mtag m = {*pt, t};
2364 *pt = (int)tree->size();
2365 tree->push_back(m);
2366 }
2367
2368 static void unfold(const MtagTree &tree, int x, int y, Words &words)
2369 {
2370 if (x == ROOT) return;
2371 unfold(tree, tree[x].pred, tree[y].pred, words);
2372 const char *px = tree[x].tag, *py = tree[y].tag;
2373 words.push_back(std::string(px, py - px));
2374 }
2375
2376 #define YYMTAGP(t) mtag(&t, YYCURSOR, &tree)
2377 #define YYMTAGN(t) mtag(&t, NULL, &tree)
2378 static bool lex(const char *YYCURSOR, Words &words)
2379 {
2380 const char *YYMARKER;
2381 /*!mtags:re2c format = "int @@ = ROOT;"; */
2382 MtagTree tree;
2383 int x, y;
2384
2385 /*!re2c
2386 re2c:yyfill:enable = 0;
2387 re2c:flags:tags = 1;
2388 re2c:define:YYCTYPE = char;
2389
2390 (#x [a-z]+ #y [;])+ {
2391 words.clear();
2392 unfold(tree, x, y, words);
2393 return true;
2394 }
2395 * { return false; }
2396 */
2397 }
2398
2399 int main()
2400 {
2401 Words w;
2402 assert(lex("one;two;three;", w) && w == Words({"one", "two", "three"}));
2403 return 0;
2404 }
2405
2406
2408 With -f --storable-state option re2c generates a lexer that can store
2409 its current state, return to the caller, and later resume operations
2410 exactly where it left off. The default mode of operation in re2c is a
2411 "pull" model, in which the lexer "pulls" more input whenever it needs
2412 it. This may be unacceptable in cases when the input becomes available
2413 piece by piece (for example, if the lexer is invoked by the parser, or
2414 if the lexer program communicates via a socket protocol with some other
2415 program that must wait for a reply from the lexer before it transmits
2416 the next message). Storable state feature is intended exactly for such
2417 cases: it allows one to generate lexers that work in a "push" model.
2418 When the lexer needs more input, it stores its state and returns to the
2419 caller. Later, when more input becomes available, the caller resumes
2420 the lexer exactly where it stopped. There are a few changes necessary
2421 compared to the "pull" model:
2422
2423 • Define YYSETSTATE() and YYGETSTATE(state) promitives.
2424
2425 • Define yych, yyaccept and state variables as a part of persistent
2426 lexer state. The state variable should be initialized to -1.
2427
2428 • YYFILL should return to the outer program instead of trying to supply
2429 more input. Return code should indicate that lexer needs more input.
2430
2431 • The outer program should recognize situations when lexer needs more
2432 input and respond appropriately.
2433
2434 • Use /*!getstate:re2c*/ directive if it is necessary to execute any
2435 code before entering the lexer.
2436
2437 • Use configurations state:abort and state:nextlabel to further tweak
2438 the generated code.
2439
2440 Here is an example of a "push"-model lexer that reads input from stdin
2441 and expects a sequence of words separated by spaces and newlines. The
2442 lexer loops forever, waiting for more input. It can be terminated by
2443 sending a special EOF token --- a word "stop", in which case the lexer
2444 terminates successfully and prints the number of words it has seen. Ab‐
2445 normal termination happens in case of a syntax error, premature end of
2446 input (without the "stop" word) or in case the buffer is too small to
2447 hold a lexeme (for example, if one of the words exceeds buffer size).
2448 Premature end of input happens in case the lexer fails to read any in‐
2449 put while being in the initial state --- this is the only case when EOF
2450 rule matches. Note that the lexer may call YYFILL twice before termi‐
2451 nating (and thus require hitting Ctrl+D a few times). First time YYFILL
2452 is called when the lexer expects continuation of the current greedy
2453 lexeme (either a word or a whitespace sequence). If YYFILL fails, the
2454 lexer knows that it has reached the end of the current lexeme and exe‐
2455 cutes the corresponding semantic action. The action jumps to the begin‐
2456 ning of the loop, the lexer enters the initial state and calls YYFILL
2457 once more. If it fails, the lexer matches EOF rule. (Alternatively EOF
2458 rule can be used for termination instead of a special EOF lexeme.)
2459
2460 Example
2461 // re2c $INPUT -o $OUTPUT -f
2462 #include <assert.h>
2463 #include <stdio.h>
2464 #include <string.h>
2465
2466 #define DEBUG 0
2467 #define LOG(...) if (DEBUG) fprintf(stderr, __VA_ARGS__);
2468 #define BUFSIZE 10
2469
2470 typedef struct {
2471 FILE *file;
2472 char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok;
2473 unsigned yyaccept;
2474 int state;
2475 } Input;
2476
2477 static void init(Input *in, FILE *f)
2478 {
2479 in->file = f;
2480 in->cur = in->mar = in->tok = in->lim = in->buf + BUFSIZE;
2481 in->lim[0] = 0; // append sentinel symbol
2482 in->yyaccept = 0;
2483 in->state = -1;
2484 }
2485
2486 typedef enum {END, READY, WAITING, BAD_PACKET, BIG_PACKET} Status;
2487
2488 static Status fill(Input *in)
2489 {
2490 const size_t shift = in->tok - in->buf;
2491 const size_t free = BUFSIZE - (in->lim - in->tok);
2492
2493 if (free < 1) return BIG_PACKET;
2494
2495 memmove(in->buf, in->tok, BUFSIZE - shift);
2496 in->lim -= shift;
2497 in->cur -= shift;
2498 in->mar -= shift;
2499 in->tok -= shift;
2500
2501 const size_t read = fread(in->lim, 1, free, in->file);
2502 in->lim += read;
2503 in->lim[0] = 0; // append sentinel symbol
2504
2505 return READY;
2506 }
2507
2508 static Status lex(Input *in, unsigned int *recv)
2509 {
2510 char yych;
2511 /*!getstate:re2c*/
2512 loop:
2513 in->tok = in->cur;
2514 /*!re2c
2515 re2c:eof = 0;
2516 re2c:api:style = free-form;
2517 re2c:define:YYCTYPE = "char";
2518 re2c:define:YYCURSOR = "in->cur";
2519 re2c:define:YYMARKER = "in->mar";
2520 re2c:define:YYLIMIT = "in->lim";
2521 re2c:define:YYGETSTATE = "in->state";
2522 re2c:define:YYSETSTATE = "in->state = @@;";
2523 re2c:define:YYFILL = "return WAITING;";
2524
2525 packet = [a-z]+[;];
2526
2527 * { return BAD_PACKET; }
2528 $ { return END; }
2529 packet { *recv = *recv + 1; goto loop; }
2530 */
2531 }
2532
2533 void test(const char **packets, Status status)
2534 {
2535 const char *fname = "pipe";
2536 FILE *fw = fopen(fname, "w");
2537 FILE *fr = fopen(fname, "r");
2538 setvbuf(fw, NULL, _IONBF, 0);
2539 setvbuf(fr, NULL, _IONBF, 0);
2540
2541 Input in;
2542 init(&in, fr);
2543 Status st;
2544 unsigned int send = 0, recv = 0;
2545
2546 for (;;) {
2547 st = lex(&in, &recv);
2548 if (st == END) {
2549 LOG("done: got %u packets\n", recv);
2550 break;
2551 } else if (st == WAITING) {
2552 LOG("waiting...\n");
2553 if (*packets) {
2554 LOG("sent packet %u\n", send);
2555 fprintf(fw, "%s", *packets++);
2556 ++send;
2557 }
2558 st = fill(&in);
2559 LOG("queue: '%s'\n", in.buf);
2560 if (st == BIG_PACKET) {
2561 LOG("error: packet too big\n");
2562 break;
2563 }
2564 assert(st == READY);
2565 } else {
2566 assert(st == BAD_PACKET);
2567 LOG("error: ill-formed packet\n");
2568 break;
2569 }
2570 }
2571
2572 LOG("\n");
2573 assert(st == status);
2574 if (st == END) assert(recv == send);
2575
2576 fclose(fw);
2577 fclose(fr);
2578 remove(fname);
2579 }
2580
2581 int main()
2582 {
2583 const char *packets1[] = {0};
2584 const char *packets2[] = {"zero;", "one;", "two;", "three;", "four;", 0};
2585 const char *packets3[] = {"zer0;", 0};
2586 const char *packets4[] = {"goooooooooogle;", 0};
2587
2588 test(packets1, END);
2589 test(packets2, END);
2590 test(packets3, BAD_PACKET);
2591 test(packets4, BIG_PACKET);
2592
2593 return 0;
2594 }
2595
2596
2598 Reuse mode is enabled with the -r --reusable option. In this mode re2c
2599 allows one to reuse definitions, configurations and rules specified by
2600 a /*!rules:re2c*/ block in subsequent /*!use:re2c*/ blocks. As of
2601 re2c-1.2 it is possible to mix such blocks with normal /*!re2c*/
2602 blocks; prior to that re2c expects a single rules-block followed by
2603 use-blocks (normal blocks are disallowed). Use-blocks can have addi‐
2604 tional definitions, configurations and rules: they are merged to those
2605 specified by the rules-block. A very common use case for -r --reusable
2606 option is a lexer that supports multiple input encodings: lexer rules
2607 are defined once and reused multiple times with encoding-specific con‐
2608 figurations, such as re2c:flags:utf-8.
2609
2610 Below is an example of a multi-encoding lexer: it reads a phrase with
2611 Unicode math symbols and accepts input either in UTF8 or in UT32. Note
2612 that the --input-encoding utf8 option allows us to write UTF8-encoded
2613 symbols in the regular expressions; without this option re2c would
2614 parse them as a plain ASCII byte sequnce (and we would have to use
2615 hexadecimal escape sequences).
2616
2617 Example
2618 // re2c $INPUT -o $OUTPUT -r --input-encoding utf8
2619 #include <assert.h>
2620 #include <stdint.h>
2621
2622 /*!rules:re2c
2623 re2c:yyfill:enable = 0;
2624
2625 "∀x ∃y: p(x, y)" { return 0; }
2626 * { return 1; }
2627 */
2628
2629 static int lex_utf8(const uint8_t *YYCURSOR)
2630 {
2631 const uint8_t *YYMARKER;
2632 /*!use:re2c
2633 re2c:define:YYCTYPE = uint8_t;
2634 re2c:flags:8 = 1;
2635 */
2636 }
2637
2638 static int lex_utf32(const uint32_t *YYCURSOR)
2639 {
2640 const uint32_t *YYMARKER;
2641 /*!use:re2c
2642 re2c:define:YYCTYPE = uint32_t;
2643 re2c:flags:8 = 0;
2644 re2c:flags:u = 1;
2645 */
2646 }
2647
2648 int main()
2649 {
2650 static const uint8_t s8[] = // UTF-8
2651 { 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79
2652 , 0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29 };
2653
2654 static const uint32_t s32[] = // UTF32
2655 { 0x00002200, 0x00000078, 0x00000020, 0x00002203
2656 , 0x00000079, 0x0000003a, 0x00000020, 0x00000070
2657 , 0x00000028, 0x00000078, 0x0000002c, 0x00000020
2658 , 0x00000079, 0x00000029 };
2659
2660 assert(lex_utf8(s8) == 0);
2661 assert(lex_utf32(s32) == 0);
2662 return 0;
2663 }
2664
2665
2666
2668 Speaking of encodings, it is necessary to understand the difference be‐
2669 tween code points and code units. Code point is an abstract symbol.
2670 Code unit is the smallest atomic unit of storage in the encoded text.
2671 A single code point may be represented with one or more code units. In
2672 a fixed-length encoding all code points are represented with the same
2673 number of code units. In a variable-length encoding code points may be
2674 represented with a different number of code units. Note that the "any"
2675 rule [^] matches any code point, but not necessarily any code unit.
2676 The only way to match any code unit regardless of the encoding it the
2677 default rule *. YYCTYPE size should be equal to the size of code unit.
2678
2679 Re2c supports the following encodings: ASCII, EBCDIC, UCS2, UTF8, UTF16
2680 and UTF32.
2681
2682 • ASCII is enabled by default. It is a fixed-length encoding with code
2683 space [0-255] and 1-byte code points and code units.
2684
2685 • EBCDIC is enabled with -e, --ecb option. It a fixed-length encoding
2686 with code space [0-255] and 1-byte code points and code units.
2687
2688 • UCS2 is enabled with -w, --wide-chars option. It is a fixed-length
2689 encoding with code space [0-0xFFFF] and 2-byte code points and code
2690 units.
2691
2692 • UTF8 is enabled with -8, --utf-8 option. It is a variable-length
2693 Unicode encoding with code space [0-0x10FFFF]. Code points are rep‐
2694 resented with one, two, three or four 1-byte code units.
2695
2696 • UTF16 is enabled with -x, --utf-16 option. It is a variable-length
2697 Unicode encoding with code space [0-0x10FFFF]. Code points are rep‐
2698 resented with one or two 2-byte code units.
2699
2700 • UTF32 is enabled with -u, --unicode option. It is a fixed-length
2701 Unicode encoding with code space [0-0x10FFFF] and 4-byte code points
2702 and code units.
2703
2704 Encodings can also be set or unset using re2c:flags configuration, for
2705 example re2c:flags:8 = 1; enables UTF8.
2706
2707 Include file include/unicode_categories.re provides re2c definitions
2708 for the standard Unicode categories.
2709
2710 Option --input-encoding utf8 enables Unicode literals in regular ex‐
2711 pressions.
2712
2713 Option --encoding-policy <fail | substitute | ignore> specifies the way
2714 re2c handles Unicode surrogates: code points in the range
2715 [0xD800-0xDFFF].
2716
2717 Example
2718 // re2c $INPUT -o $OUTPUT -8 --case-ranges -i
2719 //
2720 // Simplified "Unicode Identifier and Pattern Syntax"
2721 // (see https://unicode.org/reports/tr31)
2722
2723 #include <assert.h>
2724 #include <stdint.h>
2725
2726 /*!include:re2c "unicode_categories.re" */
2727
2728 static int lex(const char *YYCURSOR)
2729 {
2730 const char *YYMARKER;
2731 /*!re2c
2732 re2c:define:YYCTYPE = 'unsigned char';
2733 re2c:yyfill:enable = 0;
2734
2735 id_start = L | Nl | [$_];
2736 id_continue = id_start | Mn | Mc | Nd | Pc | [\u200D\u05F3];
2737 identifier = id_start id_continue*;
2738
2739 identifier { return 0; }
2740 * { return 1; }
2741 */
2742 }
2743
2744 int main()
2745 {
2746 assert(lex("_Ыдентификатор") == 0);
2747 return 0;
2748 }
2749
2750
2752 Conditions are enabled with -c --conditions. This option allows one to
2753 encode multiple interrelated lexers within the same re2c block.
2754
2755 Each lexer corresponds to a single condition. It starts with a label
2756 of the form yyc_name, where name is condition name and yyc prefix can
2757 be adjusted with configuration re2c:condprefix. Different lexers are
2758 separated with a comment /* *********************************** */
2759 which can be adjusted with configuration re2c:cond:divider.
2760
2761 Furthermore, each condition has a unique identifier of the form yyc‐
2762 name, where name is condition name and yyc prefix can be adjusted with
2763 configuration re2c:condenumprefix. Identifiers have the type YYCOND‐
2764 TYPE and should be generated with /*!types:re2c*/ directive or -t
2765 --type-header option. Users shouldn't define these identifiers manu‐
2766 ally, as the order of conditions is not specified.
2767
2768 Before all conditions re2c generates entry code that checks the current
2769 condition identifier and transfers control flow to the start label of
2770 the active condition. After matching some rule of this condition,
2771 lexer may either transfer control flow back to the entry code (after
2772 executing the associated action and optionally setting another condi‐
2773 tion with =>), or use :=> shortcut and transition directly to the start
2774 label of another condition (skipping the action and the entry code).
2775 Configuration re2c:cond:goto allows one to change the default behavior.
2776
2777 Syntactically each rule must be preceded with a list of comma-separated
2778 condition names or a wildcard * enclosed in angle brackets < and >.
2779 Wildcard means "any condition" and is semantically equivalent to list‐
2780 ing all condition names. Here regexp is a regular expression, default
2781 refers to the default rule *, and action is a block of code.
2782
2783 • <conditions-or-wildcard> regexp-or-default action
2784
2785 • <conditions-or-wildcard> regexp-or-default => condition action
2786
2787 • <conditions-or-wildcard> regexp-or-default :=> condition
2788
2789 Rules with an exclamation mark ! in front of condition list have a spe‐
2790 cial meaning: they have no regular expression, and the associated ac‐
2791 tion is merged as an entry code to actions of normal rules. This might
2792 be a convenient place to peform a routine task that is common to all
2793 rules.
2794
2795 • <!conditions-or-wildcard> action
2796
2797 Another special form of rules with an empty condition list <> and no
2798 regular expression allows one to specify an "entry condition" that can
2799 be used to execute code before entering the lexer. It is semantically
2800 equivalent to a condition with number zero, name 0 and an empty regular
2801 expression.
2802
2803 • <> action
2804
2805 • <> => condition action
2806
2807 • <> :=> condition
2808
2809 Example
2810 // re2c $INPUT -o $OUTPUT -ci
2811 #include <stdint.h>
2812 #include <limits.h>
2813 #include <assert.h>
2814
2815 static const uint64_t ERROR = ~0lu;
2816 /*!types:re2c*/
2817
2818 template<int BASE> static void adddgt(uint64_t &u, unsigned int d)
2819 {
2820 u = u * BASE + d;
2821 if (u > UINT32_MAX) u = ERROR;
2822 }
2823
2824 static uint64_t parse_u32(const char *s)
2825 {
2826 const char *YYMARKER;
2827 int c = yycinit;
2828 uint64_t u = 0;
2829
2830 /*!re2c
2831 re2c:yyfill:enable = 0;
2832 re2c:api:style = free-form;
2833 re2c:define:YYCTYPE = char;
2834 re2c:define:YYCURSOR = s;
2835 re2c:define:YYGETCONDITION = "c";
2836 re2c:define:YYSETCONDITION = "c = @@;";
2837
2838 <*> * { return ERROR; }
2839
2840 <init> '0b' / [01] :=> bin
2841 <init> "0" :=> oct
2842 <init> "" / [1-9] :=> dec
2843 <init> '0x' / [0-9a-fA-F] :=> hex
2844
2845 <bin, oct, dec, hex> "\x00" { return u; }
2846
2847 <bin> [01] { adddgt<2> (u, s[-1] - '0'); goto yyc_bin; }
2848 <oct> [0-7] { adddgt<8> (u, s[-1] - '0'); goto yyc_oct; }
2849 <dec> [0-9] { adddgt<10>(u, s[-1] - '0'); goto yyc_dec; }
2850 <hex> [0-9] { adddgt<16>(u, s[-1] - '0'); goto yyc_hex; }
2851 <hex> [a-f] { adddgt<16>(u, s[-1] - 'a' + 10); goto yyc_hex; }
2852 <hex> [A-F] { adddgt<16>(u, s[-1] - 'A' + 10); goto yyc_hex; }
2853 */
2854 }
2855
2856 int main()
2857 {
2858 assert(parse_u32("1234567890") == 1234567890);
2859 assert(parse_u32("0b1101") == 13);
2860 assert(parse_u32("0x7Fe") == 2046);
2861 assert(parse_u32("0644") == 420);
2862 assert(parse_u32("9999999999") == ERROR);
2863 assert(parse_u32("") == ERROR);
2864 return 0;
2865 }
2866
2867
2869 With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
2870 erates a self-contained C program that can be further compiled and exe‐
2871 cuted. The program consists of lexer code and input data. For each con‐
2872 structed DFA (block or condition) re2c generates a standalone lexer and
2873 two files: an .input file with strings derived from the DFA and a .keys
2874 file with expected match results. The program runs each lexer on the
2875 corresponding .input file and compares results with the expectations.
2876 Skeleton programs are very useful for a number of reasons:
2877
2878 • They can check correctness of various re2c optimizations (the data is
2879 generated early in the process, before any DFA transformations have
2880 taken place).
2881
2882 • Generating a set of input data with good coverage may be useful for
2883 both testing and benchmarking.
2884
2885 • Generating self-contained executable programs allows one to get mini‐
2886 mized test cases (the original code may be large or have a lot of de‐
2887 pendencies).
2888
2889 The difficulty with generating input data is that for all but the most
2890 trivial cases the number of possible input strings is too large (even
2891 if the string length is limited). Re2c solves this difficulty by gener‐
2892 ating sufficiently many strings to cover almost all DFA transitions. It
2893 uses the following algorithm. First, it constructs a skeleton of the
2894 DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
2895 EBCDIC) skeleton is just an exact copy of the original DFA. For encod‐
2896 ings with multibyte code units skeleton is a copy of DFA with certain
2897 transitions omitted: namely, re2c takes at most 256 code units for each
2898 disjoint continuous range that corresponds to a DFA transition. The
2899 chosen values are evenly distributed and include range bounds. Instead
2900 of trying to cover all possible paths in the skeleton (which is infea‐
2901 sible) re2c generates sufficiently many paths to cover all skeleton
2902 transitions, and thus trigger the corresponding conditional jumps in
2903 the lexer. The algorithm implementation is limited by ~1Gb of transi‐
2904 tions and consumes constant amount of memory (re2c writes data to file
2905 as soon as it is generated).
2906
2908 With the -D, --emit-dot option, re2c does not generate code. Instead,
2909 it dumps the generated DFA in DOT format. One can convert this dump to
2910 an image of the DFA using Graphviz or another library. Note that this
2911 option shows the final DFA after it has gone through a number of opti‐
2912 mizations and transformations. Earlier stages can be dumped with vari‐
2913 ous debug options, such as --dump-nfa, --dump-dfa-raw etc. (see the
2914 full list of options).
2915
2917 You can find more information about re2c at the official website:
2918 http://re2c.org. Similar programs are flex(1), lex(1), quex(‐
2919 http://quex.sourceforge.net).
2920
2922 Re2c was originaly written by Peter Bumbulis in 1993. Since then it
2923 has been developed and maintained by multiple volunteers; mots notably,
2924 Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
2925
2926
2927
2928
2929 RE2C(1)