1RE2C(1) RE2C(1)
2
3
4
6 re2c - compile regular expressions to code
7
9 re2c [OPTIONS] INPUT [-o OUTPUT]
10
11 re2go [OPTIONS] INPUT [-o OUTPUT]
12
14 re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15
16 Note: This manual includes examples for Go, but it refers to re2c
17 (rather than re2go) as the name of the program in general.
18
20 A re2c program consists of normal code intermixed with re2c blocks and
21 directives. Each re2c block may contain definitions, configurations
22 and rules. Definitions are of the form name = regexp; where name is
23 an identifier that consists of letters, digits and underscores, and
24 regexp is a regular expression. Regular expressions may contain other
25 definitions, but recursion is not allowed and each name should be de‐
26 fined before used. Configurations are of the form re2c:config = value;
27 where config is the configuration descriptor and value can be a number,
28 a string or a special word. Rules consist of a regular expression fol‐
29 lowed by a semantic action (a block of code enclosed in curly braces {
30 and }, or a raw one line of code preceded with := and ended with a new‐
31 line that is not followed by a whitespace). If the input matches the
32 regular expression, the associated semantic action is executed. If
33 multiple rules match, the longest match takes precedence. If multiple
34 rules match the same string, the earlier rule takes precedence. There
35 are two special rules: default rule * and EOF rule $. Default rule
36 should always be defined, it has the lowest priority regardless of its
37 place and matches any code unit (not necessarily a valid character, see
38 encoding support). EOF rule matches the end of input, it should be de‐
39 fined if the corresponding method for handling the end of input is
40 used. If start conditions are used, rules have more complex syntax.
41 All rules of a single block are compiled into a deterministic fi‐
42 nite-state automaton (DFA) and encoded in the form of a program in the
43 target language. The generated code interfaces with the outer program
44 by the means of a few user-defined primitives (see the program inter‐
45 face section). Reusable blocks allow sharing rules, definitions and
46 configurations between different blocks.
47
49 Input file
50 //go:generate re2go $INPUT -o $OUTPUT -i
51 package main //
52 //
53 func lex(str string) int { // Go code
54 var cursor int //
55
56 /*!re2c // start of re2c block
57 re2c:define:YYCTYPE = byte; // configuration
58 re2c:define:YYPEEK = "str[cursor]"; // configuration
59 re2c:define:YYSKIP = "cursor += 1"; // configuration
60 re2c:yyfill:enable = 0; // configuration
61 re2c:flags:nested-ifs = 1; // configuration
62 //
63 number = [1-9][0-9]*; // named definition
64 //
65 number { return 0; } // normal rule
66 * { return 1; } // default rule
67 */
68 } //
69 //
70 func main() { //
71 if lex("1234\x00") != 0 { // Go code
72 panic("failed!") //
73 } //
74 } //
75
76
77 Output file
78 // Code generated by re2c, DO NOT EDIT.
79 //go:generate re2go $INPUT -o $OUTPUT -i
80 package main //
81 //
82 func lex(str string) int { // Go code
83 var cursor int //
84
85
86 {
87 var yych byte
88 yych = str[cursor]
89 if (yych <= '0') {
90 goto yy2
91 }
92 if (yych <= '9') {
93 goto yy4
94 }
95 yy2:
96 cursor += 1
97 { return 1; }
98 yy4:
99 cursor += 1
100 yych = str[cursor]
101 if (yych <= '/') {
102 goto yy6
103 }
104 if (yych <= '9') {
105 goto yy4
106 }
107 yy6:
108 { return 0; }
109 }
110
111 } //
112 //
113 func main() { //
114 if lex("1234\x00") != 0 { // Go code
115 panic("failed!") //
116 } //
117 } //
118
119
121 -? -h --help
122 Show help message.
123
124 -1 --single-pass
125 Deprecated. Does nothing (single pass is the default now).
126
127 -8 --utf-8
128 Generate a lexer that reads input in UTF-8 encoding. re2c as‐
129 sumes that character range is 0 -- 0x10FFFF and character size
130 is 1 byte.
131
132 -b --bit-vectors
133 Optimize conditional jumps using bit masks. Implies -s.
134
135 -c --conditions --start-conditions
136 Enable support of Flex-like "conditions": multiple interrelated
137 lexers within one block. Option --start-conditions is a legacy
138 alias; use --conditions instead.
139
140 --case-insensitive
141 Treat single-quoted and double-quoted strings as case-insensi‐
142 tive.
143
144 --case-inverted
145 Invert the meaning of single-quoted and double-quoted strings:
146 treat single-quoted strings as case-sensitive and double-quoted
147 strings as case-insensitive.
148
149 --case-ranges
150 Collapse consecutive cases in a switch statements into a range
151 of the form case low ... high:. This syntax is an extension of
152 the C/C++ language, supported by compilers like GCC, Clang and
153 Tcc. The main advantage over using single cases is smaller gen‐
154 erated C code and faster generation time, although for some com‐
155 pilers like Tcc it also results in smaller binary size. This
156 option doesn't work for the Go backend.
157
158 --depfile FILE
159 Write dependency information to FILE in the form of a Makefile
160 rule <output-file> : <input-file> [include-file ...]. This al‐
161 lows to track build dependencies in the presence of /*!in‐
162 clude:re2c*/ directives, so that updating include files triggers
163 regeneration of the output file. This option requires that -o
164 --output option is specified.
165
166 -e --ecb
167 Generate a lexer that reads input in EBCDIC encoding. re2c as‐
168 sumes that character range is 0 -- 0xFF an character size is 1
169 byte.
170
171 --empty-class <match-empty | match-none | error>
172 Define the way re2c treats empty character classes. With
173 match-empty (the default) empty class matches empty input (which
174 is illogical, but backwards-compatible). With``match-none``
175 empty class always fails to match. With error empty class
176 raises a compilation error.
177
178 --encoding-policy <fail | substitute | ignore>
179 Define the way re2c treats Unicode surrogates. With fail re2c
180 aborts with an error when a surrogate is encountered. With sub‐
181 stitute re2c silently replaces surrogates with the error code
182 point 0xFFFD. With ignore (the default) re2c treats surrogates
183 as normal code points. The Unicode standard says that standalone
184 surrogates are invalid, but real-world libraries and programs
185 behave in different ways.
186
187 -f --storable-state
188 Generate a lexer which can store its inner state. This is use‐
189 ful in push-model lexers which are stopped by an outer program
190 when there is not enough input, and then resumed when more input
191 becomes available. In this mode users should additionally define
192 YYGETSTATE() and YYSETSTATE(state) macros and variables yych,
193 yyaccept and state as part of the lexer state.
194
195 -F --flex-syntax
196 Partial support for Flex syntax: in this mode named definitions
197 don't need the equal sign and the terminating semicolon, and
198 when used they must be surrounded by curly braces. Names without
199 curly braces are treated as double-quoted strings.
200
201 -g --computed-gotos
202 Optimize conditional jumps using non-standard "computed goto"
203 extension (which must be supported by the compiler). re2c gener‐
204 ates jump tables only in complex cases with a lot of conditional
205 branches. Complexity threshold can be configured with
206 cgoto:threshold configuration. This option implies -b. This op‐
207 tion doesn't work for the Go backend.
208
209 -I PATH
210 Add PATH to the list of locations which are used when searching
211 for include files. This option is useful in combination with
212 /*!include:re2c ... */ directive. Re2c looks for FILE in the di‐
213 rectory of including file and in the list of include paths spec‐
214 ified by -I option.
215
216 -i --no-debug-info
217 Do not output #line information. This is useful when the gener‐
218 ated code is tracked by some version control system or IDE.
219
220 --input <default | custom>
221 Specify the API used by the generated code to interface with
222 used-defined code. Option default is the C API based on pointer
223 arithmetic (it is the default for the C backend). Option custom
224 is the generic API (it is the default for the Go backend).
225
226 --input-encoding <ascii | utf8>
227 Specify the way re2c parses regular expressions. With ascii
228 (the default) re2c handles input as ASCII-encoded: any sequence
229 of code units is a sequence of standalone 1-byte characters.
230 With utf8 re2c handles input as UTF8-encoded and recognizes
231 multibyte characters.
232
233 --lang <c | go>
234 Specify the output language. Supported languages are C and Go
235 (the default is C).
236
237 --location-format <gnu | msvc>
238 Specify location format in messages. With gnu locations are
239 printed as 'filename:line:column: ...'. With msvc locations are
240 printed as 'filename(line,column) ...'. Default is gnu.
241
242 --no-generation-date
243 Suppress date output in the generated file.
244
245 --no-version
246 Suppress version output in the generated file.
247
248 -o OUTPUT --output=OUTPUT
249 Specify the OUTPUT file.
250
251 -P --posix-captures
252 Enable submatch extraction with POSIX-style capturing groups.
253
254 -r --reusable
255 Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
256 */ blocks. Exactly one rules-block must be present. The rules
257 are saved and used by every use-block that follows, which may
258 add its own rules and configurations.
259
260 -S --skeleton
261 Ignore user-defined interface code and generate a self-contained
262 "skeleton" program. Additionally, generate input files with
263 strings derived from the regular grammar and compressed match
264 results that are used to verify "skeleton" behavior on all in‐
265 puts. This option is useful for finding bugs in optimizations
266 and code generation. This option doesn't work for the Go back‐
267 end.
268
269 -s --nested-ifs
270 Use nested if statements instead of switch statements in condi‐
271 tional jumps. This usually results in more efficient code with
272 non-optimizing compilers.
273
274 -T --tags
275 Enable submatch extraction with tags.
276
277 -t HEADER --type-header=HEADER
278 Generate a HEADER file that contains enum with condition names.
279 Requires -c option.
280
281 -u --unicode
282 Generate a lexer that reads UTF32-encoded input. Re2c assumes
283 that character range is 0 -- 0x10FFFF and character size is 4
284 bytes. This option implies -s.
285
286 -V --vernum
287 Show version information in MMmmpp format (major, minor, patch).
288
289 --verbose
290 Output a short message in case of success.
291
292 -v --version
293 Show version information.
294
295 -w --wide-chars
296 Generate a lexer that reads UCS2-encoded input. Re2c assumes
297 that character range is 0 -- 0xFFFF and character size is 2
298 bytes. This option implies -s.
299
300 -x --utf-16
301 Generate a lexer that reads UTF16-encoded input. Re2c assumes
302 that character range is 0 -- 0x10FFFF and character size is 2
303 bytes. This option implies -s.
304
305 Debug options
306 -D --emit-dot
307 Instead of normal output generate lexer graph in .dot format.
308 The output can be converted to an image with the help of
309 Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
310
311 -d --debug-output
312 Emit YYDEBUG in the generated code. YYDEBUG should be defined
313 by the user in the form of a void function with two parameters:
314 state (lexer state or -1) and symbol (current input symbol of
315 type YYCTYPE).
316
317 --dump-adfa
318 Debug option: output DFA after tunneling (in .dot format).
319
320 --dump-cfg
321 Debug option: output control flow graph of tag variables (in
322 .dot format).
323
324 --dump-closure-stats
325 Debug option: output statistics on the number of states in clo‐
326 sure.
327
328 --dump-dfa-det
329 Debug option: output DFA immediately after determinization (in
330 .dot format).
331
332 --dump-dfa-min
333 Debug option: output DFA after minimization (in .dot format).
334
335 --dump-dfa-tagopt
336 Debug option: output DFA after tag optimizations (in .dot for‐
337 mat).
338
339 --dump-dfa-tree
340 Debug option: output DFA under construction with states repre‐
341 sented as tag history trees (in .dot format).
342
343 --dump-dfa-raw
344 Debug option: output DFA under construction with expanded
345 state-sets (in .dot format).
346
347 --dump-interf
348 Debug option: output interference table produced by liveness
349 analysis of tag variables.
350
351 --dump-nfa
352 Debug option: output NFA (in .dot format).
353
354 Internal options
355 --dfa-minimization <moore | table>
356 Internal option: DFA minimization algorithm used by re2c. The
357 moore option is the Moore algorithm (it is the default). The ta‐
358 ble option is the "table filling" algorithm. Both algorithms
359 should produce the same DFA up to states relabeling; table fill‐
360 ing is simpler and much slower and serves as a reference imple‐
361 mentation.
362
363 --eager-skip
364 Internal option: make the generated lexer advance the input po‐
365 sition eagerly -- immediately after reading the input symbol.
366 This changes the default behavior when the input position is ad‐
367 vanced lazily -- after transition to the next state. This option
368 is implied by --no-lookahead.
369
370 --no-lookahead
371 Internal option: use TDFA(0) instead of TDFA(1). This option
372 has effect only with --tags or --posix-captures options.
373
374 --no-optimize-tags
375 Internal optionL: suppress optimization of tag variables (useful
376 for debugging).
377
378 --posix-closure <gor1 | gtop>
379 Internal option: specify shortest-path algorithm used for the
380 construction of epsilon-closure with POSIX disambiguation seman‐
381 tics: gor1 (the default) stands for Goldberg-Radzik algorithm,
382 and gtop stands for "global topological order" algorithm.
383
384 --posix-prectable <complex | naive>
385 Internal option: specify the algorithm used to compute POSIX
386 precedence table. The complex algorithm computes precedence ta‐
387 ble in one traversal of tag history tree and has quadratic com‐
388 plexity in the number of TNFA states; it is the default. The
389 naive algorithm has worst-case cubic complexity in the number of
390 TNFA states, but it is much simpler than complex and may be
391 slightly faster in non-pathological cases.
392
393 --stadfa
394 Internal option: use staDFA algorithm for submatch extraction.
395 The main difference with TDFA is that tag operations in staDFA
396 are placed in states, not on transitions.
397
398 --fixed-tags <none | toplevel | all>
399 Internal option: specify whether the fixed-tag optimization
400 should be applied to all tags (all), none of them (none), or
401 only those in toplevel concatenation (toplevel). The default is
402 all. "Fixed" tags are those that are located within a fixed
403 distance to some other tag (called "base"). In such cases only
404 tha base tag needs to be tracked, and the value of the fixed tag
405 can be computed as the value of the base tag plus a static off‐
406 set. For tags that are under alternative or repetition it is
407 also necessary to check if the base tag has a no-match value (in
408 that case fixed tag should also be set to no-match, disregarding
409 the offset). For tags in top-level concatenation the check is
410 not needed, because they always match.
411
412 Warnings
413 -W Turn on all warnings.
414
415 -Werror
416 Turn warnings into errors. Note that this option alone doesn't
417 turn on any warnings; it only affects those warnings that have
418 been turned on so far or will be turned on later.
419
420 -W<warning>
421 Turn on warning.
422
423 -Wno-<warning>
424 Turn off warning.
425
426 -Werror-<warning>
427 Turn on warning and treat it as an error (this implies -W<warn‐
428 ing>).
429
430 -Wno-error-<warning>
431 Don't treat this particular warning as an error. This doesn't
432 turn off the warning itself.
433
434 -Wcondition-order
435 Warn if the generated program makes implicit assumptions about
436 condition numbering. One should use either the -t, --type-header
437 option or the /*!types:re2c*/ directive to generate a mapping of
438 condition names to numbers and then use the autogenerated condi‐
439 tion names.
440
441 -Wempty-character-class
442 Warn if a regular expression contains an empty character class.
443 Trying to match an empty character class makes no sense: it
444 should always fail. However, for backwards compatibility rea‐
445 sons re2c allows empty character classes and treats them as
446 empty strings. Use the --empty-class option to change the de‐
447 fault behavior.
448
449 -Wmatch-empty-string
450 Warn if a rule is nullable (matches an empty string). If the
451 lexer runs in a loop and the empty match is unintentional, the
452 lexer may unexpectedly hang in an infinite loop.
453
454 -Wswapped-range
455 Warn if the lower bound of a range is greater than its upper
456 bound. The default behavior is to silently swap the range
457 bounds.
458
459 -Wundefined-control-flow
460 Warn if some input strings cause undefined control flow in the
461 lexer (the faulty patterns are reported). This is the most dan‐
462 gerous and most common mistake. It can be easily fixed by adding
463 the default rule * which has the lowest priority, matches any
464 code unit, and consumes exactly one code unit.
465
466 -Wunreachable-rules
467 Warn about rules that are shadowed by other rules and will never
468 match.
469
470 -Wuseless-escape
471 Warn if a symbol is escaped when it shouldn't be. By default,
472 re2c silently ignores such escapes, but this may as well indi‐
473 cate a typo or an error in the escape sequence.
474
475 -Wnondeterministic-tags
476 Warn if a tag has n-th degree of nondeterminism, where n is
477 greater than 1.
478
479 -Wsentinel-in-midrule
480 Warn if the sentinel symbol occurs in the middle of a rule ---
481 this may cause reads past the end of buffer, crashes or memory
482 corruption in the generated lexer. This warning is only applica‐
483 ble if the sentinel method of checking for the end of input is
484 used. It is set to an error if re2c:sentinel configuration is
485 used.
486
488 Re2c has a flexible interface that gives the user both the freedom and
489 the responsibility to define how the generated code interacts with the
490 outer program. There are two major options:
491
492 • Pointer API. It is also called "default API", since it was histori‐
493 cally the first, and for a long time the only one. This is a more
494 restricted API based on C pointer arithmetics. It consists of
495 pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT,
496 which are normally defined as pointers of type YYCTYPE*. Pointer API
497 is enabled by default for the C backend, and it cannot be used with
498 other backends that do not have pointer arithmetics.
499
500
501
502 • Generic API. This is a less restricted API that does not assume
503 pointer semantics. It consists of primitives YYPEEK, YYSKIP, YY‐
504 BACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
505 YYRESTORECTX, YYRESTORETAG, YYSHIFT, YYSHIFTSTAG, YYSHIFTMTAG and YY‐
506 LESSTHAN. For the C backend generic API is enabled with --input cus‐
507 tom option or re2c:flags:input = custom; configuration; for the Go
508 backend it is enabled by default. Generic API was added in version
509 0.14. It is intentionally designed to give the user as much freedom
510 as possible in redefining the input model and the semantics of dif‐
511 ferent actions performed by the generated code. As an example, one
512 can override YYPEEK to check for the end of input before reading the
513 input character, or do some logging, etc.
514
515 Generic API has two styles:
516
517 • Function-like. This style is enabled with re2c:api:style = func‐
518 tions; configuration, and it is the default for C backend. In this
519 style API primitives should be defined as functions or macros with
520 parentheses, accepting the necessary arguments. For example, in C the
521 default pointer API can be defined in function-like style generic API
522 as follows:
523
524 #define YYPEEK() *YYCURSOR
525 #define YYSKIP() ++YYCURSOR
526 #define YYBACKUP() YYMARKER = YYCURSOR
527 #define YYBACKUPCTX() YYCTXMARKER = YYCURSOR
528 #define YYRESTORE() YYCURSOR = YYMARKER
529 #define YYRESTORECTX() YYCURSOR = YYCTXMARKER
530 #define YYRESTORETAG(tag) YYCURSOR = tag
531 #define YYLESSTHAN(len) YYLIMIT - YYCURSOR < len
532 #define YYSTAGP(tag) tag = YYCURSOR
533 #define YYSTAGN(tag) tag = NULL
534 #define YYSHIFT(shift) YYCURSOR += shift
535 #define YYSHIFTSTAG(tag, shift) tag += shift
536
537
538
539 • Free-form. This style is enabled with re2c:api:style = free-form;
540 configuration, and it is the default for Go backend. In this style
541 API primitives can be defined as free-form pieces of code, and in‐
542 stead of arguments they have interpolated variables of the form
543 @@{name}, or optionally just @@ if there is only one argument. The @@
544 text is called "sigil". It can be redefined to any other text with
545 re2c:api:sigil configuration. For example, the default pointer API
546 can be defined in free-form style generic API as follows:
547
548 re2c:define:YYPEEK = "*YYCURSOR";
549 re2c:define:YYSKIP = "++YYCURSOR";
550 re2c:define:YYBACKUP = "YYMARKER = YYCURSOR";
551 re2c:define:YYBACKUPCTX = "YYCTXMARKER = YYCURSOR";
552 re2c:define:YYRESTORE = "YYCURSOR = YYMARKER";
553 re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
554 re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
555 re2c:define:YYLESSTHAN = "YYLIMIT - YYCURSOR < @@{len}";
556 re2c:define:YYSTAGP = "@@{tag} = YYCURSOR";
557 re2c:define:YYSTAGN = "@@{tag} = NULL";
558 re2c:define:YYSHIFT = "YYCURSOR += @@{shift}";
559 re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
560
561 API primitives
562 Here is a list of API primitives that may be used by the generated code
563 in order to interface with the outer program. Which primitives are
564 needed depends on multiple factors, including the complexity of regular
565 expressions, input representation, buffering, the use of various fea‐
566 tures and so on. All the necessary primitives should be defined by the
567 user in the form of macros, functions, variables, free-form pieces of
568 code or any other suitable form. Re2c does not (and cannot) check the
569 definitions, so if anything is missing or defined incorrectly the gen‐
570 erated code will not compile.
571
572 YYCTYPE
573 The type of the input characters (code units). For ASCII,
574 EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
575 For UTF-16 or UCS-2 it should be 2-byte unsigned integer. For
576 UTF-32 it should be 4-byte unsigned integer.
577
578 YYCURSOR
579 A pointer-like l-value that stores the current input position
580 (usually a pointer of type YYCTYPE*). Initially YYCURSOR should
581 point to the first input character. It is advanced by the gener‐
582 ated code. When a rule matches, YYCURSOR points to the one af‐
583 ter the last matched character. It is used only in the default C
584 API.
585
586 YYLIMIT
587 A pointer-like r-value that stores the end of input position
588 (usually a pointer of type YYCTYPE*). Initially YYLIMIT should
589 point to the one after the last available input character. It is
590 not changed by the generated code. Lexer compares YYCURSOR to
591 YYLIMIT in order to determine if there is enough input charac‐
592 ters left. YYLIMIT is used only in the default C API.
593
594 YYMARKER
595 A pointer-like l-value (usually a pointer of type YYCTYPE*) that
596 stores the position of the latest matched rule. It is used to
597 restores YYCURSOR position if the longer match fails and lexer
598 needs to rollback. Initialization is not needed. YYMARKER is
599 used only in the default C API.
600
601 YYCTXMARKER
602 A pointer-like l-value that stores the position of the trailing
603 context (usually a pointer of type YYCTYPE*). No initialization
604 is needed. It is used only in the default C API, and only with
605 the lookahead operator /.
606
607 YYFILL API primitive with one argument len. The meaning of YYFILL is
608 to provide at least len more input characters or fail. If EOF
609 rule is used, YYFILL should always return to the calling func‐
610 tion; the return value should be zero on success and non-zero on
611 failure. If EOF rule is not used, YYFILL return value is ignored
612 and it should not return on failure. Maximal value of len is YY‐
613 MAXFILL, which can be generated with /*!max:re2c*/ directive.
614 The definition of YYFILL can be either function-like or
615 free-form depending on the API style (see re2c:api:style and
616 re2c:define:YYFILL:naked).
617
618 YYMAXFILL
619 An integral constant equal to the maximal value of YYFILL argu‐
620 ment. It can be generated with /*!max:re2c*/ directive.
621
622 YYLESSTHAN
623 A generic API primitive with one argument len. It should be de‐
624 fined as an r-value of boolean type that equals true if and only
625 if there is less than len input characters left. The definition
626 can be either function-like or free-form depending on the API
627 style (see re2c:api:style).
628
629 YYPEEK A generic API primitive with no arguments. It should be defined
630 as an r-value of type YYCTYPE that is equal to the character at
631 the current input position. The definition can be either func‐
632 tion-like or free-form depending on the API style (see
633 re2c:api:style).
634
635 YYSKIP A generic API primitive with no arguments. The meaning of
636 YYSKIP is to advance the current input position by one charac‐
637 ter. The definition can be either function-like or free-form de‐
638 pending on the API style (see re2c:api:style).
639
640 YYBACKUP
641 A generic API primitive with no arguments. The meaning of YY‐
642 BACKUP is to save the current input position, which is later re‐
643 stored with YYRESTORE. The definition should be either func‐
644 tion-like or free-form depending on the API style (see
645 re2c:api:style).
646
647 YYRESTORE
648 A generic API primitive with no arguments. The meaning of YYRE‐
649 STORE is to restore the current input position to the value
650 saved by YYBACKUP. The definition should be either func‐
651 tion-like or free-form depending on the API style (see
652 re2c:api:style).
653
654 YYBACKUPCTX
655 A generic API primitive with zero arguments. The meaning of YY‐
656 BACKUPCTX is to save the current input position as the position
657 of the trailing context, which is later restored by YYRE‐
658 STORECTX. The definition should be either function-like or
659 free-form depending on the API style (see re2c:api:style).
660
661 YYRESTORECTX
662 A generic API primitive with no arguments. The meaning of YYRE‐
663 STORECTX is to restore the trailing context position saved with
664 YYBACKUPCTX. The definition should be either function-like or
665 free-form depending on the API style (see re2c:api:style).
666
667 YYRESTORETAG
668 A generic API primitive with one argument tag. The meaning of
669 YYRESTORETAG is to restore the trailing context position to the
670 value of tag. The definition should be either function-like or
671 free-form depending on the API style (see re2c:api:style).
672
673 YYSTAGP
674 A generic API primitive with one argument tag. The meaning of
675 YYSTAGP is to set tag value to the current input position. The
676 definition should be either function-like or free-form depending
677 on the API style (see re2c:api:style).
678
679 YYSTAGN
680 A generic API primitive with one argument tag. The meaning of
681 YYSTAGN is to set tag value to null (or some default value). The
682 definition should be either function-like or free-form depending
683 on the API style (see re2c:api:style).
684
685 YYMTAGP
686 A generic API primitive with one argument tag. The meaning of
687 YYMTAGP is to append the current position to the history of tag.
688 The definition should be either function-like or free-form de‐
689 pending on the API style (see re2c:api:style).
690
691 YYMTAGN
692 A generic API primitive with one argument tag. The meaning of
693 YYMTAGN is to append null (or some other default) value to the
694 history of tag. The definition can be either function-like or
695 free-form depending on the API style (see re2c:api:style).
696
697 YYSHIFT
698 A generic API primitive with one argument shift. The meaning of
699 YYSHIFT is to shift the current input position by shift charac‐
700 ters (the shift value may be negative). The definition can be
701 either function-like or free-form depending on the API style
702 (see re2c:api:style).
703
704 YYSHIFTSTAG
705 A generic API primitive with two arguments, tag and shift. The
706 meaning of YYSHIFTSTAG is to shift tag by shift characters (the
707 shift value may be negative). The definition can be either
708 function-like or free-form depending on the API style (see
709 re2c:api:style).
710
711 YYSHIFTMTAG
712 A generic API primitive with two arguments, tag and shift. The
713 meaning of YYSHIFTMTAG is to shift the latest value in the his‐
714 tory of tag by shift characters (the shift value may be nega‐
715 tive). The definition should be either function-like or
716 free-form depending on the API style (see re2c:api:style).
717
718 YYMAXNMATCH
719 An integral constant equal to the maximal number of POSIX cap‐
720 turing groups in a rule. It is generated with /*!maxn‐
721 match:re2c*/ directive.
722
723 YYCONDTYPE
724 The type of the condition enum. It should be generated either
725 with /*!types:re2c*/ directive or -t --type-header option.
726
727 YYGETCONDITION
728 An API primitive with zero arguments. It should be defined as
729 an r-value of type YYCONDTYPE that is equal to the current con‐
730 dition identifier. The definition can be either function-like or
731 free-form depending on the API style (see re2c:api:style and
732 re2c:define:YYGETCONDITION:naked).
733
734 YYSETCONDITION
735 An API primitive with one argument cond. The meaning of YYSET‐
736 CONDITION is to set the current condition identifier to cond.
737 The definition should be either function-like or free-form de‐
738 pending on the API style (see re2c:api:style and re2c:define:YY‐
739 SETCONDITION@cond).
740
741 YYGETSTATE
742 An API primitive with zero arguments. It should be defined as
743 an r-value of integer type that is equal to the current lexer
744 state. Should be initialized to -1. The definition can be either
745 function-like or free-form depending on the API style (see
746 re2c:api:style and re2c:define:YYGETSTATE:naked).
747
748 YYSETSTATE
749 An API primitive with one argument state. The meaning of YYSET‐
750 STATE is to set the current lexer state to state. The defini‐
751 tion should be either function-like or free-form depending on
752 the API style (see re2c:api:style and re2c:define:YYSET‐
753 STATE@state).
754
755 YYDEBUG
756 A debug API primitive with two arguments. It can be used to de‐
757 bug the generated code (with -d --debug-output option). YYDEBUG
758 should return no value and accept two arguments: state (either a
759 DFA state index or -1) and symbol (the current input symbol).
760
761 yych An l-value of type YYCTYPE that stores the current input charac‐
762 ter. User definition is necessary only with -f --storable-state
763 option.
764
765 yyaccept
766 An l-value of unsigned integral type that stores the number of
767 the latest matched rule. User definition is necessary only with
768 -f --storable-state option.
769
770 yynmatch
771 An l-value of unsigned integral type that stores the number of
772 POSIX capturing groups in the matched rule. Used only with -P
773 --posix-captures option.
774
775 yypmatch
776 An array of l-values that are used to hold the tag values corre‐
777 sponding to the capturing parentheses in the matching rule. Ar‐
778 ray length must be at least yynmatch * 2 (usually YYMAXNMATCH *
779 2 is a good choice). Used only with -P --posix-captures option.
780
781 Directives
782 Below is the list of all directives provided by re2c (in no particular
783 order). More information on each directive can be found in the related
784 sections.
785
786 /*!re2c ... */
787 A standard re2c block.
788
789 %{ ... %}
790 A standard re2c block in -F --flex-support mode.
791
792 /*!rules:re2c ... */
793 A reusable re2c block (requires -r --reuse option).
794
795 /*!use:re2c ... */
796 A block that reuses previous rules-block specified with
797 /*!rules:re2c ... */ (requires -r --reuse option).
798
799 /*!ignore:re2c ... */
800 A block which contents are ignored and cut off from the output
801 file.
802
803 /*!max:re2c*/
804 This directive is substituted with the macro-definition of YY‐
805 MAXFILL.
806
807 /*!maxnmatch:re2c*/
808 This directive is substituted with the macro-definition of YY‐
809 MAXNMATCH (requires -P --posix-captures option).
810
811 /*!getstate:re2c*/
812 This directive is substituted with conditional dispatch on lexer
813 state (requires -f --storable-state option).
814
815 /*!types:re2c ... */
816 This directive is substituted with the definition of condition
817 enum (requires -c --conditions option).
818
819 /*!stags:re2c ... */, /*!mtags:re2c ... */
820 These directives allow one to specify a template piece of code
821 that is expanded for each s-tag/m-tag variable generated by
822 re2c. This block has two optional configurations: format = "@@";
823 (specifies the template where @@ is substituted with the name of
824 each tag variable), and separator = ""; (specifies the piece of
825 code used to join the generated pieces for different tag vari‐
826 ables).
827
828 /*!include:re2c FILE */
829 This directive allows one to include FILE (in the same sense as
830 #include directive in C/C++).
831
832 /*!header:re2c:on*/
833 This directive marks the start of header file. Everything after
834 it and up to the following /*!header:re2c:off*/ directive is
835 processed by re2c and written to the header file specified with
836 -t --type-header option.
837
838 /*!header:re2c:off*/
839 This directive marks the end of header file started with
840 /*!header:re2c:on*/.
841
842 Configurations
843 re2c:flags:t, re2c:flags:type-header
844 Specify the name of the generated header file relative to the
845 directory of the output file. (Same as -t, --type-header com‐
846 mand-line option except that the filepath is relative.)
847
848 re2c:flags:input
849 Same as --input command-line option.
850
851 re2c:api:style
852 Allows one to specify the style of generic API. Possible values
853 are functions and free-form. With functions style (the default
854 for the C backend) API primitives behave like functions, and
855 re2c generates parentheses with an argument list after the name
856 of each primitive. With free-form style (the default for the Go
857 backend) re2c treats API definitions as interpolated strings and
858 substitutes argument placeholders with the actual argument val‐
859 ues. This option can be overridden by options for individual
860 API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
861
862 re2c:api:sigil
863 Allows one to specify the "sigil" symbol (or string) that is
864 used to recognize argument placeholders in the definitions of
865 generic API primitives. The default value is @@. Placeholders
866 start with sigil, followed by the argument name in curly braces.
867 For example, if sigil is set to $, then placeholders will have
868 the form ${name}. Single-argument APIs may use shorthand nota‐
869 tion without the name in braces. This option can be overridden
870 by options for individual API primitives, e.g. re2c:define:YY‐
871 FILL@len for YYFILL.
872
873 re2c:define:YYCTYPE
874 Defines YYCTYPE (see the user interface section).
875
876 re2c:define:YYCURSOR
877 Defines C API primitive YYCURSOR (see the user interface sec‐
878 tion).
879
880 re2c:define:YYLIMIT
881 Defines C API primitive YYLIMIT (see the user interface sec‐
882 tion).
883
884 re2c:define:YYMARKER
885 Defines C API primitive YYMARKER (see the user interface sec‐
886 tion).
887
888 re2c:define:YYCTXMARKER
889 Defines C API primitive YYCTXMARKER (see the user interface sec‐
890 tion).
891
892 re2c:define:YYFILL
893 Defines API primitive YYFILL (see the user interface section).
894
895 re2c:define:YYFILL@len
896 Specifies the sigil used for argument substitution in YYFILL
897 definition. Defaults to @@. Overrides the more generic
898 re2c:api:sigil configuration.
899
900 re2c:define:YYFILL:naked
901 Allows one to override re2c:api:style for YYFILL. Value 0 cor‐
902 responds to free-form API style.
903
904 re2c:yyfill:enable
905 Defaults to 1 (YYFILL is enabled). Set this to zero to suppress
906 the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
907 tinel configuration to verify that the generated lexer cannot
908 read past the end of input, as this might introduce severe secu‐
909 rity issues to your programs.
910
911 re2c:yyfill:parameter
912 Controls the argument in the parentheses that follow YYFILL. De‐
913 faults to 1, which means that the argument is generated. If
914 zero, the argument is omitted. Can be overridden with re2c:de‐
915 fine:YYFILL:naked or re2c:api:style.
916
917 re2c:eof
918 Specifies the sentinel symbol used with EOF rule $ to check for
919 the end of input in the generated lexer. The default value is -1
920 (EOF rule is not used). Other possible values include all valid
921 code units. Only decimal numbers are recognized.
922
923 re2c:sentinel
924 Specifies the sentinel symbol used with the sentinel method of
925 checking for the end of input in the generated lexer (the case
926 when bounds checking is disabled with re2c:yyfill:enable = 0;
927 and EOF rule $ is not used). This configuration does not affect
928 code generation. It is used by re2c to verify that the sentinel
929 symbol is not allowed in the middle of the rule, and prevent
930 possible reads past the end of buffer in the generated lexer.
931 The default value is -1 (re2c assumes that the sentinel symbol
932 is 0, which is the most common case). Other possible values in‐
933 clude all valid code units. Only decimal numbers are recognized.
934
935 re2c:define:YYLESSTHAN
936 Defines generic API primitive YYLESSTHAN (see the user interface
937 section).
938
939 re2c:yyfill:check
940 Setting this to zero allows to suppress the generation of YYFILL
941 check (YYLESSTHAN in generic API of YYLIMIT-based comparison in
942 default C API). This configuration is useful when the necessary
943 input is always available. it defaults to 1 (the check is gener‐
944 ated).
945
946 re2c:label:yyFillLabel
947 Allows one to change the prefix of YYFILL labels (used with EOF
948 rule or with storable states).
949
950 re2c:define:YYPEEK
951 Defines generic API primitive YYPEEK (see the user interface
952 section).
953
954 re2c:define:YYSKIP
955 Defines generic API primitive YYSKIP (see the user interface
956 section).
957
958 re2c:define:YYBACKUP
959 Defines generic API primitive YYBACKUP (see the user interface
960 section).
961
962 re2c:define:YYBACKUPCTX
963 Defines generic API primitive YYBACKUPCTX (see the user inter‐
964 face section).
965
966 re2c:define:YYRESTORE
967 Defines generic API primitive YYRESTORE (see the user interface
968 section).
969
970 re2c:define:YYRESTORECTX
971 Defines generic API primitive YYRESTORECTX (see the user inter‐
972 face section).
973
974 re2c:define:YYRESTORETAG
975 Defines generic API primitive YYRESTORETAG (see the user inter‐
976 face section).
977
978 re2c:define:YYSHIFT
979 Defines generic API primitive YYSHIFT (see the user interface
980 section).
981
982 re2c:define:YYSHIFTMTAG
983 Defines generic API primitive YYSHIFTMTAG (see the user inter‐
984 face section).
985
986 re2c:define:YYSHIFTSTAG
987 Defines generic API primitive YYSHIFTSTAG (see the user inter‐
988 face section).
989
990 re2c:define:YYSTAGN
991 Defines generic API primitive YYSTAGN (see the user interface
992 section).
993
994 re2c:define:YYSTAGP
995 Defines generic API primitive YYSTAGP (see the user interface
996 section).
997
998 re2c:define:YYMTAGN
999 Defines generic API primitive YYMTAGN (see the user interface
1000 section).
1001
1002 re2c:define:YYMTAGP
1003 Defines generic API primitive YYMTAGP (see the user interface
1004 section).
1005
1006 re2c:flags:T, re2c:flags:tags
1007 Same as -T --tags command-line option.
1008
1009 re2c:flags:P, re2c:flags:posix-captures
1010 Same as -P --posix-captures command-line option.
1011
1012 re2c:tags:expression
1013 Allows one to customize the way re2c addresses tag variables.
1014 By default re2c generates expressions of the form yyt<N>. This
1015 might be inconvenient, for example if tag variables are defined
1016 as fields in a struct. Re2c recognizes placeholder of the form
1017 @@{tag} or @@ and replaces it with the actual tag name. Sigil
1018 @@ can be redefined with re2c:api:sigil configuration. For ex‐
1019 ample, setting re2c:tags:expression = "p->@@"; results in ex‐
1020 pressions of the form p->yyt<N> in the generated code.
1021
1022 re2c:tags:prefix
1023 Allows one to override the prefix of tag variables (defaults to
1024 yyt).
1025
1026 re2c:flags:lookahead
1027 Same as inverted --no-lookahead command-line option.
1028
1029 re2c:flags:optimize-tags
1030 Same as inverted --no-optimize-tags command-line option.
1031
1032 re2c:define:YYCONDTYPE
1033 Defines YYCONDTYPE (see the user interface section).
1034
1035 re2c:define:YYGETCONDITION
1036 Defines API primitive YYGETCONDITION (see the user interface
1037 section).
1038
1039 re2c:define:YYGETCONDITION:naked
1040 Allows one to override re2c:api:style for YYGETCONDITION. Value
1041 0 corresponds to free-form API style.
1042
1043 re2c:define:YYSETCONDITION
1044 Defines API primitive YYSETCONDITION (see the user interface
1045 section).
1046
1047 re2c:define:YYSETCONDITION@cond
1048 Specifies the sigil used for argument substitution in YYSETCON‐
1049 DITION definition. The default value is @@. Overrides the more
1050 generic re2c:api:sigil configuration.
1051
1052 re2c:define:YYSETCONDITION:naked
1053 Allows one to override re2c:api:style for YYSETCONDITION. Value
1054 0 corresponds to free-form API style.
1055
1056 re2c:cond:goto
1057 Allows one to customize the goto statements used with the short‐
1058 cut :=> rules in conditions. The default value is goto @@;.
1059 Placeholders are substituted with condition name (see
1060 re2c:api;sigil and re2c:cond:goto@cond).
1061
1062 re2c:cond:goto@cond
1063 Specifies the sigil used for argument substitution in
1064 re2c:cond:goto definition. The default value is @@. Overrides
1065 the more generic re2c:api:sigil configuration.
1066
1067 re2c:cond:divider
1068 Defines the divider for condition blocks. The default value is
1069 /* *********************************** */. Placeholders are
1070 substituted with condition name (see re2c:api;sigil and
1071 re2c:cond:divider@cond).
1072
1073 re2c:cond:divider@cond
1074 Specifies the sigil used for argument substitution in
1075 re2c:cond:divider definition. The default value is @@. Over‐
1076 rides the more generic re2c:api:sigil configuration.
1077
1078 re2c:condprefix
1079 Specifies the prefix used for condition labels. The default
1080 value is yyc_.
1081
1082 re2c:condenumprefix
1083 Specifies the prefix used for condition identifiers. The de‐
1084 fault value is yyc.
1085
1086 re2c:define:YYGETSTATE
1087 Defines API primitive YYGETSTATE (see the user interface sec‐
1088 tion).
1089
1090 re2c:define:YYGETSTATE:naked
1091 Allows one to override re2c:api:style for YYGETSTATE. Value 0
1092 corresponds to free-form API style.
1093
1094 re2c:define:YYSETSTATE
1095 Defines API primitive YYSETSTATE (see the user interface sec‐
1096 tion).
1097
1098 re2c:define:YYSETSTATE@state
1099 Specifies the sigil used for argument substitution in YYSETSTATE
1100 definition. The default value is @@. Overrides the more generic
1101 re2c:api:sigil configuration.
1102
1103 re2c:define:YYSETSTATE:naked
1104 Allows one to override re2c:api:style for YYSETSTATE. Value 0
1105 corresponds to free-form API style.
1106
1107 re2c:state:abort
1108 If set to a positive integer value, changes the form of the
1109 YYGETSTATE switch: instead of using default case to jump to the
1110 beginning of the lexer block, a -1 case is used, and the default
1111 case aborts the program.
1112
1113 re2c:state:nextlabel
1114 With storable states, allows to control if the YYGETSTATE block
1115 is followed by a yyNext label (the default value is zero, which
1116 corresponds to no label). Instead of using yyNext it is possible
1117 to use re2c:startlabel to force the generation of a specific
1118 start label. Instead of using labels it is often more conve‐
1119 nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1120
1121 re2c:label:yyNext
1122 Allows one to change the name of the yyNext label.
1123
1124 re2c:startlabel
1125 Controls the generation of start label for the next lexer block.
1126 The default value is zero, which means that the start label is
1127 generated only if it is used. An integer value greater than zero
1128 forces the generation of start label even if it is unused by the
1129 lexer. A string value also forces start label generation and
1130 sets the label name to the specified string. This configuration
1131 applies only to the current block (it is reset to default for
1132 the next block).
1133
1134 re2c:flags:s, re2c:flags:nested-ifs
1135 Same as -s --nested-ifs command-line option.
1136
1137 re2c:flags:b, re2c:flags:bit-vectors
1138 Same as -b --bit-vectors command-line option.
1139
1140 re2c:variable:yybm
1141 Overrides the name of the yybm variable.
1142
1143 re2c:yybm:hex
1144 Defaults to zero (a decimal bitmap table is generated). If set
1145 to nonzero, a hexadecimal table is generated.
1146
1147 re2c:flags:g, re2c:flags:computed-gotos
1148 Same as -g --computed-gotos command-line option.
1149
1150 re2c:cgoto:threshold
1151 With -g --computed-gotos option this value specifies the com‐
1152 plexity threshold that triggers the generation of jump tables
1153 instead of nested if statements and bitmaps. The default value
1154 is 9.
1155
1156 re2c:flags:case-ranges
1157 Same as --case-ranges command-line option.
1158
1159 re2c:flags:e, re2c:flags:ecb
1160 Same as -e --ecb command-line option.
1161
1162 re2c:flags:8, re2c:flags:utf-8
1163 Same as -8 --utf-8 command-line option.
1164
1165 re2c:flags:w, re2c:flags:wide-chars
1166 Same as -w --wide-chars command-line option.
1167
1168 re2c:flags:x, re2c:flags:utf-16
1169 Same as -x --utf-16 command-line option.
1170
1171 re2c:flags:u, re2c:flags:unicode
1172 Same as -u --unicode command-line option.
1173
1174 re2c:flags:encoding-policy
1175 Same as --encoding-policy command-line option.
1176
1177 re2c:flags:empty-class
1178 Same as --empty-class command-line option.
1179
1180 re2c:flags:case-insensitive
1181 Same as --case-insensitive command-line option.
1182
1183 re2c:flags:case-inverted
1184 Same as --case-inverted command-line option.
1185
1186 re2c:flags:i, re2c:flags:no-debug-info
1187 Same as -i --no-debug-info command-line option.
1188
1189 re2c:indent:string
1190 Specifies the string to use for indentation. The default value
1191 is "\t". Indent string should contain only whitespace charac‐
1192 ters. To disable indentation entirely, set this configuration
1193 to empty string "".
1194
1195 re2c:indent:top
1196 Specifies the minimum amount of indentation to use. The default
1197 value is zero. The value should be a non-negative integer num‐
1198 ber.
1199
1200 re2c:labelprefix
1201 Allows one to change the prefix of DFA state labels. The de‐
1202 fault value is yy.
1203
1204 re2c:yych:emit
1205 Set this to zero to suppress the generation of yych definition.
1206 Defaults to 1 (the definition is generated).
1207
1208 re2c:variable:yych
1209 Overrides the name of the yych variable.
1210
1211 re2c:yych:conversion
1212 If set to nonzero, re2c automatically generates a cast to YYC‐
1213 TYPE every time yych is read. Defaults to zero (no cast).
1214
1215 re2c:variable:yyaccept
1216 Overrides the name of the yyaccept variable.
1217
1218 re2c:variable:yytarget
1219 Overrides the name of the yytarget variable.
1220
1221 re2c:variable:yystable
1222 Deprecated.
1223
1224 re2c:variable:yyctable
1225 When both -c --conditions and -g --computed-gotos are active,
1226 re2c will use this variable to generate a static jump table for
1227 YYGETCONDITION.
1228
1229 re2c:define:YYDEBUG
1230 Defines YYDEBUG (see the user interface section).
1231
1232 re2c:flags:d, re2c:flags:debug-output
1233 Same as -d --debug-output command-line option.
1234
1235 re2c:flags:dfa-minimization
1236 Same as --dfa-minimization command-line option.
1237
1238 re2c:flags:eager-skip
1239 Same as --eager-skip command-line option.
1240
1242 re2c uses the following syntax for regular expressions:
1243
1244 • "foo" case-sensitive string literal
1245
1246 • 'foo' case-insensitive string literal
1247
1248 • [a-xyz], [^a-xyz] character class (possibly negated)
1249
1250 • . any character except newline
1251
1252 • R \ S difference of character classes R and S
1253
1254 • R* zero or more occurrences of R
1255
1256 • R+ one or more occurrences of R
1257
1258 • R? optional R
1259
1260 • R{n} repetition of R exactly n times
1261
1262 • R{n,} repetition of R at least n times
1263
1264 • R{n,m} repetition of R from n to m times
1265
1266 • (R) just R; parentheses are used to override precedence or for
1267 POSIX-style submatch
1268
1269 • R S concatenation: R followed by S
1270
1271 • R | S alternative: R or S
1272
1273 • R / S lookahead: R followed by S, but S is not consumed
1274
1275 • name the regular expression defined as name (or literal string "name"
1276 in Flex compatibility mode)
1277
1278 • {name} the regular expression defined as name in Flex compatibility
1279 mode
1280
1281 • @stag an s-tag: saves the last input position at which @stag matches
1282 in a variable named stag
1283
1284 • #mtag an m-tag: saves all input positions at which #mtag matches in a
1285 variable named mtag
1286
1287 Character classes and string literals may contain the following escape
1288 sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1289 decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1290
1292 One of the main problems for the lexer is to know when to stop. There
1293 are a few terminating conditions:
1294
1295 • the lexer may match some rule (including default rule *) and come to
1296 a final state
1297
1298 • the lexer may fail to match any rule and come to a default state
1299
1300 • the lexer may reach the end of input
1301
1302 The first two conditions terminate the lexer in a "natural" way: it
1303 comes to a state with no outgoing transitions, and the matching auto‐
1304 matically stops. The third condition, end of input, is different: it
1305 may happen in any state, and the lexer should be able to handle it.
1306 Checking for the end of input interrupts the normal lexer workflow and
1307 adds conditional branches to the generated program, therefore it is
1308 necessary to minimize the number of such checks. re2c supports a few
1309 different methods for end of input handling. Which one to use depends
1310 on the complexity of regular expressions, the need for buffering, per‐
1311 formance considerations and other factors. Here is a list of all meth‐
1312 ods:
1313
1314 • Sentinel character. This method eliminates the need for the end of
1315 input checks altogether. It is simple and efficient, but limited to
1316 the case when there is a natural "sentinel" character that can never
1317 occur in valid input. This character may still occur in invalid in‐
1318 put, but it is not allowed by the regular expressions, except perhaps
1319 as the last character of a rule. The sentinel character is appended
1320 at the end of input and serves as a stop signal: when the lexer reads
1321 it, it must be either the end of input, or a syntax error. In both
1322 cases the lexer stops. This method is used if YYFILL is disabled
1323 with re2c:yyfill:enable = 0; and re2c:eof has the default value -1.
1324
1325
1326
1327 • Sentinel character with bounds checks. This method is generic: it
1328 allows to handle any input without restrictions on the regular ex‐
1329 pressions. The idea is to reduce the number of end of input checks
1330 by performing them only on certain characters. Similar to the "sen‐
1331 tinel character" method, one of the characters is chosen as a "sen‐
1332 tinel" and appended at the end of input. However, there is no re‐
1333 striction on where the sentinel character may occur (in fact, any
1334 character can be chosen for a sentinel). When the lexer reads this
1335 character, it additionally performs a bounds check. If the current
1336 position is within bounds, the lexer will resume matching and handle
1337 the sentinel character as a regular one. Otherwise it will try to
1338 get more input with YYFILL (unless YYFILL is disabled). If more in‐
1339 put is available, the lexer will rematch the last character and con‐
1340 tinue as if the sentinel never occurred. Otherwise it is the real
1341 end of input, and the lexer will stop. This method is used if
1342 re2c:eof has non-negative value (it should be set to the ordinal of
1343 the sentinel character). YYFILL must be either defined or disabled
1344 with re2c:yyfill:enable = 0;.
1345
1346
1347
1348 • Bounds checks with padding. This method is the default one. It is
1349 generic, and it is usually faster than the "sentinel character with
1350 bounds checks" method, but also more complex to use. The idea is to
1351 partition the underlying finite-state automaton into strongly con‐
1352 nected components (SCCs), and generate only one bounds check per SCC,
1353 but make it check for multiple characters at once (enough to cover
1354 the longest non-looping path in the SCC). This way the checks are
1355 less frequent, which makes the lexer run much faster. If a check
1356 shows that there is not enough input, the lexer will invoke YYFILL,
1357 which may either supply enough input or else it should not return (in
1358 the latter case the lexer will stop). This approach has a problem
1359 with matching short lexemes at the end of input, because the
1360 multi-character check requires enough characters to cover the longest
1361 possible lexeme. To fix this problem, it is necessary to append a
1362 few fake characters at the end of input. The padding should not form
1363 a valid lexeme suffix to avoid fooling the lexer into matching it as
1364 part of the input. The minimum sufficient length of padding is YY‐
1365 MAXFILL and it is autogenerated by re2c with /*!max:re2c*/. This
1366 method is used if re2c:yyfill:enable has the default nonzero value,
1367 and re2c:eof has the default value -1. YYFILL must be defined.
1368
1369
1370
1371 • Custom methods with generic API. Generic API allows to override ba‐
1372 sic operations like reading a character, which makes it possible to
1373 include the end of input checks as part of them. Such methods are
1374 error-prone and should be used with caution, only if other methods
1375 cannot be used. These methods are used if generic API is enabled
1376 with --input custom or re2c:flags:input = custom; and default bounds
1377 checks are disabled with re2c:yyfill:enable = 0;. Note that the use
1378 of generic API does not imply the use of custom methods, it merely
1379 allows it.
1380
1381 The following subsections contain an example of each method.
1382
1383 Sentinel character
1384 In this example the lexer uses a sentinel character to handle the end
1385 of input. The program counts space-separated words in a null-termi‐
1386 nated string. Configuration re2c:yyfill:enable = 0; suppresses the
1387 generation of bounds checks and YYFILL invocations. The sentinel char‐
1388 acter is null. It is the last character of each input string, and it
1389 is not allowed in the middle of a lexeme by any of the rules (in par‐
1390 ticular, it is not included in the character ranges, where it is easy
1391 to overlook). If a null occurs in the middle of a string, it is a syn‐
1392 tax error and the lexer will match default rule *, but it won't read
1393 past the end of input or crash. -Wsentinel-in-midrule warning verifies
1394 that the rules do not allow sentinel in the middle (it is possible to
1395 tell re2c which character is used as a sentinel with re2c:sentinel con‐
1396 figuration --- the default assumption is null, since this is the most
1397 common case).
1398
1399 //go:generate re2go $INPUT -o $OUTPUT
1400 package main
1401
1402 import "testing"
1403
1404 // expect a null-terminated string
1405 func lex(str string) int {
1406 var cursor int
1407 count := 0
1408 loop:
1409 /*!re2c
1410 re2c:yyfill:enable = 0;
1411 re2c:define:YYCTYPE = byte;
1412 re2c:define:YYPEEK = "str[cursor]";
1413 re2c:define:YYSKIP = "cursor += 1";
1414
1415 * { return -1 }
1416 [\x00] { return count }
1417 [a-z]+ { count += 1; goto loop }
1418 [ ]+ { goto loop }
1419 */
1420 }
1421
1422 func TestLex(t *testing.T) {
1423 var tests = []struct {
1424 res int
1425 str string
1426 }{
1427 {0, "\000"},
1428 {3, "one two three\000"},
1429 {-1, "f0ur\000"},
1430 }
1431
1432 for _, x := range tests {
1433 t.Run(x.str, func(t *testing.T) {
1434 res := lex(x.str)
1435 if res != x.res {
1436 t.Errorf("got %d, want %d", res, x.res)
1437 }
1438 })
1439 }
1440 }
1441
1442
1443 Sentinel character with bounds checks
1444 In this example the lexer uses sentinel character with bounds checks to
1445 handle the end of input (this method was added in version 1.2). The
1446 program counts single-quoted strings separated with spaces. The sen‐
1447 tinel character is null, which is specified with re2c:eof = 0; configu‐
1448 ration. Null is the last character of each input string --- this is
1449 essential to detect the end of input. Null, as well as any other char‐
1450 acter, is allowed in the middle of a rule (for example, 'aaa\0aa'\0 is
1451 valid input, but 'aaa\0 is a syntax error). Bounds checks are gener‐
1452 ated in each state that has a switch on an input character, in the con‐
1453 ditional branch that corresponds to null (that branch may also cover
1454 other characters --- re2c does not split out a separate branch for sen‐
1455 tinel, because increasing the number of branches degrades performance
1456 more than bounds checks do). Bounds checks are of the form YYLIMIT <=
1457 YYCURSOR or YYLESSTHAN(1) with generic API. If a bounds check suc‐
1458 ceeds, the lexer will continue matching. If a bounds check fails, the
1459 lexer has reached the end of input, and it should stop. In this exam‐
1460 ple YYFILL is disabled with re2c:yyfill:enable = 0; and the lexer does
1461 not attempt to get more input (see another example that uses YYFILL in
1462 the YYFILL with sentinel character section). When the end of input has
1463 been reached, there are three possibilities: if the lexer is in the
1464 initial state, it will match the end of input rule $, otherwise it will
1465 either fallback to a previously matched rule (including default rule *)
1466 or go to a default state, causing -Wundefined-control-flow.
1467
1468 //go:generate re2go $INPUT -o $OUTPUT
1469 package main
1470
1471 import "testing"
1472
1473 // Expects a null-terminated string.
1474 func lex(str string) int {
1475 var cursor, marker int
1476 limit := len(str) - 1 // limit points at the terminating null
1477 count := 0
1478 loop:
1479 /*!re2c
1480 re2c:yyfill:enable = 0;
1481 re2c:eof = 0;
1482 re2c:define:YYCTYPE = byte;
1483 re2c:define:YYPEEK = "str[cursor]";
1484 re2c:define:YYSKIP = "cursor += 1";
1485 re2c:define:YYBACKUP = "marker = cursor";
1486 re2c:define:YYRESTORE = "cursor = marker";
1487 re2c:define:YYLESSTHAN = "limit <= cursor";
1488
1489 * { return -1 }
1490 $ { return count }
1491 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1492 [ ]+ { goto loop }
1493 */
1494 }
1495
1496 func TestLex(t *testing.T) {
1497 var tests = []struct {
1498 res int
1499 str string
1500 }{
1501 {0, "\000"},
1502 {3, "'qu\000tes' 'are' 'fine: \\'' \000"},
1503 {-1, "'unterminated\\'\000"},
1504 }
1505
1506 for _, x := range tests {
1507 t.Run(x.str, func(t *testing.T) {
1508 res := lex(x.str)
1509 if res != x.res {
1510 t.Errorf("got %d, want %d", res, x.res)
1511 }
1512 })
1513 }
1514 }
1515
1516
1517 Bounds checks with padding
1518 In this example the lexer uses bounds checking with padding to handle
1519 the end of input (it is the default method). The program counts sin‐
1520 gle-quoted strings separated with spaces. There is a padding of YYMAX‐
1521 FILL null characters appended at the end of input, where YYMAXFILL
1522 value is autogenerated with /*!max:re2c*/ directive. It is not neces‐
1523 sary to use null for padding --- any characters can be used, as long as
1524 they do not form a valid lexeme suffix (in this example padding should
1525 not contain single quotes, as they may be mistaken for a suffix of a
1526 single-quoted string). There is a "stop" rule that matches the first
1527 padding character (null) and terminates the lexer (it returns success
1528 only if it has matched at the beginning of padding, otherwise a stray
1529 null is syntax error). Bounds checks are generated only in some states
1530 that depend on the strongly connected components of the underlying au‐
1531 tomaton. They are of the form (YYLIMIT - YYCURSOR) < n or YY‐
1532 LESSTHAN(n) with generic API, where n is the minimum number of charac‐
1533 ters that are needed for the lexer to proceed (it also means that the
1534 next bounds check will occur in at most n characters). If a bounds
1535 check succeeds, the lexer will continue matching. If a bounds check
1536 fails, the lexer has reached the end of input and will invoke YY‐
1537 FILL(n), which should either supply at least n input characters, or it
1538 should not return. In this example YYFILL always fails and terminates
1539 the lexer with an error. This is fine, because in this example YYFILL
1540 can only be called when the lexer has advanced into the padding, which
1541 means that is has encountered an unterminated string and should return
1542 a syntax error. See the YYFILL with padding section for an example
1543 that refills the input buffer with YYFILL.
1544
1545 //go:generate re2go $INPUT -o $OUTPUT
1546 package main
1547
1548 import (
1549 "strings"
1550 "testing"
1551 )
1552
1553 /*!max:re2c*/
1554
1555 // Expects YYMAXFILL-padded string.
1556 func lex(str string) int {
1557 var cursor int
1558 limit := len(str)
1559 count := 0
1560 loop:
1561 /*!re2c
1562 re2c:define:YYCTYPE = byte;
1563 re2c:define:YYPEEK = "str[cursor]";
1564 re2c:define:YYSKIP = "cursor += 1";
1565 re2c:define:YYLESSTHAN = "limit - cursor < @@{len}";
1566 re2c:define:YYFILL = "return -1";
1567
1568 * {
1569 return -1
1570 }
1571 [\x00] {
1572 if limit - cursor == YYMAXFILL - 1 {
1573 return count
1574 } else {
1575 return -1
1576 }
1577 }
1578 ['] ([^'\\] | [\\][^])* ['] {
1579 count += 1;
1580 goto loop
1581 }
1582 [ ]+ {
1583 goto loop
1584 }
1585 */
1586 }
1587
1588 // Pad string with YYMAXFILL zeroes at the end.
1589 func pad(str string) string {
1590 return str + strings.Repeat("\000", YYMAXFILL)
1591 }
1592
1593 func TestLex(t *testing.T) {
1594 var tests = []struct {
1595 res int
1596 str string
1597 }{
1598 {0, ""},
1599 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1600 {-1, "'unterminated\\'"},
1601 }
1602
1603 for _, x := range tests {
1604 t.Run(x.str, func(t *testing.T) {
1605 res := lex(pad(x.str))
1606 if res != x.res {
1607 t.Errorf("got %d, want %d", res, x.res)
1608 }
1609 })
1610 }
1611 }
1612
1613
1614 Custom methods with generic API
1615 In this example the lexer uses a custom end of input handling method
1616 based on generic API. The program counts single-quoted strings sepa‐
1617 rated with spaces. It is the same as the sentinel character with
1618 bounds checks example, except that the input is not null-terminated (so
1619 this method can be used if it's not possible to have any padding at
1620 all, not even a single sentinel character). To cover up for the ab‐
1621 sence of sentinel character at the end of input, YYPEEK is redefined to
1622 perform a bounds check before it reads the next input character. This
1623 is inefficient, because checks are done very often. If the check suc‐
1624 ceeds, YYPEEK returns the real character, otherwise it returns a fake
1625 sentinel character.
1626
1627 //go:generate re2go $INPUT -o $OUTPUT
1628 package main
1629
1630 import "testing"
1631
1632 // Returns "fake" terminating null if cursor has reached limit.
1633 func peek(str string, cursor int, limit int) byte {
1634 if cursor >= limit {
1635 return 0 // fake null
1636 } else {
1637 return str[cursor]
1638 }
1639 }
1640
1641 // Expects a string without terminating null.
1642 func lex(str string) int {
1643 var cursor, marker int
1644 limit := len(str)
1645 count := 0
1646 loop:
1647 /*!re2c
1648 re2c:yyfill:enable = 0;
1649 re2c:eof = 0;
1650 re2c:define:YYCTYPE = byte;
1651 re2c:define:YYLESSTHAN = "cursor >= limit";
1652 re2c:define:YYPEEK = "peek(str, cursor, limit)";
1653 re2c:define:YYSKIP = "cursor += 1";
1654 re2c:define:YYBACKUP = "marker = cursor";
1655 re2c:define:YYRESTORE = "cursor = marker";
1656
1657 * { return -1 }
1658 $ { return count }
1659 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1660 [ ]+ { goto loop }
1661 */
1662 }
1663
1664 func TestLex(t *testing.T) {
1665 var tests = []struct {
1666 res int
1667 str string
1668 }{
1669 {0, ""},
1670 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1671 {-1, "'unterminated\\'"},
1672 }
1673
1674 for _, x := range tests {
1675 t.Run(x.str, func(t *testing.T) {
1676 res := lex(x.str)
1677 if res != x.res {
1678 t.Errorf("got %d, want %d", res, x.res)
1679 }
1680 })
1681 }
1682 }
1683
1684
1686 The need for buffering arises when the input cannot be mapped in memory
1687 all at once: either it is too large, or it comes in a streaming fashion
1688 (like reading from a socket). The usual technique in such cases is to
1689 allocate a fixed-sized memory buffer and process input in chunks that
1690 fit into the buffer. When the current chunk is processed, it is moved
1691 out and new data is moved in. In practice it is somewhat more complex,
1692 because lexer state consists not of a single input position, but a set
1693 of interrelated posiitons:
1694
1695 • cursor: the next input character to be read (YYCURSOR in default API
1696 or YYSKIP/YYPEEK in generic API)
1697
1698 • limit: the position after the last available input character (YYLIMIT
1699 in default API, implicitly handled by YYLESSTHAN in generic API)
1700
1701 • marker: the position of the most recent match, if any (YYMARKER in
1702 default API or YYBACKUP/YYRESTORE in generic API)
1703
1704 • token: the start of the current lexeme (implicit in re2c API, as it
1705 is not needed for the normal lexer operation and can be defined and
1706 updated by the user)
1707
1708 • context marker: the position of the trailing context (YYCTXMARKER in
1709 default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1710
1711 • tag variables: submatch positions (defined with /*!stags:re2c*/ and
1712 /*!mtags:re2c*/ directives and YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN in
1713 generic API)
1714
1715 Not all these are used in every case, but if used, they must be updated
1716 by YYFILL. All active positions are contained in the segment between
1717 token and cursor, therefore everything between buffer start and token
1718 can be discarded, the segment from token and up to limit should be
1719 moved to the beginning of buffer, and the free space at the end of buf‐
1720 fer should be filled with new data. In order to avoid frequent YYFILL
1721 calls it is best to fill in as many input characters as possible (even
1722 though fewer characters might suffice to resume the lexer). The details
1723 of YYFILL implementation are slightly different depending on which EOF
1724 handling method is used: the case of EOF rule is somewhat simpler than
1725 the case of bounds-checking with padding. Also note that if -f
1726 --storable-state option is used, YYFILL has slightly different seman‐
1727 tics (desrbed in the section about storable state).
1728
1729 YYFILL with sentinel character
1730 If EOF rule is used, YYFILL is a function-like primitive that accepts
1731 no arguments and returns a value which is checked against zero. YYFILL
1732 invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1733 and YYLESSTHAN() in generic API. A non-zero return value means that YY‐
1734 FILL has failed. A successful YYFILL call must supply at least one
1735 character and adjust input positions accordingly. Limit must always be
1736 set to one after the last input position in buffer, and the character
1737 at the limit position must be the sentinel symbol specified by re2c:eof
1738 configuration. The pictures below show the relative locations of input
1739 positions in buffer before and after YYFILL call (sentinel symbol is
1740 marked with #, and the second picture shows the case when there is not
1741 enough input to fill the whole buffer).
1742
1743 <-- shift -->
1744 >-A------------B---------C-------------D#-----------E->
1745 buffer token marker limit,
1746 cursor
1747 >-A------------B---------C-------------D------------E#->
1748 buffer, marker cursor limit
1749 token
1750
1751 <-- shift -->
1752 >-A------------B---------C-------------D#--E (EOF)
1753 buffer token marker limit,
1754 cursor
1755 >-A------------B---------C-------------D---E#........
1756 buffer, marker cursor limit
1757 token
1758
1759 Here is an example of a program that reads input file input.txt in
1760 chunks of 4096 bytes and uses EOF rule.
1761
1762 //go:generate re2go $INPUT -o $OUTPUT
1763 package main
1764
1765 import (
1766 "os"
1767 "testing"
1768 )
1769
1770 // Intentionally small to trigger buffer refill.
1771 const SIZE int = 16
1772
1773 type Input struct {
1774 file *os.File
1775 data []byte
1776 cursor int
1777 marker int
1778 token int
1779 limit int
1780 eof bool
1781 }
1782
1783 func fill(in *Input) int {
1784 // If nothing can be read, fail.
1785 if in.eof {
1786 return 1
1787 }
1788
1789 // Check if at least some space can be freed.
1790 if in.token == 0 {
1791 // In real life can reallocate a larger buffer.
1792 panic("fill error: lexeme too long")
1793 }
1794
1795 // Discard everything up to the start of the current lexeme,
1796 // shift buffer contents and adjust offsets.
1797 copy(in.data[0:], in.data[in.token:in.limit])
1798 in.cursor -= in.token
1799 in.marker -= in.token
1800 in.limit -= in.token
1801 in.token = 0
1802
1803 // Read new data (as much as possible to fill the buffer).
1804 n, _ := in.file.Read(in.data[in.limit:SIZE])
1805 in.limit += n
1806 in.data[in.limit] = 0
1807
1808 // If read less than expected, this is the end of input.
1809 in.eof = in.limit < SIZE
1810
1811 // If nothing has been read, fail.
1812 if n == 0 {
1813 return 1
1814 }
1815
1816 return 0
1817 }
1818
1819 func lex(in *Input) int {
1820 count := 0
1821 loop:
1822 in.token = in.cursor
1823 /*!re2c
1824 re2c:eof = 0;
1825 re2c:define:YYCTYPE = byte;
1826 re2c:define:YYPEEK = "in.data[in.cursor]";
1827 re2c:define:YYSKIP = "in.cursor += 1";
1828 re2c:define:YYBACKUP = "in.marker = in.cursor";
1829 re2c:define:YYRESTORE = "in.cursor = in.marker";
1830 re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
1831 re2c:define:YYFILL = "fill(in) == 0";
1832
1833 * { return -1 }
1834 $ { return count }
1835 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1836 [ ]+ { goto loop }
1837 */
1838 }
1839
1840 // Prepare a file with the input text and run the lexer.
1841 func test(data string) (result int) {
1842 tmpfile := "input.txt"
1843
1844 f, _ := os.Create(tmpfile)
1845 f.WriteString(data)
1846 f.Seek(0, 0)
1847
1848 defer func() {
1849 if r := recover(); r != nil {
1850 result = -2
1851 }
1852 f.Close()
1853 os.Remove(tmpfile)
1854 }()
1855
1856 in := &Input{
1857 file: f,
1858 data: make([]byte, SIZE+1),
1859 cursor: SIZE,
1860 marker: SIZE,
1861 token: SIZE,
1862 limit: SIZE,
1863 eof: false,
1864 }
1865
1866 return lex(in)
1867 }
1868
1869 func TestLex(t *testing.T) {
1870 var tests = []struct {
1871 res int
1872 str string
1873 }{
1874 {0, ""},
1875 {2, "'one' 'two'"},
1876 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1877 {-1, "'unterminated\\'"},
1878 {-2, "'loooooooooooong'"},
1879 }
1880
1881 for _, x := range tests {
1882 t.Run(x.str, func(t *testing.T) {
1883 res := test(x.str)
1884 if res != x.res {
1885 t.Errorf("got %d, want %d", res, x.res)
1886 }
1887 })
1888 }
1889 }
1890
1891
1892 YYFILL with padding
1893 In the default case (when EOF rule is not used) YYFILL is a func‐
1894 tion-like primitive that accepts a single argument and does not return
1895 any value. YYFILL invocation is triggered by condition (YYLIMIT - YY‐
1896 CURSOR) < n in default API and YYLESSTHAN(n) in generic API. The argu‐
1897 ment passed to YYFILL is the minimal number of characters that must be
1898 supplied. If it fails to do so, YYFILL must not return to the lexer
1899 (for that reason it is best implemented as a macro that returns from
1900 the calling function on failure). In case of a successful YYFILL invo‐
1901 cation the limit position must be set either to one after the last in‐
1902 put position in buffer, or to the end of YYMAXFILL padding (in case YY‐
1903 FILL has successfully read at least n characters, but not enough to
1904 fill the entire buffer). The pictures below show the relative locations
1905 of input positions in buffer before and after YYFILL invocation (YYMAX‐
1906 FILL padding on the second picture is marked with # symbols).
1907
1908 <-- shift --> <-- need -->
1909 >-A------------B---------C-----D-------E---F--------G->
1910 buffer token marker cursor limit
1911
1912 >-A------------B---------C-----D-------E---F--------G->
1913 buffer, marker cursor limit
1914 token
1915
1916 <-- shift --> <-- need -->
1917 >-A------------B---------C-----D-------E-F (EOF)
1918 buffer token marker cursor limit
1919
1920 >-A------------B---------C-----D-------E-F###############
1921 buffer, marker cursor limit
1922 token <- YYMAXFILL ->
1923
1924 Here is an example of a program that reads input file input.txt in
1925 chunks of 4096 bytes and uses bounds-checking with padding.
1926
1927 //go:generate re2go $INPUT -o $OUTPUT
1928 package main
1929
1930 import (
1931 "fmt"
1932 "os"
1933 "testing"
1934 )
1935
1936 /*!max:re2c*/
1937
1938 // Intentionally small to trigger buffer refill.
1939 const SIZE int = 16
1940
1941 type Input struct {
1942 file *os.File
1943 data []byte
1944 cursor int
1945 marker int
1946 token int
1947 limit int
1948 eof bool
1949 }
1950
1951 func fill(in *Input, need int) int {
1952 // End of input has already been reached, nothing to do.
1953 if in.eof {
1954 return -1 // Error: unexpected EOF
1955 }
1956
1957 // Check if after moving the current lexeme to the beginning
1958 // of buffer there will be enough free space.
1959 if SIZE-(in.cursor-in.token) < need {
1960 return -2 // Error: lexeme too long
1961 }
1962
1963 // Discard everything up to the start of the current lexeme,
1964 // shift buffer contents and adjust offsets.
1965 copy(in.data[0:], in.data[in.token:in.limit])
1966 in.cursor -= in.token
1967 in.marker -= in.token
1968 in.limit -= in.token
1969 in.token = 0
1970
1971 // Read new data (as much as possible to fill the buffer).
1972 n, _ := in.file.Read(in.data[in.limit:SIZE])
1973 in.limit += n
1974
1975 // If read less than expected, this is the end of input.
1976 in.eof = in.limit < SIZE
1977
1978 // If end of input, add padding so that the lexer can read
1979 // the remaining characters at the end of buffer.
1980 if in.eof {
1981 for i := 0; i < YYMAXFILL; i += 1 {
1982 in.data[in.limit+i] = 0
1983 }
1984 in.limit += YYMAXFILL
1985 }
1986
1987 return 0
1988 }
1989
1990 func lex(in *Input) int {
1991 count := 0
1992 loop:
1993 in.token = in.cursor
1994 /*!re2c
1995 re2c:define:YYCTYPE = byte;
1996 re2c:define:YYPEEK = "in.data[in.cursor]";
1997 re2c:define:YYSKIP = "in.cursor += 1";
1998 re2c:define:YYBACKUP = "in.marker = in.cursor";
1999 re2c:define:YYRESTORE = "in.cursor = in.marker";
2000 re2c:define:YYLESSTHAN = "in.limit-in.cursor < @@{len}";
2001 re2c:define:YYFILL = "if r := fill(in, @@{len}); r != 0 { return r }";
2002
2003 * {
2004 return -1
2005 }
2006 [\x00] {
2007 if in.limit - in.cursor == YYMAXFILL - 1 {
2008 return count
2009 } else {
2010 return -1
2011 }
2012 }
2013 ['] ([^'\\] | [\\][^])* ['] {
2014 count += 1;
2015 goto loop
2016 }
2017 [ ]+ {
2018 goto loop
2019 }
2020 */
2021 }
2022
2023 // Prepare a file with the input text and run the lexer.
2024 func test(data string) (result int) {
2025 tmpfile := "input.txt"
2026
2027 f, _ := os.Create(tmpfile)
2028 f.WriteString(data)
2029 f.Seek(0, 0)
2030
2031 defer func() {
2032 if r := recover(); r != nil {
2033 fmt.Println(r)
2034 result = -2
2035 }
2036 f.Close()
2037 os.Remove(tmpfile)
2038 }()
2039
2040 in := &Input{
2041 file: f,
2042 data: make([]byte, SIZE+YYMAXFILL),
2043 cursor: SIZE,
2044 marker: SIZE,
2045 token: SIZE,
2046 limit: SIZE,
2047 eof: false,
2048 }
2049
2050 return lex(in)
2051 }
2052
2053 func TestLex(t *testing.T) {
2054 var tests = []struct {
2055 res int
2056 str string
2057 }{
2058 {0, ""},
2059 {2, "'one' 'two'"},
2060 {3, "'qu\000tes' 'are' 'fine: \\'' "},
2061 {-1, "'unterminated\\'"},
2062 {-2, "'loooooooooooong'"},
2063 }
2064
2065 for _, x := range tests {
2066 t.Run(x.str, func(t *testing.T) {
2067 res := test(x.str)
2068 if res != x.res {
2069 t.Errorf("got %d, want %d", res, x.res)
2070 }
2071 })
2072 }
2073 }
2074
2075
2077 re2c allows one to include other files using directive /*!include:re2c
2078 FILE */, where FILE is the name of file to be included. re2c looks for
2079 included files in the directory of the including file and in include
2080 locations, which can be specified with -I option. Include directives
2081 in re2c work in the same way as C/C++ #include: the contents of FILE
2082 are copy-pasted verbatim in place of the directive. Include files may
2083 have further includes of their own. Use --depfile option to track build
2084 dependencies of the output file on include files. re2c provides some
2085 predefined include files that can be found in the include/ subdirectory
2086 of the project. These files contain definitions that can be useful to
2087 other projects (such as Unicode categories) and form something like a
2088 standard library for re2c. Below is an example of using include direc‐
2089 tive.
2090
2091 Include file (definitions.go)
2092 const (
2093 ResultOk = iota
2094 ResultFail
2095 )
2096
2097 /*!re2c
2098 number = [1-9][0-9]*;
2099 */
2100
2101
2102 Input file
2103 //go:generate re2go -c $INPUT -o $OUTPUT -i
2104 package main
2105
2106 import "testing"
2107 /*!include:re2c "definitions.go" */
2108
2109 func lex(str string) int {
2110 var cursor int
2111 /*!re2c
2112 re2c:yyfill:enable = 0;
2113 re2c:define:YYCTYPE = byte;
2114 re2c:define:YYPEEK = "str[cursor]";
2115 re2c:define:YYSKIP = "cursor += 1";
2116
2117 number { return ResultOk }
2118 * { return ResultFail }
2119 */
2120 }
2121
2122 func TestLex(t *testing.T) {
2123 if lex("123\000") != ResultOk {
2124 t.Errorf("error")
2125 }
2126 }
2127
2128
2130 Re2c allows one to generate header file from the input .re file using
2131 option -t, --type-header or configuration re2c:flags:type-header and
2132 directives /*!header:re2c:on*/ and /*!header:re2c:off*/. The first di‐
2133 rective marks the beginning of header file, and the second directive
2134 marks the end of it. Everything between these directives is processed
2135 by re2c, and the generated code is written to the file specified by the
2136 -t --type-header option (or stdout if this option was not used). Auto‐
2137 generated header file may be needed in cases when re2c is used to gen‐
2138 erate definitions of constants, variables and structs that must be vis‐
2139 ible from other translation units.
2140
2141 Here is an example of generating a header file that contains definition
2142 of the lexer state with tag variables (the number variables depends on
2143 the regular grammar and is unknown to the programmer).
2144
2145 Input file
2146 //go:generate re2go $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.go
2147 package main
2148
2149 import (
2150 "lexer" // generated by re2c
2151 "testing"
2152 )
2153
2154 /*!header:re2c:on*/
2155 package lexer
2156
2157 type State struct {
2158 Data string
2159 Cur, Mar, /*!stags:re2c format="@@{tag}"; separator=", "; */ int
2160 }
2161 /*!header:re2c:off*/
2162
2163 func lex(st *lexer.State) int {
2164 /*!re2c
2165 re2c:flags:type-header = "src/lexer/lexer.go";
2166 re2c:yyfill:enable = 0;
2167 re2c:flags:tags = 1;
2168 re2c:define:YYCTYPE = byte;
2169 re2c:define:YYPEEK = "st.Data[st.Cur]";
2170 re2c:define:YYSKIP = "st.Cur++";
2171 re2c:define:YYBACKUP = "st.Mar = st.Cur";
2172 re2c:define:YYRESTORE = "st.Cur = st.Mar";
2173 re2c:define:YYRESTORETAG = "st.Cur = @@{tag}";
2174 re2c:define:YYSTAGP = "@@{tag} = st.Cur";
2175 re2c:tags:expression = "st.@@{tag}";
2176 re2c:tags:prefix = "Tag";
2177
2178 [x]{1,4} / [x]{3,5} { return 0 } // ambiguous trailing context
2179 * { return 1 }
2180 */
2181 }
2182
2183 func TestLex(t *testing.T) {
2184 st := &lexer.State{
2185 Data: "xxxxxxxx\x00",
2186 }
2187 if !(lex(st) == 0 && st.Cur == 4) {
2188 t.Error("failed")
2189 }
2190 }
2191
2192
2193 Header file
2194 // Code generated by re2c, DO NOT EDIT.
2195
2196 package lexer
2197
2198 type State struct {
2199 Data string
2200 Cur, Mar, Tag1, Tag2, Tag3 int
2201 }
2202
2203
2205 Re2c has two options for submatch extraction.
2206
2207 The first option is -T --tags. With this option one can use standalone
2208 tags of the form @stag and #mtag, where stag and mtag are arbitrary
2209 used-defined names. Tags can be used anywhere inside of a regular ex‐
2210 pression; semantically they are just position markers. Tags of the form
2211 @stag are called s-tags: they denote a single submatch value (the last
2212 input position where this tag matched). Tags of the form #mtag are
2213 called m-tags: they denote multiple submatch values (the whole history
2214 of repetitions of this tag). All tags should be defined by the user as
2215 variables with the corresponding names. With standalone tags re2c uses
2216 leftmost greedy disambiguation: submatch positions correspond to the
2217 leftmost matching path through the regular expression.
2218
2219 The second option is -P --posix-captures: it enables POSIX-compliant
2220 capturing groups. In this mode parentheses in regular expressions de‐
2221 note the beginning and the end of capturing groups; the whole regular
2222 expression is group number zero. The number of groups for the matching
2223 rule is stored in a variable yynmatch, and submatch results are stored
2224 in yypmatch array. Both yynmatch and yypmatch should be defined by the
2225 user, and yypmatch size must be at least [yynmatch * 2]. Re2c provides
2226 a directive /*!maxnmatch:re2c*/ that defines YYMAXNMATCH: a constant
2227 equal to the maximal value of yynmatch among all rules. Note that re2c
2228 implements POSIX-compliant disambiguation: each subexpression matches
2229 as long as possible, and subexpressions that start earlier in regular
2230 expression have priority over those starting later. Capturing groups
2231 are translated into s-tags under the hood, therefore we use the word
2232 "tag" to describe them as well.
2233
2234 With both -P --posix-captures and T --tags options re2c uses efficient
2235 submatch extraction algorithm described in the Tagged Deterministic Fi‐
2236 nite Automata with Lookahead paper. The overhead on submatch extraction
2237 in the generated lexer grows with the number of tags --- if this number
2238 is moderate, the overhead is barely noticeable. In the lexer tags are
2239 implemented using a number of tag variables generated by re2c. There is
2240 no one-to-one correspondence between tag variables and tags: a single
2241 variable may be reused for different tags, and one tag may require mul‐
2242 tiple variables to hold all its ambiguous values. Eventually ambiguity
2243 is resolved, and only one final variable per tag survives. When a rule
2244 matches, all its tags are set to the values of the corresponding tag
2245 variables. The exact number of tag variables is unknown to the user;
2246 this number is determined by re2c. However, tag variables should be de‐
2247 fined by the user as a part of the lexer state and updated by YYFILL,
2248 therefore re2c provides directives /*!stags:re2c*/ and /*!mtags:re2c*/
2249 that can be used to declare, initialize and manipulate tag variables.
2250 These directives have two optional configurations: format = "@@";
2251 (specifies the template where @@ is substituted with the name of each
2252 tag variable), and separator = ""; (specifies the piece of code used to
2253 join the generated pieces for different tag variables).
2254
2255 S-tags support the following operations:
2256
2257 • save input position to an s-tag: t = YYCURSOR with default API or a
2258 user-defined operation YYSTAGP(t) with generic API
2259
2260 • save default value to an s-tag: t = NULL with default API or a
2261 user-defined operation YYSTAGN(t) with generic API
2262
2263 • copy one s-tag to another: t1 = t2
2264
2265 M-tags support the following operations:
2266
2267 • append input position to an m-tag: a user-defined operation YYM‐
2268 TAGP(t) with both default and generic API
2269
2270 • append default value to an m-tag: a user-defined operation YYMTAGN(t)
2271 with both default and generic API
2272
2273 • copy one m-tag to another: t1 = t2
2274
2275 S-tags can be implemented as scalar values (pointers or offsets).
2276 M-tags need a more complex representation, as they need to store a se‐
2277 quence of tag values. The most naive and inefficient representation of
2278 an m-tag is a list (array, vector) of tag values; a more efficient rep‐
2279 resentation is to store all m-tags in a prefix-tree represented as ar‐
2280 ray of nodes (v, p), where v is tag value and p is a pointer to parent
2281 node.
2282
2283 Here is a simple example of using s-tags to parse an IPv4 address (see
2284 below for a more complex example that uses YYFILL).
2285
2286 //go:generate re2go $INPUT -o $OUTPUT
2287 package main
2288
2289 import (
2290 "errors"
2291 "testing"
2292 )
2293
2294 var eBadIP error = errors.New("bad IP")
2295
2296 func lex(str string) (int, error) {
2297 var cursor, marker, o1, o2, o3, o4 int
2298 /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2299
2300 num := func(pos int, end int) int {
2301 n := 0
2302 for ; pos < end; pos++ {
2303 n = n*10 + int(str[pos]-'0')
2304 }
2305 return n
2306 }
2307
2308 /*!re2c
2309 re2c:flags:tags = 1;
2310 re2c:yyfill:enable = 0;
2311 re2c:define:YYCTYPE = byte;
2312 re2c:define:YYPEEK = "str[cursor]";
2313 re2c:define:YYSKIP = "cursor += 1";
2314 re2c:define:YYBACKUP = "marker = cursor";
2315 re2c:define:YYRESTORE = "cursor = marker";
2316 re2c:define:YYSTAGP = "@@{tag} = cursor";
2317 re2c:define:YYSTAGN = "@@{tag} = -1";
2318
2319 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2320 dot = [.];
2321 end = [\x00];
2322
2323 @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2324 return num(o4, cursor-1)+
2325 (num(o3, o4-1) << 8)+
2326 (num(o2, o3-1) << 16)+
2327 (num(o1, o2-1) << 24), nil
2328 }
2329 * { return 0, eBadIP }
2330 */
2331 }
2332
2333 func TestLex(t *testing.T) {
2334 var tests = []struct {
2335 str string
2336 res int
2337 err error
2338 }{
2339 {"1.2.3.4\000", 0x01020304, nil},
2340 {"127.0.0.1\000", 0x7f000001, nil},
2341 {"255.255.255.255\000", 0xffffffff, nil},
2342 {"1.2.3.\000", 0, eBadIP},
2343 {"1.2.3.256\000", 0, eBadIP},
2344 }
2345
2346 for _, x := range tests {
2347 t.Run(x.str, func(t *testing.T) {
2348 res, err := lex(x.str)
2349 if !(res == x.res && err == x.err) {
2350 t.Errorf("got %d, want %d", res, x.res)
2351 }
2352 })
2353 }
2354 }
2355
2356
2357 Here is a more complex example of using s-tags with YYFILL to parse a
2358 file with IPv4 addresses. Tag variables are part of the lexer state,
2359 and they are adjusted in YYFILL like other input positions. Note that
2360 it is necessary for s-tags because their values are invalidated after
2361 shifting buffer contents. It may not be necessary in a custom implemen‐
2362 tation where tag variables store offsets relative to the start of the
2363 input string rather than buffer, which may be the case with m-tags.
2364
2365 //go:generate re2go $INPUT -o $OUTPUT --tags
2366 package main
2367
2368 import (
2369 "fmt"
2370 "os"
2371 "reflect"
2372 "testing"
2373 )
2374
2375 const SIZE int = 4096
2376
2377 type Input struct {
2378 file *os.File
2379 data []byte
2380 cursor int
2381 marker int
2382 token int
2383 limit int
2384 // Tag variables must be part of the lexer state passed to YYFILL.
2385 // They don't correspond to tags and should be autogenerated by re2c.
2386 /*!stags:re2c format = "@@ int"; separator= "\n\t"; */
2387 eof bool
2388 }
2389
2390 func fill(in *Input) int {
2391 // If nothing can be read, fail.
2392 if in.eof {
2393 return 1
2394 }
2395
2396 // Check if at least some space can be freed.
2397 if in.token == 0 {
2398 // In real life can reallocate a larger buffer.
2399 panic("fill error: lexeme too long")
2400 }
2401
2402 // Discard everything up to the start of the current lexeme,
2403 // shift buffer contents and adjust offsets.
2404 copy(in.data[0:], in.data[in.token:in.limit])
2405 in.cursor -= in.token
2406 in.marker -= in.token
2407 in.limit -= in.token
2408 // Tag variables need to be shifted like other input positions. The
2409 // check for -1 is only needed if some tags are nested inside of
2410 // alternative or repetition, so that they can have -1 value.
2411 /*!stags:re2c
2412 format = "if in.@@ != -1 { in.@@ -= in.token }";
2413 separator= "\n\t";
2414 */
2415 in.token = 0
2416
2417 // Read new data (as much as possible to fill the buffer).
2418 n, _ := in.file.Read(in.data[in.limit:SIZE])
2419 in.limit += n
2420 in.data[in.limit] = 0
2421
2422 // If read less than expected, this is the end of input.
2423 in.eof = in.limit < SIZE
2424
2425 // If nothing has been read, fail.
2426 if n == 0 {
2427 return 1
2428 }
2429
2430 return 0
2431 }
2432
2433 func lex(in *Input) []int {
2434 // User-defined local variables that store final tag values. They are
2435 // different from tag variables autogenerated with /*!stags:re2c*/, as
2436 // they are set at the end of match and used only in semantic actions.
2437 var o1, o2, o3, o4 int
2438 var ips []int
2439
2440 num := func(pos int, end int) int {
2441 n := 0
2442 for ; pos < end; pos++ {
2443 n = n*10 + int(in.data[pos]-'0')
2444 }
2445 return n
2446 }
2447
2448 loop:
2449 in.token = in.cursor
2450 /*!re2c
2451 re2c:eof = 0;
2452 re2c:define:YYCTYPE = byte;
2453 re2c:define:YYPEEK = "in.data[in.cursor]";
2454 re2c:define:YYSKIP = "in.cursor += 1";
2455 re2c:define:YYBACKUP = "in.marker = in.cursor";
2456 re2c:define:YYRESTORE = "in.cursor = in.marker";
2457 re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
2458 re2c:define:YYFILL = "fill(in) == 0";
2459 re2c:define:YYSTAGP = "@@{tag} = in.cursor";
2460 re2c:define:YYSTAGN = "@@{tag} = -1";
2461
2462 // The way tag variables are accessed from the lexer (not needed if tag
2463 // variables are defined as local variables).
2464 re2c:tags:expression = "in.@@";
2465
2466 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2467 dot = [.];
2468 eol = [\n];
2469
2470 @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet eol {
2471 ips = append(ips, num(o4, in.cursor-1)+
2472 (num(o3, o4-1) << 8)+
2473 (num(o2, o3-1) << 16)+
2474 (num(o1, o2-1) << 24))
2475 goto loop
2476 }
2477 $ { return ips }
2478 * { return nil }
2479 */
2480 }
2481
2482 func TestLex(t *testing.T) {
2483 tmpfile := "input.txt"
2484 var want, have []int
2485
2486 // Write a few IPv4 addresses to the input file and save them to compare
2487 // against parse results.
2488 f, _ := os.Create(tmpfile)
2489 for i := 0; i < 256; i++ {
2490 fmt.Fprintf(f, "%d.%d.%d.%d\n", i, i, i, i)
2491 want = append(want, i + (i<<8) + (i<<16) + (i<<24));
2492 }
2493 f.Seek(0, 0)
2494
2495 defer func() {
2496 if r := recover(); r != nil {
2497 have = nil
2498 }
2499 f.Close()
2500 os.Remove(tmpfile)
2501 }()
2502
2503 in := &Input{
2504 file: f,
2505 data: make([]byte, SIZE+1),
2506 cursor: SIZE,
2507 marker: SIZE,
2508 token: SIZE,
2509 limit: SIZE,
2510 eof: false,
2511 }
2512
2513 have = lex(in)
2514
2515 if !reflect.DeepEqual(have, want) {
2516 t.Errorf("have %d, want %d", have, want)
2517 }
2518 }
2519
2520
2521 Here is an example of using POSIX capturing groups to parse an IPv4 ad‐
2522 dress.
2523
2524 //go:generate re2go $INPUT -o $OUTPUT
2525 package main
2526
2527 import (
2528 "errors"
2529 "testing"
2530 )
2531
2532 /*!maxnmatch:re2c*/
2533
2534 var eBadIP error = errors.New("bad IP")
2535
2536 func lex(str string) (int, error) {
2537 var cursor, marker, yynmatch int
2538 yypmatch := make([]int, YYMAXNMATCH*2)
2539 /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2540
2541 num := func(pos int, end int) int {
2542 n := 0
2543 for ; pos < end; pos++ {
2544 n = n*10 + int(str[pos]-'0')
2545 }
2546 return n
2547 }
2548
2549 /*!re2c
2550 re2c:flags:posix-captures = 1;
2551 re2c:yyfill:enable = 0;
2552 re2c:define:YYCTYPE = byte;
2553 re2c:define:YYPEEK = "str[cursor]";
2554 re2c:define:YYSKIP = "cursor += 1";
2555 re2c:define:YYBACKUP = "marker = cursor";
2556 re2c:define:YYRESTORE = "cursor = marker";
2557 re2c:define:YYSTAGP = "@@{tag} = cursor";
2558 re2c:define:YYSTAGN = "@@{tag} = -1";
2559 re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
2560
2561 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2562 dot = [.];
2563 end = [\x00];
2564
2565 (octet) dot (octet) dot (octet) dot (octet) end {
2566 if yynmatch != 5 {
2567 panic("expected 5 submatch groups")
2568 }
2569 return num(yypmatch[8], yypmatch[9])+
2570 (num(yypmatch[6], yypmatch[7]) << 8)+
2571 (num(yypmatch[4], yypmatch[5]) << 16)+
2572 (num(yypmatch[2], yypmatch[3]) << 24), nil
2573 }
2574 * { return 0, eBadIP }
2575 */
2576 }
2577
2578 func TestLex(t *testing.T) {
2579 var tests = []struct {
2580 str string
2581 res int
2582 err error
2583 }{
2584 {"1.2.3.4\000", 0x01020304, nil},
2585 {"127.0.0.1\000", 0x7f000001, nil},
2586 {"255.255.255.255\000", 0xffffffff, nil},
2587 {"1.2.3.\000", 0, eBadIP},
2588 {"1.2.3.256\000", 0, eBadIP},
2589 }
2590
2591 for _, x := range tests {
2592 t.Run(x.str, func(t *testing.T) {
2593 res, err := lex(x.str)
2594 if !(res == x.res && err == x.err) {
2595 t.Errorf("got %d, want %d", res, x.res)
2596 }
2597 })
2598 }
2599 }
2600
2601
2602 Here is an example of using m-tags to parse a semicolon-separated se‐
2603 quence of words (C++). Tag variables are stored in a tree that is
2604 packed in a vector.
2605
2606 //go:generate re2go $INPUT -o $OUTPUT
2607 package main
2608
2609 import (
2610 "reflect"
2611 "testing"
2612 )
2613
2614 const (
2615 mtagRoot int = -1
2616 mtagNil int = -2
2617 )
2618
2619 type mtagElem struct {
2620 val int
2621 pred int
2622 }
2623
2624 type mtagTrie = []mtagElem
2625
2626 func createTrie(capacity int) mtagTrie {
2627 return make([]mtagElem, 0, capacity)
2628 }
2629
2630 func mtag(trie *mtagTrie, tag int, val int) int {
2631 *trie = append(*trie, mtagElem{val, tag})
2632 return len(*trie) - 1
2633 }
2634
2635 // Recursively unwind both tag histories and consruct submatches.
2636 func unwind(trie mtagTrie, x int, y int, str string) []string {
2637 if x == mtagRoot && y == mtagRoot {
2638 return []string{}
2639 } else if x == mtagRoot || y == mtagRoot {
2640 panic("tag histories have different length")
2641 } else {
2642 xval := trie[x].val
2643 yval := trie[y].val
2644 ss := unwind(trie, trie[x].pred, trie[y].pred, str)
2645
2646 // Either both tags should be nil, or none of them.
2647 if xval == mtagNil && yval == mtagNil {
2648 return ss
2649 } else if xval == mtagNil || yval == mtagNil {
2650 panic("tag histories positive/negative tag mismatch")
2651 } else {
2652 s := str[xval:yval]
2653 return append(ss, s)
2654 }
2655 }
2656 }
2657
2658 func lex(str string) []string {
2659 var cursor, marker int
2660 trie := createTrie(256)
2661 x := mtagRoot
2662 y := mtagRoot
2663 /*!mtags:re2c format = "@@ := mtagRoot"; separator = "\n\t"; */
2664
2665 /*!re2c
2666 re2c:flags:tags = 1;
2667 re2c:yyfill:enable = 0;
2668 re2c:define:YYCTYPE = byte;
2669 re2c:define:YYPEEK = "str[cursor]";
2670 re2c:define:YYSKIP = "cursor += 1";
2671 re2c:define:YYBACKUP = "marker = cursor";
2672 re2c:define:YYRESTORE = "cursor = marker";
2673 re2c:define:YYMTAGP = "@@{tag} = mtag(&trie, @@{tag}, cursor)";
2674 re2c:define:YYMTAGN = "@@{tag} = mtag(&trie, @@{tag}, mtagNil)";
2675
2676 end = [\x00];
2677
2678 (#x [a-z]+ #y [;])* end { return unwind(trie, x, y, str) }
2679 * { return nil }
2680 */
2681 }
2682
2683 func TestLex(t *testing.T) {
2684 var tests = []struct {
2685 str string
2686 res []string
2687 }{
2688 {"\000", []string{}},
2689 {"one;two;three;\000", []string{"one", "two", "three"}},
2690 {"one;two\000", nil},
2691 }
2692
2693 for _, x := range tests {
2694 t.Run(x.str, func(t *testing.T) {
2695 res := lex(x.str)
2696 if !reflect.DeepEqual(res, x.res) {
2697 t.Errorf("got %v, want %v", res, x.res)
2698 }
2699 })
2700 }
2701 }
2702
2703
2705 With -f --storable-state option re2c generates a lexer that can store
2706 its current state, return to the caller, and later resume operations
2707 exactly where it left off. The default mode of operation in re2c is a
2708 "pull" model, in which the lexer "pulls" more input whenever it needs
2709 it. This may be unacceptable in cases when the input becomes available
2710 piece by piece (for example, if the lexer is invoked by the parser, or
2711 if the lexer program communicates via a socket protocol with some other
2712 program that must wait for a reply from the lexer before it transmits
2713 the next message). Storable state feature is intended exactly for such
2714 cases: it allows one to generate lexers that work in a "push" model.
2715 When the lexer needs more input, it stores its state and returns to the
2716 caller. Later, when more input becomes available, the caller resumes
2717 the lexer exactly where it stopped. There are a few changes necessary
2718 compared to the "pull" model:
2719
2720 • Define YYSETSTATE() and YYGETSTATE(state) promitives.
2721
2722 • Define yych, yyaccept and state variables as a part of persistent
2723 lexer state. The state variable should be initialized to -1.
2724
2725 • YYFILL should return to the outer program instead of trying to supply
2726 more input. Return code should indicate that lexer needs more input.
2727
2728 • The outer program should recognize situations when lexer needs more
2729 input and respond appropriately.
2730
2731 • Use /*!getstate:re2c*/ directive if it is necessary to execute any
2732 code before entering the lexer.
2733
2734 • Use configurations state:abort and state:nextlabel to further tweak
2735 the generated code.
2736
2737 Here is an example of a "push"-model lexer that reads input from stdin
2738 and expects a sequence of words separated by spaces and newlines. The
2739 lexer loops forever, waiting for more input. It can be terminated by
2740 sending a special EOF token --- a word "stop", in which case the lexer
2741 terminates successfully and prints the number of words it has seen. Ab‐
2742 normal termination happens in case of a syntax error, premature end of
2743 input (without the "stop" word) or in case the buffer is too small to
2744 hold a lexeme (for example, if one of the words exceeds buffer size).
2745 Premature end of input happens in case the lexer fails to read any in‐
2746 put while being in the initial state --- this is the only case when EOF
2747 rule matches. Note that the lexer may call YYFILL twice before termi‐
2748 nating (and thus require hitting Ctrl+D a few times). First time YYFILL
2749 is called when the lexer expects continuation of the current greedy
2750 lexeme (either a word or a whitespace sequence). If YYFILL fails, the
2751 lexer knows that it has reached the end of the current lexeme and exe‐
2752 cutes the corresponding semantic action. The action jumps to the begin‐
2753 ning of the loop, the lexer enters the initial state and calls YYFILL
2754 once more. If it fails, the lexer matches EOF rule. (Alternatively EOF
2755 rule can be used for termination instead of a special EOF lexeme.)
2756
2757 Example
2758 //go:generate re2go -f $INPUT -o $OUTPUT
2759 package main
2760
2761 import (
2762 "fmt"
2763 "os"
2764 "testing"
2765 )
2766
2767 // Intentionally small to trigger buffer refill.
2768 const SIZE int = 16
2769
2770 type Input struct {
2771 file *os.File
2772 data []byte
2773 cursor int
2774 marker int
2775 token int
2776 limit int
2777 state int
2778 yyaccept int
2779 }
2780
2781 const (
2782 lexEnd = iota
2783 lexReady
2784 lexWaitingForInput
2785 lexPacketBroken
2786 lexPacketTooBig
2787 lexCountMismatch
2788 )
2789
2790 func fill(in *Input) int {
2791 if in.token == 0 {
2792 // Error: no space can be freed.
2793 // In real life can reallocate a larger buffer.
2794 return lexPacketTooBig
2795 }
2796
2797 // Discard everything up to the start of the current lexeme,
2798 // shift buffer contents and adjust offsets.
2799 copy(in.data[0:], in.data[in.token:in.limit])
2800 in.cursor -= in.token
2801 in.marker -= in.token
2802 in.limit -= in.token
2803 in.token = 0
2804
2805 // Read new data (as much as possible to fill the buffer).
2806 n, _ := in.file.Read(in.data[in.limit:SIZE])
2807 in.limit += n
2808 in.data[in.limit] = 0 // append sentinel symbol
2809
2810 return lexReady
2811 }
2812
2813 func lex(in *Input, recv *int) int {
2814 var yych byte
2815 /*!getstate:re2c*/
2816 loop:
2817 in.token = in.cursor
2818 /*!re2c
2819 re2c:eof = 0;
2820 re2c:define:YYPEEK = "in.data[in.cursor]";
2821 re2c:define:YYSKIP = "in.cursor += 1";
2822 re2c:define:YYBACKUP = "in.marker = in.cursor";
2823 re2c:define:YYRESTORE = "in.cursor = in.marker";
2824 re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
2825 re2c:define:YYFILL = "return lexWaitingForInput";
2826 re2c:define:YYGETSTATE = "in.state";
2827 re2c:define:YYSETSTATE = "in.state = @@{state}";
2828
2829 packet = [a-z]+[;];
2830
2831 * { return lexPacketBroken }
2832 $ { return lexEnd }
2833 packet { *recv = *recv + 1; goto loop }
2834 */
2835 }
2836
2837 func test(packets []string) int {
2838 fname := "pipe"
2839 fw, _ := os.Create(fname);
2840 fr, _ := os.Open(fname);
2841
2842 in := &Input{
2843 file: fr,
2844 data: make([]byte, SIZE+1),
2845 cursor: SIZE,
2846 marker: SIZE,
2847 token: SIZE,
2848 limit: SIZE,
2849 state: -1,
2850 }
2851 // data is zero-initialized, no need to write sentinel
2852
2853 var status int
2854 send := 0
2855 recv := 0
2856 loop:
2857 for {
2858 status = lex(in, &recv)
2859 if status == lexEnd {
2860 if send != recv {
2861 status = lexCountMismatch
2862 }
2863 break loop
2864 } else if status == lexWaitingForInput {
2865 if send < len(packets) {
2866 fw.WriteString(packets[send])
2867 send += 1
2868 }
2869 status = fill(in)
2870 if status != lexReady {
2871 break loop
2872 }
2873 } else if status == lexPacketBroken {
2874 break loop
2875 } else {
2876 panic("unexpected status")
2877 }
2878 }
2879
2880 fr.Close()
2881 fw.Close()
2882 os.Remove(fname)
2883
2884 return status
2885 }
2886
2887 func TestLex(t *testing.T) {
2888 var tests = []struct {
2889 status int
2890 packets []string
2891 }{
2892 {lexEnd, []string{}},
2893 {lexEnd, []string{"zero;", "one;", "two;", "three;", "four;"}},
2894 {lexPacketBroken, []string{"??;"}},
2895 {lexPacketTooBig, []string{"looooooooooooong;"}},
2896 }
2897
2898 for i, x := range tests {
2899 t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
2900 status := test(x.packets)
2901 if status != x.status {
2902 t.Errorf("got %d, want %d", status, x.status)
2903 }
2904 })
2905 }
2906 }
2907
2908
2910 Reuse mode is enabled with the -r --reusable option. In this mode re2c
2911 allows one to reuse definitions, configurations and rules specified by
2912 a /*!rules:re2c*/ block in subsequent /*!use:re2c*/ blocks. As of
2913 re2c-1.2 it is possible to mix such blocks with normal /*!re2c*/
2914 blocks; prior to that re2c expects a single rules-block followed by
2915 use-blocks (normal blocks are disallowed). Use-blocks can have addi‐
2916 tional definitions, configurations and rules: they are merged to those
2917 specified by the rules-block. A very common use case for -r --reusable
2918 option is a lexer that supports multiple input encodings: lexer rules
2919 are defined once and reused multiple times with encoding-specific con‐
2920 figurations, such as re2c:flags:utf-8.
2921
2922 Below is an example of a multi-encoding lexer: it reads a phrase with
2923 Unicode math symbols and accepts input either in UTF8 or in UT32. Note
2924 that the --input-encoding utf8 option allows us to write UTF8-encoded
2925 symbols in the regular expressions; without this option re2c would
2926 parse them as a plain ASCII byte sequnce (and we would have to use
2927 hexadecimal escape sequences).
2928
2929 Example
2930 //go:generate re2go $INPUT -o $OUTPUT -r --input-encoding utf8
2931 package main
2932
2933 import "testing"
2934
2935 /*!rules:re2c
2936 re2c:yyfill:enable = 0;
2937 re2c:define:YYPEEK = "str[cursor]";
2938 re2c:define:YYSKIP = "cursor += 1";
2939 re2c:define:YYBACKUP = "marker = cursor";
2940 re2c:define:YYRESTORE = "cursor = marker";
2941
2942 "∀x ∃y: p(x, y)" { return 0; }
2943 * { return 1; }
2944 */
2945
2946 func lexUTF8(str []uint8) int {
2947 var cursor, marker int
2948 /*!use:re2c
2949 re2c:flags:8 = 1;
2950 re2c:define:YYCTYPE = uint8;
2951 */
2952 }
2953
2954 func lexUTF32(str []uint32) int {
2955 var cursor, marker int
2956 /*!use:re2c
2957 re2c:flags:u = 1;
2958 re2c:define:YYCTYPE = uint32;
2959 */
2960 }
2961
2962 func TestLexUTF8(t *testing.T) {
2963 s_utf8 := []uint8{
2964 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79,
2965 0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29};
2966
2967 if lexUTF8(s_utf8) != 0 {
2968 t.Errorf("utf8 failed")
2969 }
2970 }
2971
2972 func TestLexUTF32(t *testing.T) {
2973 s_utf32 := []uint32{
2974 0x00002200, 0x00000078, 0x00000020, 0x00002203, 0x00000079,
2975 0x0000003a, 0x00000020, 0x00000070, 0x00000028, 0x00000078,
2976 0x0000002c, 0x00000020, 0x00000079, 0x00000029};
2977
2978 if lexUTF32(s_utf32) != 0 {
2979 t.Errorf("utf32 failed")
2980 }
2981 }
2982
2983
2985 Speaking of encodings, it is necessary to understand the difference be‐
2986 tween code points and code units. Code point is an abstract symbol.
2987 Code unit is the smallest atomic unit of storage in the encoded text.
2988 A single code point may be represented with one or more code units. In
2989 a fixed-length encoding all code points are represented with the same
2990 number of code units. In a variable-length encoding code points may be
2991 represented with a different number of code units. Note that the "any"
2992 rule [^] matches any code point, but not necessarily any code unit.
2993 The only way to match any code unit regardless of the encoding it the
2994 default rule *. YYCTYPE size should be equal to the size of code unit.
2995
2996 Re2c supports the following encodings: ASCII, EBCDIC, UCS2, UTF8, UTF16
2997 and UTF32.
2998
2999 • ASCII is enabled by default. It is a fixed-length encoding with code
3000 space [0-255] and 1-byte code points and code units.
3001
3002 • EBCDIC is enabled with -e, --ecb option. It a fixed-length encoding
3003 with code space [0-255] and 1-byte code points and code units.
3004
3005 • UCS2 is enabled with -w, --wide-chars option. It is a fixed-length
3006 encoding with code space [0-0xFFFF] and 2-byte code points and code
3007 units.
3008
3009 • UTF8 is enabled with -8, --utf-8 option. It is a variable-length
3010 Unicode encoding with code space [0-0x10FFFF]. Code points are rep‐
3011 resented with one, two, three or four 1-byte code units.
3012
3013 • UTF16 is enabled with -x, --utf-16 option. It is a variable-length
3014 Unicode encoding with code space [0-0x10FFFF]. Code points are rep‐
3015 resented with one or two 2-byte code units.
3016
3017 • UTF32 is enabled with -u, --unicode option. It is a fixed-length
3018 Unicode encoding with code space [0-0x10FFFF] and 4-byte code points
3019 and code units.
3020
3021 Encodings can also be set or unset using re2c:flags configuration, for
3022 example re2c:flags:8 = 1; enables UTF8.
3023
3024 Include file include/unicode_categories.re provides re2c definitions
3025 for the standard Unicode categories.
3026
3027 Option --input-encoding utf8 enables Unicode literals in regular ex‐
3028 pressions.
3029
3030 Option --encoding-policy <fail | substitute | ignore> specifies the way
3031 re2c handles Unicode surrogates: code points in the range
3032 [0xD800-0xDFFF].
3033
3034 Example
3035 //go:generate re2go $INPUT -o $OUTPUT -8 -s -i
3036 //
3037 // Simplified "Unicode Identifier and Pattern Syntax"
3038 // (see https://unicode.org/reports/tr31)
3039
3040 package main
3041
3042 import "testing"
3043
3044 /*!include:re2c "unicode_categories.re" */
3045
3046 func lex(str string) int {
3047 var cursor, marker int
3048 /*!re2c
3049 re2c:yyfill:enable = 0;
3050 re2c:define:YYCTYPE = byte;
3051 re2c:define:YYPEEK = "str[cursor]";
3052 re2c:define:YYSKIP = "cursor += 1";
3053 re2c:define:YYBACKUP = "marker = cursor";
3054 re2c:define:YYRESTORE = "cursor = marker";
3055
3056 id_start = L | Nl | [$_];
3057 id_continue = id_start | Mn | Mc | Nd | Pc | [\u200D\u05F3];
3058 identifier = id_start id_continue*;
3059
3060 identifier { return 0 }
3061 * { return 1 }
3062 */
3063 }
3064
3065 func TestLex(t *testing.T) {
3066 if lex("_Ыдентификатор\000") != 0 {
3067 t.Errorf("failed")
3068 }
3069 }
3070
3071
3073 Conditions are enabled with -c --conditions. This option allows one to
3074 encode multiple interrelated lexers within the same re2c block.
3075
3076 Each lexer corresponds to a single condition. It starts with a label
3077 of the form yyc_name, where name is condition name and yyc prefix can
3078 be adjusted with configuration re2c:condprefix. Different lexers are
3079 separated with a comment /* *********************************** */
3080 which can be adjusted with configuration re2c:cond:divider.
3081
3082 Furthermore, each condition has a unique identifier of the form yyc‐
3083 name, where name is condition name and yyc prefix can be adjusted with
3084 configuration re2c:condenumprefix. Identifiers have the type YYCOND‐
3085 TYPE and should be generated with /*!types:re2c*/ directive or -t
3086 --type-header option. Users shouldn't define these identifiers manu‐
3087 ally, as the order of conditions is not specified.
3088
3089 Before all conditions re2c generates entry code that checks the current
3090 condition identifier and transfers control flow to the start label of
3091 the active condition. After matching some rule of this condition,
3092 lexer may either transfer control flow back to the entry code (after
3093 executing the associated action and optionally setting another condi‐
3094 tion with =>), or use :=> shortcut and transition directly to the start
3095 label of another condition (skipping the action and the entry code).
3096 Configuration re2c:cond:goto allows one to change the default behavior.
3097
3098 Syntactically each rule must be preceded with a list of comma-separated
3099 condition names or a wildcard * enclosed in angle brackets < and >.
3100 Wildcard means "any condition" and is semantically equivalent to list‐
3101 ing all condition names. Here regexp is a regular expression, default
3102 refers to the default rule *, and action is a block of code.
3103
3104 • <conditions-or-wildcard> regexp-or-default action
3105
3106 • <conditions-or-wildcard> regexp-or-default => condition action
3107
3108 • <conditions-or-wildcard> regexp-or-default :=> condition
3109
3110 Rules with an exclamation mark ! in front of condition list have a spe‐
3111 cial meaning: they have no regular expression, and the associated ac‐
3112 tion is merged as an entry code to actions of normal rules. This might
3113 be a convenient place to peform a routine task that is common to all
3114 rules.
3115
3116 • <!conditions-or-wildcard> action
3117
3118 Another special form of rules with an empty condition list <> and no
3119 regular expression allows one to specify an "entry condition" that can
3120 be used to execute code before entering the lexer. It is semantically
3121 equivalent to a condition with number zero, name 0 and an empty regular
3122 expression.
3123
3124 • <> action
3125
3126 • <> => condition action
3127
3128 • <> :=> condition
3129
3130 Example
3131 //go:generate re2go -c $INPUT -o $OUTPUT -i
3132 package main
3133
3134 import (
3135 "errors"
3136 "testing"
3137 )
3138
3139 var (
3140 eSyntax = errors.New("syntax error")
3141 eOverflow = errors.New("overflow error")
3142 )
3143
3144 /*!types:re2c*/
3145
3146 const u32Limit uint64 = 1<<32
3147
3148 func parse_u32(str string) (uint32, error) {
3149 var cursor, marker int
3150 result := uint64(0)
3151 cond := yycinit
3152
3153 add_digit := func(base uint64, offset byte) {
3154 result = result * base + uint64(str[cursor-1] - offset)
3155 if result >= u32Limit {
3156 result = u32Limit
3157 }
3158 }
3159
3160 /*!re2c
3161 re2c:yyfill:enable = 0;
3162 re2c:define:YYCTYPE = byte;
3163 re2c:define:YYPEEK = "str[cursor]";
3164 re2c:define:YYSKIP = "cursor += 1";
3165 re2c:define:YYSHIFT = "cursor += @@{shift}";
3166 re2c:define:YYBACKUP = "marker = cursor";
3167 re2c:define:YYRESTORE = "cursor = marker";
3168 re2c:define:YYGETCONDITION = "cond";
3169 re2c:define:YYSETCONDITION = "cond = @@";
3170
3171 <*> * { return 0, eSyntax }
3172
3173 <init> '0b' / [01] :=> bin
3174 <init> "0" :=> oct
3175 <init> "" / [1-9] :=> dec
3176 <init> '0x' / [0-9a-fA-F] :=> hex
3177
3178 <bin, oct, dec, hex> "\x00" {
3179 if result < u32Limit {
3180 return uint32(result), nil
3181 } else {
3182 return 0, eOverflow
3183 }
3184 }
3185
3186 <bin> [01] { add_digit(2, '0'); goto yyc_bin }
3187 <oct> [0-7] { add_digit(8, '0'); goto yyc_oct }
3188 <dec> [0-9] { add_digit(10, '0'); goto yyc_dec }
3189 <hex> [0-9] { add_digit(16, '0'); goto yyc_hex }
3190 <hex> [a-f] { add_digit(16, 'a'-10); goto yyc_hex }
3191 <hex> [A-F] { add_digit(16, 'A'-10); goto yyc_hex }
3192 */
3193 }
3194
3195 func TestLex(t *testing.T) {
3196 var tests = []struct {
3197 num uint32
3198 str string
3199 err error
3200 }{
3201 {1234567890, "1234567890\000", nil},
3202 {13, "0b1101\000", nil},
3203 {0x7fe, "0x007Fe\000", nil},
3204 {0644, "0644\000", nil},
3205 {0, "9999999999\000", eOverflow},
3206 {0, "123??\000", eSyntax},
3207 }
3208
3209 for _, x := range tests {
3210 t.Run(x.str, func(t *testing.T) {
3211 num, err := parse_u32(x.str)
3212 if !(num == x.num && err == x.err) {
3213 t.Errorf("got %d, want %d", num, x.num)
3214 }
3215 })
3216 }
3217 }
3218
3219
3221 With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
3222 erates a self-contained C program that can be further compiled and exe‐
3223 cuted. The program consists of lexer code and input data. For each con‐
3224 structed DFA (block or condition) re2c generates a standalone lexer and
3225 two files: an .input file with strings derived from the DFA and a .keys
3226 file with expected match results. The program runs each lexer on the
3227 corresponding .input file and compares results with the expectations.
3228 Skeleton programs are very useful for a number of reasons:
3229
3230 • They can check correctness of various re2c optimizations (the data is
3231 generated early in the process, before any DFA transformations have
3232 taken place).
3233
3234 • Generating a set of input data with good coverage may be useful for
3235 both testing and benchmarking.
3236
3237 • Generating self-contained executable programs allows one to get mini‐
3238 mized test cases (the original code may be large or have a lot of de‐
3239 pendencies).
3240
3241 The difficulty with generating input data is that for all but the most
3242 trivial cases the number of possible input strings is too large (even
3243 if the string length is limited). Re2c solves this difficulty by gener‐
3244 ating sufficiently many strings to cover almost all DFA transitions. It
3245 uses the following algorithm. First, it constructs a skeleton of the
3246 DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
3247 EBCDIC) skeleton is just an exact copy of the original DFA. For encod‐
3248 ings with multibyte code units skeleton is a copy of DFA with certain
3249 transitions omitted: namely, re2c takes at most 256 code units for each
3250 disjoint continuous range that corresponds to a DFA transition. The
3251 chosen values are evenly distributed and include range bounds. Instead
3252 of trying to cover all possible paths in the skeleton (which is infea‐
3253 sible) re2c generates sufficiently many paths to cover all skeleton
3254 transitions, and thus trigger the corresponding conditional jumps in
3255 the lexer. The algorithm implementation is limited by ~1Gb of transi‐
3256 tions and consumes constant amount of memory (re2c writes data to file
3257 as soon as it is generated).
3258
3260 With the -D, --emit-dot option, re2c does not generate code. Instead,
3261 it dumps the generated DFA in DOT format. One can convert this dump to
3262 an image of the DFA using Graphviz or another library. Note that this
3263 option shows the final DFA after it has gone through a number of opti‐
3264 mizations and transformations. Earlier stages can be dumped with vari‐
3265 ous debug options, such as --dump-nfa, --dump-dfa-raw etc. (see the
3266 full list of options).
3267
3269 You can find more information about re2c at the official website:
3270 http://re2c.org. Similar programs are flex(1), lex(1), quex(‐
3271 http://quex.sourceforge.net).
3272
3274 Re2c was originaly written by Peter Bumbulis in 1993. Since then it
3275 has been developed and maintained by multiple volunteers; mots notably,
3276 Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
3277
3278
3279
3280
3281 RE2C(1)