1RE2C(1) RE2C(1)
2
3
4
6 re2c - compile regular expressions to code
7
9 re2c [OPTIONS] INPUT [-o OUTPUT]
10
11 re2go [OPTIONS] INPUT [-o OUTPUT]
12
14 re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15
16 Note: This manual includes examples for Go, but it refers to re2c
17 (rather than re2go) as the name of the program in general.
18
20 A re2c program consists of normal code intermixed with re2c blocks and
21 directives. Each re2c block may contain definitions, configurations
22 and rules. Definitions are of the form name = regexp; where name is
23 an identifier that consists of letters, digits and underscores, and
24 regexp is a regular expression. Regular expressions may contain other
25 definitions, but recursion is not allowed and each name should be
26 defined before used. Configurations are of the form re2c:config =
27 value; where config is the configuration descriptor and value can be a
28 number, a string or a special word. Rules consist of a regular expres‐
29 sion followed by a semantic action (a block of code enclosed in curly
30 braces { and }, or a raw one line of code preceded with := and ended
31 with a newline that is not followed by a whitespace). If the input
32 matches the regular expression, the associated semantic action is exe‐
33 cuted. If multiple rules match, the longest match takes precedence.
34 If multiple rules match the same string, the earlier rule takes prece‐
35 dence. There are two special rules: default rule * and EOF rule $.
36 Default rule should always be defined, it has the lowest priority
37 regardless of its place and matches any code unit (not necessarily a
38 valid character, see encoding support). EOF rule matches the end of
39 input, it should be defined if the corresponding EOF handling method is
40 used. If start conditions are used, rules have more complex syntax.
41 All rules of a single block are compiled into a deterministic
42 finite-state automaton (DFA) and encoded in the form of a program in
43 the target language. The generated code interfaces with the outer pro‐
44 gram by the means of a few user-defined primitives (see the program
45 interface section). Reusable blocks allow sharing rules, definitions
46 and configurations between different blocks.
47
49 Input file
50 //go:generate re2go $INPUT -o $OUTPUT -i
51 package main //
52 //
53 func lex(str string) int { // Go code
54 var cursor int //
55
56 /*!re2c // start of re2c block
57 re2c:define:YYCTYPE = byte; // configuration
58 re2c:define:YYPEEK = "str[cursor]"; // configuration
59 re2c:define:YYSKIP = "cursor += 1"; // configuration
60 re2c:yyfill:enable = 0; // configuration
61 re2c:flags:nested-ifs = 1; // configuration
62 //
63 number = [1-9][0-9]*; // named definition
64 //
65 number { return 0; } // normal rule
66 * { return 1; } // default rule
67 */
68 } //
69 //
70 func main() { //
71 if lex("1234\x00") != 0 { // Go code
72 panic("failed!") //
73 } //
74 } //
75
76
77 Output file
78 // Code generated by re2c, DO NOT EDIT.
79 //go:generate re2go $INPUT -o $OUTPUT -i
80 package main //
81 //
82 func lex(str string) int { // Go code
83 var cursor int //
84
85
86 {
87 var yych byte
88 yych = str[cursor]
89 if (yych <= '0') {
90 goto yy2
91 }
92 if (yych <= '9') {
93 goto yy4
94 }
95 yy2:
96 cursor += 1
97 { return 1; }
98 yy4:
99 cursor += 1
100 yych = str[cursor]
101 if (yych <= '/') {
102 goto yy6
103 }
104 if (yych <= '9') {
105 goto yy4
106 }
107 yy6:
108 { return 0; }
109 }
110
111 } //
112 //
113 func main() { //
114 if lex("1234\x00") != 0 { // Go code
115 panic("failed!") //
116 } //
117 } //
118
119
121 -? -h --help
122 Show help message.
123
124 -1 --single-pass
125 Deprecated. Does nothing (single pass is the default now).
126
127 -8 --utf-8
128 Generate a lexer that reads input in UTF-8 encoding. re2c
129 assumes that character range is 0 -- 0x10FFFF and character size
130 is 1 byte.
131
132 -b --bit-vectors
133 Optimize conditional jumps using bit masks. Implies -s.
134
135 -c --conditions --start-conditions
136 Enable support of Flex-like "conditions": multiple interrelated
137 lexers within one block. Option --start-conditions is a legacy
138 alias; use --conditions instead.
139
140 --case-insensitive
141 Treat single-quoted and double-quoted strings as case-insensi‐
142 tive.
143
144 --case-inverted
145 Invert the meaning of single-quoted and double-quoted strings:
146 treat single-quoted strings as case-sensitive and double-quoted
147 strings as case-insensitive.
148
149 --case-ranges
150 Collapse consecutive cases in a switch statements into a range
151 of the form case low ... high:. This syntax is an extension of
152 the C/C++ language, supported by compilers like GCC, Clang and
153 Tcc. The main advantage over using single cases is smaller gen‐
154 erated C code and faster generation time, although for some com‐
155 pilers like Tcc it also results in smaller binary size. This
156 option doesn't work for the Go backend.
157
158 -e --ecb
159 Generate a lexer that reads input in EBCDIC encoding. re2c
160 assumes that character range is 0 -- 0xFF an character size is 1
161 byte.
162
163 --empty-class <match-empty | match-none | error>
164 Define the way re2c treats empty character classes. With
165 match-empty (the default) empty class matches empty input (which
166 is illogical, but backwards-compatible). With``match-none``
167 empty class always fails to match. With error empty class
168 raises a compilation error.
169
170 --encoding-policy <fail | substitute | ignore>
171 Define the way re2c treats Unicode surrogates. With fail re2c
172 aborts with an error when a surrogate is encountered. With sub‐
173 stitute re2c silently replaces surrogates with the error code
174 point 0xFFFD. With ignore (the default) re2c treats surrogates
175 as normal code points. The Unicode standard says that standalone
176 surrogates are invalid, but real-world libraries and programs
177 behave in different ways.
178
179 -f --storable-state
180 Generate a lexer which can store its inner state. This is use‐
181 ful in push-model lexers which are stopped by an outer program
182 when there is not enough input, and then resumed when more input
183 becomes available. In this mode users should additionally define
184 YYGETSTATE() and YYSETSTATE(state) macros and variables yych,
185 yyaccept and state as part of the lexer state.
186
187 -F --flex-syntax
188 Partial support for Flex syntax: in this mode named definitions
189 don't need the equal sign and the terminating semicolon, and
190 when used they must be surrounded by curly braces. Names without
191 curly braces are treated as double-quoted strings.
192
193 -g --computed-gotos
194 Optimize conditional jumps using non-standard "computed goto"
195 extension (which must be supported by the compiler). re2c gener‐
196 ates jump tables only in complex cases with a lot of conditional
197 branches. Complexity threshold can be configured with
198 cgoto:threshold configuration. This option implies -b. This
199 option doesn't work for the Go backend.
200
201 -I PATH
202 Add PATH to the list of locations which are used when searching
203 for include files. This option is useful in combination with
204 /*!include:re2c ... */ directive. Re2c looks for FILE in the
205 directory of including file and in the list of include paths
206 specified by -I option.
207
208 -i --no-debug-info
209 Do not output #line information. This is useful when the gener‐
210 ated code is tracked by some version control system or IDE.
211
212 --input <default | custom>
213 Specify the API used by the generated code to interface with
214 used-defined code. Option default is the C API based on pointer
215 arithmetic (it is the default for the C backend). Option custom
216 is the generic API (it is the default for the Go backend).
217
218 --input-encoding <ascii | utf8>
219 Specify the way re2c parses regular expressions. With ascii
220 (the default) re2c handles input as ASCII-encoded: any sequence
221 of code units is a sequence of standalone 1-byte characters.
222 With utf8 re2c handles input as UTF8-encoded and recognizes
223 multibyte characters.
224
225 --lang <c | go>
226 Specify the output language. Supported languages are C and Go
227 (the default is C).
228
229 --location-format <gnu | msvc>
230 Specify location format in messages. With gnu locations are
231 printed as 'filename:line:column: ...'. With msvc locations are
232 printed as 'filename(line,column) ...'. Default is gnu.
233
234 --no-generation-date
235 Suppress date output in the generated file.
236
237 --no-version
238 Suppress version output in the generated file.
239
240 -o OUTPUT --output=OUTPUT
241 Specify the OUTPUT file.
242
243 -P --posix-captures
244 Enable submatch extraction with POSIX-style capturing groups.
245
246 -r --reusable
247 Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
248 */ blocks. Exactly one rules-block must be present. The rules
249 are saved and used by every use-block that follows, which may
250 add its own rules and configurations.
251
252 -S --skeleton
253 Ignore user-defined interface code and generate a self-contained
254 "skeleton" program. Additionally, generate input files with
255 strings derived from the regular grammar and compressed match
256 results that are used to verify "skeleton" behavior on all
257 inputs. This option is useful for finding bugs in optimizations
258 and code generation. This option doesn't work for the Go back‐
259 end.
260
261 -s --nested-ifs
262 Use nested if statements instead of switch statements in condi‐
263 tional jumps. This usually results in more efficient code with
264 non-optimizing compilers.
265
266 -T --tags
267 Enable submatch extraction with tags.
268
269 -t HEADER --type-header=HEADER
270 Generate a HEADER file that contains enum with condition names.
271 Requires -c option.
272
273 -u --unicode
274 Generate a lexer that reads UTF32-encoded input. Re2c assumes
275 that character range is 0 -- 0x10FFFF and character size is 4
276 bytes. This option implies -s.
277
278 -V --vernum
279 Show version information in MMmmpp format (major, minor, patch).
280
281 --verbose
282 Output a short message in case of success.
283
284 -v --version
285 Show version information.
286
287 -w --wide-chars
288 Generate a lexer that reads UCS2-encoded input. Re2c assumes
289 that character range is 0 -- 0xFFFF and character size is 2
290 bytes. This option implies -s.
291
292 -x --utf-16
293 Generate a lexer that reads UTF16-encoded input. Re2c assumes
294 that character range is 0 -- 0x10FFFF and character size is 2
295 bytes. This option implies -s.
296
297 Debug options
298 -D --emit-dot
299 Instead of normal output generate lexer graph in .dot format.
300 The output can be converted to an image with the help of
301 Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
302
303 -d --debug-output
304 Emit YYDEBUG in the generated code. YYDEBUG should be defined
305 by the user in the form of a void function with two parameters:
306 state (lexer state or -1) and symbol (current input symbol of
307 type YYCTYPE).
308
309 --dump-adfa
310 Debug option: output DFA after tunneling (in .dot format).
311
312 --dump-cfg
313 Debug option: output control flow graph of tag variables (in
314 .dot format).
315
316 --dump-closure-stats
317 Debug option: output statistics on the number of states in clo‐
318 sure.
319
320 --dump-dfa-det
321 Debug option: output DFA immediately after determinization (in
322 .dot format).
323
324 --dump-dfa-min
325 Debug option: output DFA after minimization (in .dot format).
326
327 --dump-dfa-tagopt
328 Debug option: output DFA after tag optimizations (in .dot for‐
329 mat).
330
331 --dump-dfa-tree
332 Debug option: output DFA under construction with states repre‐
333 sented as tag history trees (in .dot format).
334
335 --dump-dfa-raw
336 Debug option: output DFA under construction with expanded
337 state-sets (in .dot format).
338
339 --dump-interf
340 Debug option: output interference table produced by liveness
341 analysis of tag variables.
342
343 --dump-nfa
344 Debug option: output NFA (in .dot format).
345
346 Internal options
347 --dfa-minimization <moore | table>
348 Internal option: DFA minimization algorithm used by re2c. The
349 moore option is the Moore algorithm (it is the default). The ta‐
350 ble option is the "table filling" algorithm. Both algorithms
351 should produce the same DFA up to states relabeling; table fill‐
352 ing is simpler and much slower and serves as a reference imple‐
353 mentation.
354
355 --eager-skip
356 Internal option: make the generated lexer advance the input
357 position eagerly -- immediately after reading the input symbol.
358 This changes the default behavior when the input position is
359 advanced lazily -- after transition to the next state. This
360 option is implied by --no-lookahead.
361
362 --no-lookahead
363 Internal option: use TDFA(0) instead of TDFA(1). This option
364 has effect only with --tags or --posix-captures options.
365
366 --no-optimize-tags
367 Internal optionL: suppress optimization of tag variables (useful
368 for debugging).
369
370 --posix-closure <gor1 | gtop>
371 Internal option: specify shortest-path algorithm used for the
372 construction of epsilon-closure with POSIX disambiguation seman‐
373 tics: gor1 (the default) stands for Goldberg-Radzik algorithm,
374 and gtop stands for "global topological order" algorithm.
375
376 --posix-prectable <complex | naive>
377 Internal option: specify the algorithm used to compute POSIX
378 precedence table. The complex algorithm computes precedence ta‐
379 ble in one traversal of tag history tree and has quadratic com‐
380 plexity in the number of TNFA states; it is the default. The
381 naive algorithm has worst-case cubic complexity in the number of
382 TNFA states, but it is much simpler than complex and may be
383 slightly faster in non-pathological cases.
384
385 --stadfa
386 Internal option: use staDFA algorithm for submatch extraction.
387 The main difference with TDFA is that tag operations in staDFA
388 are placed in states, not on transitions.
389
390 Warnings
391 -W Turn on all warnings.
392
393 -Werror
394 Turn warnings into errors. Note that this option alone doesn't
395 turn on any warnings; it only affects those warnings that have
396 been turned on so far or will be turned on later.
397
398 -W<warning>
399 Turn on warning.
400
401 -Wno-<warning>
402 Turn off warning.
403
404 -Werror-<warning>
405 Turn on warning and treat it as an error (this implies -W<warn‐
406 ing>).
407
408 -Wno-error-<warning>
409 Don't treat this particular warning as an error. This doesn't
410 turn off the warning itself.
411
412 -Wcondition-order
413 Warn if the generated program makes implicit assumptions about
414 condition numbering. One should use either the -t, --type-header
415 option or the /*!types:re2c*/ directive to generate a mapping of
416 condition names to numbers and then use the autogenerated condi‐
417 tion names.
418
419 -Wempty-character-class
420 Warn if a regular expression contains an empty character class.
421 Trying to match an empty character class makes no sense: it
422 should always fail. However, for backwards compatibility rea‐
423 sons re2c allows empty character classes and treats them as
424 empty strings. Use the --empty-class option to change the
425 default behavior.
426
427 -Wmatch-empty-string
428 Warn if a rule is nullable (matches an empty string). If the
429 lexer runs in a loop and the empty match is unintentional, the
430 lexer may unexpectedly hang in an infinite loop.
431
432 -Wswapped-range
433 Warn if the lower bound of a range is greater than its upper
434 bound. The default behavior is to silently swap the range
435 bounds.
436
437 -Wundefined-control-flow
438 Warn if some input strings cause undefined control flow in the
439 lexer (the faulty patterns are reported). This is the most dan‐
440 gerous and most common mistake. It can be easily fixed by adding
441 the default rule * which has the lowest priority, matches any
442 code unit, and consumes exactly one code unit.
443
444 -Wunreachable-rules
445 Warn about rules that are shadowed by other rules and will never
446 match.
447
448 -Wuseless-escape
449 Warn if a symbol is escaped when it shouldn't be. By default,
450 re2c silently ignores such escapes, but this may as well indi‐
451 cate a typo or an error in the escape sequence.
452
453 -Wnondeterministic-tags
454 Warn if a tag has n-th degree of nondeterminism, where n is
455 greater than 1.
456
457 -Wsentinel-in-midrule
458 Warn if the sentinel symbol occurs in the middle of a rule ---
459 this may cause reads past the end of buffer, crashes or memory
460 corruption in the generated lexer. This warning is only applica‐
461 ble if the sentinel method of checking for the end of input is
462 used. It is set to an error if re2c:sentinel configuration is
463 used.
464
466 Re2c has a flexible interface that gives the user both the freedom and
467 the responsibility to define how the generated code interacts with the
468 outer program. There are two major options:
469
470 · Pointer API. It is also called "default API", since it was histori‐
471 cally the first, and for a long time the only one. This is a more
472 restricted API based on C pointer arithmetics. It consists of
473 pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT,
474 which are normally defined as pointers of type YYCTYPE*. Pointer API
475 is enabled by default for the C backend, and it cannot be used with
476 other backends that do not have pointer arithmetics.
477
478
479
480 · Generic API. This is a less restricted API that does not assume
481 pointer semantics. It consists of primitives YYPEEK, YYSKIP,
482 YYBACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
483 YYRESTORECTX, YYRESTORETAG, YYSHIFT, YYSHIFTSTAG, YYSHIFTMTAG and
484 YYLESSTHAN. For the C backend generic API is enabled with --input
485 custom option or re2c:flags:input = custom; configuration; for the Go
486 backend it is enabled by default. Generic API was added in version
487 0.14. It is intentionally designed to give the user as much freedom
488 as possible in redefining the input model and the semantics of dif‐
489 ferent actions performed by the generated code. As an example, one
490 can override YYPEEK to check for the end of input before reading the
491 input character, or do some logging, etc.
492
493 Generic API has two styles:
494
495 · Function-like. This style is enabled with re2c:api:style = func‐
496 tions; configuration, and it is the default for C backend. In this
497 style API primitives should be defined as functions or macros with
498 parentheses, accepting the necessary arguments. For example, in C the
499 default pointer API can be defined in function-like style generic API
500 as follows:
501
502 #define YYPEEK() *YYCURSOR
503 #define YYSKIP() ++YYCURSOR
504 #define YYBACKUP() YYMARKER = YYCURSOR
505 #define YYBACKUPCTX() YYCTXMARKER = YYCURSOR
506 #define YYRESTORE() YYCURSOR = YYMARKER
507 #define YYRESTORECTX() YYCURSOR = YYCTXMARKER
508 #define YYRESTORETAG(tag) YYCURSOR = tag
509 #define YYLESSTHAN(len) YYLIMIT - YYCURSOR < len
510 #define YYSTAGP(tag) tag = YYCURSOR
511 #define YYSTAGN(tag) tag = NULL
512 #define YYSHIFT(shift) YYCURSOR += shift
513 #define YYSHIFTSTAG(tag, shift) tag += shift
514
515
516
517 · Free-form. This style is enabled with re2c:api:style = free-form;
518 configuration, and it is the default for Go backend. In this style
519 API primitives can be defined as free-form pieces of code, and
520 instead of arguments they have interpolated variables of the form
521 @@{name}, or optionally just @@ if there is only one argument. The @@
522 text is called "sigil". It can be redefined to any other text with
523 re2c:api:sigil configuration. For example, the default pointer API
524 can be defined in free-form style generic API as follows:
525
526 re2c:define:YYPEEK = "*YYCURSOR";
527 re2c:define:YYSKIP = "++YYCURSOR";
528 re2c:define:YYBACKUP = "YYMARKER = YYCURSOR";
529 re2c:define:YYBACKUPCTX = "YYCTXMARKER = YYCURSOR";
530 re2c:define:YYRESTORE = "YYCURSOR = YYMARKER";
531 re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
532 re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
533 re2c:define:YYLESSTHAN = "YYLIMIT - YYCURSOR < @@{len}";
534 re2c:define:YYSTAGP = "@@{tag} = YYCURSOR";
535 re2c:define:YYSTAGN = "@@{tag} = NULL";
536 re2c:define:YYSHIFT = "YYCURSOR += @@{shift}";
537 re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
538
539 API primitives
540 Here is a list of API primitives that may be used by the generated code
541 in order to interface with the outer program. Which primitives are
542 needed depends on multiple factors, including the complexity of regular
543 expressions, input representation, buffering, the use of various fea‐
544 tures and so on. All the necessary primitives should be defined by the
545 user in the form of macros, functions, variables, free-form pieces of
546 code or any other suitable form. Re2c does not (and cannot) check the
547 definitions, so if anything is missing or defined incorrectly the gen‐
548 erated code will not compile.
549
550 YYCTYPE
551 The type of the input characters (code units). For ASCII,
552 EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
553 For UTF-16 or UCS-2 it should be 2-byte unsigned integer. For
554 UTF-32 it should be 4-byte unsigned integer.
555
556 YYCURSOR
557 A pointer-like l-value that stores the current input position
558 (usually a pointer of type YYCTYPE*). Initially YYCURSOR should
559 point to the first input character. It is advanced by the gener‐
560 ated code. When a rule matches, YYCURSOR points to the one
561 after the last matched character. It is used only in the default
562 C API.
563
564 YYLIMIT
565 A pointer-like r-value that stores the end of input position
566 (usually a pointer of type YYCTYPE*). Initially YYLIMIT should
567 point to the one after the last available input character. It is
568 not changed by the generated code. Lexer compares YYCURSOR to
569 YYLIMIT in order to determine if there is enough input charac‐
570 ters left. YYLIMIT is used only in the default C API.
571
572 YYMARKER
573 A pointer-like l-value (usually a pointer of type YYCTYPE*) that
574 stores the position of the latest matched rule. It is used to
575 restores YYCURSOR position if the longer match fails and lexer
576 needs to rollback. Initialization is not needed. YYMARKER is
577 used only in the default C API.
578
579 YYCTXMARKER
580 A pointer-like l-value that stores the position of the trailing
581 context (usually a pointer of type YYCTYPE*). No initialization
582 is needed. It is used only in the default C API, and only with
583 the lookahead operator /.
584
585 YYFILL API primitive with one argument len. The meaning of YYFILL is
586 to provide at least len more input characters or fail. If EOF
587 rule is used, YYFILL should always return to the calling func‐
588 tion; the return value should be zero on success and non-zero on
589 failure. If EOF rule is not used, YYFILL return value is ignored
590 and it should not return on failure. Maximal value of len is
591 YYMAXFILL, which can be generated with /*!max:re2c*/ directive.
592 The definition of YYFILL can be either function-like or
593 free-form depending on the API style (see re2c:api:style and
594 re2c:define:YYFILL:naked).
595
596 YYMAXFILL
597 An integral constant equal to the maximal value of YYFILL argu‐
598 ment. It can be generated with /*!max:re2c*/ directive.
599
600 YYLESSTHAN
601 A generic API primitive with one argument len. It should be
602 defined as an r-value of boolean type that equals true if and
603 only if there is less than len input characters left. The defi‐
604 nition can be either function-like or free-form depending on the
605 API style (see re2c:api:style).
606
607 YYPEEK A generic API primitive with no arguments. It should be defined
608 as an r-value of type YYCTYPE that is equal to the character at
609 the current input position. The definition can be either func‐
610 tion-like or free-form depending on the API style (see
611 re2c:api:style).
612
613 YYSKIP A generic API primitive with no arguments. The meaning of
614 YYSKIP is to advance the current input position by one charac‐
615 ter. The definition can be either function-like or free-form
616 depending on the API style (see re2c:api:style).
617
618 YYBACKUP
619 A generic API primitive with no arguments. The meaning of
620 YYBACKUP is to save the current input position, which is later
621 restored with YYRESTORE. The definition should be either func‐
622 tion-like or free-form depending on the API style (see
623 re2c:api:style).
624
625 YYRESTORE
626 A generic API primitive with no arguments. The meaning of YYRE‐
627 STORE is to restore the current input position to the value
628 saved by YYBACKUP. The definition should be either func‐
629 tion-like or free-form depending on the API style (see
630 re2c:api:style).
631
632 YYBACKUPCTX
633 A generic API primitive with zero arguments. The meaning of
634 YYBACKUPCTX is to save the current input position as the posi‐
635 tion of the trailing context, which is later restored by YYRE‐
636 STORECTX. The definition should be either function-like or
637 free-form depending on the API style (see re2c:api:style).
638
639 YYRESTORECTX
640 A generic API primitive with no arguments. The meaning of YYRE‐
641 STORECTX is to restore the trailing context position saved with
642 YYBACKUPCTX. The definition should be either function-like or
643 free-form depending on the API style (see re2c:api:style).
644
645 YYRESTORETAG
646 A generic API primitive with one argument tag. The meaning of
647 YYRESTORETAG is to restore the trailing context position to the
648 value of tag. The definition should be either function-like or
649 free-form depending on the API style (see re2c:api:style).
650
651 YYSTAGP
652 A generic API primitive with one argument tag. The meaning of
653 YYSTAGP is to set tag value to the current input position. The
654 definition should be either function-like or free-form depending
655 on the API style (see re2c:api:style).
656
657 YYSTAGN
658 A generic API primitive with one argument tag. The meaning of
659 YYSTAGP is to set tag value to null (or some default value). The
660 definition should be either function-like or free-form depending
661 on the API style (see re2c:api:style).
662
663 YYMTAGP
664 A generic API primitive with one argument tag. The meaning of
665 YYMTAGP is to append the current position to the history of tag.
666 The definition should be either function-like or free-form
667 depending on the API style (see re2c:api:style).
668
669 YYMTAGN
670 A generic API primitive with one argument tag. The meaning of
671 YYMTAGN is to append null (or some other default) value to the
672 history of tag. The definition can be either function-like or
673 free-form depending on the API style (see re2c:api:style).
674
675 YYSHIFT
676 A generic API primitive with one argument shift. The meaning of
677 YYSHIFT is to shift the current input position by shift charac‐
678 ters (the shift value may be negative). The definition can be
679 either function-like or free-form depending on the API style
680 (see re2c:api:style).
681
682 YYSHIFTSTAG
683 A generic API primitive with two arguments, tag and shift. The
684 meaning of YYSHIFTSTAG is to shift tag by shift characters (the
685 shift value may be negative). The definition can be either
686 function-like or free-form depending on the API style (see
687 re2c:api:style).
688
689 YYSHIFTMTAG
690 A generic API primitive with two arguments, tag and shift. The
691 meaning of YYSHIFTMTAG is to shift the latest value in the his‐
692 tory of tag by shift characters (the shift value may be nega‐
693 tive). The definition should be either function-like or
694 free-form depending on the API style (see re2c:api:style).
695
696 YYMAXNMATCH
697 An integral constant equal to the maximal number of POSIX cap‐
698 turing groups in a rule. It is generated with /*!maxn‐
699 match:re2c*/ directive.
700
701 YYCONDTYPE
702 The type of the condition enum. It should be generated either
703 with /*!types:re2c*/ directive or -t --type-header option.
704
705 YYGETCONDITION
706 An API primitive with zero arguments. It should be defined as
707 an r-value of type YYCONDTYPE that is equal to the current con‐
708 dition identifier. The definition can be either function-like or
709 free-form depending on the API style (see re2c:api:style and
710 re2c:define:YYGETCONDITION:naked).
711
712 YYSETCONDITION
713 An API primitive with one argument cond. The meaning of YYSET‐
714 CONDITION is to set the current condition identifier to cond.
715 The definition should be either function-like or free-form
716 depending on the API style (see re2c:api:style and
717 re2c:define:YYSETCONDITION@cond).
718
719 YYGETSTATE
720 An API primitive with zero arguments. It should be defined as
721 an r-value of integer type that is equal to the current lexer
722 state. Should be initialized to -1. The definition can be either
723 function-like or free-form depending on the API style (see
724 re2c:api:style and re2c:define:YYGETSTATE:naked).
725
726 YYSETSTATE
727 An API primitive with one argument state. The meaning of YYSET‐
728 STATE is to set the current lexer state to state. The defini‐
729 tion should be either function-like or free-form depending on
730 the API style (see re2c:api:style and re2c:define:YYSET‐
731 STATE@state).
732
733 YYDEBUG
734 A debug API primitive with two arguments. It can be used to
735 debug the generated code (with -d --debug-output option). YYDE‐
736 BUG should return no value and accept two arguments: state
737 (either a DFA state index or -1) and symbol (the current input
738 symbol).
739
740 yych An l-value of type YYCTYPE that stores the current input charac‐
741 ter. User definition is necessary only with -f --storable-state
742 option.
743
744 yyaccept
745 An l-value of unsigned integral type that stores the number of
746 the latest matched rule. User definition is necessary only with
747 -f --storable-state option.
748
749 yynmatch
750 An l-value of unsigned integral type that stores the number of
751 POSIX capturing groups in the matched rule. Used only with -P
752 --posix-captures option.
753
754 yypmatch
755 An array of l-values that are used to hold the tag values corre‐
756 sponding to the capturing parentheses in the matching rule.
757 Array length must be at least yynmatch * 2 (usually YYMAXNMATCH
758 * 2 is a good choice). Used only with -P --posix-captures
759 option.
760
761 Directives
762 Below is the list of all directives provided by re2c (in no particular
763 order). More information on each directive can be found in the related
764 sections.
765
766 /*!re2c ... */
767 A standard re2c block.
768
769 %{ ... %}
770 A standard re2c block in -F --flex-support mode.
771
772 /*!rules:re2c ... */
773 A reusable re2c block (requires -r --reuse option).
774
775 /*!use:re2c ... */
776 A block that reuses previous rules-block specified with
777 /*!rules:re2c ... */ (requires -r --reuse option).
778
779 /*!ignore:re2c ... */
780 A block which contents are ignored and cut off from the output
781 file.
782
783 /*!max:re2c*/
784 This directive is substituted with the macro-definition of
785 YYMAXFILL.
786
787 /*!maxnmatch:re2c*/
788 This directive is substituted with the macro-definition of
789 YYMAXNMATCH (requires -P --posix-captures option).
790
791 /*!getstate:re2c*/
792 This directive is substituted with conditional dispatch on lexer
793 state (requires -f --storable-state option).
794
795 /*!types:re2c ... */
796 This directive is substituted with the definition of condition
797 enum (requires -c --conditions option).
798
799 /*!stags:re2c ... */, /*!mtags:re2c ... */
800 These directives allow one to specify a template piece of code
801 that is expanded for each s-tag/m-tag variable generated by
802 re2c. This block has two optional configurations: format = "@@";
803 (specifies the template where @@ is substituted with the name of
804 each tag variable), and separator = ""; (specifies the piece of
805 code used to join the generated pieces for different tag vari‐
806 ables).
807
808 /*!include:re2c FILE */
809 This directive allows one to include FILE (in the same sense as
810 #include directive in C/C++).
811
812 /*!header:re2c:on*/
813 This directive marks the start of header file. Everything after
814 it and up to the following /*!header:re2c:off*/ directive is
815 processed by re2c and written to the header file specified with
816 -t --type-header option.
817
818 /*!header:re2c:off*/
819 This directive marks the end of header file started with
820 /*!header:re2c:on*/.
821
822 Configurations
823 re2c:flags:t, re2c:flags:type-header
824 Specify the name of the generated header file relative to the
825 directory of the output file. (Same as -t, --type-header com‐
826 mand-line option except that the filepath is relative.)
827
828 re2c:flags:input
829 Same as --input command-line option.
830
831 re2c:api:style
832 Allows one to specify the style of generic API. Possible values
833 are functions and free-form. With functions style (the default
834 for the C backend) API primitives behave like functions, and
835 re2c generates parentheses with an argument list after the name
836 of each primitive. With free-form style (the default for the Go
837 backend) re2c treats API definitions as interpolated strings and
838 substitutes argument placeholders with the actual argument val‐
839 ues. This option can be overridden by options for individual
840 API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
841
842 re2c:api:sigil
843 Allows one to specify the "sigil" symbol (or string) that is
844 used to recognize argument placeholders in the definitions of
845 generic API primitives. The default value is @@. Placeholders
846 start with sigil, followed by the argument name in curly braces.
847 For example, if sigil is set to $, then placeholders will have
848 the form ${name}. Single-argument APIs may use shorthand nota‐
849 tion without the name in braces. This option can be overridden
850 by options for individual API primitives, e.g.
851 re2c:define:YYFILL@len for YYFILL.
852
853 re2c:define:YYCTYPE
854 Defines YYCTYPE (see the user interface section).
855
856 re2c:define:YYCURSOR
857 Defines C API primitive YYCURSOR (see the user interface sec‐
858 tion).
859
860 re2c:define:YYLIMIT
861 Defines C API primitive YYLIMIT (see the user interface sec‐
862 tion).
863
864 re2c:define:YYMARKER
865 Defines C API primitive YYMARKER (see the user interface sec‐
866 tion).
867
868 re2c:define:YYCTXMARKER
869 Defines C API primitive YYCTXMARKER (see the user interface sec‐
870 tion).
871
872 re2c:define:YYFILL
873 Defines API primitive YYFILL (see the user interface section).
874
875 re2c:define:YYFILL@len
876 Specifies the sigil used for argument substitution in YYFILL
877 definition. Defaults to @@. Overrides the more generic
878 re2c:api:sigil configuration.
879
880 re2c:define:YYFILL:naked
881 Allows one to override re2c:api:style for YYFILL. Value 0 cor‐
882 responds to free-form API style.
883
884 re2c:yyfill:enable
885 Defaults to 1 (YYFILL is enabled). Set this to zero to suppress
886 the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
887 tinel configuration to verify that the generated lexer cannot
888 read past the end of input, as this might introduce severe secu‐
889 rity issues to your programs.
890
891 re2c:yyfill:parameter
892 Controls the argument in the parentheses that follow YYFILL.
893 Defaults to 1, which means that the argument is generated. If
894 zero, the argument is omitted. Can be overridden with
895 re2c:define:YYFILL:naked or re2c:api:style.
896
897 re2c:eof
898 Specifies the sentinel symbol used with EOF rule $ to check for
899 the end of input in the generated lexer. The default value is -1
900 (EOF rule is not used). Other possible values include all valid
901 code units. Only decimal numbers are recognized.
902
903 re2c:sentinel
904 Specifies the sentinel symbol used with the sentinel method of
905 checking for the end of input in the generated lexer (the case
906 when bounds checking is disabled with re2c:yyfill:enable = 0;
907 and EOF rule $ is not used). This configuration does not affect
908 code generation. It is used by re2c to verify that the sentinel
909 symbol is not allowed in the middle of the rule, and prevent
910 possible reads past the end of buffer in the generated lexer.
911 The default value is -1 (re2c assumes that the sentinel symbol
912 is 0, which is the most common case). Other possible values
913 include all valid code units. Only decimal numbers are recog‐
914 nized.
915
916 re2c:define:YYLESSTHAN
917 Defines generic API primitive YYLESSTHAN (see the user interface
918 section).
919
920 re2c:yyfill:check
921 Setting this to zero allows to suppress the generation of YYFILL
922 check (YYLESSTHAN in generic API of YYLIMIT-based comparison in
923 default C API). This configuration is useful when the necessary
924 input is always available. it defaults to 1 (the check is gener‐
925 ated).
926
927 re2c:label:yyFillLabel
928 Allows one to change the prefix of YYFILL labels (used with EOF
929 rule or with storable states).
930
931 re2c:define:YYPEEK
932 Defines generic API primitive YYPEEK (see the user interface
933 section).
934
935 re2c:define:YYSKIP
936 Defines generic API primitive YYSKIP (see the user interface
937 section).
938
939 re2c:define:YYBACKUP
940 Defines generic API primitive YYBACKUP (see the user interface
941 section).
942
943 re2c:define:YYBACKUPCTX
944 Defines generic API primitive YYBACKUPCTX (see the user inter‐
945 face section).
946
947 re2c:define:YYRESTORE
948 Defines generic API primitive YYRESTORE (see the user interface
949 section).
950
951 re2c:define:YYRESTORECTX
952 Defines generic API primitive YYRESTORECTX (see the user inter‐
953 face section).
954
955 re2c:define:YYRESTORETAG
956 Defines generic API primitive YYRESTORETAG (see the user inter‐
957 face section).
958
959 re2c:define:YYSHIFT
960 Defines generic API primitive YYSHIFT (see the user interface
961 section).
962
963 re2c:define:YYSHIFTMTAG
964 Defines generic API primitive YYSHIFTMTAG (see the user inter‐
965 face section).
966
967 re2c:define:YYSHIFTSTAG
968 Defines generic API primitive YYSHIFTSTAG (see the user inter‐
969 face section).
970
971 re2c:define:YYSTAGN
972 Defines generic API primitive YYSTAGN (see the user interface
973 section).
974
975 re2c:define:YYSTAGP
976 Defines generic API primitive YYSTAGP (see the user interface
977 section).
978
979 re2c:define:YYMTAGN
980 Defines generic API primitive YYMTAGN (see the user interface
981 section).
982
983 re2c:define:YYMTAGP
984 Defines generic API primitive YYMTAGP (see the user interface
985 section).
986
987 re2c:flags:T, re2c:flags:tags
988 Same as -T --tags command-line option.
989
990 re2c:flags:P, re2c:flags:posix-captures
991 Same as -P --posix-captures command-line option.
992
993 re2c:tags:expression
994 Allows one to customize the way re2c addresses tag variables.
995 By default re2c generates expressions of the form yyt<N>. This
996 might be inconvenient, for example if tag variables are defined
997 as fields in a struct. Re2c recognizes placeholder of the form
998 @@{tag} or @@ and replaces it with the actual tag name. Sigil
999 @@ can be redefined with re2c:api:sigil configuration. For
1000 example, setting re2c:tags:expression = "p->@@"; results in
1001 expressions of the form p->yyt<N> in the generated code.
1002
1003 re2c:tags:prefix
1004 Allows one to override the prefix of tag variables (defaults to
1005 yyt).
1006
1007 re2c:flags:lookahead
1008 Same as inverted --no-lookahead command-line option.
1009
1010 re2c:flags:optimize-tags
1011 Same as inverted --no-optimize-tags command-line option.
1012
1013 re2c:define:YYCONDTYPE
1014 Defines YYCONDTYPE (see the user interface section).
1015
1016 re2c:define:YYGETCONDITION
1017 Defines API primitive YYGETCONDITION (see the user interface
1018 section).
1019
1020 re2c:define:YYGETCONDITION:naked
1021 Allows one to override re2c:api:style for YYGETCONDITION. Value
1022 0 corresponds to free-form API style.
1023
1024 re2c:define:YYSETCONDITION
1025 Defines API primitive YYSETCONDITION (see the user interface
1026 section).
1027
1028 re2c:define:YYSETCONDITION@cond
1029 Specifies the sigil used for argument substitution in YYSETCON‐
1030 DITION definition. The default value is @@. Overrides the more
1031 generic re2c:api:sigil configuration.
1032
1033 re2c:define:YYSETCONDITION:naked
1034 Allows one to override re2c:api:style for YYSETCONDITION. Value
1035 0 corresponds to free-form API style.
1036
1037 re2c:cond:goto
1038 Allows one to customize the goto statements used with the short‐
1039 cut :=> rules in conditions. The default value is goto @@;.
1040 Placeholders are substituted with condition name (see
1041 re2c:api;sigil and re2c:cond:goto@cond).
1042
1043 re2c:cond:goto@cond
1044 Specifies the sigil used for argument substitution in
1045 re2c:cond:goto definition. The default value is @@. Overrides
1046 the more generic re2c:api:sigil configuration.
1047
1048 re2c:cond:divider
1049 Defines the divider for condition blocks. The default value is
1050 /* *********************************** */. Placeholders are
1051 substituted with condition name (see re2c:api;sigil and
1052 re2c:cond:divider@cond).
1053
1054 re2c:cond:divider@cond
1055 Specifies the sigil used for argument substitution in
1056 re2c:cond:divider definition. The default value is @@. Over‐
1057 rides the more generic re2c:api:sigil configuration.
1058
1059 re2c:condprefix
1060 Specifies the prefix used for condition labels. The default
1061 value is yyc_.
1062
1063 re2c:condenumprefix
1064 Specifies the prefix used for condition identifiers. The
1065 default value is yyc.
1066
1067 re2c:define:YYGETSTATE
1068 Defines API primitive YYGETSTATE (see the user interface sec‐
1069 tion).
1070
1071 re2c:define:YYGETSTATE:naked
1072 Allows one to override re2c:api:style for YYGETSTATE. Value 0
1073 corresponds to free-form API style.
1074
1075 re2c:define:YYSETSTATE
1076 Defines API primitive YYSETSTATE (see the user interface sec‐
1077 tion).
1078
1079 re2c:define:YYSETSTATE@state
1080 Specifies the sigil used for argument substitution in YYSETSTATE
1081 definition. The default value is @@. Overrides the more generic
1082 re2c:api:sigil configuration.
1083
1084 re2c:define:YYSETSTATE:naked
1085 Allows one to override re2c:api:style for YYSETSTATE. Value 0
1086 corresponds to free-form API style.
1087
1088 re2c:state:abort
1089 If set to a positive integer value, changes the form of the
1090 YYGETSTATE switch: instead of using default case to jump to the
1091 beginning of the lexer block, a -1 case is used, and the default
1092 case aborts the program.
1093
1094 re2c:state:nextlabel
1095 With storable states, allows to control if the YYGETSTATE block
1096 is followed by a yyNext label (the default value is zero, which
1097 corresponds to no label). Instead of using yyNext it is possible
1098 to use re2c:startlabel to force the generation of a specific
1099 start label. Instead of using labels it is often more conve‐
1100 nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1101
1102 re2c:label:yyNext
1103 Allows one to change the name of the yyNext label.
1104
1105 re2c:startlabel
1106 Controls the generation of start label for the next lexer block.
1107 The default value is zero, which means that the start label is
1108 generated only if it is used. An integer value greater than zero
1109 forces the generation of start label even if it is unused by the
1110 lexer. A string value also forces start label generation and
1111 sets the label name to the specified string. This configuration
1112 applies only to the current block (it is reset to default for
1113 the next block).
1114
1115 re2c:flags:s, re2c:flags:nested-ifs
1116 Same as -s --nested-ifs command-line option.
1117
1118 re2c:flags:b, re2c:flags:bit-vectors
1119 Same as -b --bit-vectors command-line option.
1120
1121 re2c:variable:yybm
1122 Overrides the name of the yybm variable.
1123
1124 re2c:yybm:hex
1125 Defaults to zero (a decimal bitmap table is generated). If set
1126 to nonzero, a hexadecimal table is generated.
1127
1128 re2c:flags:g, re2c:flags:computed-gotos
1129 Same as -g --computed-gotos command-line option.
1130
1131 re2c:cgoto:threshold
1132 With -g --computed-gotos option this value specifies the com‐
1133 plexity threshold that triggers the generation of jump tables
1134 instead of nested if statements and bitmaps. The default value
1135 is 9.
1136
1137 re2c:flags:case-ranges
1138 Same as --case-ranges command-line option.
1139
1140 re2c:flags:e, re2c:flags:ecb
1141 Same as -e --ecb command-line option.
1142
1143 re2c:flags:8, re2c:flags:utf-8
1144 Same as -8 --utf-8 command-line option.
1145
1146 re2c:flags:w, re2c:flags:wide-chars
1147 Same as -w --wide-chars command-line option.
1148
1149 re2c:flags:x, re2c:flags:utf-16
1150 Same as -x --utf-16 command-line option.
1151
1152 re2c:flags:u, re2c:flags:unicode
1153 Same as -u --unicode command-line option.
1154
1155 re2c:flags:encoding-policy
1156 Same as --encoding-policy command-line option.
1157
1158 re2c:flags:empty-class
1159 Same as --empty-class command-line option.
1160
1161 re2c:flags:case-insensitive
1162 Same as --case-insensitive command-line option.
1163
1164 re2c:flags:case-inverted
1165 Same as --case-inverted command-line option.
1166
1167 re2c:flags:i, re2c:flags:no-debug-info
1168 Same as -i --no-debug-info command-line option.
1169
1170 re2c:indent:string
1171 Specifies the string to use for indentation. The default value
1172 is "\t". Indent string should contain only whitespace charac‐
1173 ters. To disable indentation entirely, set this configuration
1174 to empty string "".
1175
1176 re2c:indent:top
1177 Specifies the minimum amount of indentation to use. The default
1178 value is zero. The value should be a non-negative integer num‐
1179 ber.
1180
1181 re2c:labelprefix
1182 Allows one to change the prefix of DFA state labels. The
1183 default value is yy.
1184
1185 re2c:yych:emit
1186 Set this to zero to suppress the generation of yych definition.
1187 Defaults to 1 (the definition is generated).
1188
1189 re2c:variable:yych
1190 Overrides the name of the yych variable.
1191
1192 re2c:yych:conversion
1193 If set to nonzero, re2c automatically generates a cast to YYC‐
1194 TYPE every time yych is read. Defaults to zero (no cast).
1195
1196 re2c:variable:yyaccept
1197 Overrides the name of the yyaccept variable.
1198
1199 re2c:variable:yytarget
1200 Overrides the name of the yytarget variable.
1201
1202 re2c:variable:yystable
1203 Deprecated.
1204
1205 re2c:variable:yyctable
1206 When both -c --conditions and -g --computed-gotos are active,
1207 re2c will use this variable to generate a static jump table for
1208 YYGETCONDITION.
1209
1210 re2c:define:YYDEBUG
1211 Defines YYDEBUG (see the user interface section).
1212
1213 re2c:flags:d, re2c:flags:debug-output
1214 Same as -d --debug-output command-line option.
1215
1216 re2c:flags:dfa-minimization
1217 Same as --dfa-minimization command-line option.
1218
1219 re2c:flags:eager-skip
1220 Same as --eager-skip command-line option.
1221
1223 re2c uses the following syntax for regular expressions:
1224
1225 · "foo" case-sensitive string literal
1226
1227 · 'foo' case-insensitive string literal
1228
1229 · [a-xyz], [^a-xyz] character class (possibly negated)
1230
1231 · . any character except newline
1232
1233 · R \ S difference of character classes R and S
1234
1235 · R* zero or more occurrences of R
1236
1237 · R+ one or more occurrences of R
1238
1239 · R? optional R
1240
1241 · R{n} repetition of R exactly n times
1242
1243 · R{n,} repetition of R at least n times
1244
1245 · R{n,m} repetition of R from n to m times
1246
1247 · (R) just R; parentheses are used to override precedence or for
1248 POSIX-style submatch
1249
1250 · R S concatenation: R followed by S
1251
1252 · R | S alternative: R or S
1253
1254 · R / S lookahead: R followed by S, but S is not consumed
1255
1256 · name the regular expression defined as name (or literal string "name"
1257 in Flex compatibility mode)
1258
1259 · {name} the regular expression defined as name in Flex compatibility
1260 mode
1261
1262 · @stag an s-tag: saves the last input position at which @stag matches
1263 in a variable named stag
1264
1265 · #mtag an m-tag: saves all input positions at which #mtag matches in a
1266 variable named mtag
1267
1268 Character classes and string literals may contain the following escape
1269 sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1270 decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1271
1273 Re2c provides a number of ways to handle end-of-input situation. Which
1274 way to use depends on the complexity of regular expressions, perfor‐
1275 mance considerations, the need for input buffering and various other
1276 factors. EOF handling is probably the most complex part of re2c user
1277 interface --- it definitely requires a bit of understanding of how the
1278 generated lexer works. But in return is allows the user to customize
1279 lexer for a particular environment and avoid the unnecessary overhead
1280 of generic methods when a simpler method is sufficient. Roughly speak‐
1281 ing, there are four main methods:
1282
1283 · using sentinel symbol (simple and efficient, but limited)
1284
1285 · bounds checking with padding (generic, but complex)
1286
1287 · EOF rule: a combination of sentinel symbol and bounds checking
1288 (generic and simple, can be more or less efficient than bounds check‐
1289 ing with padding depending on the grammar)
1290
1291 · using generic API (user-defined, so may be incorrect ;])
1292
1293 Using sentinel symbol
1294 This is the simplest and the most efficient method. It is applicable in
1295 cases when the input is small enough to fit into a continuous memory
1296 buffer and there is a natural "sentinel" symbol --- a code unit that is
1297 not allowed by any of the regular expressions in grammar (except possi‐
1298 bly as a terminating character). Sentinel symbol never appears in
1299 well-formed input, therefore it can be appended at the end of input and
1300 used as a stop signal by the lexer. A good example of such input is a
1301 null-terminated C-string, provided that the grammar does not allow NULL
1302 in the middle of lexemes. Sentinel method is very efficient, because
1303 the lexer does not need to perform any additional checks for the end of
1304 input --- it comes naturally as a part of processing the next charac‐
1305 ter. It is very important that the sentinel symbol is not allowed in
1306 the middle of the rule --- otherwise on some inputs the lexer may read
1307 past the end of buffer and crash or cause memory corruption. Re2c veri‐
1308 fies this automatically. Use re2c:sentinel configuration to specify
1309 which sentinel symbol is used.
1310
1311 Below is an example of using sentinel method. Configuration
1312 re2c:yyfill:enable = 0; suppresses generation of end-of-input checks
1313 and YYFILL calls.
1314
1315 //go:generate re2go $INPUT -o $OUTPUT
1316 package main
1317
1318 import "testing"
1319
1320 // expect a null-terminated string
1321 func lex(str string) int {
1322 var cursor int
1323 count := 0
1324 loop:
1325 /*!re2c
1326 re2c:yyfill:enable = 0;
1327 re2c:define:YYCTYPE = byte;
1328 re2c:define:YYPEEK = "str[cursor]";
1329 re2c:define:YYSKIP = "cursor += 1";
1330
1331 * { return -1 }
1332 [\x00] { return count }
1333 [a-z]+ { count += 1; goto loop }
1334 [ ]+ { goto loop }
1335 */
1336 }
1337
1338 func TestLex(t *testing.T) {
1339 var tests = []struct {
1340 res int
1341 str string
1342 }{
1343 {0, "\000"},
1344 {3, "one two three\000"},
1345 {-1, "f0ur\000"},
1346 }
1347
1348 for _, x := range tests {
1349 t.Run(x.str, func(t *testing.T) {
1350 res := lex(x.str)
1351 if res != x.res {
1352 t.Errorf("got %d, want %d", res, x.res)
1353 }
1354 })
1355 }
1356 }
1357
1358
1359 Bounds checking with padding
1360 Bounds checking is a generic method: it can be used with any input
1361 grammar. The basic idea is simple: we need to check for the end of
1362 input before reading the next input character. However, if implemented
1363 in a straightforward way, this would be quite inefficient: checking on
1364 each input character would cause a major slowdown. Re2c avoids slowdown
1365 by generating checks only in certain key states of the lexer, and let‐
1366 ting it run without checks in-between the key states. More precisely,
1367 re2c computes strongly connected components (SCCs) of the underlying
1368 DFA (which roughly correspond to loops), and generates only a few
1369 checks per each SCC (usually just one, but in general enough to make
1370 the SCC acyclic). The check is of the form (YYLIMIT - YYCURSOR) < n,
1371 where n is the maximal length of a simple path in the corresponding
1372 SCC. If this condiiton is true, the lexer calls YYFILL(n), which must
1373 either supply at least n input characters, or do not return. When the
1374 lexer continues after the check, it is certain that the next n charac‐
1375 ters can be read safely without checks.
1376
1377 This approach reduces the number of checks significantly (and makes the
1378 lexer much faster as a result), but it has a downside. Since the lexer
1379 checks for multiple characters at once, it may end up in a situation
1380 when there are a few remaining input characters (less than n) corre‐
1381 sponding to a short path in the SCC, but the lexer cannot proceed
1382 because of the check, and YYFILL cannot supply more character because
1383 it is the end of input. To solve this problem, re2c requires that addi‐
1384 tional padding consisting of fake characters is appended at the end of
1385 input. The length of padding should be YYMAXFILL, which equals to the
1386 maximum n parameter to YYFILL and must be generated by re2c using
1387 /*!max:re2c*/ directive. The fake characters should not form a valid
1388 lexeme suffix, otherwise the lexer may be fooled into matching a fake
1389 lexeme. Usually it's a good idea to use NULL characters for padding.
1390
1391 Below is an example of using bounds checking with padding. Note that
1392 the grammar rule for single-quoted strings allows arbitrary symbols in
1393 the middle of lexeme, so there is no natural sentinel in the grammar.
1394 Strings like "aha\0ha" are perfectly valid, but ill-formed strings like
1395 "aha\0 are also possible and shouldn’t crash the lexer. In this example
1396 we do not use buffer refilling, therefore YYFILL definition simply
1397 returns an error. Note that YYFILL will only be called after the lexer
1398 reaches padding, because only then will the check condition be satis‐
1399 fied.
1400
1401 //go:generate re2go $INPUT -o $OUTPUT
1402 package main
1403
1404 import (
1405 "strings"
1406 "testing"
1407 )
1408
1409 /*!max:re2c*/
1410
1411 // Expects YYMAXFILL-padded string.
1412 func lex(str string) int {
1413 var cursor int
1414 limit := len(str)
1415 count := 0
1416 loop:
1417 /*!re2c
1418 re2c:define:YYCTYPE = byte;
1419 re2c:define:YYPEEK = "str[cursor]";
1420 re2c:define:YYSKIP = "cursor += 1";
1421 re2c:define:YYLESSTHAN = "limit - cursor < @@{len}";
1422 re2c:define:YYFILL = "return -1";
1423
1424 * { return -1 }
1425 [\x00] { return count }
1426 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1427 [ ]+ { goto loop }
1428 */
1429 }
1430
1431 // Pad string with YYMAXFILL zeroes at the end.
1432 func pad(str string) string {
1433 return str + strings.Repeat("\000", YYMAXFILL)
1434 }
1435
1436 func TestLex(t *testing.T) {
1437 var tests = []struct {
1438 res int
1439 str string
1440 }{
1441 {0, ""},
1442 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1443 {-1, "'unterminated\\'"},
1444 }
1445
1446 for _, x := range tests {
1447 t.Run(x.str, func(t *testing.T) {
1448 res := lex(pad(x.str))
1449 if res != x.res {
1450 t.Errorf("got %d, want %d", res, x.res)
1451 }
1452 })
1453 }
1454 }
1455
1456
1457 EOF rule
1458 EOF rule $ was introduced in version 1.2. It is a hybrid approach that
1459 tries to take the best of both worlds: simplicity and efficiency of the
1460 sentinel method combined with the generality of bounds-checking method.
1461 The idea is to appoint an arbitrary symbol to be the sentinel, and only
1462 perform further bounds checking if the sentinel symbol matches (more
1463 precisely, if the symbol class that contains it matches). The check is
1464 of the form YYLIMIT <= YYCURSOR. If this condition is not satisfied,
1465 then the sentinel is just an ordinary input character and the lexer
1466 continues. Otherwise this is a real sentinel, and the lexer calls
1467 YYFILL(). If YYFILL returns zero, the lexer assumes that it has more
1468 input and tries to re-match. Otherwise YYFILL returns non-zero and the
1469 lexer knows that it has reached the end of input. At this point there
1470 are three possibilities. First, it might have already matched a shorter
1471 lexeme --- in this case it just rolls back to the last accepting state.
1472 Second, it might have consumed some characters, but failed to match ---
1473 in this case it falls back to default rule *. Finally, it might be in
1474 the initial state --- in this (and only this!) case it matches EOF rule
1475 $.
1476
1477 Below is an example of using EOF rule. Configuration re2c:yyfill:enable
1478 = 0; suppresses generation of YYFILL calls (but not the bounds checks).
1479
1480 //go:generate re2go $INPUT -o $OUTPUT
1481 package main
1482
1483 import "testing"
1484
1485 // Expects a null-terminated string.
1486 func lex(str string) int {
1487 var cursor, marker int
1488 limit := len(str) - 1 // limit points at the terminating null
1489 count := 0
1490 loop:
1491 /*!re2c
1492 re2c:yyfill:enable = 0;
1493 re2c:eof = 0;
1494 re2c:define:YYCTYPE = byte;
1495 re2c:define:YYPEEK = "str[cursor]";
1496 re2c:define:YYSKIP = "cursor += 1";
1497 re2c:define:YYBACKUP = "marker = cursor";
1498 re2c:define:YYRESTORE = "cursor = marker";
1499 re2c:define:YYLESSTHAN = "limit <= cursor";
1500
1501 * { return -1 }
1502 $ { return count }
1503 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1504 [ ]+ { goto loop }
1505 */
1506 }
1507
1508 func TestLex(t *testing.T) {
1509 var tests = []struct {
1510 res int
1511 str string
1512 }{
1513 {0, "\000"},
1514 {3, "'qu\000tes' 'are' 'fine: \\'' \000"},
1515 {-1, "'unterminated\\'\000"},
1516 }
1517
1518 for _, x := range tests {
1519 t.Run(x.str, func(t *testing.T) {
1520 res := lex(x.str)
1521 if res != x.res {
1522 t.Errorf("got %d, want %d", res, x.res)
1523 }
1524 })
1525 }
1526 }
1527
1528
1529 Using generic API
1530 Generic API can be used with any of the above methods. It also allows
1531 one to use a user-defined method by placing EOF checks in one of the
1532 basic primitives. Usually this is either YYSKIP (the check is per‐
1533 formed when advancing to the next input character), or YYPEEK (the
1534 check is performed when reading the next input character). The result‐
1535 ing methods are inefficient, as they check on each input character.
1536 However, they can be useful in cases when the input cannot be buffered
1537 or padded and does not contain a sentinel character at the end. One
1538 should be cautious when using such ad-hoc methods, as it is easy to
1539 overlook some corner cases and come up with a method that only par‐
1540 tially works. Also it should be noted that not everything can be
1541 expressed via generic API: for example, it is impossible to reimplement
1542 the way EOF rule works (in particular, it is impossible to re-match the
1543 character after successful YYFILL).
1544
1545 Below is an example of using YYSKIP to perform bounds checking without
1546 padding. YYFILL generation is suppressed using re2c:yyfill:enable = 0;
1547 configuration. Note that if the grammar was more complex, this method
1548 might not work in case when two rules overlap and EOF check fails after
1549 a shorter lexeme has already been matched (as it happens in our exam‐
1550 ple, there are no overlapping rules).
1551
1552 //go:generate re2go $INPUT -o $OUTPUT
1553 package main
1554
1555 import "testing"
1556
1557 // Returns "fake" terminating null if cursor has reached limit.
1558 func peek(str string, cursor int, limit int) byte {
1559 if cursor >= limit {
1560 return 0 // fake null
1561 } else {
1562 return str[cursor]
1563 }
1564 }
1565
1566 // Expects a string without terminating null.
1567 func lex(str string) int {
1568 var cursor, marker int
1569 limit := len(str)
1570 count := 0
1571 loop:
1572 /*!re2c
1573 re2c:yyfill:enable = 0;
1574 re2c:eof = 0;
1575 re2c:define:YYCTYPE = byte;
1576 re2c:define:YYLESSTHAN = "cursor >= limit";
1577 re2c:define:YYPEEK = "peek(str, cursor, limit)";
1578 re2c:define:YYSKIP = "cursor += 1";
1579 re2c:define:YYBACKUP = "marker = cursor";
1580 re2c:define:YYRESTORE = "cursor = marker";
1581
1582 * { return -1 }
1583 $ { return count }
1584 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1585 [ ]+ { goto loop }
1586 */
1587 }
1588
1589 func TestLex(t *testing.T) {
1590 var tests = []struct {
1591 res int
1592 str string
1593 }{
1594 {0, ""},
1595 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1596 {-1, "'unterminated\\'"},
1597 }
1598
1599 for _, x := range tests {
1600 t.Run(x.str, func(t *testing.T) {
1601 res := lex(x.str)
1602 if res != x.res {
1603 t.Errorf("got %d, want %d", res, x.res)
1604 }
1605 })
1606 }
1607 }
1608
1609
1611 The need for buffering arises when the input cannot be mapped in memory
1612 all at once: either it is too large, or it comes in a streaming fashion
1613 (like reading from a socket). The usual technique in such cases is to
1614 allocate a fixed-sized memory buffer and process input in chunks that
1615 fit into the buffer. When the current chunk is processed, it is moved
1616 out and new data is moved in. In practice it is somewhat more complex,
1617 because lexer state consists not of a single input position, but a set
1618 of interrelated posiitons:
1619
1620 · cursor: the next input character to be read (YYCURSOR in default API
1621 or YYSKIP/YYPEEK in generic API)
1622
1623 · limit: the position after the last available input character (YYLIMIT
1624 in default API, implicitly handled by YYLESSTHAN in generic API)
1625
1626 · marker: the position of the most recent match, if any (YYMARKER in
1627 default API or YYBACKUP/YYRESTORE in generic API)
1628
1629 · token: the start of the current lexeme (implicit in re2c API, as it
1630 is not needed for the normal lexer operation and can be defined and
1631 updated by the user)
1632
1633 · context marker: the position of the trailing context (YYCTXMARKER in
1634 default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1635
1636 · tag variables: submatch positions (defined with /*!stags:re2c*/ and
1637 /*!mtags:re2c*/ directives and YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN in
1638 generic API)
1639
1640 Not all these are used in every case, but if used, they must be updated
1641 by YYFILL. All active positions are contained in the segment between
1642 token and cursor, therefore everything between buffer start and token
1643 can be discarded, the segment from token and up to limit should be
1644 moved to the beginning of buffer, and the free space at the end of buf‐
1645 fer should be filled with new data. In order to avoid frequent YYFILL
1646 calls it is best to fill in as many input characters as possible (even
1647 though fewer characters might suffice to resume the lexer). The details
1648 of YYFILL implementation are slightly different depending on which EOF
1649 handling method is used: the case of EOF rule is somewhat simpler than
1650 the case of bounds-checking with padding. Also note that if -f
1651 --storable-state option is used, YYFILL has slightly different seman‐
1652 tics (desrbed in the section about storable state).
1653
1654 YYFILL with EOF rule
1655 If EOF rule is used, YYFILL is a function-like primitive that accepts
1656 no arguments and returns a value which is checked against zero. YYFILL
1657 invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1658 and YYLESSTHAN() in generic API. A non-zero return value means that
1659 YYFILL has failed. A successful YYFILL call must supply at least one
1660 character and adjust input positions accordingly. Limit must always be
1661 set to one after the last input position in buffer, and the character
1662 at the limit position must be the sentinel symbol specified by re2c:eof
1663 configuration. The pictures below show the relative locations of input
1664 positions in buffer before and after YYFILL call (sentinel symbol is
1665 marked with #, and the second picture shows the case when there is not
1666 enough input to fill the whole buffer).
1667
1668 <-- shift -->
1669 >-A------------B---------C-------------D#-----------E->
1670 buffer token marker limit,
1671 cursor
1672 >-A------------B---------C-------------D------------E#->
1673 buffer, marker cursor limit
1674 token
1675
1676 <-- shift -->
1677 >-A------------B---------C-------------D#--E (EOF)
1678 buffer token marker limit,
1679 cursor
1680 >-A------------B---------C-------------D---E#........
1681 buffer, marker cursor limit
1682 token
1683
1684 Here is an example of a program that reads input file input.txt in
1685 chunks of 4096 bytes and uses EOF rule.
1686
1687 //go:generate re2go $INPUT -o $OUTPUT
1688 package main
1689
1690 import (
1691 "os"
1692 "testing"
1693 )
1694
1695 // Intentionally small to trigger buffer refill.
1696 const SIZE int = 16
1697
1698 type Input struct {
1699 file *os.File
1700 data []byte
1701 cursor int
1702 marker int
1703 token int
1704 limit int
1705 eof bool
1706 }
1707
1708 func fill(in *Input) int {
1709 // If nothing can be read, fail.
1710 if in.eof {
1711 return 1
1712 }
1713
1714 // Check if at least some space can be freed.
1715 if in.token == 0 {
1716 // In real life can reallocate a larger buffer.
1717 panic("fill error: lexeme too long")
1718 }
1719
1720 // Discard everything up to the start of the current lexeme,
1721 // shift buffer contents and adjust offsets.
1722 copy(in.data[0:], in.data[in.token:in.limit])
1723 in.cursor -= in.token
1724 in.marker -= in.token
1725 in.limit -= in.token
1726 in.token = 0
1727
1728 // Read new data (as much as possible to fill the buffer).
1729 n, _ := in.file.Read(in.data[in.limit:SIZE])
1730 in.limit += n
1731 in.data[in.limit] = 0
1732
1733 // If read less than expected, this is the end of input.
1734 in.eof = in.limit < SIZE
1735
1736 // If nothing has been read, fail.
1737 if n == 0 {
1738 return 1
1739 }
1740
1741 return 0
1742 }
1743
1744 func lex(in *Input) int {
1745 count := 0
1746 loop:
1747 in.token = in.cursor
1748 /*!re2c
1749 re2c:eof = 0;
1750 re2c:define:YYCTYPE = byte;
1751 re2c:define:YYPEEK = "in.data[in.cursor]";
1752 re2c:define:YYSKIP = "in.cursor += 1";
1753 re2c:define:YYBACKUP = "in.marker = in.cursor";
1754 re2c:define:YYRESTORE = "in.cursor = in.marker";
1755 re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
1756 re2c:define:YYFILL = "fill(in) == 0";
1757
1758 * { return -1 }
1759 $ { return count }
1760 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1761 [ ]+ { goto loop }
1762 */
1763 }
1764
1765 // Prepare a file with the input text and run the lexer.
1766 func test(data string) (result int) {
1767 tmpfile := "input.txt"
1768
1769 f, _ := os.Create(tmpfile)
1770 f.WriteString(data)
1771 f.Seek(0, 0)
1772
1773 defer func() {
1774 if r := recover(); r != nil {
1775 result = -2
1776 }
1777 f.Close()
1778 os.Remove(tmpfile)
1779 }()
1780
1781 in := &Input{
1782 file: f,
1783 data: make([]byte, SIZE+1),
1784 cursor: SIZE,
1785 marker: SIZE,
1786 token: SIZE,
1787 limit: SIZE,
1788 eof: false,
1789 }
1790
1791 return lex(in)
1792 }
1793
1794 func TestLex(t *testing.T) {
1795 var tests = []struct {
1796 res int
1797 str string
1798 }{
1799 {0, ""},
1800 {2, "'one' 'two'"},
1801 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1802 {-1, "'unterminated\\'"},
1803 {-2, "'loooooooooooong'"},
1804 }
1805
1806 for _, x := range tests {
1807 t.Run(x.str, func(t *testing.T) {
1808 res := test(x.str)
1809 if res != x.res {
1810 t.Errorf("got %d, want %d", res, x.res)
1811 }
1812 })
1813 }
1814 }
1815
1816
1817 YYFILL with padding
1818 In the default case (when EOF rule is not used) YYFILL is a func‐
1819 tion-like primitive that accepts a single argument and does not return
1820 any value. YYFILL invocation is triggered by condition (YYLIMIT -
1821 YYCURSOR) < n in default API and YYLESSTHAN(n) in generic API. The
1822 argument passed to YYFILL is the minimal number of characters that must
1823 be supplied. If it fails to do so, YYFILL must not return to the lexer
1824 (for that reason it is best implemented as a macro that returns from
1825 the calling function on failure). In case of a successful YYFILL invo‐
1826 cation the limit position must be set either to one after the last
1827 input position in buffer, or to the end of YYMAXFILL padding (in case
1828 YYFILL has successfully read at least n characters, but not enough to
1829 fill the entire buffer). The pictures below show the relative locations
1830 of input positions in buffer before and after YYFILL invocation (YYMAX‐
1831 FILL padding on the second picture is marked with # symbols).
1832
1833 <-- shift --> <-- need -->
1834 >-A------------B---------C-----D-------E---F--------G->
1835 buffer token marker cursor limit
1836
1837 >-A------------B---------C-----D-------E---F--------G->
1838 buffer, marker cursor limit
1839 token
1840
1841 <-- shift --> <-- need -->
1842 >-A------------B---------C-----D-------E-F (EOF)
1843 buffer token marker cursor limit
1844
1845 >-A------------B---------C-----D-------E-F###############
1846 buffer, marker cursor limit
1847 token <- YYMAXFILL ->
1848
1849 Here is an example of a program that reads input file input.txt in
1850 chunks of 4096 bytes and uses bounds-checking with padding.
1851
1852 //go:generate re2go $INPUT -o $OUTPUT
1853 package main
1854
1855 import (
1856 "fmt"
1857 "os"
1858 "testing"
1859 )
1860
1861 /*!max:re2c*/
1862
1863 // Intentionally small to trigger buffer refill.
1864 const SIZE int = 16
1865
1866 type Input struct {
1867 file *os.File
1868 data []byte
1869 cursor int
1870 marker int
1871 token int
1872 limit int
1873 eof bool
1874 }
1875
1876 func fill(in *Input, need int) int {
1877 // End of input has already been reached, nothing to do.
1878 if in.eof {
1879 return -1 // Error: unexpected EOF
1880 }
1881
1882 // Check if after moving the current lexeme to the beginning
1883 // of buffer there will be enough free space.
1884 if SIZE-(in.cursor-in.token) < need {
1885 return -2 // Error: lexeme too long
1886 }
1887
1888 // Discard everything up to the start of the current lexeme,
1889 // shift buffer contents and adjust offsets.
1890 copy(in.data[0:], in.data[in.token:in.limit])
1891 in.cursor -= in.token
1892 in.marker -= in.token
1893 in.limit -= in.token
1894 in.token = 0
1895
1896 // Read new data (as much as possible to fill the buffer).
1897 n, _ := in.file.Read(in.data[in.limit:SIZE])
1898 in.limit += n
1899
1900 // If read less than expected, this is the end of input.
1901 in.eof = in.limit < SIZE
1902
1903 // If end of input, add padding so that the lexer can read
1904 // the remaining characters at the end of buffer.
1905 if in.eof {
1906 for i := 0; i < YYMAXFILL; i += 1 {
1907 in.data[in.limit+i] = 0
1908 }
1909 in.limit += YYMAXFILL
1910 }
1911
1912 return 0
1913 }
1914
1915 func lex(in *Input) int {
1916 count := 0
1917 loop:
1918 in.token = in.cursor
1919 /*!re2c
1920 re2c:define:YYCTYPE = byte;
1921 re2c:define:YYPEEK = "in.data[in.cursor]";
1922 re2c:define:YYSKIP = "in.cursor += 1";
1923 re2c:define:YYBACKUP = "in.marker = in.cursor";
1924 re2c:define:YYRESTORE = "in.cursor = in.marker";
1925 re2c:define:YYLESSTHAN = "in.limit-in.cursor < @@{len}";
1926 re2c:define:YYFILL = "if r := fill(in, @@{len}); r != 0 { return r }";
1927
1928 * { return -1 }
1929 [\x00] { return count }
1930 ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1931 [ ]+ { goto loop }
1932 */
1933 }
1934
1935 // Prepare a file with the input text and run the lexer.
1936 func test(data string) (result int) {
1937 tmpfile := "input.txt"
1938
1939 f, _ := os.Create(tmpfile)
1940 f.WriteString(data)
1941 f.Seek(0, 0)
1942
1943 defer func() {
1944 if r := recover(); r != nil {
1945 fmt.Println(r)
1946 result = -2
1947 }
1948 f.Close()
1949 os.Remove(tmpfile)
1950 }()
1951
1952 in := &Input{
1953 file: f,
1954 data: make([]byte, SIZE+YYMAXFILL),
1955 cursor: SIZE,
1956 marker: SIZE,
1957 token: SIZE,
1958 limit: SIZE,
1959 eof: false,
1960 }
1961
1962 return lex(in)
1963 }
1964
1965 func TestLex(t *testing.T) {
1966 var tests = []struct {
1967 res int
1968 str string
1969 }{
1970 {0, ""},
1971 {2, "'one' 'two'"},
1972 {3, "'qu\000tes' 'are' 'fine: \\'' "},
1973 {-1, "'unterminated\\'"},
1974 {-2, "'loooooooooooong'"},
1975 }
1976
1977 for _, x := range tests {
1978 t.Run(x.str, func(t *testing.T) {
1979 res := test(x.str)
1980 if res != x.res {
1981 t.Errorf("got %d, want %d", res, x.res)
1982 }
1983 })
1984 }
1985 }
1986
1987
1989 Re2c allows one to include other files using directive /*!include:re2c
1990 FILE */, where FILE is the name of file to be included. Re2c looks for
1991 included files in the directory of the including file and in include
1992 locations, which can be specified with -I option. Re2c include direc‐
1993 tive works in the same way as C/C++ #include: the contents of FILE are
1994 copy-pasted verbatim in place of the directive. Include files may have
1995 further includes of their own. Re2c provides some predefined include
1996 files that can be found in the include/ subdirectory of the project.
1997 These files contain definitions that can be useful to other projects
1998 (such as Unicode categories) and form something like a standard library
1999 for re2c. Here is an example:
2000
2001 Include file (definitions.go)
2002 const (
2003 ResultOk = iota
2004 ResultFail
2005 )
2006
2007 /*!re2c
2008 number = [1-9][0-9]*;
2009 */
2010
2011
2012 Input file
2013 //go:generate re2go -c $INPUT -o $OUTPUT -i
2014 package main
2015
2016 import "testing"
2017 /*!include:re2c "definitions.go" */
2018
2019 func lex(str string) int {
2020 var cursor int
2021 /*!re2c
2022 re2c:yyfill:enable = 0;
2023 re2c:define:YYCTYPE = byte;
2024 re2c:define:YYPEEK = "str[cursor]";
2025 re2c:define:YYSKIP = "cursor += 1";
2026
2027 number { return ResultOk }
2028 * { return ResultFail }
2029 */
2030 }
2031
2032 func TestLex(t *testing.T) {
2033 if lex("123\000") != ResultOk {
2034 t.Errorf("error")
2035 }
2036 }
2037
2038
2040 Re2c allows one to generate header file from the input .re file using
2041 option -t, --type-header or configuration re2c:flags:type-header and
2042 directives /*!header:re2c:on*/ and /*!header:re2c:off*/. The first
2043 directive marks the beginning of header file, and the second directive
2044 marks the end of it. Everything between these directives is processed
2045 by re2c, and the generated code is written to the file specified by the
2046 -t --type-header option (or stdout if this option was not used). Auto‐
2047 generated header file may be needed in cases when re2c is used to gen‐
2048 erate definitions of constants, variables and structs that must be vis‐
2049 ible from other translation units.
2050
2051 Here is an example of generating a header file that contains definition
2052 of the lexer state with tag variables (the number variables depends on
2053 the regular grammar and is unknown to the programmer).
2054
2055 Input file
2056 //go:generate re2go $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.go
2057 package main
2058
2059 import (
2060 "lexer" // generated by re2c
2061 "testing"
2062 )
2063
2064 /*!header:re2c:on*/
2065 package lexer
2066
2067 type State struct {
2068 Data string
2069 Cur, Mar, /*!stags:re2c format="@@{tag}"; separator=", "; */ int
2070 }
2071 /*!header:re2c:off*/
2072
2073 func lex(st *lexer.State) int {
2074 /*!re2c
2075 re2c:flags:type-header = "src/lexer/lexer.go";
2076 re2c:yyfill:enable = 0;
2077 re2c:flags:tags = 1;
2078 re2c:define:YYCTYPE = byte;
2079 re2c:define:YYPEEK = "st.Data[st.Cur]";
2080 re2c:define:YYSKIP = "st.Cur++";
2081 re2c:define:YYBACKUP = "st.Mar = st.Cur";
2082 re2c:define:YYRESTORE = "st.Cur = st.Mar";
2083 re2c:define:YYRESTORETAG = "st.Cur = @@{tag}";
2084 re2c:define:YYSTAGP = "@@{tag} = st.Cur";
2085 re2c:tags:expression = "st.@@{tag}";
2086 re2c:tags:prefix = "Tag";
2087
2088 [x]{1,4} / [x]{3,5} { return 0 } // ambiguous trailing context
2089 * { return 1 }
2090 */
2091 }
2092
2093 func TestLex(t *testing.T) {
2094 st := &lexer.State{
2095 Data: "xxxxxxxx\x00",
2096 }
2097 if !(lex(st) == 0 && st.Cur == 4) {
2098 t.Error("failed")
2099 }
2100 }
2101
2102
2103 Header file
2104 // Code generated by re2c, DO NOT EDIT.
2105
2106 package lexer
2107
2108 type State struct {
2109 Data string
2110 Cur, Mar, Tag1, Tag2, Tag3 int
2111 }
2112
2113
2115 Re2c has two options for submatch extraction.
2116
2117 The first option is -T --tags. With this option one can use standalone
2118 tags of the form @stag and #mtag, where stag and mtag are arbitrary
2119 used-defined names. Tags can be used anywhere inside of a regular
2120 expression; semantically they are just position markers. Tags of the
2121 form @stag are called s-tags: they denote a single submatch value (the
2122 last input position where this tag matched). Tags of the form #mtag are
2123 called m-tags: they denote multiple submatch values (the whole history
2124 of repetitions of this tag). All tags should be defined by the user as
2125 variables with the corresponding names. With standalone tags re2c uses
2126 leftmost greedy disambiguation: submatch positions correspond to the
2127 leftmost matching path through the regular expression.
2128
2129 The second option is -P --posix-captures: it enables POSIX-compliant
2130 capturing groups. In this mode parentheses in regular expressions
2131 denote the beginning and the end of capturing groups; the whole regular
2132 expression is group number zero. The number of groups for the matching
2133 rule is stored in a variable yynmatch, and submatch results are stored
2134 in yypmatch array. Both yynmatch and yypmatch should be defined by the
2135 user, and yypmatch size must be at least [yynmatch * 2]. Re2c provides
2136 a directive /*!maxnmatch:re2c*/ that defines YYMAXNMATCH: a constant
2137 equal to the maximal value of yynmatch among all rules. Note that re2c
2138 implements POSIX-compliant disambiguation: each subexpression matches
2139 as long as possible, and subexpressions that start earlier in regular
2140 expression have priority over those starting later. Capturing groups
2141 are translated into s-tags under the hood, therefore we use the word
2142 "tag" to describe them as well.
2143
2144 With both -P --posix-captures and T --tags options re2c uses efficient
2145 submatch extraction algorithm described in the Tagged Deterministic
2146 Finite Automata with Lookahead paper. The overhead on submatch extrac‐
2147 tion in the generated lexer grows with the number of tags --- if this
2148 number is moderate, the overhead is barely noticeable. In the lexer
2149 tags are implemented using a number of tag variables generated by re2c.
2150 There is no one-to-one correspondence between tag variables and tags: a
2151 single variable may be reused for different tags, and one tag may
2152 require multiple variables to hold all its ambiguous values. Eventually
2153 ambiguity is resolved, and only one final variable per tag survives.
2154 When a rule matches, all its tags are set to the values of the corre‐
2155 sponding tag variables. The exact number of tag variables is unknown
2156 to the user; this number is determined by re2c. However, tag variables
2157 should be defined by the user as a part of the lexer state and updated
2158 by YYFILL, therefore re2c provides directives /*!stags:re2c*/ and
2159 /*!mtags:re2c*/ that can be used to declare, initialize and manipulate
2160 tag variables. These directives have two optional configurations: for‐
2161 mat = "@@"; (specifies the template where @@ is substituted with the
2162 name of each tag variable), and separator = ""; (specifies the piece of
2163 code used to join the generated pieces for different tag variables).
2164
2165 S-tags support the following operations:
2166
2167 · save input position to an s-tag: t = YYCURSOR with default API or a
2168 user-defined operation YYSTAGP(t) with generic API
2169
2170 · save default value to an s-tag: t = NULL with default API or a
2171 user-defined operation YYSTAGN(t) with generic API
2172
2173 · copy one s-tag to another: t1 = t2
2174
2175 M-tags support the following operations:
2176
2177 · append input position to an m-tag: a user-defined operation YYM‐
2178 TAGP(t) with both default and generic API
2179
2180 · append default value to an m-tag: a user-defined operation YYMTAGN(t)
2181 with both default and generic API
2182
2183 · copy one m-tag to another: t1 = t2
2184
2185 S-tags can be implemented as scalar values (pointers or offsets).
2186 M-tags need a more complex representation, as they need to store a
2187 sequence of tag values. The most naive and inefficient representation
2188 of an m-tag is a list (array, vector) of tag values; a more efficient
2189 representation is to store all m-tags in a prefix-tree represented as
2190 array of nodes (v, p), where v is tag value and p is a pointer to par‐
2191 ent node.
2192
2193 Here is an example of using s-tags to parse an IPv4 address.
2194
2195 //go:generate re2go $INPUT -o $OUTPUT
2196 package main
2197
2198 import (
2199 "errors"
2200 "testing"
2201 )
2202
2203 var eBadIP error = errors.New("bad IP")
2204
2205 func lex(str string) (int, error) {
2206 var cursor, marker, o1, o2, o3, o4 int
2207 /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2208
2209 num := func(pos int, end int) int {
2210 n := 0
2211 for ; pos < end; pos++ {
2212 n = n*10 + int(str[pos]-'0')
2213 }
2214 return n
2215 }
2216
2217 /*!re2c
2218 re2c:flags:tags = 1;
2219 re2c:yyfill:enable = 0;
2220 re2c:define:YYCTYPE = byte;
2221 re2c:define:YYPEEK = "str[cursor]";
2222 re2c:define:YYSKIP = "cursor += 1";
2223 re2c:define:YYBACKUP = "marker = cursor";
2224 re2c:define:YYRESTORE = "cursor = marker";
2225 re2c:define:YYSTAGP = "@@{tag} = cursor";
2226 re2c:define:YYSTAGN = "@@{tag} = -1";
2227
2228 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2229 dot = [.];
2230 end = [\x00];
2231
2232 @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2233 return num(o4, cursor-1)+
2234 (num(o3, o4-1) << 8)+
2235 (num(o2, o3-1) << 16)+
2236 (num(o1, o2-1) << 24), nil
2237 }
2238 * { return 0, eBadIP }
2239 */
2240 }
2241
2242 func TestLex(t *testing.T) {
2243 var tests = []struct {
2244 str string
2245 res int
2246 err error
2247 }{
2248 {"1.2.3.4\000", 0x01020304, nil},
2249 {"127.0.0.1\000", 0x7f000001, nil},
2250 {"255.255.255.255\000", 0xffffffff, nil},
2251 {"1.2.3.\000", 0, eBadIP},
2252 {"1.2.3.256\000", 0, eBadIP},
2253 }
2254
2255 for _, x := range tests {
2256 t.Run(x.str, func(t *testing.T) {
2257 res, err := lex(x.str)
2258 if !(res == x.res && err == x.err) {
2259 t.Errorf("got %d, want %d", res, x.res)
2260 }
2261 })
2262 }
2263 }
2264
2265
2266 Here is an example of using POSIX capturing groups to parse an IPv4
2267 address.
2268
2269 //go:generate re2go $INPUT -o $OUTPUT
2270 package main
2271
2272 import (
2273 "errors"
2274 "testing"
2275 )
2276
2277 /*!maxnmatch:re2c*/
2278
2279 var eBadIP error = errors.New("bad IP")
2280
2281 func lex(str string) (int, error) {
2282 var cursor, marker, yynmatch int
2283 yypmatch := make([]int, YYMAXNMATCH*2)
2284 /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2285
2286 num := func(pos int, end int) int {
2287 n := 0
2288 for ; pos < end; pos++ {
2289 n = n*10 + int(str[pos]-'0')
2290 }
2291 return n
2292 }
2293
2294 /*!re2c
2295 re2c:flags:posix-captures = 1;
2296 re2c:yyfill:enable = 0;
2297 re2c:define:YYCTYPE = byte;
2298 re2c:define:YYPEEK = "str[cursor]";
2299 re2c:define:YYSKIP = "cursor += 1";
2300 re2c:define:YYBACKUP = "marker = cursor";
2301 re2c:define:YYRESTORE = "cursor = marker";
2302 re2c:define:YYSTAGP = "@@{tag} = cursor";
2303 re2c:define:YYSTAGN = "@@{tag} = -1";
2304 re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
2305
2306 octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2307 dot = [.];
2308 end = [\x00];
2309
2310 (octet) dot (octet) dot (octet) dot (octet) end {
2311 if yynmatch != 5 {
2312 panic("expected 5 submatch groups")
2313 }
2314 return num(yypmatch[8], yypmatch[9])+
2315 (num(yypmatch[6], yypmatch[7]) << 8)+
2316 (num(yypmatch[4], yypmatch[5]) << 16)+
2317 (num(yypmatch[2], yypmatch[3]) << 24), nil
2318 }
2319 * { return 0, eBadIP }
2320 */
2321 }
2322
2323 func TestLex(t *testing.T) {
2324 var tests = []struct {
2325 str string
2326 res int
2327 err error
2328 }{
2329 {"1.2.3.4\000", 0x01020304, nil},
2330 {"127.0.0.1\000", 0x7f000001, nil},
2331 {"255.255.255.255\000", 0xffffffff, nil},
2332 {"1.2.3.\000", 0, eBadIP},
2333 {"1.2.3.256\000", 0, eBadIP},
2334 }
2335
2336 for _, x := range tests {
2337 t.Run(x.str, func(t *testing.T) {
2338 res, err := lex(x.str)
2339 if !(res == x.res && err == x.err) {
2340 t.Errorf("got %d, want %d", res, x.res)
2341 }
2342 })
2343 }
2344 }
2345
2346
2347 Here is an example of using m-tags to parse a semicolon-separated
2348 sequence of words (C++). Tag variables are stored in a tree that is
2349 packed in a vector.
2350
2351 //go:generate re2go $INPUT -o $OUTPUT
2352 package main
2353
2354 import (
2355 "reflect"
2356 "testing"
2357 )
2358
2359 const (
2360 mtagRoot int = -1
2361 mtagNil int = -2
2362 )
2363
2364 type mtagElem struct {
2365 val int
2366 pred int
2367 }
2368
2369 type mtagTrie = []mtagElem
2370
2371 func createTrie(capacity int) mtagTrie {
2372 return make([]mtagElem, 0, capacity)
2373 }
2374
2375 func mtag(trie *mtagTrie, tag int, val int) int {
2376 *trie = append(*trie, mtagElem{val, tag})
2377 return len(*trie) - 1
2378 }
2379
2380 // Recursively unwind both tag histories and consruct submatches.
2381 func unwind(trie mtagTrie, x int, y int, str string) []string {
2382 if x == mtagRoot && y == mtagRoot {
2383 return []string{}
2384 } else if x == mtagRoot || y == mtagRoot {
2385 panic("tag histories have different length")
2386 } else {
2387 xval := trie[x].val
2388 yval := trie[y].val
2389 ss := unwind(trie, trie[x].pred, trie[y].pred, str)
2390
2391 // Either both tags should be nil, or none of them.
2392 if xval == mtagNil && yval == mtagNil {
2393 return ss
2394 } else if xval == mtagNil || yval == mtagNil {
2395 panic("tag histories positive/negative tag mismatch")
2396 } else {
2397 s := str[xval:yval]
2398 return append(ss, s)
2399 }
2400 }
2401 }
2402
2403 func lex(str string) []string {
2404 var cursor, marker int
2405 trie := createTrie(256)
2406 x := mtagRoot
2407 y := mtagRoot
2408 /*!mtags:re2c format = "@@ := mtagRoot"; separator = "\n\t"; */
2409
2410 /*!re2c
2411 re2c:flags:tags = 1;
2412 re2c:yyfill:enable = 0;
2413 re2c:define:YYCTYPE = byte;
2414 re2c:define:YYPEEK = "str[cursor]";
2415 re2c:define:YYSKIP = "cursor += 1";
2416 re2c:define:YYBACKUP = "marker = cursor";
2417 re2c:define:YYRESTORE = "cursor = marker";
2418 re2c:define:YYMTAGP = "@@{tag} = mtag(&trie, @@{tag}, cursor)";
2419 re2c:define:YYMTAGN = "@@{tag} = mtag(&trie, @@{tag}, mtagNil)";
2420
2421 end = [\x00];
2422
2423 (#x [a-z]+ #y [;])* end { return unwind(trie, x, y, str) }
2424 * { return nil }
2425 */
2426 }
2427
2428 func TestLex(t *testing.T) {
2429 var tests = []struct {
2430 str string
2431 res []string
2432 }{
2433 {"\000", []string{}},
2434 {"one;two;three;\000", []string{"one", "two", "three"}},
2435 {"one;two\000", nil},
2436 }
2437
2438 for _, x := range tests {
2439 t.Run(x.str, func(t *testing.T) {
2440 res := lex(x.str)
2441 if !reflect.DeepEqual(res, x.res) {
2442 t.Errorf("got %v, want %v", res, x.res)
2443 }
2444 })
2445 }
2446 }
2447
2448
2450 With -f --storable-state option re2c generates a lexer that can store
2451 its current state, return to the caller, and later resume operations
2452 exactly where it left off. The default mode of operation in re2c is a
2453 "pull" model, in which the lexer "pulls" more input whenever it needs
2454 it. This may be unacceptable in cases when the input becomes available
2455 piece by piece (for example, if the lexer is invoked by the parser, or
2456 if the lexer program communicates via a socket protocol with some other
2457 program that must wait for a reply from the lexer before it transmits
2458 the next message). Storable state feature is intended exactly for such
2459 cases: it allows one to generate lexers that work in a "push" model.
2460 When the lexer needs more input, it stores its state and returns to the
2461 caller. Later, when more input becomes available, the caller resumes
2462 the lexer exactly where it stopped. There are a few changes necessary
2463 compared to the "pull" model:
2464
2465 · Define YYSETSTATE() and YYGETSTATE(state) promitives.
2466
2467 · Define yych, yyaccept and state variables as a part of persistent
2468 lexer state. The state variable should be initialized to -1.
2469
2470 · YYFILL should return to the outer program instead of trying to supply
2471 more input. Return code should indicate that lexer needs more input.
2472
2473 · The outer program should recognize situations when lexer needs more
2474 input and respond appropriately.
2475
2476 · Use /*!getstate:re2c*/ directive if it is necessary to execute any
2477 code before entering the lexer.
2478
2479 · Use configurations state:abort and state:nextlabel to further tweak
2480 the generated code.
2481
2482 Here is an example of a "push"-model lexer that reads input from stdin
2483 and expects a sequence of words separated by spaces and newlines. The
2484 lexer loops forever, waiting for more input. It can be terminated by
2485 sending a special EOF token --- a word "stop", in which case the lexer
2486 terminates successfully and prints the number of words it has seen.
2487 Abnormal termination happens in case of a syntax error, premature end
2488 of input (without the "stop" word) or in case the buffer is too small
2489 to hold a lexeme (for example, if one of the words exceeds buffer
2490 size). Premature end of input happens in case the lexer fails to read
2491 any input while being in the initial state --- this is the only case
2492 when EOF rule matches. Note that the lexer may call YYFILL twice before
2493 terminating (and thus require hitting Ctrl+D a few times). First time
2494 YYFILL is called when the lexer expects continuation of the current
2495 greedy lexeme (either a word or a whitespace sequence). If YYFILL
2496 fails, the lexer knows that it has reached the end of the current lex‐
2497 eme and executes the corresponding semantic action. The action jumps to
2498 the beginning of the loop, the lexer enters the initial state and calls
2499 YYFILL once more. If it fails, the lexer matches EOF rule. (Alterna‐
2500 tively EOF rule can be used for termination instead of a special EOF
2501 lexeme.)
2502
2503 Example
2504 //go:generate re2go -f $INPUT -o $OUTPUT
2505 package main
2506
2507 import (
2508 "fmt"
2509 "os"
2510 "testing"
2511 )
2512
2513 // Intentionally small to trigger buffer refill.
2514 const SIZE int = 16
2515
2516 type Input struct {
2517 file *os.File
2518 data []byte
2519 cursor int
2520 marker int
2521 token int
2522 limit int
2523 state int
2524 yyaccept int
2525 }
2526
2527 const (
2528 lexEnd = iota
2529 lexReady
2530 lexWaitingForInput
2531 lexPacketBroken
2532 lexPacketTooBig
2533 lexCountMismatch
2534 )
2535
2536 func fill(in *Input) int {
2537 if in.token == 0 {
2538 // Error: no space can be freed.
2539 // In real life can reallocate a larger buffer.
2540 return lexPacketTooBig
2541 }
2542
2543 // Discard everything up to the start of the current lexeme,
2544 // shift buffer contents and adjust offsets.
2545 copy(in.data[0:], in.data[in.token:in.limit])
2546 in.cursor -= in.token
2547 in.marker -= in.token
2548 in.limit -= in.token
2549 in.token = 0
2550
2551 // Read new data (as much as possible to fill the buffer).
2552 n, _ := in.file.Read(in.data[in.limit:SIZE])
2553 in.limit += n
2554 in.data[in.limit] = 0 // append sentinel symbol
2555
2556 return lexReady
2557 }
2558
2559 func lex(in *Input, recv *int) int {
2560 var yych byte
2561 /*!getstate:re2c*/
2562 loop:
2563 in.token = in.cursor
2564 /*!re2c
2565 re2c:eof = 0;
2566 re2c:define:YYPEEK = "in.data[in.cursor]";
2567 re2c:define:YYSKIP = "in.cursor += 1";
2568 re2c:define:YYBACKUP = "in.marker = in.cursor";
2569 re2c:define:YYRESTORE = "in.cursor = in.marker";
2570 re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
2571 re2c:define:YYFILL = "return lexWaitingForInput";
2572 re2c:define:YYGETSTATE = "in.state";
2573 re2c:define:YYSETSTATE = "in.state = @@{state}";
2574
2575 packet = [a-z]+[;];
2576
2577 * { return lexPacketBroken }
2578 $ { return lexEnd }
2579 packet { *recv = *recv + 1; goto loop }
2580 */
2581 }
2582
2583 func test(packets []string) int {
2584 fname := "pipe"
2585 fw, _ := os.Create(fname);
2586 fr, _ := os.Open(fname);
2587
2588 in := &Input{
2589 file: fr,
2590 data: make([]byte, SIZE+1),
2591 cursor: SIZE,
2592 marker: SIZE,
2593 token: SIZE,
2594 limit: SIZE,
2595 state: -1,
2596 }
2597 // data is zero-initialized, no need to write sentinel
2598
2599 var status int
2600 send := 0
2601 recv := 0
2602 loop:
2603 for {
2604 status = lex(in, &recv)
2605 if status == lexEnd {
2606 if send != recv {
2607 status = lexCountMismatch
2608 }
2609 break loop
2610 } else if status == lexWaitingForInput {
2611 if send < len(packets) {
2612 fw.WriteString(packets[send])
2613 send += 1
2614 }
2615 status = fill(in)
2616 if status != lexReady {
2617 break loop
2618 }
2619 } else if status == lexPacketBroken {
2620 break loop
2621 } else {
2622 panic("unexpected status")
2623 }
2624 }
2625
2626 fr.Close()
2627 fw.Close()
2628 os.Remove(fname)
2629
2630 return status
2631 }
2632
2633 func TestLex(t *testing.T) {
2634 var tests = []struct {
2635 status int
2636 packets []string
2637 }{
2638 {lexEnd, []string{}},
2639 {lexEnd, []string{"zero;", "one;", "two;", "three;", "four;"}},
2640 {lexPacketBroken, []string{"??;"}},
2641 {lexPacketTooBig, []string{"looooooooooooong;"}},
2642 }
2643
2644 for i, x := range tests {
2645 t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
2646 status := test(x.packets)
2647 if status != x.status {
2648 t.Errorf("got %d, want %d", status, x.status)
2649 }
2650 })
2651 }
2652 }
2653
2654
2656 Reuse mode is enabled with the -r --reusable option. In this mode re2c
2657 allows one to reuse definitions, configurations and rules specified by
2658 a /*!rules:re2c*/ block in subsequent /*!use:re2c*/ blocks. As of
2659 re2c-1.2 it is possible to mix such blocks with normal /*!re2c*/
2660 blocks; prior to that re2c expects a single rules-block followed by
2661 use-blocks (normal blocks are disallowed). Use-blocks can have addi‐
2662 tional definitions, configurations and rules: they are merged to those
2663 specified by the rules-block. A very common use case for -r --reusable
2664 option is a lexer that supports multiple input encodings: lexer rules
2665 are defined once and reused multiple times with encoding-specific con‐
2666 figurations, such as re2c:flags:utf-8.
2667
2668 Below is an example of a multi-encoding lexer: it reads a phrase with
2669 Unicode math symbols and accepts input either in UTF8 or in UT32. Note
2670 that the --input-encoding utf8 option allows us to write UTF8-encoded
2671 symbols in the regular expressions; without this option re2c would
2672 parse them as a plain ASCII byte sequnce (and we would have to use
2673 hexadecimal escape sequences).
2674
2675 Example
2676 //go:generate re2go $INPUT -o $OUTPUT -r --input-encoding utf8
2677 package main
2678
2679 import "testing"
2680
2681 /*!rules:re2c
2682 re2c:yyfill:enable = 0;
2683 re2c:define:YYPEEK = "str[cursor]";
2684 re2c:define:YYSKIP = "cursor += 1";
2685 re2c:define:YYBACKUP = "marker = cursor";
2686 re2c:define:YYRESTORE = "cursor = marker";
2687
2688 "∀x ∃y: p(x, y)" { return 0; }
2689 * { return 1; }
2690 */
2691
2692 func lexUTF8(str []uint8) int {
2693 var cursor, marker int
2694 /*!use:re2c
2695 re2c:flags:8 = 1;
2696 re2c:define:YYCTYPE = uint8;
2697 */
2698 }
2699
2700 func lexUTF32(str []uint32) int {
2701 var cursor, marker int
2702 /*!use:re2c
2703 re2c:flags:u = 1;
2704 re2c:define:YYCTYPE = uint32;
2705 */
2706 }
2707
2708 func TestLexUTF8(t *testing.T) {
2709 s_utf8 := []uint8{
2710 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79,
2711 0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29};
2712
2713 if lexUTF8(s_utf8) != 0 {
2714 t.Errorf("utf8 failed")
2715 }
2716 }
2717
2718 func TestLexUTF32(t *testing.T) {
2719 s_utf32 := []uint32{
2720 0x00002200, 0x00000078, 0x00000020, 0x00002203, 0x00000079,
2721 0x0000003a, 0x00000020, 0x00000070, 0x00000028, 0x00000078,
2722 0x0000002c, 0x00000020, 0x00000079, 0x00000029};
2723
2724 if lexUTF32(s_utf32) != 0 {
2725 t.Errorf("utf32 failed")
2726 }
2727 }
2728
2729
2731 re2c supports the following encodings: ASCII (default), EBCDIC (-e),
2732 UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8). See also inplace
2733 configuration re2c:flags.
2734
2735 The following concepts should be clarified when talking about encod‐
2736 ings. A code point is an abstract number that represents a single sym‐
2737 bol. A code unit is the smallest unit of memory, which is used in the
2738 encoded text (it corresponds to one character in the input stream). One
2739 or more code units may be needed to represent a single code point,
2740 depending on the encoding. In a fixed-length encoding, each code point
2741 is represented with an equal number of code units. In variable-length
2742 encodings, different code points can be represented with different num‐
2743 ber of code units.
2744
2745 · ASCII is a fixed-length encoding. Its code space includes 0x100 code
2746 points, from 0 to 0xFF. A code point is represented with exactly one
2747 1-byte code unit, which has the same value as the code point. The
2748 size of YYCTYPE must be 1 byte.
2749
2750 · EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
2751 points, from 0 to 0xFF. A code point is represented with exactly one
2752 1-byte code unit, which has the same value as the code point. The
2753 size of YYCTYPE must be 1 byte.
2754
2755 · UCS-2 is a fixed-length encoding. Its code space includes 0x10000
2756 code points, from 0 to 0xFFFF. One code point is represented with
2757 exactly one 2-byte code unit, which has the same value as the code
2758 point. The size of YYCTYPE must be 2 bytes.
2759
2760 · UTF-16 is a variable-length encoding. Its code space includes all
2761 Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
2762 One code point is represented with one or two 2-byte code units. The
2763 size of YYCTYPE must be 2 bytes.
2764
2765 · UTF-32 is a fixed-length encoding. Its code space includes all Uni‐
2766 code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
2767 code point is represented with exactly one 4-byte code unit. The size
2768 of YYCTYPE must be 4 bytes.
2769
2770 · UTF-8 is a variable-length encoding. Its code space includes all Uni‐
2771 code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
2772 code point is represented with a sequence of one, two, three, or four
2773 1-byte code units. The size of YYCTYPE must be 1 byte.
2774
2775 In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not
2776 valid Unicode code points. Any encoded sequence of code units that
2777 would map to Unicode code points in the range 0xD800-0xDFFF, is
2778 ill-formed. The user can control how re2c treats such ill-formed
2779 sequences with the --encoding-policy <policy> switch.
2780
2781 For some encodings, there are code units that never occur in a valid
2782 encoded stream (e.g., 0xFF byte in UTF-8). If the generated scanner
2783 must check for invalid input, the only correct way to do so is to use
2784 the default rule (*). Note that the full range rule ([^]) won't catch
2785 invalid code units when a variable-length encoding is used ([^] means
2786 "any valid code point", whereas the default rule (*) means "any possi‐
2787 ble code unit").
2788
2790 Conditions are enabled with -c --conditions. This option allows one to
2791 encode multiple interrelated lexers within the same re2c block.
2792
2793 Each lexer corresponds to a single condition. It starts with a label
2794 of the form yyc_name, where name is condition name and yyc prefix can
2795 be adjusted with configuration re2c:condprefix. Different lexers are
2796 separated with a comment /* *********************************** */
2797 which can be adjusted with configuration re2c:cond:divider.
2798
2799 Furthermore, each condition has a unique identifier of the form yyc‐
2800 name, where name is condition name and yyc prefix can be adjusted with
2801 configuration re2c:condenumprefix. Identifiers have the type YYCOND‐
2802 TYPE and should be generated with /*!types:re2c*/ directive or -t
2803 --type-header option. Users shouldn't define these identifiers manu‐
2804 ally, as the order of conditions is not specified.
2805
2806 Before all conditions re2c generates entry code that checks the current
2807 condition identifier and transfers control flow to the start label of
2808 the active condition. After matching some rule of this condition,
2809 lexer may either transfer control flow back to the entry code (after
2810 executing the associated action and optionally setting another condi‐
2811 tion with =>), or use :=> shortcut and transition directly to the start
2812 label of another condition (skipping the action and the entry code).
2813 Configuration re2c:cond:goto allows one to change the default behavior.
2814
2815 Syntactically each rule must be preceded with a list of comma-separated
2816 condition names or a wildcard * enclosed in angle brackets < and >.
2817 Wildcard means "any condition" and is semantically equivalent to list‐
2818 ing all condition names. Here regexp is a regular expression, default
2819 refers to the default rule *, and action is a block of code.
2820
2821 · <conditions-or-wildcard> regexp-or-default action
2822
2823 · <conditions-or-wildcard> regexp-or-default => condition action
2824
2825 · <conditions-or-wildcard> regexp-or-default :=> condition
2826
2827 Rules with an exclamation mark ! in front of condition list have a spe‐
2828 cial meaning: they have no regular expression, and the associated
2829 action is merged as an entry code to actions of normal rules. This
2830 might be a convenient place to peform a routine task that is common to
2831 all rules.
2832
2833 · <!conditions-or-wildcard> action
2834
2835 Another special form of rules with an empty condition list <> and no
2836 regular expression allows one to specify an "entry condition" that can
2837 be used to execute code before entering the lexer. It is semantically
2838 equivalent to a condition with number zero, name 0 and an empty regular
2839 expression.
2840
2841 · <> action
2842
2843 · <> => condition action
2844
2845 · <> :=> condition
2846
2847 Example
2848 //go:generate re2go -c $INPUT -o $OUTPUT -i
2849 package main
2850
2851 import (
2852 "errors"
2853 "testing"
2854 )
2855
2856 var (
2857 eSyntax = errors.New("syntax error")
2858 eOverflow = errors.New("overflow error")
2859 )
2860
2861 /*!types:re2c*/
2862
2863 const u32Limit uint64 = 1<<32
2864
2865 func parse_u32(str string) (uint32, error) {
2866 var cursor, marker int
2867 result := uint64(0)
2868 cond := yycinit
2869
2870 add_digit := func(base uint64, offset byte) {
2871 result = result * base + uint64(str[cursor-1] - offset)
2872 if result >= u32Limit {
2873 result = u32Limit
2874 }
2875 }
2876
2877 /*!re2c
2878 re2c:yyfill:enable = 0;
2879 re2c:define:YYCTYPE = byte;
2880 re2c:define:YYPEEK = "str[cursor]";
2881 re2c:define:YYSKIP = "cursor += 1";
2882 re2c:define:YYSHIFT = "cursor += @@{shift}";
2883 re2c:define:YYBACKUP = "marker = cursor";
2884 re2c:define:YYRESTORE = "cursor = marker";
2885 re2c:define:YYGETCONDITION = "cond";
2886 re2c:define:YYSETCONDITION = "cond = @@";
2887
2888 <*> * { return 0, eSyntax }
2889
2890 <init> '0b' / [01] :=> bin
2891 <init> "0" :=> oct
2892 <init> "" / [1-9] :=> dec
2893 <init> '0x' / [0-9a-fA-F] :=> hex
2894
2895 <bin, oct, dec, hex> "\x00" {
2896 if result < u32Limit {
2897 return uint32(result), nil
2898 } else {
2899 return 0, eOverflow
2900 }
2901 }
2902
2903 <bin> [01] { add_digit(2, '0'); goto yyc_bin }
2904 <oct> [0-7] { add_digit(8, '0'); goto yyc_oct }
2905 <dec> [0-9] { add_digit(10, '0'); goto yyc_dec }
2906 <hex> [0-9] { add_digit(16, '0'); goto yyc_hex }
2907 <hex> [a-f] { add_digit(16, 'a'-10); goto yyc_hex }
2908 <hex> [A-F] { add_digit(16, 'A'-10); goto yyc_hex }
2909 */
2910 }
2911
2912 func TestLex(t *testing.T) {
2913 var tests = []struct {
2914 num uint32
2915 str string
2916 err error
2917 }{
2918 {1234567890, "1234567890\000", nil},
2919 {13, "0b1101\000", nil},
2920 {0x7fe, "0x007Fe\000", nil},
2921 {0644, "0644\000", nil},
2922 {0, "9999999999\000", eOverflow},
2923 {0, "123??\000", eSyntax},
2924 }
2925
2926 for _, x := range tests {
2927 t.Run(x.str, func(t *testing.T) {
2928 num, err := parse_u32(x.str)
2929 if !(num == x.num && err == x.err) {
2930 t.Errorf("got %d, want %d", num, x.num)
2931 }
2932 })
2933 }
2934 }
2935
2936
2938 With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
2939 erates a self-contained C program that can be further compiled and exe‐
2940 cuted. The program consists of lexer code and input data. For each con‐
2941 structed DFA (block or condition) re2c generates a standalone lexer and
2942 two files: an .input file with strings derived from the DFA and a .keys
2943 file with expected match results. The program runs each lexer on the
2944 corresponding .input file and compares results with the expectations.
2945 Skeleton programs are very useful for a number of reasons:
2946
2947 · They can check correctness of various re2c optimizations (the data is
2948 generated early in the process, before any DFA transformations have
2949 taken place).
2950
2951 · Generating a set of input data with good coverage may be useful for
2952 both testing and benchmarking.
2953
2954 · Generating self-contained executable programs allows one to get mini‐
2955 mized test cases (the original code may be large or have a lot of
2956 dependencies).
2957
2958 The difficulty with generating input data is that for all but the most
2959 trivial cases the number of possible input strings is too large (even
2960 if the string length is limited). Re2c solves this difficulty by gener‐
2961 ating sufficiently many strings to cover almost all DFA transitions. It
2962 uses the following algorithm. First, it constructs a skeleton of the
2963 DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
2964 EBCDIC) skeleton is just an exact copy of the original DFA. For encod‐
2965 ings with multibyte code units skeleton is a copy of DFA with certain
2966 transitions omitted: namely, re2c takes at most 256 code units for each
2967 disjoint continuous range that corresponds to a DFA transition. The
2968 chosen values are evenly distributed and include range bounds. Instead
2969 of trying to cover all possible paths in the skeleton (which is infea‐
2970 sible) re2c generates sufficiently many paths to cover all skeleton
2971 transitions, and thus trigger the corresponding conditional jumps in
2972 the lexer. The algorithm implementation is limited by ~1Gb of transi‐
2973 tions and consumes constant amount of memory (re2c writes data to file
2974 as soon as it is generated).
2975
2977 With the -D, --emit-dot option, re2c does not generate code. Instead,
2978 it dumps the generated DFA in DOT format. One can convert this dump to
2979 an image of the DFA using Graphviz or another library. Note that this
2980 option shows the final DFA after it has gone through a number of opti‐
2981 mizations and transformations. Earlier stages can be dumped with vari‐
2982 ous debug options, such as --dump-nfa, --dump-dfa-raw etc. (see the
2983 full list of options).
2984
2986 You can find more information about re2c at the official website:
2987 http://re2c.org. Similar programs are flex(1), lex(1), quex(‐
2988 http://quex.sourceforge.net).
2989
2991 Re2c was originaly written by Peter Bumbulis in 1993. Since then it
2992 has been developed and maintained by multiple volunteers; mots notably,
2993 Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
2994
2995
2996
2997
2998 RE2C(1)