1RE2C(1) RE2C(1)
2
3
4
6 re2c - convert regular expressions to C/C++ code
7
9 re2c [OPTIONS] FILE
10
12 re2c is a lexer generator for C/C++. It finds regular expression speci‐
13 fications inside of C/C++ comments and replaces them with a hard-coded
14 DFA. The user must supply some interface code in order to control and
15 customize the generated DFA.
16
18 -? -h --help
19 Show help message.
20
21 -b --bit-vectors
22 Optimize conditional jumps using bit masks. Implies -s.
23
24 -c --conditions --start-conditions
25 Enable support of Flex-like "conditions": multiple interrelated
26 lexers within one block. Option --start-conditions is a legacy
27 alias; use --conditions instead.
28
29 -d --debug-output
30 Emit YYDEBUG in the generated code. YYDEBUG should be defined
31 by the user in the form of a void function with two parameters:
32 state (lexer state or -1) and symbol (current input symbol of
33 type YYCTYPE).
34
35 -D --emit-dot
36 Instead of normal output generate lexer graph in DOT format.
37 The output can be converted to PNG with the help of Graphviz
38 (something like dot -Tpng -odfa.png dfa.dot). Note that large
39 graphs may crash Graphviz.
40
41 -e --ecb
42 Generate a lexer that reads input in EBCDIC encoding. re2c
43 assumes that character range is 0 -- 0xFF an character size is 1
44 byte.
45
46 -f --storable-state
47 Generate a lexer which can store its inner state. This is use‐
48 ful in push-model lexers which are stopped by an outer program
49 when there is not enough input, and then resumed when more input
50 becomes available. In this mode users should additionally
51 define YYGETSTATE () and YYSETSTATE (state) macros and variables
52 yych, yyaccept and the state as part of the lexer state.
53
54 -F --flex-syntax
55 Partial support for Flex syntax: in this mode named definitions
56 don't need the equal sign and the terminating semicolon, and
57 when used they must be surrounded by curly braces. Names with‐
58 out curly braces are treated as double-quoted strings.
59
60 -g --computed-gotos
61 Optimize conditional jumps using non-standard "computed goto"
62 extension (must be supported by C/C++ compiler). re2c generates
63 jump tables only in complex cases with a lot of conditional
64 branches. Complexity threshold can be configured with
65 cgoto:threshold configuration. This option implies -b.
66
67 -i --no-debug-info
68 Do not output #line information. This is useful when the gener‐
69 ated code is tracked by some version control system.
70
71 -o OUTPUT --output=OUTPUT
72 Specify the OUTPUT file.
73
74 -r --reusable
75 Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
76 */ blocks. In this mode simple /*!re2c */ blocks are not
77 allowed and exactly one /*!rules:re2c */ block must be present.
78 The rules are saved and used by every /*!use:re2c */ block that
79 follows (which may add rules of their own). This option allows
80 to reuse the same set of rules with different configurations.
81
82 -s --nested-ifs
83 Use nested if statements instead of switch statements in condi‐
84 tional jumps. This usually results in more efficient code with
85 non-optimizing C/C++ compilers.
86
87 -t HEADER --type-header=HEADER
88 Generate a HEADER file that contains enum with condition names.
89 Requires -c option.
90
91 -T --tags
92 Enable submatch extraction with tags.
93
94 -P --posix-captures
95 Enable submatch extraction with POSIX-style capturing groups.
96
97 -u --unicode
98 Generate a lexer that reads input in UTF-32 encoding. re2c
99 assumes that character range is 0 -- 0x10FFFF and character size
100 is 4 bytes. Implies -s.
101
102 -v --version
103 Show version information.
104
105 -V --vernum
106 Show version information in MMmmpp format (major, minor, patch).
107
108 -w --wide-chars
109 Generate a lexer that reads input in UCS-2 encoding. re2c
110 assumes that character range is 0 -- 0xFFFF and character size
111 is 2 bytes. Implies -s.
112
113 -x --utf-16
114 Generate a lexer that reads input in UTF-16 encoding. re2c
115 assumes that character range is 0 -- 0x10FFFF and character size
116 is 2 bytes. Implies -s.
117
118 -8 --utf-8
119 Generate a lexer that reads input in UTF-8 encoding. re2c
120 assumes that character range is 0 -- 0x10FFFF and character size
121 is 1 byte.
122
123 --case-insensitive
124 Treat single-quoted and double-quoted strings as case-insensi‐
125 tive.
126
127 --case-inverted
128 Invert the meaning of single-quoted and double-quoted strings:
129 treat single-quoted strings as case-sensitive and double-quoted
130 strings as case-insensitive.
131
132 --no-generation-date
133 Suppress date output in the generated file.
134
135 --no-lookahead
136 Use TDFA(0) instead of TDFA(1). This option only has effect
137 with --tags or --posix-captures options.
138
139 --no-optimize-tags
140 Suppress optimization of tag variables (useful for debugging or
141 benchmarking).
142
143 --no-version
144 Suppress version output in the generated file.
145
146 --encoding-policy POLICY
147 Define the way re2c treats Unicode surrogates. POLICY can be
148 one of the following: fail (abort with an error when a surrogate
149 is encountered), substitute (silently replace surrogates with
150 the error code point 0xFFFD), ignore (default, treat surrogates
151 as normal code points). The Unicode standard says that stand‐
152 alone surrogates are invalid, but real-world libraries and pro‐
153 grams behave in different ways.
154
155 --input INPUT
156 Specify re2c input API. INPUT can be either default or custom
157 (enables the use of generic API).
158
159 -S --skeleton
160 Ignore user-defined interface code and generate a self-contained
161 "skeleton" program. Additionally, generate input files with
162 strings derived from the regular grammar and compressed match
163 results that are used to verify "skeleton" behavior on all
164 inputs. This option is useful for finding bugs in optimizations
165 and code generation.
166
167 --empty-class POLICY
168 Define the way re2c treats empty character classes. POLICY can
169 be one of the following: match-empty (match empty input: illogi‐
170 cal, but default behavior for backwards compatibility reasons),
171 match-none (fail to match on any input), error (compilation
172 error).
173
174 --dfa-minimization ALGORITHM
175 The internal algorithm used by re2c to minimize the DFA. ALGO‐
176 RITHM can be either moore (Moore algorithm, the default) or ta‐
177 ble (table filling algorithm). Both algorithms should produce
178 the same DFA up to states relabeling; table filling is much
179 slower and serves as a reference implementation.
180
181 --eager-skip
182 Make the generated lexer advance the input position "eagerly":
183 immediately after reading input symbol. By default this happens
184 after transition to the next state. Implied by --no-lookahead.
185
186 --dump-nfa
187 Generate representation of NFA in DOT format and dump it on
188 stderr.
189
190 --dump-dfa-raw
191 Generate representation of DFA in DOT format under construction
192 and dump it on stderr.
193
194 --dump-dfa-det
195 Generate representation of DFA in DOT format immediately after
196 determinization and dump it on stderr.
197
198 --dump-dfa-tagopt
199 Generate representation of DFA in DOT format after tag optimiza‐
200 tions and dump it on stderr.
201
202 --dump-dfa-min
203 Generate representation of DFA in DOT format after minimization
204 and dump it on stderr.
205
206 --dump-adfa
207 Generate representation of DFA in DOT format after tunneling and
208 dump it on stderr.
209
210 -1 --single-pass
211 Deprecated. Does nothing (single pass is the default now).
212
213 -W Turn on all warnings.
214
215 -Werror
216 Turn warnings into errors. Note that this option alone doesn't
217 turn on any warnings; it only affects those warnings that have
218 been turned on so far or will be turned on later.
219
220 -W<warning>
221 Turn on warning.
222
223 -Wno-<warning>
224 Turn off warning.
225
226 -Werror-<warning>
227 Turn on warning and treat it as an error (this implies -W<warn‐
228 ing>).
229
230 -Wno-error-<warning>
231 Don't treat this particular warning as an error. This doesn't
232 turn off the warning itself.
233
234 -Wcondition-order
235 Warn if the generated program makes implicit assumptions about
236 condition numbering. One should use either the -t, --type-header
237 option or the /*!types:re2c*/ directive to generate a mapping of
238 condition names to numbers and then use the autogenerated condi‐
239 tion names.
240
241 -Wempty-character-class
242 Warn if a regular expression contains an empty character class.
243 Trying to match an empty character class makes no sense: it
244 should always fail. However, for backwards compatibility rea‐
245 sons re2c allows empty character classes and treats them as
246 empty strings. Use the --empty-class option to change the
247 default behavior.
248
249 -Wmatch-empty-string
250 Warn if a rule is nullable (matches an empty string). If the
251 lexer runs in a loop and the empty match is unintentional, the
252 lexer may unexpectedly hang in an infinite loop.
253
254 -Wswapped-range
255 Warn if the lower bound of a range is greater than its upper
256 bound. The default behavior is to silently swap the range
257 bounds.
258
259 -Wundefined-control-flow
260 Warn if some input strings cause undefined control flow in the
261 lexer (the faulty patterns are reported). This is the most dan‐
262 gerous and most common mistake. It can be easily fixed by adding
263 the default rule * which has the lowest priority, matches any
264 code unit, and consumes exactly one code unit.
265
266 -Wunreachable-rules
267 Warn about rules that are shadowed by other rules and will never
268 match.
269
270 -Wuseless-escape
271 Warn if a symbol is escaped when it shouldn't be. By default,
272 re2c silently ignores such escapes, but this may as well indi‐
273 cate a typo or an error in the escape sequence.
274
275 -Wnondeterministic-tags
276 Warn if a tag has n-th degree of nondeterminism, where n is
277 greater than 1.
278
280 Below is the list of all symbols which may be used by the lexer in
281 order to interact with the outer world. These symbols should be
282 defined by the user, either in the form of inplace configurations, or
283 as C/C++ variables, functions, macros and other language constructs.
284 Which primitives are necessary depends on the particular use case.
285
286 yyaccept
287 L-value of unsigned integral type that is used to hold the num‐
288 ber of the last matched rule. Explicit definition by the user
289 is necessary only with -f --storable-state option.
290
291 YYBACKUP ()
292 Backup current input position (used only with --input custom
293 option).
294
295 YYBACKUPCTX ()
296 Backup current input position for trailing context (used only
297 with --input custom option).
298
299 yych L-value of type YYCTYPE that is used to hold current input char‐
300 acter. Explicit definition by the user is necessary only with
301 -f --storable-state option.
302
303 YYCONDTYPE
304 The type of condition identifiers (used only with -c --condi‐
305 tions option). Should be generated either with /*!types:re2c*/
306 directive, or with -t --type-header option.
307
308 YYCTXMARKER
309 L-value of type YYCTYPE * that is used to backup input position
310 of trailing context. It is needed only if regular expressions
311 use the lookahead operator /.
312
313 YYCTYPE
314 The type of the input characters (code units). Usually it
315 should be unsigned char for ASCII, EBCDIC and UTF-8 encodings,
316 unsigned short for UTF-16 or UCS-2 encodings, and unsigned int
317 for UTF-32 encoding.
318
319 YYCURSOR
320 L-value of type YYCTYPE * that is used as a pointer to the cur‐
321 rent input symbol. Initially YYCURSOR points to the first char‐
322 acter and is advanced by the lexer during matching. When a rule
323 matches, YYCURSOR points past the last character of the matched
324 string.
325
326 YYDEBUG (state, symbol)
327 A function-like primitive that is used to dump debug information
328 (only used with -d --debug-output option). YYDEBUG should
329 return no value and accept two arguments: state (either lexer
330 state or -1) and symbol (current input symbol).
331
332 YYFILL (n)
333 A function-like primitive that is called by the lexer when there
334 is not enough input. YYFILL should return no value and supply
335 at least n additional characters. Maximal value of n equals
336 YYMAXFILL, which can be obtained with the /*!max:re2c*/ direc‐
337 tive.
338
339 YYGETCONDITION ()
340 R-value of type YYCONDTYPE that represents current condition
341 identifier (used only with -c --conditions option).
342
343 YYGETSTATE ()
344 R-value of signed integral type that represents current lexer
345 state (used only with -f --storable-state option). Initial
346 value of lexer state should be -1.
347
348 YYLESSTHAN (n)
349 R-value of boolean type that is true if and only if there is
350 less than n input characters left (used only with --input cus‐
351 tom option).
352
353 YYLIMIT
354 R-value of type YYCTYPE * that marks the end of input
355 (YYLIMIT[-1] should be the last input character). Lexer com‐
356 pares YYCURSOR and YYLIMIT in order to determine if there is
357 enough input characters left.
358
359 YYMARKER
360 L-value of type YYCTYPE * used to backup input position of suc‐
361 cessful match. This might be necessary if there is an overlap‐
362 ping longer rule that might also match.
363
364 YYMTAGP (t)
365 Append current input position to the history of m-tag t (used
366 only with -T --tags option).
367
368 YYMTAGN (t)
369 Append default value to the history of m-tag t (used only with
370 -T --tags option).
371
372 YYMAXFILL
373 Integral constant that denotes maximal value of YYFILL argument
374 and is autogenerated by /*!max:re2c*/ directive.
375
376 YYMAXNMATCH
377 Integral constant that denotes maximal number of capturing
378 groups in a rule and is autogenerated by /*!maxnmatch:re2c*/
379 directive (used only with --posix-captures option).
380
381 yynmatch
382 L-value of unsigned integral type that is used to hold the num‐
383 ber of capturing groups in the matching rule. Used only with -P
384 --posix-captures option.
385
386 YYPEEK ()
387 R-value of type YYCTYPE that denotes current input character
388 (used only with --input custom option).
389
390 yypmatch
391 An array of l-values that are used to hold the values of s-tags
392 corresponding to the capturing parentheses in the matching rule.
393 The length of array must be at least yynmatch * 2 (ideally
394 YYMAXNMATCH * 2). Used only with -P --posix-captures option.
395
396 YYRESTORE ()
397 Restore input position (used only with --input custom option).
398
399 YYRESTORECTX ()
400 Restore input position from the value of trailing context (used
401 only with --input custom option).
402
403 YYRESTORETAG (t)
404 Restore input position from the value of s-tag t (used only with
405 --input custom option).
406
407 YYSETCONDITION (condition)
408 Set current condition identifier to condition (used only with -c
409 --conditions option).
410
411 YYSETSTATE (state)
412 Set current lexer state to state (used only with -f
413 --storable-state option). Parameter state is of signed integral
414 type.
415
416 YYSKIP ()
417 Advance input position to the next character (used only with
418 generic API).
419
420 YYSTAGP (t)
421 Save current input position to s-tag t (used only with -T --tags
422 and --input custom option).
423
424 YYSTAGN (t)
425 Save default value to s-tag t (used only with -T --tags and
426 --input custom options).
427
429 A program can contain any number of re2c blocks. Each block consists
430 of a sequence of RULES, NAMED DEFINITIONS and INPLACE CONFIGURATIONS.
431
432 RULES
433 Rules consist of a regular expression followed by a user-defined
434 action: a block of C/C++ code that is executed in case of sucessful
435 match. Action can be either an arbitrary block of code enclosed in
436 curly braces { and } or a block of code without curly braces preceded
437 with := and ended with a newline that is not followed by a whitespace.
438
439 If multiple rules match, re2c prefers the longest match. If rules
440 match the same string, the earlier rule has priority.
441
442 There is one special kind of rule: the default rule with * instead of
443 the regular expression. It always has the lowest priority, matches any
444 code unit (either valid or invalid) and consumes exactly one code unit.
445 Note that default rule is not the same as [^], which matches any valid
446 code point and can consume multiple code units. In case of vari‐
447 able-length encodings * is the only possible way to match invalid input
448 character.
449
450 If -c --conditions option is used, then rules have more complex form
451 described in the section about conditions.
452
453 NAMED DEFINITIONS
454 Named definitions are of the form name = regexp ; where name is an
455 identifier that consists of letters, digits and underscores, and regexp
456 is a regular expression. With -F --flex-syntax option named defini‐
457 tions are also of the form name regexp. Each name should be defined
458 before it is used.
459
460 INPLACE CONFIGURATIONS
461 re2c:cgoto:threshold = 9;
462 With -g --computed-gotos option this value specifies the com‐
463 plexity threshold that triggers the generation of jump tables
464 rather than nested if statements and bit masks.
465
466 re2c:cond:divider = '/* *********************************** */';
467 Allows to customize the divider for condition blocks. One can
468 use @@ to insert condition name.
469
470 re2c:cond:divider@cond = @@;
471 Specifies the placeholder that will be replaced with condition
472 name in re2c:cond:divider.
473
474 re2c:condenumprefix = yyc;
475 Specifies the prefix used for condition identifiers.
476
477 re2c:cond:goto@cond = @@;
478 Specifies the placeholder that will be replaced with condition
479 label in re2c:cond:goto.
480
481 re2c:cond:goto = 'goto @@;';
482 Allows to customize goto statements used with :=> style rules.
483 One can use @@ to insert the condition name.
484
485 re2c:condprefix = yyc;
486 Specifies the prefix used for condition labels.
487
488 re2c:define:YYBACKUPCTX = 'YYBACKUPCTX';
489 Replaces YYBACKUPCTX identifier with the specified string.
490
491 re2c:define:YYBACKUP = 'YYBACKUP';
492 Replaces YYBACKUP identifier with the specified string.
493
494 re2c:define:YYCONDTYPE = 'YYCONDTYPE';
495 Enumeration type used for condition identifiers.
496
497 re2c:define:YYCTXMARKER = 'YYCTXMARKER';
498 Replaces the YYCTXMARKER placeholder with the specified identi‐
499 fier.
500
501 re2c:define:YYCTYPE = 'YYCTYPE';
502 Replaces the YYCTYPE placeholder with the specified type.
503
504 re2c:define:YYCURSOR = 'YYCURSOR';
505 Replaces the YYCURSOR placeholder with the specified identifier.
506
507 re2c:define:YYDEBUG = 'YYDEBUG';
508 Replaces the YYDEBUG placeholder with the specified identifier.
509
510 re2c:define:YYFILL@len = '@@';
511 Any occurrence of this text inside of a YYFILL will be replaced
512 with the actual argument.
513
514 re2c:define:YYFILL:naked = 0;
515 Controls the argument in the parentheses after YYFILL and the
516 following semicolon. If zero, both the argument and the semi‐
517 colon are omitted. If non-zero, the argument is generated
518 unless re2c:yyfill:parameter is set to zero; the semicolon is
519 generated unconditionally.
520
521 re2c:define:YYFILL = 'YYFILL';
522 Define a substitution for YYFILL. By default re2c generates an
523 argument in parentheses and a semicolon after YYFILL. If you
524 need to make YYFILL an arbitrary statement rather than a call,
525 set re2c:define:YYFILL:naked to a non-zero value.
526
527 re2c:define:YYGETCONDITION:naked = 0;
528 Controls the parentheses after YYGETCONDITION. If zero, the
529 parentheses are omitted. If non-zero, the parentheses are gener‐
530 ated.
531
532 re2c:define:YYGETCONDITION = 'YYGETCONDITION';
533 Substitution for YYGETCONDITION. By default re2c generates
534 parentheses after YYGETCONDITION. Set re2c:define:YYGETCONDI‐
535 TION:naked to non-zero in order to omit the parentheses.
536
537 re2c:define:YYGETSTATE:naked = 0;
538 Controls the parentheses that follow YYGETSTATE. If zero, the
539 parentheses are omitted. If non-zero, they are generated.
540
541 re2c:define:YYGETSTATE = 'YYGETSTATE';
542 Substitution for YYGETSTATE. By default re2c generates paren‐
543 theses after YYGETSTATE. Set re2c:define:YYGETSTATE:naked to
544 non-zero to omit the parentheses.
545
546 re2c:define:YYLESSTHAN = 'YYLESSTHAN';
547 Replaces YYLESSTHAN identifier with the specified string.
548
549 re2c:define:YYLIMIT = 'YYLIMIT';
550 Replaces the YYLIMIT placeholder with the specified identifier.
551
552 re2c:define:YYMARKER = 'YYMARKER';
553 Replaces the YYMARKER placeholder with the specified identifier.
554
555 re2c:define:YYMTAGN = 'YYMTAGN';
556 Replaces YYMTAGN identifier with the specified string.
557
558 re2c:define:YYMTAGP = 'YYMTAGP';
559 Replaces YYMTAGP identifier with the specified string.
560
561 re2c:define:YYPEEK = 'YYPEEK';
562 Replaces YYPEEK identifier with the specified string.
563
564 re2c:define:YYRESTORECTX = 'YYRESTORECTX';
565 Replaces YYRESTORECTX identifier with the specified string.
566
567 re2c:define:YYRESTORE = 'YYRESTORE';
568 Replaces YYRESTORE identifier with the specified string.
569
570 re2c:define:YYRESTORETAG = 'YYRESTORETAG';
571 Replaces YYRESTORETAG identifier with the specified string.
572
573 re2c:define:YYSETCONDITION@cond = '@@';
574 Any occurrence of this text inside of YYSETCONDITION will be
575 replaced with the actual argument.
576
577 re2c:define:YYSETCONDITION:naked = 0;
578 Controls the argument in parentheses and the semicolon after
579 YYSETCONDITION. If zero, both the argument and the semicolon are
580 omitted. If non-zero, both the argument and the semicolon are
581 generated.
582
583 re2c:define:YYSETCONDITION = 'YYSETCONDITION';
584 Substitution for YYSETCONDITION. By default re2c generates an
585 argument in parentheses followed by semicolon after YYSETCONDI‐
586 TION. If you need to make YYSETCONDITION an arbitrary statement
587 rather than a call, set re2c:define:YYSETCONDITION:naked to
588 non-zero.
589
590 re2c:define:YYSETSTATE:naked = 0;
591 Controls the argument in parentheses and the semicolon after
592 YYSETSTATE. If zero, both argument and the semicolon are omit‐
593 ted. If non-zero, both the argument and the semicolon are gener‐
594 ated.
595
596 re2c:define:YYSETSTATE@state = '@@';
597 Any occurrence of this text inside of YYSETSTATE will be
598 replaced with the actual argument.
599
600 re2c:define:YYSETSTATE = 'YYSETSTATE';
601 Substitution for YYSETSTATE. By default re2c generates an argu‐
602 ment in parentheses followed by a semicolon after YYSETSTATE. If
603 you need to make YYSETSTATE an arbitrary statement rather than a
604 call, set re2c:define:YYSETSTATE:naked to non-zero.
605
606 re2c:define:YYSKIP = 'YYSKIP';
607 Replaces YYSKIP identifier with the specified string.
608
609 re2c:define:YYSTAGN = 'YYSTAGN';
610 Replaces YYSTAGN identifier with the specified string.
611
612 re2c:define:YYSTAGP = 'YYSTAGP';
613 Replaces YYSTAGP identifier with the specified string.
614
615 re2c:flags:8 or re2c:flags:utf-8
616 Same as -8 --utf-8 command-line option.
617
618 re2c:flags:b or re2c:flags:bit-vectors
619 Same as -b --bit-vectors command-line option.
620
621 re2c:flags:case-insensitive = 0;
622 Same as --case-insensitive command-line option.
623
624 re2c:flags:case-inverted = 0;
625 Same as --case-inverted command-line option.
626
627 re2c:flags:d or re2c:flags:debug-output
628 Same as -d --debug-output command-line option.
629
630 re2c:flags:dfa-minimization = 'moore';
631 Same as --dfa-minimization command-line option.
632
633 re2c:flags:eager-skip = 0;
634 Same as --eager-skip command-line option.
635
636 re2c:flags:e or re2c:flags:ecb
637 Same as -e --ecb command-line option.
638
639 re2c:flags:empty-class = 'match-empty';
640 Same as --empty-class command-line option.
641
642 re2c:flags:encoding-policy = 'ignore';
643 Same as --encoding-policy command-line option.
644
645 re2c:flags:g or re2c:flags:computed-gotos
646 Same as -g --computed-gotos command-line option.
647
648 re2c:flags:i or re2c:flags:no-debug-info
649 Same as -i --no-debug-info command-line option.
650
651 re2c:flags:input = 'default';
652 Same as --input command-line option.
653
654 re2c:flags:lookahead = 1;
655 Same as inverted --no-lookahead command-line option.
656
657 re2c:flags:optimize-tags = 1;
658 Same as inverted --no-optimize-tags command-line option.
659
660 re2c:flags:P or re2c:flags:posix-captures
661 Same as -P --posix-captures command-line option.
662
663 re2c:flags:s or re2c:flags:nested-ifs
664 Same as -s --nested-ifs command-line option.
665
666 re2c:flags:T or re2c:flags:tags
667 Same as -T --tags command-line option.
668
669 re2c:flags:u or re2c:flags:unicode
670 Same as -u --unicode command-line option.
671
672 re2c:flags:w or re2c:flags:wide-chars
673 Same as -w --wide-chars command-line option.
674
675 re2c:flags:x or re2c:flags:utf-16
676 Same as -x --utf-16 command-line option.
677
678 re2c:indent:string = '\t';
679 Specifies the string to use for indentation. Requires a string
680 that contains only whitespace (unless you need something else
681 for external tools). The easiest way to specify spaces is to
682 enclose them in single or double quotes. If you do not want
683 any indentation at all, you can set this to ''.
684
685 re2c:indent:top = 0;
686 Specifies the minimum amount of indentation to use. Requires a
687 numeric value greater than or equal to zero.
688
689 re2c:labelprefix = 'yy';
690 Allows to change the prefix of numbered labels. The default is
691 yy. Can be set any string that is valid in a label name.
692
693 re2c:label:yyFillLabel = 'yyFillLabel';
694 Overrides the name of the yyFillLabel label.
695
696 re2c:label:yyNext = 'yyNext';
697 Overrides the name of the yyNext label.
698
699 re2c:startlabel = 0;
700 If set to a non zero integer, then the start label of the next
701 scanner block will be generated even if it isn't used by the
702 scanner itself. Otherwise, the normal yy0-like start label is
703 only generated if needed. If set to a text value, then a label
704 with that text will be generated regardless of whether the nor‐
705 mal start label is used or not. This setting is reset to 0 after
706 a start label has been generated.
707
708 re2c:state:abort = 0;
709 When not zero and the -f --storable-state switch is active, then
710 the YYGETSTATE block will contain a default case that aborts and
711 a -1 case will be used for initialization.
712
713 re2c:state:nextlabel = 0;
714 Used when -f --storable-state is active to control whether the
715 YYGETSTATE block is followed by a yyNext: label line. Instead
716 of using yyNext, you can usually also use configuration startla‐
717 bel to force a specific start label or default to yy0 as a start
718 label. Instead of using a dedicated label, it is often better to
719 separate the YYGETSTATE code from the actual scanner code by
720 placing a /*!getstate:re2c*/ comment.
721
722 re2c:tags:expression = '@@';
723 Allows to customize the way re2c addresses tag variables: by
724 default it emits expressions of the form yyt<N>, but this might
725 be inconvenient if tag variables are defined as fields in a
726 struct, or for any other reason require special accessors. For
727 example, setting re2c:tags:expression = p->@@ will result in
728 p->yyt<N>.
729
730 re2c:tags:prefix = 'yyt';
731 Allows to override prefix of tag variables.
732
733 re2c:variable:yyaccept = yyaccept;
734 Overrides the name of the yyaccept variable.
735
736 re2c:variable:yybm = 'yybm';
737 Overrides the name of the yybm variable.
738
739 re2c:variable:yych = 'yych';
740 Overrides the name of the yych variable.
741
742 re2c:variable:yyctable = 'yyctable';
743 When both -c --conditions and -g --computed-gotos are active,
744 re2c will use this variable to generate a static jump table for
745 YYGETCONDITION.
746
747 re2c:variable:yystable = 'yystable';
748 Deprecated.
749
750 re2c:variable:yytarget = 'yytarget';
751 Overrides the name of the yytarget variable.
752
753 re2c:yybm:hex = 0;
754 If set to zero, a decimal table will be used. Otherwise, a hexa‐
755 decimal table will be generated.
756
757 re2c:yych:conversion = 0;
758 When this setting is non zero, re2c automatically generates con‐
759 version code whenever yych gets read. In this case, the type
760 must be defined using re2c:define:YYCTYPE.
761
762 re2c:yych:emit = 1;
763 Set this to zero to suppress the generation of yych.
764
765 re2c:yyfill:check = 1;
766 This can be set to 0 to suppress the generations of YYCURSOR and
767 YYLIMIT based precondition checks. This option is useful when
768 YYLIMIT + YYMAXFILL is always accessible.
769
770 re2c:yyfill:enable = 1;
771 Set this to zero to suppress the generation of YYFILL (n). When
772 using this, be sure to verify that the generated scanner does
773 not read beyond the available input, as allowing such behavior
774 might introduce severe security issues to your programs.
775
776 re2c:yyfill:parameter = 1;
777 Controls the argument in the parentheses that follow YYFILL. If
778 zero, the argument is omitted. If non-zero, the argument is
779 generated unless re2c:define:YYFILL:naked is set to non-zero.
780
781 REGULAR EXPRESSIONS
782 re2c uses the following syntax for regular expressions:
783
784 · "foo" case-sensitive string literal
785
786 · 'foo' case-insensitive string literal
787
788 · [a-xyz], [^a-xyz] character class (possibly negated)
789
790 · . any character except newline
791
792 · R \ S difference of character classes R and S
793
794 · R* zero or more occurrences of R
795
796 · R+ one or more occurrences of R
797
798 · R? optional R
799
800 · R{n} repetition of R exactly n times
801
802 · R{n,} repetition of R at least n times
803
804 · R{n,m} repetition of R from n to m times
805
806 · (R) just R; parentheses are used to override precedence or for
807 POSIX-style submatch
808
809 · R S concatenation: R followed by S
810
811 · R | S alternative: R or S
812
813 · R / S loohakead: R followed by S, but S is not consumed
814
815 · name the regular expression defined as name (or literal string "name"
816 in Flex compatibility mode)
817
818 · {name} the regular expression defined as name in Flex compatibility
819 mode
820
821 · @stag an s-tag: saves the last input position at which @stag matches
822 in a variable named stag
823
824 · #mtag an m-tag: saves all input positions at which #mtag matches in a
825 variable named mtag
826
827 Character classes and string literals may contain the following escape
828 sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
829 decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
830
832 re2c supports two kinds of submatch extraction.
833
834 The first option is -P --posix-captures: it enables POSIX-compliant
835 capturing groups. In this mode parentheses in regular expressions
836 denote the beginning and the end of capturing groups; the whole regular
837 expression is group number zero. The number of groups for the matching
838 rule is stored in a variable yynmatch, and submatch results are stored
839 in yypmatch array. Both yynmatch and yypmatch should be defined by the
840 user; note that yypmatch size must be at least [yynmatch * 2]. re2c
841 provides a directive /*!maxnmatch:re2c*/ that defines a constant
842 YYMAXNMATCH: the maximal value of yynmatch among all rules. Note that
843 re2c implements POSIX-compliant disambiguation: each subexpression
844 matches as long as possible, and subexpressions that start earlier in
845 regular expression have priority over those starting later.
846
847 Second option is -T --tags. With this option one can use standalone
848 tags of the form @stag and #mtag instead of capturing parentheses,
849 where stag and mtag are arbitrary used-defined names. Tags can be used
850 anywhere inside of a regular expression; semantically they are just
851 position markers. Tags of the form @stag are called s-tags: they
852 denote a single submatch value (the last input position where this tag
853 matched). Tags of the form #mtag are called m-tags: they denote multi‐
854 ple submatch values (the whole history of repetitions of this tag).
855 All tags should be defined by the user as variables with the corre‐
856 sponding names. With standalone tags re2c uses leftmost greedy disam‐
857 biguation: submatch positions correspond to the leftmost matching path
858 through the regular expression.
859
860 With both --posix-captures and --tags options re2c generates a number
861 of tag variables that are used by the lexer to track multiple possible
862 versions of each tag (multiple versions are caused by possible ambigu‐
863 ity of submatch). When a rule matches, ambiguity is resolved and all
864 tags of this rule (or capturing parentheses, which are also implemented
865 as tags) are initialized with the values of appropriate tag variables.
866 Note that there is no one-to-one correspondence between tag variables
867 and tags: the same tag variable may be reused for different tags, and
868 one tag may require multiple tag variables to hold all its ambiguous
869 versions. The exact number of tag variables is unknown to the user;
870 this number is determined by re2c. However, tag variables should be
871 defined by the user, because it might be necessary to update them in
872 YYFILL and store them between invocations of lexer with
873 --storable-state option. Therefore re2c provides directives
874 /*!stags:re2c ... */ and /*!mtags:re2c ... */ that can be used to
875 declare, initialize and manipulate tag variables.
876
877 S-tags must support the following operations:
878
879 · save input position to s-tag: t = YYCURSOR with default API, or
880 user-defined operation YYSTAGP (t) with generic API
881
882 · save default value to s-tag: t = NULL with default API, or
883 user-defined operation YYSTAGN (t) with generic API
884
885 · copy one s-tag to another: t1 = t2
886
887 M-tags must support the following operations:
888
889 · append input position to m-tag: user-defined operation YYMTAGP (t)
890 with both default and generic API
891
892 · append default value to m-tag: user-defined operation YYMTAGN (t)
893 with both default and generic API
894
895 · copy one m-tag to another: t1 = t2
896
897 S-tags can be implemented as scalar values (pointers or offsets).
898 M-tags need a more complex representation, as they need to store a
899 sequence of tag values. The most naive and inefficient representation
900 of m-tag is a list (array, vector) of tag values; a more efficient rep‐
901 resentation is to store all m-tags in a prefix-tree represented as
902 array of nodes (v, p), where v is tag value and p is a pointer to par‐
903 ent node.
904
905 For further details see http://re2c.org/examples/examples.html page on
906 the website or re2c/examples/ subdirectory of re2c distribution.
907
909 With -f --storable-state option re2c generates a lexer that can store
910 its current state, return to the caller, and later resume operations
911 exactly where it left off. The default mode of operation in re2c is a
912 "pull" model, where the lexer "pulls" more input whenever it needs it.
913 However, this mode of operation assumes that the lexer is the owner of
914 the parsing loop, and that may not always be convenient.
915
916 Storable state is useful exactly for situations like that: it allows to
917 construct lexers that work in a "push" model, where data is fed to the
918 lexer chunk by chunk. When the lexer needs more input, it stores its
919 state and returns to the caller. Later, when more input becomes avail‐
920 able, it resumes operations exactly where it stopped.
921
922 Changes needed compared to the "pull" model:
923
924 · Define YYSETSTATE () and YYGETSTATE (state).
925
926 · Define yych, yyaccept and state variables as a part of persistent
927 lexer state. state should be initialized to -1.
928
929 · YYFILL should return to the outer program instead of trying to supply
930 more input. Return code should indicate that lexer needs more input.
931
932 · The outer program should recognize situations when lexer needs more
933 input and respond appropriately.
934
935 · Use /*!getstate:re2c*/ directive if it is necessary to execute any
936 code before entering the lexer.
937
938 · Use configurations state:abort and state:nextlabel to tweak the gen‐
939 erated code.
940
942 Conditions are enabled with -c --conditions. This option allows to
943 encode multiple interrelated lexers within the same re2c block.
944
945 Each lexer corresponds to a single condition. It starts with a label
946 of the form yyc_name, where name is condition name and yyc prefix can
947 be adjusted with configuration re2c:condprefix. Different lexers are
948 separated with a comment /* *********************************** */
949 which can be adjusted with configuration re2c:cond:divider.
950
951 Furthermore, each condition has a unique identifier of the form yyc‐
952 name, where name is condition name and yyc prefix can be adjusted with
953 configuration re2c:condenumprefix. Identifiers have the type YYCOND‐
954 TYPE and should be generated with /*!types:re2c*/ directive or -t
955 --type-header option. Users shouldn't define these identifiers manu‐
956 ally, as the order of conditions is not specified.
957
958 Before all conditions re2c generates entry code that checks the current
959 condition identifier and transfers control flow to the start label of
960 the active condition. After matching some rule of this condition,
961 lexer may either transfer control flow back to the entry code (after
962 executing the associated action and optionally setting another condi‐
963 tion with =>), or use :=> shortcut and transition directly to the start
964 label of another condition (skipping the action and the entry code).
965 Configuration re2c:cond:goto allows to change the default behavior.
966
967 Syntactically each rule must be preceded with a list of comma-separated
968 condition names or a wildcard * enclosed in angle brackets < and >.
969 Wildcard means "any condition" and is semantically equivalent to list‐
970 ing all condition names. Here regexp is a regular expression, default
971 refers to the default rule *, and action is a block of C/C++ code.
972
973 · <conditions-or-wildcard> regexp-or-default action
974
975 · <conditions-or-wildcard> regexp-or-default => condition action
976
977 · <conditions-or-wildcard> regexp-or-default :=> condition
978
979 Rules with an exclamation mark ! in front of condition list have a spe‐
980 cial meaning: they have no regular expression, and the associated
981 action is merged as an entry code to actions of normal rules. This
982 might be a convenient place to peform a routine task that is common to
983 all rules.
984
985 · <!conditions-or-wildcard> action
986
987 Another special form of rules with an empty condition list <> and no
988 regular expression allows to specify an "entry condition" that can be
989 used to execute code before entering the lexer. It is semantically
990 equivalent to a condition with number zero, name 0 and an empty regular
991 expression.
992
993 · <> action
994
995 · <> => condition action
996
997 · <> :=> condition
998
1000 re2c supports the following encodings: ASCII (default), EBCDIC (-e),
1001 UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8). See also inplace
1002 configuration re2c:flags.
1003
1004 The following concepts should be clarified when talking about encod‐
1005 ings. A code point is an abstract number that represents a single sym‐
1006 bol. A code unit is the smallest unit of memory, which is used in the
1007 encoded text (it corresponds to one character in the input stream). One
1008 or more code units may be needed to represent a single code point,
1009 depending on the encoding. In a fixed-length encoding, each code point
1010 is represented with an equal number of code units. In variable-length
1011 encodings, different code points can be represented with different num‐
1012 ber of code units.
1013
1014 · ASCII is a fixed-length encoding. Its code space includes 0x100 code
1015 points, from 0 to 0xFF. A code point is represented with exactly one
1016 1-byte code unit, which has the same value as the code point. The
1017 size of YYCTYPE must be 1 byte.
1018
1019 · EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
1020 points, from 0 to 0xFF. A code point is represented with exactly one
1021 1-byte code unit, which has the same value as the code point. The
1022 size of YYCTYPE must be 1 byte.
1023
1024 · UCS-2 is a fixed-length encoding. Its code space includes 0x10000
1025 code points, from 0 to 0xFFFF. One code point is represented with
1026 exactly one 2-byte code unit, which has the same value as the code
1027 point. The size of YYCTYPE must be 2 bytes.
1028
1029 · UTF-16 is a variable-length encoding. Its code space includes all
1030 Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
1031 One code point is represented with one or two 2-byte code units. The
1032 size of YYCTYPE must be 2 bytes.
1033
1034 · UTF-32 is a fixed-length encoding. Its code space includes all Uni‐
1035 code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
1036 code point is represented with exactly one 4-byte code unit. The size
1037 of YYCTYPE must be 4 bytes.
1038
1039 · UTF-8 is a variable-length encoding. Its code space includes all Uni‐
1040 code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
1041 code point is represented with a sequence of one, two, three, or four
1042 1-byte code units. The size of YYCTYPE must be 1 byte.
1043
1044 In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not
1045 valid Unicode code points. Any encoded sequence of code units that
1046 would map to Unicode code points in the range 0xD800-0xDFFF, is
1047 ill-formed. The user can control how re2c treats such ill-formed
1048 sequences with the --encoding-policy <policy> switch.
1049
1050 For some encodings, there are code units that never occur in a valid
1051 encoded stream (e.g., 0xFF byte in UTF-8). If the generated scanner
1052 must check for invalid input, the only correct way to do so is to use
1053 the default rule (*). Note that the full range rule ([^]) won't catch
1054 invalid code units when a variable-length encoding is used ([^] means
1055 "any valid code point", whereas the default rule (*) means "any possi‐
1056 ble code unit").
1057
1059 By default re2c operates on input using pointer-like primitives YYCUR‐
1060 SOR, YYMARKER, YYCTXMARKER, and YYLIMIT. Normally pointer-like primi‐
1061 tives are defined as variables of type YYCTYPE*, but it is possible to
1062 use STL iterators or any other abstraction as long as it syntactically
1063 fits into the following use cases:
1064
1065 · ++YYCURSOR;
1066
1067 · yych = *YYCURSOR;
1068
1069 · yych = *++YYCURSOR;
1070
1071 · yych = *(YYMARKER = YYCURSOR);
1072
1073 · yych = *(YYMARKER = ++YCURSOR);
1074
1075 · YYMARKER = YYCURSOR;
1076
1077 · YYMARKER = ++YYCURSOR;
1078
1079 · YYCURSOR = YYMARKER;
1080
1081 · YYCTXMARKER = YYCURSOR + 1;
1082
1083 · YYCURSOR = YYCTXMARKER;
1084
1085 · if (YYLIMIT <= YYCURSOR) ...
1086
1087 · if ((YYLIMIT - YYCURSOR) < n) ...
1088
1089 · YYDEBUG (label, *YYCURSOR);
1090
1091 If this input model is too restrictive, then it is possible to use
1092 generic input API enabled with --input custom option. In this mode all
1093 input operations are expressed in terms of the primitives below. These
1094 primitives can be defined in any suitable way; one doesn't have to
1095 stick to the pointer semantics. For example, it is possible to read
1096 input directly from file without any buffering, or to disable YYFILL
1097 mechanism and perform end-of-input checking on each input character
1098 from inside of YYPEEK or YYSKIP.
1099
1100 · YYPEEK ()
1101
1102 · YYSKIP ()
1103
1104 · YYBACKUP ()
1105
1106 · YYBACKUPCTX ()
1107
1108 · YYSTAGP (t)
1109
1110 · YYSTAGN (t)
1111
1112 · YYMTAGP (t)
1113
1114 · YYMTAGN (t)
1115
1116 · YYRESTORE ()
1117
1118 · YYRESTORECTX ()
1119
1120 · YYRESTORETAG (t)
1121
1122 · YYLESSTHAN (n)
1123
1124 Default input model can be expressed in terms of generic API as follows
1125 (except for YMTAGP and YYMTAGN, which have no default implementation):
1126
1127 · #define YYPEEK () *YYCURSOR
1128
1129 · #define YYSKIP () ++YYCURSOR
1130
1131 · #define YYBACKUP () YYMARKER = YYCURSOR
1132
1133 · #define YYBACKUPCTX () YYCTXMARKER = YYCURSOR
1134
1135 · #define YYRESTORE () YYCURSOR = YYMARKER
1136
1137 · #define YYRESTORECTX () YYCURSOR = YYCTXMARKER
1138
1139 · #define YYRESTORERAG (t) YYCURSOR = t
1140
1141 · #define YYLESSTHAN (n) YYLIMIT - YYCURSOR < n
1142
1143 · #define YYSTAGP (t) t = YYCURSOR
1144
1145 · #define YYSTAGN (t) t = NULL
1146
1148 You can find more information about re2c at: http://re2c.org. See
1149 also: flex(1), lex(1), quex (http://quex.sourceforge.net).
1150
1152 Originaly written by Peter Bumbulis in 1993; developed and maintained
1153 by Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich. Below
1154 is a (more or less) full list of contributors retrieved from the Git
1155 history and mailing lists:
1156
1157 Abs62, asmwarrior, Ben Smith, Brian Young, CRCinAU, Dan Nuffer, Derick
1158 Rethans, Dimitri John Ledkov, Durimar, Eldar Zakirov, Emmanuel Mogenet,
1159 Hartmut Kaiser, jcfp, Jean-Claude Wippler, Jeff Trull, Jérôme Dumesnil,
1160 Jesse Buesking, joscherl, Julian Andres Klode, Marcus Boerger, Mike
1161 Gilbert, nuno-lopes, Oleksii Taran, paulmcq, Paulo Custodio, Perry E.
1162 Metzger, philippschaefer, Ross Burton, Rui Maciel, Ryan Mast,
1163 Samuel006, Sergei Trofimovich, sirzooro, Tim Kelly, Ulya Trofimovich
1164
1166 This manpage describes re2c version 1.1.1, package date 30 Aug 2018.
1167
1168
1169
1170
1171 RE2C(1)