1RE2C(1)                                                                RE2C(1)
2
3
4

NAME

6       re2c - convert regular expressions to C/C++ code
7

SYNOPSIS

9       re2c [OPTIONS] FILE
10

DESCRIPTION

12       re2c is a lexer generator for C/C++. It finds regular expression speci‐
13       fications inside of C/C++ comments and replaces them with a  hard-coded
14       DFA.  The  user must supply some interface code in order to control and
15       customize the generated DFA.
16

OPTIONS

18       -? -h --help
19              Show help message.
20
21       -b --bit-vectors
22              Optimize conditional jumps using bit masks. Implies -s.
23
24       -c --conditions --start-conditions
25              Enable support of Flex-like "conditions": multiple  interrelated
26              lexers  within one block.  Option --start-conditions is a legacy
27              alias; use --conditions instead.
28
29       -d --debug-output
30              Emit YYDEBUG in the generated code.  YYDEBUG should  be  defined
31              by  the user in the form of a void function with two parameters:
32              state (lexer state or -1) and symbol (current  input  symbol  of
33              type YYCTYPE).
34
35       -D --emit-dot
36              Instead  of  normal  output  generate lexer graph in DOT format.
37              The output can be converted to PNG with  the  help  of  Graphviz
38              (something  like  dot -Tpng -odfa.png dfa.dot).  Note that large
39              graphs may crash Graphviz.
40
41       -e --ecb
42              Generate a lexer that reads  input  in  EBCDIC  encoding.   re2c
43              assumes that character range is 0 -- 0xFF an character size is 1
44              byte.
45
46       -f --storable-state
47              Generate a lexer which can store its inner state.  This is  use‐
48              ful  in  push-model lexers which are stopped by an outer program
49              when there is not enough input, and then resumed when more input
50              becomes  available.   In  this  mode  users  should additionally
51              define YYGETSTATE () and YYSETSTATE (state) macros and variables
52              yych, yyaccept and the state as part of the lexer state.
53
54       -F --flex-syntax
55              Partial  support for Flex syntax: in this mode named definitions
56              don't need the equal sign and  the  terminating  semicolon,  and
57              when  used they must be surrounded by curly braces.  Names with‐
58              out curly braces are treated as double-quoted strings.
59
60       -g --computed-gotos
61              Optimize conditional jumps using  non-standard  "computed  goto"
62              extension (must be supported by C/C++ compiler).  re2c generates
63              jump tables only in complex cases  with  a  lot  of  conditional
64              branches.    Complexity   threshold   can   be  configured  with
65              cgoto:threshold configuration.  This option implies -b.
66
67       -i --no-debug-info
68              Do not output #line information.  This is useful when the gener‐
69              ated code is tracked by some version control system.
70
71       -o OUTPUT --output=OUTPUT
72              Specify the OUTPUT file.
73
74       -r --reusable
75              Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
76              */ blocks.  In this  mode  simple  /*!re2c  */  blocks  are  not
77              allowed  and exactly one /*!rules:re2c */ block must be present.
78              The rules are saved and used by every /*!use:re2c */ block  that
79              follows  (which may add rules of their own).  This option allows
80              to reuse the same set of rules with different configurations.
81
82       -s --nested-ifs
83              Use nested if statements instead of switch statements in  condi‐
84              tional  jumps.  This usually results in more efficient code with
85              non-optimizing C/C++ compilers.
86
87       -t HEADER --type-header=HEADER
88              Generate a HEADER file that contains enum with condition  names.
89              Requires -c option.
90
91       -T --tags
92              Enable submatch extraction with tags.
93
94       -P --posix-captures
95              Enable submatch extraction with POSIX-style capturing groups.
96
97       -u --unicode
98              Generate  a  lexer  that  reads  input in UTF-32 encoding.  re2c
99              assumes that character range is 0 -- 0x10FFFF and character size
100              is 4 bytes.  Implies -s.
101
102       -v --version
103              Show version information.
104
105       -V --vernum
106              Show version information in MMmmpp format (major, minor, patch).
107
108       -w --wide-chars
109              Generate  a  lexer  that  reads  input  in UCS-2 encoding.  re2c
110              assumes that character range is 0 -- 0xFFFF and  character  size
111              is 2 bytes.  Implies -s.
112
113       -x --utf-16
114              Generate  a  lexer  that  reads  input in UTF-16 encoding.  re2c
115              assumes that character range is 0 -- 0x10FFFF and character size
116              is 2 bytes.  Implies -s.
117
118       -8 --utf-8
119              Generate  a  lexer  that  reads  input  in UTF-8 encoding.  re2c
120              assumes that character range is 0 -- 0x10FFFF and character size
121              is 1 byte.
122
123       --case-insensitive
124              Treat  single-quoted  and double-quoted strings as case-insensi‐
125              tive.
126
127       --case-inverted
128              Invert the meaning of single-quoted and  double-quoted  strings:
129              treat  single-quoted strings as case-sensitive and double-quoted
130              strings as case-insensitive.
131
132       --no-generation-date
133              Suppress date output in the generated file.
134
135       --no-lookahead
136              Use TDFA(0) instead of TDFA(1).  This  option  only  has  effect
137              with --tags or --posix-captures options.
138
139       --no-optimize-tags
140              Suppress  optimization of tag variables (useful for debugging or
141              benchmarking).
142
143       --no-version
144              Suppress version output in the generated file.
145
146       --encoding-policy POLICY
147              Define the way re2c treats Unicode surrogates.   POLICY  can  be
148              one of the following: fail (abort with an error when a surrogate
149              is encountered), substitute (silently  replace  surrogates  with
150              the  error code point 0xFFFD), ignore (default, treat surrogates
151              as normal code points).  The Unicode standard says  that  stand‐
152              alone  surrogates are invalid, but real-world libraries and pro‐
153              grams behave in different ways.
154
155       --input INPUT
156              Specify re2c input API. INPUT can be either  default  or  custom
157              (enables the use of generic API).
158
159       -S --skeleton
160              Ignore user-defined interface code and generate a self-contained
161              "skeleton" program.  Additionally,  generate  input  files  with
162              strings  derived  from  the regular grammar and compressed match
163              results that are used  to  verify  "skeleton"  behavior  on  all
164              inputs.  This option is useful for finding bugs in optimizations
165              and code generation.
166
167       --empty-class POLICY
168              Define the way re2c treats empty character classes.  POLICY  can
169              be one of the following: match-empty (match empty input: illogi‐
170              cal, but default behavior for backwards compatibility  reasons),
171              match-none  (fail  to  match  on  any input), error (compilation
172              error).
173
174       --dfa-minimization ALGORITHM
175              The internal algorithm used by re2c to minimize the DFA.   ALGO‐
176              RITHM  can be either moore (Moore algorithm, the default) or ta‐
177              ble (table filling algorithm).  Both algorithms  should  produce
178              the  same  DFA  up  to  states relabeling; table filling is much
179              slower and serves as a reference implementation.
180
181       --eager-skip
182              Make the generated lexer advance the input  position  "eagerly":
183              immediately after reading input symbol.  By default this happens
184              after transition to the next state.  Implied by --no-lookahead.
185
186       --dump-nfa
187              Generate representation of NFA in DOT  format  and  dump  it  on
188              stderr.
189
190       --dump-dfa-raw
191              Generate  representation of DFA in DOT format under construction
192              and dump it on stderr.
193
194       --dump-dfa-det
195              Generate representation of DFA in DOT format  immediately  after
196              determinization and dump it on stderr.
197
198       --dump-dfa-tagopt
199              Generate representation of DFA in DOT format after tag optimiza‐
200              tions and dump it on stderr.
201
202       --dump-dfa-min
203              Generate representation of DFA in DOT format after  minimization
204              and dump it on stderr.
205
206       --dump-adfa
207              Generate representation of DFA in DOT format after tunneling and
208              dump it on stderr.
209
210       -1 --single-pass
211              Deprecated. Does nothing (single pass is the default now).
212
213       -W     Turn on all warnings.
214
215       -Werror
216              Turn warnings into errors. Note that this option  alone  doesn't
217              turn  on  any warnings; it only affects those warnings that have
218              been turned on so far or will be turned on later.
219
220       -W<warning>
221              Turn on warning.
222
223       -Wno-<warning>
224              Turn off warning.
225
226       -Werror-<warning>
227              Turn on warning and treat it as an error (this implies  -W<warn‐
228              ing>).
229
230       -Wno-error-<warning>
231              Don't  treat  this  particular warning as an error. This doesn't
232              turn off the warning itself.
233
234       -Wcondition-order
235              Warn if the generated program makes implicit  assumptions  about
236              condition numbering. One should use either the -t, --type-header
237              option or the /*!types:re2c*/ directive to generate a mapping of
238              condition names to numbers and then use the autogenerated condi‐
239              tion names.
240
241       -Wempty-character-class
242              Warn if a regular expression contains an empty character  class.
243              Trying  to  match  an  empty  character class makes no sense: it
244              should always fail.  However, for backwards  compatibility  rea‐
245              sons  re2c  allows  empty  character  classes and treats them as
246              empty strings.  Use  the  --empty-class  option  to  change  the
247              default behavior.
248
249       -Wmatch-empty-string
250              Warn  if  a  rule is nullable (matches an empty string).  If the
251              lexer runs in a loop and the empty match is  unintentional,  the
252              lexer may unexpectedly hang in an infinite loop.
253
254       -Wswapped-range
255              Warn  if  the  lower  bound of a range is greater than its upper
256              bound. The default  behavior  is  to  silently  swap  the  range
257              bounds.
258
259       -Wundefined-control-flow
260              Warn  if  some input strings cause undefined control flow in the
261              lexer (the faulty patterns are reported). This is the most  dan‐
262              gerous and most common mistake. It can be easily fixed by adding
263              the default rule * which has the lowest  priority,  matches  any
264              code unit, and consumes exactly one code unit.
265
266       -Wunreachable-rules
267              Warn about rules that are shadowed by other rules and will never
268              match.
269
270       -Wuseless-escape
271              Warn if a symbol is escaped when it shouldn't be.   By  default,
272              re2c  silently  ignores such escapes, but this may as well indi‐
273              cate a typo or an error in the escape sequence.
274
275       -Wnondeterministic-tags
276              Warn if a tag has n-th degree  of  nondeterminism,  where  n  is
277              greater than 1.
278

INTERFACE CODE

280       Below  is  the  list  of  all symbols which may be used by the lexer in
281       order to interact with  the  outer  world.   These  symbols  should  be
282       defined  by  the user, either in the form of inplace configurations, or
283       as C/C++ variables, functions, macros and  other  language  constructs.
284       Which primitives are necessary depends on the particular use case.
285
286       yyaccept
287              L-value  of unsigned integral type that is used to hold the num‐
288              ber of the last matched rule.  Explicit definition by  the  user
289              is necessary only with -f --storable-state option.
290
291       YYBACKUP ()
292              Backup  current  input  position  (used only with --input custom
293              option).
294
295       YYBACKUPCTX ()
296              Backup current input position for trailing  context  (used  only
297              with  --input custom option).
298
299       yych   L-value of type YYCTYPE that is used to hold current input char‐
300              acter.  Explicit definition by the user is necessary  only  with
301              -f --storable-state option.
302
303       YYCONDTYPE
304              The  type  of  condition identifiers (used only with -c --condi‐
305              tions option).  Should be generated either with  /*!types:re2c*/
306              directive, or with -t --type-header option.
307
308       YYCTXMARKER
309              L-value  of type YYCTYPE * that is used to backup input position
310              of trailing context.  It is needed only if  regular  expressions
311              use the lookahead operator /.
312
313       YYCTYPE
314              The  type  of  the  input  characters  (code units).  Usually it
315              should be unsigned char for ASCII, EBCDIC and  UTF-8  encodings,
316              unsigned  short  for UTF-16 or UCS-2 encodings, and unsigned int
317              for UTF-32 encoding.
318
319       YYCURSOR
320              L-value of type YYCTYPE * that is used as a pointer to the  cur‐
321              rent input symbol.  Initially YYCURSOR points to the first char‐
322              acter and is advanced by the lexer during matching.  When a rule
323              matches,  YYCURSOR points past the last character of the matched
324              string.
325
326       YYDEBUG (state, symbol)
327              A function-like primitive that is used to dump debug information
328              (only  used  with  -d  --debug-output  option).   YYDEBUG should
329              return no value and accept two arguments:  state  (either  lexer
330              state or -1) and symbol (current input symbol).
331
332       YYFILL (n)
333              A function-like primitive that is called by the lexer when there
334              is not enough input.  YYFILL should return no value  and  supply
335              at  least  n  additional  characters.  Maximal value of n equals
336              YYMAXFILL, which can be obtained with the  /*!max:re2c*/  direc‐
337              tive.
338
339       YYGETCONDITION ()
340              R-value  of  type  YYCONDTYPE  that represents current condition
341              identifier (used only with -c --conditions option).
342
343       YYGETSTATE ()
344              R-value of signed integral type that  represents  current  lexer
345              state  (used  only  with  -f  --storable-state option).  Initial
346              value of lexer state should be -1.
347
348       YYLESSTHAN (n)
349              R-value of boolean type that is true if and  only  if  there  is
350              less  than n input characters left (used only with  --input cus‐
351              tom option).
352
353       YYLIMIT
354              R-value  of  type  YYCTYPE  *  that  marks  the  end  of   input
355              (YYLIMIT[-1]  should  be  the last input character).  Lexer com‐
356              pares YYCURSOR and YYLIMIT in order to  determine  if  there  is
357              enough input characters left.
358
359       YYMARKER
360              L-value  of type YYCTYPE * used to backup input position of suc‐
361              cessful match.  This might be necessary if there is an  overlap‐
362              ping longer rule that might also match.
363
364       YYMTAGP (t)
365              Append  current  input  position to the history of m-tag t (used
366              only with -T --tags option).
367
368       YYMTAGN (t)
369              Append default value to the history of m-tag t (used  only  with
370              -T --tags option).
371
372       YYMAXFILL
373              Integral  constant that denotes maximal value of YYFILL argument
374              and is autogenerated by /*!max:re2c*/ directive.
375
376       YYMAXNMATCH
377              Integral constant  that  denotes  maximal  number  of  capturing
378              groups  in  a  rule  and is autogenerated by /*!maxnmatch:re2c*/
379              directive (used only with --posix-captures option).
380
381       yynmatch
382              L-value of unsigned integral type that is used to hold the  num‐
383              ber of capturing groups in the matching rule.  Used only with -P
384              --posix-captures option.
385
386       YYPEEK ()
387              R-value of type YYCTYPE that  denotes  current  input  character
388              (used only with --input custom option).
389
390       yypmatch
391              An  array of l-values that are used to hold the values of s-tags
392              corresponding to the capturing parentheses in the matching rule.
393              The  length  of  array  must  be  at least yynmatch * 2 (ideally
394              YYMAXNMATCH * 2).  Used only with -P --posix-captures option.
395
396       YYRESTORE ()
397              Restore input position (used only with --input custom option).
398
399       YYRESTORECTX ()
400              Restore input position from the value of trailing context  (used
401              only with --input custom option).
402
403       YYRESTORETAG (t)
404              Restore input position from the value of s-tag t (used only with
405              --input custom option).
406
407       YYSETCONDITION (condition)
408              Set current condition identifier to condition (used only with -c
409              --conditions option).
410
411       YYSETSTATE (state)
412              Set   current   lexer   state   to  state  (used  only  with  -f
413              --storable-state option).  Parameter state is of signed integral
414              type.
415
416       YYSKIP ()
417              Advance  input  position  to  the next character (used only with
418              generic API).
419
420       YYSTAGP (t)
421              Save current input position to s-tag t (used only with -T --tags
422              and --input custom option).
423
424       YYSTAGN (t)
425              Save  default  value  to  s-tag  t (used only with -T --tags and
426              --input custom options).
427

SYNTAX

429       A program can contain any number of re2c blocks.  Each  block  consists
430       of a sequence of RULES, NAMED DEFINITIONS and INPLACE CONFIGURATIONS.
431
432   RULES
433       Rules  consist  of  a  regular  expression  followed  by a user-defined
434       action: a block of C/C++ code that is executed  in  case  of  sucessful
435       match.   Action  can  be  either an arbitrary block of code enclosed in
436       curly braces { and } or a block of code without curly  braces  preceded
437       with := and ended with a newline that is not followed by a whitespace.
438
439       If  multiple  rules  match,  re2c  prefers the longest match.  If rules
440       match the same string, the earlier rule has priority.
441
442       There is one special kind of rule: the default rule with *  instead  of
443       the regular expression.  It always has the lowest priority, matches any
444       code unit (either valid or invalid) and consumes exactly one code unit.
445       Note  that default rule is not the same as [^], which matches any valid
446       code point and can consume multiple  code  units.   In  case  of  vari‐
447       able-length encodings * is the only possible way to match invalid input
448       character.
449
450       If -c --conditions option is used, then rules have  more  complex  form
451       described in the section about conditions.
452
453   NAMED DEFINITIONS
454       Named  definitions  are  of  the  form name = regexp ; where name is an
455       identifier that consists of letters, digits and underscores, and regexp
456       is  a  regular  expression.  With -F --flex-syntax option named defini‐
457       tions are also of the form name regexp.  Each name  should  be  defined
458       before it is used.
459
460   INPLACE CONFIGURATIONS
461       re2c:cgoto:threshold = 9;
462              With  -g  --computed-gotos  option this value specifies the com‐
463              plexity threshold that triggers the generation  of  jump  tables
464              rather than nested if statements and bit masks.
465
466       re2c:cond:divider = '/* *********************************** */';
467              Allows  to  customize  the divider for condition blocks. One can
468              use @@ to insert condition name.
469
470       re2c:cond:divider@cond = @@;
471              Specifies the placeholder that will be replaced  with  condition
472              name in re2c:cond:divider.
473
474       re2c:condenumprefix = yyc;
475              Specifies the prefix used for condition identifiers.
476
477       re2c:cond:goto@cond = @@;
478              Specifies  the  placeholder that will be replaced with condition
479              label in re2c:cond:goto.
480
481       re2c:cond:goto = 'goto @@;';
482              Allows to customize goto statements used with :=>  style  rules.
483              One can use @@ to insert the condition name.
484
485       re2c:condprefix = yyc;
486              Specifies the prefix used for condition labels.
487
488       re2c:define:YYBACKUPCTX = 'YYBACKUPCTX';
489              Replaces YYBACKUPCTX identifier with the specified string.
490
491       re2c:define:YYBACKUP = 'YYBACKUP';
492              Replaces YYBACKUP identifier with the specified string.
493
494       re2c:define:YYCONDTYPE = 'YYCONDTYPE';
495              Enumeration type used for condition identifiers.
496
497       re2c:define:YYCTXMARKER = 'YYCTXMARKER';
498              Replaces  the YYCTXMARKER placeholder with the specified identi‐
499              fier.
500
501       re2c:define:YYCTYPE = 'YYCTYPE';
502              Replaces the YYCTYPE placeholder with the specified type.
503
504       re2c:define:YYCURSOR = 'YYCURSOR';
505              Replaces the YYCURSOR placeholder with the specified identifier.
506
507       re2c:define:YYDEBUG = 'YYDEBUG';
508              Replaces the YYDEBUG placeholder with the specified identifier.
509
510       re2c:define:YYFILL@len = '@@';
511              Any occurrence of this text inside of a YYFILL will be  replaced
512              with the actual argument.
513
514       re2c:define:YYFILL:naked = 0;
515              Controls  the  argument  in the parentheses after YYFILL and the
516              following semicolon.  If zero, both the argument and  the  semi‐
517              colon  are  omitted.   If  non-zero,  the  argument is generated
518              unless re2c:yyfill:parameter is set to zero;  the  semicolon  is
519              generated unconditionally.
520
521       re2c:define:YYFILL = 'YYFILL';
522              Define  a substitution for YYFILL.  By default re2c generates an
523              argument in parentheses and a semicolon after  YYFILL.   If  you
524              need  to  make YYFILL an arbitrary statement rather than a call,
525              set re2c:define:YYFILL:naked to a non-zero value.
526
527       re2c:define:YYGETCONDITION:naked = 0;
528              Controls the parentheses after  YYGETCONDITION.   If  zero,  the
529              parentheses are omitted. If non-zero, the parentheses are gener‐
530              ated.
531
532       re2c:define:YYGETCONDITION = 'YYGETCONDITION';
533              Substitution for  YYGETCONDITION.   By  default  re2c  generates
534              parentheses  after  YYGETCONDITION.  Set re2c:define:YYGETCONDI‐
535              TION:naked to non-zero in order to omit the parentheses.
536
537       re2c:define:YYGETSTATE:naked = 0;
538              Controls the parentheses that follow YYGETSTATE.  If  zero,  the
539              parentheses are omitted. If non-zero, they are generated.
540
541       re2c:define:YYGETSTATE = 'YYGETSTATE';
542              Substitution  for  YYGETSTATE.  By default re2c generates paren‐
543              theses after YYGETSTATE.   Set  re2c:define:YYGETSTATE:naked  to
544              non-zero to omit the parentheses.
545
546       re2c:define:YYLESSTHAN = 'YYLESSTHAN';
547              Replaces YYLESSTHAN identifier with the specified string.
548
549       re2c:define:YYLIMIT = 'YYLIMIT';
550              Replaces the YYLIMIT placeholder with the specified identifier.
551
552       re2c:define:YYMARKER = 'YYMARKER';
553              Replaces the YYMARKER placeholder with the specified identifier.
554
555       re2c:define:YYMTAGN = 'YYMTAGN';
556              Replaces YYMTAGN identifier with the specified string.
557
558       re2c:define:YYMTAGP = 'YYMTAGP';
559              Replaces YYMTAGP identifier with the specified string.
560
561       re2c:define:YYPEEK = 'YYPEEK';
562              Replaces YYPEEK identifier with the specified string.
563
564       re2c:define:YYRESTORECTX = 'YYRESTORECTX';
565              Replaces YYRESTORECTX identifier with the specified string.
566
567       re2c:define:YYRESTORE = 'YYRESTORE';
568              Replaces YYRESTORE identifier with the specified string.
569
570       re2c:define:YYRESTORETAG = 'YYRESTORETAG';
571              Replaces YYRESTORETAG identifier with the specified string.
572
573       re2c:define:YYSETCONDITION@cond = '@@';
574              Any  occurrence  of  this  text inside of YYSETCONDITION will be
575              replaced with the actual argument.
576
577       re2c:define:YYSETCONDITION:naked = 0;
578              Controls the argument in parentheses  and  the  semicolon  after
579              YYSETCONDITION. If zero, both the argument and the semicolon are
580              omitted. If non-zero, both the argument and  the  semicolon  are
581              generated.
582
583       re2c:define:YYSETCONDITION = 'YYSETCONDITION';
584              Substitution  for  YYSETCONDITION.  By default re2c generates an
585              argument in parentheses followed by semicolon after  YYSETCONDI‐
586              TION.  If you need to make YYSETCONDITION an arbitrary statement
587              rather than  a  call,  set  re2c:define:YYSETCONDITION:naked  to
588              non-zero.
589
590       re2c:define:YYSETSTATE:naked = 0;
591              Controls  the  argument  in  parentheses and the semicolon after
592              YYSETSTATE. If zero, both argument and the semicolon  are  omit‐
593              ted. If non-zero, both the argument and the semicolon are gener‐
594              ated.
595
596       re2c:define:YYSETSTATE@state = '@@';
597              Any occurrence  of  this  text  inside  of  YYSETSTATE  will  be
598              replaced with the actual argument.
599
600       re2c:define:YYSETSTATE = 'YYSETSTATE';
601              Substitution  for YYSETSTATE. By default re2c generates an argu‐
602              ment in parentheses followed by a semicolon after YYSETSTATE. If
603              you need to make YYSETSTATE an arbitrary statement rather than a
604              call, set re2c:define:YYSETSTATE:naked to non-zero.
605
606       re2c:define:YYSKIP = 'YYSKIP';
607              Replaces YYSKIP identifier with the specified string.
608
609       re2c:define:YYSTAGN = 'YYSTAGN';
610              Replaces YYSTAGN identifier with the specified string.
611
612       re2c:define:YYSTAGP = 'YYSTAGP';
613              Replaces YYSTAGP identifier with the specified string.
614
615       re2c:flags:8 or re2c:flags:utf-8
616              Same as -8 --utf-8 command-line option.
617
618       re2c:flags:b or re2c:flags:bit-vectors
619              Same as -b --bit-vectors command-line option.
620
621       re2c:flags:case-insensitive = 0;
622              Same as --case-insensitive command-line option.
623
624       re2c:flags:case-inverted = 0;
625              Same as --case-inverted command-line option.
626
627       re2c:flags:d or re2c:flags:debug-output
628              Same as -d --debug-output command-line option.
629
630       re2c:flags:dfa-minimization = 'moore';
631              Same as --dfa-minimization command-line option.
632
633       re2c:flags:eager-skip = 0;
634              Same as --eager-skip command-line option.
635
636       re2c:flags:e or re2c:flags:ecb
637              Same as -e --ecb command-line option.
638
639       re2c:flags:empty-class = 'match-empty';
640              Same as --empty-class command-line option.
641
642       re2c:flags:encoding-policy = 'ignore';
643              Same as --encoding-policy command-line option.
644
645       re2c:flags:g or re2c:flags:computed-gotos
646              Same as -g --computed-gotos command-line option.
647
648       re2c:flags:i or re2c:flags:no-debug-info
649              Same as -i --no-debug-info command-line option.
650
651       re2c:flags:input = 'default';
652              Same as --input command-line option.
653
654       re2c:flags:lookahead = 1;
655              Same as inverted --no-lookahead command-line option.
656
657       re2c:flags:optimize-tags = 1;
658              Same as inverted --no-optimize-tags command-line option.
659
660       re2c:flags:P or re2c:flags:posix-captures
661              Same as -P --posix-captures command-line option.
662
663       re2c:flags:s or re2c:flags:nested-ifs
664              Same as -s --nested-ifs command-line option.
665
666       re2c:flags:T or re2c:flags:tags
667              Same as -T --tags command-line option.
668
669       re2c:flags:u or re2c:flags:unicode
670              Same as -u --unicode command-line option.
671
672       re2c:flags:w or re2c:flags:wide-chars
673              Same as -w --wide-chars command-line option.
674
675       re2c:flags:x or re2c:flags:utf-16
676              Same as -x --utf-16 command-line option.
677
678       re2c:indent:string = '\t';
679              Specifies the string to use for indentation. Requires  a  string
680              that  contains  only  whitespace (unless you need something else
681              for external tools). The easiest way to  specify  spaces  is  to
682              enclose  them  in  single or double quotes.  If you do  not want
683              any indentation at all, you can set this to ''.
684
685       re2c:indent:top = 0;
686              Specifies the minimum amount of indentation to use.  Requires  a
687              numeric value greater than or equal to zero.
688
689       re2c:labelprefix = 'yy';
690              Allows  to  change the prefix of numbered labels. The default is
691              yy. Can be set any string that is valid in a label name.
692
693       re2c:label:yyFillLabel = 'yyFillLabel';
694              Overrides the name of the yyFillLabel label.
695
696       re2c:label:yyNext = 'yyNext';
697              Overrides the name of the yyNext label.
698
699       re2c:startlabel = 0;
700              If set to a non zero integer, then the start label of  the  next
701              scanner  block  will  be  generated even if it isn't used by the
702              scanner itself. Otherwise, the normal yy0-like  start  label  is
703              only  generated  if needed. If set to a text value, then a label
704              with that text will be generated regardless of whether the  nor‐
705              mal start label is used or not. This setting is reset to 0 after
706              a start label has been generated.
707
708       re2c:state:abort = 0;
709              When not zero and the -f --storable-state switch is active, then
710              the YYGETSTATE block will contain a default case that aborts and
711              a -1 case will be used for initialization.
712
713       re2c:state:nextlabel = 0;
714              Used when -f --storable-state is active to control  whether  the
715              YYGETSTATE  block  is followed by a yyNext: label line.  Instead
716              of using yyNext, you can usually also use configuration startla‐
717              bel to force a specific start label or default to yy0 as a start
718              label. Instead of using a dedicated label, it is often better to
719              separate  the  YYGETSTATE  code  from the actual scanner code by
720              placing a /*!getstate:re2c*/ comment.
721
722       re2c:tags:expression = '@@';
723              Allows to customize the way re2c  addresses  tag  variables:  by
724              default  it emits expressions of the form yyt<N>, but this might
725              be inconvenient if tag variables are  defined  as  fields  in  a
726              struct,  or for any other reason require special accessors.  For
727              example, setting re2c:tags:expression =  p->@@  will  result  in
728              p->yyt<N>.
729
730       re2c:tags:prefix = 'yyt';
731              Allows to override prefix of tag variables.
732
733       re2c:variable:yyaccept = yyaccept;
734              Overrides the name of the yyaccept variable.
735
736       re2c:variable:yybm = 'yybm';
737              Overrides the name of the yybm variable.
738
739       re2c:variable:yych = 'yych';
740              Overrides the name of the yych variable.
741
742       re2c:variable:yyctable = 'yyctable';
743              When  both  -c  --conditions and -g --computed-gotos are active,
744              re2c will use this variable to generate a static jump table  for
745              YYGETCONDITION.
746
747       re2c:variable:yystable = 'yystable';
748              Deprecated.
749
750       re2c:variable:yytarget = 'yytarget';
751              Overrides the name of the yytarget variable.
752
753       re2c:yybm:hex = 0;
754              If set to zero, a decimal table will be used. Otherwise, a hexa‐
755              decimal table will be generated.
756
757       re2c:yych:conversion = 0;
758              When this setting is non zero, re2c automatically generates con‐
759              version  code  whenever  yych  gets read. In this case, the type
760              must be defined using re2c:define:YYCTYPE.
761
762       re2c:yych:emit = 1;
763              Set this to zero to suppress the generation of yych.
764
765       re2c:yyfill:check = 1;
766              This can be set to 0 to suppress the generations of YYCURSOR and
767              YYLIMIT  based  precondition  checks. This option is useful when
768              YYLIMIT + YYMAXFILL is always accessible.
769
770       re2c:yyfill:enable = 1;
771              Set this to zero to suppress the generation of YYFILL (n).  When
772              using  this,  be  sure to verify that the generated scanner does
773              not read beyond the available input, as allowing  such  behavior
774              might introduce severe security issues to your programs.
775
776       re2c:yyfill:parameter = 1;
777              Controls  the argument in the parentheses that follow YYFILL. If
778              zero, the argument is omitted.  If  non-zero,  the  argument  is
779              generated unless re2c:define:YYFILL:naked is set to non-zero.
780
781   REGULAR EXPRESSIONS
782       re2c uses the following syntax for regular expressions:
783
784       · "foo" case-sensitive string literal
785
786       · 'foo' case-insensitive string literal
787
788       · [a-xyz], [^a-xyz] character class (possibly negated)
789
790       · . any character except newline
791
792       · R \ S difference of character classes R and S
793
794       · R* zero or more occurrences of R
795
796       · R+ one or more occurrences of R
797
798       · R? optional R
799
800       · R{n} repetition of R exactly n times
801
802       · R{n,} repetition of R at least n times
803
804       · R{n,m} repetition of R from n to m times
805
806       · (R)  just  R;  parentheses  are  used  to  override precedence or for
807         POSIX-style submatch
808
809       · R S concatenation: R followed by S
810
811       · R | S alternative: R or S
812
813       · R / S loohakead: R followed by S, but S is not consumed
814
815       · name the regular expression defined as name (or literal string "name"
816         in Flex compatibility mode)
817
818       · {name}  the  regular expression defined as name in Flex compatibility
819         mode
820
821       · @stag an s-tag: saves the last input position at which @stag  matches
822         in a variable named stag
823
824       · #mtag an m-tag: saves all input positions at which #mtag matches in a
825         variable named mtag
826
827       Character classes and string literals may contain the following  escape
828       sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
829       decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
830

SUBMATCH EXTRACTION

832       re2c supports two kinds of submatch extraction.
833
834       The first option is -P  --posix-captures:  it  enables  POSIX-compliant
835       capturing  groups.   In  this  mode  parentheses in regular expressions
836       denote the beginning and the end of capturing groups; the whole regular
837       expression is group number zero.  The number of groups for the matching
838       rule is stored in a variable yynmatch, and submatch results are  stored
839       in yypmatch array.  Both yynmatch and yypmatch should be defined by the
840       user; note that yypmatch size must be at least [yynmatch  *  2].   re2c
841       provides  a  directive  /*!maxnmatch:re2c*/  that  defines  a  constant
842       YYMAXNMATCH: the maximal value of yynmatch among all rules.  Note  that
843       re2c  implements  POSIX-compliant  disambiguation:  each  subexpression
844       matches as long as possible, and subexpressions that start  earlier  in
845       regular expression have priority over those starting later.
846
847       Second  option  is  -T --tags.  With this option one can use standalone
848       tags of the form @stag and  #mtag  instead  of  capturing  parentheses,
849       where stag and mtag are arbitrary used-defined names.  Tags can be used
850       anywhere inside of a regular expression;  semantically  they  are  just
851       position  markers.   Tags  of  the  form  @stag are called s-tags: they
852       denote a single submatch value (the last input position where this  tag
853       matched).  Tags of the form #mtag are called m-tags: they denote multi‐
854       ple submatch values (the whole history of  repetitions  of  this  tag).
855       All  tags  should  be  defined by the user as variables with the corre‐
856       sponding names.  With standalone tags re2c uses leftmost greedy  disam‐
857       biguation:  submatch positions correspond to the leftmost matching path
858       through the regular expression.
859
860       With both --posix-captures and --tags options re2c generates  a  number
861       of  tag variables that are used by the lexer to track multiple possible
862       versions of each tag (multiple versions are caused by possible  ambigu‐
863       ity  of  submatch).  When a rule matches, ambiguity is resolved and all
864       tags of this rule (or capturing parentheses, which are also implemented
865       as  tags) are initialized with the values of appropriate tag variables.
866       Note that there is no one-to-one correspondence between  tag  variables
867       and  tags:  the same tag variable may be reused for different tags, and
868       one tag may require multiple tag variables to hold  all  its  ambiguous
869       versions.   The  exact  number of tag variables is unknown to the user;
870       this number is determined by re2c.  However, tag  variables  should  be
871       defined  by  the  user, because it might be necessary to update them in
872       YYFILL   and   store   them   between   invocations   of   lexer   with
873       --storable-state    option.    Therefore   re2c   provides   directives
874       /*!stags:re2c ... */ and /*!mtags:re2c ...  */  that  can  be  used  to
875       declare, initialize and manipulate tag variables.
876
877       S-tags must support the following operations:
878
879       · save  input  position  to  s-tag:  t  = YYCURSOR with default API, or
880         user-defined operation YYSTAGP (t) with generic API
881
882       · save  default  value  to  s-tag:  t  =  NULL  with  default  API,  or
883         user-defined operation YYSTAGN (t) with generic API
884
885       · copy one s-tag to another: t1 = t2
886
887       M-tags must support the following operations:
888
889       · append  input  position  to m-tag: user-defined operation YYMTAGP (t)
890         with both default and generic API
891
892       · append default value to m-tag:  user-defined  operation  YYMTAGN  (t)
893         with both default and generic API
894
895       · copy one m-tag to another: t1 = t2
896
897       S-tags  can  be  implemented  as  scalar  values (pointers or offsets).
898       M-tags need a more complex representation, as  they  need  to  store  a
899       sequence  of tag values.  The most naive and inefficient representation
900       of m-tag is a list (array, vector) of tag values; a more efficient rep‐
901       resentation  is  to  store  all  m-tags in a prefix-tree represented as
902       array of nodes (v, p), where v is tag value and p is a pointer to  par‐
903       ent node.
904
905       For  further details see http://re2c.org/examples/examples.html page on
906       the website or re2c/examples/ subdirectory of re2c distribution.
907

STORABLE STATE

909       With -f --storable-state option re2c generates a lexer that  can  store
910       its  current  state,  return to the caller, and later resume operations
911       exactly where it left off.  The default mode of operation in re2c is  a
912       "pull"  model, where the lexer "pulls" more input whenever it needs it.
913       However, this mode of operation assumes that the lexer is the owner  of
914       the parsing loop, and that may not always be convenient.
915
916       Storable state is useful exactly for situations like that: it allows to
917       construct lexers that work in a "push" model, where data is fed to  the
918       lexer  chunk  by chunk.  When the lexer needs more input, it stores its
919       state and returns to the caller.  Later, when more input becomes avail‐
920       able, it resumes operations exactly where it stopped.
921
922       Changes needed compared to the "pull" model:
923
924       · Define YYSETSTATE () and YYGETSTATE (state).
925
926       · Define  yych,  yyaccept  and  state variables as a part of persistent
927         lexer state.  state should be initialized to -1.
928
929       · YYFILL should return to the outer program instead of trying to supply
930         more input.  Return code should indicate that lexer needs more input.
931
932       · The  outer  program should recognize situations when lexer needs more
933         input and respond appropriately.
934
935       · Use /*!getstate:re2c*/ directive if it is necessary  to  execute  any
936         code before entering the lexer.
937
938       · Use  configurations state:abort and state:nextlabel to tweak the gen‐
939         erated code.
940

CONDITIONS

942       Conditions are enabled with -c --conditions.   This  option  allows  to
943       encode multiple interrelated lexers within the same re2c block.
944
945       Each  lexer  corresponds to a single condition.  It starts with a label
946       of the form yyc_name, where name is condition name and yyc  prefix  can
947       be  adjusted  with configuration re2c:condprefix.  Different lexers are
948       separated with  a  comment  /*  ***********************************  */
949       which can be adjusted with configuration re2c:cond:divider.
950
951       Furthermore,  each  condition  has a unique identifier of the form yyc‐
952       name, where name is condition name and yyc prefix can be adjusted  with
953       configuration  re2c:condenumprefix.   Identifiers have the type YYCOND‐
954       TYPE and should be  generated  with  /*!types:re2c*/  directive  or  -t
955       --type-header  option.   Users shouldn't define these identifiers manu‐
956       ally, as the order of conditions is not specified.
957
958       Before all conditions re2c generates entry code that checks the current
959       condition  identifier  and transfers control flow to the start label of
960       the active condition.  After matching  some  rule  of  this  condition,
961       lexer  may  either  transfer control flow back to the entry code (after
962       executing the associated action and optionally setting  another  condi‐
963       tion with =>), or use :=> shortcut and transition directly to the start
964       label of another condition (skipping the action and  the  entry  code).
965       Configuration re2c:cond:goto allows to change the default behavior.
966
967       Syntactically each rule must be preceded with a list of comma-separated
968       condition names or a wildcard * enclosed in angle  brackets  <  and  >.
969       Wildcard  means "any condition" and is semantically equivalent to list‐
970       ing all condition names.  Here regexp is a regular expression,  default
971       refers to the default rule *, and action is a block of C/C++ code.
972
973       · <conditions-or-wildcard>  regexp-or-default                 action
974
975       · <conditions-or-wildcard>  regexp-or-default  =>  condition  action
976
977       · <conditions-or-wildcard>  regexp-or-default  :=> condition
978
979       Rules with an exclamation mark ! in front of condition list have a spe‐
980       cial meaning: they have  no  regular  expression,  and  the  associated
981       action  is  merged  as  an entry code to actions of normal rules.  This
982       might be a convenient place to peform a routine task that is common  to
983       all rules.
984
985       · <!conditions-or-wildcard>  action
986
987       Another  special  form  of rules with an empty condition list <> and no
988       regular expression allows to specify an "entry condition" that  can  be
989       used  to  execute  code  before entering the lexer.  It is semantically
990       equivalent to a condition with number zero, name 0 and an empty regular
991       expression.
992
993       · <>                 action
994
995       · <>  =>  condition  action
996
997       · <>  :=> condition
998

ENCODINGS

1000       re2c  supports  the  following encodings: ASCII (default), EBCDIC (-e),
1001       UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8).  See also  inplace
1002       configuration re2c:flags.
1003
1004       The  following  concepts  should be clarified when talking about encod‐
1005       ings.  A code point is an abstract number that represents a single sym‐
1006       bol.   A code unit is the smallest unit of memory, which is used in the
1007       encoded text (it corresponds to one character in the input stream). One
1008       or  more  code  units  may  be needed to represent a single code point,
1009       depending on the encoding. In a fixed-length encoding, each code  point
1010       is  represented  with an equal number of code units. In variable-length
1011       encodings, different code points can be represented with different num‐
1012       ber of code units.
1013
1014       · ASCII  is a fixed-length encoding. Its code space includes 0x100 code
1015         points, from 0 to 0xFF. A code point is represented with exactly  one
1016         1-byte  code  unit,  which  has the same value as the code point. The
1017         size of YYCTYPE must be 1 byte.
1018
1019       · EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
1020         points,  from 0 to 0xFF. A code point is represented with exactly one
1021         1-byte code unit, which has the same value as  the  code  point.  The
1022         size of YYCTYPE must be 1 byte.
1023
1024       · UCS-2  is  a  fixed-length  encoding. Its code space includes 0x10000
1025         code points, from 0 to 0xFFFF. One code  point  is  represented  with
1026         exactly  one  2-byte  code unit, which has the same value as the code
1027         point. The size of YYCTYPE must be 2 bytes.
1028
1029       · UTF-16 is a variable-length encoding. Its  code  space  includes  all
1030         Unicode  code  points,  from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
1031         One code point is represented with one or two 2-byte code units.  The
1032         size of YYCTYPE must be 2 bytes.
1033
1034       · UTF-32  is  a fixed-length encoding. Its code space includes all Uni‐
1035         code code points, from 0 to 0xD7FF and from 0xE000 to  0x10FFFF.  One
1036         code point is represented with exactly one 4-byte code unit. The size
1037         of YYCTYPE must be 4 bytes.
1038
1039       · UTF-8 is a variable-length encoding. Its code space includes all Uni‐
1040         code  code  points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
1041         code point is represented with a sequence of one, two, three, or four
1042         1-byte code units. The size of YYCTYPE must be 1 byte.
1043
1044       In  Unicode,  values  from  range 0xD800 to 0xDFFF (surrogates) are not
1045       valid Unicode code points. Any encoded  sequence  of  code  units  that
1046       would  map  to  Unicode  code  points  in  the  range 0xD800-0xDFFF, is
1047       ill-formed. The user  can  control  how  re2c  treats  such  ill-formed
1048       sequences with the --encoding-policy <policy> switch.
1049
1050       For  some  encodings,  there are code units that never occur in a valid
1051       encoded stream (e.g., 0xFF byte in UTF-8).  If  the  generated  scanner
1052       must  check  for invalid input, the only correct way to do so is to use
1053       the default rule (*). Note that the full range rule ([^])  won't  catch
1054       invalid  code  units when a variable-length encoding is used ([^] means
1055       "any valid code point", whereas the default rule (*) means "any  possi‐
1056       ble code unit").
1057

GENERIC API

1059       By  default re2c operates on input using pointer-like primitives YYCUR‐
1060       SOR, YYMARKER, YYCTXMARKER, and YYLIMIT.  Normally pointer-like  primi‐
1061       tives  are defined as variables of type YYCTYPE*, but it is possible to
1062       use STL iterators or any other abstraction as long as it  syntactically
1063       fits into the following use cases:
1064
1065       · ++YYCURSOR;
1066
1067       · yych = *YYCURSOR;
1068
1069       · yych = *++YYCURSOR;
1070
1071       · yych = *(YYMARKER = YYCURSOR);
1072
1073       · yych = *(YYMARKER = ++YCURSOR);
1074
1075       · YYMARKER = YYCURSOR;
1076
1077       · YYMARKER = ++YYCURSOR;
1078
1079       · YYCURSOR = YYMARKER;
1080
1081       · YYCTXMARKER = YYCURSOR + 1;
1082
1083       · YYCURSOR = YYCTXMARKER;
1084
1085       · if (YYLIMIT <= YYCURSOR) ...
1086
1087       · if ((YYLIMIT - YYCURSOR) < n) ...
1088
1089       · YYDEBUG (label, *YYCURSOR);
1090
1091       If  this  input  model  is  too restrictive, then it is possible to use
1092       generic input API enabled with --input custom option.  In this mode all
1093       input operations are expressed in terms of the primitives below.  These
1094       primitives can be defined in any suitable  way;  one  doesn't  have  to
1095       stick  to  the  pointer semantics.  For example, it is possible to read
1096       input directly from file without any buffering, or  to  disable  YYFILL
1097       mechanism  and  perform  end-of-input  checking on each input character
1098       from inside of YYPEEK or YYSKIP.
1099
1100       · YYPEEK ()
1101
1102       · YYSKIP ()
1103
1104       · YYBACKUP ()
1105
1106       · YYBACKUPCTX ()
1107
1108       · YYSTAGP (t)
1109
1110       · YYSTAGN (t)
1111
1112       · YYMTAGP (t)
1113
1114       · YYMTAGN (t)
1115
1116       · YYRESTORE ()
1117
1118       · YYRESTORECTX ()
1119
1120       · YYRESTORETAG (t)
1121
1122       · YYLESSTHAN (n)
1123
1124       Default input model can be expressed in terms of generic API as follows
1125       (except for YMTAGP and YYMTAGN, which have no default implementation):
1126
1127       · #define  YYPEEK ()         *YYCURSOR
1128
1129       · #define  YYSKIP ()         ++YYCURSOR
1130
1131       · #define  YYBACKUP ()       YYMARKER = YYCURSOR
1132
1133       · #define  YYBACKUPCTX ()    YYCTXMARKER = YYCURSOR
1134
1135       · #define  YYRESTORE ()      YYCURSOR = YYMARKER
1136
1137       · #define  YYRESTORECTX ()   YYCURSOR = YYCTXMARKER
1138
1139       · #define  YYRESTORERAG (t)  YYCURSOR = t
1140
1141       · #define  YYLESSTHAN (n)    YYLIMIT - YYCURSOR < n
1142
1143       · #define  YYSTAGP (t)       t = YYCURSOR
1144
1145       · #define  YYSTAGN (t)       t = NULL
1146

SEE ALSO

1148       You  can  find  more  information  about re2c at: http://re2c.org.  See
1149       also: flex(1), lex(1), quex (http://quex.sourceforge.net).
1150

AUTHORS

1152       Originaly written by Peter Bumbulis in 1993; developed  and  maintained
1153       by Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.  Below
1154       is a (more or less) full list of contributors retrieved  from  the  Git
1155       history and mailing lists:
1156
1157       Abs62,  asmwarrior, Ben Smith, Brian Young, CRCinAU, Dan Nuffer, Derick
1158       Rethans, Dimitri John Ledkov, Durimar, Eldar Zakirov, Emmanuel Mogenet,
1159       Hartmut Kaiser, jcfp, Jean-Claude Wippler, Jeff Trull, Jérôme Dumesnil,
1160       Jesse Buesking, joscherl, Julian Andres  Klode,  Marcus  Boerger,  Mike
1161       Gilbert,  nuno-lopes,  Oleksii Taran, paulmcq, Paulo Custodio, Perry E.
1162       Metzger,  philippschaefer,  Ross  Burton,  Rui   Maciel,   Ryan   Mast,
1163       Samuel006, Sergei Trofimovich, sirzooro, Tim Kelly, Ulya Trofimovich
1164

VERSION INFORMATION

1166       This manpage describes re2c version 1.1.1, package date 30 Aug 2018.
1167
1168
1169
1170
1171                                                                       RE2C(1)
Impressum