1RE2C(1)                                                                RE2C(1)
2
3
4

NAME

6       re2c - compile regular expressions to code
7

SYNOPSIS

9       re2c  [OPTIONS] INPUT [-o OUTPUT]
10
11       re2go [OPTIONS] INPUT [-o OUTPUT]
12

DESCRIPTION

14       re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15

SYNTAX

17       A  re2c program consists of normal code intermixed with re2c blocks and
18       directives.  Each re2c block may  contain  definitions,  configurations
19       and  rules.   Definitions are of the form name = regexp;  where name is
20       an identifier that consists of letters,  digits  and  underscores,  and
21       regexp  is a regular expression.  Regular expressions may contain other
22       definitions, but recursion is not  allowed  and  each  name  should  be
23       defined  before  used.   Configurations  are  of the form re2c:config =
24       value; where config is the configuration descriptor and value can be  a
25       number, a string or a special word.  Rules consist of a regular expres‐
26       sion followed by a semantic action (a block of code enclosed  in  curly
27       braces  {  and  }, or a raw one line of code preceded with := and ended
28       with a newline that is not followed by a  whitespace).   If  the  input
29       matches  the regular expression, the associated semantic action is exe‐
30       cuted.  If multiple rules match, the longest  match  takes  precedence.
31       If  multiple rules match the same string, the earlier rule takes prece‐
32       dence.  There are two special rules: default rule *  and  EOF  rule  $.
33       Default  rule  should  always  be  defined,  it has the lowest priority
34       regardless of its place and matches any code unit  (not  necessarily  a
35       valid  character,  see  encoding support).  EOF rule matches the end of
36       input, it should be defined if the corresponding EOF handling method is
37       used.   If  start  conditions are used, rules have more complex syntax.
38       All  rules  of  a  single  block  are  compiled  into  a  deterministic
39       finite-state  automaton  (DFA)  and encoded in the form of a program in
40       the target language.  The generated code interfaces with the outer pro‐
41       gram  by  the  means  of a few user-defined primitives (see the program
42       interface section).  Reusable blocks allow sharing  rules,  definitions
43       and configurations between different blocks.
44

EXAMPLE

46   Input file
47          // re2c $INPUT -o $OUTPUT -i
48          #include <assert.h>                 //
49                                              // C/C++ code
50          int lex(const char *YYCURSOR)       //
51          {
52              /*!re2c                         // start of re2c block
53              re2c:define:YYCTYPE = char;     // configuration
54              re2c:yyfill:enable = 0;         // configuration
55              re2c:flags:case-ranges = 1;     // configuration
56                                              //
57              ident = [a-zA-Z_][a-zA-Z_0-9]*; // named definition
58                                              //
59              ident { return 0; }             // normal rule
60              *     { return 1; }             // default rule
61              */
62          }                                   //
63                                              //
64          int main()                          //
65          {                                   // C/C++ code
66              assert(lex("_Zer0") == 0);      //
67              return 0;                       //
68          }                                   //
69
70
71   Output file
72          /* Generated by re2c */
73          // re2c $INPUT -o $OUTPUT -i
74          #include <assert.h>                 //
75                                              // C/C++ code
76          int lex(const char *YYCURSOR)       //
77          {
78
79          {
80              char yych;
81              yych = *YYCURSOR;
82              switch (yych) {
83              case 'A' ... 'Z':
84              case '_':
85              case 'a' ... 'z': goto yy4;
86              default: goto yy2;
87              }
88          yy2:
89              ++YYCURSOR;
90              { return 1; }
91          yy4:
92              yych = *++YYCURSOR;
93              switch (yych) {
94              case '0' ... '9':
95              case 'A' ... 'Z':
96              case '_':
97              case 'a' ... 'z': goto yy4;
98              default: goto yy6;
99              }
100          yy6:
101              { return 0; }
102          }
103
104          }                                   //
105                                              //
106          int main()                          //
107          {                                   // C/C++ code
108              assert(lex("_Zer0") == 0);      //
109              return 0;                       //
110          }                                   //
111
112

OPTIONS

114       -? -h --help
115              Show help message.
116
117       -1 --single-pass
118              Deprecated. Does nothing (single pass is the default now).
119
120       -8 --utf-8
121              Generate  a  lexer  that  reads  input  in UTF-8 encoding.  re2c
122              assumes that character range is 0 -- 0x10FFFF and character size
123              is 1 byte.
124
125       -b --bit-vectors
126              Optimize conditional jumps using bit masks. Implies -s.
127
128       -c --conditions --start-conditions
129              Enable  support of Flex-like "conditions": multiple interrelated
130              lexers within one block. Option --start-conditions is  a  legacy
131              alias; use --conditions instead.
132
133       --case-insensitive
134              Treat  single-quoted  and double-quoted strings as case-insensi‐
135              tive.
136
137       --case-inverted
138              Invert the meaning of single-quoted and  double-quoted  strings:
139              treat  single-quoted strings as case-sensitive and double-quoted
140              strings as case-insensitive.
141
142       --case-ranges
143              Collapse consecutive cases in a switch statements into  a  range
144              of  the  form case low ... high:. This syntax is an extension of
145              the C/C++ language, supported by compilers like GCC,  Clang  and
146              Tcc.  The main advantage over using single cases is smaller gen‐
147              erated C code and faster generation time, although for some com‐
148              pilers  like  Tcc  it also results in smaller binary size.  This
149              option doesn't work for the Go backend.
150
151       -e --ecb
152              Generate a lexer that reads  input  in  EBCDIC  encoding.   re2c
153              assumes that character range is 0 -- 0xFF an character size is 1
154              byte.
155
156       --empty-class <match-empty | match-none | error>
157              Define  the  way  re2c  treats  empty  character  classes.  With
158              match-empty (the default) empty class matches empty input (which
159              is  illogical,  but  backwards-compatible).   With``match-none``
160              empty  class  always  fails  to  match.   With error empty class
161              raises a compilation error.
162
163       --encoding-policy <fail | substitute | ignore>
164              Define the way re2c treats Unicode surrogates.  With  fail  re2c
165              aborts with an error when a surrogate is encountered.  With sub‐
166              stitute re2c silently replaces surrogates with  the  error  code
167              point  0xFFFD.  With ignore (the default) re2c treats surrogates
168              as normal code points. The Unicode standard says that standalone
169              surrogates  are  invalid,  but real-world libraries and programs
170              behave in different ways.
171
172       -f --storable-state
173              Generate a lexer which can store its inner state.  This is  use‐
174              ful  in  push-model lexers which are stopped by an outer program
175              when there is not enough input, and then resumed when more input
176              becomes available. In this mode users should additionally define
177              YYGETSTATE() and YYSETSTATE(state) macros  and  variables  yych,
178              yyaccept and state as part of the lexer state.
179
180       -F --flex-syntax
181              Partial  support for Flex syntax: in this mode named definitions
182              don't need the equal sign and  the  terminating  semicolon,  and
183              when used they must be surrounded by curly braces. Names without
184              curly braces are treated as double-quoted strings.
185
186       -g --computed-gotos
187              Optimize conditional jumps using  non-standard  "computed  goto"
188              extension (which must be supported by the compiler). re2c gener‐
189              ates jump tables only in complex cases with a lot of conditional
190              branches.   Complexity   threshold   can   be   configured  with
191              cgoto:threshold configuration.  This  option  implies  -b.  This
192              option doesn't work for the Go backend.
193
194       -I PATH
195              Add  PATH to the list of locations which are used when searching
196              for include files. This option is  useful  in  combination  with
197              /*!include:re2c  ...  */  directive.  Re2c looks for FILE in the
198              directory of including file and in the  list  of  include  paths
199              specified by -I option.
200
201       -i --no-debug-info
202              Do  not output #line information. This is useful when the gener‐
203              ated code is tracked by some version control system or IDE.
204
205       --input <default | custom>
206              Specify the API used by the generated  code  to  interface  with
207              used-defined  code. Option default is the C API based on pointer
208              arithmetic (it is the default for the C backend). Option  custom
209              is the generic API (it is the default for the Go backend).
210
211       --input-encoding <ascii | utf8>
212              Specify  the  way  re2c  parses regular expressions.  With ascii
213              (the default) re2c handles input as ASCII-encoded: any  sequence
214              of  code  units  is  a sequence of standalone 1-byte characters.
215              With utf8 re2c handles  input  as  UTF8-encoded  and  recognizes
216              multibyte characters.
217
218       --lang <c | go>
219              Specify  the  output  language. Supported languages are C and Go
220              (the default is C).
221
222       --location-format <gnu | msvc>
223              Specify location format in messages.   With  gnu  locations  are
224              printed as 'filename:line:column: ...'.  With msvc locations are
225              printed as 'filename(line,column) ...'.  Default is gnu.
226
227       --no-generation-date
228              Suppress date output in the generated file.
229
230       --no-version
231              Suppress version output in the generated file.
232
233       -o OUTPUT --output=OUTPUT
234              Specify the OUTPUT file.
235
236       -P --posix-captures
237              Enable submatch extraction with POSIX-style capturing groups.
238
239       -r --reusable
240              Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
241              */  blocks.  Exactly  one rules-block must be present. The rules
242              are saved and used by every use-block that  follows,  which  may
243              add its own rules and configurations.
244
245       -S --skeleton
246              Ignore user-defined interface code and generate a self-contained
247              "skeleton" program.  Additionally,  generate  input  files  with
248              strings  derived  from  the regular grammar and compressed match
249              results that are used  to  verify  "skeleton"  behavior  on  all
250              inputs.  This option is useful for finding bugs in optimizations
251              and code generation. This option doesn't work for the  Go  back‐
252              end.
253
254       -s --nested-ifs
255              Use  nested if statements instead of switch statements in condi‐
256              tional jumps. This usually results in more efficient  code  with
257              non-optimizing compilers.
258
259       -T --tags
260              Enable submatch extraction with tags.
261
262       -t HEADER --type-header=HEADER
263              Generate  a HEADER file that contains enum with condition names.
264              Requires -c option.
265
266       -u --unicode
267              Generate a lexer that reads UTF32-encoded  input.  Re2c  assumes
268              that  character  range  is 0 -- 0x10FFFF and character size is 4
269              bytes. This option implies -s.
270
271       -V --vernum
272              Show version information in MMmmpp format (major, minor, patch).
273
274       --verbose
275              Output a short message in case of success.
276
277       -v --version
278              Show version information.
279
280       -w --wide-chars
281              Generate a lexer that reads  UCS2-encoded  input.  Re2c  assumes
282              that  character  range  is  0  -- 0xFFFF and character size is 2
283              bytes. This option implies -s.
284
285       -x --utf-16
286              Generate a lexer that reads UTF16-encoded  input.  Re2c  assumes
287              that  character  range  is 0 -- 0x10FFFF and character size is 2
288              bytes. This option implies -s.
289
290   Debug options
291       -D --emit-dot
292              Instead of normal output generate lexer graph  in  .dot  format.
293              The  output  can  be  converted  to  an  image  with the help of
294              Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
295
296       -d --debug-output
297              Emit YYDEBUG in the generated code.  YYDEBUG should  be  defined
298              by  the user in the form of a void function with two parameters:
299              state (lexer state or -1) and symbol (current  input  symbol  of
300              type YYCTYPE).
301
302       --dump-adfa
303              Debug option: output DFA after tunneling (in .dot format).
304
305       --dump-cfg
306              Debug  option:  output  control  flow graph of tag variables (in
307              .dot format).
308
309       --dump-closure-stats
310              Debug option: output statistics on the number of states in  clo‐
311              sure.
312
313       --dump-dfa-det
314              Debug  option:  output DFA immediately after determinization (in
315              .dot format).
316
317       --dump-dfa-min
318              Debug option: output DFA after minimization (in .dot format).
319
320       --dump-dfa-tagopt
321              Debug option: output DFA after tag optimizations (in  .dot  for‐
322              mat).
323
324       --dump-dfa-tree
325              Debug  option:  output DFA under construction with states repre‐
326              sented as tag history trees (in .dot format).
327
328       --dump-dfa-raw
329              Debug  option:  output  DFA  under  construction  with  expanded
330              state-sets (in .dot format).
331
332       --dump-interf
333              Debug  option:  output  interference  table produced by liveness
334              analysis of tag variables.
335
336       --dump-nfa
337              Debug option: output NFA (in .dot format).
338
339   Internal options
340       --dfa-minimization <moore | table>
341              Internal option: DFA minimization algorithm used  by  re2c.  The
342              moore option is the Moore algorithm (it is the default). The ta‐
343              ble option is the "table  filling"  algorithm.  Both  algorithms
344              should produce the same DFA up to states relabeling; table fill‐
345              ing is simpler and much slower and serves as a reference  imple‐
346              mentation.
347
348       --eager-skip
349              Internal  option:  make  the  generated  lexer advance the input
350              position eagerly -- immediately after reading the input  symbol.
351              This  changes  the  default  behavior when the input position is
352              advanced lazily -- after transition  to  the  next  state.  This
353              option is implied by --no-lookahead.
354
355       --no-lookahead
356              Internal  option:  use  TDFA(0) instead of TDFA(1).  This option
357              has effect only with --tags or --posix-captures options.
358
359       --no-optimize-tags
360              Internal optionL: suppress optimization of tag variables (useful
361              for debugging).
362
363       --posix-closure <gor1 | gtop>
364              Internal  option:  specify  shortest-path algorithm used for the
365              construction of epsilon-closure with POSIX disambiguation seman‐
366              tics:  gor1  (the default) stands for Goldberg-Radzik algorithm,
367              and gtop stands for "global topological order" algorithm.
368
369       --posix-prectable <complex | naive>
370              Internal option: specify the algorithm  used  to  compute  POSIX
371              precedence  table. The complex algorithm computes precedence ta‐
372              ble in one traversal of tag history tree and has quadratic  com‐
373              plexity  in  the  number  of TNFA states; it is the default. The
374              naive algorithm has worst-case cubic complexity in the number of
375              TNFA  states,  but  it  is  much simpler than complex and may be
376              slightly faster in non-pathological cases.
377
378       --stadfa
379              Internal option: use staDFA algorithm for  submatch  extraction.
380              The  main  difference with TDFA is that tag operations in staDFA
381              are placed in states, not on transitions.
382
383   Warnings
384       -W     Turn on all warnings.
385
386       -Werror
387              Turn warnings into errors. Note that this option  alone  doesn't
388              turn  on  any warnings; it only affects those warnings that have
389              been turned on so far or will be turned on later.
390
391       -W<warning>
392              Turn on warning.
393
394       -Wno-<warning>
395              Turn off warning.
396
397       -Werror-<warning>
398              Turn on warning and treat it as an error (this implies  -W<warn‐
399              ing>).
400
401       -Wno-error-<warning>
402              Don't  treat  this  particular warning as an error. This doesn't
403              turn off the warning itself.
404
405       -Wcondition-order
406              Warn if the generated program makes implicit  assumptions  about
407              condition numbering. One should use either the -t, --type-header
408              option or the /*!types:re2c*/ directive to generate a mapping of
409              condition names to numbers and then use the autogenerated condi‐
410              tion names.
411
412       -Wempty-character-class
413              Warn if a regular expression contains an empty character  class.
414              Trying  to  match  an  empty  character class makes no sense: it
415              should always fail.  However, for backwards  compatibility  rea‐
416              sons  re2c  allows  empty  character  classes and treats them as
417              empty strings.  Use  the  --empty-class  option  to  change  the
418              default behavior.
419
420       -Wmatch-empty-string
421              Warn  if  a  rule is nullable (matches an empty string).  If the
422              lexer runs in a loop and the empty match is  unintentional,  the
423              lexer may unexpectedly hang in an infinite loop.
424
425       -Wswapped-range
426              Warn  if  the  lower  bound of a range is greater than its upper
427              bound. The default  behavior  is  to  silently  swap  the  range
428              bounds.
429
430       -Wundefined-control-flow
431              Warn  if  some input strings cause undefined control flow in the
432              lexer (the faulty patterns are reported). This is the most  dan‐
433              gerous and most common mistake. It can be easily fixed by adding
434              the default rule * which has the lowest  priority,  matches  any
435              code unit, and consumes exactly one code unit.
436
437       -Wunreachable-rules
438              Warn about rules that are shadowed by other rules and will never
439              match.
440
441       -Wuseless-escape
442              Warn if a symbol is escaped when it shouldn't be.   By  default,
443              re2c  silently  ignores such escapes, but this may as well indi‐
444              cate a typo or an error in the escape sequence.
445
446       -Wnondeterministic-tags
447              Warn if a tag has n-th degree  of  nondeterminism,  where  n  is
448              greater than 1.
449
450       -Wsentinel-in-midrule
451              Warn  if  the sentinel symbol occurs in the middle of a rule ---
452              this may cause reads past the end of buffer, crashes  or  memory
453              corruption in the generated lexer. This warning is only applica‐
454              ble if the sentinel method of checking for the end of  input  is
455              used.   It  is set to an error if re2c:sentinel configuration is
456              used.
457

PROGRAM INTERFACE

459       Re2c has a flexible interface that gives the user both the freedom  and
460       the  responsibility to define how the generated code interacts with the
461       outer program.  There are two major options:
462
463       · Pointer API.  It is also called "default API", since it was  histori‐
464         cally  the  first,  and for a long time the only one.  This is a more
465         restricted API based  on  C  pointer  arithmetics.   It  consists  of
466         pointer-like  primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT,
467         which are normally defined as pointers of type YYCTYPE*.  Pointer API
468         is  enabled  by default for the C backend, and it cannot be used with
469         other backends that do not have pointer arithmetics.
470
471
472
473       · Generic API.  This is a less restricted  API  that  does  not  assume
474         pointer   semantics.   It  consists  of  primitives  YYPEEK,  YYSKIP,
475         YYBACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
476         YYRESTORECTX,  YYRESTORETAG,  YYSHIFT,  YYSHIFTSTAG,  YYSHIFTMTAG and
477         YYLESSTHAN.  For the C backend generic API is  enabled  with  --input
478         custom option or re2c:flags:input = custom; configuration; for the Go
479         backend it is enabled by default.  Generic API was added  in  version
480         0.14.   It is intentionally designed to give the user as much freedom
481         as possible in redefining the input model and the semantics  of  dif‐
482         ferent  actions  performed  by the generated code. As an example, one
483         can override YYPEEK to check for the end of input before reading  the
484         input character, or do some logging, etc.
485
486       Generic API has two styles:
487
488       · Function-like.   This  style  is  enabled with re2c:api:style = func‐
489         tions; configuration, and it is the default for C  backend.  In  this
490         style  API  primitives  should be defined as functions or macros with
491         parentheses, accepting the necessary arguments. For example, in C the
492         default pointer API can be defined in function-like style generic API
493         as follows:
494
495            #define  YYPEEK()                 *YYCURSOR
496            #define  YYSKIP()                 ++YYCURSOR
497            #define  YYBACKUP()               YYMARKER = YYCURSOR
498            #define  YYBACKUPCTX()            YYCTXMARKER = YYCURSOR
499            #define  YYRESTORE()              YYCURSOR = YYMARKER
500            #define  YYRESTORECTX()           YYCURSOR = YYCTXMARKER
501            #define  YYRESTORETAG(tag)        YYCURSOR = tag
502            #define  YYLESSTHAN(len)          YYLIMIT - YYCURSOR < len
503            #define  YYSTAGP(tag)             tag = YYCURSOR
504            #define  YYSTAGN(tag)             tag = NULL
505            #define  YYSHIFT(shift)           YYCURSOR += shift
506            #define  YYSHIFTSTAG(tag, shift)  tag += shift
507
508
509
510       · Free-form.  This style is enabled with  re2c:api:style  =  free-form;
511         configuration,  and  it  is the default for Go backend. In this style
512         API primitives can be  defined  as  free-form  pieces  of  code,  and
513         instead  of  arguments  they  have interpolated variables of the form
514         @@{name}, or optionally just @@ if there is only one argument. The @@
515         text  is  called  "sigil". It can be redefined to any other text with
516         re2c:api:sigil configuration. For example, the  default  pointer  API
517         can be defined in free-form style generic API as follows:
518
519            re2c:define:YYPEEK       = "*YYCURSOR";
520            re2c:define:YYSKIP       = "++YYCURSOR";
521            re2c:define:YYBACKUP     = "YYMARKER = YYCURSOR";
522            re2c:define:YYBACKUPCTX  = "YYCTXMARKER = YYCURSOR";
523            re2c:define:YYRESTORE    = "YYCURSOR = YYMARKER";
524            re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
525            re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
526            re2c:define:YYLESSTHAN   = "YYLIMIT - YYCURSOR < @@{len}";
527            re2c:define:YYSTAGP      = "@@{tag} = YYCURSOR";
528            re2c:define:YYSTAGN      = "@@{tag} = NULL";
529            re2c:define:YYSHIFT      = "YYCURSOR += @@{shift}";
530            re2c:define:YYSHIFTSTAG  = "@@{tag} += @@{shift}";
531
532   API primitives
533       Here is a list of API primitives that may be used by the generated code
534       in order to interface with the outer  program.   Which  primitives  are
535       needed depends on multiple factors, including the complexity of regular
536       expressions, input representation, buffering, the use of  various  fea‐
537       tures and so on.  All the necessary primitives should be defined by the
538       user in the form of macros, functions, variables, free-form  pieces  of
539       code  or any other suitable form.  Re2c does not (and cannot) check the
540       definitions, so if anything is missing or defined incorrectly the  gen‐
541       erated code will not compile.
542
543       YYCTYPE
544              The  type  of  the  input  characters  (code units).  For ASCII,
545              EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
546              For  UTF-16  or  UCS-2 it should be 2-byte unsigned integer. For
547              UTF-32 it should be 4-byte unsigned integer.
548
549       YYCURSOR
550              A pointer-like l-value that stores the  current  input  position
551              (usually  a pointer of type YYCTYPE*). Initially YYCURSOR should
552              point to the first input character. It is advanced by the gener‐
553              ated  code.   When  a  rule  matches, YYCURSOR points to the one
554              after the last matched character. It is used only in the default
555              C API.
556
557       YYLIMIT
558              A  pointer-like  r-value  that  stores the end of input position
559              (usually a pointer of type YYCTYPE*). Initially  YYLIMIT  should
560              point to the one after the last available input character. It is
561              not changed by the generated code. Lexer  compares  YYCURSOR  to
562              YYLIMIT  in  order to determine if there is enough input charac‐
563              ters left.  YYLIMIT is used only in the default C API.
564
565       YYMARKER
566              A pointer-like l-value (usually a pointer of type YYCTYPE*) that
567              stores  the  position  of the latest matched rule. It is used to
568              restores YYCURSOR position if the longer match fails  and  lexer
569              needs  to  rollback.   Initialization is not needed. YYMARKER is
570              used only in the default C API.
571
572       YYCTXMARKER
573              A pointer-like l-value that stores the position of the  trailing
574              context  (usually a pointer of type YYCTYPE*). No initialization
575              is needed.  It is used only in the default C API, and only  with
576              the lookahead operator /.
577
578       YYFILL API  primitive  with one argument len.  The meaning of YYFILL is
579              to provide at least len more input characters or  fail.  If  EOF
580              rule  is  used, YYFILL should always return to the calling func‐
581              tion; the return value should be zero on success and non-zero on
582              failure. If EOF rule is not used, YYFILL return value is ignored
583              and it should not return on failure. Maximal  value  of  len  is
584              YYMAXFILL,  which can be generated with /*!max:re2c*/ directive.
585              The  definition  of  YYFILL  can  be  either  function-like   or
586              free-form  depending  on  the  API style (see re2c:api:style and
587              re2c:define:YYFILL:naked).
588
589       YYMAXFILL
590              An integral constant equal to the  maximal value of YYFILL argu‐
591              ment.  It can be generated with /*!max:re2c*/ directive.
592
593       YYLESSTHAN
594              A  generic  API  primitive  with one argument len.  It should be
595              defined as an r-value of boolean type that equals  true  if  and
596              only if there is less than len input characters left.  The defi‐
597              nition can be either function-like or free-form depending on the
598              API style (see re2c:api:style).
599
600       YYPEEK A generic API primitive with no arguments.  It should be defined
601              as an r-value of type YYCTYPE that is equal to the character  at
602              the  current  input position. The definition can be either func‐
603              tion-like  or  free-form  depending  on  the  API   style   (see
604              re2c:api:style).
605
606       YYSKIP A  generic  API  primitive  with  no  arguments.  The meaning of
607              YYSKIP is to advance the current input position by  one  charac‐
608              ter.  The  definition  can  be either function-like or free-form
609              depending on the API style (see re2c:api:style).
610
611       YYBACKUP
612              A generic API primitive  with  no  arguments.   The  meaning  of
613              YYBACKUP  is  to save the current input position, which is later
614              restored with YYRESTORE.  The definition should be either  func‐
615              tion-like   or   free-form  depending  on  the  API  style  (see
616              re2c:api:style).
617
618       YYRESTORE
619              A generic API primitive with no arguments.  The meaning of YYRE‐
620              STORE  is  to  restore  the  current input position to the value
621              saved by  YYBACKUP.   The  definition  should  be  either  func‐
622              tion-like   or   free-form  depending  on  the  API  style  (see
623              re2c:api:style).
624
625       YYBACKUPCTX
626              A generic API primitive with zero  arguments.   The  meaning  of
627              YYBACKUPCTX  is  to save the current input position as the posi‐
628              tion of the trailing context, which is later restored  by  YYRE‐
629              STORECTX.   The  definition  should  be  either function-like or
630              free-form depending on the API style (see re2c:api:style).
631
632       YYRESTORECTX
633              A generic API primitive with no arguments.  The meaning of YYRE‐
634              STORECTX  is to restore the trailing context position saved with
635              YYBACKUPCTX.  The definition should be either  function-like  or
636              free-form depending on the API style (see re2c:api:style).
637
638       YYRESTORETAG
639              A  generic  API primitive with one argument tag.  The meaning of
640              YYRESTORETAG is to restore the trailing context position to  the
641              value  of tag.  The definition should be either function-like or
642              free-form depending on the API style (see re2c:api:style).
643
644       YYSTAGP
645              A generic API primitive with one argument tag.  The  meaning  of
646              YYSTAGP  is to set tag value to the current input position.  The
647              definition should be either function-like or free-form depending
648              on the API style (see re2c:api:style).
649
650       YYSTAGN
651              A  generic  API primitive with one argument tag.  The meaning of
652              YYSTAGP is to set tag value to null (or some default value). The
653              definition should be either function-like or free-form depending
654              on the API style (see re2c:api:style).
655
656       YYMTAGP
657              A generic API primitive with one argument tag.  The  meaning  of
658              YYMTAGP is to append the current position to the history of tag.
659              The definition  should  be  either  function-like  or  free-form
660              depending on the API style (see re2c:api:style).
661
662       YYMTAGN
663              A  generic  API primitive with one argument tag.  The meaning of
664              YYMTAGN is to append null (or some other default) value  to  the
665              history  of  tag.  The definition can be either function-like or
666              free-form depending on the API style (see re2c:api:style).
667
668       YYSHIFT
669              A generic API primitive with one argument shift.  The meaning of
670              YYSHIFT  is to shift the current input position by shift charac‐
671              ters (the shift value may be negative). The  definition  can  be
672              either  function-like  or  free-form  depending on the API style
673              (see re2c:api:style).
674
675       YYSHIFTSTAG
676              A generic  API primitive with two arguments, tag and shift.  The
677              meaning  of YYSHIFTSTAG is to shift tag by shift characters (the
678              shift value may be negative).   The  definition  can  be  either
679              function-like  or  free-form  depending  on  the  API style (see
680              re2c:api:style).
681
682       YYSHIFTMTAG
683              A generic API primitive with two arguments, tag and shift.   The
684              meaning  of YYSHIFTMTAG is to shift the latest value in the his‐
685              tory of tag by shift characters (the shift value  may  be  nega‐
686              tive).    The  definition  should  be  either  function-like  or
687              free-form depending on the API style (see re2c:api:style).
688
689       YYMAXNMATCH
690              An integral constant equal to the maximal number of  POSIX  cap‐
691              turing   groups  in  a  rule.  It  is  generated  with  /*!maxn‐
692              match:re2c*/ directive.
693
694       YYCONDTYPE
695              The type of the condition enum.  It should be  generated  either
696              with /*!types:re2c*/ directive or -t --type-header option.
697
698       YYGETCONDITION
699              An  API  primitive with zero arguments.  It should be defined as
700              an r-value of type YYCONDTYPE that is equal to the current  con‐
701              dition identifier. The definition can be either function-like or
702              free-form depending on the API  style  (see  re2c:api:style  and
703              re2c:define:YYGETCONDITION:naked).
704
705       YYSETCONDITION
706              An  API primitive with one argument cond.  The meaning of YYSET‐
707              CONDITION is to set the current condition  identifier  to  cond.
708              The  definition  should  be  either  function-like  or free-form
709              depending   on   the   API   style   (see   re2c:api:style   and
710              re2c:define:YYSETCONDITION@cond).
711
712       YYGETSTATE
713              An  API  primitive with zero arguments.  It should be defined as
714              an r-value of integer type that is equal to  the  current  lexer
715              state. Should be initialized to -1. The definition can be either
716              function-like or free-form  depending  on  the  API  style  (see
717              re2c:api:style and re2c:define:YYGETSTATE:naked).
718
719       YYSETSTATE
720              An API primitive with one argument state.  The meaning of YYSET‐
721              STATE is to set the current lexer state to state.   The  defini‐
722              tion  should  be  either function-like or free-form depending on
723              the  API  style  (see  re2c:api:style   and   re2c:define:YYSET‐
724              STATE@state).
725
726       YYDEBUG
727              A  debug  API  primitive  with  two arguments. It can be used to
728              debug the generated code (with -d --debug-output option).  YYDE‐
729              BUG  should  return  no  value  and  accept two arguments: state
730              (either a DFA state index or -1) and symbol (the  current  input
731              symbol).
732
733       yych   An l-value of type YYCTYPE that stores the current input charac‐
734              ter.  User definition is necessary only with -f --storable-state
735              option.
736
737       yyaccept
738              An  l-value  of unsigned integral type that stores the number of
739              the latest matched rule.  User definition is necessary only with
740              -f --storable-state option.
741
742       yynmatch
743              An  l-value  of unsigned integral type that stores the number of
744              POSIX capturing groups in the matched rule.  Used only  with  -P
745              --posix-captures option.
746
747       yypmatch
748              An array of l-values that are used to hold the tag values corre‐
749              sponding to the capturing  parentheses  in  the  matching  rule.
750              Array  length must be at least yynmatch * 2 (usually YYMAXNMATCH
751              * 2 is a good  choice).   Used  only  with  -P  --posix-captures
752              option.
753
754   Directives
755       Below  is the list of all directives provided by re2c (in no particular
756       order).  More information on each directive can be found in the related
757       sections.
758
759       /*!re2c ... */
760              A standard re2c block.
761
762       %{ ... %}
763              A standard re2c block in -F --flex-support mode.
764
765       /*!rules:re2c ... */
766              A reusable re2c block (requires -r --reuse option).
767
768       /*!use:re2c ... */
769              A   block   that  reuses  previous  rules-block  specified  with
770              /*!rules:re2c ... */ (requires -r --reuse option).
771
772       /*!ignore:re2c ... */
773              A block which contents are ignored and cut off from  the  output
774              file.
775
776       /*!max:re2c*/
777              This  directive  is  substituted  with  the  macro-definition of
778              YYMAXFILL.
779
780       /*!maxnmatch:re2c*/
781              This directive  is  substituted  with  the  macro-definition  of
782              YYMAXNMATCH (requires -P --posix-captures option).
783
784       /*!getstate:re2c*/
785              This directive is substituted with conditional dispatch on lexer
786              state (requires -f --storable-state option).
787
788       /*!types:re2c ... */
789              This directive is substituted with the definition  of  condition
790              enum (requires -c --conditions option).
791
792       /*!stags:re2c ... */, /*!mtags:re2c ... */
793              These  directives  allow one to specify a template piece of code
794              that is expanded for  each  s-tag/m-tag  variable  generated  by
795              re2c. This block has two optional configurations: format = "@@";
796              (specifies the template where @@ is substituted with the name of
797              each  tag variable), and separator = ""; (specifies the piece of
798              code used to join the generated pieces for different  tag  vari‐
799              ables).
800
801       /*!include:re2c FILE */
802              This  directive allows one to include FILE (in the same sense as
803              #include directive in C/C++).
804
805       /*!header:re2c:on*/
806              This directive marks the start of header file. Everything  after
807              it  and  up  to  the following /*!header:re2c:off*/ directive is
808              processed by re2c and written to the header file specified  with
809              -t --type-header option.
810
811       /*!header:re2c:off*/
812              This  directive  marks  the  end  of  header  file  started with
813              /*!header:re2c:on*/.
814
815   Configurations
816       re2c:flags:t, re2c:flags:type-header
817              Specify the name of the generated header file  relative  to  the
818              directory  of  the  output file. (Same as -t, --type-header com‐
819              mand-line option except that the filepath is relative.)
820
821       re2c:flags:input
822              Same as --input command-line option.
823
824       re2c:api:style
825              Allows one to specify the style of generic API. Possible  values
826              are  functions  and free-form. With functions style (the default
827              for the C backend) API primitives  behave  like  functions,  and
828              re2c  generates parentheses with an argument list after the name
829              of each primitive.  With free-form style (the default for the Go
830              backend) re2c treats API definitions as interpolated strings and
831              substitutes argument placeholders with the actual argument  val‐
832              ues.   This  option  can be overridden by options for individual
833              API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
834
835       re2c:api:sigil
836              Allows one to specify the "sigil" symbol  (or  string)  that  is
837              used  to  recognize  argument placeholders in the definitions of
838              generic API primitives.  The default value is @@.   Placeholders
839              start with sigil, followed by the argument name in curly braces.
840              For example, if sigil is set to $, then placeholders  will  have
841              the  form  ${name}. Single-argument APIs may use shorthand nota‐
842              tion without the name in braces. This option can  be  overridden
843              by    options    for    individual    API    primitives,    e.g.
844              re2c:define:YYFILL@len for YYFILL.
845
846       re2c:define:YYCTYPE
847              Defines YYCTYPE (see the user interface section).
848
849       re2c:define:YYCURSOR
850              Defines C API primitive YYCURSOR (see the  user  interface  sec‐
851              tion).
852
853       re2c:define:YYLIMIT
854              Defines  C  API  primitive  YYLIMIT (see the user interface sec‐
855              tion).
856
857       re2c:define:YYMARKER
858              Defines C API primitive YYMARKER (see the  user  interface  sec‐
859              tion).
860
861       re2c:define:YYCTXMARKER
862              Defines C API primitive YYCTXMARKER (see the user interface sec‐
863              tion).
864
865       re2c:define:YYFILL
866              Defines API primitive YYFILL (see the user interface section).
867
868       re2c:define:YYFILL@len
869              Specifies the sigil used for  argument  substitution  in  YYFILL
870              definition.   Defaults   to  @@.   Overrides  the  more  generic
871              re2c:api:sigil configuration.
872
873       re2c:define:YYFILL:naked
874              Allows one to override re2c:api:style for YYFILL.  Value 0  cor‐
875              responds to free-form API style.
876
877       re2c:yyfill:enable
878              Defaults  to 1 (YYFILL is enabled). Set this to zero to suppress
879              the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
880              tinel  configuration  to  verify that the generated lexer cannot
881              read past the end of input, as this might introduce severe secu‐
882              rity issues to your programs.
883
884       re2c:yyfill:parameter
885              Controls  the  argument  in  the parentheses that follow YYFILL.
886              Defaults to 1, which means that the argument  is  generated.  If
887              zero,   the   argument   is  omitted.  Can  be  overridden  with
888              re2c:define:YYFILL:naked or re2c:api:style.
889
890       re2c:eof
891              Specifies the sentinel symbol used with EOF rule $ to check  for
892              the end of input in the generated lexer. The default value is -1
893              (EOF rule is not used). Other possible values include all  valid
894              code units. Only decimal numbers are recognized.
895
896       re2c:sentinel
897              Specifies  the  sentinel symbol used with the sentinel method of
898              checking for the end of input in the generated lexer  (the  case
899              when  bounds  checking  is disabled with re2c:yyfill:enable = 0;
900              and EOF rule $ is not used). This configuration does not  affect
901              code  generation. It is used by re2c to verify that the sentinel
902              symbol is not allowed in the middle of  the  rule,  and  prevent
903              possible  reads  past  the end of buffer in the generated lexer.
904              The default value is -1 (re2c assumes that the  sentinel  symbol
905              is  0,  which  is  the  most common case). Other possible values
906              include all valid code units. Only decimal  numbers  are  recog‐
907              nized.
908
909       re2c:define:YYLESSTHAN
910              Defines generic API primitive YYLESSTHAN (see the user interface
911              section).
912
913       re2c:yyfill:check
914              Setting this to zero allows to suppress the generation of YYFILL
915              check  (YYLESSTHAN in generic API of YYLIMIT-based comparison in
916              default C API). This configuration is useful when the  necessary
917              input is always available. it defaults to 1 (the check is gener‐
918              ated).
919
920       re2c:label:yyFillLabel
921              Allows one to change the prefix of YYFILL labels (used with  EOF
922              rule or with storable states).
923
924       re2c:define:YYPEEK
925              Defines  generic  API  primitive  YYPEEK (see the user interface
926              section).
927
928       re2c:define:YYSKIP
929              Defines generic API primitive YYSKIP  (see  the  user  interface
930              section).
931
932       re2c:define:YYBACKUP
933              Defines  generic  API primitive YYBACKUP (see the user interface
934              section).
935
936       re2c:define:YYBACKUPCTX
937              Defines generic API primitive YYBACKUPCTX (see the  user  inter‐
938              face section).
939
940       re2c:define:YYRESTORE
941              Defines  generic API primitive YYRESTORE (see the user interface
942              section).
943
944       re2c:define:YYRESTORECTX
945              Defines generic API primitive YYRESTORECTX (see the user  inter‐
946              face section).
947
948       re2c:define:YYRESTORETAG
949              Defines  generic API primitive YYRESTORETAG (see the user inter‐
950              face section).
951
952       re2c:define:YYSHIFT
953              Defines generic API primitive YYSHIFT (see  the  user  interface
954              section).
955
956       re2c:define:YYSHIFTMTAG
957              Defines  generic  API primitive YYSHIFTMTAG (see the user inter‐
958              face section).
959
960       re2c:define:YYSHIFTSTAG
961              Defines generic API primitive YYSHIFTSTAG (see the  user  inter‐
962              face section).
963
964       re2c:define:YYSTAGN
965              Defines  generic  API  primitive YYSTAGN (see the user interface
966              section).
967
968       re2c:define:YYSTAGP
969              Defines generic API primitive YYSTAGP (see  the  user  interface
970              section).
971
972       re2c:define:YYMTAGN
973              Defines  generic  API  primitive YYMTAGN (see the user interface
974              section).
975
976       re2c:define:YYMTAGP
977              Defines generic API primitive YYMTAGP (see  the  user  interface
978              section).
979
980       re2c:flags:T, re2c:flags:tags
981              Same as -T --tags command-line option.
982
983       re2c:flags:P, re2c:flags:posix-captures
984              Same as -P --posix-captures command-line option.
985
986       re2c:tags:expression
987              Allows  one  to  customize the way re2c addresses tag variables.
988              By default re2c generates expressions of the form  yyt<N>.  This
989              might  be inconvenient, for example if tag variables are defined
990              as fields in a struct. Re2c recognizes placeholder of  the  form
991              @@{tag}  or  @@ and replaces it with the actual tag name.  Sigil
992              @@ can be  redefined  with  re2c:api:sigil  configuration.   For
993              example,  setting  re2c:tags:expression  =  "p->@@";  results in
994              expressions of the form p->yyt<N> in the generated code.
995
996       re2c:tags:prefix
997              Allows one to override the prefix of tag variables (defaults  to
998              yyt).
999
1000       re2c:flags:lookahead
1001              Same as inverted --no-lookahead command-line option.
1002
1003       re2c:flags:optimize-tags
1004              Same as inverted --no-optimize-tags command-line option.
1005
1006       re2c:define:YYCONDTYPE
1007              Defines YYCONDTYPE (see the user interface section).
1008
1009       re2c:define:YYGETCONDITION
1010              Defines  API  primitive  YYGETCONDITION  (see the user interface
1011              section).
1012
1013       re2c:define:YYGETCONDITION:naked
1014              Allows one to override re2c:api:style for YYGETCONDITION.  Value
1015              0 corresponds to free-form API style.
1016
1017       re2c:define:YYSETCONDITION
1018              Defines  API  primitive  YYSETCONDITION  (see the user interface
1019              section).
1020
1021       re2c:define:YYSETCONDITION@cond
1022              Specifies the sigil used for argument substitution in  YYSETCON‐
1023              DITION  definition. The default value is @@.  Overrides the more
1024              generic re2c:api:sigil configuration.
1025
1026       re2c:define:YYSETCONDITION:naked
1027              Allows one to override re2c:api:style for YYSETCONDITION.  Value
1028              0 corresponds to free-form API style.
1029
1030       re2c:cond:goto
1031              Allows one to customize the goto statements used with the short‐
1032              cut :=> rules in conditions. The  default  value  is  goto  @@;.
1033              Placeholders   are   substituted   with   condition   name  (see
1034              re2c:api;sigil and re2c:cond:goto@cond).
1035
1036       re2c:cond:goto@cond
1037              Specifies  the  sigil  used   for   argument   substitution   in
1038              re2c:cond:goto  definition.  The default value is @@.  Overrides
1039              the more generic re2c:api:sigil configuration.
1040
1041       re2c:cond:divider
1042              Defines the divider for condition blocks.  The default value  is
1043              /*  ***********************************  */.   Placeholders  are
1044              substituted  with  condition  name   (see   re2c:api;sigil   and
1045              re2c:cond:divider@cond).
1046
1047       re2c:cond:divider@cond
1048              Specifies   the   sigil   used   for  argument  substitution  in
1049              re2c:cond:divider definition. The default value  is  @@.   Over‐
1050              rides the more generic re2c:api:sigil configuration.
1051
1052       re2c:condprefix
1053              Specifies  the  prefix  used  for condition labels.  The default
1054              value is yyc_.
1055
1056       re2c:condenumprefix
1057              Specifies  the  prefix  used  for  condition  identifiers.   The
1058              default value is yyc.
1059
1060       re2c:define:YYGETSTATE
1061              Defines  API  primitive  YYGETSTATE (see the user interface sec‐
1062              tion).
1063
1064       re2c:define:YYGETSTATE:naked
1065              Allows one to override re2c:api:style for YYGETSTATE.   Value  0
1066              corresponds to free-form API style.
1067
1068       re2c:define:YYSETSTATE
1069              Defines  API  primitive  YYSETSTATE (see the user interface sec‐
1070              tion).
1071
1072       re2c:define:YYSETSTATE@state
1073              Specifies the sigil used for argument substitution in YYSETSTATE
1074              definition. The default value is @@.  Overrides the more generic
1075              re2c:api:sigil configuration.
1076
1077       re2c:define:YYSETSTATE:naked
1078              Allows one to override re2c:api:style for YYSETSTATE.   Value  0
1079              corresponds to free-form API style.
1080
1081       re2c:state:abort
1082              If  set  to  a  positive  integer value, changes the form of the
1083              YYGETSTATE switch: instead of using default case to jump to  the
1084              beginning of the lexer block, a -1 case is used, and the default
1085              case aborts the program.
1086
1087       re2c:state:nextlabel
1088              With storable states, allows to control if the YYGETSTATE  block
1089              is  followed by a yyNext label (the default value is zero, which
1090              corresponds to no label). Instead of using yyNext it is possible
1091              to  use  re2c:startlabel  to  force the generation of a specific
1092              start label.  Instead of using labels it is  often  more  conve‐
1093              nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1094
1095       re2c:label:yyNext
1096              Allows one to change the name of the yyNext label.
1097
1098       re2c:startlabel
1099              Controls the generation of start label for the next lexer block.
1100              The default value is zero, which means that the start  label  is
1101              generated only if it is used. An integer value greater than zero
1102              forces the generation of start label even if it is unused by the
1103              lexer.  A  string  value  also forces start label generation and
1104              sets the label name to the specified string.  This configuration
1105              applies  only  to  the current block (it is reset to default for
1106              the next block).
1107
1108       re2c:flags:s, re2c:flags:nested-ifs
1109              Same as -s --nested-ifs command-line option.
1110
1111       re2c:flags:b, re2c:flags:bit-vectors
1112              Same as -b --bit-vectors command-line option.
1113
1114       re2c:variable:yybm
1115              Overrides the name of the yybm variable.
1116
1117       re2c:yybm:hex
1118              Defaults to zero (a decimal bitmap table is generated).  If  set
1119              to nonzero, a hexadecimal table is generated.
1120
1121       re2c:flags:g, re2c:flags:computed-gotos
1122              Same as -g --computed-gotos command-line option.
1123
1124       re2c:cgoto:threshold
1125              With  -g  --computed-gotos  option this value specifies the com‐
1126              plexity threshold that triggers the generation  of  jump  tables
1127              instead  of  nested if statements and bitmaps. The default value
1128              is 9.
1129
1130       re2c:flags:case-ranges
1131              Same as --case-ranges command-line option.
1132
1133       re2c:flags:e, re2c:flags:ecb
1134              Same as -e --ecb command-line option.
1135
1136       re2c:flags:8, re2c:flags:utf-8
1137              Same as -8 --utf-8 command-line option.
1138
1139       re2c:flags:w, re2c:flags:wide-chars
1140              Same as -w --wide-chars command-line option.
1141
1142       re2c:flags:x, re2c:flags:utf-16
1143              Same as -x --utf-16 command-line option.
1144
1145       re2c:flags:u, re2c:flags:unicode
1146              Same as -u --unicode command-line option.
1147
1148       re2c:flags:encoding-policy
1149              Same as --encoding-policy command-line option.
1150
1151       re2c:flags:empty-class
1152              Same as --empty-class command-line option.
1153
1154       re2c:flags:case-insensitive
1155              Same as --case-insensitive command-line option.
1156
1157       re2c:flags:case-inverted
1158              Same as --case-inverted command-line option.
1159
1160       re2c:flags:i, re2c:flags:no-debug-info
1161              Same as -i --no-debug-info command-line option.
1162
1163       re2c:indent:string
1164              Specifies the string to use for indentation.  The default  value
1165              is  "\t".   Indent string should contain only whitespace charac‐
1166              ters.  To disable indentation entirely, set  this  configuration
1167              to empty string "".
1168
1169       re2c:indent:top
1170              Specifies the minimum amount of indentation to use.  The default
1171              value is zero.  The value should be a non-negative integer  num‐
1172              ber.
1173
1174       re2c:labelprefix
1175              Allows  one  to  change  the  prefix  of  DFA state labels.  The
1176              default value is yy.
1177
1178       re2c:yych:emit
1179              Set this to zero to suppress the generation of yych  definition.
1180              Defaults to 1 (the definition is generated).
1181
1182       re2c:variable:yych
1183              Overrides the name of the yych variable.
1184
1185       re2c:yych:conversion
1186              If  set  to nonzero, re2c automatically generates a cast to YYC‐
1187              TYPE every time yych is read. Defaults to zero (no cast).
1188
1189       re2c:variable:yyaccept
1190              Overrides the name of the yyaccept variable.
1191
1192       re2c:variable:yytarget
1193              Overrides the name of the yytarget variable.
1194
1195       re2c:variable:yystable
1196              Deprecated.
1197
1198       re2c:variable:yyctable
1199              When both -c --conditions and -g  --computed-gotos  are  active,
1200              re2c  will use this variable to generate a static jump table for
1201              YYGETCONDITION.
1202
1203       re2c:define:YYDEBUG
1204              Defines YYDEBUG (see the user interface section).
1205
1206       re2c:flags:d, re2c:flags:debug-output
1207              Same as -d --debug-output command-line option.
1208
1209       re2c:flags:dfa-minimization
1210              Same as --dfa-minimization command-line option.
1211
1212       re2c:flags:eager-skip
1213              Same as --eager-skip command-line option.
1214

REGULAR EXPRESSIONS

1216       re2c uses the following syntax for regular expressions:
1217
1218       · "foo" case-sensitive string literal
1219
1220       · 'foo' case-insensitive string literal
1221
1222       · [a-xyz], [^a-xyz] character class (possibly negated)
1223
1224       · . any character except newline
1225
1226       · R \ S difference of character classes R and S
1227
1228       · R* zero or more occurrences of R
1229
1230       · R+ one or more occurrences of R
1231
1232       · R? optional R
1233
1234       · R{n} repetition of R exactly n times
1235
1236       · R{n,} repetition of R at least n times
1237
1238       · R{n,m} repetition of R from n to m times
1239
1240       · (R) just R; parentheses  are  used  to  override  precedence  or  for
1241         POSIX-style submatch
1242
1243       · R S concatenation: R followed by S
1244
1245       · R | S alternative: R or S
1246
1247       · R / S lookahead: R followed by S, but S is not consumed
1248
1249       · name the regular expression defined as name (or literal string "name"
1250         in Flex compatibility mode)
1251
1252       · {name} the regular expression defined as name in  Flex  compatibility
1253         mode
1254
1255       · @stag  an s-tag: saves the last input position at which @stag matches
1256         in a variable named stag
1257
1258       · #mtag an m-tag: saves all input positions at which #mtag matches in a
1259         variable named mtag
1260
1261       Character  classes and string literals may contain the following escape
1262       sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1263       decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1264

EOF HANDLING

1266       Re2c  provides a number of ways to handle end-of-input situation. Which
1267       way to use depends on the complexity of  regular  expressions,  perfor‐
1268       mance  considerations,  the  need for input buffering and various other
1269       factors. EOF handling is probably the most complex part  of  re2c  user
1270       interface  --- it definitely requires a bit of understanding of how the
1271       generated lexer works.  But in return is allows the user  to  customize
1272       lexer  for  a particular environment and avoid the unnecessary overhead
1273       of generic methods when a simpler method is sufficient. Roughly  speak‐
1274       ing, there are four main methods:
1275
1276       · using sentinel symbol (simple and efficient, but limited)
1277
1278       · bounds checking with padding (generic, but complex)
1279
1280       · EOF  rule:  a  combination  of  sentinel  symbol  and bounds checking
1281         (generic and simple, can be more or less efficient than bounds check‐
1282         ing with padding depending on the grammar)
1283
1284       · using generic API (user-defined, so may be incorrect ;])
1285
1286   Using sentinel symbol
1287       This is the simplest and the most efficient method. It is applicable in
1288       cases when the input is small enough to fit into  a  continuous  memory
1289       buffer and there is a natural "sentinel" symbol --- a code unit that is
1290       not allowed by any of the regular expressions in grammar (except possi‐
1291       bly  as  a  terminating  character).   Sentinel symbol never appears in
1292       well-formed input, therefore it can be appended at the end of input and
1293       used  as  a stop signal by the lexer. A good example of such input is a
1294       null-terminated C-string, provided that the grammar does not allow NULL
1295       in  the  middle  of lexemes. Sentinel method is very efficient, because
1296       the lexer does not need to perform any additional checks for the end of
1297       input  ---  it comes naturally as a part of processing the next charac‐
1298       ter.  It is very important that the sentinel symbol is not  allowed  in
1299       the  middle of the rule --- otherwise on some inputs the lexer may read
1300       past the end of buffer and crash or cause memory corruption. Re2c veri‐
1301       fies  this  automatically.   Use re2c:sentinel configuration to specify
1302       which sentinel symbol is used.
1303
1304       Below  is  an  example  of   using   sentinel   method.   Configuration
1305       re2c:yyfill:enable  =  0;  suppresses generation of end-of-input checks
1306       and YYFILL calls.
1307
1308          // re2c $INPUT -o $OUTPUT
1309          #include <assert.h>
1310
1311          // expect a null-terminated string
1312          static int lex(const char *YYCURSOR)
1313          {
1314              int count = 0;
1315          loop:
1316              /*!re2c
1317              re2c:define:YYCTYPE = char;
1318              re2c:yyfill:enable = 0;
1319
1320              *      { return -1; }
1321              [\x00] { return count; }
1322              [a-z]+ { ++count; goto loop; }
1323              [ ]+   { goto loop; }
1324
1325              */
1326          }
1327
1328          int main()
1329          {
1330              assert(lex("") == 0);
1331              assert(lex("one two three") == 3);
1332              assert(lex("f0ur") == -1);
1333              return 0;
1334          }
1335
1336
1337   Bounds checking with padding
1338       Bounds checking is a generic method: it can  be  used  with  any  input
1339       grammar.   The  basic  idea  is simple: we need to check for the end of
1340       input before reading the next input character. However, if  implemented
1341       in  a straightforward way, this would be quite inefficient: checking on
1342       each input character would cause a major slowdown. Re2c avoids slowdown
1343       by  generating checks only in certain key states of the lexer, and let‐
1344       ting it run without checks in-between the key states.  More  precisely,
1345       re2c  computes  strongly  connected components (SCCs) of the underlying
1346       DFA (which roughly correspond to  loops),  and  generates  only  a  few
1347       checks  per  each  SCC (usually just one, but in general enough to make
1348       the SCC acyclic). The check is of the form (YYLIMIT -  YYCURSOR)  <  n,
1349       where  n  is  the  maximal length of a simple path in the corresponding
1350       SCC. If this condiiton is true, the lexer calls YYFILL(n),  which  must
1351       either  supply  at least n input characters, or do not return. When the
1352       lexer continues after the check, it is certain that the next n  charac‐
1353       ters can be read safely without checks.
1354
1355       This approach reduces the number of checks significantly (and makes the
1356       lexer much faster as a result), but it has a downside. Since the  lexer
1357       checks  for  multiple  characters at once, it may end up in a situation
1358       when there are a few remaining input characters (less  than  n)  corre‐
1359       sponding  to  a  short  path  in  the SCC, but the lexer cannot proceed
1360       because of the check, and YYFILL cannot supply more  character  because
1361       it is the end of input. To solve this problem, re2c requires that addi‐
1362       tional padding consisting of fake characters is appended at the end  of
1363       input.  The  length of padding should be YYMAXFILL, which equals to the
1364       maximum n parameter to YYFILL and  must  be  generated  by  re2c  using
1365       /*!max:re2c*/  directive.  The  fake characters should not form a valid
1366       lexeme suffix, otherwise the lexer may be fooled into matching  a  fake
1367       lexeme. Usually it's a good idea to use NULL characters for padding.
1368
1369       Below  is  an  example of using bounds checking with padding. Note that
1370       the grammar rule for single-quoted strings allows arbitrary symbols  in
1371       the  middle  of lexeme, so there is no natural sentinel in the grammar.
1372       Strings like "aha\0ha" are perfectly valid, but ill-formed strings like
1373       "aha\0 are also possible and shouldn’t crash the lexer. In this example
1374       we do not use buffer  refilling,  therefore  YYFILL  definition  simply
1375       returns  an error. Note that YYFILL will only be called after the lexer
1376       reaches padding, because only then will the check condition  be  satis‐
1377       fied.
1378
1379          // re2c $INPUT -o $OUTPUT
1380          #include <assert.h>
1381          #include <stdlib.h>
1382          #include <string.h>
1383
1384          /*!max:re2c*/
1385
1386          // expect YYMAXFILL-padded string
1387          static int lex(const char *str, unsigned int len)
1388          {
1389              const char *YYCURSOR = str, *YYLIMIT = str + len + YYMAXFILL;
1390              int count = 0;
1391
1392          loop:
1393              /*!re2c
1394              re2c:api:style = free-form;
1395              re2c:define:YYCTYPE = char;
1396              re2c:define:YYFILL = "return -1;";
1397
1398              *                           { return -1; }
1399              [\x00]                      { return YYCURSOR == YYLIMIT ? count : -1; }
1400              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1401              [ ]+                        { goto loop; }
1402
1403              */
1404          }
1405
1406          // make a copy of the string with YYMAXFILL zeroes at the end
1407          static void test(const char *str, unsigned int len, int res)
1408          {
1409              char *s = (char*) malloc(len + YYMAXFILL);
1410              memcpy(s, str, len);
1411              memset(s + len, 0, YYMAXFILL);
1412              int r = lex(s, len);
1413              free(s);
1414              assert(r == res);
1415          }
1416
1417          #define TEST(s, r) test(s, sizeof(s) - 1, r)
1418          int main()
1419          {
1420              TEST("", 0);
1421              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1422              TEST("'unterminated\\'", -1);
1423              return 0;
1424          }
1425
1426
1427   EOF rule
1428       EOF  rule $ was introduced in version 1.2. It is a hybrid approach that
1429       tries to take the best of both worlds: simplicity and efficiency of the
1430       sentinel method combined with the generality of bounds-checking method.
1431       The idea is to appoint an arbitrary symbol to be the sentinel, and only
1432       perform  further  bounds  checking if the sentinel symbol matches (more
1433       precisely, if the symbol class that contains it matches). The check  is
1434       of  the  form YYLIMIT <= YYCURSOR.  If this condition is not satisfied,
1435       then the sentinel is just an ordinary input  character  and  the  lexer
1436       continues.  Otherwise  this  is  a  real  sentinel, and the lexer calls
1437       YYFILL(). If YYFILL returns zero, the lexer assumes that  it  has  more
1438       input  and tries to re-match. Otherwise YYFILL returns non-zero and the
1439       lexer knows that it has reached the end of input. At this  point  there
1440       are three possibilities. First, it might have already matched a shorter
1441       lexeme --- in this case it just rolls back to the last accepting state.
1442       Second, it might have consumed some characters, but failed to match ---
1443       in this case it falls back to default rule *. Finally, it might  be  in
1444       the initial state --- in this (and only this!) case it matches EOF rule
1445       $.
1446
1447       Below is an example of using EOF rule. Configuration re2c:yyfill:enable
1448       = 0; suppresses generation of YYFILL calls (but not the bounds checks).
1449
1450          // re2c $INPUT -o $OUTPUT
1451          #include <assert.h>
1452
1453          // expect a null-terminated string
1454          static int lex(const char *str, unsigned int len)
1455          {
1456              const char *YYCURSOR = str, *YYLIMIT = str + len, *YYMARKER;
1457              int count = 0;
1458
1459          loop:
1460              /*!re2c
1461              re2c:define:YYCTYPE = char;
1462              re2c:yyfill:enable = 0;
1463              re2c:eof = 0;
1464
1465              *                           { return -1; }
1466              $                           { return count; }
1467              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1468              [ ]+                        { goto loop; }
1469
1470              */
1471          }
1472
1473          #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
1474          int main()
1475          {
1476              TEST("", 0);
1477              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1478              TEST("'unterminated\\'", -1);
1479              return 0;
1480          }
1481
1482
1483   Using generic API
1484       Generic  API  can be used with any of the above methods. It also allows
1485       one to use a user-defined method by placing EOF checks in  one  of  the
1486       basic  primitives.   Usually  this  is either YYSKIP (the check is per‐
1487       formed when advancing to the next  input  character),  or  YYPEEK  (the
1488       check  is performed when reading the next input character). The result‐
1489       ing methods are inefficient, as they check  on  each  input  character.
1490       However,  they can be useful in cases when the input cannot be buffered
1491       or padded and does not contain a sentinel character  at  the  end.  One
1492       should  be  cautious  when  using such ad-hoc methods, as it is easy to
1493       overlook some corner cases and come up with a  method  that  only  par‐
1494       tially  works.  Also  it  should  be  noted  that not everything can be
1495       expressed via generic API: for example, it is impossible to reimplement
1496       the way EOF rule works (in particular, it is impossible to re-match the
1497       character after successful YYFILL).
1498
1499       Below is an example of using YYSKIP to perform bounds checking  without
1500       padding.  YYFILL generation is suppressed using re2c:yyfill:enable = 0;
1501       configuration. Note that if the grammar was more complex,  this  method
1502       might not work in case when two rules overlap and EOF check fails after
1503       a shorter lexeme has already been matched (as it happens in  our  exam‐
1504       ple, there are no overlapping rules).
1505
1506          // re2c $INPUT -o $OUTPUT
1507          #include <assert.h>
1508          #include <stdlib.h>
1509          #include <string.h>
1510
1511          // expect a string without terminating null
1512          static int lex(const char *str, unsigned int len)
1513          {
1514              const char *cur = str, *lim = str + len, *mar;
1515              int count = 0;
1516
1517          loop:
1518              /*!re2c
1519              re2c:yyfill:enable = 0;
1520              re2c:eof = 0;
1521              re2c:flags:input = custom;
1522              re2c:api:style = free-form;
1523              re2c:define:YYCTYPE    = char;
1524              re2c:define:YYLESSTHAN = "cur >= lim";
1525              re2c:define:YYPEEK     = "cur < lim ? *cur : 0";  // fake null
1526              re2c:define:YYSKIP     = "++cur;";
1527              re2c:define:YYBACKUP   = "mar = cur;";
1528              re2c:define:YYRESTORE  = "cur = mar;";
1529
1530              *                           { return -1; }
1531              $                           { return count; }
1532              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1533              [ ]+                        { goto loop; }
1534
1535              */
1536          }
1537
1538          // make a copy of the string without terminating null
1539          static void test(const char *str, unsigned int len, int res)
1540          {
1541              char *s = (char*) malloc(len);
1542              memcpy(s, str, len);
1543              int r = lex(s, len);
1544              free(s);
1545              assert(r == res);
1546          }
1547
1548          #define TEST(s, r) test(s, sizeof(s) - 1, r)
1549          int main()
1550          {
1551              TEST("", 0);
1552              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1553              TEST("'unterminated\\'", -1);
1554              return 0;
1555          }
1556
1557

BUFFER REFILLING

1559       The need for buffering arises when the input cannot be mapped in memory
1560       all at once: either it is too large, or it comes in a streaming fashion
1561       (like  reading  from a socket). The usual technique in such cases is to
1562       allocate a fixed-sized memory buffer and process input in  chunks  that
1563       fit  into  the buffer. When the current chunk is processed, it is moved
1564       out and new data is moved in. In practice it is somewhat more  complex,
1565       because  lexer state consists not of a single input position, but a set
1566       of interrelated posiitons:
1567
1568       · cursor: the next input character to be read (YYCURSOR in default  API
1569         or YYSKIP/YYPEEK in generic API)
1570
1571       · limit: the position after the last available input character (YYLIMIT
1572         in default API, implicitly handled by YYLESSTHAN in generic API)
1573
1574       · marker: the position of the most recent match, if  any  (YYMARKER  in
1575         default API or YYBACKUP/YYRESTORE in generic API)
1576
1577       · token:  the  start of the current lexeme (implicit in re2c API, as it
1578         is not needed for the normal lexer operation and can be  defined  and
1579         updated by the user)
1580
1581       · context  marker: the position of the trailing context (YYCTXMARKER in
1582         default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1583
1584       · tag variables: submatch positions (defined with  /*!stags:re2c*/  and
1585         /*!mtags:re2c*/  directives  and  YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN  in
1586         generic API)
1587
1588       Not all these are used in every case, but if used, they must be updated
1589       by  YYFILL.  All  active positions are contained in the segment between
1590       token and cursor, therefore everything between buffer start  and  token
1591       can  be  discarded,  the  segment  from token and up to limit should be
1592       moved to the beginning of buffer, and the free space at the end of buf‐
1593       fer  should be filled with new data.  In order to avoid frequent YYFILL
1594       calls it is best to fill in as many input characters as possible  (even
1595       though fewer characters might suffice to resume the lexer). The details
1596       of YYFILL implementation are slightly different depending on which  EOF
1597       handling  method is used: the case of EOF rule is somewhat simpler than
1598       the case  of  bounds-checking  with  padding.  Also  note  that  if  -f
1599       --storable-state  option  is used, YYFILL has slightly different seman‐
1600       tics (desrbed in the section about storable state).
1601
1602   YYFILL with EOF rule
1603       If EOF rule is used, YYFILL is a function-like primitive  that  accepts
1604       no  arguments and returns a value which is checked against zero. YYFILL
1605       invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1606       and  YYLESSTHAN()  in  generic  API. A non-zero return value means that
1607       YYFILL has failed. A successful YYFILL call must supply  at  least  one
1608       character  and adjust input positions accordingly. Limit must always be
1609       set to one after the last input position in buffer, and  the  character
1610       at the limit position must be the sentinel symbol specified by re2c:eof
1611       configuration. The pictures below show the relative locations of  input
1612       positions  in  buffer  before and after YYFILL call (sentinel symbol is
1613       marked with #, and the second picture shows the case when there is  not
1614       enough input to fill the whole buffer).
1615
1616                         <-- shift -->
1617                       >-A------------B---------C-------------D#-----------E->
1618                       buffer       token    marker         limit,
1619                                                            cursor
1620          >-A------------B---------C-------------D------------E#->
1621                       buffer,  marker        cursor        limit
1622                       token
1623
1624                         <-- shift -->
1625                       >-A------------B---------C-------------D#--E (EOF)
1626                       buffer       token    marker         limit,
1627                                                            cursor
1628          >-A------------B---------C-------------D---E#........
1629                       buffer,  marker       cursor limit
1630                       token
1631
1632       Here  is  an  example  of  a program that reads input file input.txt in
1633       chunks of 4096 bytes and uses EOF rule.
1634
1635          // re2c $INPUT -o $OUTPUT
1636          #include <assert.h>
1637          #include <stdio.h>
1638          #include <string.h>
1639
1640          #define SIZE 4096
1641
1642          typedef struct {
1643              FILE *file;
1644              char buf[SIZE + 1], *lim, *cur, *mar, *tok;
1645              int eof;
1646          } Input;
1647
1648          static int fill(Input *in)
1649          {
1650              if (in->eof) {
1651                  return 1;
1652              }
1653              const size_t free = in->tok - in->buf;
1654              if (free < 1) {
1655                  return 2;
1656              }
1657              memmove(in->buf, in->tok, in->lim - in->tok);
1658              in->lim -= free;
1659              in->cur -= free;
1660              in->mar -= free;
1661              in->tok -= free;
1662              in->lim += fread(in->lim, 1, free, in->file);
1663              in->lim[0] = 0;
1664              in->eof |= in->lim < in->buf + SIZE;
1665              return 0;
1666          }
1667
1668          static void init(Input *in, FILE *file)
1669          {
1670              in->file = file;
1671              in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1672              in->eof = 0;
1673              fill(in);
1674          }
1675
1676          static int lex(Input *in)
1677          {
1678              int count = 0;
1679          loop:
1680              in->tok = in->cur;
1681              /*!re2c
1682              re2c:eof = 0;
1683              re2c:api:style = free-form;
1684              re2c:define:YYCTYPE  = char;
1685              re2c:define:YYCURSOR = in->cur;
1686              re2c:define:YYMARKER = in->mar;
1687              re2c:define:YYLIMIT  = in->lim;
1688              re2c:define:YYFILL   = "fill(in) == 0";
1689
1690              *                           { return -1; }
1691              $                           { return count; }
1692              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1693              [ ]+                        { goto loop; }
1694
1695              */
1696          }
1697
1698          int main()
1699          {
1700              const char *fname = "input";
1701              const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1702              FILE *f;
1703              Input in;
1704
1705              // prepare input file: a few times the size of the buffer,
1706              // containing strings with zeroes and escaped quotes
1707              f = fopen(fname, "w");
1708              for (int i = 0; i < SIZE; ++i) {
1709                  fwrite(str, 1, sizeof(str) - 1, f);
1710              }
1711              fclose(f);
1712
1713              f = fopen(fname, "r");
1714              init(&in, f);
1715              assert(lex(&in) == SIZE * 3);
1716              fclose(f);
1717
1718              remove(fname);
1719              return 0;
1720          }
1721
1722
1723   YYFILL with padding
1724       In the default case (when EOF rule is  not  used)  YYFILL  is  a  func‐
1725       tion-like  primitive that accepts a single argument and does not return
1726       any value.  YYFILL invocation is  triggered  by  condition  (YYLIMIT  -
1727       YYCURSOR)  <  n  in  default  API and YYLESSTHAN(n) in generic API. The
1728       argument passed to YYFILL is the minimal number of characters that must
1729       be  supplied. If it fails to do so, YYFILL must not return to the lexer
1730       (for that reason it is best implemented as a macro  that  returns  from
1731       the calling function on failure).  In case of a successful YYFILL invo‐
1732       cation the limit position must be set either  to  one  after  the  last
1733       input  position  in buffer, or to the end of YYMAXFILL padding (in case
1734       YYFILL has successfully read at least n characters, but not  enough  to
1735       fill the entire buffer). The pictures below show the relative locations
1736       of input positions in buffer before and after YYFILL invocation (YYMAX‐
1737       FILL padding on the second picture is marked with # symbols).
1738
1739                         <-- shift -->                 <-- need -->
1740                       >-A------------B---------C-----D-------E---F--------G->
1741                       buffer       token    marker cursor  limit
1742
1743          >-A------------B---------C-----D-------E---F--------G->
1744                       buffer,  marker cursor               limit
1745                       token
1746
1747                         <-- shift -->                 <-- need -->
1748                       >-A------------B---------C-----D-------E-F        (EOF)
1749                       buffer       token    marker cursor  limit
1750
1751          >-A------------B---------C-----D-------E-F###############
1752                       buffer,  marker cursor                   limit
1753                       token                        <- YYMAXFILL ->
1754
1755       Here  is  an  example  of  a program that reads input file input.txt in
1756       chunks of 4096 bytes and uses bounds-checking with padding.
1757
1758          // re2c $INPUT -o $OUTPUT
1759          #include <assert.h>
1760          #include <stdio.h>
1761          #include <string.h>
1762
1763          /*!max:re2c*/
1764          #define SIZE 4096
1765
1766          typedef struct {
1767              FILE *file;
1768              char buf[SIZE + YYMAXFILL], *lim, *cur, *mar, *tok;
1769              int eof;
1770          } Input;
1771
1772          static int fill(Input *in, size_t need)
1773          {
1774              if (in->eof) {
1775                  return 1;
1776              }
1777              const size_t free = in->tok - in->buf;
1778              if (free < need) {
1779                  return 2;
1780              }
1781              memmove(in->buf, in->tok, in->lim - in->tok);
1782              in->lim -= free;
1783              in->cur -= free;
1784              in->mar -= free;
1785              in->tok -= free;
1786              in->lim += fread(in->lim, 1, free, in->file);
1787              if (in->lim < in->buf + SIZE) {
1788                  in->eof = 1;
1789                  memset(in->lim, 0, YYMAXFILL);
1790                  in->lim += YYMAXFILL;
1791              }
1792              return 0;
1793          }
1794
1795          static void init(Input *in, FILE *file)
1796          {
1797              in->file = file;
1798              in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1799              in->eof = 0;
1800              fill(in, 1);
1801          }
1802
1803          static int lex(Input *in)
1804          {
1805              int count = 0;
1806          loop:
1807              in->tok = in->cur;
1808              /*!re2c
1809              re2c:api:style = free-form;
1810              re2c:define:YYCTYPE  = char;
1811              re2c:define:YYCURSOR = in->cur;
1812              re2c:define:YYMARKER = in->mar;
1813              re2c:define:YYLIMIT  = in->lim;
1814              re2c:define:YYFILL   = "if (fill(in, @@) != 0) return -1;";
1815
1816              *                           { return -1; }
1817              [\x00]                      { return (YYMAXFILL == in->lim - in->tok) ? count : -1; }
1818              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1819              [ ]+                        { goto loop; }
1820
1821              */
1822          }
1823
1824          int main()
1825          {
1826              const char *fname = "input";
1827              const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1828              FILE *f;
1829              Input in;
1830
1831              // prepare input file: a few times the size of the buffer,
1832              // containing strings with zeroes and escaped quotes
1833              f = fopen(fname, "w");
1834              for (int i = 0; i < SIZE; ++i) {
1835                  fwrite(str, 1, sizeof(str) - 1, f);
1836              }
1837              fclose(f);
1838
1839              f = fopen(fname, "r");
1840              init(&in, f);
1841              assert(lex(&in) == SIZE * 3);
1842              fclose(f);
1843
1844              remove(fname);
1845              return 0;
1846          }
1847
1848

INCLUDE FILES

1850       Re2c allows one to include other files using directive  /*!include:re2c
1851       FILE  */, where FILE is the name of file to be included. Re2c looks for
1852       included files in the directory of the including file  and  in  include
1853       locations,  which can be specified with -I option.  Re2c include direc‐
1854       tive works in the same way as C/C++ #include: the contents of FILE  are
1855       copy-pasted  verbatim in place of the directive. Include files may have
1856       further includes of their own.  Re2c provides some  predefined  include
1857       files  that  can  be found in the include/ subdirectory of the project.
1858       These files contain definitions that can be useful  to  other  projects
1859       (such as Unicode categories) and form something like a standard library
1860       for re2c.  Here is an example:
1861
1862   Include file (definitions.h)
1863          typedef enum { OK, FAIL } Result;
1864
1865          /*!re2c
1866              number = [1-9][0-9]*;
1867          */
1868
1869
1870   Input file
1871          // re2c $INPUT -o $OUTPUT -i
1872          #include <assert.h>
1873          /*!include:re2c "definitions.h" */
1874
1875          Result lex(const char *YYCURSOR)
1876          {
1877              /*!re2c
1878              re2c:define:YYCTYPE = char;
1879              re2c:yyfill:enable = 0;
1880
1881              number { return OK; }
1882              *      { return FAIL; }
1883              */
1884          }
1885
1886          int main()
1887          {
1888              assert(lex("123") == OK);
1889              return 0;
1890          }
1891
1892

HEADER FILES

1894       Re2c allows one to generate header file from the input .re  file  using
1895       option  -t,  --type-header  or configuration re2c:flags:type-header and
1896       directives  /*!header:re2c:on*/  and  /*!header:re2c:off*/.  The  first
1897       directive  marks the beginning of header file, and the second directive
1898       marks the end of it. Everything between these directives  is  processed
1899       by re2c, and the generated code is written to the file specified by the
1900       -t --type-header option (or stdout if this option was not used).  Auto‐
1901       generated  header file may be needed in cases when re2c is used to gen‐
1902       erate definitions of constants, variables and structs that must be vis‐
1903       ible from other translation units.
1904
1905       Here is an example of generating a header file that contains definition
1906       of the lexer state with tag variables (the number variables depends  on
1907       the regular grammar and is unknown to the programmer).
1908
1909   Input file
1910          // re2c $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.h
1911          #include <assert.h>
1912          #include "src/lexer/lexer.h" // generated by re2c
1913
1914          /*!header:re2c:on*/
1915
1916          typedef struct {
1917              const char *str, *cur, *mar;
1918              /*!stags:re2c format = "const char *@@{tag}; "; */
1919          } LexerState;
1920
1921          /*!header:re2c:off*/
1922
1923          int lex(LexerState *st)
1924          {
1925              /*!re2c
1926              re2c:flags:type-header = "src/lexer/lexer.h";
1927              re2c:yyfill:enable = 0;
1928              re2c:flags:tags = 1;
1929              re2c:define:YYCTYPE  = char;
1930              re2c:define:YYCURSOR = "st->cur";
1931              re2c:define:YYMARKER = "st->mar";
1932              re2c:tags:expression = "st->@@{tag}";
1933
1934              [x]{1,4} / [x]{3,5} { return 0; } // ambiguous trailing context
1935              *                   { return 1; }
1936              */
1937          }
1938
1939          int main()
1940          {
1941              LexerState st;
1942              st.str = st.cur = "xxxxxxxx";
1943              assert(lex(&st) == 0 && st.cur - st.str == 4);
1944              return 0;
1945          }
1946
1947
1948   Header file
1949          /* Generated by re2c */
1950
1951
1952          typedef struct {
1953              const char *str, *cur, *mar;
1954              const char *yyt1; const char *yyt2; const char *yyt3;
1955          } LexerState;
1956
1957
1958

SUBMATCH EXTRACTION

1960       Re2c has two options for submatch extraction.
1961
1962       The  first option is -T --tags. With this option one can use standalone
1963       tags of the form @stag and #mtag, where stag  and  mtag  are  arbitrary
1964       used-defined  names.  Tags  can  be  used  anywhere inside of a regular
1965       expression; semantically they are just position markers.  Tags  of  the
1966       form  @stag are called s-tags: they denote a single submatch value (the
1967       last input position where this tag matched). Tags of the form #mtag are
1968       called  m-tags: they denote multiple submatch values (the whole history
1969       of repetitions of this tag).  All tags should be defined by the user as
1970       variables  with the corresponding names. With standalone tags re2c uses
1971       leftmost greedy disambiguation: submatch positions  correspond  to  the
1972       leftmost matching path through the regular expression.
1973
1974       The  second  option  is -P --posix-captures: it enables POSIX-compliant
1975       capturing groups. In  this  mode  parentheses  in  regular  expressions
1976       denote the beginning and the end of capturing groups; the whole regular
1977       expression is group number zero. The number of groups for the  matching
1978       rule  is stored in a variable yynmatch, and submatch results are stored
1979       in yypmatch array. Both yynmatch and yypmatch should be defined by  the
1980       user,  and yypmatch size must be at least [yynmatch * 2]. Re2c provides
1981       a directive /*!maxnmatch:re2c*/ that defines  YYMAXNMATCH:  a  constant
1982       equal  to the maximal value of yynmatch among all rules. Note that re2c
1983       implements POSIX-compliant disambiguation: each  subexpression  matches
1984       as  long  as possible, and subexpressions that start earlier in regular
1985       expression have priority over those starting  later.  Capturing  groups
1986       are  translated  into  s-tags under the hood, therefore we use the word
1987       "tag" to describe them as well.
1988
1989       With both -P --posix-captures and T --tags options re2c uses  efficient
1990       submatch  extraction  algorithm  described  in the Tagged Deterministic
1991       Finite Automata with Lookahead paper. The overhead on submatch  extrac‐
1992       tion  in  the generated lexer grows with the number of tags --- if this
1993       number is moderate, the overhead is barely  noticeable.  In  the  lexer
1994       tags are implemented using a number of tag variables generated by re2c.
1995       There is no one-to-one correspondence between tag variables and tags: a
1996       single  variable  may  be  reused  for  different tags, and one tag may
1997       require multiple variables to hold all its ambiguous values. Eventually
1998       ambiguity  is  resolved,  and only one final variable per tag survives.
1999       When a rule matches, all its tags are set to the values of  the  corre‐
2000       sponding  tag  variables.  The exact number of tag variables is unknown
2001       to the user; this number is determined by re2c. However, tag  variables
2002       should  be defined by the user as a part of the lexer state and updated
2003       by YYFILL,  therefore  re2c  provides  directives  /*!stags:re2c*/  and
2004       /*!mtags:re2c*/  that can be used to declare, initialize and manipulate
2005       tag variables. These directives have two optional configurations:  for‐
2006       mat  =  "@@";  (specifies the template where @@ is substituted with the
2007       name of each tag variable), and separator = ""; (specifies the piece of
2008       code used to join the generated pieces for different tag variables).
2009
2010       S-tags support the following operations:
2011
2012       · save  input  position to an s-tag: t = YYCURSOR with default API or a
2013         user-defined operation YYSTAGP(t) with generic API
2014
2015       · save default value to an s-tag: t  =  NULL  with  default  API  or  a
2016         user-defined operation YYSTAGN(t) with generic API
2017
2018       · copy one s-tag to another: t1 = t2
2019
2020       M-tags support the following operations:
2021
2022       · append  input  position  to  an  m-tag: a user-defined operation YYM‐
2023         TAGP(t) with both default and generic API
2024
2025       · append default value to an m-tag: a user-defined operation YYMTAGN(t)
2026         with both default and generic API
2027
2028       · copy one m-tag to another: t1 = t2
2029
2030       S-tags  can  be  implemented  as  scalar  values (pointers or offsets).
2031       M-tags need a more complex representation, as  they  need  to  store  a
2032       sequence  of  tag values. The most naive and inefficient representation
2033       of an m-tag is a list (array, vector) of tag values; a  more  efficient
2034       representation  is  to store all m-tags in a prefix-tree represented as
2035       array of nodes (v, p), where v is tag value and p is a pointer to  par‐
2036       ent node.
2037
2038       Here is an example of using s-tags to parse an IPv4 address.
2039
2040          // re2c $INPUT -o $OUTPUT
2041          #include <assert.h>
2042          #include <stdint.h>
2043
2044          static uint32_t num(const char *s, const char *e)
2045          {
2046              uint32_t n = 0;
2047              for (; s < e; ++s) n = n * 10 + (*s - '0');
2048              return n;
2049          }
2050
2051          static const uint64_t ERROR = ~0lu;
2052
2053          static uint64_t lex(const char *YYCURSOR)
2054          {
2055              const char *YYMARKER, *o1, *o2, *o3, *o4;
2056              /*!stags:re2c format = 'const char *@@;'; */
2057
2058              /*!re2c
2059              re2c:yyfill:enable = 0;
2060              re2c:flags:tags = 1;
2061              re2c:define:YYCTYPE = char;
2062
2063              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2064              dot = [.];
2065              end = [\x00];
2066
2067              @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2068                  return num(o4, YYCURSOR - 1)
2069                      + (num(o3, o4 - 1) << 8)
2070                      + (num(o2, o3 - 1) << 16)
2071                      + (num(o1, o2 - 1) << 24);
2072              }
2073              * { return ERROR; }
2074              */
2075          }
2076
2077          int main()
2078          {
2079              assert(lex("1.2.3.4") == 0x01020304);
2080              assert(lex("127.0.0.1") == 0x7f000001);
2081              assert(lex("255.255.255.255") == 0xffffffff);
2082              assert(lex("1.2.3.") == ERROR);
2083              assert(lex("1.2.3.256") == ERROR);
2084              return 0;
2085          }
2086
2087
2088       Here  is  an  example  of using POSIX capturing groups to parse an IPv4
2089       address.
2090
2091          // re2c $INPUT -o $OUTPUT
2092          #include <assert.h>
2093          #include <stdint.h>
2094
2095          static uint32_t num(const char *s, const char *e)
2096          {
2097              uint32_t n = 0;
2098              for (; s < e; ++s) n = n * 10 + (*s - '0');
2099              return n;
2100          }
2101
2102          /*!maxnmatch:re2c*/
2103          static const uint64_t ERROR = ~0lu;
2104
2105          static uint64_t lex(const char *YYCURSOR)
2106          {
2107              const char *YYMARKER;
2108              const char *yypmatch[YYMAXNMATCH * 2];
2109              uint32_t yynmatch;
2110              /*!stags:re2c format = 'const char *@@;'; */
2111
2112              /*!re2c
2113              re2c:yyfill:enable = 0;
2114              re2c:flags:posix-captures = 1;
2115              re2c:define:YYCTYPE = char;
2116
2117              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2118              dot = [.];
2119              end = [\x00];
2120
2121              (octet) dot (octet) dot (octet) dot (octet) end {
2122                  assert(yynmatch == 5);
2123                  return num(yypmatch[8], yypmatch[9])
2124                      + (num(yypmatch[6], yypmatch[7]) << 8)
2125                      + (num(yypmatch[4], yypmatch[5]) << 16)
2126                      + (num(yypmatch[2], yypmatch[3]) << 24);
2127              }
2128              * { return ERROR; }
2129              */
2130          }
2131
2132          int main()
2133          {
2134              assert(lex("1.2.3.4") == 0x01020304);
2135              assert(lex("127.0.0.1") == 0x7f000001);
2136              assert(lex("255.255.255.255") == 0xffffffff);
2137              assert(lex("1.2.3.") == ERROR);
2138              assert(lex("1.2.3.256") == ERROR);
2139              return 0;
2140          }
2141
2142
2143       Here is an example of  using  m-tags  to  parse  a  semicolon-separated
2144       sequence  of  words  (C++).  Tag variables are stored in a tree that is
2145       packed in a vector.
2146
2147          // re2c $INPUT -o $OUTPUT
2148          #include <assert.h>
2149          #include <vector>
2150          #include <string>
2151
2152          static const int ROOT = -1;
2153
2154          struct Mtag {
2155              int pred;
2156              const char *tag;
2157          };
2158
2159          typedef std::vector<Mtag> MtagTree;
2160          typedef std::vector<std::string> Words;
2161
2162          static void mtag(int *pt, const char *t, MtagTree *tree)
2163          {
2164              Mtag m = {*pt, t};
2165              *pt = (int)tree->size();
2166              tree->push_back(m);
2167          }
2168
2169          static void unfold(const MtagTree &tree, int x, int y, Words &words)
2170          {
2171              if (x == ROOT) return;
2172              unfold(tree, tree[x].pred, tree[y].pred, words);
2173              const char *px = tree[x].tag, *py = tree[y].tag;
2174              words.push_back(std::string(px, py - px));
2175          }
2176
2177          #define YYMTAGP(t) mtag(&t, YYCURSOR, &tree)
2178          #define YYMTAGN(t) mtag(&t, NULL,     &tree)
2179          static bool lex(const char *YYCURSOR, Words &words)
2180          {
2181              const char *YYMARKER;
2182              /*!mtags:re2c format = "int @@ = ROOT;"; */
2183              MtagTree tree;
2184              int x, y;
2185
2186              /*!re2c
2187              re2c:yyfill:enable = 0;
2188              re2c:flags:tags = 1;
2189              re2c:define:YYCTYPE = char;
2190
2191              (#x [a-z]+ #y [;])+ {
2192                  words.clear();
2193                  unfold(tree, x, y, words);
2194                  return true;
2195              }
2196              * { return false; }
2197              */
2198          }
2199
2200          int main()
2201          {
2202              Words w;
2203              assert(lex("one;two;three;", w) && w == Words({"one", "two", "three"}));
2204              return 0;
2205          }
2206
2207

STORABLE STATE

2209       With -f --storable-state option re2c generates a lexer that  can  store
2210       its  current  state,  return to the caller, and later resume operations
2211       exactly where it left off. The default mode of operation in re2c  is  a
2212       "pull"  model,  in which the lexer "pulls" more input whenever it needs
2213       it. This may be unacceptable in cases when the input becomes  available
2214       piece  by piece (for example, if the lexer is invoked by the parser, or
2215       if the lexer program communicates via a socket protocol with some other
2216       program  that  must wait for a reply from the lexer before it transmits
2217       the next message). Storable state feature is intended exactly for  such
2218       cases:  it  allows  one to generate lexers that work in a "push" model.
2219       When the lexer needs more input, it stores its state and returns to the
2220       caller.  Later,  when  more input becomes available, the caller resumes
2221       the lexer exactly where it stopped. There are a few  changes  necessary
2222       compared to the "pull" model:
2223
2224       · Define YYSETSTATE() and YYGETSTATE(state) promitives.
2225
2226       · Define  yych,  yyaccept  and  state variables as a part of persistent
2227         lexer state. The state variable should be initialized to -1.
2228
2229       · YYFILL should return to the outer program instead of trying to supply
2230         more input. Return code should indicate that lexer needs more input.
2231
2232       · The  outer  program should recognize situations when lexer needs more
2233         input and respond appropriately.
2234
2235       · Use /*!getstate:re2c*/ directive if it is necessary  to  execute  any
2236         code before entering the lexer.
2237
2238       · Use  configurations  state:abort and state:nextlabel to further tweak
2239         the generated code.
2240
2241       Here is an example of a "push"-model lexer that reads input from  stdin
2242       and  expects  a sequence of words separated by spaces and newlines. The
2243       lexer loops forever, waiting for more input. It can  be  terminated  by
2244       sending  a special EOF token --- a word "stop", in which case the lexer
2245       terminates successfully and prints the number of  words  it  has  seen.
2246       Abnormal  termination  happens in case of a syntax error, premature end
2247       of input (without the "stop" word) or in case the buffer is  too  small
2248       to  hold  a  lexeme  (for  example,  if one of the words exceeds buffer
2249       size). Premature end of input happens in case the lexer fails  to  read
2250       any  input  while  being in the initial state --- this is the only case
2251       when EOF rule matches. Note that the lexer may call YYFILL twice before
2252       terminating  (and  thus require hitting Ctrl+D a few times). First time
2253       YYFILL is called when the lexer expects  continuation  of  the  current
2254       greedy  lexeme  (either  a  word  or  a whitespace sequence). If YYFILL
2255       fails, the lexer knows that it has reached the end of the current  lex‐
2256       eme and executes the corresponding semantic action. The action jumps to
2257       the beginning of the loop, the lexer enters the initial state and calls
2258       YYFILL  once  more.  If it fails, the lexer matches EOF rule. (Alterna‐
2259       tively EOF rule can be used for termination instead of  a  special  EOF
2260       lexeme.)
2261
2262   Example
2263          // re2c $INPUT -o $OUTPUT -f
2264          #include <assert.h>
2265          #include <stdio.h>
2266          #include <string.h>
2267
2268          #define DEBUG    0
2269          #define LOG(...) if (DEBUG) fprintf(stderr, __VA_ARGS__);
2270          #define BUFSIZE  10
2271
2272          typedef struct {
2273              FILE *file;
2274              char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok;
2275              unsigned yyaccept;
2276              int state;
2277          } Input;
2278
2279          static void init(Input *in, FILE *f)
2280          {
2281              in->file = f;
2282              in->cur = in->mar = in->tok = in->lim = in->buf + BUFSIZE;
2283              in->lim[0] = 0; // append sentinel symbol
2284              in->yyaccept = 0;
2285              in->state = -1;
2286          }
2287
2288          typedef enum {END, READY, WAITING, BAD_PACKET, BIG_PACKET} Status;
2289
2290          static Status fill(Input *in)
2291          {
2292              const size_t shift = in->tok - in->buf;
2293              const size_t free = BUFSIZE - (in->lim - in->tok);
2294
2295              if (free < 1) return BIG_PACKET;
2296
2297              memmove(in->buf, in->tok, BUFSIZE - shift);
2298              in->lim -= shift;
2299              in->cur -= shift;
2300              in->mar -= shift;
2301              in->tok -= shift;
2302
2303              const size_t read = fread(in->lim, 1, free, in->file);
2304              in->lim += read;
2305              in->lim[0] = 0; // append sentinel symbol
2306
2307              return READY;
2308          }
2309
2310          static Status lex(Input *in, unsigned int *recv)
2311          {
2312              char yych;
2313              /*!getstate:re2c*/
2314          loop:
2315              in->tok = in->cur;
2316              /*!re2c
2317                  re2c:eof = 0;
2318                  re2c:api:style = free-form;
2319                  re2c:define:YYCTYPE    = "char";
2320                  re2c:define:YYCURSOR   = "in->cur";
2321                  re2c:define:YYMARKER   = "in->mar";
2322                  re2c:define:YYLIMIT    = "in->lim";
2323                  re2c:define:YYGETSTATE = "in->state";
2324                  re2c:define:YYSETSTATE = "in->state = @@;";
2325                  re2c:define:YYFILL     = "return WAITING;";
2326
2327                  packet = [a-z]+[;];
2328
2329                  *      { return BAD_PACKET; }
2330                  $      { return END; }
2331                  packet { *recv = *recv + 1; goto loop; }
2332              */
2333          }
2334
2335          void test(const char **packets, Status status)
2336          {
2337              const char *fname = "pipe";
2338              FILE *fw = fopen(fname, "w");
2339              FILE *fr = fopen(fname, "r");
2340              setvbuf(fw, NULL, _IONBF, 0);
2341              setvbuf(fr, NULL, _IONBF, 0);
2342
2343              Input in;
2344              init(&in, fr);
2345              Status st;
2346              unsigned int send = 0, recv = 0;
2347
2348              for (;;) {
2349                  st = lex(&in, &recv);
2350                  if (st == END) {
2351                      LOG("done: got %u packets\n", recv);
2352                      break;
2353                  } else if (st == WAITING) {
2354                      LOG("waiting...\n");
2355                      if (*packets) {
2356                          LOG("sent packet %u\n", send);
2357                          fprintf(fw, "%s", *packets++);
2358                          ++send;
2359                      }
2360                      st = fill(&in);
2361                      LOG("queue: '%s'\n", in.buf);
2362                      if (st == BIG_PACKET) {
2363                          LOG("error: packet too big\n");
2364                          break;
2365                      }
2366                      assert(st == READY);
2367                  } else {
2368                      assert(st == BAD_PACKET);
2369                      LOG("error: ill-formed packet\n");
2370                      break;
2371                  }
2372              }
2373
2374              LOG("\n");
2375              assert(st == status);
2376              if (st == END) assert(recv == send);
2377
2378              fclose(fw);
2379              fclose(fr);
2380              remove(fname);
2381          }
2382
2383          int main()
2384          {
2385              const char *packets1[] = {0};
2386              const char *packets2[] = {"zero;", "one;", "two;", "three;", "four;", 0};
2387              const char *packets3[] = {"zer0;", 0};
2388              const char *packets4[] = {"goooooooooogle;", 0};
2389
2390              test(packets1, END);
2391              test(packets2, END);
2392              test(packets3, BAD_PACKET);
2393              test(packets4, BIG_PACKET);
2394
2395              return 0;
2396          }
2397
2398

REUSABLE BLOCKS

2400       Reuse  mode is enabled with the -r --reusable option. In this mode re2c
2401       allows one to reuse definitions, configurations and rules specified  by
2402       a  /*!rules:re2c*/  block  in  subsequent  /*!use:re2c*/  blocks. As of
2403       re2c-1.2 it is possible  to  mix  such  blocks  with  normal  /*!re2c*/
2404       blocks;  prior  to  that  re2c expects a single rules-block followed by
2405       use-blocks (normal blocks are disallowed). Use-blocks  can  have  addi‐
2406       tional  definitions, configurations and rules: they are merged to those
2407       specified by the rules-block.  A very common use case for -r --reusable
2408       option  is  a lexer that supports multiple input encodings: lexer rules
2409       are defined once and reused multiple times with encoding-specific  con‐
2410       figurations, such as re2c:flags:utf-8.
2411
2412       Below  is  an example of a multi-encoding lexer: it reads a phrase with
2413       Unicode math symbols and accepts input either in UTF8 or in UT32.  Note
2414       that  the  --input-encoding utf8 option allows us to write UTF8-encoded
2415       symbols in the regular expressions;  without  this  option  re2c  would
2416       parse  them  as  a  plain  ASCII byte sequnce (and we would have to use
2417       hexadecimal escape sequences).
2418
2419   Example
2420          // re2c $INPUT -o $OUTPUT -r --input-encoding utf8
2421          #include <assert.h>
2422          #include <stdint.h>
2423
2424          /*!rules:re2c
2425              re2c:yyfill:enable = 0;
2426
2427              "∀x ∃y: p(x, y)" { return 0; }
2428              *                { return 1; }
2429          */
2430
2431          static int lex_utf8(const uint8_t *YYCURSOR)
2432          {
2433              const uint8_t *YYMARKER;
2434              /*!use:re2c
2435              re2c:define:YYCTYPE = uint8_t;
2436              re2c:flags:8 = 1;
2437              */
2438          }
2439
2440          static int lex_utf32(const uint32_t *YYCURSOR)
2441          {
2442              const uint32_t *YYMARKER;
2443              /*!use:re2c
2444              re2c:define:YYCTYPE = uint32_t;
2445              re2c:flags:8 = 0;
2446              re2c:flags:u = 1;
2447              */
2448          }
2449
2450          int main()
2451          {
2452              static const uint8_t s8[] = // UTF-8
2453                  { 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79
2454                  , 0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29 };
2455
2456              static const uint32_t s32[] = // UTF32
2457                  { 0x00002200, 0x00000078, 0x00000020, 0x00002203
2458                  , 0x00000079, 0x0000003a, 0x00000020, 0x00000070
2459                  , 0x00000028, 0x00000078, 0x0000002c, 0x00000020
2460                  , 0x00000079, 0x00000029 };
2461
2462              assert(lex_utf8(s8) == 0);
2463              assert(lex_utf32(s32) == 0);
2464              return 0;
2465          }
2466
2467
2468

ENCODING SUPPORT

2470       re2c supports the following encodings: ASCII  (default),  EBCDIC  (-e),
2471       UCS-2  (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8).  See also inplace
2472       configuration re2c:flags.
2473
2474       The following concepts should be clarified when  talking  about  encod‐
2475       ings.  A code point is an abstract number that represents a single sym‐
2476       bol.  A code unit is the smallest unit of memory, which is used in  the
2477       encoded text (it corresponds to one character in the input stream). One
2478       or more code units may be needed to  represent  a  single  code  point,
2479       depending  on the encoding. In a fixed-length encoding, each code point
2480       is represented with an equal number of code units.  In  variable-length
2481       encodings, different code points can be represented with different num‐
2482       ber of code units.
2483
2484       · ASCII is a fixed-length encoding. Its code space includes 0x100  code
2485         points,  from 0 to 0xFF. A code point is represented with exactly one
2486         1-byte code unit, which has the same value as  the  code  point.  The
2487         size of YYCTYPE must be 1 byte.
2488
2489       · EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
2490         points, from 0 to 0xFF. A code point is represented with exactly  one
2491         1-byte  code  unit,  which  has the same value as the code point. The
2492         size of YYCTYPE must be 1 byte.
2493
2494       · UCS-2 is a fixed-length encoding. Its  code  space  includes  0x10000
2495         code  points,  from  0  to 0xFFFF. One code point is represented with
2496         exactly one 2-byte code unit, which has the same value  as  the  code
2497         point. The size of YYCTYPE must be 2 bytes.
2498
2499       · UTF-16  is  a  variable-length  encoding. Its code space includes all
2500         Unicode code points, from 0 to 0xD7FF and from  0xE000  to  0x10FFFF.
2501         One  code point is represented with one or two 2-byte code units. The
2502         size of YYCTYPE must be 2 bytes.
2503
2504       · UTF-32 is a fixed-length encoding. Its code space includes  all  Uni‐
2505         code  code  points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
2506         code point is represented with exactly one 4-byte code unit. The size
2507         of YYCTYPE must be 4 bytes.
2508
2509       · UTF-8 is a variable-length encoding. Its code space includes all Uni‐
2510         code code points, from 0 to 0xD7FF and from 0xE000 to  0x10FFFF.  One
2511         code point is represented with a sequence of one, two, three, or four
2512         1-byte code units. The size of YYCTYPE must be 1 byte.
2513
2514       In Unicode, values from range 0xD800 to  0xDFFF  (surrogates)  are  not
2515       valid  Unicode  code  points.  Any  encoded sequence of code units that
2516       would map to  Unicode  code  points  in  the  range  0xD800-0xDFFF,  is
2517       ill-formed.  The  user  can  control  how  re2c  treats such ill-formed
2518       sequences with the --encoding-policy <policy> switch.
2519
2520       For some encodings, there are code units that never occur  in  a  valid
2521       encoded  stream  (e.g.,  0xFF  byte in UTF-8). If the generated scanner
2522       must check for invalid input, the only correct way to do so is  to  use
2523       the  default  rule (*). Note that the full range rule ([^]) won't catch
2524       invalid code units when a variable-length encoding is used  ([^]  means
2525       "any  valid code point", whereas the default rule (*) means "any possi‐
2526       ble code unit").
2527

START CONDITIONS

2529       Conditions are enabled with -c --conditions.  This option allows one to
2530       encode multiple interrelated lexers within the same re2c block.
2531
2532       Each  lexer  corresponds to a single condition.  It starts with a label
2533       of the form yyc_name, where name is condition name and yyc  prefix  can
2534       be  adjusted  with configuration re2c:condprefix.  Different lexers are
2535       separated with  a  comment  /*  ***********************************  */
2536       which can be adjusted with configuration re2c:cond:divider.
2537
2538       Furthermore,  each  condition  has a unique identifier of the form yyc‐
2539       name, where name is condition name and yyc prefix can be adjusted  with
2540       configuration  re2c:condenumprefix.   Identifiers have the type YYCOND‐
2541       TYPE and should be  generated  with  /*!types:re2c*/  directive  or  -t
2542       --type-header  option.   Users shouldn't define these identifiers manu‐
2543       ally, as the order of conditions is not specified.
2544
2545       Before all conditions re2c generates entry code that checks the current
2546       condition  identifier  and transfers control flow to the start label of
2547       the active condition.  After matching  some  rule  of  this  condition,
2548       lexer  may  either  transfer control flow back to the entry code (after
2549       executing the associated action and optionally setting  another  condi‐
2550       tion with =>), or use :=> shortcut and transition directly to the start
2551       label of another condition (skipping the action and  the  entry  code).
2552       Configuration re2c:cond:goto allows one to change the default behavior.
2553
2554       Syntactically each rule must be preceded with a list of comma-separated
2555       condition names or a wildcard * enclosed in angle  brackets  <  and  >.
2556       Wildcard  means "any condition" and is semantically equivalent to list‐
2557       ing all condition names.  Here regexp is a regular expression,  default
2558       refers to the default rule *, and action is a block of code.
2559
2560       · <conditions-or-wildcard>  regexp-or-default                 action
2561
2562       · <conditions-or-wildcard>  regexp-or-default  =>  condition  action
2563
2564       · <conditions-or-wildcard>  regexp-or-default  :=> condition
2565
2566       Rules with an exclamation mark ! in front of condition list have a spe‐
2567       cial meaning: they have  no  regular  expression,  and  the  associated
2568       action  is  merged  as  an entry code to actions of normal rules.  This
2569       might be a convenient place to peform a routine task that is common  to
2570       all rules.
2571
2572       · <!conditions-or-wildcard>  action
2573
2574       Another  special  form  of rules with an empty condition list <> and no
2575       regular expression allows one to specify an "entry condition" that  can
2576       be  used to execute code before entering the lexer.  It is semantically
2577       equivalent to a condition with number zero, name 0 and an empty regular
2578       expression.
2579
2580       · <>                 action
2581
2582       · <>  =>  condition  action
2583
2584       · <>  :=> condition
2585
2586   Example
2587          // re2c $INPUT -o $OUTPUT -ci
2588          #include <stdint.h>
2589          #include <limits.h>
2590          #include <assert.h>
2591
2592          static const uint64_t ERROR = ~0lu;
2593          /*!types:re2c*/
2594
2595          template<int BASE> static void adddgt(uint64_t &u, unsigned int d)
2596          {
2597              u = u * BASE + d;
2598              if (u > UINT32_MAX) u = ERROR;
2599          }
2600
2601          static uint64_t parse_u32(const char *s)
2602          {
2603              const char *YYMARKER;
2604              int c = yycinit;
2605              uint64_t u = 0;
2606
2607              /*!re2c
2608              re2c:yyfill:enable = 0;
2609              re2c:api:style = free-form;
2610              re2c:define:YYCTYPE = char;
2611              re2c:define:YYCURSOR = s;
2612              re2c:define:YYGETCONDITION = "c";
2613              re2c:define:YYSETCONDITION = "c = @@;";
2614
2615              <*> * { return ERROR; }
2616
2617              <init> '0b' / [01]        :=> bin
2618              <init> "0"                :=> oct
2619              <init> "" / [1-9]         :=> dec
2620              <init> '0x' / [0-9a-fA-F] :=> hex
2621
2622              <bin, oct, dec, hex> "\x00" { return u; }
2623
2624              <bin> [01]  { adddgt<2> (u, s[-1] - '0');      goto yyc_bin; }
2625              <oct> [0-7] { adddgt<8> (u, s[-1] - '0');      goto yyc_oct; }
2626              <dec> [0-9] { adddgt<10>(u, s[-1] - '0');      goto yyc_dec; }
2627              <hex> [0-9] { adddgt<16>(u, s[-1] - '0');      goto yyc_hex; }
2628              <hex> [a-f] { adddgt<16>(u, s[-1] - 'a' + 10); goto yyc_hex; }
2629              <hex> [A-F] { adddgt<16>(u, s[-1] - 'A' + 10); goto yyc_hex; }
2630              */
2631          }
2632
2633          int main()
2634          {
2635              assert(parse_u32("1234567890") == 1234567890);
2636              assert(parse_u32("0b1101") == 13);
2637              assert(parse_u32("0x7Fe") == 2046);
2638              assert(parse_u32("0644") == 420);
2639              assert(parse_u32("9999999999") == ERROR);
2640              assert(parse_u32("") == ERROR);
2641              return 0;
2642          }
2643
2644

SKELETON PROGRAMS

2646       With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
2647       erates a self-contained C program that can be further compiled and exe‐
2648       cuted. The program consists of lexer code and input data. For each con‐
2649       structed DFA (block or condition) re2c generates a standalone lexer and
2650       two files: an .input file with strings derived from the DFA and a .keys
2651       file with expected match results. The program runs each  lexer  on  the
2652       corresponding  .input  file and compares results with the expectations.
2653       Skeleton programs are very useful for a number of reasons:
2654
2655       · They can check correctness of various re2c optimizations (the data is
2656         generated  early  in the process, before any DFA transformations have
2657         taken place).
2658
2659       · Generating a set of input data with good coverage may be  useful  for
2660         both testing and benchmarking.
2661
2662       · Generating self-contained executable programs allows one to get mini‐
2663         mized test cases (the original code may be large or  have  a  lot  of
2664         dependencies).
2665
2666       The  difficulty with generating input data is that for all but the most
2667       trivial cases the number of possible input strings is too  large  (even
2668       if the string length is limited). Re2c solves this difficulty by gener‐
2669       ating sufficiently many strings to cover almost all DFA transitions. It
2670       uses  the  following  algorithm. First, it constructs a skeleton of the
2671       DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
2672       EBCDIC)  skeleton is just an exact copy of the original DFA. For encod‐
2673       ings with multibyte code units skeleton is a copy of DFA  with  certain
2674       transitions omitted: namely, re2c takes at most 256 code units for each
2675       disjoint continuous range that corresponds to a  DFA  transition.   The
2676       chosen  values are evenly distributed and include range bounds. Instead
2677       of trying to cover all possible paths in the skeleton (which is  infea‐
2678       sible)  re2c  generates  sufficiently  many paths to cover all skeleton
2679       transitions, and thus trigger the corresponding  conditional  jumps  in
2680       the  lexer.  The algorithm implementation is limited by ~1Gb of transi‐
2681       tions and consumes constant amount of memory (re2c writes data to  file
2682       as soon as it is generated).
2683

VISUALIZATION AND DEBUG

2685       With  the  -D, --emit-dot option, re2c does not generate code. Instead,
2686       it dumps the generated DFA in DOT format.  One can convert this dump to
2687       an  image of the DFA using Graphviz or another library.  Note that this
2688       option shows the final DFA after it has gone through a number of  opti‐
2689       mizations  and transformations. Earlier stages can be dumped with vari‐
2690       ous debug options, such as --dump-nfa,  --dump-dfa-raw  etc.  (see  the
2691       full list of options).
2692

SEE ALSO

2694       You  can  find  more  information  about  re2c at the official website:
2695       http://re2c.org.   Similar  programs  are   flex(1),   lex(1),   quex(‐
2696       http://quex.sourceforge.net).
2697

AUTHORS

2699       Re2c  was  originaly  written by Peter Bumbulis in 1993.  Since then it
2700       has been developed and maintained by multiple volunteers; mots notably,
2701       Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
2702
2703
2704
2705
2706                                                                       RE2C(1)
Impressum