re2c(1) - f37

1RE2C(1)                                                                RE2C(1)
2
3
4

NAME

6       re2c - compile regular expressions to code
7

SYNOPSIS

9       re2c  [OPTIONS] INPUT [-o OUTPUT]
10
11       re2go [OPTIONS] INPUT [-o OUTPUT]
12

DESCRIPTION

14       re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15

SYNTAX

17       A  re2c program consists of normal code intermixed with re2c blocks and
18       directives.  Each re2c block may  contain  definitions,  configurations
19       and  rules.   Definitions are of the form name = regexp;  where name is
20       an identifier that consists of letters,  digits  and  underscores,  and
21       regexp  is a regular expression.  Regular expressions may contain other
22       definitions, but recursion is not allowed and each name should  be  de‐
23       fined before used.  Configurations are of the form re2c:config = value;
24       where config is the configuration descriptor and value can be a number,
25       a string or a special word.  Rules consist of a regular expression fol‐
26       lowed by a semantic action (a block of code enclosed in curly braces  {
27       and }, or a raw one line of code preceded with := and ended with a new‐
28       line that is not followed by a whitespace).  If the input  matches  the
29       regular  expression,  the  associated  semantic action is executed.  If
30       multiple rules match, the longest match takes precedence.  If  multiple
31       rules  match the same string, the earlier rule takes precedence.  There
32       are two special rules: default rule * and EOF  rule  $.   Default  rule
33       should  always be defined, it has the lowest priority regardless of its
34       place and matches any code unit (not necessarily a valid character, see
35       encoding support).  EOF rule matches the end of input, it should be de‐
36       fined if the corresponding method for handling  the  end  of  input  is
37       used.   If  start  conditions are used, rules have more complex syntax.
38       All rules of a single block  are  compiled  into  a  deterministic  fi‐
39       nite-state  automaton (DFA) and encoded in the form of a program in the
40       target language.  The generated code interfaces with the outer  program
41       by  the  means of a few user-defined primitives (see the program inter‐
42       face section).  Reusable blocks allow sharing  rules,  definitions  and
43       configurations between different blocks.
44

EXAMPLE

46   Input file
47          // re2c $INPUT -o $OUTPUT -i
48          #include <assert.h>                 //
49                                              // C/C++ code
50          int lex(const char *YYCURSOR)       //
51          {
52              /*!re2c                         // start of re2c block
53              re2c:define:YYCTYPE = char;     // configuration
54              re2c:yyfill:enable = 0;         // configuration
55              re2c:flags:case-ranges = 1;     // configuration
56                                              //
57              ident = [a-zA-Z_][a-zA-Z_0-9]*; // named definition
58                                              //
59              ident { return 0; }             // normal rule
60              *     { return 1; }             // default rule
61              */
62          }                                   //
63                                              //
64          int main()                          //
65          {                                   // C/C++ code
66              assert(lex("_Zer0") == 0);      //
67              return 0;                       //
68          }                                   //
69
70
71   Output file
72          /* Generated by re2c */
73          // re2c $INPUT -o $OUTPUT -i
74          #include <assert.h>                 //
75                                              // C/C++ code
76          int lex(const char *YYCURSOR)       //
77          {
78
79          {
80              char yych;
81              yych = *YYCURSOR;
82              switch (yych) {
83              case 'A' ... 'Z':
84              case '_':
85              case 'a' ... 'z': goto yy4;
86              default: goto yy2;
87              }
88          yy2:
89              ++YYCURSOR;
90              { return 1; }
91          yy4:
92              yych = *++YYCURSOR;
93              switch (yych) {
94              case '0' ... '9':
95              case 'A' ... 'Z':
96              case '_':
97              case 'a' ... 'z': goto yy4;
98              default: goto yy6;
99              }
100          yy6:
101              { return 0; }
102          }
103
104          }                                   //
105                                              //
106          int main()                          //
107          {                                   // C/C++ code
108              assert(lex("_Zer0") == 0);      //
109              return 0;                       //
110          }                                   //
111
112

OPTIONS

114       -? -h --help
115              Show help message.
116
117       -1 --single-pass
118              Deprecated. Does nothing (single pass is the default now).
119
120       -8 --utf-8
121              Generate  a  lexer that reads input in UTF-8 encoding.  re2c as‐
122              sumes that character range is 0 -- 0x10FFFF and  character  size
123              is 1 byte.
124
125       -b --bit-vectors
126              Optimize conditional jumps using bit masks. Implies -s.
127
128       -c --conditions --start-conditions
129              Enable  support of Flex-like "conditions": multiple interrelated
130              lexers within one block. Option --start-conditions is  a  legacy
131              alias; use --conditions instead.
132
133       --case-insensitive
134              Treat  single-quoted  and double-quoted strings as case-insensi‐
135              tive.
136
137       --case-inverted
138              Invert the meaning of single-quoted and  double-quoted  strings:
139              treat  single-quoted strings as case-sensitive and double-quoted
140              strings as case-insensitive.
141
142       --case-ranges
143              Collapse consecutive cases in a switch statements into  a  range
144              of  the  form case low ... high:. This syntax is an extension of
145              the C/C++ language, supported by compilers like GCC,  Clang  and
146              Tcc.  The main advantage over using single cases is smaller gen‐
147              erated C code and faster generation time, although for some com‐
148              pilers  like  Tcc  it also results in smaller binary size.  This
149              option doesn't work for the Go backend.
150
151       --depfile FILE
152              Write dependency information to FILE in the form of  a  Makefile
153              rule  <output-file>  : <input-file> [include-file ...]. This al‐
154              lows to track build  dependencies  in  the  presence  of  /*!in‐
155              clude:re2c*/ directives, so that updating include files triggers
156              regeneration of the output file.  This option requires  that  -o
157              --output option is specified.
158
159       -e --ecb
160              Generate  a lexer that reads input in EBCDIC encoding.  re2c as‐
161              sumes that character range is 0 -- 0xFF an character size  is  1
162              byte.
163
164       --empty-class <match-empty | match-none | error>
165              Define  the  way  re2c  treats  empty  character  classes.  With
166              match-empty (the default) empty class matches empty input (which
167              is   illogical,  but  backwards-compatible).  With``match-none``
168              empty class always fails  to  match.   With  error  empty  class
169              raises a compilation error.
170
171       --encoding-policy <fail | substitute | ignore>
172              Define  the  way re2c treats Unicode surrogates.  With fail re2c
173              aborts with an error when a surrogate is encountered.  With sub‐
174              stitute  re2c  silently  replaces surrogates with the error code
175              point 0xFFFD. With ignore (the default) re2c  treats  surrogates
176              as normal code points. The Unicode standard says that standalone
177              surrogates are invalid, but real-world  libraries  and  programs
178              behave in different ways.
179
180       -f --storable-state
181              Generate  a lexer which can store its inner state.  This is use‐
182              ful in push-model lexers which are stopped by an  outer  program
183              when there is not enough input, and then resumed when more input
184              becomes available. In this mode users should additionally define
185              YYGETSTATE()  and  YYSETSTATE(state)  macros and variables yych,
186              yyaccept and state as part of the lexer state.
187
188       -F --flex-syntax
189              Partial support for Flex syntax: in this mode named  definitions
190              don't  need  the  equal  sign and the terminating semicolon, and
191              when used they must be surrounded by curly braces. Names without
192              curly braces are treated as double-quoted strings.
193
194       -g --computed-gotos
195              Optimize  conditional  jumps  using non-standard "computed goto"
196              extension (which must be supported by the compiler). re2c gener‐
197              ates jump tables only in complex cases with a lot of conditional
198              branches.  Complexity   threshold   can   be   configured   with
199              cgoto:threshold  configuration. This option implies -b. This op‐
200              tion doesn't work for the Go backend.
201
202       -I PATH
203              Add PATH to the list of locations which are used when  searching
204              for  include  files.  This  option is useful in combination with
205              /*!include:re2c ... */ directive. Re2c looks for FILE in the di‐
206              rectory of including file and in the list of include paths spec‐
207              ified by -I option.
208
209       -i --no-debug-info
210              Do not output #line information. This is useful when the  gener‐
211              ated code is tracked by some version control system or IDE.
212
213       --input <default | custom>
214              Specify  the  API  used  by the generated code to interface with
215              used-defined code. Option default is the C API based on  pointer
216              arithmetic  (it is the default for the C backend). Option custom
217              is the generic API (it is the default for the Go backend).
218
219       --input-encoding <ascii | utf8>
220              Specify the way re2c parses  regular  expressions.   With  ascii
221              (the  default) re2c handles input as ASCII-encoded: any sequence
222              of code units is a sequence  of  standalone  1-byte  characters.
223              With  utf8  re2c  handles  input  as UTF8-encoded and recognizes
224              multibyte characters.
225
226       --lang <c | go>
227              Specify the output language. Supported languages are  C  and  Go
228              (the default is C).
229
230       --location-format <gnu | msvc>
231              Specify  location  format  in  messages.  With gnu locations are
232              printed as 'filename:line:column: ...'.  With msvc locations are
233              printed as 'filename(line,column) ...'.  Default is gnu.
234
235       --no-generation-date
236              Suppress date output in the generated file.
237
238       --no-version
239              Suppress version output in the generated file.
240
241       -o OUTPUT --output=OUTPUT
242              Specify the OUTPUT file.
243
244       -P --posix-captures
245              Enable submatch extraction with POSIX-style capturing groups.
246
247       -r --reusable
248              Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
249              */ blocks. Exactly one rules-block must be  present.  The  rules
250              are  saved  and  used by every use-block that follows, which may
251              add its own rules and configurations.
252
253       -S --skeleton
254              Ignore user-defined interface code and generate a self-contained
255              "skeleton"  program.  Additionally,  generate  input  files with
256              strings derived from the regular grammar  and  compressed  match
257              results  that  are used to verify "skeleton" behavior on all in‐
258              puts. This option is useful for finding  bugs  in  optimizations
259              and  code  generation. This option doesn't work for the Go back‐
260              end.
261
262       -s --nested-ifs
263              Use nested if statements instead of switch statements in  condi‐
264              tional  jumps.  This usually results in more efficient code with
265              non-optimizing compilers.
266
267       -T --tags
268              Enable submatch extraction with tags.
269
270       -t HEADER --type-header=HEADER
271              Generate a HEADER file that contains enum with condition  names.
272              Requires -c option.
273
274       -u --unicode
275              Generate  a  lexer  that reads UTF32-encoded input. Re2c assumes
276              that character range is 0 -- 0x10FFFF and character  size  is  4
277              bytes. This option implies -s.
278
279       -V --vernum
280              Show version information in MMmmpp format (major, minor, patch).
281
282       --verbose
283              Output a short message in case of success.
284
285       -v --version
286              Show version information.
287
288       -w --wide-chars
289              Generate  a  lexer  that  reads UCS2-encoded input. Re2c assumes
290              that character range is 0 -- 0xFFFF  and  character  size  is  2
291              bytes. This option implies -s.
292
293       -x --utf-16
294              Generate  a  lexer  that reads UTF16-encoded input. Re2c assumes
295              that character range is 0 -- 0x10FFFF and character  size  is  2
296              bytes. This option implies -s.
297
298   Debug options
299       -D --emit-dot
300              Instead  of  normal  output generate lexer graph in .dot format.
301              The output can be  converted  to  an  image  with  the  help  of
302              Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
303
304       -d --debug-output
305              Emit  YYDEBUG  in the generated code.  YYDEBUG should be defined
306              by the user in the form of a void function with two  parameters:
307              state  (lexer  state  or -1) and symbol (current input symbol of
308              type YYCTYPE).
309
310       --dump-adfa
311              Debug option: output DFA after tunneling (in .dot format).
312
313       --dump-cfg
314              Debug option: output control flow graph  of  tag  variables  (in
315              .dot format).
316
317       --dump-closure-stats
318              Debug  option: output statistics on the number of states in clo‐
319              sure.
320
321       --dump-dfa-det
322              Debug option: output DFA immediately after  determinization  (in
323              .dot format).
324
325       --dump-dfa-min
326              Debug option: output DFA after minimization (in .dot format).
327
328       --dump-dfa-tagopt
329              Debug  option:  output DFA after tag optimizations (in .dot for‐
330              mat).
331
332       --dump-dfa-tree
333              Debug option: output DFA under construction with  states  repre‐
334              sented as tag history trees (in .dot format).
335
336       --dump-dfa-raw
337              Debug  option:  output  DFA  under  construction  with  expanded
338              state-sets (in .dot format).
339
340       --dump-interf
341              Debug option: output interference  table  produced  by  liveness
342              analysis of tag variables.
343
344       --dump-nfa
345              Debug option: output NFA (in .dot format).
346
347   Internal options
348       --dfa-minimization <moore | table>
349              Internal  option:  DFA  minimization algorithm used by re2c. The
350              moore option is the Moore algorithm (it is the default). The ta‐
351              ble  option  is  the  "table filling" algorithm. Both algorithms
352              should produce the same DFA up to states relabeling; table fill‐
353              ing  is simpler and much slower and serves as a reference imple‐
354              mentation.
355
356       --eager-skip
357              Internal option: make the generated lexer advance the input  po‐
358              sition  eagerly  --  immediately after reading the input symbol.
359              This changes the default behavior when the input position is ad‐
360              vanced lazily -- after transition to the next state. This option
361              is implied by --no-lookahead.
362
363       --no-lookahead
364              Internal option: use TDFA(0) instead of  TDFA(1).   This  option
365              has effect only with --tags or --posix-captures options.
366
367       --no-optimize-tags
368              Internal optionL: suppress optimization of tag variables (useful
369              for debugging).
370
371       --posix-closure <gor1 | gtop>
372              Internal option: specify shortest-path algorithm  used  for  the
373              construction of epsilon-closure with POSIX disambiguation seman‐
374              tics: gor1 (the default) stands for  Goldberg-Radzik  algorithm,
375              and gtop stands for "global topological order" algorithm.
376
377       --posix-prectable <complex | naive>
378              Internal  option:  specify  the  algorithm used to compute POSIX
379              precedence table. The complex algorithm computes precedence  ta‐
380              ble  in one traversal of tag history tree and has quadratic com‐
381              plexity in the number of TNFA states; it  is  the  default.  The
382              naive algorithm has worst-case cubic complexity in the number of
383              TNFA states, but it is much simpler  than  complex  and  may  be
384              slightly faster in non-pathological cases.
385
386       --stadfa
387              Internal  option:  use staDFA algorithm for submatch extraction.
388              The main difference with TDFA is that tag operations  in  staDFA
389              are placed in states, not on transitions.
390
391       --fixed-tags <none | toplevel | all>
392              Internal  option:  specify  whether  the  fixed-tag optimization
393              should be applied to all tags (all), none  of  them  (none),  or
394              only  those in toplevel concatenation (toplevel). The default is
395              all.  "Fixed" tags are those that are  located  within  a  fixed
396              distance  to  some other tag (called "base"). In such cases only
397              tha base tag needs to be tracked, and the value of the fixed tag
398              can  be computed as the value of the base tag plus a static off‐
399              set. For tags that are under alternative  or  repetition  it  is
400              also necessary to check if the base tag has a no-match value (in
401              that case fixed tag should also be set to no-match, disregarding
402              the  offset).  For  tags in top-level concatenation the check is
403              not needed, because they always match.
404
405   Warnings
406       -W     Turn on all warnings.
407
408       -Werror
409              Turn warnings into errors. Note that this option  alone  doesn't
410              turn  on  any warnings; it only affects those warnings that have
411              been turned on so far or will be turned on later.
412
413       -W<warning>
414              Turn on warning.
415
416       -Wno-<warning>
417              Turn off warning.
418
419       -Werror-<warning>
420              Turn on warning and treat it as an error (this implies  -W<warn‐
421              ing>).
422
423       -Wno-error-<warning>
424              Don't  treat  this  particular warning as an error. This doesn't
425              turn off the warning itself.
426
427       -Wcondition-order
428              Warn if the generated program makes implicit  assumptions  about
429              condition numbering. One should use either the -t, --type-header
430              option or the /*!types:re2c*/ directive to generate a mapping of
431              condition names to numbers and then use the autogenerated condi‐
432              tion names.
433
434       -Wempty-character-class
435              Warn if a regular expression contains an empty character  class.
436              Trying  to  match  an  empty  character class makes no sense: it
437              should always fail.  However, for backwards  compatibility  rea‐
438              sons  re2c  allows  empty  character  classes and treats them as
439              empty strings. Use the --empty-class option to  change  the  de‐
440              fault behavior.
441
442       -Wmatch-empty-string
443              Warn  if  a  rule is nullable (matches an empty string).  If the
444              lexer runs in a loop and the empty match is  unintentional,  the
445              lexer may unexpectedly hang in an infinite loop.
446
447       -Wswapped-range
448              Warn  if  the  lower  bound of a range is greater than its upper
449              bound. The default  behavior  is  to  silently  swap  the  range
450              bounds.
451
452       -Wundefined-control-flow
453              Warn  if  some input strings cause undefined control flow in the
454              lexer (the faulty patterns are reported). This is the most  dan‐
455              gerous and most common mistake. It can be easily fixed by adding
456              the default rule * which has the lowest  priority,  matches  any
457              code unit, and consumes exactly one code unit.
458
459       -Wunreachable-rules
460              Warn about rules that are shadowed by other rules and will never
461              match.
462
463       -Wuseless-escape
464              Warn if a symbol is escaped when it shouldn't be.   By  default,
465              re2c  silently  ignores such escapes, but this may as well indi‐
466              cate a typo or an error in the escape sequence.
467
468       -Wnondeterministic-tags
469              Warn if a tag has n-th degree  of  nondeterminism,  where  n  is
470              greater than 1.
471
472       -Wsentinel-in-midrule
473              Warn  if  the sentinel symbol occurs in the middle of a rule ---
474              this may cause reads past the end of buffer, crashes  or  memory
475              corruption in the generated lexer. This warning is only applica‐
476              ble if the sentinel method of checking for the end of  input  is
477              used.   It  is set to an error if re2c:sentinel configuration is
478              used.
479

PROGRAM INTERFACE

481       Re2c has a flexible interface that gives the user both the freedom  and
482       the  responsibility to define how the generated code interacts with the
483       outer program.  There are two major options:
484
485       • Pointer API.  It is also called "default API", since it was  histori‐
486         cally  the  first,  and for a long time the only one.  This is a more
487         restricted API based  on  C  pointer  arithmetics.   It  consists  of
488         pointer-like  primitives YYCURSOR, YYMARKER, YYCTXMARKER and YYLIMIT,
489         which are normally defined as pointers of type YYCTYPE*.  Pointer API
490         is  enabled  by default for the C backend, and it cannot be used with
491         other backends that do not have pointer arithmetics.
492
493
494
495       • Generic API.  This is a less restricted  API  that  does  not  assume
496         pointer  semantics.   It  consists  of primitives YYPEEK, YYSKIP, YY‐
497         BACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP,  YYMTAGN,  YYRESTORE,
498         YYRESTORECTX, YYRESTORETAG, YYSHIFT, YYSHIFTSTAG, YYSHIFTMTAG and YY‐
499         LESSTHAN.  For the C backend generic API is enabled with --input cus‐
500         tom  option  or  re2c:flags:input = custom; configuration; for the Go
501         backend it is enabled by default.  Generic API was added  in  version
502         0.14.   It is intentionally designed to give the user as much freedom
503         as possible in redefining the input model and the semantics  of  dif‐
504         ferent  actions  performed  by the generated code. As an example, one
505         can override YYPEEK to check for the end of input before reading  the
506         input character, or do some logging, etc.
507
508       Generic API has two styles:
509
510       • Function-like.   This  style  is  enabled with re2c:api:style = func‐
511         tions; configuration, and it is the default for C  backend.  In  this
512         style  API  primitives  should be defined as functions or macros with
513         parentheses, accepting the necessary arguments. For example, in C the
514         default pointer API can be defined in function-like style generic API
515         as follows:
516
517            #define  YYPEEK()                 *YYCURSOR
518            #define  YYSKIP()                 ++YYCURSOR
519            #define  YYBACKUP()               YYMARKER = YYCURSOR
520            #define  YYBACKUPCTX()            YYCTXMARKER = YYCURSOR
521            #define  YYRESTORE()              YYCURSOR = YYMARKER
522            #define  YYRESTORECTX()           YYCURSOR = YYCTXMARKER
523            #define  YYRESTORETAG(tag)        YYCURSOR = tag
524            #define  YYLESSTHAN(len)          YYLIMIT - YYCURSOR < len
525            #define  YYSTAGP(tag)             tag = YYCURSOR
526            #define  YYSTAGN(tag)             tag = NULL
527            #define  YYSHIFT(shift)           YYCURSOR += shift
528            #define  YYSHIFTSTAG(tag, shift)  tag += shift
529
530
531
532       • Free-form.  This style is enabled with  re2c:api:style  =  free-form;
533         configuration,  and  it  is the default for Go backend. In this style
534         API primitives can be defined as free-form pieces of  code,  and  in‐
535         stead  of  arguments  they  have  interpolated  variables of the form
536         @@{name}, or optionally just @@ if there is only one argument. The @@
537         text  is  called  "sigil". It can be redefined to any other text with
538         re2c:api:sigil configuration. For example, the  default  pointer  API
539         can be defined in free-form style generic API as follows:
540
541            re2c:define:YYPEEK       = "*YYCURSOR";
542            re2c:define:YYSKIP       = "++YYCURSOR";
543            re2c:define:YYBACKUP     = "YYMARKER = YYCURSOR";
544            re2c:define:YYBACKUPCTX  = "YYCTXMARKER = YYCURSOR";
545            re2c:define:YYRESTORE    = "YYCURSOR = YYMARKER";
546            re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
547            re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
548            re2c:define:YYLESSTHAN   = "YYLIMIT - YYCURSOR < @@{len}";
549            re2c:define:YYSTAGP      = "@@{tag} = YYCURSOR";
550            re2c:define:YYSTAGN      = "@@{tag} = NULL";
551            re2c:define:YYSHIFT      = "YYCURSOR += @@{shift}";
552            re2c:define:YYSHIFTSTAG  = "@@{tag} += @@{shift}";
553
554   API primitives
555       Here is a list of API primitives that may be used by the generated code
556       in order to interface with the outer  program.   Which  primitives  are
557       needed depends on multiple factors, including the complexity of regular
558       expressions, input representation, buffering, the use of  various  fea‐
559       tures and so on.  All the necessary primitives should be defined by the
560       user in the form of macros, functions, variables, free-form  pieces  of
561       code  or any other suitable form.  Re2c does not (and cannot) check the
562       definitions, so if anything is missing or defined incorrectly the  gen‐
563       erated code will not compile.
564
565       YYCTYPE
566              The  type  of  the  input  characters  (code units).  For ASCII,
567              EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
568              For  UTF-16  or  UCS-2 it should be 2-byte unsigned integer. For
569              UTF-32 it should be 4-byte unsigned integer.
570
571       YYCURSOR
572              A pointer-like l-value that stores the  current  input  position
573              (usually  a pointer of type YYCTYPE*). Initially YYCURSOR should
574              point to the first input character. It is advanced by the gener‐
575              ated  code.  When a rule matches, YYCURSOR points to the one af‐
576              ter the last matched character. It is used only in the default C
577              API.
578
579       YYLIMIT
580              A  pointer-like  r-value  that  stores the end of input position
581              (usually a pointer of type YYCTYPE*). Initially  YYLIMIT  should
582              point to the one after the last available input character. It is
583              not changed by the generated code. Lexer  compares  YYCURSOR  to
584              YYLIMIT  in  order to determine if there is enough input charac‐
585              ters left.  YYLIMIT is used only in the default C API.
586
587       YYMARKER
588              A pointer-like l-value (usually a pointer of type YYCTYPE*) that
589              stores  the  position  of the latest matched rule. It is used to
590              restores YYCURSOR position if the longer match fails  and  lexer
591              needs  to  rollback.   Initialization is not needed. YYMARKER is
592              used only in the default C API.
593
594       YYCTXMARKER
595              A pointer-like l-value that stores the position of the  trailing
596              context  (usually a pointer of type YYCTYPE*). No initialization
597              is needed.  It is used only in the default C API, and only  with
598              the lookahead operator /.
599
600       YYFILL API  primitive  with one argument len.  The meaning of YYFILL is
601              to provide at least len more input characters or  fail.  If  EOF
602              rule  is  used, YYFILL should always return to the calling func‐
603              tion; the return value should be zero on success and non-zero on
604              failure. If EOF rule is not used, YYFILL return value is ignored
605              and it should not return on failure. Maximal value of len is YY‐
606              MAXFILL,  which  can  be generated with /*!max:re2c*/ directive.
607              The  definition  of  YYFILL  can  be  either  function-like   or
608              free-form  depending  on  the  API style (see re2c:api:style and
609              re2c:define:YYFILL:naked).
610
611       YYMAXFILL
612              An integral constant equal to the  maximal value of YYFILL argu‐
613              ment.  It can be generated with /*!max:re2c*/ directive.
614
615       YYLESSTHAN
616              A generic API primitive with one argument len.  It should be de‐
617              fined as an r-value of boolean type that equals true if and only
618              if there is less than len input characters left.  The definition
619              can be either function-like or free-form depending  on  the  API
620              style (see re2c:api:style).
621
622       YYPEEK A generic API primitive with no arguments.  It should be defined
623              as an r-value of type YYCTYPE that is equal to the character  at
624              the  current  input position. The definition can be either func‐
625              tion-like  or  free-form  depending  on  the  API   style   (see
626              re2c:api:style).
627
628       YYSKIP A  generic  API  primitive  with  no  arguments.  The meaning of
629              YYSKIP is to advance the current input position by  one  charac‐
630              ter. The definition can be either function-like or free-form de‐
631              pending on the API style (see re2c:api:style).
632
633       YYBACKUP
634              A generic API primitive with no arguments.  The meaning  of  YY‐
635              BACKUP is to save the current input position, which is later re‐
636              stored with YYRESTORE.  The definition should  be  either  func‐
637              tion-like   or   free-form  depending  on  the  API  style  (see
638              re2c:api:style).
639
640       YYRESTORE
641              A generic API primitive with no arguments.  The meaning of YYRE‐
642              STORE  is  to  restore  the  current input position to the value
643              saved by  YYBACKUP.   The  definition  should  be  either  func‐
644              tion-like   or   free-form  depending  on  the  API  style  (see
645              re2c:api:style).
646
647       YYBACKUPCTX
648              A generic API primitive with zero arguments.  The meaning of YY‐
649              BACKUPCTX  is to save the current input position as the position
650              of the trailing  context,  which  is  later  restored  by  YYRE‐
651              STORECTX.   The  definition  should  be  either function-like or
652              free-form depending on the API style (see re2c:api:style).
653
654       YYRESTORECTX
655              A generic API primitive with no arguments.  The meaning of YYRE‐
656              STORECTX  is to restore the trailing context position saved with
657              YYBACKUPCTX.  The definition should be either  function-like  or
658              free-form depending on the API style (see re2c:api:style).
659
660       YYRESTORETAG
661              A  generic  API primitive with one argument tag.  The meaning of
662              YYRESTORETAG is to restore the trailing context position to  the
663              value  of tag.  The definition should be either function-like or
664              free-form depending on the API style (see re2c:api:style).
665
666       YYSTAGP
667              A generic API primitive with one argument tag.  The  meaning  of
668              YYSTAGP  is to set tag value to the current input position.  The
669              definition should be either function-like or free-form depending
670              on the API style (see re2c:api:style).
671
672       YYSTAGN
673              A  generic  API primitive with one argument tag.  The meaning of
674              YYSTAGN is to set tag value to null (or some default value). The
675              definition should be either function-like or free-form depending
676              on the API style (see re2c:api:style).
677
678       YYMTAGP
679              A generic API primitive with one argument tag.  The  meaning  of
680              YYMTAGP is to append the current position to the history of tag.
681              The definition should be either function-like or  free-form  de‐
682              pending on the API style (see re2c:api:style).
683
684       YYMTAGN
685              A  generic  API primitive with one argument tag.  The meaning of
686              YYMTAGN is to append null (or some other default) value  to  the
687              history  of  tag.  The definition can be either function-like or
688              free-form depending on the API style (see re2c:api:style).
689
690       YYSHIFT
691              A generic API primitive with one argument shift.  The meaning of
692              YYSHIFT  is to shift the current input position by shift charac‐
693              ters (the shift value may be negative). The  definition  can  be
694              either  function-like  or  free-form  depending on the API style
695              (see re2c:api:style).
696
697       YYSHIFTSTAG
698              A generic  API primitive with two arguments, tag and shift.  The
699              meaning  of YYSHIFTSTAG is to shift tag by shift characters (the
700              shift value may be negative).   The  definition  can  be  either
701              function-like  or  free-form  depending  on  the  API style (see
702              re2c:api:style).
703
704       YYSHIFTMTAG
705              A generic API primitive with two arguments, tag and shift.   The
706              meaning  of YYSHIFTMTAG is to shift the latest value in the his‐
707              tory of tag by shift characters (the shift value  may  be  nega‐
708              tive).    The  definition  should  be  either  function-like  or
709              free-form depending on the API style (see re2c:api:style).
710
711       YYMAXNMATCH
712              An integral constant equal to the maximal number of  POSIX  cap‐
713              turing   groups  in  a  rule.  It  is  generated  with  /*!maxn‐
714              match:re2c*/ directive.
715
716       YYCONDTYPE
717              The type of the condition enum.  It should be  generated  either
718              with /*!types:re2c*/ directive or -t --type-header option.
719
720       YYGETCONDITION
721              An  API  primitive with zero arguments.  It should be defined as
722              an r-value of type YYCONDTYPE that is equal to the current  con‐
723              dition identifier. The definition can be either function-like or
724              free-form depending on the API  style  (see  re2c:api:style  and
725              re2c:define:YYGETCONDITION:naked).
726
727       YYSETCONDITION
728              An  API primitive with one argument cond.  The meaning of YYSET‐
729              CONDITION is to set the current condition  identifier  to  cond.
730              The  definition  should be either function-like or free-form de‐
731              pending on the API style (see re2c:api:style and re2c:define:YY‐
732              SETCONDITION@cond).
733
734       YYGETSTATE
735              An  API  primitive with zero arguments.  It should be defined as
736              an r-value of integer type that is equal to  the  current  lexer
737              state. Should be initialized to -1. The definition can be either
738              function-like or free-form  depending  on  the  API  style  (see
739              re2c:api:style and re2c:define:YYGETSTATE:naked).
740
741       YYSETSTATE
742              An API primitive with one argument state.  The meaning of YYSET‐
743              STATE is to set the current lexer state to state.   The  defini‐
744              tion  should  be  either function-like or free-form depending on
745              the  API  style  (see  re2c:api:style   and   re2c:define:YYSET‐
746              STATE@state).
747
748       YYDEBUG
749              A  debug API primitive with two arguments. It can be used to de‐
750              bug the generated code (with -d --debug-output option).  YYDEBUG
751              should return no value and accept two arguments: state (either a
752              DFA state index or -1) and symbol (the current input symbol).
753
754       yych   An l-value of type YYCTYPE that stores the current input charac‐
755              ter.  User definition is necessary only with -f --storable-state
756              option.
757
758       yyaccept
759              An l-value of unsigned integral type that stores the  number  of
760              the latest matched rule.  User definition is necessary only with
761              -f --storable-state option.
762
763       yynmatch
764              An l-value of unsigned integral type that stores the  number  of
765              POSIX  capturing  groups in the matched rule.  Used only with -P
766              --posix-captures option.
767
768       yypmatch
769              An array of l-values that are used to hold the tag values corre‐
770              sponding  to the capturing parentheses in the matching rule. Ar‐
771              ray length must be at least yynmatch * 2 (usually YYMAXNMATCH  *
772              2 is a good choice).  Used only with -P --posix-captures option.
773
774   Directives
775       Below  is the list of all directives provided by re2c (in no particular
776       order).  More information on each directive can be found in the related
777       sections.
778
779       /*!re2c ... */
780              A standard re2c block.
781
782       %{ ... %}
783              A standard re2c block in -F --flex-support mode.
784
785       /*!rules:re2c ... */
786              A reusable re2c block (requires -r --reuse option).
787
788       /*!use:re2c ... */
789              A   block   that  reuses  previous  rules-block  specified  with
790              /*!rules:re2c ... */ (requires -r --reuse option).
791
792       /*!ignore:re2c ... */
793              A block which contents are ignored and cut off from  the  output
794              file.
795
796       /*!max:re2c*/
797              This  directive  is substituted with the macro-definition of YY‐
798              MAXFILL.
799
800       /*!maxnmatch:re2c*/
801              This directive is substituted with the macro-definition  of  YY‐
802              MAXNMATCH (requires -P --posix-captures option).
803
804       /*!getstate:re2c*/
805              This directive is substituted with conditional dispatch on lexer
806              state (requires -f --storable-state option).
807
808       /*!types:re2c ... */
809              This directive is substituted with the definition  of  condition
810              enum (requires -c --conditions option).
811
812       /*!stags:re2c ... */, /*!mtags:re2c ... */
813              These  directives  allow one to specify a template piece of code
814              that is expanded for  each  s-tag/m-tag  variable  generated  by
815              re2c. This block has two optional configurations: format = "@@";
816              (specifies the template where @@ is substituted with the name of
817              each  tag variable), and separator = ""; (specifies the piece of
818              code used to join the generated pieces for different  tag  vari‐
819              ables).
820
821       /*!include:re2c FILE */
822              This  directive allows one to include FILE (in the same sense as
823              #include directive in C/C++).
824
825       /*!header:re2c:on*/
826              This directive marks the start of header file. Everything  after
827              it  and  up  to  the following /*!header:re2c:off*/ directive is
828              processed by re2c and written to the header file specified  with
829              -t --type-header option.
830
831       /*!header:re2c:off*/
832              This  directive  marks  the  end  of  header  file  started with
833              /*!header:re2c:on*/.
834
835   Configurations
836       re2c:flags:t, re2c:flags:type-header
837              Specify the name of the generated header file  relative  to  the
838              directory  of  the  output file. (Same as -t, --type-header com‐
839              mand-line option except that the filepath is relative.)
840
841       re2c:flags:input
842              Same as --input command-line option.
843
844       re2c:api:style
845              Allows one to specify the style of generic API. Possible  values
846              are  functions  and free-form. With functions style (the default
847              for the C backend) API primitives  behave  like  functions,  and
848              re2c  generates parentheses with an argument list after the name
849              of each primitive.  With free-form style (the default for the Go
850              backend) re2c treats API definitions as interpolated strings and
851              substitutes argument placeholders with the actual argument  val‐
852              ues.   This  option  can be overridden by options for individual
853              API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
854
855       re2c:api:sigil
856              Allows one to specify the "sigil" symbol  (or  string)  that  is
857              used  to  recognize  argument placeholders in the definitions of
858              generic API primitives.  The default value is @@.   Placeholders
859              start with sigil, followed by the argument name in curly braces.
860              For example, if sigil is set to $, then placeholders  will  have
861              the  form  ${name}. Single-argument APIs may use shorthand nota‐
862              tion without the name in braces. This option can  be  overridden
863              by  options  for individual API primitives, e.g. re2c:define:YY‐
864              FILL@len for YYFILL.
865
866       re2c:define:YYCTYPE
867              Defines YYCTYPE (see the user interface section).
868
869       re2c:define:YYCURSOR
870              Defines C API primitive YYCURSOR (see the  user  interface  sec‐
871              tion).
872
873       re2c:define:YYLIMIT
874              Defines  C  API  primitive  YYLIMIT (see the user interface sec‐
875              tion).
876
877       re2c:define:YYMARKER
878              Defines C API primitive YYMARKER (see the  user  interface  sec‐
879              tion).
880
881       re2c:define:YYCTXMARKER
882              Defines C API primitive YYCTXMARKER (see the user interface sec‐
883              tion).
884
885       re2c:define:YYFILL
886              Defines API primitive YYFILL (see the user interface section).
887
888       re2c:define:YYFILL@len
889              Specifies the sigil used for  argument  substitution  in  YYFILL
890              definition.   Defaults   to  @@.   Overrides  the  more  generic
891              re2c:api:sigil configuration.
892
893       re2c:define:YYFILL:naked
894              Allows one to override re2c:api:style for YYFILL.  Value 0  cor‐
895              responds to free-form API style.
896
897       re2c:yyfill:enable
898              Defaults  to 1 (YYFILL is enabled). Set this to zero to suppress
899              the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
900              tinel  configuration  to  verify that the generated lexer cannot
901              read past the end of input, as this might introduce severe secu‐
902              rity issues to your programs.
903
904       re2c:yyfill:parameter
905              Controls the argument in the parentheses that follow YYFILL. De‐
906              faults to 1, which means that  the  argument  is  generated.  If
907              zero,  the  argument is omitted. Can be overridden with re2c:de‐
908              fine:YYFILL:naked or re2c:api:style.
909
910       re2c:eof
911              Specifies the sentinel symbol used with EOF rule $ to check  for
912              the end of input in the generated lexer. The default value is -1
913              (EOF rule is not used). Other possible values include all  valid
914              code units. Only decimal numbers are recognized.
915
916       re2c:sentinel
917              Specifies  the  sentinel symbol used with the sentinel method of
918              checking for the end of input in the generated lexer  (the  case
919              when  bounds  checking  is disabled with re2c:yyfill:enable = 0;
920              and EOF rule $ is not used). This configuration does not  affect
921              code  generation. It is used by re2c to verify that the sentinel
922              symbol is not allowed in the middle of  the  rule,  and  prevent
923              possible  reads  past  the end of buffer in the generated lexer.
924              The default value is -1 (re2c assumes that the  sentinel  symbol
925              is  0, which is the most common case). Other possible values in‐
926              clude all valid code units. Only decimal numbers are recognized.
927
928       re2c:define:YYLESSTHAN
929              Defines generic API primitive YYLESSTHAN (see the user interface
930              section).
931
932       re2c:yyfill:check
933              Setting this to zero allows to suppress the generation of YYFILL
934              check (YYLESSTHAN in generic API of YYLIMIT-based comparison  in
935              default  C API). This configuration is useful when the necessary
936              input is always available. it defaults to 1 (the check is gener‐
937              ated).
938
939       re2c:label:yyFillLabel
940              Allows  one to change the prefix of YYFILL labels (used with EOF
941              rule or with storable states).
942
943       re2c:define:YYPEEK
944              Defines generic API primitive YYPEEK  (see  the  user  interface
945              section).
946
947       re2c:define:YYSKIP
948              Defines  generic  API  primitive  YYSKIP (see the user interface
949              section).
950
951       re2c:define:YYBACKUP
952              Defines generic API primitive YYBACKUP (see the  user  interface
953              section).
954
955       re2c:define:YYBACKUPCTX
956              Defines  generic  API primitive YYBACKUPCTX (see the user inter‐
957              face section).
958
959       re2c:define:YYRESTORE
960              Defines generic API primitive YYRESTORE (see the user  interface
961              section).
962
963       re2c:define:YYRESTORECTX
964              Defines  generic API primitive YYRESTORECTX (see the user inter‐
965              face section).
966
967       re2c:define:YYRESTORETAG
968              Defines generic API primitive YYRESTORETAG (see the user  inter‐
969              face section).
970
971       re2c:define:YYSHIFT
972              Defines  generic  API  primitive YYSHIFT (see the user interface
973              section).
974
975       re2c:define:YYSHIFTMTAG
976              Defines generic API primitive YYSHIFTMTAG (see the  user  inter‐
977              face section).
978
979       re2c:define:YYSHIFTSTAG
980              Defines  generic  API primitive YYSHIFTSTAG (see the user inter‐
981              face section).
982
983       re2c:define:YYSTAGN
984              Defines generic API primitive YYSTAGN (see  the  user  interface
985              section).
986
987       re2c:define:YYSTAGP
988              Defines  generic  API  primitive YYSTAGP (see the user interface
989              section).
990
991       re2c:define:YYMTAGN
992              Defines generic API primitive YYMTAGN (see  the  user  interface
993              section).
994
995       re2c:define:YYMTAGP
996              Defines  generic  API  primitive YYMTAGP (see the user interface
997              section).
998
999       re2c:flags:T, re2c:flags:tags
1000              Same as -T --tags command-line option.
1001
1002       re2c:flags:P, re2c:flags:posix-captures
1003              Same as -P --posix-captures command-line option.
1004
1005       re2c:tags:expression
1006              Allows one to customize the way re2c  addresses  tag  variables.
1007              By  default  re2c generates expressions of the form yyt<N>. This
1008              might be inconvenient, for example if tag variables are  defined
1009              as  fields  in a struct. Re2c recognizes placeholder of the form
1010              @@{tag} or @@ and replaces it with the actual tag  name.   Sigil
1011              @@  can be redefined with re2c:api:sigil configuration.  For ex‐
1012              ample, setting re2c:tags:expression = "p->@@";  results  in  ex‐
1013              pressions of the form p->yyt<N> in the generated code.
1014
1015       re2c:tags:prefix
1016              Allows  one to override the prefix of tag variables (defaults to
1017              yyt).
1018
1019       re2c:flags:lookahead
1020              Same as inverted --no-lookahead command-line option.
1021
1022       re2c:flags:optimize-tags
1023              Same as inverted --no-optimize-tags command-line option.
1024
1025       re2c:define:YYCONDTYPE
1026              Defines YYCONDTYPE (see the user interface section).
1027
1028       re2c:define:YYGETCONDITION
1029              Defines API primitive YYGETCONDITION  (see  the  user  interface
1030              section).
1031
1032       re2c:define:YYGETCONDITION:naked
1033              Allows one to override re2c:api:style for YYGETCONDITION.  Value
1034              0 corresponds to free-form API style.
1035
1036       re2c:define:YYSETCONDITION
1037              Defines API primitive YYSETCONDITION  (see  the  user  interface
1038              section).
1039
1040       re2c:define:YYSETCONDITION@cond
1041              Specifies  the sigil used for argument substitution in YYSETCON‐
1042              DITION definition. The default value is @@.  Overrides the  more
1043              generic re2c:api:sigil configuration.
1044
1045       re2c:define:YYSETCONDITION:naked
1046              Allows one to override re2c:api:style for YYSETCONDITION.  Value
1047              0 corresponds to free-form API style.
1048
1049       re2c:cond:goto
1050              Allows one to customize the goto statements used with the short‐
1051              cut  :=>  rules  in  conditions.  The default value is goto @@;.
1052              Placeholders  are   substituted   with   condition   name   (see
1053              re2c:api;sigil and re2c:cond:goto@cond).
1054
1055       re2c:cond:goto@cond
1056              Specifies   the   sigil   used   for  argument  substitution  in
1057              re2c:cond:goto definition. The default value is  @@.   Overrides
1058              the more generic re2c:api:sigil configuration.
1059
1060       re2c:cond:divider
1061              Defines  the divider for condition blocks.  The default value is
1062              /*  ***********************************  */.   Placeholders  are
1063              substituted   with   condition   name  (see  re2c:api;sigil  and
1064              re2c:cond:divider@cond).
1065
1066       re2c:cond:divider@cond
1067              Specifies  the  sigil  used   for   argument   substitution   in
1068              re2c:cond:divider  definition.  The  default value is @@.  Over‐
1069              rides the more generic re2c:api:sigil configuration.
1070
1071       re2c:condprefix
1072              Specifies the prefix used for  condition  labels.   The  default
1073              value is yyc_.
1074
1075       re2c:condenumprefix
1076              Specifies  the  prefix  used for condition identifiers.  The de‐
1077              fault value is yyc.
1078
1079       re2c:define:YYGETSTATE
1080              Defines API primitive YYGETSTATE (see the  user  interface  sec‐
1081              tion).
1082
1083       re2c:define:YYGETSTATE:naked
1084              Allows  one  to override re2c:api:style for YYGETSTATE.  Value 0
1085              corresponds to free-form API style.
1086
1087       re2c:define:YYSETSTATE
1088              Defines API primitive YYSETSTATE (see the  user  interface  sec‐
1089              tion).
1090
1091       re2c:define:YYSETSTATE@state
1092              Specifies the sigil used for argument substitution in YYSETSTATE
1093              definition. The default value is @@.  Overrides the more generic
1094              re2c:api:sigil configuration.
1095
1096       re2c:define:YYSETSTATE:naked
1097              Allows  one  to override re2c:api:style for YYSETSTATE.  Value 0
1098              corresponds to free-form API style.
1099
1100       re2c:state:abort
1101              If set to a positive integer value,  changes  the  form  of  the
1102              YYGETSTATE  switch: instead of using default case to jump to the
1103              beginning of the lexer block, a -1 case is used, and the default
1104              case aborts the program.
1105
1106       re2c:state:nextlabel
1107              With  storable states, allows to control if the YYGETSTATE block
1108              is followed by a yyNext label (the default value is zero,  which
1109              corresponds to no label). Instead of using yyNext it is possible
1110              to use re2c:startlabel to force the  generation  of  a  specific
1111              start  label.   Instead  of using labels it is often more conve‐
1112              nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1113
1114       re2c:label:yyNext
1115              Allows one to change the name of the yyNext label.
1116
1117       re2c:startlabel
1118              Controls the generation of start label for the next lexer block.
1119              The  default  value is zero, which means that the start label is
1120              generated only if it is used. An integer value greater than zero
1121              forces the generation of start label even if it is unused by the
1122              lexer. A string value also forces  start  label  generation  and
1123              sets the label name to the specified string.  This configuration
1124              applies only to the current block (it is reset  to  default  for
1125              the next block).
1126
1127       re2c:flags:s, re2c:flags:nested-ifs
1128              Same as -s --nested-ifs command-line option.
1129
1130       re2c:flags:b, re2c:flags:bit-vectors
1131              Same as -b --bit-vectors command-line option.
1132
1133       re2c:variable:yybm
1134              Overrides the name of the yybm variable.
1135
1136       re2c:yybm:hex
1137              Defaults  to  zero (a decimal bitmap table is generated). If set
1138              to nonzero, a hexadecimal table is generated.
1139
1140       re2c:flags:g, re2c:flags:computed-gotos
1141              Same as -g --computed-gotos command-line option.
1142
1143       re2c:cgoto:threshold
1144              With -g --computed-gotos option this value  specifies  the  com‐
1145              plexity  threshold  that  triggers the generation of jump tables
1146              instead of nested if statements and bitmaps. The  default  value
1147              is 9.
1148
1149       re2c:flags:case-ranges
1150              Same as --case-ranges command-line option.
1151
1152       re2c:flags:e, re2c:flags:ecb
1153              Same as -e --ecb command-line option.
1154
1155       re2c:flags:8, re2c:flags:utf-8
1156              Same as -8 --utf-8 command-line option.
1157
1158       re2c:flags:w, re2c:flags:wide-chars
1159              Same as -w --wide-chars command-line option.
1160
1161       re2c:flags:x, re2c:flags:utf-16
1162              Same as -x --utf-16 command-line option.
1163
1164       re2c:flags:u, re2c:flags:unicode
1165              Same as -u --unicode command-line option.
1166
1167       re2c:flags:encoding-policy
1168              Same as --encoding-policy command-line option.
1169
1170       re2c:flags:empty-class
1171              Same as --empty-class command-line option.
1172
1173       re2c:flags:case-insensitive
1174              Same as --case-insensitive command-line option.
1175
1176       re2c:flags:case-inverted
1177              Same as --case-inverted command-line option.
1178
1179       re2c:flags:i, re2c:flags:no-debug-info
1180              Same as -i --no-debug-info command-line option.
1181
1182       re2c:indent:string
1183              Specifies  the string to use for indentation.  The default value
1184              is "\t".  Indent string should contain only  whitespace  charac‐
1185              ters.   To  disable indentation entirely, set this configuration
1186              to empty string "".
1187
1188       re2c:indent:top
1189              Specifies the minimum amount of indentation to use.  The default
1190              value  is zero.  The value should be a non-negative integer num‐
1191              ber.
1192
1193       re2c:labelprefix
1194              Allows one to change the prefix of DFA state  labels.   The  de‐
1195              fault value is yy.
1196
1197       re2c:yych:emit
1198              Set  this to zero to suppress the generation of yych definition.
1199              Defaults to 1 (the definition is generated).
1200
1201       re2c:variable:yych
1202              Overrides the name of the yych variable.
1203
1204       re2c:yych:conversion
1205              If set to nonzero, re2c automatically generates a cast  to  YYC‐
1206              TYPE every time yych is read. Defaults to zero (no cast).
1207
1208       re2c:variable:yyaccept
1209              Overrides the name of the yyaccept variable.
1210
1211       re2c:variable:yytarget
1212              Overrides the name of the yytarget variable.
1213
1214       re2c:variable:yystable
1215              Deprecated.
1216
1217       re2c:variable:yyctable
1218              When  both  -c  --conditions and -g --computed-gotos are active,
1219              re2c will use this variable to generate a static jump table  for
1220              YYGETCONDITION.
1221
1222       re2c:define:YYDEBUG
1223              Defines YYDEBUG (see the user interface section).
1224
1225       re2c:flags:d, re2c:flags:debug-output
1226              Same as -d --debug-output command-line option.
1227
1228       re2c:flags:dfa-minimization
1229              Same as --dfa-minimization command-line option.
1230
1231       re2c:flags:eager-skip
1232              Same as --eager-skip command-line option.
1233

REGULAR EXPRESSIONS

1235       re2c uses the following syntax for regular expressions:
1236
1237       • "foo" case-sensitive string literal
1238
1239       • 'foo' case-insensitive string literal
1240
1241       • [a-xyz], [^a-xyz] character class (possibly negated)
1242
1243       • . any character except newline
1244
1245       • R \ S difference of character classes R and S
1246
1247       • R* zero or more occurrences of R
1248
1249       • R+ one or more occurrences of R
1250
1251       • R? optional R
1252
1253       • R{n} repetition of R exactly n times
1254
1255       • R{n,} repetition of R at least n times
1256
1257       • R{n,m} repetition of R from n to m times
1258
1259       • (R)  just  R;  parentheses  are  used  to  override precedence or for
1260         POSIX-style submatch
1261
1262       • R S concatenation: R followed by S
1263
1264       • R | S alternative: R or S
1265
1266       • R / S lookahead: R followed by S, but S is not consumed
1267
1268       • name the regular expression defined as name (or literal string "name"
1269         in Flex compatibility mode)
1270
1271       • {name}  the  regular expression defined as name in Flex compatibility
1272         mode
1273
1274       • @stag an s-tag: saves the last input position at which @stag  matches
1275         in a variable named stag
1276
1277       • #mtag an m-tag: saves all input positions at which #mtag matches in a
1278         variable named mtag
1279
1280       Character classes and string literals may contain the following  escape
1281       sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1282       decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1283

HANDLING THE END OF INPUT

1285       One of the main problems for the lexer is to know when to stop.   There
1286       are a few terminating conditions:
1287
1288       • the  lexer may match some rule (including default rule *) and come to
1289         a final state
1290
1291       • the lexer may fail to match any rule and come to a default state
1292
1293       • the lexer may reach the end of input
1294
1295       The first two conditions terminate the lexer in  a  "natural"  way:  it
1296       comes  to  a state with no outgoing transitions, and the matching auto‐
1297       matically stops.  The third condition, end of input, is  different:  it
1298       may  happen  in  any  state, and the lexer should be able to handle it.
1299       Checking for the end of input interrupts the normal lexer workflow  and
1300       adds  conditional  branches  to  the generated program, therefore it is
1301       necessary to minimize the number of such checks.  re2c supports  a  few
1302       different  methods for end of input handling.  Which one to use depends
1303       on the complexity of regular expressions, the need for buffering,  per‐
1304       formance considerations and other factors.  Here is a list of all meth‐
1305       ods:
1306
1307       • Sentinel character.  This method eliminates the need for the  end  of
1308         input  checks altogether.  It is simple and efficient, but limited to
1309         the case when there is a natural "sentinel" character that can  never
1310         occur  in valid input.  This character may still occur in invalid in‐
1311         put, but it is not allowed by the regular expressions, except perhaps
1312         as  the last character of a rule.  The sentinel character is appended
1313         at the end of input and serves as a stop signal: when the lexer reads
1314         it,  it  must be either the end of input, or a syntax error.  In both
1315         cases the lexer stops.  This method is used  if  YYFILL  is  disabled
1316         with re2c:yyfill:enable = 0; and re2c:eof has the default value -1.
1317
1318
1319
1320       • Sentinel  character  with  bounds checks.  This method is generic: it
1321         allows to handle any input without restrictions on  the  regular  ex‐
1322         pressions.   The  idea is to reduce the number of end of input checks
1323         by performing them only on certain characters.  Similar to the  "sen‐
1324         tinel  character"  method, one of the characters is chosen as a "sen‐
1325         tinel" and appended at the end of input.  However, there  is  no  re‐
1326         striction  on  where  the  sentinel character may occur (in fact, any
1327         character can be chosen for a sentinel).  When the lexer  reads  this
1328         character,  it  additionally performs a bounds check.  If the current
1329         position is within bounds, the lexer will resume matching and  handle
1330         the  sentinel  character  as a regular one.  Otherwise it will try to
1331         get more input with YYFILL (unless YYFILL is disabled).  If more  in‐
1332         put  is available, the lexer will rematch the last character and con‐
1333         tinue as if the sentinel never occurred.  Otherwise it  is  the  real
1334         end  of  input,  and  the  lexer  will  stop.  This method is used if
1335         re2c:eof has non-negative value (it should be set to the  ordinal  of
1336         the  sentinel  character).  YYFILL must be either defined or disabled
1337         with re2c:yyfill:enable = 0;.
1338
1339
1340
1341       • Bounds checks with padding.  This method is the default one.   It  is
1342         generic,  and  it is usually faster than the "sentinel character with
1343         bounds checks" method, but also more complex to use.  The idea is  to
1344         partition  the  underlying  finite-state automaton into strongly con‐
1345         nected components (SCCs), and generate only one bounds check per SCC,
1346         but  make  it  check for multiple characters at once (enough to cover
1347         the longest non-looping path in the SCC).  This way  the  checks  are
1348         less  frequent,  which  makes  the lexer run much faster.  If a check
1349         shows that there is not enough input, the lexer will  invoke  YYFILL,
1350         which may either supply enough input or else it should not return (in
1351         the latter case the lexer will stop).  This approach  has  a  problem
1352         with  matching  short  lexemes  at  the  end  of  input,  because the
1353         multi-character check requires enough characters to cover the longest
1354         possible  lexeme.   To  fix this problem, it is necessary to append a
1355         few fake characters at the end of input.  The padding should not form
1356         a  valid lexeme suffix to avoid fooling the lexer into matching it as
1357         part of the input.  The minimum sufficient length of padding  is  YY‐
1358         MAXFILL  and  it  is  autogenerated by re2c with /*!max:re2c*/.  This
1359         method is used if re2c:yyfill:enable has the default  nonzero  value,
1360         and re2c:eof has the default value -1.  YYFILL must be defined.
1361
1362
1363
1364       • Custom  methods with generic API.  Generic API allows to override ba‐
1365         sic operations like reading a character, which makes it  possible  to
1366         include  the  end  of input checks as part of them.  Such methods are
1367         error-prone and should be used with caution, only  if  other  methods
1368         cannot  be  used.   These  methods are used if generic API is enabled
1369         with --input custom or re2c:flags:input = custom; and default  bounds
1370         checks  are disabled with re2c:yyfill:enable = 0;.  Note that the use
1371         of generic API does not imply the use of custom  methods,  it  merely
1372         allows it.
1373
1374       The following subsections contain an example of each method.
1375
1376   Sentinel character
1377       In  this  example the lexer uses a sentinel character to handle the end
1378       of input.  The program counts space-separated words  in  a  null-termi‐
1379       nated  string.   Configuration  re2c:yyfill:enable  = 0; suppresses the
1380       generation of bounds checks and YYFILL invocations.  The sentinel char‐
1381       acter  is  null.  It is the last character of each input string, and it
1382       is not allowed in the middle of a lexeme by any of the rules  (in  par‐
1383       ticular,  it  is not included in the character ranges, where it is easy
1384       to overlook).  If a null occurs in the middle of a string, it is a syn‐
1385       tax  error  and  the lexer will match default rule *, but it won't read
1386       past the end of input or crash.  -Wsentinel-in-midrule warning verifies
1387       that  the  rules do not allow sentinel in the middle (it is possible to
1388       tell re2c which character is used as a sentinel with re2c:sentinel con‐
1389       figuration  ---  the default assumption is null, since this is the most
1390       common case).
1391
1392          // re2c $INPUT -o $OUTPUT
1393          #include <assert.h>
1394
1395          // expect a null-terminated string
1396          static int lex(const char *YYCURSOR)
1397          {
1398              int count = 0;
1399          loop:
1400              /*!re2c
1401              re2c:define:YYCTYPE = char;
1402              re2c:yyfill:enable = 0;
1403
1404              *      { return -1; }
1405              [\x00] { return count; }
1406              [a-z]+ { ++count; goto loop; }
1407              [ ]+   { goto loop; }
1408
1409              */
1410          }
1411
1412          int main()
1413          {
1414              assert(lex("") == 0);
1415              assert(lex("one two three") == 3);
1416              assert(lex("f0ur") == -1);
1417              return 0;
1418          }
1419
1420
1421   Sentinel character with bounds checks
1422       In this example the lexer uses sentinel character with bounds checks to
1423       handle  the  end  of input (this method was added in version 1.2).  The
1424       program counts single-quoted strings separated with spaces.   The  sen‐
1425       tinel character is null, which is specified with re2c:eof = 0; configu‐
1426       ration.  Null is the last character of each input string  ---  this  is
1427       essential to detect the end of input.  Null, as well as any other char‐
1428       acter, is allowed in the middle of a rule (for example, 'aaa\0aa'\0  is
1429       valid  input,  but 'aaa\0 is a syntax error).  Bounds checks are gener‐
1430       ated in each state that has a switch on an input character, in the con‐
1431       ditional  branch  that  corresponds to null (that branch may also cover
1432       other characters --- re2c does not split out a separate branch for sen‐
1433       tinel,  because  increasing the number of branches degrades performance
1434       more than bounds checks do).  Bounds checks are of the form YYLIMIT  <=
1435       YYCURSOR  or  YYLESSTHAN(1)  with  generic API.  If a bounds check suc‐
1436       ceeds, the lexer will continue matching.  If a bounds check fails,  the
1437       lexer  has reached the end of input, and it should stop.  In this exam‐
1438       ple YYFILL is disabled with re2c:yyfill:enable = 0; and the lexer  does
1439       not  attempt to get more input (see another example that uses YYFILL in
1440       the YYFILL with sentinel character section).  When the end of input has
1441       been  reached,  there  are  three possibilities: if the lexer is in the
1442       initial state, it will match the end of input rule $, otherwise it will
1443       either fallback to a previously matched rule (including default rule *)
1444       or go to a default state, causing -Wundefined-control-flow.
1445
1446          // re2c $INPUT -o $OUTPUT
1447          #include <assert.h>
1448
1449          // expect a null-terminated string
1450          static int lex(const char *str, unsigned int len)
1451          {
1452              const char *YYCURSOR = str, *YYLIMIT = str + len, *YYMARKER;
1453              int count = 0;
1454
1455          loop:
1456              /*!re2c
1457              re2c:define:YYCTYPE = char;
1458              re2c:yyfill:enable = 0;
1459              re2c:eof = 0;
1460
1461              *                           { return -1; }
1462              $                           { return count; }
1463              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1464              [ ]+                        { goto loop; }
1465
1466              */
1467          }
1468
1469          #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
1470          int main()
1471          {
1472              TEST("", 0);
1473              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1474              TEST("'unterminated\\'", -1);
1475              return 0;
1476          }
1477
1478
1479   Bounds checks with padding
1480       In this example the lexer uses bounds checking with padding  to  handle
1481       the  end  of input (it is the default method).  The program counts sin‐
1482       gle-quoted strings separated with spaces.  There is a padding of YYMAX‐
1483       FILL  null  characters  appended  at  the end of input, where YYMAXFILL
1484       value is autogenerated with /*!max:re2c*/ directive.  It is not  neces‐
1485       sary to use null for padding --- any characters can be used, as long as
1486       they do not form a valid lexeme suffix (in this example padding  should
1487       not  contain  single  quotes, as they may be mistaken for a suffix of a
1488       single-quoted string).  There is a "stop" rule that matches  the  first
1489       padding  character  (null) and terminates the lexer (it returns success
1490       only if it has matched at the beginning of padding, otherwise  a  stray
1491       null is syntax error).  Bounds checks are generated only in some states
1492       that depend on the strongly connected components of the underlying  au‐
1493       tomaton.   They  are  of  the  form  (YYLIMIT  -  YYCURSOR)  < n or YY‐
1494       LESSTHAN(n) with generic API, where n is the minimum number of  charac‐
1495       ters  that  are needed for the lexer to proceed (it also means that the
1496       next bounds check will occur in at most n  characters).   If  a  bounds
1497       check  succeeds,  the  lexer will continue matching.  If a bounds check
1498       fails, the lexer has reached the end  of  input  and  will  invoke  YY‐
1499       FILL(n),  which should either supply at least n input characters, or it
1500       should not return.  In this example YYFILL always fails and  terminates
1501       the  lexer with an error.  This is fine, because in this example YYFILL
1502       can only be called when the lexer has advanced into the padding,  which
1503       means  that is has encountered an unterminated string and should return
1504       a syntax error.  See the YYFILL with padding  section  for  an  example
1505       that refills the input buffer with YYFILL.
1506
1507          // re2c $INPUT -o $OUTPUT
1508          #include <assert.h>
1509          #include <stdlib.h>
1510          #include <string.h>
1511
1512          /*!max:re2c*/
1513
1514          // expect YYMAXFILL-padded string
1515          static int lex(const char *str, unsigned int len)
1516          {
1517              const char *YYCURSOR = str, *YYLIMIT = str + len + YYMAXFILL;
1518              int count = 0;
1519
1520          loop:
1521              /*!re2c
1522              re2c:api:style = free-form;
1523              re2c:define:YYCTYPE = char;
1524              re2c:define:YYFILL = "return -1;";
1525
1526              *                           { return -1; }
1527              [\x00]                      { return YYCURSOR + YYMAXFILL - 1 == YYLIMIT ? count : -1; }
1528              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1529              [ ]+                        { goto loop; }
1530
1531              */
1532          }
1533
1534          // make a copy of the string with YYMAXFILL zeroes at the end
1535          static void test(const char *str, unsigned int len, int res)
1536          {
1537              char *s = (char*) malloc(len + YYMAXFILL);
1538              memcpy(s, str, len);
1539              memset(s + len, 0, YYMAXFILL);
1540              int r = lex(s, len);
1541              free(s);
1542              assert(r == res);
1543          }
1544
1545          #define TEST(s, r) test(s, sizeof(s) - 1, r)
1546          int main()
1547          {
1548              TEST("", 0);
1549              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1550              TEST("'unterminated\\'", -1);
1551              return 0;
1552          }
1553
1554
1555   Custom methods with generic API
1556       In  this  example  the lexer uses a custom end of input handling method
1557       based on generic API.  The program counts single-quoted  strings  sepa‐
1558       rated  with  spaces.   It  is  the  same as the sentinel character with
1559       bounds checks example, except that the input is not null-terminated (so
1560       this  method  can  be  used if it's not possible to have any padding at
1561       all, not even a single sentinel character).  To cover up  for  the  ab‐
1562       sence of sentinel character at the end of input, YYPEEK is redefined to
1563       perform a bounds check before it reads the next input character.   This
1564       is  inefficient, because checks are done very often.  If the check suc‐
1565       ceeds, YYPEEK returns the real character, otherwise it returns  a  fake
1566       sentinel character.
1567
1568          // re2c $INPUT -o $OUTPUT
1569          #include <assert.h>
1570          #include <stdlib.h>
1571          #include <string.h>
1572
1573          // expect a string without terminating null
1574          static int lex(const char *str, unsigned int len)
1575          {
1576              const char *cur = str, *lim = str + len, *mar;
1577              int count = 0;
1578
1579          loop:
1580              /*!re2c
1581              re2c:yyfill:enable = 0;
1582              re2c:eof = 0;
1583              re2c:flags:input = custom;
1584              re2c:api:style = free-form;
1585              re2c:define:YYCTYPE    = char;
1586              re2c:define:YYLESSTHAN = "cur >= lim";
1587              re2c:define:YYPEEK     = "cur < lim ? *cur : 0";  // fake null
1588              re2c:define:YYSKIP     = "++cur;";
1589              re2c:define:YYBACKUP   = "mar = cur;";
1590              re2c:define:YYRESTORE  = "cur = mar;";
1591
1592              *                           { return -1; }
1593              $                           { return count; }
1594              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1595              [ ]+                        { goto loop; }
1596
1597              */
1598          }
1599
1600          // make a copy of the string without terminating null
1601          static void test(const char *str, unsigned int len, int res)
1602          {
1603              char *s = (char*) malloc(len);
1604              memcpy(s, str, len);
1605              int r = lex(s, len);
1606              free(s);
1607              assert(r == res);
1608          }
1609
1610          #define TEST(s, r) test(s, sizeof(s) - 1, r)
1611          int main()
1612          {
1613              TEST("", 0);
1614              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
1615              TEST("'unterminated\\'", -1);
1616              return 0;
1617          }
1618
1619

BUFFER REFILLING

1621       The need for buffering arises when the input cannot be mapped in memory
1622       all at once: either it is too large, or it comes in a streaming fashion
1623       (like  reading  from a socket). The usual technique in such cases is to
1624       allocate a fixed-sized memory buffer and process input in  chunks  that
1625       fit  into  the buffer. When the current chunk is processed, it is moved
1626       out and new data is moved in. In practice it is somewhat more  complex,
1627       because  lexer state consists not of a single input position, but a set
1628       of interrelated posiitons:
1629
1630       • cursor: the next input character to be read (YYCURSOR in default  API
1631         or YYSKIP/YYPEEK in generic API)
1632
1633       • limit: the position after the last available input character (YYLIMIT
1634         in default API, implicitly handled by YYLESSTHAN in generic API)
1635
1636       • marker: the position of the most recent match, if  any  (YYMARKER  in
1637         default API or YYBACKUP/YYRESTORE in generic API)
1638
1639       • token:  the  start of the current lexeme (implicit in re2c API, as it
1640         is not needed for the normal lexer operation and can be  defined  and
1641         updated by the user)
1642
1643       • context  marker: the position of the trailing context (YYCTXMARKER in
1644         default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1645
1646       • tag variables: submatch positions (defined with  /*!stags:re2c*/  and
1647         /*!mtags:re2c*/  directives  and  YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN  in
1648         generic API)
1649
1650       Not all these are used in every case, but if used, they must be updated
1651       by  YYFILL.  All  active positions are contained in the segment between
1652       token and cursor, therefore everything between buffer start  and  token
1653       can  be  discarded,  the  segment  from token and up to limit should be
1654       moved to the beginning of buffer, and the free space at the end of buf‐
1655       fer  should be filled with new data.  In order to avoid frequent YYFILL
1656       calls it is best to fill in as many input characters as possible  (even
1657       though fewer characters might suffice to resume the lexer). The details
1658       of YYFILL implementation are slightly different depending on which  EOF
1659       handling  method is used: the case of EOF rule is somewhat simpler than
1660       the case  of  bounds-checking  with  padding.  Also  note  that  if  -f
1661       --storable-state  option  is used, YYFILL has slightly different seman‐
1662       tics (desrbed in the section about storable state).
1663
1664   YYFILL with sentinel character
1665       If EOF rule is used, YYFILL is a function-like primitive  that  accepts
1666       no  arguments and returns a value which is checked against zero. YYFILL
1667       invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1668       and YYLESSTHAN() in generic API. A non-zero return value means that YY‐
1669       FILL has failed. A successful YYFILL call  must  supply  at  least  one
1670       character  and adjust input positions accordingly. Limit must always be
1671       set to one after the last input position in buffer, and  the  character
1672       at the limit position must be the sentinel symbol specified by re2c:eof
1673       configuration. The pictures below show the relative locations of  input
1674       positions  in  buffer  before and after YYFILL call (sentinel symbol is
1675       marked with #, and the second picture shows the case when there is  not
1676       enough input to fill the whole buffer).
1677
1678                         <-- shift -->
1679                       >-A------------B---------C-------------D#-----------E->
1680                       buffer       token    marker         limit,
1681                                                            cursor
1682          >-A------------B---------C-------------D------------E#->
1683                       buffer,  marker        cursor        limit
1684                       token
1685
1686                         <-- shift -->
1687                       >-A------------B---------C-------------D#--E (EOF)
1688                       buffer       token    marker         limit,
1689                                                            cursor
1690          >-A------------B---------C-------------D---E#........
1691                       buffer,  marker       cursor limit
1692                       token
1693
1694       Here  is  an  example  of  a program that reads input file input.txt in
1695       chunks of 4096 bytes and uses EOF rule.
1696
1697          // re2c $INPUT -o $OUTPUT
1698          #include <assert.h>
1699          #include <stdio.h>
1700          #include <string.h>
1701
1702          #define SIZE 4096
1703
1704          typedef struct {
1705              FILE *file;
1706              char buf[SIZE + 1], *lim, *cur, *mar, *tok;
1707              int eof;
1708          } Input;
1709
1710          static int fill(Input *in)
1711          {
1712              if (in->eof) {
1713                  return 1;
1714              }
1715              const size_t free = in->tok - in->buf;
1716              if (free < 1) {
1717                  return 2;
1718              }
1719              memmove(in->buf, in->tok, in->lim - in->tok);
1720              in->lim -= free;
1721              in->cur -= free;
1722              in->mar -= free;
1723              in->tok -= free;
1724              in->lim += fread(in->lim, 1, free, in->file);
1725              in->lim[0] = 0;
1726              in->eof |= in->lim < in->buf + SIZE;
1727              return 0;
1728          }
1729
1730          static void init(Input *in, FILE *file)
1731          {
1732              in->file = file;
1733              in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1734              in->eof = 0;
1735              fill(in);
1736          }
1737
1738          static int lex(Input *in)
1739          {
1740              int count = 0;
1741          loop:
1742              in->tok = in->cur;
1743              /*!re2c
1744              re2c:eof = 0;
1745              re2c:api:style = free-form;
1746              re2c:define:YYCTYPE  = char;
1747              re2c:define:YYCURSOR = in->cur;
1748              re2c:define:YYMARKER = in->mar;
1749              re2c:define:YYLIMIT  = in->lim;
1750              re2c:define:YYFILL   = "fill(in) == 0";
1751
1752              *                           { return -1; }
1753              $                           { return count; }
1754              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1755              [ ]+                        { goto loop; }
1756
1757              */
1758          }
1759
1760          int main()
1761          {
1762              const char *fname = "input";
1763              const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1764              FILE *f;
1765              Input in;
1766
1767              // prepare input file: a few times the size of the buffer,
1768              // containing strings with zeroes and escaped quotes
1769              f = fopen(fname, "w");
1770              for (int i = 0; i < SIZE; ++i) {
1771                  fwrite(str, 1, sizeof(str) - 1, f);
1772              }
1773              fclose(f);
1774
1775              f = fopen(fname, "r");
1776              init(&in, f);
1777              assert(lex(&in) == SIZE * 3);
1778              fclose(f);
1779
1780              remove(fname);
1781              return 0;
1782          }
1783
1784
1785   YYFILL with padding
1786       In the default case (when EOF rule is  not  used)  YYFILL  is  a  func‐
1787       tion-like  primitive that accepts a single argument and does not return
1788       any value.  YYFILL invocation is triggered by condition (YYLIMIT -  YY‐
1789       CURSOR)  < n in default API and YYLESSTHAN(n) in generic API. The argu‐
1790       ment passed to YYFILL is the minimal number of characters that must  be
1791       supplied.  If  it  fails  to do so, YYFILL must not return to the lexer
1792       (for that reason it is best implemented as a macro  that  returns  from
1793       the calling function on failure).  In case of a successful YYFILL invo‐
1794       cation the limit position must be set either to one after the last  in‐
1795       put position in buffer, or to the end of YYMAXFILL padding (in case YY‐
1796       FILL has successfully read at least n characters,  but  not  enough  to
1797       fill the entire buffer). The pictures below show the relative locations
1798       of input positions in buffer before and after YYFILL invocation (YYMAX‐
1799       FILL padding on the second picture is marked with # symbols).
1800
1801                         <-- shift -->                 <-- need -->
1802                       >-A------------B---------C-----D-------E---F--------G->
1803                       buffer       token    marker cursor  limit
1804
1805          >-A------------B---------C-----D-------E---F--------G->
1806                       buffer,  marker cursor               limit
1807                       token
1808
1809                         <-- shift -->                 <-- need -->
1810                       >-A------------B---------C-----D-------E-F        (EOF)
1811                       buffer       token    marker cursor  limit
1812
1813          >-A------------B---------C-----D-------E-F###############
1814                       buffer,  marker cursor                   limit
1815                       token                        <- YYMAXFILL ->
1816
1817       Here  is  an  example  of  a program that reads input file input.txt in
1818       chunks of 4096 bytes and uses bounds-checking with padding.
1819
1820          // re2c $INPUT -o $OUTPUT
1821          #include <assert.h>
1822          #include <stdio.h>
1823          #include <string.h>
1824
1825          /*!max:re2c*/
1826          #define SIZE 4096
1827
1828          typedef struct {
1829              FILE *file;
1830              char buf[SIZE + YYMAXFILL], *lim, *cur, *mar, *tok;
1831              int eof;
1832          } Input;
1833
1834          static int fill(Input *in, size_t need)
1835          {
1836              if (in->eof) {
1837                  return 1;
1838              }
1839              const size_t free = in->tok - in->buf;
1840              if (free < need) {
1841                  return 2;
1842              }
1843              memmove(in->buf, in->tok, in->lim - in->tok);
1844              in->lim -= free;
1845              in->cur -= free;
1846              in->mar -= free;
1847              in->tok -= free;
1848              in->lim += fread(in->lim, 1, free, in->file);
1849              if (in->lim < in->buf + SIZE) {
1850                  in->eof = 1;
1851                  memset(in->lim, 0, YYMAXFILL);
1852                  in->lim += YYMAXFILL;
1853              }
1854              return 0;
1855          }
1856
1857          static void init(Input *in, FILE *file)
1858          {
1859              in->file = file;
1860              in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
1861              in->eof = 0;
1862              fill(in, 1);
1863          }
1864
1865          static int lex(Input *in)
1866          {
1867              int count = 0;
1868          loop:
1869              in->tok = in->cur;
1870              /*!re2c
1871              re2c:api:style = free-form;
1872              re2c:define:YYCTYPE  = char;
1873              re2c:define:YYCURSOR = in->cur;
1874              re2c:define:YYMARKER = in->mar;
1875              re2c:define:YYLIMIT  = in->lim;
1876              re2c:define:YYFILL   = "if (fill(in, @@) != 0) return -1;";
1877
1878              *                           { return -1; }
1879              [\x00]                      { return (in->lim - in->cur == YYMAXFILL - 1) ? count : -1; }
1880              ['] ([^'\\] | [\\][^])* ['] { ++count; goto loop; }
1881              [ ]+                        { goto loop; }
1882
1883              */
1884          }
1885
1886          int main()
1887          {
1888              const char *fname = "input";
1889              const char str[] = "'qu\0tes' 'are' 'fine: \\'' ";
1890              FILE *f;
1891              Input in;
1892
1893              // prepare input file: a few times the size of the buffer,
1894              // containing strings with zeroes and escaped quotes
1895              f = fopen(fname, "w");
1896              for (int i = 0; i < SIZE; ++i) {
1897                  fwrite(str, 1, sizeof(str) - 1, f);
1898              }
1899              fclose(f);
1900
1901              f = fopen(fname, "r");
1902              init(&in, f);
1903              assert(lex(&in) == SIZE * 3);
1904              fclose(f);
1905
1906              remove(fname);
1907              return 0;
1908          }
1909
1910

INCLUDE FILES

1912       re2c allows one to include other files using directive  /*!include:re2c
1913       FILE  */, where FILE is the name of file to be included. re2c looks for
1914       included files in the directory of the including file  and  in  include
1915       locations,  which  can be specified with -I option.  Include directives
1916       in re2c work in the same way as C/C++ #include: the  contents  of  FILE
1917       are  copy-pasted  verbatim in place of the directive. Include files may
1918       have further includes of their own. Use --depfile option to track build
1919       dependencies  of  the output file on include files.  re2c provides some
1920       predefined include files that can be found in the include/ subdirectory
1921       of  the  project. These files contain definitions that can be useful to
1922       other projects (such as Unicode categories) and form something  like  a
1923       standard library for re2c.  Below is an example of using include direc‐
1924       tive.
1925
1926   Include file (definitions.h)
1927          typedef enum { OK, FAIL } Result;
1928
1929          /*!re2c
1930              number = [1-9][0-9]*;
1931          */
1932
1933
1934   Input file
1935          // re2c $INPUT -o $OUTPUT -i
1936          #include <assert.h>
1937          /*!include:re2c "definitions.h" */
1938
1939          Result lex(const char *YYCURSOR)
1940          {
1941              /*!re2c
1942              re2c:define:YYCTYPE = char;
1943              re2c:yyfill:enable = 0;
1944
1945              number { return OK; }
1946              *      { return FAIL; }
1947              */
1948          }
1949
1950          int main()
1951          {
1952              assert(lex("123") == OK);
1953              return 0;
1954          }
1955
1956

HEADER FILES

1958       Re2c allows one to generate header file from the input .re  file  using
1959       option  -t,  --type-header  or configuration re2c:flags:type-header and
1960       directives /*!header:re2c:on*/ and /*!header:re2c:off*/. The first  di‐
1961       rective  marks  the  beginning of header file, and the second directive
1962       marks the end of it. Everything between these directives  is  processed
1963       by re2c, and the generated code is written to the file specified by the
1964       -t --type-header option (or stdout if this option was not used).  Auto‐
1965       generated  header file may be needed in cases when re2c is used to gen‐
1966       erate definitions of constants, variables and structs that must be vis‐
1967       ible from other translation units.
1968
1969       Here is an example of generating a header file that contains definition
1970       of the lexer state with tag variables (the number variables depends  on
1971       the regular grammar and is unknown to the programmer).
1972
1973   Input file
1974          // re2c $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.h
1975          #include <assert.h>
1976          #include "src/lexer/lexer.h" // generated by re2c
1977
1978          /*!header:re2c:on*/
1979
1980          typedef struct {
1981              const char *str, *cur, *mar;
1982              /*!stags:re2c format = "const char *@@{tag}; "; */
1983          } LexerState;
1984
1985          /*!header:re2c:off*/
1986
1987          int lex(LexerState *st)
1988          {
1989              /*!re2c
1990              re2c:flags:type-header = "src/lexer/lexer.h";
1991              re2c:yyfill:enable = 0;
1992              re2c:flags:tags = 1;
1993              re2c:define:YYCTYPE  = char;
1994              re2c:define:YYCURSOR = "st->cur";
1995              re2c:define:YYMARKER = "st->mar";
1996              re2c:tags:expression = "st->@@{tag}";
1997
1998              [x]{1,4} / [x]{3,5} { return 0; } // ambiguous trailing context
1999              *                   { return 1; }
2000              */
2001          }
2002
2003          int main()
2004          {
2005              LexerState st;
2006              st.str = st.cur = "xxxxxxxx";
2007              assert(lex(&st) == 0 && st.cur - st.str == 4);
2008              return 0;
2009          }
2010
2011
2012   Header file
2013          /* Generated by re2c */
2014
2015
2016          typedef struct {
2017              const char *str, *cur, *mar;
2018              const char *yyt1; const char *yyt2; const char *yyt3;
2019          } LexerState;
2020
2021
2022

SUBMATCH EXTRACTION

2024       Re2c has two options for submatch extraction.
2025
2026       The  first option is -T --tags. With this option one can use standalone
2027       tags of the form @stag and #mtag, where stag  and  mtag  are  arbitrary
2028       used-defined  names.  Tags can be used anywhere inside of a regular ex‐
2029       pression; semantically they are just position markers. Tags of the form
2030       @stag  are called s-tags: they denote a single submatch value (the last
2031       input position where this tag matched). Tags  of  the  form  #mtag  are
2032       called  m-tags: they denote multiple submatch values (the whole history
2033       of repetitions of this tag).  All tags should be defined by the user as
2034       variables  with the corresponding names. With standalone tags re2c uses
2035       leftmost greedy disambiguation: submatch positions  correspond  to  the
2036       leftmost matching path through the regular expression.
2037
2038       The  second  option  is -P --posix-captures: it enables POSIX-compliant
2039       capturing groups. In this mode parentheses in regular  expressions  de‐
2040       note  the  beginning and the end of capturing groups; the whole regular
2041       expression is group number zero. The number of groups for the  matching
2042       rule  is stored in a variable yynmatch, and submatch results are stored
2043       in yypmatch array. Both yynmatch and yypmatch should be defined by  the
2044       user,  and yypmatch size must be at least [yynmatch * 2]. Re2c provides
2045       a directive /*!maxnmatch:re2c*/ that defines  YYMAXNMATCH:  a  constant
2046       equal  to the maximal value of yynmatch among all rules. Note that re2c
2047       implements POSIX-compliant disambiguation: each  subexpression  matches
2048       as  long  as possible, and subexpressions that start earlier in regular
2049       expression have priority over those starting  later.  Capturing  groups
2050       are  translated  into  s-tags under the hood, therefore we use the word
2051       "tag" to describe them as well.
2052
2053       With both -P --posix-captures and T --tags options re2c uses  efficient
2054       submatch extraction algorithm described in the Tagged Deterministic Fi‐
2055       nite Automata with Lookahead paper. The overhead on submatch extraction
2056       in the generated lexer grows with the number of tags --- if this number
2057       is moderate, the overhead is barely noticeable. In the lexer  tags  are
2058       implemented using a number of tag variables generated by re2c. There is
2059       no one-to-one correspondence between tag variables and tags:  a  single
2060       variable may be reused for different tags, and one tag may require mul‐
2061       tiple variables to hold all its ambiguous values. Eventually  ambiguity
2062       is  resolved, and only one final variable per tag survives. When a rule
2063       matches, all its tags are set to the values of  the  corresponding  tag
2064       variables.   The  exact number of tag variables is unknown to the user;
2065       this number is determined by re2c. However, tag variables should be de‐
2066       fined  by  the user as a part of the lexer state and updated by YYFILL,
2067       therefore re2c provides directives /*!stags:re2c*/ and  /*!mtags:re2c*/
2068       that  can  be used to declare, initialize and manipulate tag variables.
2069       These directives have  two  optional  configurations:  format  =  "@@";
2070       (specifies  the  template where @@ is substituted with the name of each
2071       tag variable), and separator = ""; (specifies the piece of code used to
2072       join the generated pieces for different tag variables).
2073
2074       S-tags support the following operations:
2075
2076       • save  input  position to an s-tag: t = YYCURSOR with default API or a
2077         user-defined operation YYSTAGP(t) with generic API
2078
2079       • save default value to an s-tag: t  =  NULL  with  default  API  or  a
2080         user-defined operation YYSTAGN(t) with generic API
2081
2082       • copy one s-tag to another: t1 = t2
2083
2084       M-tags support the following operations:
2085
2086       • append  input  position  to  an  m-tag: a user-defined operation YYM‐
2087         TAGP(t) with both default and generic API
2088
2089       • append default value to an m-tag: a user-defined operation YYMTAGN(t)
2090         with both default and generic API
2091
2092       • copy one m-tag to another: t1 = t2
2093
2094       S-tags  can  be  implemented  as  scalar  values (pointers or offsets).
2095       M-tags need a more complex representation, as they need to store a  se‐
2096       quence  of tag values. The most naive and inefficient representation of
2097       an m-tag is a list (array, vector) of tag values; a more efficient rep‐
2098       resentation  is to store all m-tags in a prefix-tree represented as ar‐
2099       ray of nodes (v, p), where v is tag value and p is a pointer to  parent
2100       node.
2101
2102       Here  is a simple example of using s-tags to parse an IPv4 address (see
2103       below for a more complex example that uses YYFILL).
2104
2105          // re2c $INPUT -o $OUTPUT
2106          #include <assert.h>
2107          #include <stdint.h>
2108
2109          static uint32_t num(const char *s, const char *e)
2110          {
2111              uint32_t n = 0;
2112              for (; s < e; ++s) n = n * 10 + (*s - '0');
2113              return n;
2114          }
2115
2116          static const uint64_t ERROR = ~0lu;
2117
2118          static uint64_t lex(const char *YYCURSOR)
2119          {
2120              const char *YYMARKER, *o1, *o2, *o3, *o4;
2121              /*!stags:re2c format = 'const char *@@;'; */
2122
2123              /*!re2c
2124              re2c:yyfill:enable = 0;
2125              re2c:flags:tags = 1;
2126              re2c:define:YYCTYPE = char;
2127
2128              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2129              dot = [.];
2130              end = [\x00];
2131
2132              @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2133                  return num(o4, YYCURSOR - 1)
2134                      + (num(o3, o4 - 1) << 8)
2135                      + (num(o2, o3 - 1) << 16)
2136                      + (num(o1, o2 - 1) << 24);
2137              }
2138              * { return ERROR; }
2139              */
2140          }
2141
2142          int main()
2143          {
2144              assert(lex("1.2.3.4") == 0x01020304);
2145              assert(lex("127.0.0.1") == 0x7f000001);
2146              assert(lex("255.255.255.255") == 0xffffffff);
2147              assert(lex("1.2.3.") == ERROR);
2148              assert(lex("1.2.3.256") == ERROR);
2149              return 0;
2150          }
2151
2152
2153       Here is a more complex example of using s-tags with YYFILL to  parse  a
2154       file  with  IPv4  addresses. Tag variables are part of the lexer state,
2155       and they are adjusted in YYFILL like other input positions.  Note  that
2156       it  is  necessary for s-tags because their values are invalidated after
2157       shifting buffer contents. It may not be necessary in a custom implemen‐
2158       tation  where  tag variables store offsets relative to the start of the
2159       input string rather than buffer, which may be the case with m-tags.
2160
2161          // re2c $INPUT -o $OUTPUT --tags
2162          #include <assert.h>
2163          #include <stdint.h>
2164          #include <stdio.h>
2165          #include <string.h>
2166          #include <vector>
2167
2168          #define SIZE 4096
2169
2170          typedef struct {
2171              FILE *file;
2172              char buf[SIZE + 1], *lim, *cur, *mar, *tok;
2173              // Tag variables must be part of the lexer state passed to YYFILL.
2174              // They don't correspond to tags and should be autogenerated by re2c.
2175              /*!stags:re2c format = 'const char *@@;'; */
2176              int eof;
2177          } Input;
2178
2179          static int fill(Input *in)
2180          {
2181              if (in->eof) return 1;
2182
2183              const size_t free = in->tok - in->buf;
2184              if (free < 1) return 2;
2185
2186              memmove(in->buf, in->tok, in->lim - in->tok);
2187
2188              in->lim -= free;
2189              in->cur -= free;
2190              in->mar -= free;
2191              in->tok -= free;
2192              // Tag variables need to be shifted like other input positions. The check
2193              // for non-NULL is only needed if some tags are nested inside of alternative
2194              // or repetition, so that they can have NULL value.
2195              /*!stags:re2c format = "if (in->@@) in->@@ -= free;"; */
2196
2197              in->lim += fread(in->lim, 1, free, in->file);
2198              in->lim[0] = 0;
2199              in->eof |= in->lim < in->buf + SIZE;
2200
2201              return 0;
2202          }
2203
2204          static void init(Input *in, FILE *file)
2205          {
2206              in->file = file;
2207              in->cur = in->mar = in->tok = in->lim = in->buf + SIZE;
2208              // Initialization is only needed to avoid "use of uninitialized" warnings
2209              // when shifting tags in YYFILL. In the lexer tags are guaranteed to be
2210              // set before they are used (either to a valid input position, or NULL).
2211              /*!stags:re2c format = "in->@@ = in->lim;"; */
2212              in->eof = 0;
2213              fill(in);
2214          }
2215
2216          static uint32_t num(const char *s, const char *e)
2217          {
2218              uint32_t n = 0;
2219              for (; s < e; ++s) n = n * 10 + (*s - '0');
2220              return n;
2221          }
2222
2223          static bool lex(Input *in, std::vector<uint32_t> &ips)
2224          {
2225              // User-defined local variables that store final tag values.
2226              // They are different from tag variables autogenerated with /*!stags:re2c*/,
2227              // as they are set at the end of match and used only in semantic actions.
2228              const char *o1, *o2, *o3, *o4;
2229          loop:
2230              in->tok = in->cur;
2231              /*!re2c
2232              re2c:eof = 0;
2233              re2c:api:style = free-form;
2234              re2c:define:YYCTYPE  = char;
2235              re2c:define:YYCURSOR = in->cur;
2236              re2c:define:YYMARKER = in->mar;
2237              re2c:define:YYLIMIT  = in->lim;
2238              re2c:define:YYFILL   = "fill(in) == 0";
2239
2240              // The way tag variables are accessed from the lexer (not needed if tag
2241              // variables are defined as local variables).
2242              re2c:tags:expression = "in->@@";
2243
2244              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2245              dot = [.];
2246              eol = [\n];
2247
2248              @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet eol {
2249                  ips.push_back(num(o4, in->cur - 1)
2250                      + (num(o3, o4 - 1) << 8)
2251                      + (num(o2, o3 - 1) << 16)
2252                      + (num(o1, o2 - 1) << 24));
2253                  goto loop;
2254              }
2255              $ { return true; }
2256              * { return false; }
2257              */
2258          }
2259
2260          int main()
2261          {
2262              const char *fname = "input";
2263              FILE *f;
2264              Input in;
2265              std::vector<uint32_t> have, want;
2266
2267              // Write a few IPv4 addresses to the input file and save them to compare
2268              // against parse results.
2269              f = fopen(fname, "w");
2270              for (int i = 0; i < 256; ++i) {
2271                  fprintf(f, "%d.%d.%d.%d\n", i, i, i, i);
2272                  want.push_back(i + (i << 8) + (i << 16) + (i << 24));
2273              }
2274              fclose(f);
2275
2276              f = fopen(fname, "r");
2277              init(&in, f);
2278
2279              assert(lex(&in, have) && have == want);
2280
2281              fclose(f);
2282              remove(fname);
2283              return 0;
2284          }
2285
2286
2287       Here is an example of using POSIX capturing groups to parse an IPv4 ad‐
2288       dress.
2289
2290          // re2c $INPUT -o $OUTPUT
2291          #include <assert.h>
2292          #include <stdint.h>
2293
2294          static uint32_t num(const char *s, const char *e)
2295          {
2296              uint32_t n = 0;
2297              for (; s < e; ++s) n = n * 10 + (*s - '0');
2298              return n;
2299          }
2300
2301          /*!maxnmatch:re2c*/
2302          static const uint64_t ERROR = ~0lu;
2303
2304          static uint64_t lex(const char *YYCURSOR)
2305          {
2306              const char *YYMARKER;
2307              const char *yypmatch[YYMAXNMATCH * 2];
2308              uint32_t yynmatch;
2309              /*!stags:re2c format = 'const char *@@;'; */
2310
2311              /*!re2c
2312              re2c:yyfill:enable = 0;
2313              re2c:flags:posix-captures = 1;
2314              re2c:define:YYCTYPE = char;
2315
2316              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2317              dot = [.];
2318              end = [\x00];
2319
2320              (octet) dot (octet) dot (octet) dot (octet) end {
2321                  assert(yynmatch == 5);
2322                  return num(yypmatch[8], yypmatch[9])
2323                      + (num(yypmatch[6], yypmatch[7]) << 8)
2324                      + (num(yypmatch[4], yypmatch[5]) << 16)
2325                      + (num(yypmatch[2], yypmatch[3]) << 24);
2326              }
2327              * { return ERROR; }
2328              */
2329          }
2330
2331          int main()
2332          {
2333              assert(lex("1.2.3.4") == 0x01020304);
2334              assert(lex("127.0.0.1") == 0x7f000001);
2335              assert(lex("255.255.255.255") == 0xffffffff);
2336              assert(lex("1.2.3.") == ERROR);
2337              assert(lex("1.2.3.256") == ERROR);
2338              return 0;
2339          }
2340
2341
2342       Here  is  an example of using m-tags to parse a semicolon-separated se‐
2343       quence of words (C++). Tag variables are  stored  in  a  tree  that  is
2344       packed in a vector.
2345
2346          // re2c $INPUT -o $OUTPUT
2347          #include <assert.h>
2348          #include <vector>
2349          #include <string>
2350
2351          static const int ROOT = -1;
2352
2353          struct Mtag {
2354              int pred;
2355              const char *tag;
2356          };
2357
2358          typedef std::vector<Mtag> MtagTree;
2359          typedef std::vector<std::string> Words;
2360
2361          static void mtag(int *pt, const char *t, MtagTree *tree)
2362          {
2363              Mtag m = {*pt, t};
2364              *pt = (int)tree->size();
2365              tree->push_back(m);
2366          }
2367
2368          static void unfold(const MtagTree &tree, int x, int y, Words &words)
2369          {
2370              if (x == ROOT) return;
2371              unfold(tree, tree[x].pred, tree[y].pred, words);
2372              const char *px = tree[x].tag, *py = tree[y].tag;
2373              words.push_back(std::string(px, py - px));
2374          }
2375
2376          #define YYMTAGP(t) mtag(&t, YYCURSOR, &tree)
2377          #define YYMTAGN(t) mtag(&t, NULL,     &tree)
2378          static bool lex(const char *YYCURSOR, Words &words)
2379          {
2380              const char *YYMARKER;
2381              /*!mtags:re2c format = "int @@ = ROOT;"; */
2382              MtagTree tree;
2383              int x, y;
2384
2385              /*!re2c
2386              re2c:yyfill:enable = 0;
2387              re2c:flags:tags = 1;
2388              re2c:define:YYCTYPE = char;
2389
2390              (#x [a-z]+ #y [;])+ {
2391                  words.clear();
2392                  unfold(tree, x, y, words);
2393                  return true;
2394              }
2395              * { return false; }
2396              */
2397          }
2398
2399          int main()
2400          {
2401              Words w;
2402              assert(lex("one;two;three;", w) && w == Words({"one", "two", "three"}));
2403              return 0;
2404          }
2405
2406

STORABLE STATE

2408       With  -f  --storable-state option re2c generates a lexer that can store
2409       its current state, return to the caller, and  later  resume  operations
2410       exactly  where  it left off. The default mode of operation in re2c is a
2411       "pull" model, in which the lexer "pulls" more input whenever  it  needs
2412       it.  This may be unacceptable in cases when the input becomes available
2413       piece by piece (for example, if the lexer is invoked by the parser,  or
2414       if the lexer program communicates via a socket protocol with some other
2415       program that must wait for a reply from the lexer before  it  transmits
2416       the  next message). Storable state feature is intended exactly for such
2417       cases: it allows one to generate lexers that work in  a  "push"  model.
2418       When the lexer needs more input, it stores its state and returns to the
2419       caller. Later, when more input becomes available,  the  caller  resumes
2420       the  lexer  exactly where it stopped. There are a few changes necessary
2421       compared to the "pull" model:
2422
2423       • Define YYSETSTATE() and YYGETSTATE(state) promitives.
2424
2425       • Define yych, yyaccept and state variables as  a  part  of  persistent
2426         lexer state. The state variable should be initialized to -1.
2427
2428       • YYFILL should return to the outer program instead of trying to supply
2429         more input. Return code should indicate that lexer needs more input.
2430
2431       • The outer program should recognize situations when lexer  needs  more
2432         input and respond appropriately.
2433
2434       • Use  /*!getstate:re2c*/  directive  if it is necessary to execute any
2435         code before entering the lexer.
2436
2437       • Use configurations state:abort and state:nextlabel to  further  tweak
2438         the generated code.
2439
2440       Here  is an example of a "push"-model lexer that reads input from stdin
2441       and expects a sequence of words separated by spaces and  newlines.  The
2442       lexer  loops  forever,  waiting for more input. It can be terminated by
2443       sending a special EOF token --- a word "stop", in which case the  lexer
2444       terminates successfully and prints the number of words it has seen. Ab‐
2445       normal termination happens in case of a syntax error, premature end  of
2446       input  (without  the "stop" word) or in case the buffer is too small to
2447       hold a lexeme (for example, if one of the words exceeds  buffer  size).
2448       Premature  end of input happens in case the lexer fails to read any in‐
2449       put while being in the initial state --- this is the only case when EOF
2450       rule  matches.  Note that the lexer may call YYFILL twice before termi‐
2451       nating (and thus require hitting Ctrl+D a few times). First time YYFILL
2452       is  called  when  the  lexer expects continuation of the current greedy
2453       lexeme (either a word or a whitespace sequence). If YYFILL  fails,  the
2454       lexer  knows that it has reached the end of the current lexeme and exe‐
2455       cutes the corresponding semantic action. The action jumps to the begin‐
2456       ning  of  the loop, the lexer enters the initial state and calls YYFILL
2457       once more. If it fails, the lexer matches EOF rule. (Alternatively  EOF
2458       rule can be used for termination instead of a special EOF lexeme.)
2459
2460   Example
2461          // re2c $INPUT -o $OUTPUT -f
2462          #include <assert.h>
2463          #include <stdio.h>
2464          #include <string.h>
2465
2466          #define DEBUG    0
2467          #define LOG(...) if (DEBUG) fprintf(stderr, __VA_ARGS__);
2468          #define BUFSIZE  10
2469
2470          typedef struct {
2471              FILE *file;
2472              char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok;
2473              unsigned yyaccept;
2474              int state;
2475          } Input;
2476
2477          static void init(Input *in, FILE *f)
2478          {
2479              in->file = f;
2480              in->cur = in->mar = in->tok = in->lim = in->buf + BUFSIZE;
2481              in->lim[0] = 0; // append sentinel symbol
2482              in->yyaccept = 0;
2483              in->state = -1;
2484          }
2485
2486          typedef enum {END, READY, WAITING, BAD_PACKET, BIG_PACKET} Status;
2487
2488          static Status fill(Input *in)
2489          {
2490              const size_t shift = in->tok - in->buf;
2491              const size_t free = BUFSIZE - (in->lim - in->tok);
2492
2493              if (free < 1) return BIG_PACKET;
2494
2495              memmove(in->buf, in->tok, BUFSIZE - shift);
2496              in->lim -= shift;
2497              in->cur -= shift;
2498              in->mar -= shift;
2499              in->tok -= shift;
2500
2501              const size_t read = fread(in->lim, 1, free, in->file);
2502              in->lim += read;
2503              in->lim[0] = 0; // append sentinel symbol
2504
2505              return READY;
2506          }
2507
2508          static Status lex(Input *in, unsigned int *recv)
2509          {
2510              char yych;
2511              /*!getstate:re2c*/
2512          loop:
2513              in->tok = in->cur;
2514              /*!re2c
2515                  re2c:eof = 0;
2516                  re2c:api:style = free-form;
2517                  re2c:define:YYCTYPE    = "char";
2518                  re2c:define:YYCURSOR   = "in->cur";
2519                  re2c:define:YYMARKER   = "in->mar";
2520                  re2c:define:YYLIMIT    = "in->lim";
2521                  re2c:define:YYGETSTATE = "in->state";
2522                  re2c:define:YYSETSTATE = "in->state = @@;";
2523                  re2c:define:YYFILL     = "return WAITING;";
2524
2525                  packet = [a-z]+[;];
2526
2527                  *      { return BAD_PACKET; }
2528                  $      { return END; }
2529                  packet { *recv = *recv + 1; goto loop; }
2530              */
2531          }
2532
2533          void test(const char **packets, Status status)
2534          {
2535              const char *fname = "pipe";
2536              FILE *fw = fopen(fname, "w");
2537              FILE *fr = fopen(fname, "r");
2538              setvbuf(fw, NULL, _IONBF, 0);
2539              setvbuf(fr, NULL, _IONBF, 0);
2540
2541              Input in;
2542              init(&in, fr);
2543              Status st;
2544              unsigned int send = 0, recv = 0;
2545
2546              for (;;) {
2547                  st = lex(&in, &recv);
2548                  if (st == END) {
2549                      LOG("done: got %u packets\n", recv);
2550                      break;
2551                  } else if (st == WAITING) {
2552                      LOG("waiting...\n");
2553                      if (*packets) {
2554                          LOG("sent packet %u\n", send);
2555                          fprintf(fw, "%s", *packets++);
2556                          ++send;
2557                      }
2558                      st = fill(&in);
2559                      LOG("queue: '%s'\n", in.buf);
2560                      if (st == BIG_PACKET) {
2561                          LOG("error: packet too big\n");
2562                          break;
2563                      }
2564                      assert(st == READY);
2565                  } else {
2566                      assert(st == BAD_PACKET);
2567                      LOG("error: ill-formed packet\n");
2568                      break;
2569                  }
2570              }
2571
2572              LOG("\n");
2573              assert(st == status);
2574              if (st == END) assert(recv == send);
2575
2576              fclose(fw);
2577              fclose(fr);
2578              remove(fname);
2579          }
2580
2581          int main()
2582          {
2583              const char *packets1[] = {0};
2584              const char *packets2[] = {"zero;", "one;", "two;", "three;", "four;", 0};
2585              const char *packets3[] = {"zer0;", 0};
2586              const char *packets4[] = {"goooooooooogle;", 0};
2587
2588              test(packets1, END);
2589              test(packets2, END);
2590              test(packets3, BAD_PACKET);
2591              test(packets4, BIG_PACKET);
2592
2593              return 0;
2594          }
2595
2596

REUSABLE BLOCKS

2598       Reuse  mode is enabled with the -r --reusable option. In this mode re2c
2599       allows one to reuse definitions, configurations and rules specified  by
2600       a  /*!rules:re2c*/  block  in  subsequent  /*!use:re2c*/  blocks. As of
2601       re2c-1.2 it is possible  to  mix  such  blocks  with  normal  /*!re2c*/
2602       blocks;  prior  to  that  re2c expects a single rules-block followed by
2603       use-blocks (normal blocks are disallowed). Use-blocks  can  have  addi‐
2604       tional  definitions, configurations and rules: they are merged to those
2605       specified by the rules-block.  A very common use case for -r --reusable
2606       option  is  a lexer that supports multiple input encodings: lexer rules
2607       are defined once and reused multiple times with encoding-specific  con‐
2608       figurations, such as re2c:flags:utf-8.
2609
2610       Below  is  an example of a multi-encoding lexer: it reads a phrase with
2611       Unicode math symbols and accepts input either in UTF8 or in UT32.  Note
2612       that  the  --input-encoding utf8 option allows us to write UTF8-encoded
2613       symbols in the regular expressions;  without  this  option  re2c  would
2614       parse  them  as  a  plain  ASCII byte sequnce (and we would have to use
2615       hexadecimal escape sequences).
2616
2617   Example
2618          // re2c $INPUT -o $OUTPUT -r --input-encoding utf8
2619          #include <assert.h>
2620          #include <stdint.h>
2621
2622          /*!rules:re2c
2623              re2c:yyfill:enable = 0;
2624
2625              "∀x ∃y: p(x, y)" { return 0; }
2626              *                { return 1; }
2627          */
2628
2629          static int lex_utf8(const uint8_t *YYCURSOR)
2630          {
2631              const uint8_t *YYMARKER;
2632              /*!use:re2c
2633              re2c:define:YYCTYPE = uint8_t;
2634              re2c:flags:8 = 1;
2635              */
2636          }
2637
2638          static int lex_utf32(const uint32_t *YYCURSOR)
2639          {
2640              const uint32_t *YYMARKER;
2641              /*!use:re2c
2642              re2c:define:YYCTYPE = uint32_t;
2643              re2c:flags:8 = 0;
2644              re2c:flags:u = 1;
2645              */
2646          }
2647
2648          int main()
2649          {
2650              static const uint8_t s8[] = // UTF-8
2651                  { 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79
2652                  , 0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29 };
2653
2654              static const uint32_t s32[] = // UTF32
2655                  { 0x00002200, 0x00000078, 0x00000020, 0x00002203
2656                  , 0x00000079, 0x0000003a, 0x00000020, 0x00000070
2657                  , 0x00000028, 0x00000078, 0x0000002c, 0x00000020
2658                  , 0x00000079, 0x00000029 };
2659
2660              assert(lex_utf8(s8) == 0);
2661              assert(lex_utf32(s32) == 0);
2662              return 0;
2663          }
2664
2665
2666

ENCODING SUPPORT

2668       Speaking of encodings, it is necessary to understand the difference be‐
2669       tween  code  points  and code units.  Code point is an abstract symbol.
2670       Code unit is the smallest atomic unit of storage in the  encoded  text.
2671       A single code point may be represented with one or more code units.  In
2672       a fixed-length encoding all code points are represented with  the  same
2673       number of code units.  In a variable-length encoding code points may be
2674       represented with a different number of code units.  Note that the "any"
2675       rule  [^]  matches  any  code point, but not necessarily any code unit.
2676       The only way to match any code unit regardless of the encoding  it  the
2677       default rule *.  YYCTYPE size should be equal to the size of code unit.
2678
2679       Re2c supports the following encodings: ASCII, EBCDIC, UCS2, UTF8, UTF16
2680       and UTF32.
2681
2682       • ASCII is enabled by default.  It is a fixed-length encoding with code
2683         space [0-255] and 1-byte code points and code units.
2684
2685       • EBCDIC  is enabled with -e, --ecb option.  It a fixed-length encoding
2686         with code space [0-255] and 1-byte code points and code units.
2687
2688       • UCS2 is enabled with -w, --wide-chars option.  It is  a  fixed-length
2689         encoding  with  code space [0-0xFFFF] and 2-byte code points and code
2690         units.
2691
2692       • UTF8 is enabled with -8, --utf-8 option.   It  is  a  variable-length
2693         Unicode  encoding with code space [0-0x10FFFF].  Code points are rep‐
2694         resented with one, two, three or four 1-byte code units.
2695
2696       • UTF16 is enabled with -x, --utf-16 option.  It is  a  variable-length
2697         Unicode  encoding with code space [0-0x10FFFF].  Code points are rep‐
2698         resented with one or two 2-byte code units.
2699
2700       • UTF32 is enabled with -u, --unicode option.   It  is  a  fixed-length
2701         Unicode  encoding with code space [0-0x10FFFF] and 4-byte code points
2702         and code units.
2703
2704       Encodings can also be set or unset using re2c:flags configuration,  for
2705       example re2c:flags:8 = 1; enables UTF8.
2706
2707       Include  file  include/unicode_categories.re  provides re2c definitions
2708       for the standard Unicode categories.
2709
2710       Option --input-encoding utf8 enables Unicode literals  in  regular  ex‐
2711       pressions.
2712
2713       Option --encoding-policy <fail | substitute | ignore> specifies the way
2714       re2c  handles  Unicode   surrogates:   code   points   in   the   range
2715       [0xD800-0xDFFF].
2716
2717   Example
2718          // re2c $INPUT -o $OUTPUT -8 --case-ranges -i
2719          //
2720          // Simplified "Unicode Identifier and Pattern Syntax"
2721          // (see https://unicode.org/reports/tr31)
2722
2723          #include <assert.h>
2724          #include <stdint.h>
2725
2726          /*!include:re2c "unicode_categories.re" */
2727
2728          static int lex(const char *YYCURSOR)
2729          {
2730              const char *YYMARKER;
2731              /*!re2c
2732              re2c:define:YYCTYPE = 'unsigned char';
2733              re2c:yyfill:enable  = 0;
2734
2735              id_start    = L | Nl | [$_];
2736              id_continue = id_start | Mn | Mc | Nd | Pc | [\u200D\u05F3];
2737              identifier  = id_start id_continue*;
2738
2739              identifier { return 0; }
2740              *          { return 1; }
2741              */
2742          }
2743
2744          int main()
2745          {
2746              assert(lex("_Ыдентификатор") == 0);
2747              return 0;
2748          }
2749
2750

START CONDITIONS

2752       Conditions are enabled with -c --conditions.  This option allows one to
2753       encode multiple interrelated lexers within the same re2c block.
2754
2755       Each lexer corresponds to a single condition.  It starts with  a  label
2756       of  the  form yyc_name, where name is condition name and yyc prefix can
2757       be adjusted with configuration re2c:condprefix.  Different  lexers  are
2758       separated  with  a  comment  /*  *********************************** */
2759       which can be adjusted with configuration re2c:cond:divider.
2760
2761       Furthermore, each condition has a unique identifier of  the  form  yyc‐
2762       name,  where name is condition name and yyc prefix can be adjusted with
2763       configuration re2c:condenumprefix.  Identifiers have the  type  YYCOND‐
2764       TYPE  and  should  be  generated  with  /*!types:re2c*/ directive or -t
2765       --type-header option.  Users shouldn't define these  identifiers  manu‐
2766       ally, as the order of conditions is not specified.
2767
2768       Before all conditions re2c generates entry code that checks the current
2769       condition identifier and transfers control flow to the start  label  of
2770       the  active  condition.   After  matching  some rule of this condition,
2771       lexer may either transfer control flow back to the  entry  code  (after
2772       executing  the  associated action and optionally setting another condi‐
2773       tion with =>), or use :=> shortcut and transition directly to the start
2774       label  of  another  condition (skipping the action and the entry code).
2775       Configuration re2c:cond:goto allows one to change the default behavior.
2776
2777       Syntactically each rule must be preceded with a list of comma-separated
2778       condition  names  or  a  wildcard * enclosed in angle brackets < and >.
2779       Wildcard means "any condition" and is semantically equivalent to  list‐
2780       ing  all condition names.  Here regexp is a regular expression, default
2781       refers to the default rule *, and action is a block of code.
2782
2783       • <conditions-or-wildcard>  regexp-or-default                 action
2784
2785       • <conditions-or-wildcard>  regexp-or-default  =>  condition  action
2786
2787       • <conditions-or-wildcard>  regexp-or-default  :=> condition
2788
2789       Rules with an exclamation mark ! in front of condition list have a spe‐
2790       cial  meaning:  they have no regular expression, and the associated ac‐
2791       tion is merged as an entry code to actions of normal rules.  This might
2792       be  a  convenient  place to peform a routine task that is common to all
2793       rules.
2794
2795       • <!conditions-or-wildcard>  action
2796
2797       Another special form of rules with an empty condition list  <>  and  no
2798       regular  expression allows one to specify an "entry condition" that can
2799       be used to execute code before entering the lexer.  It is  semantically
2800       equivalent to a condition with number zero, name 0 and an empty regular
2801       expression.
2802
2803       • <>                 action
2804
2805       • <>  =>  condition  action
2806
2807       • <>  :=> condition
2808
2809   Example
2810          // re2c $INPUT -o $OUTPUT -ci
2811          #include <stdint.h>
2812          #include <limits.h>
2813          #include <assert.h>
2814
2815          static const uint64_t ERROR = ~0lu;
2816          /*!types:re2c*/
2817
2818          template<int BASE> static void adddgt(uint64_t &u, unsigned int d)
2819          {
2820              u = u * BASE + d;
2821              if (u > UINT32_MAX) u = ERROR;
2822          }
2823
2824          static uint64_t parse_u32(const char *s)
2825          {
2826              const char *YYMARKER;
2827              int c = yycinit;
2828              uint64_t u = 0;
2829
2830              /*!re2c
2831              re2c:yyfill:enable = 0;
2832              re2c:api:style = free-form;
2833              re2c:define:YYCTYPE = char;
2834              re2c:define:YYCURSOR = s;
2835              re2c:define:YYGETCONDITION = "c";
2836              re2c:define:YYSETCONDITION = "c = @@;";
2837
2838              <*> * { return ERROR; }
2839
2840              <init> '0b' / [01]        :=> bin
2841              <init> "0"                :=> oct
2842              <init> "" / [1-9]         :=> dec
2843              <init> '0x' / [0-9a-fA-F] :=> hex
2844
2845              <bin, oct, dec, hex> "\x00" { return u; }
2846
2847              <bin> [01]  { adddgt<2> (u, s[-1] - '0');      goto yyc_bin; }
2848              <oct> [0-7] { adddgt<8> (u, s[-1] - '0');      goto yyc_oct; }
2849              <dec> [0-9] { adddgt<10>(u, s[-1] - '0');      goto yyc_dec; }
2850              <hex> [0-9] { adddgt<16>(u, s[-1] - '0');      goto yyc_hex; }
2851              <hex> [a-f] { adddgt<16>(u, s[-1] - 'a' + 10); goto yyc_hex; }
2852              <hex> [A-F] { adddgt<16>(u, s[-1] - 'A' + 10); goto yyc_hex; }
2853              */
2854          }
2855
2856          int main()
2857          {
2858              assert(parse_u32("1234567890") == 1234567890);
2859              assert(parse_u32("0b1101") == 13);
2860              assert(parse_u32("0x7Fe") == 2046);
2861              assert(parse_u32("0644") == 420);
2862              assert(parse_u32("9999999999") == ERROR);
2863              assert(parse_u32("") == ERROR);
2864              return 0;
2865          }
2866
2867

SKELETON PROGRAMS

2869       With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
2870       erates a self-contained C program that can be further compiled and exe‐
2871       cuted. The program consists of lexer code and input data. For each con‐
2872       structed DFA (block or condition) re2c generates a standalone lexer and
2873       two files: an .input file with strings derived from the DFA and a .keys
2874       file  with  expected  match results. The program runs each lexer on the
2875       corresponding .input file and compares results with  the  expectations.
2876       Skeleton programs are very useful for a number of reasons:
2877
2878       • They can check correctness of various re2c optimizations (the data is
2879         generated early in the process, before any DFA  transformations  have
2880         taken place).
2881
2882       • Generating  a  set of input data with good coverage may be useful for
2883         both testing and benchmarking.
2884
2885       • Generating self-contained executable programs allows one to get mini‐
2886         mized test cases (the original code may be large or have a lot of de‐
2887         pendencies).
2888
2889       The difficulty with generating input data is that for all but the  most
2890       trivial  cases  the number of possible input strings is too large (even
2891       if the string length is limited). Re2c solves this difficulty by gener‐
2892       ating sufficiently many strings to cover almost all DFA transitions. It
2893       uses the following algorithm. First, it constructs a  skeleton  of  the
2894       DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
2895       EBCDIC) skeleton is just an exact copy of the original DFA. For  encod‐
2896       ings  with  multibyte code units skeleton is a copy of DFA with certain
2897       transitions omitted: namely, re2c takes at most 256 code units for each
2898       disjoint  continuous  range  that corresponds to a DFA transition.  The
2899       chosen values are evenly distributed and include range bounds.  Instead
2900       of  trying to cover all possible paths in the skeleton (which is infea‐
2901       sible) re2c generates sufficiently many paths  to  cover  all  skeleton
2902       transitions,  and  thus  trigger the corresponding conditional jumps in
2903       the lexer.  The algorithm implementation is limited by ~1Gb of  transi‐
2904       tions  and consumes constant amount of memory (re2c writes data to file
2905       as soon as it is generated).
2906

VISUALIZATION AND DEBUG

2908       With the -D, --emit-dot option, re2c does not generate  code.  Instead,
2909       it dumps the generated DFA in DOT format.  One can convert this dump to
2910       an image of the DFA using Graphviz or another library.  Note that  this
2911       option  shows the final DFA after it has gone through a number of opti‐
2912       mizations and transformations. Earlier stages can be dumped with  vari‐
2913       ous  debug  options,  such  as --dump-nfa, --dump-dfa-raw etc. (see the
2914       full list of options).
2915

AUTHORS

2922       Re2c was originaly written by Peter Bumbulis in 1993.   Since  then  it
2923       has been developed and maintained by multiple volunteers; mots notably,
2924       Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
2925
2926
2927
2928
2929                                                                       RE2C(1)