1RE2C(1)                                                                RE2C(1)
2
3
4

NAME

6       re2c - compile regular expressions to code
7

SYNOPSIS

9       re2c  [OPTIONS] INPUT [-o OUTPUT]
10
11       re2go [OPTIONS] INPUT [-o OUTPUT]
12

DESCRIPTION

14       re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15
16       Note:  This  manual  includes  examples  for  Go, but it refers to re2c
17       (rather than re2go) as the name of the program in general.
18

SYNTAX

20       A re2c program consists of normal code intermixed with re2c blocks  and
21       directives.   Each  re2c  block may contain definitions, configurations
22       and rules.  Definitions are of the form name = regexp;  where  name  is
23       an  identifier  that  consists  of letters, digits and underscores, and
24       regexp is a regular expression.  Regular expressions may contain  other
25       definitions,  but  recursion is not allowed and each name should be de‐
26       fined before used.  Configurations are of the form re2c:config = value;
27       where config is the configuration descriptor and value can be a number,
28       a string or a special word.  Rules consist of a regular expression fol‐
29       lowed  by a semantic action (a block of code enclosed in curly braces {
30       and }, or a raw one line of code preceded with := and ended with a new‐
31       line  that  is not followed by a whitespace).  If the input matches the
32       regular expression, the associated semantic  action  is  executed.   If
33       multiple  rules match, the longest match takes precedence.  If multiple
34       rules match the same string, the earlier rule takes precedence.   There
35       are  two  special  rules:  default rule * and EOF rule $.  Default rule
36       should always be defined, it has the lowest priority regardless of  its
37       place and matches any code unit (not necessarily a valid character, see
38       encoding support).  EOF rule matches the end of input, it should be de‐
39       fined  if  the  corresponding  method  for handling the end of input is
40       used.  If start conditions are used, rules have  more  complex  syntax.
41       All  rules  of  a  single  block  are compiled into a deterministic fi‐
42       nite-state automaton (DFA) and encoded in the form of a program in  the
43       target  language.  The generated code interfaces with the outer program
44       by the means of a few user-defined primitives (see the  program  inter‐
45       face  section).   Reusable  blocks allow sharing rules, definitions and
46       configurations between different blocks.
47

EXAMPLE

49   Input file
50          //go:generate re2go $INPUT -o $OUTPUT -i
51          package main                             //
52                                                   //
53          func lex(str string) int {               // Go code
54              var cursor int                       //
55
56              /*!re2c                              // start of re2c block
57              re2c:define:YYCTYPE = byte;          // configuration
58              re2c:define:YYPEEK = "str[cursor]";  // configuration
59              re2c:define:YYSKIP = "cursor += 1";  // configuration
60              re2c:yyfill:enable = 0;              // configuration
61              re2c:flags:nested-ifs = 1;           // configuration
62                                                   //
63              number = [1-9][0-9]*;                // named definition
64                                                   //
65              number { return 0; }                 // normal rule
66              *      { return 1; }                 // default rule
67              */
68          }                                        //
69                                                   //
70          func main() {                            //
71              if lex("1234\x00") != 0 {            // Go code
72                  panic("failed!")                 //
73              }                                    //
74          }                                        //
75
76
77   Output file
78          // Code generated by re2c, DO NOT EDIT.
79          //go:generate re2go $INPUT -o $OUTPUT -i
80          package main                             //
81                                                   //
82          func lex(str string) int {               // Go code
83              var cursor int                       //
84
85
86          {
87              var yych byte
88              yych = str[cursor]
89              if (yych <= '0') {
90                  goto yy2
91              }
92              if (yych <= '9') {
93                  goto yy4
94              }
95          yy2:
96              cursor += 1
97              { return 1; }
98          yy4:
99              cursor += 1
100              yych = str[cursor]
101              if (yych <= '/') {
102                  goto yy6
103              }
104              if (yych <= '9') {
105                  goto yy4
106              }
107          yy6:
108              { return 0; }
109          }
110
111          }                                        //
112                                                   //
113          func main() {                            //
114              if lex("1234\x00") != 0 {            // Go code
115                  panic("failed!")                 //
116              }                                    //
117          }                                        //
118
119

OPTIONS

121       -? -h --help
122              Show help message.
123
124       -1 --single-pass
125              Deprecated. Does nothing (single pass is the default now).
126
127       -8 --utf-8
128              Generate a lexer that reads input in UTF-8 encoding.   re2c  as‐
129              sumes  that  character range is 0 -- 0x10FFFF and character size
130              is 1 byte.
131
132       -b --bit-vectors
133              Optimize conditional jumps using bit masks. Implies -s.
134
135       -c --conditions --start-conditions
136              Enable support of Flex-like "conditions": multiple  interrelated
137              lexers  within  one block. Option --start-conditions is a legacy
138              alias; use --conditions instead.
139
140       --case-insensitive
141              Treat single-quoted and double-quoted strings  as  case-insensi‐
142              tive.
143
144       --case-inverted
145              Invert  the  meaning of single-quoted and double-quoted strings:
146              treat single-quoted strings as case-sensitive and  double-quoted
147              strings as case-insensitive.
148
149       --case-ranges
150              Collapse  consecutive  cases in a switch statements into a range
151              of the form case low ... high:. This syntax is an  extension  of
152              the  C/C++  language, supported by compilers like GCC, Clang and
153              Tcc. The main advantage over using single cases is smaller  gen‐
154              erated C code and faster generation time, although for some com‐
155              pilers like Tcc it also results in smaller  binary  size.   This
156              option doesn't work for the Go backend.
157
158       --depfile FILE
159              Write  dependency  information to FILE in the form of a Makefile
160              rule <output-file> : <input-file> [include-file ...].  This  al‐
161              lows  to  track  build  dependencies  in  the presence of /*!in‐
162              clude:re2c*/ directives, so that updating include files triggers
163              regeneration  of  the output file.  This option requires that -o
164              --output option is specified.
165
166       -e --ecb
167              Generate a lexer that reads input in EBCDIC encoding.  re2c  as‐
168              sumes  that  character range is 0 -- 0xFF an character size is 1
169              byte.
170
171       --empty-class <match-empty | match-none | error>
172              Define  the  way  re2c  treats  empty  character  classes.  With
173              match-empty (the default) empty class matches empty input (which
174              is  illogical,  but  backwards-compatible).   With``match-none``
175              empty  class  always  fails  to  match.   With error empty class
176              raises a compilation error.
177
178       --encoding-policy <fail | substitute | ignore>
179              Define the way re2c treats Unicode surrogates.  With  fail  re2c
180              aborts with an error when a surrogate is encountered.  With sub‐
181              stitute re2c silently replaces surrogates with  the  error  code
182              point  0xFFFD.  With ignore (the default) re2c treats surrogates
183              as normal code points. The Unicode standard says that standalone
184              surrogates  are  invalid,  but real-world libraries and programs
185              behave in different ways.
186
187       -f --storable-state
188              Generate a lexer which can store its inner state.  This is  use‐
189              ful  in  push-model lexers which are stopped by an outer program
190              when there is not enough input, and then resumed when more input
191              becomes available. In this mode users should additionally define
192              YYGETSTATE() and YYSETSTATE(state) macros  and  variables  yych,
193              yyaccept and state as part of the lexer state.
194
195       -F --flex-syntax
196              Partial  support for Flex syntax: in this mode named definitions
197              don't need the equal sign and  the  terminating  semicolon,  and
198              when used they must be surrounded by curly braces. Names without
199              curly braces are treated as double-quoted strings.
200
201       -g --computed-gotos
202              Optimize conditional jumps using  non-standard  "computed  goto"
203              extension (which must be supported by the compiler). re2c gener‐
204              ates jump tables only in complex cases with a lot of conditional
205              branches.   Complexity   threshold   can   be   configured  with
206              cgoto:threshold configuration. This option implies -b. This  op‐
207              tion doesn't work for the Go backend.
208
209       -I PATH
210              Add  PATH to the list of locations which are used when searching
211              for include files. This option is  useful  in  combination  with
212              /*!include:re2c ... */ directive. Re2c looks for FILE in the di‐
213              rectory of including file and in the list of include paths spec‐
214              ified by -I option.
215
216       -i --no-debug-info
217              Do  not output #line information. This is useful when the gener‐
218              ated code is tracked by some version control system or IDE.
219
220       --input <default | custom>
221              Specify the API used by the generated  code  to  interface  with
222              used-defined  code. Option default is the C API based on pointer
223              arithmetic (it is the default for the C backend). Option  custom
224              is the generic API (it is the default for the Go backend).
225
226       --input-encoding <ascii | utf8>
227              Specify  the  way  re2c  parses regular expressions.  With ascii
228              (the default) re2c handles input as ASCII-encoded: any  sequence
229              of  code  units  is  a sequence of standalone 1-byte characters.
230              With utf8 re2c handles  input  as  UTF8-encoded  and  recognizes
231              multibyte characters.
232
233       --lang <c | go>
234              Specify  the  output  language. Supported languages are C and Go
235              (the default is C).
236
237       --location-format <gnu | msvc>
238              Specify location format in messages.   With  gnu  locations  are
239              printed as 'filename:line:column: ...'.  With msvc locations are
240              printed as 'filename(line,column) ...'.  Default is gnu.
241
242       --no-generation-date
243              Suppress date output in the generated file.
244
245       --no-version
246              Suppress version output in the generated file.
247
248       -o OUTPUT --output=OUTPUT
249              Specify the OUTPUT file.
250
251       -P --posix-captures
252              Enable submatch extraction with POSIX-style capturing groups.
253
254       -r --reusable
255              Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
256              */  blocks.  Exactly  one rules-block must be present. The rules
257              are saved and used by every use-block that  follows,  which  may
258              add its own rules and configurations.
259
260       -S --skeleton
261              Ignore user-defined interface code and generate a self-contained
262              "skeleton" program.  Additionally,  generate  input  files  with
263              strings  derived  from  the regular grammar and compressed match
264              results that are used to verify "skeleton" behavior on  all  in‐
265              puts.  This  option  is useful for finding bugs in optimizations
266              and code generation. This option doesn't work for the  Go  back‐
267              end.
268
269       -s --nested-ifs
270              Use  nested if statements instead of switch statements in condi‐
271              tional jumps. This usually results in more efficient  code  with
272              non-optimizing compilers.
273
274       -T --tags
275              Enable submatch extraction with tags.
276
277       -t HEADER --type-header=HEADER
278              Generate  a HEADER file that contains enum with condition names.
279              Requires -c option.
280
281       -u --unicode
282              Generate a lexer that reads UTF32-encoded  input.  Re2c  assumes
283              that  character  range  is 0 -- 0x10FFFF and character size is 4
284              bytes. This option implies -s.
285
286       -V --vernum
287              Show version information in MMmmpp format (major, minor, patch).
288
289       --verbose
290              Output a short message in case of success.
291
292       -v --version
293              Show version information.
294
295       -w --wide-chars
296              Generate a lexer that reads  UCS2-encoded  input.  Re2c  assumes
297              that  character  range  is  0  -- 0xFFFF and character size is 2
298              bytes. This option implies -s.
299
300       -x --utf-16
301              Generate a lexer that reads UTF16-encoded  input.  Re2c  assumes
302              that  character  range  is 0 -- 0x10FFFF and character size is 2
303              bytes. This option implies -s.
304
305   Debug options
306       -D --emit-dot
307              Instead of normal output generate lexer graph  in  .dot  format.
308              The  output  can  be  converted  to  an  image  with the help of
309              Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
310
311       -d --debug-output
312              Emit YYDEBUG in the generated code.  YYDEBUG should  be  defined
313              by  the user in the form of a void function with two parameters:
314              state (lexer state or -1) and symbol (current  input  symbol  of
315              type YYCTYPE).
316
317       --dump-adfa
318              Debug option: output DFA after tunneling (in .dot format).
319
320       --dump-cfg
321              Debug  option:  output  control  flow graph of tag variables (in
322              .dot format).
323
324       --dump-closure-stats
325              Debug option: output statistics on the number of states in  clo‐
326              sure.
327
328       --dump-dfa-det
329              Debug  option:  output DFA immediately after determinization (in
330              .dot format).
331
332       --dump-dfa-min
333              Debug option: output DFA after minimization (in .dot format).
334
335       --dump-dfa-tagopt
336              Debug option: output DFA after tag optimizations (in  .dot  for‐
337              mat).
338
339       --dump-dfa-tree
340              Debug  option:  output DFA under construction with states repre‐
341              sented as tag history trees (in .dot format).
342
343       --dump-dfa-raw
344              Debug  option:  output  DFA  under  construction  with  expanded
345              state-sets (in .dot format).
346
347       --dump-interf
348              Debug  option:  output  interference  table produced by liveness
349              analysis of tag variables.
350
351       --dump-nfa
352              Debug option: output NFA (in .dot format).
353
354   Internal options
355       --dfa-minimization <moore | table>
356              Internal option: DFA minimization algorithm used  by  re2c.  The
357              moore option is the Moore algorithm (it is the default). The ta‐
358              ble option is the "table  filling"  algorithm.  Both  algorithms
359              should produce the same DFA up to states relabeling; table fill‐
360              ing is simpler and much slower and serves as a reference  imple‐
361              mentation.
362
363       --eager-skip
364              Internal  option: make the generated lexer advance the input po‐
365              sition eagerly -- immediately after reading  the  input  symbol.
366              This changes the default behavior when the input position is ad‐
367              vanced lazily -- after transition to the next state. This option
368              is implied by --no-lookahead.
369
370       --no-lookahead
371              Internal  option:  use  TDFA(0) instead of TDFA(1).  This option
372              has effect only with --tags or --posix-captures options.
373
374       --no-optimize-tags
375              Internal optionL: suppress optimization of tag variables (useful
376              for debugging).
377
378       --posix-closure <gor1 | gtop>
379              Internal  option:  specify  shortest-path algorithm used for the
380              construction of epsilon-closure with POSIX disambiguation seman‐
381              tics:  gor1  (the default) stands for Goldberg-Radzik algorithm,
382              and gtop stands for "global topological order" algorithm.
383
384       --posix-prectable <complex | naive>
385              Internal option: specify the algorithm  used  to  compute  POSIX
386              precedence  table. The complex algorithm computes precedence ta‐
387              ble in one traversal of tag history tree and has quadratic  com‐
388              plexity  in  the  number  of TNFA states; it is the default. The
389              naive algorithm has worst-case cubic complexity in the number of
390              TNFA  states,  but  it  is  much simpler than complex and may be
391              slightly faster in non-pathological cases.
392
393       --stadfa
394              Internal option: use staDFA algorithm for  submatch  extraction.
395              The  main  difference with TDFA is that tag operations in staDFA
396              are placed in states, not on transitions.
397
398       --fixed-tags <none | toplevel | all>
399              Internal option:  specify  whether  the  fixed-tag  optimization
400              should  be  applied  to  all tags (all), none of them (none), or
401              only those in toplevel concatenation (toplevel). The default  is
402              all.   "Fixed"  tags  are  those that are located within a fixed
403              distance to some other tag (called "base"). In such  cases  only
404              tha base tag needs to be tracked, and the value of the fixed tag
405              can be computed as the value of the base tag plus a static  off‐
406              set.  For  tags  that  are under alternative or repetition it is
407              also necessary to check if the base tag has a no-match value (in
408              that case fixed tag should also be set to no-match, disregarding
409              the offset). For tags in top-level concatenation  the  check  is
410              not needed, because they always match.
411
412   Warnings
413       -W     Turn on all warnings.
414
415       -Werror
416              Turn  warnings  into errors. Note that this option alone doesn't
417              turn on any warnings; it only affects those warnings  that  have
418              been turned on so far or will be turned on later.
419
420       -W<warning>
421              Turn on warning.
422
423       -Wno-<warning>
424              Turn off warning.
425
426       -Werror-<warning>
427              Turn  on warning and treat it as an error (this implies -W<warn‐
428              ing>).
429
430       -Wno-error-<warning>
431              Don't treat this particular warning as an  error.  This  doesn't
432              turn off the warning itself.
433
434       -Wcondition-order
435              Warn  if  the generated program makes implicit assumptions about
436              condition numbering. One should use either the -t, --type-header
437              option or the /*!types:re2c*/ directive to generate a mapping of
438              condition names to numbers and then use the autogenerated condi‐
439              tion names.
440
441       -Wempty-character-class
442              Warn  if a regular expression contains an empty character class.
443              Trying to match an empty character  class  makes  no  sense:  it
444              should  always  fail.  However, for backwards compatibility rea‐
445              sons re2c allows empty character  classes  and  treats  them  as
446              empty  strings.  Use  the --empty-class option to change the de‐
447              fault behavior.
448
449       -Wmatch-empty-string
450              Warn if a rule is nullable (matches an empty  string).   If  the
451              lexer  runs  in a loop and the empty match is unintentional, the
452              lexer may unexpectedly hang in an infinite loop.
453
454       -Wswapped-range
455              Warn if the lower bound of a range is  greater  than  its  upper
456              bound.  The  default  behavior  is  to  silently  swap the range
457              bounds.
458
459       -Wundefined-control-flow
460              Warn if some input strings cause undefined control flow  in  the
461              lexer  (the faulty patterns are reported). This is the most dan‐
462              gerous and most common mistake. It can be easily fixed by adding
463              the  default  rule  * which has the lowest priority, matches any
464              code unit, and consumes exactly one code unit.
465
466       -Wunreachable-rules
467              Warn about rules that are shadowed by other rules and will never
468              match.
469
470       -Wuseless-escape
471              Warn  if  a symbol is escaped when it shouldn't be.  By default,
472              re2c silently ignores such escapes, but this may as  well  indi‐
473              cate a typo or an error in the escape sequence.
474
475       -Wnondeterministic-tags
476              Warn  if  a  tag  has  n-th degree of nondeterminism, where n is
477              greater than 1.
478
479       -Wsentinel-in-midrule
480              Warn if the sentinel symbol occurs in the middle of a  rule  ---
481              this  may  cause reads past the end of buffer, crashes or memory
482              corruption in the generated lexer. This warning is only applica‐
483              ble  if  the sentinel method of checking for the end of input is
484              used.  It is set to an error if re2c:sentinel  configuration  is
485              used.
486

PROGRAM INTERFACE

488       Re2c  has a flexible interface that gives the user both the freedom and
489       the responsibility to define how the generated code interacts with  the
490       outer program.  There are two major options:
491
492Pointer  API.  It is also called "default API", since it was histori‐
493         cally the first, and for a long time the only one.  This  is  a  more
494         restricted  API  based  on  C  pointer  arithmetics.   It consists of
495         pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and  YYLIMIT,
496         which are normally defined as pointers of type YYCTYPE*.  Pointer API
497         is enabled by default for the C backend, and it cannot be  used  with
498         other backends that do not have pointer arithmetics.
499
500
501
502Generic  API.   This  is  a  less restricted API that does not assume
503         pointer semantics.  It consists of  primitives  YYPEEK,  YYSKIP,  YY‐
504         BACKUP,  YYBACKUPCTX,  YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
505         YYRESTORECTX, YYRESTORETAG, YYSHIFT, YYSHIFTSTAG, YYSHIFTMTAG and YY‐
506         LESSTHAN.  For the C backend generic API is enabled with --input cus‐
507         tom option or re2c:flags:input = custom; configuration;  for  the  Go
508         backend  it  is enabled by default.  Generic API was added in version
509         0.14.  It is intentionally designed to give the user as much  freedom
510         as  possible  in redefining the input model and the semantics of dif‐
511         ferent actions performed by the generated code. As  an  example,  one
512         can  override YYPEEK to check for the end of input before reading the
513         input character, or do some logging, etc.
514
515       Generic API has two styles:
516
517Function-like.  This style is enabled  with  re2c:api:style  =  func‐
518         tions;  configuration,  and  it is the default for C backend. In this
519         style API primitives should be defined as functions  or  macros  with
520         parentheses, accepting the necessary arguments. For example, in C the
521         default pointer API can be defined in function-like style generic API
522         as follows:
523
524            #define  YYPEEK()                 *YYCURSOR
525            #define  YYSKIP()                 ++YYCURSOR
526            #define  YYBACKUP()               YYMARKER = YYCURSOR
527            #define  YYBACKUPCTX()            YYCTXMARKER = YYCURSOR
528            #define  YYRESTORE()              YYCURSOR = YYMARKER
529            #define  YYRESTORECTX()           YYCURSOR = YYCTXMARKER
530            #define  YYRESTORETAG(tag)        YYCURSOR = tag
531            #define  YYLESSTHAN(len)          YYLIMIT - YYCURSOR < len
532            #define  YYSTAGP(tag)             tag = YYCURSOR
533            #define  YYSTAGN(tag)             tag = NULL
534            #define  YYSHIFT(shift)           YYCURSOR += shift
535            #define  YYSHIFTSTAG(tag, shift)  tag += shift
536
537
538
539Free-form.   This  style  is enabled with re2c:api:style = free-form;
540         configuration, and it is the default for Go backend.  In  this  style
541         API  primitives  can  be defined as free-form pieces of code, and in‐
542         stead of arguments they  have  interpolated  variables  of  the  form
543         @@{name}, or optionally just @@ if there is only one argument. The @@
544         text is called "sigil". It can be redefined to any  other  text  with
545         re2c:api:sigil  configuration.  For  example, the default pointer API
546         can be defined in free-form style generic API as follows:
547
548            re2c:define:YYPEEK       = "*YYCURSOR";
549            re2c:define:YYSKIP       = "++YYCURSOR";
550            re2c:define:YYBACKUP     = "YYMARKER = YYCURSOR";
551            re2c:define:YYBACKUPCTX  = "YYCTXMARKER = YYCURSOR";
552            re2c:define:YYRESTORE    = "YYCURSOR = YYMARKER";
553            re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
554            re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
555            re2c:define:YYLESSTHAN   = "YYLIMIT - YYCURSOR < @@{len}";
556            re2c:define:YYSTAGP      = "@@{tag} = YYCURSOR";
557            re2c:define:YYSTAGN      = "@@{tag} = NULL";
558            re2c:define:YYSHIFT      = "YYCURSOR += @@{shift}";
559            re2c:define:YYSHIFTSTAG  = "@@{tag} += @@{shift}";
560
561   API primitives
562       Here is a list of API primitives that may be used by the generated code
563       in  order  to  interface  with the outer program.  Which primitives are
564       needed depends on multiple factors, including the complexity of regular
565       expressions,  input  representation, buffering, the use of various fea‐
566       tures and so on.  All the necessary primitives should be defined by the
567       user  in  the form of macros, functions, variables, free-form pieces of
568       code or any other suitable form.  Re2c does not (and cannot) check  the
569       definitions,  so if anything is missing or defined incorrectly the gen‐
570       erated code will not compile.
571
572       YYCTYPE
573              The type of the  input  characters  (code  units).   For  ASCII,
574              EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
575              For UTF-16 or UCS-2 it should be 2-byte  unsigned  integer.  For
576              UTF-32 it should be 4-byte unsigned integer.
577
578       YYCURSOR
579              A  pointer-like  l-value  that stores the current input position
580              (usually a pointer of type YYCTYPE*). Initially YYCURSOR  should
581              point to the first input character. It is advanced by the gener‐
582              ated code.  When a rule matches, YYCURSOR points to the one  af‐
583              ter the last matched character. It is used only in the default C
584              API.
585
586       YYLIMIT
587              A pointer-like r-value that stores the  end  of  input  position
588              (usually  a  pointer of type YYCTYPE*). Initially YYLIMIT should
589              point to the one after the last available input character. It is
590              not  changed  by  the generated code. Lexer compares YYCURSOR to
591              YYLIMIT in order to determine if there is enough  input  charac‐
592              ters left.  YYLIMIT is used only in the default C API.
593
594       YYMARKER
595              A pointer-like l-value (usually a pointer of type YYCTYPE*) that
596              stores the position of the latest matched rule. It  is  used  to
597              restores  YYCURSOR  position if the longer match fails and lexer
598              needs to rollback.  Initialization is not  needed.  YYMARKER  is
599              used only in the default C API.
600
601       YYCTXMARKER
602              A  pointer-like l-value that stores the position of the trailing
603              context (usually a pointer of type YYCTYPE*). No  initialization
604              is  needed.  It is used only in the default C API, and only with
605              the lookahead operator /.
606
607       YYFILL API primitive with one argument len.  The meaning of  YYFILL  is
608              to  provide  at  least len more input characters or fail. If EOF
609              rule is used, YYFILL should always return to the  calling  func‐
610              tion; the return value should be zero on success and non-zero on
611              failure. If EOF rule is not used, YYFILL return value is ignored
612              and it should not return on failure. Maximal value of len is YY‐
613              MAXFILL, which can be generated  with  /*!max:re2c*/  directive.
614              The   definition  of  YYFILL  can  be  either  function-like  or
615              free-form depending on the API  style  (see  re2c:api:style  and
616              re2c:define:YYFILL:naked).
617
618       YYMAXFILL
619              An integral constant equal to the  maximal value of YYFILL argu‐
620              ment.  It can be generated with /*!max:re2c*/ directive.
621
622       YYLESSTHAN
623              A generic API primitive with one argument len.  It should be de‐
624              fined as an r-value of boolean type that equals true if and only
625              if there is less than len input characters left.  The definition
626              can  be  either  function-like or free-form depending on the API
627              style (see re2c:api:style).
628
629       YYPEEK A generic API primitive with no arguments.  It should be defined
630              as  an r-value of type YYCTYPE that is equal to the character at
631              the current input position. The definition can be  either  func‐
632              tion-like   or   free-form  depending  on  the  API  style  (see
633              re2c:api:style).
634
635       YYSKIP A generic API primitive  with  no  arguments.   The  meaning  of
636              YYSKIP  is  to advance the current input position by one charac‐
637              ter. The definition can be either function-like or free-form de‐
638              pending on the API style (see re2c:api:style).
639
640       YYBACKUP
641              A  generic  API primitive with no arguments.  The meaning of YY‐
642              BACKUP is to save the current input position, which is later re‐
643              stored  with  YYRESTORE.   The definition should be either func‐
644              tion-like  or  free-form  depending  on  the  API   style   (see
645              re2c:api:style).
646
647       YYRESTORE
648              A generic API primitive with no arguments.  The meaning of YYRE‐
649              STORE is to restore the current  input  position  to  the  value
650              saved  by  YYBACKUP.   The  definition  should  be  either func‐
651              tion-like  or  free-form  depending  on  the  API   style   (see
652              re2c:api:style).
653
654       YYBACKUPCTX
655              A generic API primitive with zero arguments.  The meaning of YY‐
656              BACKUPCTX is to save the current input position as the  position
657              of  the  trailing  context,  which  is  later  restored by YYRE‐
658              STORECTX.  The definition  should  be  either  function-like  or
659              free-form depending on the API style (see re2c:api:style).
660
661       YYRESTORECTX
662              A generic API primitive with no arguments.  The meaning of YYRE‐
663              STORECTX is to restore the trailing context position saved  with
664              YYBACKUPCTX.   The  definition should be either function-like or
665              free-form depending on the API style (see re2c:api:style).
666
667       YYRESTORETAG
668              A generic API primitive with one argument tag.  The  meaning  of
669              YYRESTORETAG  is to restore the trailing context position to the
670              value of tag.  The definition should be either function-like  or
671              free-form depending on the API style (see re2c:api:style).
672
673       YYSTAGP
674              A  generic  API primitive with one argument tag.  The meaning of
675              YYSTAGP is to set tag value to the current input position.   The
676              definition should be either function-like or free-form depending
677              on the API style (see re2c:api:style).
678
679       YYSTAGN
680              A generic API primitive with one argument tag.  The  meaning  of
681              YYSTAGN is to set tag value to null (or some default value). The
682              definition should be either function-like or free-form depending
683              on the API style (see re2c:api:style).
684
685       YYMTAGP
686              A  generic  API primitive with one argument tag.  The meaning of
687              YYMTAGP is to append the current position to the history of tag.
688              The  definition  should be either function-like or free-form de‐
689              pending on the API style (see re2c:api:style).
690
691       YYMTAGN
692              A generic API primitive with one argument tag.  The  meaning  of
693              YYMTAGN  is  to append null (or some other default) value to the
694              history of tag.  The definition can be either  function-like  or
695              free-form depending on the API style (see re2c:api:style).
696
697       YYSHIFT
698              A generic API primitive with one argument shift.  The meaning of
699              YYSHIFT is to shift the current input position by shift  charac‐
700              ters  (the  shift  value may be negative). The definition can be
701              either function-like or free-form depending  on  the  API  style
702              (see re2c:api:style).
703
704       YYSHIFTSTAG
705              A generic  API primitive with two arguments, tag and shift.  The
706              meaning of YYSHIFTSTAG is to shift tag by shift characters  (the
707              shift  value  may  be  negative).   The definition can be either
708              function-like or free-form  depending  on  the  API  style  (see
709              re2c:api:style).
710
711       YYSHIFTMTAG
712              A  generic API primitive with two arguments, tag and shift.  The
713              meaning of YYSHIFTMTAG is to shift the latest value in the  his‐
714              tory  of  tag  by shift characters (the shift value may be nega‐
715              tive).   The  definition  should  be  either  function-like   or
716              free-form depending on the API style (see re2c:api:style).
717
718       YYMAXNMATCH
719              An  integral  constant equal to the maximal number of POSIX cap‐
720              turing  groups  in  a  rule.  It  is  generated  with   /*!maxn‐
721              match:re2c*/ directive.
722
723       YYCONDTYPE
724              The  type  of the condition enum.  It should be generated either
725              with /*!types:re2c*/ directive or -t --type-header option.
726
727       YYGETCONDITION
728              An API primitive with zero arguments.  It should be  defined  as
729              an  r-value of type YYCONDTYPE that is equal to the current con‐
730              dition identifier. The definition can be either function-like or
731              free-form  depending  on  the  API style (see re2c:api:style and
732              re2c:define:YYGETCONDITION:naked).
733
734       YYSETCONDITION
735              An API primitive with one argument cond.  The meaning of  YYSET‐
736              CONDITION  is  to  set the current condition identifier to cond.
737              The definition should be either function-like or  free-form  de‐
738              pending on the API style (see re2c:api:style and re2c:define:YY‐
739              SETCONDITION@cond).
740
741       YYGETSTATE
742              An API primitive with zero arguments.  It should be  defined  as
743              an  r-value  of  integer type that is equal to the current lexer
744              state. Should be initialized to -1. The definition can be either
745              function-like  or  free-form  depending  on  the  API style (see
746              re2c:api:style and re2c:define:YYGETSTATE:naked).
747
748       YYSETSTATE
749              An API primitive with one argument state.  The meaning of YYSET‐
750              STATE  is  to set the current lexer state to state.  The defini‐
751              tion should be either function-like or  free-form  depending  on
752              the   API   style  (see  re2c:api:style  and  re2c:define:YYSET‐
753              STATE@state).
754
755       YYDEBUG
756              A debug API primitive with two arguments. It can be used to  de‐
757              bug  the generated code (with -d --debug-output option). YYDEBUG
758              should return no value and accept two arguments: state (either a
759              DFA state index or -1) and symbol (the current input symbol).
760
761       yych   An l-value of type YYCTYPE that stores the current input charac‐
762              ter.  User definition is necessary only with -f --storable-state
763              option.
764
765       yyaccept
766              An  l-value  of unsigned integral type that stores the number of
767              the latest matched rule.  User definition is necessary only with
768              -f --storable-state option.
769
770       yynmatch
771              An  l-value  of unsigned integral type that stores the number of
772              POSIX capturing groups in the matched rule.  Used only  with  -P
773              --posix-captures option.
774
775       yypmatch
776              An array of l-values that are used to hold the tag values corre‐
777              sponding to the capturing parentheses in the matching rule.  Ar‐
778              ray  length must be at least yynmatch * 2 (usually YYMAXNMATCH *
779              2 is a good choice).  Used only with -P --posix-captures option.
780
781   Directives
782       Below is the list of all directives provided by re2c (in no  particular
783       order).  More information on each directive can be found in the related
784       sections.
785
786       /*!re2c ... */
787              A standard re2c block.
788
789       %{ ... %}
790              A standard re2c block in -F --flex-support mode.
791
792       /*!rules:re2c ... */
793              A reusable re2c block (requires -r --reuse option).
794
795       /*!use:re2c ... */
796              A  block  that  reuses  previous  rules-block   specified   with
797              /*!rules:re2c ... */ (requires -r --reuse option).
798
799       /*!ignore:re2c ... */
800              A  block  which contents are ignored and cut off from the output
801              file.
802
803       /*!max:re2c*/
804              This directive is substituted with the macro-definition  of  YY‐
805              MAXFILL.
806
807       /*!maxnmatch:re2c*/
808              This  directive  is substituted with the macro-definition of YY‐
809              MAXNMATCH (requires -P --posix-captures option).
810
811       /*!getstate:re2c*/
812              This directive is substituted with conditional dispatch on lexer
813              state (requires -f --storable-state option).
814
815       /*!types:re2c ... */
816              This  directive  is substituted with the definition of condition
817              enum (requires -c --conditions option).
818
819       /*!stags:re2c ... */, /*!mtags:re2c ... */
820              These directives allow one to specify a template piece  of  code
821              that  is  expanded  for  each  s-tag/m-tag variable generated by
822              re2c. This block has two optional configurations: format = "@@";
823              (specifies the template where @@ is substituted with the name of
824              each tag variable), and separator = ""; (specifies the piece  of
825              code  used  to join the generated pieces for different tag vari‐
826              ables).
827
828       /*!include:re2c FILE */
829              This directive allows one to include FILE (in the same sense  as
830              #include directive in C/C++).
831
832       /*!header:re2c:on*/
833              This  directive marks the start of header file. Everything after
834              it and up to the  following  /*!header:re2c:off*/  directive  is
835              processed  by re2c and written to the header file specified with
836              -t --type-header option.
837
838       /*!header:re2c:off*/
839              This directive  marks  the  end  of  header  file  started  with
840              /*!header:re2c:on*/.
841
842   Configurations
843       re2c:flags:t, re2c:flags:type-header
844              Specify  the  name  of the generated header file relative to the
845              directory of the output file. (Same as  -t,  --type-header  com‐
846              mand-line option except that the filepath is relative.)
847
848       re2c:flags:input
849              Same as --input command-line option.
850
851       re2c:api:style
852              Allows  one to specify the style of generic API. Possible values
853              are functions and free-form. With functions style  (the  default
854              for  the  C  backend)  API primitives behave like functions, and
855              re2c generates parentheses with an argument list after the  name
856              of each primitive.  With free-form style (the default for the Go
857              backend) re2c treats API definitions as interpolated strings and
858              substitutes  argument placeholders with the actual argument val‐
859              ues.  This option can be overridden by  options  for  individual
860              API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
861
862       re2c:api:sigil
863              Allows  one  to  specify  the "sigil" symbol (or string) that is
864              used to recognize argument placeholders in  the  definitions  of
865              generic  API primitives.  The default value is @@.  Placeholders
866              start with sigil, followed by the argument name in curly braces.
867              For  example,  if sigil is set to $, then placeholders will have
868              the form ${name}. Single-argument APIs may use  shorthand  nota‐
869              tion  without  the name in braces. This option can be overridden
870              by options for individual API primitives,  e.g.  re2c:define:YY‐
871              FILL@len for YYFILL.
872
873       re2c:define:YYCTYPE
874              Defines YYCTYPE (see the user interface section).
875
876       re2c:define:YYCURSOR
877              Defines  C  API  primitive YYCURSOR (see the user interface sec‐
878              tion).
879
880       re2c:define:YYLIMIT
881              Defines C API primitive YYLIMIT (see  the  user  interface  sec‐
882              tion).
883
884       re2c:define:YYMARKER
885              Defines  C  API  primitive YYMARKER (see the user interface sec‐
886              tion).
887
888       re2c:define:YYCTXMARKER
889              Defines C API primitive YYCTXMARKER (see the user interface sec‐
890              tion).
891
892       re2c:define:YYFILL
893              Defines API primitive YYFILL (see the user interface section).
894
895       re2c:define:YYFILL@len
896              Specifies  the  sigil  used  for argument substitution in YYFILL
897              definition.  Defaults  to  @@.   Overrides  the   more   generic
898              re2c:api:sigil configuration.
899
900       re2c:define:YYFILL:naked
901              Allows  one to override re2c:api:style for YYFILL.  Value 0 cor‐
902              responds to free-form API style.
903
904       re2c:yyfill:enable
905              Defaults to 1 (YYFILL is enabled). Set this to zero to  suppress
906              the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
907              tinel configuration to verify that the  generated  lexer  cannot
908              read past the end of input, as this might introduce severe secu‐
909              rity issues to your programs.
910
911       re2c:yyfill:parameter
912              Controls the argument in the parentheses that follow YYFILL. De‐
913              faults  to  1,  which  means  that the argument is generated. If
914              zero, the argument is omitted. Can be overridden  with  re2c:de‐
915              fine:YYFILL:naked or re2c:api:style.
916
917       re2c:eof
918              Specifies  the sentinel symbol used with EOF rule $ to check for
919              the end of input in the generated lexer. The default value is -1
920              (EOF  rule is not used). Other possible values include all valid
921              code units. Only decimal numbers are recognized.
922
923       re2c:sentinel
924              Specifies the sentinel symbol used with the sentinel  method  of
925              checking  for  the end of input in the generated lexer (the case
926              when bounds checking is disabled with  re2c:yyfill:enable  =  0;
927              and  EOF rule $ is not used). This configuration does not affect
928              code generation. It is used by re2c to verify that the  sentinel
929              symbol  is  not  allowed  in the middle of the rule, and prevent
930              possible reads past the end of buffer in  the  generated  lexer.
931              The  default  value is -1 (re2c assumes that the sentinel symbol
932              is 0, which is the most common case). Other possible values  in‐
933              clude all valid code units. Only decimal numbers are recognized.
934
935       re2c:define:YYLESSTHAN
936              Defines generic API primitive YYLESSTHAN (see the user interface
937              section).
938
939       re2c:yyfill:check
940              Setting this to zero allows to suppress the generation of YYFILL
941              check  (YYLESSTHAN in generic API of YYLIMIT-based comparison in
942              default C API). This configuration is useful when the  necessary
943              input is always available. it defaults to 1 (the check is gener‐
944              ated).
945
946       re2c:label:yyFillLabel
947              Allows one to change the prefix of YYFILL labels (used with  EOF
948              rule or with storable states).
949
950       re2c:define:YYPEEK
951              Defines  generic  API  primitive  YYPEEK (see the user interface
952              section).
953
954       re2c:define:YYSKIP
955              Defines generic API primitive YYSKIP  (see  the  user  interface
956              section).
957
958       re2c:define:YYBACKUP
959              Defines  generic  API primitive YYBACKUP (see the user interface
960              section).
961
962       re2c:define:YYBACKUPCTX
963              Defines generic API primitive YYBACKUPCTX (see the  user  inter‐
964              face section).
965
966       re2c:define:YYRESTORE
967              Defines  generic API primitive YYRESTORE (see the user interface
968              section).
969
970       re2c:define:YYRESTORECTX
971              Defines generic API primitive YYRESTORECTX (see the user  inter‐
972              face section).
973
974       re2c:define:YYRESTORETAG
975              Defines  generic API primitive YYRESTORETAG (see the user inter‐
976              face section).
977
978       re2c:define:YYSHIFT
979              Defines generic API primitive YYSHIFT (see  the  user  interface
980              section).
981
982       re2c:define:YYSHIFTMTAG
983              Defines  generic  API primitive YYSHIFTMTAG (see the user inter‐
984              face section).
985
986       re2c:define:YYSHIFTSTAG
987              Defines generic API primitive YYSHIFTSTAG (see the  user  inter‐
988              face section).
989
990       re2c:define:YYSTAGN
991              Defines  generic  API  primitive YYSTAGN (see the user interface
992              section).
993
994       re2c:define:YYSTAGP
995              Defines generic API primitive YYSTAGP (see  the  user  interface
996              section).
997
998       re2c:define:YYMTAGN
999              Defines  generic  API  primitive YYMTAGN (see the user interface
1000              section).
1001
1002       re2c:define:YYMTAGP
1003              Defines generic API primitive YYMTAGP (see  the  user  interface
1004              section).
1005
1006       re2c:flags:T, re2c:flags:tags
1007              Same as -T --tags command-line option.
1008
1009       re2c:flags:P, re2c:flags:posix-captures
1010              Same as -P --posix-captures command-line option.
1011
1012       re2c:tags:expression
1013              Allows  one  to  customize the way re2c addresses tag variables.
1014              By default re2c generates expressions of the form  yyt<N>.  This
1015              might  be inconvenient, for example if tag variables are defined
1016              as fields in a struct. Re2c recognizes placeholder of  the  form
1017              @@{tag}  or  @@ and replaces it with the actual tag name.  Sigil
1018              @@ can be redefined with re2c:api:sigil configuration.  For  ex‐
1019              ample,  setting  re2c:tags:expression  = "p->@@"; results in ex‐
1020              pressions of the form p->yyt<N> in the generated code.
1021
1022       re2c:tags:prefix
1023              Allows one to override the prefix of tag variables (defaults  to
1024              yyt).
1025
1026       re2c:flags:lookahead
1027              Same as inverted --no-lookahead command-line option.
1028
1029       re2c:flags:optimize-tags
1030              Same as inverted --no-optimize-tags command-line option.
1031
1032       re2c:define:YYCONDTYPE
1033              Defines YYCONDTYPE (see the user interface section).
1034
1035       re2c:define:YYGETCONDITION
1036              Defines  API  primitive  YYGETCONDITION  (see the user interface
1037              section).
1038
1039       re2c:define:YYGETCONDITION:naked
1040              Allows one to override re2c:api:style for YYGETCONDITION.  Value
1041              0 corresponds to free-form API style.
1042
1043       re2c:define:YYSETCONDITION
1044              Defines  API  primitive  YYSETCONDITION  (see the user interface
1045              section).
1046
1047       re2c:define:YYSETCONDITION@cond
1048              Specifies the sigil used for argument substitution in  YYSETCON‐
1049              DITION  definition. The default value is @@.  Overrides the more
1050              generic re2c:api:sigil configuration.
1051
1052       re2c:define:YYSETCONDITION:naked
1053              Allows one to override re2c:api:style for YYSETCONDITION.  Value
1054              0 corresponds to free-form API style.
1055
1056       re2c:cond:goto
1057              Allows one to customize the goto statements used with the short‐
1058              cut :=> rules in conditions. The  default  value  is  goto  @@;.
1059              Placeholders   are   substituted   with   condition   name  (see
1060              re2c:api;sigil and re2c:cond:goto@cond).
1061
1062       re2c:cond:goto@cond
1063              Specifies  the  sigil  used   for   argument   substitution   in
1064              re2c:cond:goto  definition.  The default value is @@.  Overrides
1065              the more generic re2c:api:sigil configuration.
1066
1067       re2c:cond:divider
1068              Defines the divider for condition blocks.  The default value  is
1069              /*  ***********************************  */.   Placeholders  are
1070              substituted  with  condition  name   (see   re2c:api;sigil   and
1071              re2c:cond:divider@cond).
1072
1073       re2c:cond:divider@cond
1074              Specifies   the   sigil   used   for  argument  substitution  in
1075              re2c:cond:divider definition. The default value  is  @@.   Over‐
1076              rides the more generic re2c:api:sigil configuration.
1077
1078       re2c:condprefix
1079              Specifies  the  prefix  used  for condition labels.  The default
1080              value is yyc_.
1081
1082       re2c:condenumprefix
1083              Specifies the prefix used for condition  identifiers.   The  de‐
1084              fault value is yyc.
1085
1086       re2c:define:YYGETSTATE
1087              Defines  API  primitive  YYGETSTATE (see the user interface sec‐
1088              tion).
1089
1090       re2c:define:YYGETSTATE:naked
1091              Allows one to override re2c:api:style for YYGETSTATE.   Value  0
1092              corresponds to free-form API style.
1093
1094       re2c:define:YYSETSTATE
1095              Defines  API  primitive  YYSETSTATE (see the user interface sec‐
1096              tion).
1097
1098       re2c:define:YYSETSTATE@state
1099              Specifies the sigil used for argument substitution in YYSETSTATE
1100              definition. The default value is @@.  Overrides the more generic
1101              re2c:api:sigil configuration.
1102
1103       re2c:define:YYSETSTATE:naked
1104              Allows one to override re2c:api:style for YYSETSTATE.   Value  0
1105              corresponds to free-form API style.
1106
1107       re2c:state:abort
1108              If  set  to  a  positive  integer value, changes the form of the
1109              YYGETSTATE switch: instead of using default case to jump to  the
1110              beginning of the lexer block, a -1 case is used, and the default
1111              case aborts the program.
1112
1113       re2c:state:nextlabel
1114              With storable states, allows to control if the YYGETSTATE  block
1115              is  followed by a yyNext label (the default value is zero, which
1116              corresponds to no label). Instead of using yyNext it is possible
1117              to  use  re2c:startlabel  to  force the generation of a specific
1118              start label.  Instead of using labels it is  often  more  conve‐
1119              nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1120
1121       re2c:label:yyNext
1122              Allows one to change the name of the yyNext label.
1123
1124       re2c:startlabel
1125              Controls the generation of start label for the next lexer block.
1126              The default value is zero, which means that the start  label  is
1127              generated only if it is used. An integer value greater than zero
1128              forces the generation of start label even if it is unused by the
1129              lexer.  A  string  value  also forces start label generation and
1130              sets the label name to the specified string.  This configuration
1131              applies  only  to  the current block (it is reset to default for
1132              the next block).
1133
1134       re2c:flags:s, re2c:flags:nested-ifs
1135              Same as -s --nested-ifs command-line option.
1136
1137       re2c:flags:b, re2c:flags:bit-vectors
1138              Same as -b --bit-vectors command-line option.
1139
1140       re2c:variable:yybm
1141              Overrides the name of the yybm variable.
1142
1143       re2c:yybm:hex
1144              Defaults to zero (a decimal bitmap table is generated).  If  set
1145              to nonzero, a hexadecimal table is generated.
1146
1147       re2c:flags:g, re2c:flags:computed-gotos
1148              Same as -g --computed-gotos command-line option.
1149
1150       re2c:cgoto:threshold
1151              With  -g  --computed-gotos  option this value specifies the com‐
1152              plexity threshold that triggers the generation  of  jump  tables
1153              instead  of  nested if statements and bitmaps. The default value
1154              is 9.
1155
1156       re2c:flags:case-ranges
1157              Same as --case-ranges command-line option.
1158
1159       re2c:flags:e, re2c:flags:ecb
1160              Same as -e --ecb command-line option.
1161
1162       re2c:flags:8, re2c:flags:utf-8
1163              Same as -8 --utf-8 command-line option.
1164
1165       re2c:flags:w, re2c:flags:wide-chars
1166              Same as -w --wide-chars command-line option.
1167
1168       re2c:flags:x, re2c:flags:utf-16
1169              Same as -x --utf-16 command-line option.
1170
1171       re2c:flags:u, re2c:flags:unicode
1172              Same as -u --unicode command-line option.
1173
1174       re2c:flags:encoding-policy
1175              Same as --encoding-policy command-line option.
1176
1177       re2c:flags:empty-class
1178              Same as --empty-class command-line option.
1179
1180       re2c:flags:case-insensitive
1181              Same as --case-insensitive command-line option.
1182
1183       re2c:flags:case-inverted
1184              Same as --case-inverted command-line option.
1185
1186       re2c:flags:i, re2c:flags:no-debug-info
1187              Same as -i --no-debug-info command-line option.
1188
1189       re2c:indent:string
1190              Specifies the string to use for indentation.  The default  value
1191              is  "\t".   Indent string should contain only whitespace charac‐
1192              ters.  To disable indentation entirely, set  this  configuration
1193              to empty string "".
1194
1195       re2c:indent:top
1196              Specifies the minimum amount of indentation to use.  The default
1197              value is zero.  The value should be a non-negative integer  num‐
1198              ber.
1199
1200       re2c:labelprefix
1201              Allows  one  to  change the prefix of DFA state labels.  The de‐
1202              fault value is yy.
1203
1204       re2c:yych:emit
1205              Set this to zero to suppress the generation of yych  definition.
1206              Defaults to 1 (the definition is generated).
1207
1208       re2c:variable:yych
1209              Overrides the name of the yych variable.
1210
1211       re2c:yych:conversion
1212              If  set  to nonzero, re2c automatically generates a cast to YYC‐
1213              TYPE every time yych is read. Defaults to zero (no cast).
1214
1215       re2c:variable:yyaccept
1216              Overrides the name of the yyaccept variable.
1217
1218       re2c:variable:yytarget
1219              Overrides the name of the yytarget variable.
1220
1221       re2c:variable:yystable
1222              Deprecated.
1223
1224       re2c:variable:yyctable
1225              When both -c --conditions and -g  --computed-gotos  are  active,
1226              re2c  will use this variable to generate a static jump table for
1227              YYGETCONDITION.
1228
1229       re2c:define:YYDEBUG
1230              Defines YYDEBUG (see the user interface section).
1231
1232       re2c:flags:d, re2c:flags:debug-output
1233              Same as -d --debug-output command-line option.
1234
1235       re2c:flags:dfa-minimization
1236              Same as --dfa-minimization command-line option.
1237
1238       re2c:flags:eager-skip
1239              Same as --eager-skip command-line option.
1240

REGULAR EXPRESSIONS

1242       re2c uses the following syntax for regular expressions:
1243
1244"foo" case-sensitive string literal
1245
1246'foo' case-insensitive string literal
1247
1248[a-xyz], [^a-xyz] character class (possibly negated)
1249
1250. any character except newline
1251
1252R \ S difference of character classes R and S
1253
1254R* zero or more occurrences of R
1255
1256R+ one or more occurrences of R
1257
1258R? optional R
1259
1260R{n} repetition of R exactly n times
1261
1262R{n,} repetition of R at least n times
1263
1264R{n,m} repetition of R from n to m times
1265
1266(R) just R; parentheses  are  used  to  override  precedence  or  for
1267         POSIX-style submatch
1268
1269R S concatenation: R followed by S
1270
1271R | S alternative: R or S
1272
1273R / S lookahead: R followed by S, but S is not consumed
1274
1275name the regular expression defined as name (or literal string "name"
1276         in Flex compatibility mode)
1277
1278{name} the regular expression defined as name in  Flex  compatibility
1279         mode
1280
1281@stag  an s-tag: saves the last input position at which @stag matches
1282         in a variable named stag
1283
1284#mtag an m-tag: saves all input positions at which #mtag matches in a
1285         variable named mtag
1286
1287       Character  classes and string literals may contain the following escape
1288       sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1289       decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1290

HANDLING THE END OF INPUT

1292       One  of the main problems for the lexer is to know when to stop.  There
1293       are a few terminating conditions:
1294
1295       • the lexer may match some rule (including default rule *) and come  to
1296         a final state
1297
1298       • the lexer may fail to match any rule and come to a default state
1299
1300       • the lexer may reach the end of input
1301
1302       The  first  two  conditions  terminate the lexer in a "natural" way: it
1303       comes to a state with no outgoing transitions, and the  matching  auto‐
1304       matically  stops.   The third condition, end of input, is different: it
1305       may happen in any state, and the lexer should be  able  to  handle  it.
1306       Checking  for the end of input interrupts the normal lexer workflow and
1307       adds conditional branches to the generated  program,  therefore  it  is
1308       necessary  to  minimize the number of such checks.  re2c supports a few
1309       different methods for end of input handling.  Which one to use  depends
1310       on  the complexity of regular expressions, the need for buffering, per‐
1311       formance considerations and other factors.  Here is a list of all meth‐
1312       ods:
1313
1314Sentinel  character.   This method eliminates the need for the end of
1315         input checks altogether.  It is simple and efficient, but limited  to
1316         the  case when there is a natural "sentinel" character that can never
1317         occur in valid input.  This character may still occur in invalid  in‐
1318         put, but it is not allowed by the regular expressions, except perhaps
1319         as the last character of a rule.  The sentinel character is  appended
1320         at the end of input and serves as a stop signal: when the lexer reads
1321         it, it must be either the end of input, or a syntax error.   In  both
1322         cases  the  lexer  stops.   This method is used if YYFILL is disabled
1323         with re2c:yyfill:enable = 0; and re2c:eof has the default value -1.
1324
1325
1326
1327Sentinel character with bounds checks.  This method  is  generic:  it
1328         allows  to  handle  any input without restrictions on the regular ex‐
1329         pressions.  The idea is to reduce the number of end of  input  checks
1330         by  performing them only on certain characters.  Similar to the "sen‐
1331         tinel character" method, one of the characters is chosen as  a  "sen‐
1332         tinel"  and  appended  at the end of input.  However, there is no re‐
1333         striction on where the sentinel character may  occur  (in  fact,  any
1334         character  can  be chosen for a sentinel).  When the lexer reads this
1335         character, it additionally performs a bounds check.  If  the  current
1336         position  is within bounds, the lexer will resume matching and handle
1337         the sentinel character as a regular one.  Otherwise it  will  try  to
1338         get  more input with YYFILL (unless YYFILL is disabled).  If more in‐
1339         put is available, the lexer will rematch the last character and  con‐
1340         tinue  as  if  the sentinel never occurred.  Otherwise it is the real
1341         end of input, and the lexer  will  stop.   This  method  is  used  if
1342         re2c:eof  has  non-negative value (it should be set to the ordinal of
1343         the sentinel character).  YYFILL must be either defined  or  disabled
1344         with re2c:yyfill:enable = 0;.
1345
1346
1347
1348Bounds  checks  with padding.  This method is the default one.  It is
1349         generic, and it is usually faster than the "sentinel  character  with
1350         bounds  checks" method, but also more complex to use.  The idea is to
1351         partition the underlying finite-state automaton  into  strongly  con‐
1352         nected components (SCCs), and generate only one bounds check per SCC,
1353         but make it check for multiple characters at once  (enough  to  cover
1354         the  longest  non-looping  path in the SCC).  This way the checks are
1355         less frequent, which makes the lexer run much  faster.   If  a  check
1356         shows  that  there is not enough input, the lexer will invoke YYFILL,
1357         which may either supply enough input or else it should not return (in
1358         the  latter  case  the lexer will stop).  This approach has a problem
1359         with matching  short  lexemes  at  the  end  of  input,  because  the
1360         multi-character check requires enough characters to cover the longest
1361         possible lexeme.  To fix this problem, it is necessary  to  append  a
1362         few fake characters at the end of input.  The padding should not form
1363         a valid lexeme suffix to avoid fooling the lexer into matching it  as
1364         part  of  the input.  The minimum sufficient length of padding is YY‐
1365         MAXFILL and it is autogenerated by  re2c  with  /*!max:re2c*/.   This
1366         method  is  used if re2c:yyfill:enable has the default nonzero value,
1367         and re2c:eof has the default value -1.  YYFILL must be defined.
1368
1369
1370
1371Custom methods with generic API.  Generic API allows to override  ba‐
1372         sic  operations  like reading a character, which makes it possible to
1373         include the end of input checks as part of them.   Such  methods  are
1374         error-prone  and  should  be used with caution, only if other methods
1375         cannot be used.  These methods are used if  generic  API  is  enabled
1376         with  --input custom or re2c:flags:input = custom; and default bounds
1377         checks are disabled with re2c:yyfill:enable = 0;.  Note that the  use
1378         of  generic  API  does not imply the use of custom methods, it merely
1379         allows it.
1380
1381       The following subsections contain an example of each method.
1382
1383   Sentinel character
1384       In this example the lexer uses a sentinel character to handle  the  end
1385       of  input.   The  program counts space-separated words in a null-termi‐
1386       nated string.  Configuration re2c:yyfill:enable  =  0;  suppresses  the
1387       generation of bounds checks and YYFILL invocations.  The sentinel char‐
1388       acter is null.  It is the last character of each input string,  and  it
1389       is  not  allowed in the middle of a lexeme by any of the rules (in par‐
1390       ticular, it is not included in the character ranges, where it  is  easy
1391       to overlook).  If a null occurs in the middle of a string, it is a syn‐
1392       tax error and the lexer will match default rule *, but  it  won't  read
1393       past the end of input or crash.  -Wsentinel-in-midrule warning verifies
1394       that the rules do not allow sentinel in the middle (it is  possible  to
1395       tell re2c which character is used as a sentinel with re2c:sentinel con‐
1396       figuration --- the default assumption is null, since this is  the  most
1397       common case).
1398
1399          //go:generate re2go $INPUT -o $OUTPUT
1400          package main
1401
1402          import "testing"
1403
1404          // expect a null-terminated string
1405          func lex(str string) int {
1406              var cursor int
1407              count := 0
1408          loop:
1409              /*!re2c
1410              re2c:yyfill:enable = 0;
1411              re2c:define:YYCTYPE = byte;
1412              re2c:define:YYPEEK = "str[cursor]";
1413              re2c:define:YYSKIP = "cursor += 1";
1414
1415              *      { return -1 }
1416              [\x00] { return count }
1417              [a-z]+ { count += 1; goto loop }
1418              [ ]+   { goto loop }
1419              */
1420          }
1421
1422          func TestLex(t *testing.T) {
1423              var tests = []struct {
1424                  res int
1425                  str string
1426              }{
1427                  {0, "\000"},
1428                  {3, "one two three\000"},
1429                  {-1, "f0ur\000"},
1430              }
1431
1432              for _, x := range tests {
1433                  t.Run(x.str, func(t *testing.T) {
1434                      res := lex(x.str)
1435                      if res != x.res {
1436                          t.Errorf("got %d, want %d", res, x.res)
1437                      }
1438                  })
1439              }
1440          }
1441
1442
1443   Sentinel character with bounds checks
1444       In this example the lexer uses sentinel character with bounds checks to
1445       handle the end of input (this method was added in  version  1.2).   The
1446       program  counts  single-quoted strings separated with spaces.  The sen‐
1447       tinel character is null, which is specified with re2c:eof = 0; configu‐
1448       ration.   Null  is  the last character of each input string --- this is
1449       essential to detect the end of input.  Null, as well as any other char‐
1450       acter,  is allowed in the middle of a rule (for example, 'aaa\0aa'\0 is
1451       valid input, but 'aaa\0 is a syntax error).  Bounds checks  are  gener‐
1452       ated in each state that has a switch on an input character, in the con‐
1453       ditional branch that corresponds to null (that branch  may  also  cover
1454       other characters --- re2c does not split out a separate branch for sen‐
1455       tinel, because increasing the number of branches  degrades  performance
1456       more  than bounds checks do).  Bounds checks are of the form YYLIMIT <=
1457       YYCURSOR or YYLESSTHAN(1) with generic API.  If  a  bounds  check  suc‐
1458       ceeds,  the lexer will continue matching.  If a bounds check fails, the
1459       lexer has reached the end of input, and it should stop.  In this  exam‐
1460       ple  YYFILL is disabled with re2c:yyfill:enable = 0; and the lexer does
1461       not attempt to get more input (see another example that uses YYFILL  in
1462       the YYFILL with sentinel character section).  When the end of input has
1463       been reached, there are three possibilities: if the  lexer  is  in  the
1464       initial state, it will match the end of input rule $, otherwise it will
1465       either fallback to a previously matched rule (including default rule *)
1466       or go to a default state, causing -Wundefined-control-flow.
1467
1468          //go:generate re2go $INPUT -o $OUTPUT
1469          package main
1470
1471          import "testing"
1472
1473          // Expects a null-terminated string.
1474          func lex(str string) int {
1475              var cursor, marker int
1476              limit := len(str) - 1 // limit points at the terminating null
1477              count := 0
1478          loop:
1479              /*!re2c
1480              re2c:yyfill:enable = 0;
1481              re2c:eof = 0;
1482              re2c:define:YYCTYPE    = byte;
1483              re2c:define:YYPEEK     = "str[cursor]";
1484              re2c:define:YYSKIP     = "cursor += 1";
1485              re2c:define:YYBACKUP   = "marker = cursor";
1486              re2c:define:YYRESTORE  = "cursor = marker";
1487              re2c:define:YYLESSTHAN = "limit <= cursor";
1488
1489              *                           { return -1 }
1490              $                           { return count }
1491              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1492              [ ]+                        { goto loop }
1493              */
1494          }
1495
1496          func TestLex(t *testing.T) {
1497              var tests = []struct {
1498                  res int
1499                  str string
1500              }{
1501                  {0, "\000"},
1502                  {3, "'qu\000tes' 'are' 'fine: \\'' \000"},
1503                  {-1, "'unterminated\\'\000"},
1504              }
1505
1506              for _, x := range tests {
1507                  t.Run(x.str, func(t *testing.T) {
1508                      res := lex(x.str)
1509                      if res != x.res {
1510                          t.Errorf("got %d, want %d", res, x.res)
1511                      }
1512                  })
1513              }
1514          }
1515
1516
1517   Bounds checks with padding
1518       In  this  example the lexer uses bounds checking with padding to handle
1519       the end of input (it is the default method).  The program  counts  sin‐
1520       gle-quoted strings separated with spaces.  There is a padding of YYMAX‐
1521       FILL null characters appended at the  end  of  input,  where  YYMAXFILL
1522       value  is autogenerated with /*!max:re2c*/ directive.  It is not neces‐
1523       sary to use null for padding --- any characters can be used, as long as
1524       they  do not form a valid lexeme suffix (in this example padding should
1525       not contain single quotes, as they may be mistaken for a  suffix  of  a
1526       single-quoted  string).   There is a "stop" rule that matches the first
1527       padding character (null) and terminates the lexer (it  returns  success
1528       only  if  it has matched at the beginning of padding, otherwise a stray
1529       null is syntax error).  Bounds checks are generated only in some states
1530       that  depend on the strongly connected components of the underlying au‐
1531       tomaton.  They are of  the  form  (YYLIMIT  -  YYCURSOR)  <  n  or  YY‐
1532       LESSTHAN(n)  with generic API, where n is the minimum number of charac‐
1533       ters that are needed for the lexer to proceed (it also means  that  the
1534       next  bounds  check  will  occur in at most n characters).  If a bounds
1535       check succeeds, the lexer will continue matching.  If  a  bounds  check
1536       fails,  the  lexer  has  reached  the  end of input and will invoke YY‐
1537       FILL(n), which should either supply at least n input characters, or  it
1538       should  not return.  In this example YYFILL always fails and terminates
1539       the lexer with an error.  This is fine, because in this example  YYFILL
1540       can  only be called when the lexer has advanced into the padding, which
1541       means that is has encountered an unterminated string and should  return
1542       a  syntax  error.   See  the YYFILL with padding section for an example
1543       that refills the input buffer with YYFILL.
1544
1545          //go:generate re2go $INPUT -o $OUTPUT
1546          package main
1547
1548          import (
1549              "strings"
1550              "testing"
1551          )
1552
1553          /*!max:re2c*/
1554
1555          // Expects YYMAXFILL-padded string.
1556          func lex(str string) int {
1557              var cursor int
1558              limit := len(str)
1559              count := 0
1560          loop:
1561              /*!re2c
1562              re2c:define:YYCTYPE    = byte;
1563              re2c:define:YYPEEK     = "str[cursor]";
1564              re2c:define:YYSKIP     = "cursor += 1";
1565              re2c:define:YYLESSTHAN = "limit - cursor < @@{len}";
1566              re2c:define:YYFILL     = "return -1";
1567
1568              * {
1569                  return -1
1570              }
1571              [\x00] {
1572                  if limit - cursor == YYMAXFILL - 1 {
1573                      return count
1574                  } else {
1575                      return -1
1576                  }
1577              }
1578              ['] ([^'\\] | [\\][^])* ['] {
1579                  count += 1;
1580                  goto loop
1581              }
1582              [ ]+ {
1583                  goto loop
1584              }
1585              */
1586          }
1587
1588          // Pad string with YYMAXFILL zeroes at the end.
1589          func pad(str string) string {
1590              return str + strings.Repeat("\000", YYMAXFILL)
1591          }
1592
1593          func TestLex(t *testing.T) {
1594              var tests = []struct {
1595                  res int
1596                  str string
1597              }{
1598                  {0, ""},
1599                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1600                  {-1, "'unterminated\\'"},
1601              }
1602
1603              for _, x := range tests {
1604                  t.Run(x.str, func(t *testing.T) {
1605                      res := lex(pad(x.str))
1606                      if res != x.res {
1607                          t.Errorf("got %d, want %d", res, x.res)
1608                      }
1609                  })
1610              }
1611          }
1612
1613
1614   Custom methods with generic API
1615       In this example the lexer uses a custom end of  input  handling  method
1616       based  on  generic API.  The program counts single-quoted strings sepa‐
1617       rated with spaces.  It is the  same  as  the  sentinel  character  with
1618       bounds checks example, except that the input is not null-terminated (so
1619       this method can be used if it's not possible to  have  any  padding  at
1620       all,  not  even  a single sentinel character).  To cover up for the ab‐
1621       sence of sentinel character at the end of input, YYPEEK is redefined to
1622       perform  a bounds check before it reads the next input character.  This
1623       is inefficient, because checks are done very often.  If the check  suc‐
1624       ceeds,  YYPEEK  returns the real character, otherwise it returns a fake
1625       sentinel character.
1626
1627          //go:generate re2go $INPUT -o $OUTPUT
1628          package main
1629
1630          import "testing"
1631
1632          // Returns "fake" terminating null if cursor has reached limit.
1633          func peek(str string, cursor int, limit int) byte {
1634              if cursor >= limit {
1635                  return 0 // fake null
1636              } else {
1637                  return str[cursor]
1638              }
1639          }
1640
1641          // Expects a string without terminating null.
1642          func lex(str string) int {
1643              var cursor, marker int
1644              limit := len(str)
1645              count := 0
1646          loop:
1647              /*!re2c
1648              re2c:yyfill:enable = 0;
1649              re2c:eof = 0;
1650              re2c:define:YYCTYPE    = byte;
1651              re2c:define:YYLESSTHAN = "cursor >= limit";
1652              re2c:define:YYPEEK     = "peek(str, cursor, limit)";
1653              re2c:define:YYSKIP     = "cursor += 1";
1654              re2c:define:YYBACKUP   = "marker = cursor";
1655              re2c:define:YYRESTORE  = "cursor = marker";
1656
1657              *                           { return -1 }
1658              $                           { return count }
1659              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1660              [ ]+                        { goto loop }
1661              */
1662          }
1663
1664          func TestLex(t *testing.T) {
1665              var tests = []struct {
1666                  res int
1667                  str string
1668              }{
1669                  {0, ""},
1670                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1671                  {-1, "'unterminated\\'"},
1672              }
1673
1674              for _, x := range tests {
1675                  t.Run(x.str, func(t *testing.T) {
1676                      res := lex(x.str)
1677                      if res != x.res {
1678                          t.Errorf("got %d, want %d", res, x.res)
1679                      }
1680                  })
1681              }
1682          }
1683
1684

BUFFER REFILLING

1686       The need for buffering arises when the input cannot be mapped in memory
1687       all at once: either it is too large, or it comes in a streaming fashion
1688       (like reading from a socket). The usual technique in such cases  is  to
1689       allocate  a  fixed-sized memory buffer and process input in chunks that
1690       fit into the buffer. When the current chunk is processed, it  is  moved
1691       out  and new data is moved in. In practice it is somewhat more complex,
1692       because lexer state consists not of a single input position, but a  set
1693       of interrelated posiitons:
1694
1695       • cursor:  the next input character to be read (YYCURSOR in default API
1696         or YYSKIP/YYPEEK in generic API)
1697
1698       • limit: the position after the last available input character (YYLIMIT
1699         in default API, implicitly handled by YYLESSTHAN in generic API)
1700
1701       • marker:  the  position  of the most recent match, if any (YYMARKER in
1702         default API or YYBACKUP/YYRESTORE in generic API)
1703
1704       • token: the start of the current lexeme (implicit in re2c API,  as  it
1705         is  not  needed for the normal lexer operation and can be defined and
1706         updated by the user)
1707
1708       • context marker: the position of the trailing context (YYCTXMARKER  in
1709         default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1710
1711       • tag  variables:  submatch positions (defined with /*!stags:re2c*/ and
1712         /*!mtags:re2c*/  directives  and  YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN  in
1713         generic API)
1714
1715       Not all these are used in every case, but if used, they must be updated
1716       by YYFILL. All active positions are contained in  the  segment  between
1717       token  and  cursor, therefore everything between buffer start and token
1718       can be discarded, the segment from token and  up  to  limit  should  be
1719       moved to the beginning of buffer, and the free space at the end of buf‐
1720       fer should be filled with new data.  In order to avoid frequent  YYFILL
1721       calls  it is best to fill in as many input characters as possible (even
1722       though fewer characters might suffice to resume the lexer). The details
1723       of  YYFILL implementation are slightly different depending on which EOF
1724       handling method is used: the case of EOF rule is somewhat simpler  than
1725       the  case  of  bounds-checking  with  padding.  Also  note  that  if -f
1726       --storable-state option is used, YYFILL has slightly  different  seman‐
1727       tics (desrbed in the section about storable state).
1728
1729   YYFILL with sentinel character
1730       If  EOF  rule is used, YYFILL is a function-like primitive that accepts
1731       no arguments and returns a value which is checked against zero.  YYFILL
1732       invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1733       and YYLESSTHAN() in generic API. A non-zero return value means that YY‐
1734       FILL  has  failed.  A  successful  YYFILL call must supply at least one
1735       character and adjust input positions accordingly. Limit must always  be
1736       set  to  one after the last input position in buffer, and the character
1737       at the limit position must be the sentinel symbol specified by re2c:eof
1738       configuration.  The pictures below show the relative locations of input
1739       positions in buffer before and after YYFILL call  (sentinel  symbol  is
1740       marked  with #, and the second picture shows the case when there is not
1741       enough input to fill the whole buffer).
1742
1743                         <-- shift -->
1744                       >-A------------B---------C-------------D#-----------E->
1745                       buffer       token    marker         limit,
1746                                                            cursor
1747          >-A------------B---------C-------------D------------E#->
1748                       buffer,  marker        cursor        limit
1749                       token
1750
1751                         <-- shift -->
1752                       >-A------------B---------C-------------D#--E (EOF)
1753                       buffer       token    marker         limit,
1754                                                            cursor
1755          >-A------------B---------C-------------D---E#........
1756                       buffer,  marker       cursor limit
1757                       token
1758
1759       Here is an example of a program that  reads  input  file  input.txt  in
1760       chunks of 4096 bytes and uses EOF rule.
1761
1762          //go:generate re2go $INPUT -o $OUTPUT
1763          package main
1764
1765          import (
1766              "os"
1767              "testing"
1768          )
1769
1770          // Intentionally small to trigger buffer refill.
1771          const SIZE int = 16
1772
1773          type Input struct {
1774              file   *os.File
1775              data   []byte
1776              cursor int
1777              marker int
1778              token  int
1779              limit  int
1780              eof    bool
1781          }
1782
1783          func fill(in *Input) int {
1784              // If nothing can be read, fail.
1785              if in.eof {
1786                  return 1
1787              }
1788
1789              // Check if at least some space can be freed.
1790              if in.token == 0 {
1791                  // In real life can reallocate a larger buffer.
1792                  panic("fill error: lexeme too long")
1793              }
1794
1795              // Discard everything up to the start of the current lexeme,
1796              // shift buffer contents and adjust offsets.
1797              copy(in.data[0:], in.data[in.token:in.limit])
1798              in.cursor -= in.token
1799              in.marker -= in.token
1800              in.limit -= in.token
1801              in.token = 0
1802
1803              // Read new data (as much as possible to fill the buffer).
1804              n, _ := in.file.Read(in.data[in.limit:SIZE])
1805              in.limit += n
1806              in.data[in.limit] = 0
1807
1808              // If read less than expected, this is the end of input.
1809              in.eof = in.limit < SIZE
1810
1811              // If nothing has been read, fail.
1812              if n == 0 {
1813                  return 1
1814              }
1815
1816              return 0
1817          }
1818
1819          func lex(in *Input) int {
1820              count := 0
1821          loop:
1822              in.token = in.cursor
1823              /*!re2c
1824              re2c:eof = 0;
1825              re2c:define:YYCTYPE    = byte;
1826              re2c:define:YYPEEK     = "in.data[in.cursor]";
1827              re2c:define:YYSKIP     = "in.cursor += 1";
1828              re2c:define:YYBACKUP   = "in.marker = in.cursor";
1829              re2c:define:YYRESTORE  = "in.cursor = in.marker";
1830              re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
1831              re2c:define:YYFILL     = "fill(in) == 0";
1832
1833              *                           { return -1 }
1834              $                           { return count }
1835              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1836              [ ]+                        { goto loop }
1837              */
1838          }
1839
1840          // Prepare a file with the input text and run the lexer.
1841          func test(data string) (result int) {
1842              tmpfile := "input.txt"
1843
1844              f, _ := os.Create(tmpfile)
1845              f.WriteString(data)
1846              f.Seek(0, 0)
1847
1848              defer func() {
1849                  if r := recover(); r != nil {
1850                      result = -2
1851                  }
1852                  f.Close()
1853                  os.Remove(tmpfile)
1854              }()
1855
1856              in := &Input{
1857                  file:   f,
1858                  data:   make([]byte, SIZE+1),
1859                  cursor: SIZE,
1860                  marker: SIZE,
1861                  token:  SIZE,
1862                  limit:  SIZE,
1863                  eof:    false,
1864              }
1865
1866              return lex(in)
1867          }
1868
1869          func TestLex(t *testing.T) {
1870              var tests = []struct {
1871                  res int
1872                  str string
1873              }{
1874                  {0, ""},
1875                  {2, "'one' 'two'"},
1876                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1877                  {-1, "'unterminated\\'"},
1878                  {-2, "'loooooooooooong'"},
1879              }
1880
1881              for _, x := range tests {
1882                  t.Run(x.str, func(t *testing.T) {
1883                      res := test(x.str)
1884                      if res != x.res {
1885                          t.Errorf("got %d, want %d", res, x.res)
1886                      }
1887                  })
1888              }
1889          }
1890
1891
1892   YYFILL with padding
1893       In  the  default  case  (when  EOF  rule is not used) YYFILL is a func‐
1894       tion-like primitive that accepts a single argument and does not  return
1895       any  value.  YYFILL invocation is triggered by condition (YYLIMIT - YY‐
1896       CURSOR) < n in default API and YYLESSTHAN(n) in generic API. The  argu‐
1897       ment  passed to YYFILL is the minimal number of characters that must be
1898       supplied. If it fails to do so, YYFILL must not  return  to  the  lexer
1899       (for  that  reason  it is best implemented as a macro that returns from
1900       the calling function on failure).  In case of a successful YYFILL invo‐
1901       cation  the limit position must be set either to one after the last in‐
1902       put position in buffer, or to the end of YYMAXFILL padding (in case YY‐
1903       FILL  has  successfully  read  at least n characters, but not enough to
1904       fill the entire buffer). The pictures below show the relative locations
1905       of input positions in buffer before and after YYFILL invocation (YYMAX‐
1906       FILL padding on the second picture is marked with # symbols).
1907
1908                         <-- shift -->                 <-- need -->
1909                       >-A------------B---------C-----D-------E---F--------G->
1910                       buffer       token    marker cursor  limit
1911
1912          >-A------------B---------C-----D-------E---F--------G->
1913                       buffer,  marker cursor               limit
1914                       token
1915
1916                         <-- shift -->                 <-- need -->
1917                       >-A------------B---------C-----D-------E-F        (EOF)
1918                       buffer       token    marker cursor  limit
1919
1920          >-A------------B---------C-----D-------E-F###############
1921                       buffer,  marker cursor                   limit
1922                       token                        <- YYMAXFILL ->
1923
1924       Here is an example of a program that  reads  input  file  input.txt  in
1925       chunks of 4096 bytes and uses bounds-checking with padding.
1926
1927          //go:generate re2go $INPUT -o $OUTPUT
1928          package main
1929
1930          import (
1931              "fmt"
1932              "os"
1933              "testing"
1934          )
1935
1936          /*!max:re2c*/
1937
1938          // Intentionally small to trigger buffer refill.
1939          const SIZE int = 16
1940
1941          type Input struct {
1942              file   *os.File
1943              data   []byte
1944              cursor int
1945              marker int
1946              token  int
1947              limit  int
1948              eof    bool
1949          }
1950
1951          func fill(in *Input, need int) int {
1952              // End of input has already been reached, nothing to do.
1953              if in.eof {
1954                  return -1 // Error: unexpected EOF
1955              }
1956
1957              // Check if after moving the current lexeme to the beginning
1958              // of buffer there will be enough free space.
1959              if SIZE-(in.cursor-in.token) < need {
1960                  return -2 // Error: lexeme too long
1961              }
1962
1963              // Discard everything up to the start of the current lexeme,
1964              // shift buffer contents and adjust offsets.
1965              copy(in.data[0:], in.data[in.token:in.limit])
1966              in.cursor -= in.token
1967              in.marker -= in.token
1968              in.limit -= in.token
1969              in.token = 0
1970
1971              // Read new data (as much as possible to fill the buffer).
1972              n, _ := in.file.Read(in.data[in.limit:SIZE])
1973              in.limit += n
1974
1975              // If read less than expected, this is the end of input.
1976              in.eof = in.limit < SIZE
1977
1978              // If end of input, add padding so that the lexer can read
1979              // the remaining characters at the end of buffer.
1980              if in.eof {
1981                  for i := 0; i < YYMAXFILL; i += 1 {
1982                      in.data[in.limit+i] = 0
1983                  }
1984                  in.limit += YYMAXFILL
1985              }
1986
1987              return 0
1988          }
1989
1990          func lex(in *Input) int {
1991              count := 0
1992          loop:
1993              in.token = in.cursor
1994              /*!re2c
1995              re2c:define:YYCTYPE    = byte;
1996              re2c:define:YYPEEK     = "in.data[in.cursor]";
1997              re2c:define:YYSKIP     = "in.cursor += 1";
1998              re2c:define:YYBACKUP   = "in.marker = in.cursor";
1999              re2c:define:YYRESTORE  = "in.cursor = in.marker";
2000              re2c:define:YYLESSTHAN = "in.limit-in.cursor < @@{len}";
2001              re2c:define:YYFILL     = "if r := fill(in, @@{len}); r != 0 { return r }";
2002
2003              * {
2004                  return -1
2005              }
2006              [\x00] {
2007                  if in.limit - in.cursor == YYMAXFILL - 1 {
2008                      return count
2009                  } else {
2010                      return -1
2011                  }
2012              }
2013              ['] ([^'\\] | [\\][^])* ['] {
2014                  count += 1;
2015                  goto loop
2016              }
2017              [ ]+ {
2018                  goto loop
2019              }
2020              */
2021          }
2022
2023          // Prepare a file with the input text and run the lexer.
2024          func test(data string) (result int) {
2025              tmpfile := "input.txt"
2026
2027              f, _ := os.Create(tmpfile)
2028              f.WriteString(data)
2029              f.Seek(0, 0)
2030
2031              defer func() {
2032                  if r := recover(); r != nil {
2033                      fmt.Println(r)
2034                      result = -2
2035                  }
2036                  f.Close()
2037                  os.Remove(tmpfile)
2038              }()
2039
2040              in := &Input{
2041                  file:   f,
2042                  data:   make([]byte, SIZE+YYMAXFILL),
2043                  cursor: SIZE,
2044                  marker: SIZE,
2045                  token:  SIZE,
2046                  limit:  SIZE,
2047                  eof:    false,
2048              }
2049
2050              return lex(in)
2051          }
2052
2053          func TestLex(t *testing.T) {
2054              var tests = []struct {
2055                  res int
2056                  str string
2057              }{
2058                  {0, ""},
2059                  {2, "'one' 'two'"},
2060                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
2061                  {-1, "'unterminated\\'"},
2062                  {-2, "'loooooooooooong'"},
2063              }
2064
2065              for _, x := range tests {
2066                  t.Run(x.str, func(t *testing.T) {
2067                      res := test(x.str)
2068                      if res != x.res {
2069                          t.Errorf("got %d, want %d", res, x.res)
2070                      }
2071                  })
2072              }
2073          }
2074
2075

INCLUDE FILES

2077       re2c  allows one to include other files using directive /*!include:re2c
2078       FILE */, where FILE is the name of file to be included. re2c looks  for
2079       included  files  in  the directory of the including file and in include
2080       locations, which can be specified with -I option.   Include  directives
2081       in  re2c  work  in the same way as C/C++ #include: the contents of FILE
2082       are copy-pasted verbatim in place of the directive. Include  files  may
2083       have further includes of their own. Use --depfile option to track build
2084       dependencies of the output file on include files.  re2c  provides  some
2085       predefined include files that can be found in the include/ subdirectory
2086       of the project. These files contain definitions that can be  useful  to
2087       other  projects  (such as Unicode categories) and form something like a
2088       standard library for re2c.  Below is an example of using include direc‐
2089       tive.
2090
2091   Include file (definitions.go)
2092          const (
2093              ResultOk = iota
2094              ResultFail
2095          )
2096
2097          /*!re2c
2098              number = [1-9][0-9]*;
2099          */
2100
2101
2102   Input file
2103          //go:generate re2go -c $INPUT -o $OUTPUT -i
2104          package main
2105
2106          import "testing"
2107          /*!include:re2c "definitions.go" */
2108
2109          func lex(str string) int {
2110              var cursor int
2111              /*!re2c
2112              re2c:yyfill:enable  = 0;
2113              re2c:define:YYCTYPE = byte;
2114              re2c:define:YYPEEK  = "str[cursor]";
2115              re2c:define:YYSKIP  = "cursor += 1";
2116
2117              number { return ResultOk }
2118              *      { return ResultFail }
2119              */
2120          }
2121
2122          func TestLex(t *testing.T) {
2123              if lex("123\000") != ResultOk {
2124                  t.Errorf("error")
2125              }
2126          }
2127
2128

HEADER FILES

2130       Re2c  allows  one to generate header file from the input .re file using
2131       option -t, --type-header or  configuration  re2c:flags:type-header  and
2132       directives  /*!header:re2c:on*/ and /*!header:re2c:off*/. The first di‐
2133       rective marks the beginning of header file, and  the  second  directive
2134       marks  the  end of it. Everything between these directives is processed
2135       by re2c, and the generated code is written to the file specified by the
2136       -t  --type-header option (or stdout if this option was not used). Auto‐
2137       generated header file may be needed in cases when re2c is used to  gen‐
2138       erate definitions of constants, variables and structs that must be vis‐
2139       ible from other translation units.
2140
2141       Here is an example of generating a header file that contains definition
2142       of  the lexer state with tag variables (the number variables depends on
2143       the regular grammar and is unknown to the programmer).
2144
2145   Input file
2146          //go:generate re2go $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.go
2147          package main
2148
2149          import (
2150              "lexer" // generated by re2c
2151              "testing"
2152          )
2153
2154          /*!header:re2c:on*/
2155          package lexer
2156
2157          type State struct {
2158              Data string
2159              Cur, Mar, /*!stags:re2c format="@@{tag}"; separator=", "; */ int
2160          }
2161          /*!header:re2c:off*/
2162
2163          func lex(st *lexer.State) int {
2164              /*!re2c
2165              re2c:flags:type-header = "src/lexer/lexer.go";
2166              re2c:yyfill:enable = 0;
2167              re2c:flags:tags = 1;
2168              re2c:define:YYCTYPE      = byte;
2169              re2c:define:YYPEEK       = "st.Data[st.Cur]";
2170              re2c:define:YYSKIP       = "st.Cur++";
2171              re2c:define:YYBACKUP     = "st.Mar = st.Cur";
2172              re2c:define:YYRESTORE    = "st.Cur = st.Mar";
2173              re2c:define:YYRESTORETAG = "st.Cur = @@{tag}";
2174              re2c:define:YYSTAGP      = "@@{tag} = st.Cur";
2175              re2c:tags:expression     = "st.@@{tag}";
2176              re2c:tags:prefix         = "Tag";
2177
2178              [x]{1,4} / [x]{3,5} { return 0 } // ambiguous trailing context
2179              *                   { return 1 }
2180              */
2181          }
2182
2183          func TestLex(t *testing.T) {
2184              st := &lexer.State{
2185                  Data: "xxxxxxxx\x00",
2186              }
2187              if !(lex(st) == 0 && st.Cur == 4) {
2188                  t.Error("failed")
2189              }
2190          }
2191
2192
2193   Header file
2194          // Code generated by re2c, DO NOT EDIT.
2195
2196          package lexer
2197
2198          type State struct {
2199              Data string
2200              Cur, Mar, Tag1, Tag2, Tag3 int
2201          }
2202
2203

SUBMATCH EXTRACTION

2205       Re2c has two options for submatch extraction.
2206
2207       The first option is -T --tags. With this option one can use  standalone
2208       tags  of  the  form  @stag and #mtag, where stag and mtag are arbitrary
2209       used-defined names. Tags can be used anywhere inside of a  regular  ex‐
2210       pression; semantically they are just position markers. Tags of the form
2211       @stag are called s-tags: they denote a single submatch value (the  last
2212       input  position  where  this  tag  matched). Tags of the form #mtag are
2213       called m-tags: they denote multiple submatch values (the whole  history
2214       of repetitions of this tag).  All tags should be defined by the user as
2215       variables with the corresponding names. With standalone tags re2c  uses
2216       leftmost  greedy  disambiguation:  submatch positions correspond to the
2217       leftmost matching path through the regular expression.
2218
2219       The second option is -P --posix-captures:  it  enables  POSIX-compliant
2220       capturing  groups.  In this mode parentheses in regular expressions de‐
2221       note the beginning and the end of capturing groups; the  whole  regular
2222       expression  is group number zero. The number of groups for the matching
2223       rule is stored in a variable yynmatch, and submatch results are  stored
2224       in  yypmatch array. Both yynmatch and yypmatch should be defined by the
2225       user, and yypmatch size must be at least [yynmatch * 2]. Re2c  provides
2226       a  directive  /*!maxnmatch:re2c*/  that defines YYMAXNMATCH: a constant
2227       equal to the maximal value of yynmatch among all rules. Note that  re2c
2228       implements  POSIX-compliant  disambiguation: each subexpression matches
2229       as long as possible, and subexpressions that start earlier  in  regular
2230       expression  have  priority  over those starting later. Capturing groups
2231       are translated into s-tags under the hood, therefore we  use  the  word
2232       "tag" to describe them as well.
2233
2234       With  both -P --posix-captures and T --tags options re2c uses efficient
2235       submatch extraction algorithm described in the Tagged Deterministic Fi‐
2236       nite Automata with Lookahead paper. The overhead on submatch extraction
2237       in the generated lexer grows with the number of tags --- if this number
2238       is  moderate,  the overhead is barely noticeable. In the lexer tags are
2239       implemented using a number of tag variables generated by re2c. There is
2240       no  one-to-one  correspondence between tag variables and tags: a single
2241       variable may be reused for different tags, and one tag may require mul‐
2242       tiple  variables to hold all its ambiguous values. Eventually ambiguity
2243       is resolved, and only one final variable per tag survives. When a  rule
2244       matches,  all  its  tags are set to the values of the corresponding tag
2245       variables.  The exact number of tag variables is unknown to  the  user;
2246       this number is determined by re2c. However, tag variables should be de‐
2247       fined by the user as a part of the lexer state and updated  by  YYFILL,
2248       therefore  re2c provides directives /*!stags:re2c*/ and /*!mtags:re2c*/
2249       that can be used to declare, initialize and manipulate  tag  variables.
2250       These  directives  have  two  optional  configurations:  format = "@@";
2251       (specifies the template where @@ is substituted with the name  of  each
2252       tag variable), and separator = ""; (specifies the piece of code used to
2253       join the generated pieces for different tag variables).
2254
2255       S-tags support the following operations:
2256
2257       • save input position to an s-tag: t = YYCURSOR with default API  or  a
2258         user-defined operation YYSTAGP(t) with generic API
2259
2260       • save  default  value  to  an  s-tag:  t  = NULL with default API or a
2261         user-defined operation YYSTAGN(t) with generic API
2262
2263       • copy one s-tag to another: t1 = t2
2264
2265       M-tags support the following operations:
2266
2267       • append input position to an  m-tag:  a  user-defined  operation  YYM‐
2268         TAGP(t) with both default and generic API
2269
2270       • append default value to an m-tag: a user-defined operation YYMTAGN(t)
2271         with both default and generic API
2272
2273       • copy one m-tag to another: t1 = t2
2274
2275       S-tags can be implemented  as  scalar  values  (pointers  or  offsets).
2276       M-tags  need a more complex representation, as they need to store a se‐
2277       quence of tag values. The most naive and inefficient representation  of
2278       an m-tag is a list (array, vector) of tag values; a more efficient rep‐
2279       resentation is to store all m-tags in a prefix-tree represented as  ar‐
2280       ray  of nodes (v, p), where v is tag value and p is a pointer to parent
2281       node.
2282
2283       Here is a simple example of using s-tags to parse an IPv4 address  (see
2284       below for a more complex example that uses YYFILL).
2285
2286          //go:generate re2go $INPUT -o $OUTPUT
2287          package main
2288
2289          import (
2290              "errors"
2291              "testing"
2292          )
2293
2294          var eBadIP error = errors.New("bad IP")
2295
2296          func lex(str string) (int, error) {
2297              var cursor, marker, o1, o2, o3, o4 int
2298              /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2299
2300              num := func(pos int, end int) int {
2301                  n := 0
2302                  for ; pos < end; pos++ {
2303                      n = n*10 + int(str[pos]-'0')
2304                  }
2305                  return n
2306              }
2307
2308              /*!re2c
2309              re2c:flags:tags = 1;
2310              re2c:yyfill:enable = 0;
2311              re2c:define:YYCTYPE   = byte;
2312              re2c:define:YYPEEK    = "str[cursor]";
2313              re2c:define:YYSKIP    = "cursor += 1";
2314              re2c:define:YYBACKUP  = "marker = cursor";
2315              re2c:define:YYRESTORE = "cursor = marker";
2316              re2c:define:YYSTAGP   = "@@{tag} = cursor";
2317              re2c:define:YYSTAGN   = "@@{tag} = -1";
2318
2319              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2320              dot = [.];
2321              end = [\x00];
2322
2323              @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2324                  return num(o4, cursor-1)+
2325                      (num(o3, o4-1) << 8)+
2326                      (num(o2, o3-1) << 16)+
2327                      (num(o1, o2-1) << 24), nil
2328              }
2329              * { return 0, eBadIP }
2330              */
2331          }
2332
2333          func TestLex(t *testing.T) {
2334              var tests = []struct {
2335                  str string
2336                  res int
2337                  err error
2338              }{
2339                  {"1.2.3.4\000", 0x01020304, nil},
2340                  {"127.0.0.1\000", 0x7f000001, nil},
2341                  {"255.255.255.255\000", 0xffffffff, nil},
2342                  {"1.2.3.\000", 0, eBadIP},
2343                  {"1.2.3.256\000", 0, eBadIP},
2344              }
2345
2346              for _, x := range tests {
2347                  t.Run(x.str, func(t *testing.T) {
2348                      res, err := lex(x.str)
2349                      if !(res == x.res && err == x.err) {
2350                          t.Errorf("got %d, want %d", res, x.res)
2351                      }
2352                  })
2353              }
2354          }
2355
2356
2357       Here  is  a more complex example of using s-tags with YYFILL to parse a
2358       file with IPv4 addresses. Tag variables are part of  the  lexer  state,
2359       and  they are adjusted in YYFILL like other input positions.  Note that
2360       it is necessary for s-tags because their values are  invalidated  after
2361       shifting buffer contents. It may not be necessary in a custom implemen‐
2362       tation where tag variables store offsets relative to the start  of  the
2363       input string rather than buffer, which may be the case with m-tags.
2364
2365          //go:generate re2go $INPUT -o $OUTPUT --tags
2366          package main
2367
2368          import (
2369              "fmt"
2370              "os"
2371              "reflect"
2372              "testing"
2373          )
2374
2375          const SIZE int = 4096
2376
2377          type Input struct {
2378              file   *os.File
2379              data   []byte
2380              cursor int
2381              marker int
2382              token  int
2383              limit  int
2384              // Tag variables must be part of the lexer state passed to YYFILL.
2385              // They don't correspond to tags and should be autogenerated by re2c.
2386              /*!stags:re2c format = "@@ int"; separator= "\n\t"; */
2387              eof    bool
2388          }
2389
2390          func fill(in *Input) int {
2391              // If nothing can be read, fail.
2392              if in.eof {
2393                  return 1
2394              }
2395
2396              // Check if at least some space can be freed.
2397              if in.token == 0 {
2398                  // In real life can reallocate a larger buffer.
2399                  panic("fill error: lexeme too long")
2400              }
2401
2402              // Discard everything up to the start of the current lexeme,
2403              // shift buffer contents and adjust offsets.
2404              copy(in.data[0:], in.data[in.token:in.limit])
2405              in.cursor -= in.token
2406              in.marker -= in.token
2407              in.limit -= in.token
2408              // Tag variables need to be shifted like other input positions. The
2409              // check for -1 is only needed if some tags are nested inside of
2410              // alternative or repetition, so that they can have -1 value.
2411              /*!stags:re2c
2412                  format = "if in.@@ != -1 { in.@@ -= in.token }";
2413                  separator= "\n\t";
2414              */
2415              in.token = 0
2416
2417              // Read new data (as much as possible to fill the buffer).
2418              n, _ := in.file.Read(in.data[in.limit:SIZE])
2419              in.limit += n
2420              in.data[in.limit] = 0
2421
2422              // If read less than expected, this is the end of input.
2423              in.eof = in.limit < SIZE
2424
2425              // If nothing has been read, fail.
2426              if n == 0 {
2427                  return 1
2428              }
2429
2430              return 0
2431          }
2432
2433          func lex(in *Input) []int {
2434              // User-defined local variables that store final tag values. They are
2435              // different from tag variables autogenerated with /*!stags:re2c*/, as
2436              // they are set at the end of match and used only in semantic actions.
2437              var o1, o2, o3, o4 int
2438              var ips []int
2439
2440              num := func(pos int, end int) int {
2441                  n := 0
2442                  for ; pos < end; pos++ {
2443                      n = n*10 + int(in.data[pos]-'0')
2444                  }
2445                  return n
2446              }
2447
2448          loop:
2449              in.token = in.cursor
2450              /*!re2c
2451              re2c:eof = 0;
2452              re2c:define:YYCTYPE    = byte;
2453              re2c:define:YYPEEK     = "in.data[in.cursor]";
2454              re2c:define:YYSKIP     = "in.cursor += 1";
2455              re2c:define:YYBACKUP   = "in.marker = in.cursor";
2456              re2c:define:YYRESTORE  = "in.cursor = in.marker";
2457              re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
2458              re2c:define:YYFILL     = "fill(in) == 0";
2459              re2c:define:YYSTAGP    = "@@{tag} = in.cursor";
2460              re2c:define:YYSTAGN    = "@@{tag} = -1";
2461
2462              // The way tag variables are accessed from the lexer (not needed if tag
2463              // variables are defined as local variables).
2464              re2c:tags:expression = "in.@@";
2465
2466              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2467              dot = [.];
2468              eol = [\n];
2469
2470              @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet eol {
2471                  ips = append(ips, num(o4, in.cursor-1)+
2472                      (num(o3, o4-1) << 8)+
2473                      (num(o2, o3-1) << 16)+
2474                      (num(o1, o2-1) << 24))
2475                  goto loop
2476              }
2477              $ { return ips }
2478              * { return nil }
2479              */
2480          }
2481
2482          func TestLex(t *testing.T) {
2483              tmpfile := "input.txt"
2484              var want, have []int
2485
2486              // Write a few IPv4 addresses to the input file and save them to compare
2487              // against parse results.
2488              f, _ := os.Create(tmpfile)
2489              for i := 0; i < 256; i++ {
2490                  fmt.Fprintf(f, "%d.%d.%d.%d\n", i, i, i, i)
2491                  want = append(want, i + (i<<8) + (i<<16) + (i<<24));
2492              }
2493              f.Seek(0, 0)
2494
2495              defer func() {
2496                  if r := recover(); r != nil {
2497                      have = nil
2498                  }
2499                  f.Close()
2500                  os.Remove(tmpfile)
2501              }()
2502
2503              in := &Input{
2504                  file:   f,
2505                  data:   make([]byte, SIZE+1),
2506                  cursor: SIZE,
2507                  marker: SIZE,
2508                  token:  SIZE,
2509                  limit:  SIZE,
2510                  eof:    false,
2511              }
2512
2513              have = lex(in)
2514
2515              if !reflect.DeepEqual(have, want) {
2516                  t.Errorf("have %d, want %d", have, want)
2517              }
2518          }
2519
2520
2521       Here is an example of using POSIX capturing groups to parse an IPv4 ad‐
2522       dress.
2523
2524          //go:generate re2go $INPUT -o $OUTPUT
2525          package main
2526
2527          import (
2528              "errors"
2529              "testing"
2530          )
2531
2532          /*!maxnmatch:re2c*/
2533
2534          var eBadIP error = errors.New("bad IP")
2535
2536          func lex(str string) (int, error) {
2537              var cursor, marker, yynmatch int
2538              yypmatch := make([]int, YYMAXNMATCH*2)
2539              /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2540
2541              num := func(pos int, end int) int {
2542                  n := 0
2543                  for ; pos < end; pos++ {
2544                      n = n*10 + int(str[pos]-'0')
2545                  }
2546                  return n
2547              }
2548
2549              /*!re2c
2550              re2c:flags:posix-captures = 1;
2551              re2c:yyfill:enable = 0;
2552              re2c:define:YYCTYPE     = byte;
2553              re2c:define:YYPEEK      = "str[cursor]";
2554              re2c:define:YYSKIP      = "cursor += 1";
2555              re2c:define:YYBACKUP    = "marker = cursor";
2556              re2c:define:YYRESTORE   = "cursor = marker";
2557              re2c:define:YYSTAGP     = "@@{tag} = cursor";
2558              re2c:define:YYSTAGN     = "@@{tag} = -1";
2559              re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
2560
2561              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2562              dot = [.];
2563              end = [\x00];
2564
2565              (octet) dot (octet) dot (octet) dot (octet) end {
2566                  if yynmatch != 5 {
2567                      panic("expected 5 submatch groups")
2568                  }
2569                  return num(yypmatch[8], yypmatch[9])+
2570                      (num(yypmatch[6], yypmatch[7]) << 8)+
2571                      (num(yypmatch[4], yypmatch[5]) << 16)+
2572                      (num(yypmatch[2], yypmatch[3]) << 24), nil
2573              }
2574              * { return 0, eBadIP }
2575              */
2576          }
2577
2578          func TestLex(t *testing.T) {
2579              var tests = []struct {
2580                  str string
2581                  res int
2582                  err error
2583              }{
2584                  {"1.2.3.4\000", 0x01020304, nil},
2585                  {"127.0.0.1\000", 0x7f000001, nil},
2586                  {"255.255.255.255\000", 0xffffffff, nil},
2587                  {"1.2.3.\000", 0, eBadIP},
2588                  {"1.2.3.256\000", 0, eBadIP},
2589              }
2590
2591              for _, x := range tests {
2592                  t.Run(x.str, func(t *testing.T) {
2593                      res, err := lex(x.str)
2594                      if !(res == x.res && err == x.err) {
2595                          t.Errorf("got %d, want %d", res, x.res)
2596                      }
2597                  })
2598              }
2599          }
2600
2601
2602       Here is an example of using m-tags to parse a  semicolon-separated  se‐
2603       quence  of  words  (C++).  Tag  variables  are stored in a tree that is
2604       packed in a vector.
2605
2606          //go:generate re2go $INPUT -o $OUTPUT
2607          package main
2608
2609          import (
2610              "reflect"
2611              "testing"
2612          )
2613
2614          const (
2615              mtagRoot int = -1
2616              mtagNil int = -2
2617          )
2618
2619          type mtagElem struct {
2620              val  int
2621              pred int
2622          }
2623
2624          type mtagTrie = []mtagElem
2625
2626          func createTrie(capacity int) mtagTrie {
2627              return make([]mtagElem, 0, capacity)
2628          }
2629
2630          func mtag(trie *mtagTrie, tag int, val int) int {
2631              *trie = append(*trie, mtagElem{val, tag})
2632              return len(*trie) - 1
2633          }
2634
2635          // Recursively unwind both tag histories and consruct submatches.
2636          func unwind(trie mtagTrie, x int, y int, str string) []string {
2637              if x == mtagRoot && y == mtagRoot {
2638                  return []string{}
2639              } else if x == mtagRoot || y == mtagRoot {
2640                  panic("tag histories have different length")
2641              } else {
2642                  xval := trie[x].val
2643                  yval := trie[y].val
2644                  ss := unwind(trie, trie[x].pred, trie[y].pred, str)
2645
2646                  // Either both tags should be nil, or none of them.
2647                  if xval == mtagNil && yval == mtagNil {
2648                      return ss
2649                  } else if xval == mtagNil || yval == mtagNil {
2650                      panic("tag histories positive/negative tag mismatch")
2651                  } else {
2652                      s := str[xval:yval]
2653                      return append(ss, s)
2654                  }
2655              }
2656          }
2657
2658          func lex(str string) []string {
2659              var cursor, marker int
2660              trie := createTrie(256)
2661              x := mtagRoot
2662              y := mtagRoot
2663              /*!mtags:re2c format = "@@ := mtagRoot"; separator = "\n\t"; */
2664
2665              /*!re2c
2666              re2c:flags:tags = 1;
2667              re2c:yyfill:enable = 0;
2668              re2c:define:YYCTYPE   = byte;
2669              re2c:define:YYPEEK    = "str[cursor]";
2670              re2c:define:YYSKIP    = "cursor += 1";
2671              re2c:define:YYBACKUP  = "marker = cursor";
2672              re2c:define:YYRESTORE = "cursor = marker";
2673              re2c:define:YYMTAGP   = "@@{tag} = mtag(&trie, @@{tag}, cursor)";
2674              re2c:define:YYMTAGN   = "@@{tag} = mtag(&trie, @@{tag}, mtagNil)";
2675
2676              end = [\x00];
2677
2678              (#x [a-z]+ #y [;])* end { return unwind(trie, x, y, str) }
2679              *                       { return nil }
2680              */
2681          }
2682
2683          func TestLex(t *testing.T) {
2684              var tests = []struct {
2685                  str string
2686                  res []string
2687              }{
2688                  {"\000", []string{}},
2689                  {"one;two;three;\000", []string{"one", "two", "three"}},
2690                  {"one;two\000", nil},
2691              }
2692
2693              for _, x := range tests {
2694                  t.Run(x.str, func(t *testing.T) {
2695                      res := lex(x.str)
2696                      if !reflect.DeepEqual(res, x.res) {
2697                          t.Errorf("got %v, want %v", res, x.res)
2698                      }
2699                  })
2700              }
2701          }
2702
2703

STORABLE STATE

2705       With -f --storable-state option re2c generates a lexer that  can  store
2706       its  current  state,  return to the caller, and later resume operations
2707       exactly where it left off. The default mode of operation in re2c  is  a
2708       "pull"  model,  in which the lexer "pulls" more input whenever it needs
2709       it. This may be unacceptable in cases when the input becomes  available
2710       piece  by piece (for example, if the lexer is invoked by the parser, or
2711       if the lexer program communicates via a socket protocol with some other
2712       program  that  must wait for a reply from the lexer before it transmits
2713       the next message). Storable state feature is intended exactly for  such
2714       cases:  it  allows  one to generate lexers that work in a "push" model.
2715       When the lexer needs more input, it stores its state and returns to the
2716       caller.  Later,  when  more input becomes available, the caller resumes
2717       the lexer exactly where it stopped. There are a few  changes  necessary
2718       compared to the "pull" model:
2719
2720       • Define YYSETSTATE() and YYGETSTATE(state) promitives.
2721
2722       • Define  yych,  yyaccept  and  state variables as a part of persistent
2723         lexer state. The state variable should be initialized to -1.
2724
2725YYFILL should return to the outer program instead of trying to supply
2726         more input. Return code should indicate that lexer needs more input.
2727
2728       • The  outer  program should recognize situations when lexer needs more
2729         input and respond appropriately.
2730
2731       • Use /*!getstate:re2c*/ directive if it is necessary  to  execute  any
2732         code before entering the lexer.
2733
2734       • Use  configurations  state:abort and state:nextlabel to further tweak
2735         the generated code.
2736
2737       Here is an example of a "push"-model lexer that reads input from  stdin
2738       and  expects  a sequence of words separated by spaces and newlines. The
2739       lexer loops forever, waiting for more input. It can  be  terminated  by
2740       sending  a special EOF token --- a word "stop", in which case the lexer
2741       terminates successfully and prints the number of words it has seen. Ab‐
2742       normal  termination happens in case of a syntax error, premature end of
2743       input (without the "stop" word) or in case the buffer is too  small  to
2744       hold  a  lexeme (for example, if one of the words exceeds buffer size).
2745       Premature end of input happens in case the lexer fails to read any  in‐
2746       put while being in the initial state --- this is the only case when EOF
2747       rule matches. Note that the lexer may call YYFILL twice  before  termi‐
2748       nating (and thus require hitting Ctrl+D a few times). First time YYFILL
2749       is called when the lexer expects continuation  of  the  current  greedy
2750       lexeme  (either  a word or a whitespace sequence). If YYFILL fails, the
2751       lexer knows that it has reached the end of the current lexeme and  exe‐
2752       cutes the corresponding semantic action. The action jumps to the begin‐
2753       ning of the loop, the lexer enters the initial state and  calls  YYFILL
2754       once  more. If it fails, the lexer matches EOF rule. (Alternatively EOF
2755       rule can be used for termination instead of a special EOF lexeme.)
2756
2757   Example
2758          //go:generate re2go -f $INPUT -o $OUTPUT
2759          package main
2760
2761          import (
2762              "fmt"
2763              "os"
2764              "testing"
2765          )
2766
2767          // Intentionally small to trigger buffer refill.
2768          const SIZE int = 16
2769
2770          type Input struct {
2771              file     *os.File
2772              data     []byte
2773              cursor   int
2774              marker   int
2775              token    int
2776              limit    int
2777              state    int
2778              yyaccept int
2779          }
2780
2781          const (
2782              lexEnd = iota
2783              lexReady
2784              lexWaitingForInput
2785              lexPacketBroken
2786              lexPacketTooBig
2787              lexCountMismatch
2788          )
2789
2790          func fill(in *Input) int {
2791              if in.token == 0 {
2792                  // Error: no space can be freed.
2793                  // In real life can reallocate a larger buffer.
2794                  return lexPacketTooBig
2795              }
2796
2797              // Discard everything up to the start of the current lexeme,
2798              // shift buffer contents and adjust offsets.
2799              copy(in.data[0:], in.data[in.token:in.limit])
2800              in.cursor -= in.token
2801              in.marker -= in.token
2802              in.limit -= in.token
2803              in.token = 0
2804
2805              // Read new data (as much as possible to fill the buffer).
2806              n, _ := in.file.Read(in.data[in.limit:SIZE])
2807              in.limit += n
2808              in.data[in.limit] = 0 // append sentinel symbol
2809
2810              return lexReady
2811          }
2812
2813          func lex(in *Input, recv *int) int {
2814              var yych byte
2815              /*!getstate:re2c*/
2816          loop:
2817              in.token = in.cursor
2818              /*!re2c
2819              re2c:eof = 0;
2820              re2c:define:YYPEEK     = "in.data[in.cursor]";
2821              re2c:define:YYSKIP     = "in.cursor += 1";
2822              re2c:define:YYBACKUP   = "in.marker = in.cursor";
2823              re2c:define:YYRESTORE  = "in.cursor = in.marker";
2824              re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
2825              re2c:define:YYFILL     = "return lexWaitingForInput";
2826              re2c:define:YYGETSTATE = "in.state";
2827              re2c:define:YYSETSTATE = "in.state = @@{state}";
2828
2829              packet = [a-z]+[;];
2830
2831              *      { return lexPacketBroken }
2832              $      { return lexEnd }
2833              packet { *recv = *recv + 1; goto loop }
2834              */
2835          }
2836
2837          func test(packets []string) int {
2838              fname := "pipe"
2839              fw, _ := os.Create(fname);
2840              fr, _ := os.Open(fname);
2841
2842              in := &Input{
2843                  file:   fr,
2844                  data:   make([]byte, SIZE+1),
2845                  cursor: SIZE,
2846                  marker: SIZE,
2847                  token:  SIZE,
2848                  limit:  SIZE,
2849                  state:  -1,
2850              }
2851              // data is zero-initialized, no need to write sentinel
2852
2853              var status int
2854              send := 0
2855              recv := 0
2856          loop:
2857              for {
2858                  status = lex(in, &recv)
2859                  if status == lexEnd {
2860                      if send != recv {
2861                          status = lexCountMismatch
2862                      }
2863                      break loop
2864                  } else if status == lexWaitingForInput {
2865                      if send < len(packets) {
2866                          fw.WriteString(packets[send])
2867                          send += 1
2868                      }
2869                      status = fill(in)
2870                      if status != lexReady {
2871                          break loop
2872                      }
2873                  } else if status == lexPacketBroken {
2874                      break loop
2875                  } else {
2876                      panic("unexpected status")
2877                  }
2878              }
2879
2880              fr.Close()
2881              fw.Close()
2882              os.Remove(fname)
2883
2884              return status
2885          }
2886
2887          func TestLex(t *testing.T) {
2888              var tests = []struct {
2889                  status  int
2890                  packets []string
2891              }{
2892                  {lexEnd, []string{}},
2893                  {lexEnd, []string{"zero;", "one;", "two;", "three;", "four;"}},
2894                  {lexPacketBroken, []string{"??;"}},
2895                  {lexPacketTooBig, []string{"looooooooooooong;"}},
2896              }
2897
2898              for i, x := range tests {
2899                  t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
2900                      status := test(x.packets)
2901                      if status != x.status {
2902                          t.Errorf("got %d, want %d", status, x.status)
2903                      }
2904                  })
2905              }
2906          }
2907
2908

REUSABLE BLOCKS

2910       Reuse mode is enabled with the -r --reusable option. In this mode  re2c
2911       allows  one to reuse definitions, configurations and rules specified by
2912       a /*!rules:re2c*/ block  in  subsequent  /*!use:re2c*/  blocks.  As  of
2913       re2c-1.2  it  is  possible  to  mix  such  blocks with normal /*!re2c*/
2914       blocks; prior to that re2c expects a  single  rules-block  followed  by
2915       use-blocks  (normal  blocks  are disallowed). Use-blocks can have addi‐
2916       tional definitions, configurations and rules: they are merged to  those
2917       specified by the rules-block.  A very common use case for -r --reusable
2918       option is a lexer that supports multiple input encodings:  lexer  rules
2919       are  defined once and reused multiple times with encoding-specific con‐
2920       figurations, such as re2c:flags:utf-8.
2921
2922       Below is an example of a multi-encoding lexer: it reads a  phrase  with
2923       Unicode  math symbols and accepts input either in UTF8 or in UT32. Note
2924       that the --input-encoding utf8 option allows us to  write  UTF8-encoded
2925       symbols  in  the  regular  expressions;  without this option re2c would
2926       parse them as a plain ASCII byte sequnce (and  we  would  have  to  use
2927       hexadecimal escape sequences).
2928
2929   Example
2930          //go:generate re2go $INPUT -o $OUTPUT -r --input-encoding utf8
2931          package main
2932
2933          import "testing"
2934
2935          /*!rules:re2c
2936              re2c:yyfill:enable = 0;
2937              re2c:define:YYPEEK    = "str[cursor]";
2938              re2c:define:YYSKIP    = "cursor += 1";
2939              re2c:define:YYBACKUP  = "marker = cursor";
2940              re2c:define:YYRESTORE = "cursor = marker";
2941
2942              "∀x ∃y: p(x, y)" { return 0; }
2943              *                { return 1; }
2944          */
2945
2946          func lexUTF8(str []uint8) int {
2947              var cursor, marker int
2948              /*!use:re2c
2949              re2c:flags:8 = 1;
2950              re2c:define:YYCTYPE = uint8;
2951              */
2952          }
2953
2954          func lexUTF32(str []uint32) int {
2955              var cursor, marker int
2956              /*!use:re2c
2957              re2c:flags:u = 1;
2958              re2c:define:YYCTYPE = uint32;
2959              */
2960          }
2961
2962          func TestLexUTF8(t *testing.T) {
2963              s_utf8 := []uint8{
2964                  0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79,
2965                  0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29};
2966
2967              if lexUTF8(s_utf8) != 0 {
2968                  t.Errorf("utf8 failed")
2969              }
2970          }
2971
2972          func TestLexUTF32(t *testing.T) {
2973              s_utf32 := []uint32{
2974                  0x00002200, 0x00000078, 0x00000020, 0x00002203, 0x00000079,
2975                  0x0000003a, 0x00000020, 0x00000070, 0x00000028, 0x00000078,
2976                  0x0000002c, 0x00000020, 0x00000079, 0x00000029};
2977
2978              if lexUTF32(s_utf32) != 0 {
2979                  t.Errorf("utf32 failed")
2980              }
2981          }
2982
2983

ENCODING SUPPORT

2985       Speaking of encodings, it is necessary to understand the difference be‐
2986       tween code points and code units.  Code point is  an  abstract  symbol.
2987       Code  unit  is the smallest atomic unit of storage in the encoded text.
2988       A single code point may be represented with one or more code units.  In
2989       a  fixed-length  encoding all code points are represented with the same
2990       number of code units.  In a variable-length encoding code points may be
2991       represented with a different number of code units.  Note that the "any"
2992       rule [^] matches any code point, but not  necessarily  any  code  unit.
2993       The  only  way to match any code unit regardless of the encoding it the
2994       default rule *.  YYCTYPE size should be equal to the size of code unit.
2995
2996       Re2c supports the following encodings: ASCII, EBCDIC, UCS2, UTF8, UTF16
2997       and UTF32.
2998
2999       • ASCII is enabled by default.  It is a fixed-length encoding with code
3000         space [0-255] and 1-byte code points and code units.
3001
3002       • EBCDIC is enabled with -e, --ecb option.  It a fixed-length  encoding
3003         with code space [0-255] and 1-byte code points and code units.
3004
3005       • UCS2  is  enabled with -w, --wide-chars option.  It is a fixed-length
3006         encoding with code space [0-0xFFFF] and 2-byte code points  and  code
3007         units.
3008
3009       • UTF8  is  enabled  with  -8, --utf-8 option.  It is a variable-length
3010         Unicode encoding with code space [0-0x10FFFF].  Code points are  rep‐
3011         resented with one, two, three or four 1-byte code units.
3012
3013       • UTF16  is  enabled with -x, --utf-16 option.  It is a variable-length
3014         Unicode encoding with code space [0-0x10FFFF].  Code points are  rep‐
3015         resented with one or two 2-byte code units.
3016
3017       • UTF32  is  enabled  with  -u, --unicode option.  It is a fixed-length
3018         Unicode encoding with code space [0-0x10FFFF] and 4-byte code  points
3019         and code units.
3020
3021       Encodings  can also be set or unset using re2c:flags configuration, for
3022       example re2c:flags:8 = 1; enables UTF8.
3023
3024       Include file include/unicode_categories.re  provides  re2c  definitions
3025       for the standard Unicode categories.
3026
3027       Option  --input-encoding  utf8  enables Unicode literals in regular ex‐
3028       pressions.
3029
3030       Option --encoding-policy <fail | substitute | ignore> specifies the way
3031       re2c   handles   Unicode   surrogates:   code   points   in  the  range
3032       [0xD800-0xDFFF].
3033
3034   Example
3035          //go:generate re2go $INPUT -o $OUTPUT -8 -s -i
3036          //
3037          // Simplified "Unicode Identifier and Pattern Syntax"
3038          // (see https://unicode.org/reports/tr31)
3039
3040          package main
3041
3042          import "testing"
3043
3044          /*!include:re2c "unicode_categories.re" */
3045
3046          func lex(str string) int {
3047              var cursor, marker int
3048              /*!re2c
3049              re2c:yyfill:enable    = 0;
3050              re2c:define:YYCTYPE   = byte;
3051              re2c:define:YYPEEK    = "str[cursor]";
3052              re2c:define:YYSKIP    = "cursor += 1";
3053              re2c:define:YYBACKUP  = "marker = cursor";
3054              re2c:define:YYRESTORE = "cursor = marker";
3055
3056              id_start    = L | Nl | [$_];
3057              id_continue = id_start | Mn | Mc | Nd | Pc | [\u200D\u05F3];
3058              identifier  = id_start id_continue*;
3059
3060              identifier { return 0 }
3061              *          { return 1 }
3062              */
3063          }
3064
3065          func TestLex(t *testing.T) {
3066              if lex("_Ыдентификатор\000") != 0 {
3067                  t.Errorf("failed")
3068              }
3069          }
3070
3071

START CONDITIONS

3073       Conditions are enabled with -c --conditions.  This option allows one to
3074       encode multiple interrelated lexers within the same re2c block.
3075
3076       Each  lexer  corresponds to a single condition.  It starts with a label
3077       of the form yyc_name, where name is condition name and yyc  prefix  can
3078       be  adjusted  with configuration re2c:condprefix.  Different lexers are
3079       separated with  a  comment  /*  ***********************************  */
3080       which can be adjusted with configuration re2c:cond:divider.
3081
3082       Furthermore,  each  condition  has a unique identifier of the form yyc‐
3083       name, where name is condition name and yyc prefix can be adjusted  with
3084       configuration  re2c:condenumprefix.   Identifiers have the type YYCOND‐
3085       TYPE and should be  generated  with  /*!types:re2c*/  directive  or  -t
3086       --type-header  option.   Users shouldn't define these identifiers manu‐
3087       ally, as the order of conditions is not specified.
3088
3089       Before all conditions re2c generates entry code that checks the current
3090       condition  identifier  and transfers control flow to the start label of
3091       the active condition.  After matching  some  rule  of  this  condition,
3092       lexer  may  either  transfer control flow back to the entry code (after
3093       executing the associated action and optionally setting  another  condi‐
3094       tion with =>), or use :=> shortcut and transition directly to the start
3095       label of another condition (skipping the action and  the  entry  code).
3096       Configuration re2c:cond:goto allows one to change the default behavior.
3097
3098       Syntactically each rule must be preceded with a list of comma-separated
3099       condition names or a wildcard * enclosed in angle  brackets  <  and  >.
3100       Wildcard  means "any condition" and is semantically equivalent to list‐
3101       ing all condition names.  Here regexp is a regular expression,  default
3102       refers to the default rule *, and action is a block of code.
3103
3104<conditions-or-wildcard>  regexp-or-default                 action
3105
3106<conditions-or-wildcard>  regexp-or-default  =>  condition  action
3107
3108<conditions-or-wildcard>  regexp-or-default  :=> condition
3109
3110       Rules with an exclamation mark ! in front of condition list have a spe‐
3111       cial meaning: they have no regular expression, and the  associated  ac‐
3112       tion is merged as an entry code to actions of normal rules.  This might
3113       be a convenient place to peform a routine task that is  common  to  all
3114       rules.
3115
3116<!conditions-or-wildcard>  action
3117
3118       Another  special  form  of rules with an empty condition list <> and no
3119       regular expression allows one to specify an "entry condition" that  can
3120       be  used to execute code before entering the lexer.  It is semantically
3121       equivalent to a condition with number zero, name 0 and an empty regular
3122       expression.
3123
3124<>                 action
3125
3126<>  =>  condition  action
3127
3128<>  :=> condition
3129
3130   Example
3131          //go:generate re2go -c $INPUT -o $OUTPUT -i
3132          package main
3133
3134          import (
3135              "errors"
3136              "testing"
3137          )
3138
3139          var (
3140              eSyntax   = errors.New("syntax error")
3141              eOverflow = errors.New("overflow error")
3142          )
3143
3144          /*!types:re2c*/
3145
3146          const u32Limit uint64 = 1<<32
3147
3148          func parse_u32(str string) (uint32, error) {
3149              var cursor, marker int
3150              result := uint64(0)
3151              cond := yycinit
3152
3153              add_digit := func(base uint64, offset byte) {
3154                  result = result * base + uint64(str[cursor-1] - offset)
3155                  if result >= u32Limit {
3156                      result = u32Limit
3157                  }
3158              }
3159
3160              /*!re2c
3161              re2c:yyfill:enable = 0;
3162              re2c:define:YYCTYPE        = byte;
3163              re2c:define:YYPEEK         = "str[cursor]";
3164              re2c:define:YYSKIP         = "cursor += 1";
3165              re2c:define:YYSHIFT        = "cursor += @@{shift}";
3166              re2c:define:YYBACKUP       = "marker = cursor";
3167              re2c:define:YYRESTORE      = "cursor = marker";
3168              re2c:define:YYGETCONDITION = "cond";
3169              re2c:define:YYSETCONDITION = "cond = @@";
3170
3171              <*> * { return 0, eSyntax }
3172
3173              <init> '0b' / [01]        :=> bin
3174              <init> "0"                :=> oct
3175              <init> ""   / [1-9]       :=> dec
3176              <init> '0x' / [0-9a-fA-F] :=> hex
3177
3178              <bin, oct, dec, hex> "\x00" {
3179                  if result < u32Limit {
3180                      return uint32(result), nil
3181                  } else {
3182                      return 0, eOverflow
3183                  }
3184              }
3185
3186              <bin> [01]  { add_digit(2, '0');     goto yyc_bin }
3187              <oct> [0-7] { add_digit(8, '0');     goto yyc_oct }
3188              <dec> [0-9] { add_digit(10, '0');    goto yyc_dec }
3189              <hex> [0-9] { add_digit(16, '0');    goto yyc_hex }
3190              <hex> [a-f] { add_digit(16, 'a'-10); goto yyc_hex }
3191              <hex> [A-F] { add_digit(16, 'A'-10); goto yyc_hex }
3192              */
3193          }
3194
3195          func TestLex(t *testing.T) {
3196              var tests = []struct {
3197                  num uint32
3198                  str string
3199                  err error
3200              }{
3201                  {1234567890, "1234567890\000", nil},
3202                  {13, "0b1101\000", nil},
3203                  {0x7fe, "0x007Fe\000", nil},
3204                  {0644, "0644\000", nil},
3205                  {0, "9999999999\000", eOverflow},
3206                  {0, "123??\000", eSyntax},
3207              }
3208
3209              for _, x := range tests {
3210                  t.Run(x.str, func(t *testing.T) {
3211                      num, err := parse_u32(x.str)
3212                      if !(num == x.num && err == x.err) {
3213                          t.Errorf("got %d, want %d", num, x.num)
3214                      }
3215                  })
3216              }
3217          }
3218
3219

SKELETON PROGRAMS

3221       With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
3222       erates a self-contained C program that can be further compiled and exe‐
3223       cuted. The program consists of lexer code and input data. For each con‐
3224       structed DFA (block or condition) re2c generates a standalone lexer and
3225       two files: an .input file with strings derived from the DFA and a .keys
3226       file with expected match results. The program runs each  lexer  on  the
3227       corresponding  .input  file and compares results with the expectations.
3228       Skeleton programs are very useful for a number of reasons:
3229
3230       • They can check correctness of various re2c optimizations (the data is
3231         generated  early  in the process, before any DFA transformations have
3232         taken place).
3233
3234       • Generating a set of input data with good coverage may be  useful  for
3235         both testing and benchmarking.
3236
3237       • Generating self-contained executable programs allows one to get mini‐
3238         mized test cases (the original code may be large or have a lot of de‐
3239         pendencies).
3240
3241       The  difficulty with generating input data is that for all but the most
3242       trivial cases the number of possible input strings is too  large  (even
3243       if the string length is limited). Re2c solves this difficulty by gener‐
3244       ating sufficiently many strings to cover almost all DFA transitions. It
3245       uses  the  following  algorithm. First, it constructs a skeleton of the
3246       DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
3247       EBCDIC)  skeleton is just an exact copy of the original DFA. For encod‐
3248       ings with multibyte code units skeleton is a copy of DFA  with  certain
3249       transitions omitted: namely, re2c takes at most 256 code units for each
3250       disjoint continuous range that corresponds to a  DFA  transition.   The
3251       chosen  values are evenly distributed and include range bounds. Instead
3252       of trying to cover all possible paths in the skeleton (which is  infea‐
3253       sible)  re2c  generates  sufficiently  many paths to cover all skeleton
3254       transitions, and thus trigger the corresponding  conditional  jumps  in
3255       the  lexer.  The algorithm implementation is limited by ~1Gb of transi‐
3256       tions and consumes constant amount of memory (re2c writes data to  file
3257       as soon as it is generated).
3258

VISUALIZATION AND DEBUG

3260       With  the  -D, --emit-dot option, re2c does not generate code. Instead,
3261       it dumps the generated DFA in DOT format.  One can convert this dump to
3262       an  image of the DFA using Graphviz or another library.  Note that this
3263       option shows the final DFA after it has gone through a number of  opti‐
3264       mizations  and transformations. Earlier stages can be dumped with vari‐
3265       ous debug options, such as --dump-nfa,  --dump-dfa-raw  etc.  (see  the
3266       full list of options).
3267

SEE ALSO

3269       You  can  find  more  information  about  re2c at the official website:
3270       http://re2c.org.   Similar  programs  are   flex(1),   lex(1),   quex(‐
3271       http://quex.sourceforge.net).
3272

AUTHORS

3274       Re2c  was  originaly  written by Peter Bumbulis in 1993.  Since then it
3275       has been developed and maintained by multiple volunteers; mots notably,
3276       Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
3277
3278
3279
3280
3281                                                                       RE2C(1)
Impressum