re2go(1) - f34

1RE2C(1)                                                                RE2C(1)
2
3
4

NAME

6       re2c - compile regular expressions to code
7

SYNOPSIS

9       re2c  [OPTIONS] INPUT [-o OUTPUT]
10
11       re2go [OPTIONS] INPUT [-o OUTPUT]
12

DESCRIPTION

14       re2c is a tool for generating fast lexical analyzers for C, C++ and Go.
15
16       Note:  This  manual  includes  examples  for  Go, but it refers to re2c
17       (rather than re2go) as the name of the program in general.
18

SYNTAX

20       A re2c program consists of normal code intermixed with re2c blocks  and
21       directives.   Each  re2c  block may contain definitions, configurations
22       and rules.  Definitions are of the form name = regexp;  where  name  is
23       an  identifier  that  consists  of letters, digits and underscores, and
24       regexp is a regular expression.  Regular expressions may contain  other
25       definitions,  but  recursion  is  not  allowed  and each name should be
26       defined before used.  Configurations are  of  the  form  re2c:config  =
27       value;  where config is the configuration descriptor and value can be a
28       number, a string or a special word.  Rules consist of a regular expres‐
29       sion  followed  by a semantic action (a block of code enclosed in curly
30       braces { and }, or a raw one line of code preceded with  :=  and  ended
31       with  a  newline  that  is not followed by a whitespace).  If the input
32       matches the regular expression, the associated semantic action is  exe‐
33       cuted.   If  multiple  rules match, the longest match takes precedence.
34       If multiple rules match the same string, the earlier rule takes  prece‐
35       dence.   There  are  two  special rules: default rule * and EOF rule $.
36       Default rule should always be  defined,  it  has  the  lowest  priority
37       regardless  of  its  place and matches any code unit (not necessarily a
38       valid character, see encoding support).  EOF rule matches  the  end  of
39       input, it should be defined if the corresponding EOF handling method is
40       used.  If start conditions are used, rules have  more  complex  syntax.
41       All  rules  of  a  single  block  are  compiled  into  a  deterministic
42       finite-state automaton (DFA) and encoded in the form of  a  program  in
43       the target language.  The generated code interfaces with the outer pro‐
44       gram by the means of a few user-defined  primitives  (see  the  program
45       interface  section).   Reusable blocks allow sharing rules, definitions
46       and configurations between different blocks.
47

EXAMPLE

49   Input file
50          //go:generate re2go $INPUT -o $OUTPUT -i
51          package main                             //
52                                                   //
53          func lex(str string) int {               // Go code
54              var cursor int                       //
55
56              /*!re2c                              // start of re2c block
57              re2c:define:YYCTYPE = byte;          // configuration
58              re2c:define:YYPEEK = "str[cursor]";  // configuration
59              re2c:define:YYSKIP = "cursor += 1";  // configuration
60              re2c:yyfill:enable = 0;              // configuration
61              re2c:flags:nested-ifs = 1;           // configuration
62                                                   //
63              number = [1-9][0-9]*;                // named definition
64                                                   //
65              number { return 0; }                 // normal rule
66              *      { return 1; }                 // default rule
67              */
68          }                                        //
69                                                   //
70          func main() {                            //
71              if lex("1234\x00") != 0 {            // Go code
72                  panic("failed!")                 //
73              }                                    //
74          }                                        //
75
76
77   Output file
78          // Code generated by re2c, DO NOT EDIT.
79          //go:generate re2go $INPUT -o $OUTPUT -i
80          package main                             //
81                                                   //
82          func lex(str string) int {               // Go code
83              var cursor int                       //
84
85
86          {
87              var yych byte
88              yych = str[cursor]
89              if (yych <= '0') {
90                  goto yy2
91              }
92              if (yych <= '9') {
93                  goto yy4
94              }
95          yy2:
96              cursor += 1
97              { return 1; }
98          yy4:
99              cursor += 1
100              yych = str[cursor]
101              if (yych <= '/') {
102                  goto yy6
103              }
104              if (yych <= '9') {
105                  goto yy4
106              }
107          yy6:
108              { return 0; }
109          }
110
111          }                                        //
112                                                   //
113          func main() {                            //
114              if lex("1234\x00") != 0 {            // Go code
115                  panic("failed!")                 //
116              }                                    //
117          }                                        //
118
119

OPTIONS

121       -? -h --help
122              Show help message.
123
124       -1 --single-pass
125              Deprecated. Does nothing (single pass is the default now).
126
127       -8 --utf-8
128              Generate a lexer that  reads  input  in  UTF-8  encoding.   re2c
129              assumes that character range is 0 -- 0x10FFFF and character size
130              is 1 byte.
131
132       -b --bit-vectors
133              Optimize conditional jumps using bit masks. Implies -s.
134
135       -c --conditions --start-conditions
136              Enable support of Flex-like "conditions": multiple  interrelated
137              lexers  within  one block. Option --start-conditions is a legacy
138              alias; use --conditions instead.
139
140       --case-insensitive
141              Treat single-quoted and double-quoted strings  as  case-insensi‐
142              tive.
143
144       --case-inverted
145              Invert  the  meaning of single-quoted and double-quoted strings:
146              treat single-quoted strings as case-sensitive and  double-quoted
147              strings as case-insensitive.
148
149       --case-ranges
150              Collapse  consecutive  cases in a switch statements into a range
151              of the form case low ... high:. This syntax is an  extension  of
152              the  C/C++  language, supported by compilers like GCC, Clang and
153              Tcc. The main advantage over using single cases is smaller  gen‐
154              erated C code and faster generation time, although for some com‐
155              pilers like Tcc it also results in smaller  binary  size.   This
156              option doesn't work for the Go backend.
157
158       -e --ecb
159              Generate  a  lexer  that  reads  input in EBCDIC encoding.  re2c
160              assumes that character range is 0 -- 0xFF an character size is 1
161              byte.
162
163       --empty-class <match-empty | match-none | error>
164              Define  the  way  re2c  treats  empty  character  classes.  With
165              match-empty (the default) empty class matches empty input (which
166              is   illogical,  but  backwards-compatible).  With``match-none``
167              empty class always fails  to  match.   With  error  empty  class
168              raises a compilation error.
169
170       --encoding-policy <fail | substitute | ignore>
171              Define  the  way re2c treats Unicode surrogates.  With fail re2c
172              aborts with an error when a surrogate is encountered.  With sub‐
173              stitute  re2c  silently  replaces surrogates with the error code
174              point 0xFFFD. With ignore (the default) re2c  treats  surrogates
175              as normal code points. The Unicode standard says that standalone
176              surrogates are invalid, but real-world  libraries  and  programs
177              behave in different ways.
178
179       -f --storable-state
180              Generate  a lexer which can store its inner state.  This is use‐
181              ful in push-model lexers which are stopped by an  outer  program
182              when there is not enough input, and then resumed when more input
183              becomes available. In this mode users should additionally define
184              YYGETSTATE()  and  YYSETSTATE(state)  macros and variables yych,
185              yyaccept and state as part of the lexer state.
186
187       -F --flex-syntax
188              Partial support for Flex syntax: in this mode named  definitions
189              don't  need  the  equal  sign and the terminating semicolon, and
190              when used they must be surrounded by curly braces. Names without
191              curly braces are treated as double-quoted strings.
192
193       -g --computed-gotos
194              Optimize  conditional  jumps  using non-standard "computed goto"
195              extension (which must be supported by the compiler). re2c gener‐
196              ates jump tables only in complex cases with a lot of conditional
197              branches.  Complexity   threshold   can   be   configured   with
198              cgoto:threshold  configuration.  This  option  implies  -b. This
199              option doesn't work for the Go backend.
200
201       -I PATH
202              Add PATH to the list of locations which are used when  searching
203              for  include  files.  This  option is useful in combination with
204              /*!include:re2c ... */ directive. Re2c looks  for  FILE  in  the
205              directory  of  including  file  and in the list of include paths
206              specified by -I option.
207
208       -i --no-debug-info
209              Do not output #line information. This is useful when the  gener‐
210              ated code is tracked by some version control system or IDE.
211
212       --input <default | custom>
213              Specify  the  API  used  by the generated code to interface with
214              used-defined code. Option default is the C API based on  pointer
215              arithmetic  (it is the default for the C backend). Option custom
216              is the generic API (it is the default for the Go backend).
217
218       --input-encoding <ascii | utf8>
219              Specify the way re2c parses  regular  expressions.   With  ascii
220              (the  default) re2c handles input as ASCII-encoded: any sequence
221              of code units is a sequence  of  standalone  1-byte  characters.
222              With  utf8  re2c  handles  input  as UTF8-encoded and recognizes
223              multibyte characters.
224
225       --lang <c | go>
226              Specify the output language. Supported languages are  C  and  Go
227              (the default is C).
228
229       --location-format <gnu | msvc>
230              Specify  location  format  in  messages.  With gnu locations are
231              printed as 'filename:line:column: ...'.  With msvc locations are
232              printed as 'filename(line,column) ...'.  Default is gnu.
233
234       --no-generation-date
235              Suppress date output in the generated file.
236
237       --no-version
238              Suppress version output in the generated file.
239
240       -o OUTPUT --output=OUTPUT
241              Specify the OUTPUT file.
242
243       -P --posix-captures
244              Enable submatch extraction with POSIX-style capturing groups.
245
246       -r --reusable
247              Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c
248              */ blocks. Exactly one rules-block must be  present.  The  rules
249              are  saved  and  used by every use-block that follows, which may
250              add its own rules and configurations.
251
252       -S --skeleton
253              Ignore user-defined interface code and generate a self-contained
254              "skeleton"  program.  Additionally,  generate  input  files with
255              strings derived from the regular grammar  and  compressed  match
256              results  that  are  used  to  verify  "skeleton" behavior on all
257              inputs. This option is useful for finding bugs in  optimizations
258              and  code  generation. This option doesn't work for the Go back‐
259              end.
260
261       -s --nested-ifs
262              Use nested if statements instead of switch statements in  condi‐
263              tional  jumps.  This usually results in more efficient code with
264              non-optimizing compilers.
265
266       -T --tags
267              Enable submatch extraction with tags.
268
269       -t HEADER --type-header=HEADER
270              Generate a HEADER file that contains enum with condition  names.
271              Requires -c option.
272
273       -u --unicode
274              Generate  a  lexer  that reads UTF32-encoded input. Re2c assumes
275              that character range is 0 -- 0x10FFFF and character  size  is  4
276              bytes. This option implies -s.
277
278       -V --vernum
279              Show version information in MMmmpp format (major, minor, patch).
280
281       --verbose
282              Output a short message in case of success.
283
284       -v --version
285              Show version information.
286
287       -w --wide-chars
288              Generate  a  lexer  that  reads UCS2-encoded input. Re2c assumes
289              that character range is 0 -- 0xFFFF  and  character  size  is  2
290              bytes. This option implies -s.
291
292       -x --utf-16
293              Generate  a  lexer  that reads UTF16-encoded input. Re2c assumes
294              that character range is 0 -- 0x10FFFF and character  size  is  2
295              bytes. This option implies -s.
296
297   Debug options
298       -D --emit-dot
299              Instead  of  normal  output generate lexer graph in .dot format.
300              The output can be  converted  to  an  image  with  the  help  of
301              Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
302
303       -d --debug-output
304              Emit  YYDEBUG  in the generated code.  YYDEBUG should be defined
305              by the user in the form of a void function with two  parameters:
306              state  (lexer  state  or -1) and symbol (current input symbol of
307              type YYCTYPE).
308
309       --dump-adfa
310              Debug option: output DFA after tunneling (in .dot format).
311
312       --dump-cfg
313              Debug option: output control flow graph  of  tag  variables  (in
314              .dot format).
315
316       --dump-closure-stats
317              Debug  option: output statistics on the number of states in clo‐
318              sure.
319
320       --dump-dfa-det
321              Debug option: output DFA immediately after  determinization  (in
322              .dot format).
323
324       --dump-dfa-min
325              Debug option: output DFA after minimization (in .dot format).
326
327       --dump-dfa-tagopt
328              Debug  option:  output DFA after tag optimizations (in .dot for‐
329              mat).
330
331       --dump-dfa-tree
332              Debug option: output DFA under construction with  states  repre‐
333              sented as tag history trees (in .dot format).
334
335       --dump-dfa-raw
336              Debug  option:  output  DFA  under  construction  with  expanded
337              state-sets (in .dot format).
338
339       --dump-interf
340              Debug option: output interference  table  produced  by  liveness
341              analysis of tag variables.
342
343       --dump-nfa
344              Debug option: output NFA (in .dot format).
345
346   Internal options
347       --dfa-minimization <moore | table>
348              Internal  option:  DFA  minimization algorithm used by re2c. The
349              moore option is the Moore algorithm (it is the default). The ta‐
350              ble  option  is  the  "table filling" algorithm. Both algorithms
351              should produce the same DFA up to states relabeling; table fill‐
352              ing  is simpler and much slower and serves as a reference imple‐
353              mentation.
354
355       --eager-skip
356              Internal option: make the  generated  lexer  advance  the  input
357              position  eagerly -- immediately after reading the input symbol.
358              This changes the default behavior when  the  input  position  is
359              advanced  lazily  --  after  transition  to the next state. This
360              option is implied by --no-lookahead.
361
362       --no-lookahead
363              Internal option: use TDFA(0) instead of  TDFA(1).   This  option
364              has effect only with --tags or --posix-captures options.
365
366       --no-optimize-tags
367              Internal optionL: suppress optimization of tag variables (useful
368              for debugging).
369
370       --posix-closure <gor1 | gtop>
371              Internal option: specify shortest-path algorithm  used  for  the
372              construction of epsilon-closure with POSIX disambiguation seman‐
373              tics: gor1 (the default) stands for  Goldberg-Radzik  algorithm,
374              and gtop stands for "global topological order" algorithm.
375
376       --posix-prectable <complex | naive>
377              Internal  option:  specify  the  algorithm used to compute POSIX
378              precedence table. The complex algorithm computes precedence  ta‐
379              ble  in one traversal of tag history tree and has quadratic com‐
380              plexity in the number of TNFA states; it  is  the  default.  The
381              naive algorithm has worst-case cubic complexity in the number of
382              TNFA states, but it is much simpler  than  complex  and  may  be
383              slightly faster in non-pathological cases.
384
385       --stadfa
386              Internal  option:  use staDFA algorithm for submatch extraction.
387              The main difference with TDFA is that tag operations  in  staDFA
388              are placed in states, not on transitions.
389
390   Warnings
391       -W     Turn on all warnings.
392
393       -Werror
394              Turn  warnings  into errors. Note that this option alone doesn't
395              turn on any warnings; it only affects those warnings  that  have
396              been turned on so far or will be turned on later.
397
398       -W<warning>
399              Turn on warning.
400
401       -Wno-<warning>
402              Turn off warning.
403
404       -Werror-<warning>
405              Turn  on warning and treat it as an error (this implies -W<warn‐
406              ing>).
407
408       -Wno-error-<warning>
409              Don't treat this particular warning as an  error.  This  doesn't
410              turn off the warning itself.
411
412       -Wcondition-order
413              Warn  if  the generated program makes implicit assumptions about
414              condition numbering. One should use either the -t, --type-header
415              option or the /*!types:re2c*/ directive to generate a mapping of
416              condition names to numbers and then use the autogenerated condi‐
417              tion names.
418
419       -Wempty-character-class
420              Warn  if a regular expression contains an empty character class.
421              Trying to match an empty character  class  makes  no  sense:  it
422              should  always  fail.  However, for backwards compatibility rea‐
423              sons re2c allows empty character  classes  and  treats  them  as
424              empty  strings.  Use  the  --empty-class  option  to  change the
425              default behavior.
426
427       -Wmatch-empty-string
428              Warn if a rule is nullable (matches an empty  string).   If  the
429              lexer  runs  in a loop and the empty match is unintentional, the
430              lexer may unexpectedly hang in an infinite loop.
431
432       -Wswapped-range
433              Warn if the lower bound of a range is  greater  than  its  upper
434              bound.  The  default  behavior  is  to  silently  swap the range
435              bounds.
436
437       -Wundefined-control-flow
438              Warn if some input strings cause undefined control flow  in  the
439              lexer  (the faulty patterns are reported). This is the most dan‐
440              gerous and most common mistake. It can be easily fixed by adding
441              the  default  rule  * which has the lowest priority, matches any
442              code unit, and consumes exactly one code unit.
443
444       -Wunreachable-rules
445              Warn about rules that are shadowed by other rules and will never
446              match.
447
448       -Wuseless-escape
449              Warn  if  a symbol is escaped when it shouldn't be.  By default,
450              re2c silently ignores such escapes, but this may as  well  indi‐
451              cate a typo or an error in the escape sequence.
452
453       -Wnondeterministic-tags
454              Warn  if  a  tag  has  n-th degree of nondeterminism, where n is
455              greater than 1.
456
457       -Wsentinel-in-midrule
458              Warn if the sentinel symbol occurs in the middle of a  rule  ---
459              this  may  cause reads past the end of buffer, crashes or memory
460              corruption in the generated lexer. This warning is only applica‐
461              ble  if  the sentinel method of checking for the end of input is
462              used.  It is set to an error if re2c:sentinel  configuration  is
463              used.
464

PROGRAM INTERFACE

466       Re2c  has a flexible interface that gives the user both the freedom and
467       the responsibility to define how the generated code interacts with  the
468       outer program.  There are two major options:
469
470       · Pointer  API.  It is also called "default API", since it was histori‐
471         cally the first, and for a long time the only one.  This  is  a  more
472         restricted  API  based  on  C  pointer  arithmetics.   It consists of
473         pointer-like primitives YYCURSOR, YYMARKER, YYCTXMARKER and  YYLIMIT,
474         which are normally defined as pointers of type YYCTYPE*.  Pointer API
475         is enabled by default for the C backend, and it cannot be  used  with
476         other backends that do not have pointer arithmetics.
477
478
479
480       · Generic  API.   This  is  a  less restricted API that does not assume
481         pointer  semantics.   It  consists  of  primitives  YYPEEK,   YYSKIP,
482         YYBACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP, YYMTAGN, YYRESTORE,
483         YYRESTORECTX, YYRESTORETAG,  YYSHIFT,  YYSHIFTSTAG,  YYSHIFTMTAG  and
484         YYLESSTHAN.   For  the  C backend generic API is enabled with --input
485         custom option or re2c:flags:input = custom; configuration; for the Go
486         backend  it  is enabled by default.  Generic API was added in version
487         0.14.  It is intentionally designed to give the user as much  freedom
488         as  possible  in redefining the input model and the semantics of dif‐
489         ferent actions performed by the generated code. As  an  example,  one
490         can  override YYPEEK to check for the end of input before reading the
491         input character, or do some logging, etc.
492
493       Generic API has two styles:
494
495       · Function-like.  This style is enabled  with  re2c:api:style  =  func‐
496         tions;  configuration,  and  it is the default for C backend. In this
497         style API primitives should be defined as functions  or  macros  with
498         parentheses, accepting the necessary arguments. For example, in C the
499         default pointer API can be defined in function-like style generic API
500         as follows:
501
502            #define  YYPEEK()                 *YYCURSOR
503            #define  YYSKIP()                 ++YYCURSOR
504            #define  YYBACKUP()               YYMARKER = YYCURSOR
505            #define  YYBACKUPCTX()            YYCTXMARKER = YYCURSOR
506            #define  YYRESTORE()              YYCURSOR = YYMARKER
507            #define  YYRESTORECTX()           YYCURSOR = YYCTXMARKER
508            #define  YYRESTORETAG(tag)        YYCURSOR = tag
509            #define  YYLESSTHAN(len)          YYLIMIT - YYCURSOR < len
510            #define  YYSTAGP(tag)             tag = YYCURSOR
511            #define  YYSTAGN(tag)             tag = NULL
512            #define  YYSHIFT(shift)           YYCURSOR += shift
513            #define  YYSHIFTSTAG(tag, shift)  tag += shift
514
515
516
517       · Free-form.   This  style  is enabled with re2c:api:style = free-form;
518         configuration, and it is the default for Go backend.  In  this  style
519         API  primitives  can  be  defined  as  free-form  pieces of code, and
520         instead of arguments they have interpolated  variables  of  the  form
521         @@{name}, or optionally just @@ if there is only one argument. The @@
522         text is called "sigil". It can be redefined to any  other  text  with
523         re2c:api:sigil  configuration.  For  example, the default pointer API
524         can be defined in free-form style generic API as follows:
525
526            re2c:define:YYPEEK       = "*YYCURSOR";
527            re2c:define:YYSKIP       = "++YYCURSOR";
528            re2c:define:YYBACKUP     = "YYMARKER = YYCURSOR";
529            re2c:define:YYBACKUPCTX  = "YYCTXMARKER = YYCURSOR";
530            re2c:define:YYRESTORE    = "YYCURSOR = YYMARKER";
531            re2c:define:YYRESTORECTX = "YYCURSOR = YYCTXMARKER";
532            re2c:define:YYRESTORETAG = "YYCURSOR = ${tag}";
533            re2c:define:YYLESSTHAN   = "YYLIMIT - YYCURSOR < @@{len}";
534            re2c:define:YYSTAGP      = "@@{tag} = YYCURSOR";
535            re2c:define:YYSTAGN      = "@@{tag} = NULL";
536            re2c:define:YYSHIFT      = "YYCURSOR += @@{shift}";
537            re2c:define:YYSHIFTSTAG  = "@@{tag} += @@{shift}";
538
539   API primitives
540       Here is a list of API primitives that may be used by the generated code
541       in  order  to  interface  with the outer program.  Which primitives are
542       needed depends on multiple factors, including the complexity of regular
543       expressions,  input  representation, buffering, the use of various fea‐
544       tures and so on.  All the necessary primitives should be defined by the
545       user  in  the form of macros, functions, variables, free-form pieces of
546       code or any other suitable form.  Re2c does not (and cannot) check  the
547       definitions,  so if anything is missing or defined incorrectly the gen‐
548       erated code will not compile.
549
550       YYCTYPE
551              The type of the  input  characters  (code  units).   For  ASCII,
552              EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
553              For UTF-16 or UCS-2 it should be 2-byte  unsigned  integer.  For
554              UTF-32 it should be 4-byte unsigned integer.
555
556       YYCURSOR
557              A  pointer-like  l-value  that stores the current input position
558              (usually a pointer of type YYCTYPE*). Initially YYCURSOR  should
559              point to the first input character. It is advanced by the gener‐
560              ated code.  When a rule matches,  YYCURSOR  points  to  the  one
561              after the last matched character. It is used only in the default
562              C API.
563
564       YYLIMIT
565              A pointer-like r-value that stores the  end  of  input  position
566              (usually  a  pointer of type YYCTYPE*). Initially YYLIMIT should
567              point to the one after the last available input character. It is
568              not  changed  by  the generated code. Lexer compares YYCURSOR to
569              YYLIMIT in order to determine if there is enough  input  charac‐
570              ters left.  YYLIMIT is used only in the default C API.
571
572       YYMARKER
573              A pointer-like l-value (usually a pointer of type YYCTYPE*) that
574              stores the position of the latest matched rule. It  is  used  to
575              restores  YYCURSOR  position if the longer match fails and lexer
576              needs to rollback.  Initialization is not  needed.  YYMARKER  is
577              used only in the default C API.
578
579       YYCTXMARKER
580              A  pointer-like l-value that stores the position of the trailing
581              context (usually a pointer of type YYCTYPE*). No  initialization
582              is  needed.  It is used only in the default C API, and only with
583              the lookahead operator /.
584
585       YYFILL API primitive with one argument len.  The meaning of  YYFILL  is
586              to  provide  at  least len more input characters or fail. If EOF
587              rule is used, YYFILL should always return to the  calling  func‐
588              tion; the return value should be zero on success and non-zero on
589              failure. If EOF rule is not used, YYFILL return value is ignored
590              and  it  should  not  return on failure. Maximal value of len is
591              YYMAXFILL, which can be generated with /*!max:re2c*/  directive.
592              The   definition  of  YYFILL  can  be  either  function-like  or
593              free-form depending on the API  style  (see  re2c:api:style  and
594              re2c:define:YYFILL:naked).
595
596       YYMAXFILL
597              An integral constant equal to the  maximal value of YYFILL argu‐
598              ment.  It can be generated with /*!max:re2c*/ directive.
599
600       YYLESSTHAN
601              A generic API primitive with one argument  len.   It  should  be
602              defined  as  an  r-value of boolean type that equals true if and
603              only if there is less than len input characters left.  The defi‐
604              nition can be either function-like or free-form depending on the
605              API style (see re2c:api:style).
606
607       YYPEEK A generic API primitive with no arguments.  It should be defined
608              as  an r-value of type YYCTYPE that is equal to the character at
609              the current input position. The definition can be  either  func‐
610              tion-like   or   free-form  depending  on  the  API  style  (see
611              re2c:api:style).
612
613       YYSKIP A generic API primitive  with  no  arguments.   The  meaning  of
614              YYSKIP  is  to advance the current input position by one charac‐
615              ter. The definition can be  either  function-like  or  free-form
616              depending on the API style (see re2c:api:style).
617
618       YYBACKUP
619              A  generic  API  primitive  with  no  arguments.  The meaning of
620              YYBACKUP is to save the current input position, which  is  later
621              restored  with YYRESTORE.  The definition should be either func‐
622              tion-like  or  free-form  depending  on  the  API   style   (see
623              re2c:api:style).
624
625       YYRESTORE
626              A generic API primitive with no arguments.  The meaning of YYRE‐
627              STORE is to restore the current  input  position  to  the  value
628              saved  by  YYBACKUP.   The  definition  should  be  either func‐
629              tion-like  or  free-form  depending  on  the  API   style   (see
630              re2c:api:style).
631
632       YYBACKUPCTX
633              A  generic  API  primitive  with zero arguments.  The meaning of
634              YYBACKUPCTX is to save the current input position as  the  posi‐
635              tion  of  the trailing context, which is later restored by YYRE‐
636              STORECTX.  The definition  should  be  either  function-like  or
637              free-form depending on the API style (see re2c:api:style).
638
639       YYRESTORECTX
640              A generic API primitive with no arguments.  The meaning of YYRE‐
641              STORECTX is to restore the trailing context position saved  with
642              YYBACKUPCTX.   The  definition should be either function-like or
643              free-form depending on the API style (see re2c:api:style).
644
645       YYRESTORETAG
646              A generic API primitive with one argument tag.  The  meaning  of
647              YYRESTORETAG  is to restore the trailing context position to the
648              value of tag.  The definition should be either function-like  or
649              free-form depending on the API style (see re2c:api:style).
650
651       YYSTAGP
652              A  generic  API primitive with one argument tag.  The meaning of
653              YYSTAGP is to set tag value to the current input position.   The
654              definition should be either function-like or free-form depending
655              on the API style (see re2c:api:style).
656
657       YYSTAGN
658              A generic API primitive with one argument tag.  The  meaning  of
659              YYSTAGP is to set tag value to null (or some default value). The
660              definition should be either function-like or free-form depending
661              on the API style (see re2c:api:style).
662
663       YYMTAGP
664              A  generic  API primitive with one argument tag.  The meaning of
665              YYMTAGP is to append the current position to the history of tag.
666              The  definition  should  be  either  function-like  or free-form
667              depending on the API style (see re2c:api:style).
668
669       YYMTAGN
670              A generic API primitive with one argument tag.  The  meaning  of
671              YYMTAGN  is  to append null (or some other default) value to the
672              history of tag.  The definition can be either  function-like  or
673              free-form depending on the API style (see re2c:api:style).
674
675       YYSHIFT
676              A generic API primitive with one argument shift.  The meaning of
677              YYSHIFT is to shift the current input position by shift  charac‐
678              ters  (the  shift  value may be negative). The definition can be
679              either function-like or free-form depending  on  the  API  style
680              (see re2c:api:style).
681
682       YYSHIFTSTAG
683              A generic  API primitive with two arguments, tag and shift.  The
684              meaning of YYSHIFTSTAG is to shift tag by shift characters  (the
685              shift  value  may  be  negative).   The definition can be either
686              function-like or free-form  depending  on  the  API  style  (see
687              re2c:api:style).
688
689       YYSHIFTMTAG
690              A  generic API primitive with two arguments, tag and shift.  The
691              meaning of YYSHIFTMTAG is to shift the latest value in the  his‐
692              tory  of  tag  by shift characters (the shift value may be nega‐
693              tive).   The  definition  should  be  either  function-like   or
694              free-form depending on the API style (see re2c:api:style).
695
696       YYMAXNMATCH
697              An  integral  constant equal to the maximal number of POSIX cap‐
698              turing  groups  in  a  rule.  It  is  generated  with   /*!maxn‐
699              match:re2c*/ directive.
700
701       YYCONDTYPE
702              The  type  of the condition enum.  It should be generated either
703              with /*!types:re2c*/ directive or -t --type-header option.
704
705       YYGETCONDITION
706              An API primitive with zero arguments.  It should be  defined  as
707              an  r-value of type YYCONDTYPE that is equal to the current con‐
708              dition identifier. The definition can be either function-like or
709              free-form  depending  on  the  API style (see re2c:api:style and
710              re2c:define:YYGETCONDITION:naked).
711
712       YYSETCONDITION
713              An API primitive with one argument cond.  The meaning of  YYSET‐
714              CONDITION  is  to  set the current condition identifier to cond.
715              The definition  should  be  either  function-like  or  free-form
716              depending   on   the   API   style   (see   re2c:api:style   and
717              re2c:define:YYSETCONDITION@cond).
718
719       YYGETSTATE
720              An API primitive with zero arguments.  It should be  defined  as
721              an  r-value  of  integer type that is equal to the current lexer
722              state. Should be initialized to -1. The definition can be either
723              function-like  or  free-form  depending  on  the  API style (see
724              re2c:api:style and re2c:define:YYGETSTATE:naked).
725
726       YYSETSTATE
727              An API primitive with one argument state.  The meaning of YYSET‐
728              STATE  is  to set the current lexer state to state.  The defini‐
729              tion should be either function-like or  free-form  depending  on
730              the   API   style  (see  re2c:api:style  and  re2c:define:YYSET‐
731              STATE@state).
732
733       YYDEBUG
734              A debug API primitive with two arguments.  It  can  be  used  to
735              debug  the generated code (with -d --debug-output option). YYDE‐
736              BUG should return no  value  and  accept  two  arguments:  state
737              (either  a  DFA state index or -1) and symbol (the current input
738              symbol).
739
740       yych   An l-value of type YYCTYPE that stores the current input charac‐
741              ter.  User definition is necessary only with -f --storable-state
742              option.
743
744       yyaccept
745              An l-value of unsigned integral type that stores the  number  of
746              the latest matched rule.  User definition is necessary only with
747              -f --storable-state option.
748
749       yynmatch
750              An l-value of unsigned integral type that stores the  number  of
751              POSIX  capturing  groups in the matched rule.  Used only with -P
752              --posix-captures option.
753
754       yypmatch
755              An array of l-values that are used to hold the tag values corre‐
756              sponding  to  the  capturing  parentheses  in the matching rule.
757              Array length must be at least yynmatch * 2 (usually  YYMAXNMATCH
758              *  2  is  a  good  choice).   Used only with -P --posix-captures
759              option.
760
761   Directives
762       Below is the list of all directives provided by re2c (in no  particular
763       order).  More information on each directive can be found in the related
764       sections.
765
766       /*!re2c ... */
767              A standard re2c block.
768
769       %{ ... %}
770              A standard re2c block in -F --flex-support mode.
771
772       /*!rules:re2c ... */
773              A reusable re2c block (requires -r --reuse option).
774
775       /*!use:re2c ... */
776              A  block  that  reuses  previous  rules-block   specified   with
777              /*!rules:re2c ... */ (requires -r --reuse option).
778
779       /*!ignore:re2c ... */
780              A  block  which contents are ignored and cut off from the output
781              file.
782
783       /*!max:re2c*/
784              This directive  is  substituted  with  the  macro-definition  of
785              YYMAXFILL.
786
787       /*!maxnmatch:re2c*/
788              This  directive  is  substituted  with  the  macro-definition of
789              YYMAXNMATCH (requires -P --posix-captures option).
790
791       /*!getstate:re2c*/
792              This directive is substituted with conditional dispatch on lexer
793              state (requires -f --storable-state option).
794
795       /*!types:re2c ... */
796              This  directive  is substituted with the definition of condition
797              enum (requires -c --conditions option).
798
799       /*!stags:re2c ... */, /*!mtags:re2c ... */
800              These directives allow one to specify a template piece  of  code
801              that  is  expanded  for  each  s-tag/m-tag variable generated by
802              re2c. This block has two optional configurations: format = "@@";
803              (specifies the template where @@ is substituted with the name of
804              each tag variable), and separator = ""; (specifies the piece  of
805              code  used  to join the generated pieces for different tag vari‐
806              ables).
807
808       /*!include:re2c FILE */
809              This directive allows one to include FILE (in the same sense  as
810              #include directive in C/C++).
811
812       /*!header:re2c:on*/
813              This  directive marks the start of header file. Everything after
814              it and up to the  following  /*!header:re2c:off*/  directive  is
815              processed  by re2c and written to the header file specified with
816              -t --type-header option.
817
818       /*!header:re2c:off*/
819              This directive  marks  the  end  of  header  file  started  with
820              /*!header:re2c:on*/.
821
822   Configurations
823       re2c:flags:t, re2c:flags:type-header
824              Specify  the  name  of the generated header file relative to the
825              directory of the output file. (Same as  -t,  --type-header  com‐
826              mand-line option except that the filepath is relative.)
827
828       re2c:flags:input
829              Same as --input command-line option.
830
831       re2c:api:style
832              Allows  one to specify the style of generic API. Possible values
833              are functions and free-form. With functions style  (the  default
834              for  the  C  backend)  API primitives behave like functions, and
835              re2c generates parentheses with an argument list after the  name
836              of each primitive.  With free-form style (the default for the Go
837              backend) re2c treats API definitions as interpolated strings and
838              substitutes  argument placeholders with the actual argument val‐
839              ues.  This option can be overridden by  options  for  individual
840              API primitives, e.g. re2c:define:YYFILL:naked for YYFILL.
841
842       re2c:api:sigil
843              Allows  one  to  specify  the "sigil" symbol (or string) that is
844              used to recognize argument placeholders in  the  definitions  of
845              generic  API primitives.  The default value is @@.  Placeholders
846              start with sigil, followed by the argument name in curly braces.
847              For  example,  if sigil is set to $, then placeholders will have
848              the form ${name}. Single-argument APIs may use  shorthand  nota‐
849              tion  without  the name in braces. This option can be overridden
850              by    options    for    individual    API    primitives,    e.g.
851              re2c:define:YYFILL@len for YYFILL.
852
853       re2c:define:YYCTYPE
854              Defines YYCTYPE (see the user interface section).
855
856       re2c:define:YYCURSOR
857              Defines  C  API  primitive YYCURSOR (see the user interface sec‐
858              tion).
859
860       re2c:define:YYLIMIT
861              Defines C API primitive YYLIMIT (see  the  user  interface  sec‐
862              tion).
863
864       re2c:define:YYMARKER
865              Defines  C  API  primitive YYMARKER (see the user interface sec‐
866              tion).
867
868       re2c:define:YYCTXMARKER
869              Defines C API primitive YYCTXMARKER (see the user interface sec‐
870              tion).
871
872       re2c:define:YYFILL
873              Defines API primitive YYFILL (see the user interface section).
874
875       re2c:define:YYFILL@len
876              Specifies  the  sigil  used  for argument substitution in YYFILL
877              definition.  Defaults  to  @@.   Overrides  the   more   generic
878              re2c:api:sigil configuration.
879
880       re2c:define:YYFILL:naked
881              Allows  one to override re2c:api:style for YYFILL.  Value 0 cor‐
882              responds to free-form API style.
883
884       re2c:yyfill:enable
885              Defaults to 1 (YYFILL is enabled). Set this to zero to  suppress
886              the generation of YYFILL. Use warnings (-W option) and re2c:sen‐
887              tinel configuration to verify that the  generated  lexer  cannot
888              read past the end of input, as this might introduce severe secu‐
889              rity issues to your programs.
890
891       re2c:yyfill:parameter
892              Controls the argument in the  parentheses  that  follow  YYFILL.
893              Defaults  to  1,  which means that the argument is generated. If
894              zero,  the  argument  is  omitted.  Can   be   overridden   with
895              re2c:define:YYFILL:naked or re2c:api:style.
896
897       re2c:eof
898              Specifies  the sentinel symbol used with EOF rule $ to check for
899              the end of input in the generated lexer. The default value is -1
900              (EOF  rule is not used). Other possible values include all valid
901              code units. Only decimal numbers are recognized.
902
903       re2c:sentinel
904              Specifies the sentinel symbol used with the sentinel  method  of
905              checking  for  the end of input in the generated lexer (the case
906              when bounds checking is disabled with  re2c:yyfill:enable  =  0;
907              and  EOF rule $ is not used). This configuration does not affect
908              code generation. It is used by re2c to verify that the  sentinel
909              symbol  is  not  allowed  in the middle of the rule, and prevent
910              possible reads past the end of buffer in  the  generated  lexer.
911              The  default  value is -1 (re2c assumes that the sentinel symbol
912              is 0, which is the most  common  case).  Other  possible  values
913              include  all  valid  code units. Only decimal numbers are recog‐
914              nized.
915
916       re2c:define:YYLESSTHAN
917              Defines generic API primitive YYLESSTHAN (see the user interface
918              section).
919
920       re2c:yyfill:check
921              Setting this to zero allows to suppress the generation of YYFILL
922              check (YYLESSTHAN in generic API of YYLIMIT-based comparison  in
923              default  C API). This configuration is useful when the necessary
924              input is always available. it defaults to 1 (the check is gener‐
925              ated).
926
927       re2c:label:yyFillLabel
928              Allows  one to change the prefix of YYFILL labels (used with EOF
929              rule or with storable states).
930
931       re2c:define:YYPEEK
932              Defines generic API primitive YYPEEK  (see  the  user  interface
933              section).
934
935       re2c:define:YYSKIP
936              Defines  generic  API  primitive  YYSKIP (see the user interface
937              section).
938
939       re2c:define:YYBACKUP
940              Defines generic API primitive YYBACKUP (see the  user  interface
941              section).
942
943       re2c:define:YYBACKUPCTX
944              Defines  generic  API primitive YYBACKUPCTX (see the user inter‐
945              face section).
946
947       re2c:define:YYRESTORE
948              Defines generic API primitive YYRESTORE (see the user  interface
949              section).
950
951       re2c:define:YYRESTORECTX
952              Defines  generic API primitive YYRESTORECTX (see the user inter‐
953              face section).
954
955       re2c:define:YYRESTORETAG
956              Defines generic API primitive YYRESTORETAG (see the user  inter‐
957              face section).
958
959       re2c:define:YYSHIFT
960              Defines  generic  API  primitive YYSHIFT (see the user interface
961              section).
962
963       re2c:define:YYSHIFTMTAG
964              Defines generic API primitive YYSHIFTMTAG (see the  user  inter‐
965              face section).
966
967       re2c:define:YYSHIFTSTAG
968              Defines  generic  API primitive YYSHIFTSTAG (see the user inter‐
969              face section).
970
971       re2c:define:YYSTAGN
972              Defines generic API primitive YYSTAGN (see  the  user  interface
973              section).
974
975       re2c:define:YYSTAGP
976              Defines  generic  API  primitive YYSTAGP (see the user interface
977              section).
978
979       re2c:define:YYMTAGN
980              Defines generic API primitive YYMTAGN (see  the  user  interface
981              section).
982
983       re2c:define:YYMTAGP
984              Defines  generic  API  primitive YYMTAGP (see the user interface
985              section).
986
987       re2c:flags:T, re2c:flags:tags
988              Same as -T --tags command-line option.
989
990       re2c:flags:P, re2c:flags:posix-captures
991              Same as -P --posix-captures command-line option.
992
993       re2c:tags:expression
994              Allows one to customize the way re2c  addresses  tag  variables.
995              By  default  re2c generates expressions of the form yyt<N>. This
996              might be inconvenient, for example if tag variables are  defined
997              as  fields  in a struct. Re2c recognizes placeholder of the form
998              @@{tag} or @@ and replaces it with the actual tag  name.   Sigil
999              @@  can  be  redefined  with  re2c:api:sigil configuration.  For
1000              example, setting  re2c:tags:expression  =  "p->@@";  results  in
1001              expressions of the form p->yyt<N> in the generated code.
1002
1003       re2c:tags:prefix
1004              Allows  one to override the prefix of tag variables (defaults to
1005              yyt).
1006
1007       re2c:flags:lookahead
1008              Same as inverted --no-lookahead command-line option.
1009
1010       re2c:flags:optimize-tags
1011              Same as inverted --no-optimize-tags command-line option.
1012
1013       re2c:define:YYCONDTYPE
1014              Defines YYCONDTYPE (see the user interface section).
1015
1016       re2c:define:YYGETCONDITION
1017              Defines API primitive YYGETCONDITION  (see  the  user  interface
1018              section).
1019
1020       re2c:define:YYGETCONDITION:naked
1021              Allows one to override re2c:api:style for YYGETCONDITION.  Value
1022              0 corresponds to free-form API style.
1023
1024       re2c:define:YYSETCONDITION
1025              Defines API primitive YYSETCONDITION  (see  the  user  interface
1026              section).
1027
1028       re2c:define:YYSETCONDITION@cond
1029              Specifies  the sigil used for argument substitution in YYSETCON‐
1030              DITION definition. The default value is @@.  Overrides the  more
1031              generic re2c:api:sigil configuration.
1032
1033       re2c:define:YYSETCONDITION:naked
1034              Allows one to override re2c:api:style for YYSETCONDITION.  Value
1035              0 corresponds to free-form API style.
1036
1037       re2c:cond:goto
1038              Allows one to customize the goto statements used with the short‐
1039              cut  :=>  rules  in  conditions.  The default value is goto @@;.
1040              Placeholders  are   substituted   with   condition   name   (see
1041              re2c:api;sigil and re2c:cond:goto@cond).
1042
1043       re2c:cond:goto@cond
1044              Specifies   the   sigil   used   for  argument  substitution  in
1045              re2c:cond:goto definition. The default value is  @@.   Overrides
1046              the more generic re2c:api:sigil configuration.
1047
1048       re2c:cond:divider
1049              Defines  the divider for condition blocks.  The default value is
1050              /*  ***********************************  */.   Placeholders  are
1051              substituted   with   condition   name  (see  re2c:api;sigil  and
1052              re2c:cond:divider@cond).
1053
1054       re2c:cond:divider@cond
1055              Specifies  the  sigil  used   for   argument   substitution   in
1056              re2c:cond:divider  definition.  The  default value is @@.  Over‐
1057              rides the more generic re2c:api:sigil configuration.
1058
1059       re2c:condprefix
1060              Specifies the prefix used for  condition  labels.   The  default
1061              value is yyc_.
1062
1063       re2c:condenumprefix
1064              Specifies  the  prefix  used  for  condition  identifiers.   The
1065              default value is yyc.
1066
1067       re2c:define:YYGETSTATE
1068              Defines API primitive YYGETSTATE (see the  user  interface  sec‐
1069              tion).
1070
1071       re2c:define:YYGETSTATE:naked
1072              Allows  one  to override re2c:api:style for YYGETSTATE.  Value 0
1073              corresponds to free-form API style.
1074
1075       re2c:define:YYSETSTATE
1076              Defines API primitive YYSETSTATE (see the  user  interface  sec‐
1077              tion).
1078
1079       re2c:define:YYSETSTATE@state
1080              Specifies the sigil used for argument substitution in YYSETSTATE
1081              definition. The default value is @@.  Overrides the more generic
1082              re2c:api:sigil configuration.
1083
1084       re2c:define:YYSETSTATE:naked
1085              Allows  one  to override re2c:api:style for YYSETSTATE.  Value 0
1086              corresponds to free-form API style.
1087
1088       re2c:state:abort
1089              If set to a positive integer value,  changes  the  form  of  the
1090              YYGETSTATE  switch: instead of using default case to jump to the
1091              beginning of the lexer block, a -1 case is used, and the default
1092              case aborts the program.
1093
1094       re2c:state:nextlabel
1095              With  storable states, allows to control if the YYGETSTATE block
1096              is followed by a yyNext label (the default value is zero,  which
1097              corresponds to no label). Instead of using yyNext it is possible
1098              to use re2c:startlabel to force the  generation  of  a  specific
1099              start  label.   Instead  of using labels it is often more conve‐
1100              nient to generate YYGETSTATE code using /*!getstate:re2c*/.
1101
1102       re2c:label:yyNext
1103              Allows one to change the name of the yyNext label.
1104
1105       re2c:startlabel
1106              Controls the generation of start label for the next lexer block.
1107              The  default  value is zero, which means that the start label is
1108              generated only if it is used. An integer value greater than zero
1109              forces the generation of start label even if it is unused by the
1110              lexer. A string value also forces  start  label  generation  and
1111              sets the label name to the specified string.  This configuration
1112              applies only to the current block (it is reset  to  default  for
1113              the next block).
1114
1115       re2c:flags:s, re2c:flags:nested-ifs
1116              Same as -s --nested-ifs command-line option.
1117
1118       re2c:flags:b, re2c:flags:bit-vectors
1119              Same as -b --bit-vectors command-line option.
1120
1121       re2c:variable:yybm
1122              Overrides the name of the yybm variable.
1123
1124       re2c:yybm:hex
1125              Defaults  to  zero (a decimal bitmap table is generated). If set
1126              to nonzero, a hexadecimal table is generated.
1127
1128       re2c:flags:g, re2c:flags:computed-gotos
1129              Same as -g --computed-gotos command-line option.
1130
1131       re2c:cgoto:threshold
1132              With -g --computed-gotos option this value  specifies  the  com‐
1133              plexity  threshold  that  triggers the generation of jump tables
1134              instead of nested if statements and bitmaps. The  default  value
1135              is 9.
1136
1137       re2c:flags:case-ranges
1138              Same as --case-ranges command-line option.
1139
1140       re2c:flags:e, re2c:flags:ecb
1141              Same as -e --ecb command-line option.
1142
1143       re2c:flags:8, re2c:flags:utf-8
1144              Same as -8 --utf-8 command-line option.
1145
1146       re2c:flags:w, re2c:flags:wide-chars
1147              Same as -w --wide-chars command-line option.
1148
1149       re2c:flags:x, re2c:flags:utf-16
1150              Same as -x --utf-16 command-line option.
1151
1152       re2c:flags:u, re2c:flags:unicode
1153              Same as -u --unicode command-line option.
1154
1155       re2c:flags:encoding-policy
1156              Same as --encoding-policy command-line option.
1157
1158       re2c:flags:empty-class
1159              Same as --empty-class command-line option.
1160
1161       re2c:flags:case-insensitive
1162              Same as --case-insensitive command-line option.
1163
1164       re2c:flags:case-inverted
1165              Same as --case-inverted command-line option.
1166
1167       re2c:flags:i, re2c:flags:no-debug-info
1168              Same as -i --no-debug-info command-line option.
1169
1170       re2c:indent:string
1171              Specifies  the string to use for indentation.  The default value
1172              is "\t".  Indent string should contain only  whitespace  charac‐
1173              ters.   To  disable indentation entirely, set this configuration
1174              to empty string "".
1175
1176       re2c:indent:top
1177              Specifies the minimum amount of indentation to use.  The default
1178              value  is zero.  The value should be a non-negative integer num‐
1179              ber.
1180
1181       re2c:labelprefix
1182              Allows one to change  the  prefix  of  DFA  state  labels.   The
1183              default value is yy.
1184
1185       re2c:yych:emit
1186              Set  this to zero to suppress the generation of yych definition.
1187              Defaults to 1 (the definition is generated).
1188
1189       re2c:variable:yych
1190              Overrides the name of the yych variable.
1191
1192       re2c:yych:conversion
1193              If set to nonzero, re2c automatically generates a cast  to  YYC‐
1194              TYPE every time yych is read. Defaults to zero (no cast).
1195
1196       re2c:variable:yyaccept
1197              Overrides the name of the yyaccept variable.
1198
1199       re2c:variable:yytarget
1200              Overrides the name of the yytarget variable.
1201
1202       re2c:variable:yystable
1203              Deprecated.
1204
1205       re2c:variable:yyctable
1206              When  both  -c  --conditions and -g --computed-gotos are active,
1207              re2c will use this variable to generate a static jump table  for
1208              YYGETCONDITION.
1209
1210       re2c:define:YYDEBUG
1211              Defines YYDEBUG (see the user interface section).
1212
1213       re2c:flags:d, re2c:flags:debug-output
1214              Same as -d --debug-output command-line option.
1215
1216       re2c:flags:dfa-minimization
1217              Same as --dfa-minimization command-line option.
1218
1219       re2c:flags:eager-skip
1220              Same as --eager-skip command-line option.
1221

REGULAR EXPRESSIONS

1223       re2c uses the following syntax for regular expressions:
1224
1225       · "foo" case-sensitive string literal
1226
1227       · 'foo' case-insensitive string literal
1228
1229       · [a-xyz], [^a-xyz] character class (possibly negated)
1230
1231       · . any character except newline
1232
1233       · R \ S difference of character classes R and S
1234
1235       · R* zero or more occurrences of R
1236
1237       · R+ one or more occurrences of R
1238
1239       · R? optional R
1240
1241       · R{n} repetition of R exactly n times
1242
1243       · R{n,} repetition of R at least n times
1244
1245       · R{n,m} repetition of R from n to m times
1246
1247       · (R)  just  R;  parentheses  are  used  to  override precedence or for
1248         POSIX-style submatch
1249
1250       · R S concatenation: R followed by S
1251
1252       · R | S alternative: R or S
1253
1254       · R / S lookahead: R followed by S, but S is not consumed
1255
1256       · name the regular expression defined as name (or literal string "name"
1257         in Flex compatibility mode)
1258
1259       · {name}  the  regular expression defined as name in Flex compatibility
1260         mode
1261
1262       · @stag an s-tag: saves the last input position at which @stag  matches
1263         in a variable named stag
1264
1265       · #mtag an m-tag: saves all input positions at which #mtag matches in a
1266         variable named mtag
1267
1268       Character classes and string literals may contain the following  escape
1269       sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa‐
1270       decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
1271

EOF HANDLING

1273       Re2c provides a number of ways to handle end-of-input situation.  Which
1274       way  to  use  depends on the complexity of regular expressions, perfor‐
1275       mance considerations, the need for input buffering  and  various  other
1276       factors.  EOF  handling  is probably the most complex part of re2c user
1277       interface --- it definitely requires a bit of understanding of how  the
1278       generated  lexer  works.  But in return is allows the user to customize
1279       lexer for a particular environment and avoid the  unnecessary  overhead
1280       of  generic methods when a simpler method is sufficient. Roughly speak‐
1281       ing, there are four main methods:
1282
1283       · using sentinel symbol (simple and efficient, but limited)
1284
1285       · bounds checking with padding (generic, but complex)
1286
1287       · EOF rule: a  combination  of  sentinel  symbol  and  bounds  checking
1288         (generic and simple, can be more or less efficient than bounds check‐
1289         ing with padding depending on the grammar)
1290
1291       · using generic API (user-defined, so may be incorrect ;])
1292
1293   Using sentinel symbol
1294       This is the simplest and the most efficient method. It is applicable in
1295       cases  when  the  input is small enough to fit into a continuous memory
1296       buffer and there is a natural "sentinel" symbol --- a code unit that is
1297       not allowed by any of the regular expressions in grammar (except possi‐
1298       bly as a terminating character).   Sentinel  symbol  never  appears  in
1299       well-formed input, therefore it can be appended at the end of input and
1300       used as a stop signal by the lexer. A good example of such input  is  a
1301       null-terminated C-string, provided that the grammar does not allow NULL
1302       in the middle of lexemes. Sentinel method is  very  efficient,  because
1303       the lexer does not need to perform any additional checks for the end of
1304       input --- it comes naturally as a part of processing the  next  charac‐
1305       ter.   It  is very important that the sentinel symbol is not allowed in
1306       the middle of the rule --- otherwise on some inputs the lexer may  read
1307       past the end of buffer and crash or cause memory corruption. Re2c veri‐
1308       fies this automatically.  Use re2c:sentinel  configuration  to  specify
1309       which sentinel symbol is used.
1310
1311       Below   is   an   example   of  using  sentinel  method.  Configuration
1312       re2c:yyfill:enable = 0; suppresses generation  of  end-of-input  checks
1313       and YYFILL calls.
1314
1315          //go:generate re2go $INPUT -o $OUTPUT
1316          package main
1317
1318          import "testing"
1319
1320          // expect a null-terminated string
1321          func lex(str string) int {
1322              var cursor int
1323              count := 0
1324          loop:
1325              /*!re2c
1326              re2c:yyfill:enable = 0;
1327              re2c:define:YYCTYPE = byte;
1328              re2c:define:YYPEEK = "str[cursor]";
1329              re2c:define:YYSKIP = "cursor += 1";
1330
1331              *      { return -1 }
1332              [\x00] { return count }
1333              [a-z]+ { count += 1; goto loop }
1334              [ ]+   { goto loop }
1335              */
1336          }
1337
1338          func TestLex(t *testing.T) {
1339              var tests = []struct {
1340                  res int
1341                  str string
1342              }{
1343                  {0, "\000"},
1344                  {3, "one two three\000"},
1345                  {-1, "f0ur\000"},
1346              }
1347
1348              for _, x := range tests {
1349                  t.Run(x.str, func(t *testing.T) {
1350                      res := lex(x.str)
1351                      if res != x.res {
1352                          t.Errorf("got %d, want %d", res, x.res)
1353                      }
1354                  })
1355              }
1356          }
1357
1358
1359   Bounds checking with padding
1360       Bounds  checking  is  a  generic  method: it can be used with any input
1361       grammar.  The basic idea is simple: we need to check  for  the  end  of
1362       input  before reading the next input character. However, if implemented
1363       in a straightforward way, this would be quite inefficient: checking  on
1364       each input character would cause a major slowdown. Re2c avoids slowdown
1365       by generating checks only in certain key states of the lexer, and  let‐
1366       ting  it run without checks in-between the key states.  More precisely,
1367       re2c computes strongly connected components (SCCs)  of  the  underlying
1368       DFA  (which  roughly  correspond  to  loops),  and generates only a few
1369       checks per each SCC (usually just one, but in general  enough  to  make
1370       the  SCC  acyclic).  The check is of the form (YYLIMIT - YYCURSOR) < n,
1371       where n is the maximal length of a simple  path  in  the  corresponding
1372       SCC.  If  this condiiton is true, the lexer calls YYFILL(n), which must
1373       either supply at least n input characters, or do not return.  When  the
1374       lexer  continues after the check, it is certain that the next n charac‐
1375       ters can be read safely without checks.
1376
1377       This approach reduces the number of checks significantly (and makes the
1378       lexer  much faster as a result), but it has a downside. Since the lexer
1379       checks for multiple characters at once, it may end up  in  a  situation
1380       when  there  are  a few remaining input characters (less than n) corre‐
1381       sponding to a short path in the  SCC,  but  the  lexer  cannot  proceed
1382       because  of  the check, and YYFILL cannot supply more character because
1383       it is the end of input. To solve this problem, re2c requires that addi‐
1384       tional  padding consisting of fake characters is appended at the end of
1385       input. The length of padding should be YYMAXFILL, which equals  to  the
1386       maximum  n  parameter  to  YYFILL  and  must be generated by re2c using
1387       /*!max:re2c*/ directive. The fake characters should not  form  a  valid
1388       lexeme  suffix,  otherwise the lexer may be fooled into matching a fake
1389       lexeme. Usually it's a good idea to use NULL characters for padding.
1390
1391       Below is an example of using bounds checking with  padding.  Note  that
1392       the  grammar rule for single-quoted strings allows arbitrary symbols in
1393       the middle of lexeme, so there is no natural sentinel in  the  grammar.
1394       Strings like "aha\0ha" are perfectly valid, but ill-formed strings like
1395       "aha\0 are also possible and shouldn’t crash the lexer. In this example
1396       we  do  not  use  buffer  refilling, therefore YYFILL definition simply
1397       returns an error. Note that YYFILL will only be called after the  lexer
1398       reaches  padding,  because only then will the check condition be satis‐
1399       fied.
1400
1401          //go:generate re2go $INPUT -o $OUTPUT
1402          package main
1403
1404          import (
1405              "strings"
1406              "testing"
1407          )
1408
1409          /*!max:re2c*/
1410
1411          // Expects YYMAXFILL-padded string.
1412          func lex(str string) int {
1413              var cursor int
1414              limit := len(str)
1415              count := 0
1416          loop:
1417              /*!re2c
1418              re2c:define:YYCTYPE    = byte;
1419              re2c:define:YYPEEK     = "str[cursor]";
1420              re2c:define:YYSKIP     = "cursor += 1";
1421              re2c:define:YYLESSTHAN = "limit - cursor < @@{len}";
1422              re2c:define:YYFILL     = "return -1";
1423
1424              *                           { return -1 }
1425              [\x00]                      { return count }
1426              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1427              [ ]+                        { goto loop }
1428              */
1429          }
1430
1431          // Pad string with YYMAXFILL zeroes at the end.
1432          func pad(str string) string {
1433              return str + strings.Repeat("\000", YYMAXFILL)
1434          }
1435
1436          func TestLex(t *testing.T) {
1437              var tests = []struct {
1438                  res int
1439                  str string
1440              }{
1441                  {0, ""},
1442                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1443                  {-1, "'unterminated\\'"},
1444              }
1445
1446              for _, x := range tests {
1447                  t.Run(x.str, func(t *testing.T) {
1448                      res := lex(pad(x.str))
1449                      if res != x.res {
1450                          t.Errorf("got %d, want %d", res, x.res)
1451                      }
1452                  })
1453              }
1454          }
1455
1456
1457   EOF rule
1458       EOF rule $ was introduced in version 1.2. It is a hybrid approach  that
1459       tries to take the best of both worlds: simplicity and efficiency of the
1460       sentinel method combined with the generality of bounds-checking method.
1461       The idea is to appoint an arbitrary symbol to be the sentinel, and only
1462       perform further bounds checking if the sentinel  symbol  matches  (more
1463       precisely,  if the symbol class that contains it matches). The check is
1464       of the form YYLIMIT <= YYCURSOR.  If this condition is  not  satisfied,
1465       then  the  sentinel  is  just an ordinary input character and the lexer
1466       continues. Otherwise this is a  real  sentinel,  and  the  lexer  calls
1467       YYFILL().  If  YYFILL  returns zero, the lexer assumes that it has more
1468       input and tries to re-match. Otherwise YYFILL returns non-zero and  the
1469       lexer  knows  that it has reached the end of input. At this point there
1470       are three possibilities. First, it might have already matched a shorter
1471       lexeme --- in this case it just rolls back to the last accepting state.
1472       Second, it might have consumed some characters, but failed to match ---
1473       in  this  case it falls back to default rule *. Finally, it might be in
1474       the initial state --- in this (and only this!) case it matches EOF rule
1475       $.
1476
1477       Below is an example of using EOF rule. Configuration re2c:yyfill:enable
1478       = 0; suppresses generation of YYFILL calls (but not the bounds checks).
1479
1480          //go:generate re2go $INPUT -o $OUTPUT
1481          package main
1482
1483          import "testing"
1484
1485          // Expects a null-terminated string.
1486          func lex(str string) int {
1487              var cursor, marker int
1488              limit := len(str) - 1 // limit points at the terminating null
1489              count := 0
1490          loop:
1491              /*!re2c
1492              re2c:yyfill:enable = 0;
1493              re2c:eof = 0;
1494              re2c:define:YYCTYPE    = byte;
1495              re2c:define:YYPEEK     = "str[cursor]";
1496              re2c:define:YYSKIP     = "cursor += 1";
1497              re2c:define:YYBACKUP   = "marker = cursor";
1498              re2c:define:YYRESTORE  = "cursor = marker";
1499              re2c:define:YYLESSTHAN = "limit <= cursor";
1500
1501              *                           { return -1 }
1502              $                           { return count }
1503              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1504              [ ]+                        { goto loop }
1505              */
1506          }
1507
1508          func TestLex(t *testing.T) {
1509              var tests = []struct {
1510                  res int
1511                  str string
1512              }{
1513                  {0, "\000"},
1514                  {3, "'qu\000tes' 'are' 'fine: \\'' \000"},
1515                  {-1, "'unterminated\\'\000"},
1516              }
1517
1518              for _, x := range tests {
1519                  t.Run(x.str, func(t *testing.T) {
1520                      res := lex(x.str)
1521                      if res != x.res {
1522                          t.Errorf("got %d, want %d", res, x.res)
1523                      }
1524                  })
1525              }
1526          }
1527
1528
1529   Using generic API
1530       Generic API can be used with any of the above methods. It  also  allows
1531       one  to  use  a user-defined method by placing EOF checks in one of the
1532       basic primitives.  Usually this is either YYSKIP  (the  check  is  per‐
1533       formed  when  advancing  to  the  next input character), or YYPEEK (the
1534       check is performed when reading the next input character). The  result‐
1535       ing  methods  are  inefficient,  as they check on each input character.
1536       However, they can be useful in cases when the input cannot be  buffered
1537       or  padded  and  does  not contain a sentinel character at the end. One
1538       should be cautious when using such ad-hoc methods, as  it  is  easy  to
1539       overlook  some  corner  cases  and come up with a method that only par‐
1540       tially works. Also it should  be  noted  that  not  everything  can  be
1541       expressed via generic API: for example, it is impossible to reimplement
1542       the way EOF rule works (in particular, it is impossible to re-match the
1543       character after successful YYFILL).
1544
1545       Below  is an example of using YYSKIP to perform bounds checking without
1546       padding. YYFILL generation is suppressed using re2c:yyfill:enable =  0;
1547       configuration.  Note  that if the grammar was more complex, this method
1548       might not work in case when two rules overlap and EOF check fails after
1549       a  shorter  lexeme has already been matched (as it happens in our exam‐
1550       ple, there are no overlapping rules).
1551
1552          //go:generate re2go $INPUT -o $OUTPUT
1553          package main
1554
1555          import "testing"
1556
1557          // Returns "fake" terminating null if cursor has reached limit.
1558          func peek(str string, cursor int, limit int) byte {
1559              if cursor >= limit {
1560                  return 0 // fake null
1561              } else {
1562                  return str[cursor]
1563              }
1564          }
1565
1566          // Expects a string without terminating null.
1567          func lex(str string) int {
1568              var cursor, marker int
1569              limit := len(str)
1570              count := 0
1571          loop:
1572              /*!re2c
1573              re2c:yyfill:enable = 0;
1574              re2c:eof = 0;
1575              re2c:define:YYCTYPE    = byte;
1576              re2c:define:YYLESSTHAN = "cursor >= limit";
1577              re2c:define:YYPEEK     = "peek(str, cursor, limit)";
1578              re2c:define:YYSKIP     = "cursor += 1";
1579              re2c:define:YYBACKUP   = "marker = cursor";
1580              re2c:define:YYRESTORE  = "cursor = marker";
1581
1582              *                           { return -1 }
1583              $                           { return count }
1584              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1585              [ ]+                        { goto loop }
1586              */
1587          }
1588
1589          func TestLex(t *testing.T) {
1590              var tests = []struct {
1591                  res int
1592                  str string
1593              }{
1594                  {0, ""},
1595                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1596                  {-1, "'unterminated\\'"},
1597              }
1598
1599              for _, x := range tests {
1600                  t.Run(x.str, func(t *testing.T) {
1601                      res := lex(x.str)
1602                      if res != x.res {
1603                          t.Errorf("got %d, want %d", res, x.res)
1604                      }
1605                  })
1606              }
1607          }
1608
1609

BUFFER REFILLING

1611       The need for buffering arises when the input cannot be mapped in memory
1612       all at once: either it is too large, or it comes in a streaming fashion
1613       (like reading from a socket). The usual technique in such cases  is  to
1614       allocate  a  fixed-sized memory buffer and process input in chunks that
1615       fit into the buffer. When the current chunk is processed, it  is  moved
1616       out  and new data is moved in. In practice it is somewhat more complex,
1617       because lexer state consists not of a single input position, but a  set
1618       of interrelated posiitons:
1619
1620       · cursor:  the next input character to be read (YYCURSOR in default API
1621         or YYSKIP/YYPEEK in generic API)
1622
1623       · limit: the position after the last available input character (YYLIMIT
1624         in default API, implicitly handled by YYLESSTHAN in generic API)
1625
1626       · marker:  the  position  of the most recent match, if any (YYMARKER in
1627         default API or YYBACKUP/YYRESTORE in generic API)
1628
1629       · token: the start of the current lexeme (implicit in re2c API,  as  it
1630         is  not  needed for the normal lexer operation and can be defined and
1631         updated by the user)
1632
1633       · context marker: the position of the trailing context (YYCTXMARKER  in
1634         default API or YYBACKUPCTX/YYRESTORECTX in generic API)
1635
1636       · tag  variables:  submatch positions (defined with /*!stags:re2c*/ and
1637         /*!mtags:re2c*/  directives  and  YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN  in
1638         generic API)
1639
1640       Not all these are used in every case, but if used, they must be updated
1641       by YYFILL. All active positions are contained in  the  segment  between
1642       token  and  cursor, therefore everything between buffer start and token
1643       can be discarded, the segment from token and  up  to  limit  should  be
1644       moved to the beginning of buffer, and the free space at the end of buf‐
1645       fer should be filled with new data.  In order to avoid frequent  YYFILL
1646       calls  it is best to fill in as many input characters as possible (even
1647       though fewer characters might suffice to resume the lexer). The details
1648       of  YYFILL implementation are slightly different depending on which EOF
1649       handling method is used: the case of EOF rule is somewhat simpler  than
1650       the  case  of  bounds-checking  with  padding.  Also  note  that  if -f
1651       --storable-state option is used, YYFILL has slightly  different  seman‐
1652       tics (desrbed in the section about storable state).
1653
1654   YYFILL with EOF rule
1655       If  EOF  rule is used, YYFILL is a function-like primitive that accepts
1656       no arguments and returns a value which is checked against zero.  YYFILL
1657       invocation is triggered by condition YYLIMIT <= YYCURSOR in default API
1658       and YYLESSTHAN() in generic API. A non-zero  return  value  means  that
1659       YYFILL  has  failed.  A successful YYFILL call must supply at least one
1660       character and adjust input positions accordingly. Limit must always  be
1661       set  to  one after the last input position in buffer, and the character
1662       at the limit position must be the sentinel symbol specified by re2c:eof
1663       configuration.  The pictures below show the relative locations of input
1664       positions in buffer before and after YYFILL call  (sentinel  symbol  is
1665       marked  with #, and the second picture shows the case when there is not
1666       enough input to fill the whole buffer).
1667
1668                         <-- shift -->
1669                       >-A------------B---------C-------------D#-----------E->
1670                       buffer       token    marker         limit,
1671                                                            cursor
1672          >-A------------B---------C-------------D------------E#->
1673                       buffer,  marker        cursor        limit
1674                       token
1675
1676                         <-- shift -->
1677                       >-A------------B---------C-------------D#--E (EOF)
1678                       buffer       token    marker         limit,
1679                                                            cursor
1680          >-A------------B---------C-------------D---E#........
1681                       buffer,  marker       cursor limit
1682                       token
1683
1684       Here is an example of a program that  reads  input  file  input.txt  in
1685       chunks of 4096 bytes and uses EOF rule.
1686
1687          //go:generate re2go $INPUT -o $OUTPUT
1688          package main
1689
1690          import (
1691              "os"
1692              "testing"
1693          )
1694
1695          // Intentionally small to trigger buffer refill.
1696          const SIZE int = 16
1697
1698          type Input struct {
1699              file   *os.File
1700              data   []byte
1701              cursor int
1702              marker int
1703              token  int
1704              limit  int
1705              eof    bool
1706          }
1707
1708          func fill(in *Input) int {
1709              // If nothing can be read, fail.
1710              if in.eof {
1711                  return 1
1712              }
1713
1714              // Check if at least some space can be freed.
1715              if in.token == 0 {
1716                  // In real life can reallocate a larger buffer.
1717                  panic("fill error: lexeme too long")
1718              }
1719
1720              // Discard everything up to the start of the current lexeme,
1721              // shift buffer contents and adjust offsets.
1722              copy(in.data[0:], in.data[in.token:in.limit])
1723              in.cursor -= in.token
1724              in.marker -= in.token
1725              in.limit -= in.token
1726              in.token = 0
1727
1728              // Read new data (as much as possible to fill the buffer).
1729              n, _ := in.file.Read(in.data[in.limit:SIZE])
1730              in.limit += n
1731              in.data[in.limit] = 0
1732
1733              // If read less than expected, this is the end of input.
1734              in.eof = in.limit < SIZE
1735
1736              // If nothing has been read, fail.
1737              if n == 0 {
1738                  return 1
1739              }
1740
1741              return 0
1742          }
1743
1744          func lex(in *Input) int {
1745              count := 0
1746          loop:
1747              in.token = in.cursor
1748              /*!re2c
1749              re2c:eof = 0;
1750              re2c:define:YYCTYPE    = byte;
1751              re2c:define:YYPEEK     = "in.data[in.cursor]";
1752              re2c:define:YYSKIP     = "in.cursor += 1";
1753              re2c:define:YYBACKUP   = "in.marker = in.cursor";
1754              re2c:define:YYRESTORE  = "in.cursor = in.marker";
1755              re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
1756              re2c:define:YYFILL     = "fill(in) == 0";
1757
1758              *                           { return -1 }
1759              $                           { return count }
1760              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1761              [ ]+                        { goto loop }
1762              */
1763          }
1764
1765          // Prepare a file with the input text and run the lexer.
1766          func test(data string) (result int) {
1767              tmpfile := "input.txt"
1768
1769              f, _ := os.Create(tmpfile)
1770              f.WriteString(data)
1771              f.Seek(0, 0)
1772
1773              defer func() {
1774                  if r := recover(); r != nil {
1775                      result = -2
1776                  }
1777                  f.Close()
1778                  os.Remove(tmpfile)
1779              }()
1780
1781              in := &Input{
1782                  file:   f,
1783                  data:   make([]byte, SIZE+1),
1784                  cursor: SIZE,
1785                  marker: SIZE,
1786                  token:  SIZE,
1787                  limit:  SIZE,
1788                  eof:    false,
1789              }
1790
1791              return lex(in)
1792          }
1793
1794          func TestLex(t *testing.T) {
1795              var tests = []struct {
1796                  res int
1797                  str string
1798              }{
1799                  {0, ""},
1800                  {2, "'one' 'two'"},
1801                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1802                  {-1, "'unterminated\\'"},
1803                  {-2, "'loooooooooooong'"},
1804              }
1805
1806              for _, x := range tests {
1807                  t.Run(x.str, func(t *testing.T) {
1808                      res := test(x.str)
1809                      if res != x.res {
1810                          t.Errorf("got %d, want %d", res, x.res)
1811                      }
1812                  })
1813              }
1814          }
1815
1816
1817   YYFILL with padding
1818       In  the  default  case  (when  EOF  rule is not used) YYFILL is a func‐
1819       tion-like primitive that accepts a single argument and does not  return
1820       any  value.   YYFILL  invocation  is  triggered by condition (YYLIMIT -
1821       YYCURSOR) < n in default API and  YYLESSTHAN(n)  in  generic  API.  The
1822       argument passed to YYFILL is the minimal number of characters that must
1823       be supplied. If it fails to do so, YYFILL must not return to the  lexer
1824       (for  that  reason  it is best implemented as a macro that returns from
1825       the calling function on failure).  In case of a successful YYFILL invo‐
1826       cation  the  limit  position  must  be set either to one after the last
1827       input position in buffer, or to the end of YYMAXFILL padding  (in  case
1828       YYFILL  has  successfully read at least n characters, but not enough to
1829       fill the entire buffer). The pictures below show the relative locations
1830       of input positions in buffer before and after YYFILL invocation (YYMAX‐
1831       FILL padding on the second picture is marked with # symbols).
1832
1833                         <-- shift -->                 <-- need -->
1834                       >-A------------B---------C-----D-------E---F--------G->
1835                       buffer       token    marker cursor  limit
1836
1837          >-A------------B---------C-----D-------E---F--------G->
1838                       buffer,  marker cursor               limit
1839                       token
1840
1841                         <-- shift -->                 <-- need -->
1842                       >-A------------B---------C-----D-------E-F        (EOF)
1843                       buffer       token    marker cursor  limit
1844
1845          >-A------------B---------C-----D-------E-F###############
1846                       buffer,  marker cursor                   limit
1847                       token                        <- YYMAXFILL ->
1848
1849       Here is an example of a program that  reads  input  file  input.txt  in
1850       chunks of 4096 bytes and uses bounds-checking with padding.
1851
1852          //go:generate re2go $INPUT -o $OUTPUT
1853          package main
1854
1855          import (
1856              "fmt"
1857              "os"
1858              "testing"
1859          )
1860
1861          /*!max:re2c*/
1862
1863          // Intentionally small to trigger buffer refill.
1864          const SIZE int = 16
1865
1866          type Input struct {
1867              file   *os.File
1868              data   []byte
1869              cursor int
1870              marker int
1871              token  int
1872              limit  int
1873              eof    bool
1874          }
1875
1876          func fill(in *Input, need int) int {
1877              // End of input has already been reached, nothing to do.
1878              if in.eof {
1879                  return -1 // Error: unexpected EOF
1880              }
1881
1882              // Check if after moving the current lexeme to the beginning
1883              // of buffer there will be enough free space.
1884              if SIZE-(in.cursor-in.token) < need {
1885                  return -2 // Error: lexeme too long
1886              }
1887
1888              // Discard everything up to the start of the current lexeme,
1889              // shift buffer contents and adjust offsets.
1890              copy(in.data[0:], in.data[in.token:in.limit])
1891              in.cursor -= in.token
1892              in.marker -= in.token
1893              in.limit -= in.token
1894              in.token = 0
1895
1896              // Read new data (as much as possible to fill the buffer).
1897              n, _ := in.file.Read(in.data[in.limit:SIZE])
1898              in.limit += n
1899
1900              // If read less than expected, this is the end of input.
1901              in.eof = in.limit < SIZE
1902
1903              // If end of input, add padding so that the lexer can read
1904              // the remaining characters at the end of buffer.
1905              if in.eof {
1906                  for i := 0; i < YYMAXFILL; i += 1 {
1907                      in.data[in.limit+i] = 0
1908                  }
1909                  in.limit += YYMAXFILL
1910              }
1911
1912              return 0
1913          }
1914
1915          func lex(in *Input) int {
1916              count := 0
1917          loop:
1918              in.token = in.cursor
1919              /*!re2c
1920              re2c:define:YYCTYPE    = byte;
1921              re2c:define:YYPEEK     = "in.data[in.cursor]";
1922              re2c:define:YYSKIP     = "in.cursor += 1";
1923              re2c:define:YYBACKUP   = "in.marker = in.cursor";
1924              re2c:define:YYRESTORE  = "in.cursor = in.marker";
1925              re2c:define:YYLESSTHAN = "in.limit-in.cursor < @@{len}";
1926              re2c:define:YYFILL     = "if r := fill(in, @@{len}); r != 0 { return r }";
1927
1928              *                           { return -1 }
1929              [\x00]                      { return count }
1930              ['] ([^'\\] | [\\][^])* ['] { count += 1; goto loop }
1931              [ ]+                        { goto loop }
1932              */
1933          }
1934
1935          // Prepare a file with the input text and run the lexer.
1936          func test(data string) (result int) {
1937              tmpfile := "input.txt"
1938
1939              f, _ := os.Create(tmpfile)
1940              f.WriteString(data)
1941              f.Seek(0, 0)
1942
1943              defer func() {
1944                  if r := recover(); r != nil {
1945                      fmt.Println(r)
1946                      result = -2
1947                  }
1948                  f.Close()
1949                  os.Remove(tmpfile)
1950              }()
1951
1952              in := &Input{
1953                  file:   f,
1954                  data:   make([]byte, SIZE+YYMAXFILL),
1955                  cursor: SIZE,
1956                  marker: SIZE,
1957                  token:  SIZE,
1958                  limit:  SIZE,
1959                  eof:    false,
1960              }
1961
1962              return lex(in)
1963          }
1964
1965          func TestLex(t *testing.T) {
1966              var tests = []struct {
1967                  res int
1968                  str string
1969              }{
1970                  {0, ""},
1971                  {2, "'one' 'two'"},
1972                  {3, "'qu\000tes' 'are' 'fine: \\'' "},
1973                  {-1, "'unterminated\\'"},
1974                  {-2, "'loooooooooooong'"},
1975              }
1976
1977              for _, x := range tests {
1978                  t.Run(x.str, func(t *testing.T) {
1979                      res := test(x.str)
1980                      if res != x.res {
1981                          t.Errorf("got %d, want %d", res, x.res)
1982                      }
1983                  })
1984              }
1985          }
1986
1987

INCLUDE FILES

1989       Re2c  allows one to include other files using directive /*!include:re2c
1990       FILE */, where FILE is the name of file to be included. Re2c looks  for
1991       included  files  in  the directory of the including file and in include
1992       locations, which can be specified with -I option.  Re2c include  direc‐
1993       tive  works in the same way as C/C++ #include: the contents of FILE are
1994       copy-pasted verbatim in place of the directive. Include files may  have
1995       further  includes  of their own.  Re2c provides some predefined include
1996       files that can be found in the include/ subdirectory  of  the  project.
1997       These  files  contain  definitions that can be useful to other projects
1998       (such as Unicode categories) and form something like a standard library
1999       for re2c.  Here is an example:
2000
2001   Include file (definitions.go)
2002          const (
2003              ResultOk = iota
2004              ResultFail
2005          )
2006
2007          /*!re2c
2008              number = [1-9][0-9]*;
2009          */
2010
2011
2012   Input file
2013          //go:generate re2go -c $INPUT -o $OUTPUT -i
2014          package main
2015
2016          import "testing"
2017          /*!include:re2c "definitions.go" */
2018
2019          func lex(str string) int {
2020              var cursor int
2021              /*!re2c
2022              re2c:yyfill:enable  = 0;
2023              re2c:define:YYCTYPE = byte;
2024              re2c:define:YYPEEK  = "str[cursor]";
2025              re2c:define:YYSKIP  = "cursor += 1";
2026
2027              number { return ResultOk }
2028              *      { return ResultFail }
2029              */
2030          }
2031
2032          func TestLex(t *testing.T) {
2033              if lex("123\000") != ResultOk {
2034                  t.Errorf("error")
2035              }
2036          }
2037
2038

HEADER FILES

2040       Re2c  allows  one to generate header file from the input .re file using
2041       option -t, --type-header or  configuration  re2c:flags:type-header  and
2042       directives  /*!header:re2c:on*/  and  /*!header:re2c:off*/.  The  first
2043       directive marks the beginning of header file, and the second  directive
2044       marks  the  end of it. Everything between these directives is processed
2045       by re2c, and the generated code is written to the file specified by the
2046       -t  --type-header option (or stdout if this option was not used). Auto‐
2047       generated header file may be needed in cases when re2c is used to  gen‐
2048       erate definitions of constants, variables and structs that must be vis‐
2049       ible from other translation units.
2050
2051       Here is an example of generating a header file that contains definition
2052       of  the lexer state with tag variables (the number variables depends on
2053       the regular grammar and is unknown to the programmer).
2054
2055   Input file
2056          //go:generate re2go $INPUT -o $OUTPUT -i --type-header src/lexer/lexer.go
2057          package main
2058
2059          import (
2060              "lexer" // generated by re2c
2061              "testing"
2062          )
2063
2064          /*!header:re2c:on*/
2065          package lexer
2066
2067          type State struct {
2068              Data string
2069              Cur, Mar, /*!stags:re2c format="@@{tag}"; separator=", "; */ int
2070          }
2071          /*!header:re2c:off*/
2072
2073          func lex(st *lexer.State) int {
2074              /*!re2c
2075              re2c:flags:type-header = "src/lexer/lexer.go";
2076              re2c:yyfill:enable = 0;
2077              re2c:flags:tags = 1;
2078              re2c:define:YYCTYPE      = byte;
2079              re2c:define:YYPEEK       = "st.Data[st.Cur]";
2080              re2c:define:YYSKIP       = "st.Cur++";
2081              re2c:define:YYBACKUP     = "st.Mar = st.Cur";
2082              re2c:define:YYRESTORE    = "st.Cur = st.Mar";
2083              re2c:define:YYRESTORETAG = "st.Cur = @@{tag}";
2084              re2c:define:YYSTAGP      = "@@{tag} = st.Cur";
2085              re2c:tags:expression     = "st.@@{tag}";
2086              re2c:tags:prefix         = "Tag";
2087
2088              [x]{1,4} / [x]{3,5} { return 0 } // ambiguous trailing context
2089              *                   { return 1 }
2090              */
2091          }
2092
2093          func TestLex(t *testing.T) {
2094              st := &lexer.State{
2095                  Data: "xxxxxxxx\x00",
2096              }
2097              if !(lex(st) == 0 && st.Cur == 4) {
2098                  t.Error("failed")
2099              }
2100          }
2101
2102
2103   Header file
2104          // Code generated by re2c, DO NOT EDIT.
2105
2106          package lexer
2107
2108          type State struct {
2109              Data string
2110              Cur, Mar, Tag1, Tag2, Tag3 int
2111          }
2112
2113

SUBMATCH EXTRACTION

2115       Re2c has two options for submatch extraction.
2116
2117       The first option is -T --tags. With this option one can use  standalone
2118       tags  of  the  form  @stag and #mtag, where stag and mtag are arbitrary
2119       used-defined names. Tags can be  used  anywhere  inside  of  a  regular
2120       expression;  semantically  they  are just position markers. Tags of the
2121       form @stag are called s-tags: they denote a single submatch value  (the
2122       last input position where this tag matched). Tags of the form #mtag are
2123       called m-tags: they denote multiple submatch values (the whole  history
2124       of repetitions of this tag).  All tags should be defined by the user as
2125       variables with the corresponding names. With standalone tags re2c  uses
2126       leftmost  greedy  disambiguation:  submatch positions correspond to the
2127       leftmost matching path through the regular expression.
2128
2129       The second option is -P --posix-captures:  it  enables  POSIX-compliant
2130       capturing  groups.  In  this  mode  parentheses  in regular expressions
2131       denote the beginning and the end of capturing groups; the whole regular
2132       expression  is group number zero. The number of groups for the matching
2133       rule is stored in a variable yynmatch, and submatch results are  stored
2134       in  yypmatch array. Both yynmatch and yypmatch should be defined by the
2135       user, and yypmatch size must be at least [yynmatch * 2]. Re2c  provides
2136       a  directive  /*!maxnmatch:re2c*/  that defines YYMAXNMATCH: a constant
2137       equal to the maximal value of yynmatch among all rules. Note that  re2c
2138       implements  POSIX-compliant  disambiguation: each subexpression matches
2139       as long as possible, and subexpressions that start earlier  in  regular
2140       expression  have  priority  over those starting later. Capturing groups
2141       are translated into s-tags under the hood, therefore we  use  the  word
2142       "tag" to describe them as well.
2143
2144       With  both -P --posix-captures and T --tags options re2c uses efficient
2145       submatch extraction algorithm described  in  the  Tagged  Deterministic
2146       Finite  Automata with Lookahead paper. The overhead on submatch extrac‐
2147       tion in the generated lexer grows with the number of tags ---  if  this
2148       number  is  moderate,  the  overhead is barely noticeable. In the lexer
2149       tags are implemented using a number of tag variables generated by re2c.
2150       There is no one-to-one correspondence between tag variables and tags: a
2151       single variable may be reused for  different  tags,  and  one  tag  may
2152       require multiple variables to hold all its ambiguous values. Eventually
2153       ambiguity is resolved, and only one final variable  per  tag  survives.
2154       When  a  rule matches, all its tags are set to the values of the corre‐
2155       sponding tag variables.  The exact number of tag variables  is  unknown
2156       to  the user; this number is determined by re2c. However, tag variables
2157       should be defined by the user as a part of the lexer state and  updated
2158       by  YYFILL,  therefore  re2c  provides  directives  /*!stags:re2c*/ and
2159       /*!mtags:re2c*/ that can be used to declare, initialize and  manipulate
2160       tag  variables. These directives have two optional configurations: for‐
2161       mat = "@@"; (specifies the template where @@ is  substituted  with  the
2162       name of each tag variable), and separator = ""; (specifies the piece of
2163       code used to join the generated pieces for different tag variables).
2164
2165       S-tags support the following operations:
2166
2167       · save input position to an s-tag: t = YYCURSOR with default API  or  a
2168         user-defined operation YYSTAGP(t) with generic API
2169
2170       · save  default  value  to  an  s-tag:  t  = NULL with default API or a
2171         user-defined operation YYSTAGN(t) with generic API
2172
2173       · copy one s-tag to another: t1 = t2
2174
2175       M-tags support the following operations:
2176
2177       · append input position to an  m-tag:  a  user-defined  operation  YYM‐
2178         TAGP(t) with both default and generic API
2179
2180       · append default value to an m-tag: a user-defined operation YYMTAGN(t)
2181         with both default and generic API
2182
2183       · copy one m-tag to another: t1 = t2
2184
2185       S-tags can be implemented  as  scalar  values  (pointers  or  offsets).
2186       M-tags  need  a  more  complex  representation, as they need to store a
2187       sequence of tag values. The most naive and  inefficient  representation
2188       of  an  m-tag is a list (array, vector) of tag values; a more efficient
2189       representation is to store all m-tags in a prefix-tree  represented  as
2190       array  of nodes (v, p), where v is tag value and p is a pointer to par‐
2191       ent node.
2192
2193       Here is an example of using s-tags to parse an IPv4 address.
2194
2195          //go:generate re2go $INPUT -o $OUTPUT
2196          package main
2197
2198          import (
2199              "errors"
2200              "testing"
2201          )
2202
2203          var eBadIP error = errors.New("bad IP")
2204
2205          func lex(str string) (int, error) {
2206              var cursor, marker, o1, o2, o3, o4 int
2207              /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2208
2209              num := func(pos int, end int) int {
2210                  n := 0
2211                  for ; pos < end; pos++ {
2212                      n = n*10 + int(str[pos]-'0')
2213                  }
2214                  return n
2215              }
2216
2217              /*!re2c
2218              re2c:flags:tags = 1;
2219              re2c:yyfill:enable = 0;
2220              re2c:define:YYCTYPE   = byte;
2221              re2c:define:YYPEEK    = "str[cursor]";
2222              re2c:define:YYSKIP    = "cursor += 1";
2223              re2c:define:YYBACKUP  = "marker = cursor";
2224              re2c:define:YYRESTORE = "cursor = marker";
2225              re2c:define:YYSTAGP   = "@@{tag} = cursor";
2226              re2c:define:YYSTAGN   = "@@{tag} = -1";
2227
2228              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2229              dot = [.];
2230              end = [\x00];
2231
2232              @o1 octet dot @o2 octet dot @o3 octet dot @o4 octet end {
2233                  return num(o4, cursor-1)+
2234                      (num(o3, o4-1) << 8)+
2235                      (num(o2, o3-1) << 16)+
2236                      (num(o1, o2-1) << 24), nil
2237              }
2238              * { return 0, eBadIP }
2239              */
2240          }
2241
2242          func TestLex(t *testing.T) {
2243              var tests = []struct {
2244                  str string
2245                  res int
2246                  err error
2247              }{
2248                  {"1.2.3.4\000", 0x01020304, nil},
2249                  {"127.0.0.1\000", 0x7f000001, nil},
2250                  {"255.255.255.255\000", 0xffffffff, nil},
2251                  {"1.2.3.\000", 0, eBadIP},
2252                  {"1.2.3.256\000", 0, eBadIP},
2253              }
2254
2255              for _, x := range tests {
2256                  t.Run(x.str, func(t *testing.T) {
2257                      res, err := lex(x.str)
2258                      if !(res == x.res && err == x.err) {
2259                          t.Errorf("got %d, want %d", res, x.res)
2260                      }
2261                  })
2262              }
2263          }
2264
2265
2266       Here is an example of using POSIX capturing groups  to  parse  an  IPv4
2267       address.
2268
2269          //go:generate re2go $INPUT -o $OUTPUT
2270          package main
2271
2272          import (
2273              "errors"
2274              "testing"
2275          )
2276
2277          /*!maxnmatch:re2c*/
2278
2279          var eBadIP error = errors.New("bad IP")
2280
2281          func lex(str string) (int, error) {
2282              var cursor, marker, yynmatch int
2283              yypmatch := make([]int, YYMAXNMATCH*2)
2284              /*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
2285
2286              num := func(pos int, end int) int {
2287                  n := 0
2288                  for ; pos < end; pos++ {
2289                      n = n*10 + int(str[pos]-'0')
2290                  }
2291                  return n
2292              }
2293
2294              /*!re2c
2295              re2c:flags:posix-captures = 1;
2296              re2c:yyfill:enable = 0;
2297              re2c:define:YYCTYPE     = byte;
2298              re2c:define:YYPEEK      = "str[cursor]";
2299              re2c:define:YYSKIP      = "cursor += 1";
2300              re2c:define:YYBACKUP    = "marker = cursor";
2301              re2c:define:YYRESTORE   = "cursor = marker";
2302              re2c:define:YYSTAGP     = "@@{tag} = cursor";
2303              re2c:define:YYSTAGN     = "@@{tag} = -1";
2304              re2c:define:YYSHIFTSTAG = "@@{tag} += @@{shift}";
2305
2306              octet = [0-9] | [1-9][0-9] | [1][0-9][0-9] | [2][0-4][0-9] | [2][5][0-5];
2307              dot = [.];
2308              end = [\x00];
2309
2310              (octet) dot (octet) dot (octet) dot (octet) end {
2311                  if yynmatch != 5 {
2312                      panic("expected 5 submatch groups")
2313                  }
2314                  return num(yypmatch[8], yypmatch[9])+
2315                      (num(yypmatch[6], yypmatch[7]) << 8)+
2316                      (num(yypmatch[4], yypmatch[5]) << 16)+
2317                      (num(yypmatch[2], yypmatch[3]) << 24), nil
2318              }
2319              * { return 0, eBadIP }
2320              */
2321          }
2322
2323          func TestLex(t *testing.T) {
2324              var tests = []struct {
2325                  str string
2326                  res int
2327                  err error
2328              }{
2329                  {"1.2.3.4\000", 0x01020304, nil},
2330                  {"127.0.0.1\000", 0x7f000001, nil},
2331                  {"255.255.255.255\000", 0xffffffff, nil},
2332                  {"1.2.3.\000", 0, eBadIP},
2333                  {"1.2.3.256\000", 0, eBadIP},
2334              }
2335
2336              for _, x := range tests {
2337                  t.Run(x.str, func(t *testing.T) {
2338                      res, err := lex(x.str)
2339                      if !(res == x.res && err == x.err) {
2340                          t.Errorf("got %d, want %d", res, x.res)
2341                      }
2342                  })
2343              }
2344          }
2345
2346
2347       Here  is  an  example  of  using  m-tags to parse a semicolon-separated
2348       sequence of words (C++). Tag variables are stored in  a  tree  that  is
2349       packed in a vector.
2350
2351          //go:generate re2go $INPUT -o $OUTPUT
2352          package main
2353
2354          import (
2355              "reflect"
2356              "testing"
2357          )
2358
2359          const (
2360              mtagRoot int = -1
2361              mtagNil int = -2
2362          )
2363
2364          type mtagElem struct {
2365              val  int
2366              pred int
2367          }
2368
2369          type mtagTrie = []mtagElem
2370
2371          func createTrie(capacity int) mtagTrie {
2372              return make([]mtagElem, 0, capacity)
2373          }
2374
2375          func mtag(trie *mtagTrie, tag int, val int) int {
2376              *trie = append(*trie, mtagElem{val, tag})
2377              return len(*trie) - 1
2378          }
2379
2380          // Recursively unwind both tag histories and consruct submatches.
2381          func unwind(trie mtagTrie, x int, y int, str string) []string {
2382              if x == mtagRoot && y == mtagRoot {
2383                  return []string{}
2384              } else if x == mtagRoot || y == mtagRoot {
2385                  panic("tag histories have different length")
2386              } else {
2387                  xval := trie[x].val
2388                  yval := trie[y].val
2389                  ss := unwind(trie, trie[x].pred, trie[y].pred, str)
2390
2391                  // Either both tags should be nil, or none of them.
2392                  if xval == mtagNil && yval == mtagNil {
2393                      return ss
2394                  } else if xval == mtagNil || yval == mtagNil {
2395                      panic("tag histories positive/negative tag mismatch")
2396                  } else {
2397                      s := str[xval:yval]
2398                      return append(ss, s)
2399                  }
2400              }
2401          }
2402
2403          func lex(str string) []string {
2404              var cursor, marker int
2405              trie := createTrie(256)
2406              x := mtagRoot
2407              y := mtagRoot
2408              /*!mtags:re2c format = "@@ := mtagRoot"; separator = "\n\t"; */
2409
2410              /*!re2c
2411              re2c:flags:tags = 1;
2412              re2c:yyfill:enable = 0;
2413              re2c:define:YYCTYPE   = byte;
2414              re2c:define:YYPEEK    = "str[cursor]";
2415              re2c:define:YYSKIP    = "cursor += 1";
2416              re2c:define:YYBACKUP  = "marker = cursor";
2417              re2c:define:YYRESTORE = "cursor = marker";
2418              re2c:define:YYMTAGP   = "@@{tag} = mtag(&trie, @@{tag}, cursor)";
2419              re2c:define:YYMTAGN   = "@@{tag} = mtag(&trie, @@{tag}, mtagNil)";
2420
2421              end = [\x00];
2422
2423              (#x [a-z]+ #y [;])* end { return unwind(trie, x, y, str) }
2424              *                       { return nil }
2425              */
2426          }
2427
2428          func TestLex(t *testing.T) {
2429              var tests = []struct {
2430                  str string
2431                  res []string
2432              }{
2433                  {"\000", []string{}},
2434                  {"one;two;three;\000", []string{"one", "two", "three"}},
2435                  {"one;two\000", nil},
2436              }
2437
2438              for _, x := range tests {
2439                  t.Run(x.str, func(t *testing.T) {
2440                      res := lex(x.str)
2441                      if !reflect.DeepEqual(res, x.res) {
2442                          t.Errorf("got %v, want %v", res, x.res)
2443                      }
2444                  })
2445              }
2446          }
2447
2448

STORABLE STATE

2450       With  -f  --storable-state option re2c generates a lexer that can store
2451       its current state, return to the caller, and  later  resume  operations
2452       exactly  where  it left off. The default mode of operation in re2c is a
2453       "pull" model, in which the lexer "pulls" more input whenever  it  needs
2454       it.  This may be unacceptable in cases when the input becomes available
2455       piece by piece (for example, if the lexer is invoked by the parser,  or
2456       if the lexer program communicates via a socket protocol with some other
2457       program that must wait for a reply from the lexer before  it  transmits
2458       the  next message). Storable state feature is intended exactly for such
2459       cases: it allows one to generate lexers that work in  a  "push"  model.
2460       When the lexer needs more input, it stores its state and returns to the
2461       caller. Later, when more input becomes available,  the  caller  resumes
2462       the  lexer  exactly where it stopped. There are a few changes necessary
2463       compared to the "pull" model:
2464
2465       · Define YYSETSTATE() and YYGETSTATE(state) promitives.
2466
2467       · Define yych, yyaccept and state variables as  a  part  of  persistent
2468         lexer state. The state variable should be initialized to -1.
2469
2470       · YYFILL should return to the outer program instead of trying to supply
2471         more input. Return code should indicate that lexer needs more input.
2472
2473       · The outer program should recognize situations when lexer  needs  more
2474         input and respond appropriately.
2475
2476       · Use  /*!getstate:re2c*/  directive  if it is necessary to execute any
2477         code before entering the lexer.
2478
2479       · Use configurations state:abort and state:nextlabel to  further  tweak
2480         the generated code.
2481
2482       Here  is an example of a "push"-model lexer that reads input from stdin
2483       and expects a sequence of words separated by spaces and  newlines.  The
2484       lexer  loops  forever,  waiting for more input. It can be terminated by
2485       sending a special EOF token --- a word "stop", in which case the  lexer
2486       terminates  successfully  and  prints  the number of words it has seen.
2487       Abnormal termination happens in case of a syntax error,  premature  end
2488       of  input  (without the "stop" word) or in case the buffer is too small
2489       to hold a lexeme (for example, if  one  of  the  words  exceeds  buffer
2490       size).  Premature  end of input happens in case the lexer fails to read
2491       any input while being in the initial state --- this is  the  only  case
2492       when EOF rule matches. Note that the lexer may call YYFILL twice before
2493       terminating (and thus require hitting Ctrl+D a few times).  First  time
2494       YYFILL  is  called  when  the lexer expects continuation of the current
2495       greedy lexeme (either a word  or  a  whitespace  sequence).  If  YYFILL
2496       fails,  the lexer knows that it has reached the end of the current lex‐
2497       eme and executes the corresponding semantic action. The action jumps to
2498       the beginning of the loop, the lexer enters the initial state and calls
2499       YYFILL once more. If it fails, the lexer matches  EOF  rule.  (Alterna‐
2500       tively  EOF  rule  can be used for termination instead of a special EOF
2501       lexeme.)
2502
2503   Example
2504          //go:generate re2go -f $INPUT -o $OUTPUT
2505          package main
2506
2507          import (
2508              "fmt"
2509              "os"
2510              "testing"
2511          )
2512
2513          // Intentionally small to trigger buffer refill.
2514          const SIZE int = 16
2515
2516          type Input struct {
2517              file     *os.File
2518              data     []byte
2519              cursor   int
2520              marker   int
2521              token    int
2522              limit    int
2523              state    int
2524              yyaccept int
2525          }
2526
2527          const (
2528              lexEnd = iota
2529              lexReady
2530              lexWaitingForInput
2531              lexPacketBroken
2532              lexPacketTooBig
2533              lexCountMismatch
2534          )
2535
2536          func fill(in *Input) int {
2537              if in.token == 0 {
2538                  // Error: no space can be freed.
2539                  // In real life can reallocate a larger buffer.
2540                  return lexPacketTooBig
2541              }
2542
2543              // Discard everything up to the start of the current lexeme,
2544              // shift buffer contents and adjust offsets.
2545              copy(in.data[0:], in.data[in.token:in.limit])
2546              in.cursor -= in.token
2547              in.marker -= in.token
2548              in.limit -= in.token
2549              in.token = 0
2550
2551              // Read new data (as much as possible to fill the buffer).
2552              n, _ := in.file.Read(in.data[in.limit:SIZE])
2553              in.limit += n
2554              in.data[in.limit] = 0 // append sentinel symbol
2555
2556              return lexReady
2557          }
2558
2559          func lex(in *Input, recv *int) int {
2560              var yych byte
2561              /*!getstate:re2c*/
2562          loop:
2563              in.token = in.cursor
2564              /*!re2c
2565              re2c:eof = 0;
2566              re2c:define:YYPEEK     = "in.data[in.cursor]";
2567              re2c:define:YYSKIP     = "in.cursor += 1";
2568              re2c:define:YYBACKUP   = "in.marker = in.cursor";
2569              re2c:define:YYRESTORE  = "in.cursor = in.marker";
2570              re2c:define:YYLESSTHAN = "in.limit <= in.cursor";
2571              re2c:define:YYFILL     = "return lexWaitingForInput";
2572              re2c:define:YYGETSTATE = "in.state";
2573              re2c:define:YYSETSTATE = "in.state = @@{state}";
2574
2575              packet = [a-z]+[;];
2576
2577              *      { return lexPacketBroken }
2578              $      { return lexEnd }
2579              packet { *recv = *recv + 1; goto loop }
2580              */
2581          }
2582
2583          func test(packets []string) int {
2584              fname := "pipe"
2585              fw, _ := os.Create(fname);
2586              fr, _ := os.Open(fname);
2587
2588              in := &Input{
2589                  file:   fr,
2590                  data:   make([]byte, SIZE+1),
2591                  cursor: SIZE,
2592                  marker: SIZE,
2593                  token:  SIZE,
2594                  limit:  SIZE,
2595                  state:  -1,
2596              }
2597              // data is zero-initialized, no need to write sentinel
2598
2599              var status int
2600              send := 0
2601              recv := 0
2602          loop:
2603              for {
2604                  status = lex(in, &recv)
2605                  if status == lexEnd {
2606                      if send != recv {
2607                          status = lexCountMismatch
2608                      }
2609                      break loop
2610                  } else if status == lexWaitingForInput {
2611                      if send < len(packets) {
2612                          fw.WriteString(packets[send])
2613                          send += 1
2614                      }
2615                      status = fill(in)
2616                      if status != lexReady {
2617                          break loop
2618                      }
2619                  } else if status == lexPacketBroken {
2620                      break loop
2621                  } else {
2622                      panic("unexpected status")
2623                  }
2624              }
2625
2626              fr.Close()
2627              fw.Close()
2628              os.Remove(fname)
2629
2630              return status
2631          }
2632
2633          func TestLex(t *testing.T) {
2634              var tests = []struct {
2635                  status  int
2636                  packets []string
2637              }{
2638                  {lexEnd, []string{}},
2639                  {lexEnd, []string{"zero;", "one;", "two;", "three;", "four;"}},
2640                  {lexPacketBroken, []string{"??;"}},
2641                  {lexPacketTooBig, []string{"looooooooooooong;"}},
2642              }
2643
2644              for i, x := range tests {
2645                  t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
2646                      status := test(x.packets)
2647                      if status != x.status {
2648                          t.Errorf("got %d, want %d", status, x.status)
2649                      }
2650                  })
2651              }
2652          }
2653
2654

REUSABLE BLOCKS

2656       Reuse mode is enabled with the -r --reusable option. In this mode  re2c
2657       allows  one to reuse definitions, configurations and rules specified by
2658       a /*!rules:re2c*/ block  in  subsequent  /*!use:re2c*/  blocks.  As  of
2659       re2c-1.2  it  is  possible  to  mix  such  blocks with normal /*!re2c*/
2660       blocks; prior to that re2c expects a  single  rules-block  followed  by
2661       use-blocks  (normal  blocks  are disallowed). Use-blocks can have addi‐
2662       tional definitions, configurations and rules: they are merged to  those
2663       specified by the rules-block.  A very common use case for -r --reusable
2664       option is a lexer that supports multiple input encodings:  lexer  rules
2665       are  defined once and reused multiple times with encoding-specific con‐
2666       figurations, such as re2c:flags:utf-8.
2667
2668       Below is an example of a multi-encoding lexer: it reads a  phrase  with
2669       Unicode  math symbols and accepts input either in UTF8 or in UT32. Note
2670       that the --input-encoding utf8 option allows us to  write  UTF8-encoded
2671       symbols  in  the  regular  expressions;  without this option re2c would
2672       parse them as a plain ASCII byte sequnce (and  we  would  have  to  use
2673       hexadecimal escape sequences).
2674
2675   Example
2676          //go:generate re2go $INPUT -o $OUTPUT -r --input-encoding utf8
2677          package main
2678
2679          import "testing"
2680
2681          /*!rules:re2c
2682              re2c:yyfill:enable = 0;
2683              re2c:define:YYPEEK    = "str[cursor]";
2684              re2c:define:YYSKIP    = "cursor += 1";
2685              re2c:define:YYBACKUP  = "marker = cursor";
2686              re2c:define:YYRESTORE = "cursor = marker";
2687
2688              "∀x ∃y: p(x, y)" { return 0; }
2689              *                { return 1; }
2690          */
2691
2692          func lexUTF8(str []uint8) int {
2693              var cursor, marker int
2694              /*!use:re2c
2695              re2c:flags:8 = 1;
2696              re2c:define:YYCTYPE = uint8;
2697              */
2698          }
2699
2700          func lexUTF32(str []uint32) int {
2701              var cursor, marker int
2702              /*!use:re2c
2703              re2c:flags:u = 1;
2704              re2c:define:YYCTYPE = uint32;
2705              */
2706          }
2707
2708          func TestLexUTF8(t *testing.T) {
2709              s_utf8 := []uint8{
2710                  0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79,
2711                  0x3a, 0x20, 0x70, 0x28, 0x78, 0x2c, 0x20, 0x79, 0x29};
2712
2713              if lexUTF8(s_utf8) != 0 {
2714                  t.Errorf("utf8 failed")
2715              }
2716          }
2717
2718          func TestLexUTF32(t *testing.T) {
2719              s_utf32 := []uint32{
2720                  0x00002200, 0x00000078, 0x00000020, 0x00002203, 0x00000079,
2721                  0x0000003a, 0x00000020, 0x00000070, 0x00000028, 0x00000078,
2722                  0x0000002c, 0x00000020, 0x00000079, 0x00000029};
2723
2724              if lexUTF32(s_utf32) != 0 {
2725                  t.Errorf("utf32 failed")
2726              }
2727          }
2728
2729

ENCODING SUPPORT

2731       re2c  supports  the  following encodings: ASCII (default), EBCDIC (-e),
2732       UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8).  See also  inplace
2733       configuration re2c:flags.
2734
2735       The  following  concepts  should be clarified when talking about encod‐
2736       ings.  A code point is an abstract number that represents a single sym‐
2737       bol.   A code unit is the smallest unit of memory, which is used in the
2738       encoded text (it corresponds to one character in the input stream). One
2739       or  more  code  units  may  be needed to represent a single code point,
2740       depending on the encoding. In a fixed-length encoding, each code  point
2741       is  represented  with an equal number of code units. In variable-length
2742       encodings, different code points can be represented with different num‐
2743       ber of code units.
2744
2745       · ASCII  is a fixed-length encoding. Its code space includes 0x100 code
2746         points, from 0 to 0xFF. A code point is represented with exactly  one
2747         1-byte  code  unit,  which  has the same value as the code point. The
2748         size of YYCTYPE must be 1 byte.
2749
2750       · EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
2751         points,  from 0 to 0xFF. A code point is represented with exactly one
2752         1-byte code unit, which has the same value as  the  code  point.  The
2753         size of YYCTYPE must be 1 byte.
2754
2755       · UCS-2  is  a  fixed-length  encoding. Its code space includes 0x10000
2756         code points, from 0 to 0xFFFF. One code  point  is  represented  with
2757         exactly  one  2-byte  code unit, which has the same value as the code
2758         point. The size of YYCTYPE must be 2 bytes.
2759
2760       · UTF-16 is a variable-length encoding. Its  code  space  includes  all
2761         Unicode  code  points,  from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
2762         One code point is represented with one or two 2-byte code units.  The
2763         size of YYCTYPE must be 2 bytes.
2764
2765       · UTF-32  is  a fixed-length encoding. Its code space includes all Uni‐
2766         code code points, from 0 to 0xD7FF and from 0xE000 to  0x10FFFF.  One
2767         code point is represented with exactly one 4-byte code unit. The size
2768         of YYCTYPE must be 4 bytes.
2769
2770       · UTF-8 is a variable-length encoding. Its code space includes all Uni‐
2771         code  code  points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
2772         code point is represented with a sequence of one, two, three, or four
2773         1-byte code units. The size of YYCTYPE must be 1 byte.
2774
2775       In  Unicode,  values  from  range 0xD800 to 0xDFFF (surrogates) are not
2776       valid Unicode code points. Any encoded  sequence  of  code  units  that
2777       would  map  to  Unicode  code  points  in  the  range 0xD800-0xDFFF, is
2778       ill-formed. The user  can  control  how  re2c  treats  such  ill-formed
2779       sequences with the --encoding-policy <policy> switch.
2780
2781       For  some  encodings,  there are code units that never occur in a valid
2782       encoded stream (e.g., 0xFF byte in UTF-8).  If  the  generated  scanner
2783       must  check  for invalid input, the only correct way to do so is to use
2784       the default rule (*). Note that the full range rule ([^])  won't  catch
2785       invalid  code  units when a variable-length encoding is used ([^] means
2786       "any valid code point", whereas the default rule (*) means "any  possi‐
2787       ble code unit").
2788

START CONDITIONS

2790       Conditions are enabled with -c --conditions.  This option allows one to
2791       encode multiple interrelated lexers within the same re2c block.
2792
2793       Each lexer corresponds to a single condition.  It starts with  a  label
2794       of  the  form yyc_name, where name is condition name and yyc prefix can
2795       be adjusted with configuration re2c:condprefix.  Different  lexers  are
2796       separated  with  a  comment  /*  *********************************** */
2797       which can be adjusted with configuration re2c:cond:divider.
2798
2799       Furthermore, each condition has a unique identifier of  the  form  yyc‐
2800       name,  where name is condition name and yyc prefix can be adjusted with
2801       configuration re2c:condenumprefix.  Identifiers have the  type  YYCOND‐
2802       TYPE  and  should  be  generated  with  /*!types:re2c*/ directive or -t
2803       --type-header option.  Users shouldn't define these  identifiers  manu‐
2804       ally, as the order of conditions is not specified.
2805
2806       Before all conditions re2c generates entry code that checks the current
2807       condition identifier and transfers control flow to the start  label  of
2808       the  active  condition.   After  matching  some rule of this condition,
2809       lexer may either transfer control flow back to the  entry  code  (after
2810       executing  the  associated action and optionally setting another condi‐
2811       tion with =>), or use :=> shortcut and transition directly to the start
2812       label  of  another  condition (skipping the action and the entry code).
2813       Configuration re2c:cond:goto allows one to change the default behavior.
2814
2815       Syntactically each rule must be preceded with a list of comma-separated
2816       condition  names  or  a  wildcard * enclosed in angle brackets < and >.
2817       Wildcard means "any condition" and is semantically equivalent to  list‐
2818       ing  all condition names.  Here regexp is a regular expression, default
2819       refers to the default rule *, and action is a block of code.
2820
2821       · <conditions-or-wildcard>  regexp-or-default                 action
2822
2823       · <conditions-or-wildcard>  regexp-or-default  =>  condition  action
2824
2825       · <conditions-or-wildcard>  regexp-or-default  :=> condition
2826
2827       Rules with an exclamation mark ! in front of condition list have a spe‐
2828       cial  meaning:  they  have  no  regular  expression, and the associated
2829       action is merged as an entry code to actions  of  normal  rules.   This
2830       might  be a convenient place to peform a routine task that is common to
2831       all rules.
2832
2833       · <!conditions-or-wildcard>  action
2834
2835       Another special form of rules with an empty condition list  <>  and  no
2836       regular  expression allows one to specify an "entry condition" that can
2837       be used to execute code before entering the lexer.  It is  semantically
2838       equivalent to a condition with number zero, name 0 and an empty regular
2839       expression.
2840
2841       · <>                 action
2842
2843       · <>  =>  condition  action
2844
2845       · <>  :=> condition
2846
2847   Example
2848          //go:generate re2go -c $INPUT -o $OUTPUT -i
2849          package main
2850
2851          import (
2852              "errors"
2853              "testing"
2854          )
2855
2856          var (
2857              eSyntax   = errors.New("syntax error")
2858              eOverflow = errors.New("overflow error")
2859          )
2860
2861          /*!types:re2c*/
2862
2863          const u32Limit uint64 = 1<<32
2864
2865          func parse_u32(str string) (uint32, error) {
2866              var cursor, marker int
2867              result := uint64(0)
2868              cond := yycinit
2869
2870              add_digit := func(base uint64, offset byte) {
2871                  result = result * base + uint64(str[cursor-1] - offset)
2872                  if result >= u32Limit {
2873                      result = u32Limit
2874                  }
2875              }
2876
2877              /*!re2c
2878              re2c:yyfill:enable = 0;
2879              re2c:define:YYCTYPE        = byte;
2880              re2c:define:YYPEEK         = "str[cursor]";
2881              re2c:define:YYSKIP         = "cursor += 1";
2882              re2c:define:YYSHIFT        = "cursor += @@{shift}";
2883              re2c:define:YYBACKUP       = "marker = cursor";
2884              re2c:define:YYRESTORE      = "cursor = marker";
2885              re2c:define:YYGETCONDITION = "cond";
2886              re2c:define:YYSETCONDITION = "cond = @@";
2887
2888              <*> * { return 0, eSyntax }
2889
2890              <init> '0b' / [01]        :=> bin
2891              <init> "0"                :=> oct
2892              <init> ""   / [1-9]       :=> dec
2893              <init> '0x' / [0-9a-fA-F] :=> hex
2894
2895              <bin, oct, dec, hex> "\x00" {
2896                  if result < u32Limit {
2897                      return uint32(result), nil
2898                  } else {
2899                      return 0, eOverflow
2900                  }
2901              }
2902
2903              <bin> [01]  { add_digit(2, '0');     goto yyc_bin }
2904              <oct> [0-7] { add_digit(8, '0');     goto yyc_oct }
2905              <dec> [0-9] { add_digit(10, '0');    goto yyc_dec }
2906              <hex> [0-9] { add_digit(16, '0');    goto yyc_hex }
2907              <hex> [a-f] { add_digit(16, 'a'-10); goto yyc_hex }
2908              <hex> [A-F] { add_digit(16, 'A'-10); goto yyc_hex }
2909              */
2910          }
2911
2912          func TestLex(t *testing.T) {
2913              var tests = []struct {
2914                  num uint32
2915                  str string
2916                  err error
2917              }{
2918                  {1234567890, "1234567890\000", nil},
2919                  {13, "0b1101\000", nil},
2920                  {0x7fe, "0x007Fe\000", nil},
2921                  {0644, "0644\000", nil},
2922                  {0, "9999999999\000", eOverflow},
2923                  {0, "123??\000", eSyntax},
2924              }
2925
2926              for _, x := range tests {
2927                  t.Run(x.str, func(t *testing.T) {
2928                      num, err := parse_u32(x.str)
2929                      if !(num == x.num && err == x.err) {
2930                          t.Errorf("got %d, want %d", num, x.num)
2931                      }
2932                  })
2933              }
2934          }
2935
2936

SKELETON PROGRAMS

2938       With the -S, --skeleton option, re2c ignores all non-re2c code and gen‐
2939       erates a self-contained C program that can be further compiled and exe‐
2940       cuted. The program consists of lexer code and input data. For each con‐
2941       structed DFA (block or condition) re2c generates a standalone lexer and
2942       two files: an .input file with strings derived from the DFA and a .keys
2943       file  with  expected  match results. The program runs each lexer on the
2944       corresponding .input file and compares results with  the  expectations.
2945       Skeleton programs are very useful for a number of reasons:
2946
2947       · They can check correctness of various re2c optimizations (the data is
2948         generated early in the process, before any DFA  transformations  have
2949         taken place).
2950
2951       · Generating  a  set of input data with good coverage may be useful for
2952         both testing and benchmarking.
2953
2954       · Generating self-contained executable programs allows one to get mini‐
2955         mized  test  cases  (the  original code may be large or have a lot of
2956         dependencies).
2957
2958       The difficulty with generating input data is that for all but the  most
2959       trivial  cases  the number of possible input strings is too large (even
2960       if the string length is limited). Re2c solves this difficulty by gener‐
2961       ating sufficiently many strings to cover almost all DFA transitions. It
2962       uses the following algorithm. First, it constructs a  skeleton  of  the
2963       DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
2964       EBCDIC) skeleton is just an exact copy of the original DFA. For  encod‐
2965       ings  with  multibyte code units skeleton is a copy of DFA with certain
2966       transitions omitted: namely, re2c takes at most 256 code units for each
2967       disjoint  continuous  range  that corresponds to a DFA transition.  The
2968       chosen values are evenly distributed and include range bounds.  Instead
2969       of  trying to cover all possible paths in the skeleton (which is infea‐
2970       sible) re2c generates sufficiently many paths  to  cover  all  skeleton
2971       transitions,  and  thus  trigger the corresponding conditional jumps in
2972       the lexer.  The algorithm implementation is limited by ~1Gb of  transi‐
2973       tions  and consumes constant amount of memory (re2c writes data to file
2974       as soon as it is generated).
2975

VISUALIZATION AND DEBUG

2977       With the -D, --emit-dot option, re2c does not generate  code.  Instead,
2978       it dumps the generated DFA in DOT format.  One can convert this dump to
2979       an image of the DFA using Graphviz or another library.  Note that  this
2980       option  shows the final DFA after it has gone through a number of opti‐
2981       mizations and transformations. Earlier stages can be dumped with  vari‐
2982       ous  debug  options,  such  as --dump-nfa, --dump-dfa-raw etc. (see the
2983       full list of options).
2984

AUTHORS

2991       Re2c was originaly written by Peter Bumbulis in 1993.   Since  then  it
2992       has been developed and maintained by multiple volunteers; mots notably,
2993       Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.
2994
2995
2996
2997
2998                                                                       RE2C(1)