1RE2C(1)                     General Commands Manual                    RE2C(1)
2
3
4

NAME

6       re2c - convert regular expressions to C/C++
7
8

SYNOPSIS

10       re2c [-bdefghisuvVw1] [-o output] file
11
12

DESCRIPTION

14       re2c  is a preprocessor that generates C-based recognizers from regular
15       expressions.  The input to re2c consists of  C/C++  source  interleaved
16       with comments of the form /*!re2c ... */ which contain scanner specifi‐
17       cations.  In the output these comments are  replaced  with  code  that,
18       when  executed,  will  find  the next input token and then execute some
19       user-supplied token-specific code.
20
21       For example, given the following code
22
23          char *scan(char *p)
24          {
25          /*!re2c
26                  re2c:define:YYCTYPE  = "unsigned char";
27                  re2c:define:YYCURSOR = p;
28                  re2c:yyfill:enable   = 0;
29                  re2c:yych:conversion = 1;
30                  re2c:indent:top      = 1;
31                  [0-9]+          {return p;}
32                  [ 00-377]     {return (char*)0;}
33          */
34          }
35
36       re2c -is will generate
37
38          /* Generated by re2c on Sat Apr 16 11:40:58 1994 */
39          char *scan(char *p)
40          {
41              {
42                  unsigned char yych;
43
44                  yych = (unsigned char)*p;
45                  if(yych <= '/') goto yy4;
46                  if(yych >= ':') goto yy4;
47                  ++p;
48                  yych = (unsigned char)*p;
49                  goto yy7;
50          yy3:
51                  {return p;}
52          yy4:
53                  ++p;
54                  yych = (unsigned char)*p;
55                  {return char*)0;}
56          yy6:
57                  ++p;
58                  yych = (unsigned char)*p;
59          yy7:
60                  if(yych <= '/') goto yy3;
61                  if(yych <= '9') goto yy6;
62                  goto yy3;
63              }
64
65          }
66
67       You can place one /*!max:re2c */ comment that will  output  a  "#define
68       YYMAXFILL  <n>"  line  that  holds  the  maximum  number  of characters
69       required to parse the input. That is the maximum value  YYFILL(n)  will
70       receive.  If  -1 is in effect then YYMAXFILL can only be triggered once
71       after the last /*!re2c */.
72
73       You can also use /*!ignore:re2c */ blocks that allows to  document  the
74       scanner code and will not be part of the output.
75
76

OPTIONS

78       re2c provides the following options:
79
80       -?     -h Invoke a short help.
81
82       -b     Implies -s.  Use bit vectors as well in the attempt to coax bet‐
83              ter code out of the compiler.  Most  useful  for  specifications
84              with  more  than  a few keywords (e.g. for most programming lan‐
85              guages).
86
87       -d     Creates a parser that dumps information about the current  posi‐
88              tion  and  in which state the parser is while parsing the input.
89              This is useful to debug parser issues and  states.  If  you  use
90              this  switch  you  need to define a macro YYDEBUG that is called
91              like a function with two  parameters:  void  YYDEBUG(int  state,
92              char  current). The first parameter receives the state or -1 and
93              the second parameter receives the input at the current cursor.
94
95       -e     Cross-compile from an ASCII platform to an EBCDIC one.
96
97       -f     Generate a scanner with support for storable state.  For details
98              see below at SCANNER WITH STORABLE STATES.
99
100       -g     Generate  a  scanner  that utilizes GCC's computed goto feature.
101              That is re2c generates jump tables whenever a decision is  of  a
102              certain  complexity  (e.g.  a lot of if conditions are otherwise
103              necessary). This is only useable with GCC  and  produces  output
104              that  cannot be compiled with any other compiler. Note that this
105              implies -b and that the complexity threshold can  be  configured
106              using the inplace configuration "cgoto:threshold".
107
108       -i     Do  not  output #line information. This is usefull when you want
109              use a CMS tool with the re2c output which you might want if  you
110              do  not require your users to have re2c themselves when building
111              from your source.  -o output Specify the output file.
112
113       -s     Generate nested ifs for some switches.  Many compilers need this
114              assist to generate better code.
115
116       -u     Generate  a  parser  that  supports Unicode chars (UTF-32). This
117              means the generated code can deal with any valid Unicode charac‐
118              ter  up  to 0x10FFFF. When UTF-8 or UTF-16 needs to be supported
119              you need to convert the incoming stream  to  UTF-32  upon  input
120              yourself.
121
122       -v     Show version information.
123
124       -V     Show the version as a number XXYYZZ.
125
126       -w     Create  a  parser that supports wide chars (UCS-2). This implies
127              -s and cannot be used together with -e switch.
128
129       -1     Force single pass generation, this cannot be  combined  with  -f
130              and disables YYMAXFILL generation prior to last re2c block.
131
132       --no-generation-date
133              Suppress  date  output  in  the generated output so that it only
134              shows the re2c version.
135

INTERFACE CODE

137       Unlike other scanner generators, re2c does not generate complete  scan‐
138       ners:  the  user  must  supply some interface code.  In particular, the
139       user must define the following macros or use the corresponding  inplace
140       configurations:
141
142       YYCTYPE
143              Type  used  to  hold  an input symbol.  Usually char or unsigned
144              char.
145
146       YYCURSOR
147              l-expression of type *YYCTYPE that points to the  current  input
148              symbol.   The  generated  code  advances YYCURSOR as symbols are
149              matched.  On entry, YYCURSOR is assumed to point  to  the  first
150              character of the current token.  On exit, YYCURSOR will point to
151              the first character of the following token.
152
153       YYLIMIT
154              Expression of type *YYCTYPE that marks the  end  of  the  buffer
155              (YYLIMIT[-1]  is  the last character in the buffer).  The gener‐
156              ated code repeatedly compares YYCURSOR to YYLIMIT  to  determine
157              when the buffer needs (re)filling.
158
159       YYMARKER
160              l-expression  of  type *YYCTYPE.  The generated code saves back‐
161              tracking information in YYMARKER. Some easy scanners  might  not
162              use this.
163
164       YYCTXMARKER
165              l-expression  of type *YYCTYPE.  The generated code saves trail‐
166              ing context backtracking information in YYCTXMARKER.   The  user
167              only  needs to define this macro if a scanner specification uses
168              trailing context in one or more of its regular expressions.
169
170       YYFILL(n)
171              The generated code  "calls"  YYFILL(n)  when  the  buffer  needs
172              (re)filling:   at  least  n additional characters should be pro‐
173              vided.  YYFILL(n) should adjust YYCURSOR, YYLIMIT, YYMARKER  and
174              YYCTXMARKER  as  needed.  Note that for typical programming lan‐
175              guages n will be the length of the  longest  keyword  plus  one.
176              The  user can place a comment of the form /*!max:re2c */ once to
177              insert a YYMAXFILL(n) definition that  is  set  to  the  maximum
178              length  value.  If -1 switch is used then YYMAXFILL can be trig‐
179              gered only once after the last /*!re2c */ block.
180
181       YYGETSTATE()
182              The user only needs to define this macro  if  the  -f  flag  was
183              specified.   In  that  case,  the  generated code "calls" YYGET‐
184              STATE() at the very beginning of the scanner in order to  obtain
185              the  saved state. YYGETSTATE() must return a signed integer. The
186              value must be either -1, indicating that the scanner is  entered
187              for  the  first  time,  or  a  value  previously saved by YYSET‐
188              STATE(s).  In the second case, the scanner  will  resume  opera‐
189              tions right after where the last YYFILL(n) was called.
190
191       YYSETSTATE(s)
192              The  user  only  needs  to  define this macro if the -f flag was
193              specified.  In that case, the generated code "calls"  YYSETSTATE
194              just before calling YYFILL(n).  The parameter to YYSETSTATE is a
195              signed integer that uniquely identifies the specific instance of
196              YYFILL(n)  that  is about to be called.  Should the user wish to
197              save the state of the scanner and have YYFILL(n) return  to  the
198              caller,  all  he  has  to do is store that unique identifer in a
199              variable.  Later, when the scannered is called  again,  it  will
200              call  YYGETSTATE() and resume execution right where it left off.
201              The generated code will contain both  YYSETSTATE(s)  and  YYGET‐
202              STATE even if YYFILL(n) is being disabled.
203
204       YYDEBUG(state,current)
205              This  is  only needed if the -d flag was specified. It allows to
206              easily debug the generated parser  by  calling  a  user  defined
207              function for every state. The function should have the following
208              signature: void YYDEBUG(int state,  char  current).   The  first
209              parameter  receives  the  state  or  -1 and the second parameter
210              receives the input at the current cursor.
211
212       YYMAXFILL
213              This will be automatically defined by /*!max:re2c */  blocks  as
214              explained above.
215
216

SCANNER WITH STORABLE STATES

218       When  the -f flag is specified, re2c generates a scanner that can store
219       its current state, return to the caller, and  later  resume  operations
220       exactly where it left off.
221
222       The default operation of re2c is a "pull" model, where the scanner asks
223       for extra input whenever it needs it. However, this mode  of  operation
224       assumes  that the scanner is the "owner" the parsing loop, and that may
225       not always be convenient.
226
227       Typically, if there is a preprocessor  ahead  of  the  scanner  in  the
228       stream,  or  for  that  matter any other procedural source of data, the
229       scanner cannot "ask" for more data unless both scanner and source  live
230       in a separate threads.
231
232       The  -f  flag  is useful for just this situation : it lets users design
233       scanners that work in a "push" model, i.e. where data  is  fed  to  the
234       scanner  chunk  by chunk. When the scanner runs out of data to consume,
235       it just stores its state, and return to the  caller.  When  more  input
236       data is fed to the scanner, it resumes operations exactly where it left
237       off.
238
239       When using the -f option re2c does not accept stdin because it  has  to
240       do  the  full  generation  process twice which means it has to read the
241       input twice. That means re2c would fail in  case  it  cannot  open  the
242       input twice or reading the input for the first time influences the sec‐
243       ond read attempt.
244
245       Changes needed compared to the "pull" model.
246
247       1. User has to supply macros YYSETSTATE() and YYGETSTATE(state)
248
249       2. The -f option inhibits declaration of yych and yyaccept. So the user
250       has  to  declare these. Also the user has to save and restore these. In
251       the example examples/push.re these are declared as fields of the  (C++)
252       class  of  which  the  scanner  is  a method, so they do not need to be
253       saved/restored explicitly. For C they could e.g. be  made  macros  that
254       select  fields  from a structure passed in as parameter. Alternatively,
255       they could be declared as local variables, saved with YYFILL(n) when it
256       decides to return and restored at entry to the function. Also, it could
257       be more efficient to save  the  state  from  YYFILL(n)  because  YYSET‐
258       STATE(state)  is called unconditionally. YYFILL(n) however does not get
259       state as parameter, so we would have to store state in a local variable
260       by YYSETSTATE(state).
261
262       3.  Modify  YYFILL(n)  to return (from the function calling it) if more
263       input is needed.
264
265       4. Modify caller to recognise "more input is needed" and respond appro‐
266       priately.
267
268       5.  The  generated  code  will  contain  a switch block that is used to
269       restores the last state by jumping  behind  the  corrspoding  YYFILL(n)
270       call.  This  code is automatically generated in the epilog of the first
271       "/*!re2c */" block.  It is possible to trigger generation of the YYGET‐
272       STATE()  block earlier by placing a "/*!getstate:re2c */" comment. This
273       is especially useful when the scanner code should be wrapped  inside  a
274       loop.
275
276       Please  see examples/push.re for push-model scanner. The generated code
277       can  be  tweaked  using  inplace   configurations   "state:abort"   and
278       "state:nextlabel".
279
280

SCANNER SPECIFICATIONS

282       Each  scanner  specification  consists of a set of rules, named defini‐
283       tions and configurations.
284
285       Rules consist of a regular expression along with a block of C/C++  code
286       that  is  to  be  executed  when  the  associated regular expression is
287       matched.
288
289              regular expression { C/C++ code }
290
291       Named definitions are of the form:
292
293              name = regular expression;
294
295       Configurations look like  named  definitions  whose  names  start  with
296       "re2c:":
297
298              re2c:name = value;
299              re2c:name = "value";
300
301

SUMMARY OF RE2C REGULAR EXPRESSIONS

303       "foo"  the literal string foo.  ANSI-C escape sequences can be used.
304
305       'foo'  the  literal string foo (characters [a-zA-Z] treated case-insen‐
306              sitive).  ANSI-C escape sequences can be used.
307
308       [xyz]  a "character  class";  in  this  case,  the  regular  expression
309              matches either an 'x', a 'y', or a 'z'.
310
311       [abj-oZ]
312              a  "character  class" with a range in it; matches an 'a', a 'b',
313              any letter from 'j' through 'o', or a 'Z'.
314
315       [^class]
316              an inverted "character class".
317
318       r\s    match any r which isn't an s. r and s must  be  regular  expres‐
319              sions which can be expressed as character classes.
320
321       r*     zero or more r's, where r is any regular expression
322
323       r+     one or more r's
324
325       r?     zero or one r's (that is, "an optional r")
326
327       name   the expansion of the "named definition" (see above)
328
329       (r)    an r; parentheses are used to override precedence (see below)
330
331       rs     an r followed by an s ("concatenation")
332
333       r|s    either an r or an s
334
335       r/s    an  r  but  only if it is followed by an s. The s is not part of
336              the matched text. This type  of  regular  expression  is  called
337              "trailing  context". A trailing context can only be the end of a
338              rule and not part of a named definition.
339
340       r{n}   matches r exactly n times.
341
342       r{n,}  matches r at least n times.
343
344       r{n,m} matches r at least n but not more than m times.
345
346       .      match any character except newline (\n).
347
348       def    matches named definition as specified by def.
349
350       Character classes and string literals may contain octoal or hexadecimal
351       character definitions and the following set of escape sequences (\n,
352        \t, \v, \b, \r, \f, \a, \\).  An octal character is defined by a back‐
353       slash followed by its three octal digits and a hexadecimal character is
354       defined  by backslash, a lower cased 'x' and its two hexadecimal digits
355       or a backslash, an upper cased X and its four hexadecimal digits.
356
357       re2c further more supports the c/c++ unicode notation. That is a  back‐
358       slash followed by either a lowercased u and its four hexadecimal digits
359       or an uppercased U and its eight hexadecimal digits. However only in -u
360       mode the generated code can deal with any valid Unicode character up to
361       0x10FFFF.
362
363       Since characters greater \X00FF are not allowed in  non  unicode  mode,
364       the only portable "any" rules are (.|"\n") and [^].
365
366       The  regular  expressions  listed above are grouped according to prece‐
367       dence, from highest precedence at the top  to  lowest  at  the  bottom.
368       Those grouped together have equal precedence.
369
370

INPLACE CONFIGURATION

372       It  is  possible  to  configure code generation inside re2c blocks. The
373       following lists the available configurations:
374
375       re2c:indent:top = 0 ;
376              Specifies the minimum number of indendation to use.  Requires  a
377              numeric value greater than or equal zero.
378
379       re2c:indent:string = "\t" ;
380              Specifies  the  string to use for indendation. Requires a string
381              that should contain only whitespace unless  you  need  this  for
382              external  tools. The easiest way to specify spaces is to enclude
383              them in single or double quotes. If you do not want any indenda‐
384              tion at all you can simply set this to "".
385
386       re2c:yybm:hex = 0 ;
387              If  set  to zero then a decimal table is being used else a hexa‐
388              decimal table will be generated.
389
390       re2c:yyfill:enable = 1 ;
391              Set this to zero to suppress generation of YYFILL(n). When using
392              this  be sure to verify that the generated scanner does not read
393              behind input. Allowing this behavior might introduce sever secu‐
394              rity issues to you programs.
395
396       re2c:yyfill:parameter = 1 ;
397              Allows  to suppress parameter passing to YYFILL calls. If set to
398              zero then no parameter is passed to YYFILL. If set to a non zero
399              value  then  YYFILL  usage  will  be  followed  by the number of
400              requested characters in braces.
401
402       re2c:startlabel = 0 ;
403              If set to a non zero integer then the start label  of  the  next
404              scanner blocks will be generated even if not used by the scanner
405              itself. Otherwise the normal yy0 like start label is only  being
406              generated  if  needed.  If set to a text value then a label with
407              that text will be generated regardless  of  whether  the  normal
408              start label is being used or not. This setting is being reset to
409              0 after a start label has been generated.
410
411       re2c:labelprefix = yy ;
412              Allows to change the prefix of numbered labels. The  default  is
413              yy and can be set any string that is a valid label.
414
415       re2c:state:abort = 0 ;
416              When  not zero and switch -f is active then the YYGETSTATE block
417              will contain a default case that aborts and a -1  case  is  used
418              for initialization.
419
420       re2c:state:nextlabel = 0 ;
421              Used  when  -f is active to control whether the YYGETSTATE block
422              is followed by a yyNext: label line. Instead of using yyNext you
423              can  usually  also  use configuration startlabel to force a spe‐
424              cific start label or default to yy0 as start label.  Instead  of
425              using  a  dedicated  label  it  is  often better to separate the
426              YYGETSTATE code from  the  actual  scanner  code  by  placing  a
427              "/*!getstate:re2c */" comment.
428
429       re2c:cgoto:threshold = 9 ;
430              When  -g is active this value specifies the complexity threshold
431              that triggers generation of jump tables rather than using nested
432              if's  and decision bitfields.  The threshold is compared against
433              a calculated estimation of if-s needed where every  used  bitmap
434              divides the threshold by 2.
435
436       re2c:yych:conversion = 0 ;
437              When  the input uses signed characters and -s or -b switches are
438              in effect re2c allows to automatically convert to  the  unsigned
439              character  type  that  is then necessary for its internal single
440              character. When this setting is zero or an empty string the con‐
441              version  is  disabled. Using a non zero number the conversion is
442              taken from YYCTYPE. If that is given by an inplace configuration
443              that  value  is  being  used. Otherwise it will be (YYCTYPE) and
444              changes to that configuration are  no longer possible. When this
445              setting  is  a string the braces must be specified. Now assuming
446              your input is a char* buffer and you are using  above  mentioned
447              switches  you  can set YYCTYPE to unsigned char and this setting
448              to either 1 or "(unsigned char)".
449
450       re2c:define:YYCTXMARKER = YYCTXMARKER ;
451              Allows to overwrite the define YYCTXMARKER and thus avoiding  it
452              by setting the value to the actual code needed.
453
454       re2c:define:YYCTYPE = YYCTYPE ;
455              Allows  to  overwrite the define YYCTYPE and thus avoiding it by
456              setting the value to the actual code needed.
457
458       re2c:define:YYCURSOR = YYCURSOR ;
459              Allows to overwrite the define YYCURSOR and thus avoiding it  by
460              setting the value to the actual code needed.
461
462       re2c:define:YYDEBUG = YYDEBUG ;
463              Allows  to  overwrite the define YYDEBUG and thus avoiding it by
464              setting the value to the actual code needed.
465
466       re2c:define:YYFILL = YYFILL ;
467              Allows to overwrite the define YYFILL and thus  avoiding  it  by
468              setting the value to the actual code needed.
469
470       re2c:define:YYGETSTATE = YYGETSTATE ;
471              Allows  to  overwrite the define YYGETSTATE and thus avoiding it
472              by setting the value to the actual code needed.
473
474       re2c:define:YYLIMIT = YYLIMIT ;
475              Allows to overwrite the define YYLIMIT and thus avoiding  it  by
476              setting the value to the actual code needed.
477
478       re2c:define:YYMARKER = YYMARKER ;
479              Allows  to overwrite the define YYMARKER and thus avoiding it by
480              setting the value to the actual code needed.
481
482       re2c:define:YYSETSTATE = YYSETSTATE ;
483              Allows to overwrite the define YYSETSTATE and thus  avoiding  it
484              by setting the value to the actual code needed.
485
486       re2c:label:yyFillLabel = yyFillLabel ;
487              Allows to overwrite the name of the label yyFillLabel.
488
489       re2c:label:yyNext = yyNext ;
490              Allows to overwrite the name of the label yyNext.
491
492       re2c:variable:yyaccept = yyaccept ;
493              Allows to overwrite the name of the variable yyaccept.
494
495       re2c:variable:yybm = yybm ;
496              Allows to overwrite the name of the variable yybm.
497
498       re2c:variable:yych = yych ;
499              Allows to overwrite the name of the variable yych.
500
501       re2c:variable:yytarget = yytarget ;
502              Allows to overwrite the name of the variable yytarget.
503
504

UNDERSTANDING RE2C

506       The  subdirectory  lessons of the re2c distribution contains a few step
507       by step lessons to get you started  with  re2c.  All  examples  in  the
508       lessons subdirectory can be compiled and actually work.
509
510

FEATURES

512       re2c does not provide a default action: the generated code assumes that
513       the input will consist of a sequence of tokens.  Typically this can  be
514       dealt  with  by adding a rule such as the one for unexpected characters
515       in the example above.
516
517       The user must arrange for a sentinel token to  appear  at  the  end  of
518       input  (and  provide  a rule for matching it): re2c does not provide an
519       <<EOF>> expression.  If the  source  is  from  a  null-byte  terminated
520       string,  a  rule matching a null character will suffice.  If the source
521       is from a file then you could pad the input with  a  newline  (or  some
522       other  character  that cannot appear within another token); upon recog‐
523       nizing such a character check to see if it  is  the  sentinel  and  act
524       accordingly.  And you can also use YYFILL(n) to end the scanner in case
525       not enough characters are available which is nothing else then e detec‐
526       tion of end of data/file.
527
528       re2c  does not provide start conditions:  use a separate scanner speci‐
529       fication for each start condition (as illustrated in  the  above  exam‐
530       ple).
531
532

BUGS

534       Difference only works for character sets.
535
536       The re2c internal algorithms need documentation.
537
538

SEE ALSO

540       flex(1), lex(1).
541
542       More information on re2c can be found here:
543       http://re2c.org/
544
545

AUTHORS

547       Peter Bumbulis <peter@csg.uwaterloo.ca>
548       Brian Young <bayoung@acm.org>
549       Dan Nuffer <nuffer@users.sourceforge.net>
550       Marcus Boerger <helly@users.sourceforge.net>
551       Hartmut Kaiser <hkaiser@users.sourceforge.net>
552       Emmanuel Mogenet <mgix@mgix.com> added storable state

VERSION INFORMATION

554       This manpage describes re2c, version 0.12.1.
555
556
557
558
559Version 0.12.1                   22 April 2005                         RE2C(1)
Impressum