1LEX(1P)                    POSIX Programmer's Manual                   LEX(1P)
2
3
4

PROLOG

6       This  manual  page is part of the POSIX Programmer's Manual.  The Linux
7       implementation of this interface may differ (consult the  corresponding
8       Linux  manual page for details of Linux behavior), or the interface may
9       not be implemented on Linux.
10

NAME

12       lex - generate programs for lexical tasks (DEVELOPMENT)
13

SYNOPSIS

15       lex [-t][-n|-v][file ...]
16

DESCRIPTION

18       The lex utility shall generate C programs to be used  in  lexical  pro‐
19       cessing  of  character  input,  and that can be used as an interface to
20       yacc. The C programs shall be generated from lex source code  and  con‐
21       form  to  the  ISO C standard. Usually, the lex utility shall write the
22       program it generates to the file lex.yy.c; the state of  this  file  is
23       unspecified  if lex exits with a non-zero exit status. See the EXTENDED
24       DESCRIPTION section for a complete description of the  lex  input  lan‐
25       guage.
26

OPTIONS

28       The  lex  utility  shall  conform  to  the  Base  Definitions volume of
29       IEEE Std 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
30
31       The following options shall be supported:
32
33       -n     Suppress the summary of statistics usually written with  the  -v
34              option.  If  no table sizes are specified in the lex source code
35              and the -v option is not specified, then -n is implied.
36
37       -t     Write the  resulting  program  to  standard  output  instead  of
38              lex.yy.c.
39
40       -v     Write  a  summary of lex statistics to the standard output. (See
41              the discussion of lex table sizes in Definitions in  lex  .)  If
42              the  -t option is specified and -n is not specified, this report
43              shall be written to standard error. If table sizes are specified
44              in  the  lex source code, and if the -n option is not specified,
45              the -v option may be enabled.
46
47

OPERANDS

49       The following operand shall be supported:
50
51       file   A pathname of an input file. If more than one such file is spec‐
52              ified,  all  files shall be concatenated to produce a single lex
53              program. If no file operands are specified, or if a file operand
54              is '-', the standard input shall be used.
55
56

STDIN

58       The  standard input shall be used if no file operands are specified, or
59       if a file operand is '-' . See INPUT FILES.
60

INPUT FILES

62       The input files shall be text files  containing  lex  source  code,  as
63       described in the EXTENDED DESCRIPTION section.
64

ENVIRONMENT VARIABLES

66       The following environment variables shall affect the execution of lex:
67
68       LANG   Provide  a  default value for the internationalization variables
69              that are unset or null. (See  the  Base  Definitions  volume  of
70              IEEE Std 1003.1-2001,  Section  8.2,  Internationalization Vari‐
71              ables for the precedence of internationalization variables  used
72              to determine the values of locale categories.)
73
74       LC_ALL If  set  to a non-empty string value, override the values of all
75              the other internationalization variables.
76
77       LC_COLLATE
78
79              Determine the locale for the  behavior  of  ranges,  equivalence
80              classes,  and  multi-character collating elements within regular
81              expressions. If this variable is not set to  the  POSIX  locale,
82              the results are unspecified.
83
84       LC_CTYPE
85              Determine  the  locale  for  the  interpretation of sequences of
86              bytes of text data as characters (for  example,  single-byte  as
87              opposed  to multi-byte characters in arguments and input files),
88              and the behavior of character  classes  within  regular  expres‐
89              sions.   If  this  variable  is not set to the POSIX locale, the
90              results are unspecified.
91
92       LC_MESSAGES
93              Determine the locale that should be used to  affect  the  format
94              and contents of diagnostic messages written to standard error.
95
96       NLSPATH
97              Determine the location of message catalogs for the processing of
98              LC_MESSAGES .
99
100

ASYNCHRONOUS EVENTS

102       Default.
103

STDOUT

105       If the -t option is specified, the text file of C source code output of
106       lex shall be written to standard output.
107
108       If the -t option is not specified:
109
110        * Implementation-defined  informational,  error,  and warning messages
111          concerning the contents of lex source code input shall be written to
112          either the standard output or standard error.
113
114        * If  the  -v  option is specified and the -n option is not specified,
115          lex statistics shall also be written to either the  standard  output
116          or  standard  error, in an implementation-defined format. These sta‐
117          tistics may also be generated if table sizes are  specified  with  a
118          '%' operator in the Definitions section, as long as the -n option is
119          not specified.
120

STDERR

122       If the -t option is  specified,  implementation-defined  informational,
123       error,  and warning messages concerning the contents of lex source code
124       input shall be written to the standard error.
125
126       If the -t option is not specified:
127
128        1. Implementation-defined informational, error, and  warning  messages
129           concerning  the  contents of lex source code input shall be written
130           to either the standard output or standard error.
131
132        2. If the -v option is specified and the -n option is  not  specified,
133           lex  statistics shall also be written to either the standard output
134           or standard error, in an implementation-defined format. These  sta‐
135           tistics  may  also be generated if table sizes are specified with a
136           '%' operator in the Definitions section, as long as the  -n  option
137           is not specified.
138

OUTPUT FILES

140       A  text  file containing C source code shall be written to lex.yy.c, or
141       to the standard output if the -t option is present.
142

EXTENDED DESCRIPTION

144       Each input file shall contain lex source code, which is a table of reg‐
145       ular  expressions  with  corresponding actions in the form of C program
146       fragments.
147
148       When lex.yy.c is compiled and linked with the lex  library  (using  the
149       -l l  operand  with  c99),  the  resulting program shall read character
150       input from the standard input and shall partition it into strings  that
151       match the given expressions.
152
153       When an expression is matched, these actions shall occur:
154
155        * The input string that was matched shall be left in yytext as a null-
156          terminated string; yytext shall  either  be  an  external  character
157          array  or  a  pointer to a character string. As explained in Defini‐
158          tions in lex, the type can be explicitly selected using  the  %array
159          or %pointer declarations, but the default is implementation-defined.
160
161        * The  external  int yyleng shall be set to the length of the matching
162          string.
163
164        * The expression's corresponding program fragment, or action, shall be
165          executed.
166
167       During  pattern  matching, lex shall search the set of patterns for the
168       single longest possible match. Among rules that match the  same  number
169       of characters, the rule given first shall be chosen.
170
171       The general format of lex source shall be:
172
173
174              Definitions
175              %%
176              Rules
177              %%
178              UserSubroutines
179
180       The  first "%%" is required to mark the beginning of the rules (regular
181       expressions and actions); the second "%%" is required only if user sub‐
182       routines follow.
183
184       Any  line  in the Definitions section beginning with a <blank> shall be
185       assumed to be a C program fragment and shall be copied to the  external
186       definition area of the lex.yy.c file.  Similarly, anything in the Defi‐
187       nitions section included between delimiter lines containing  only  "%{"
188       and "%}" shall also be copied unchanged to the external definition area
189       of the lex.yy.c file.
190
191       Any such input (beginning with a <blank> or within "%{" and "%}" delim‐
192       iter  lines) appearing at the beginning of the Rules section before any
193       rules are specified shall be written to lex.yy.c after the declarations
194       of variables for the yylex() function and before the first line of code
195       in yylex(). Thus, user variables local to yylex() can be declared here,
196       as well as application code to execute upon entry to yylex().
197
198       The  action  taken  by lex when encountering any input beginning with a
199       <blank> or within "%{" and "%}" delimiter lines appearing in the  Rules
200       section  but  coming after one or more rules is undefined. The presence
201       of such input may result in an  erroneous  definition  of  the  yylex()
202       function.
203
204   Definitions in lex
205       Definitions  appear  before  the first "%%" delimiter. Any line in this
206       section not contained between "%{" and "%}"  lines  and  not  beginning
207       with  a  <blank>  shall be assumed to define a lex substitution string.
208       The format of these lines shall be:
209
210
211              name substitute
212
213       If a name does not meet the requirements for identifiers in  the  ISO C
214       standard,  the result is undefined. The string substitute shall replace
215       the string { name} when it is used in a rule. The name string shall  be
216       recognized  in  this context only when the braces are provided and when
217       it does not appear within a bracket expression or within double-quotes.
218
219       In the Definitions section, any line  beginning  with  a  '%'  (percent
220       sign)  character  and  followed  by an alphanumeric word beginning with
221       either 's' or 'S' shall define a set  of  start  conditions.  Any  line
222       beginning  with  a  '%' followed by a word beginning with either 'x' or
223       'X' shall define a set of exclusive start conditions. When  the  gener‐
224       ated  scanner  is in a %s state, patterns with no state specified shall
225       be also active; in a %x state, such patterns shall not be  active.  The
226       rest  of  the line, after the first word, shall be considered to be one
227       or more <blank>-separated names of start  conditions.  Start  condition
228       names  shall  be constructed in the same way as definition names. Start
229       conditions can be used to restrict the matching of regular  expressions
230       to one or more states as described in Regular Expressions in lex .
231
232       Implementations  shall  accept  either  of  the following two mutually-
233       exclusive declarations in the Definitions section:
234
235       %array Declare the type of yytext to  be  a  null-terminated  character
236              array.
237
238       %pointer
239              Declare  the type of yytext to be a pointer to a null-terminated
240              character string.
241
242
243       The default type of yytext is implementation-defined. If an application
244       refers  to  yytext  outside of the scanner source file (that is, via an
245       extern), the  application  shall  include  the  appropriate  %array  or
246       %pointer declaration in the scanner source file.
247
248       Implementations  shall  accept  declarations in the Definitions section
249       for setting certain internal table sizes. The declarations are shown in
250       the following table.
251
252                        Table: Table Size Declarations in lex
253
254           Declaration  Description                         Minimum Value
255           %p n         Number of positions                 2500
256           %n n         Number of states                    500
257           %a n         Number of transitions               2000
258           %e n         Number of parse tree nodes          1000
259           %k n         Number of packed character classes  1000
260           %o n         Size of the output array            3000
261
262       In  the table, n represents a positive decimal integer, preceded by one
263       or more <blank>s. The exact meaning of  these  table  size  numbers  is
264       implementation-defined.  The  implementation  shall  document how these
265       numbers affect the lex utility and how they are related to  any  output
266       that  may  be  generated  by  the  implementation should limitations be
267       encountered during the execution of lex. It shall be possible to deter‐
268       mine  from this output which of the table size values needs to be modi‐
269       fied to permit lex to successfully generate tables for the  input  lan‐
270       guage.   The  values  in  the column Minimum Value represent the lowest
271       values conforming implementations shall provide.
272
273   Rules in lex
274       The rules in lex source files are a table in which the left column con‐
275       tains regular expressions and the right column contains actions (C pro‐
276       gram fragments) to be executed when the expressions are recognized.
277
278
279              ERE action
280              ERE action...
281
282       The extended regular expression (ERE) portion of a row shall  be  sepa‐
283       rated  from  action  by one or more <blank>s. A regular expression con‐
284       taining <blank>s shall be recognized under one of the following  condi‐
285       tions:
286
287        * The entire expression appears within double-quotes.
288
289        * The <blank>s appear within double-quotes or square brackets.
290
291        * Each <blank> is preceded by a backslash character.
292
293   User Subroutines in lex
294       Anything  in  the  user subroutines section shall be copied to lex.yy.c
295       following yylex().
296
297   Regular Expressions in lex
298       The lex utility shall support the set of extended  regular  expressions
299       (see  the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.4,
300       Extended Regular Expressions), with the following additions and  excep‐
301       tions to the syntax:
302
303       "..."  Any string enclosed in double-quotes shall represent the charac‐
304              ters within the double-quotes as themselves, except  that  back‐
305              slash  escapes  (which  appear  in the following table) shall be
306              recognized.  Any backslash-escape sequence shall  be  terminated
307              by the closing quote. For example, "\01" "1" represents a single
308              string: the octal value 1 followed by the character '1' .
309
310       <state>r, <state1,state2,...>r
311
312              The regular expression r shall be matched only when the  program
313              is  in  one  of the start conditions indicated by state, state1,
314              and so on; see Actions in lex . (As an exception  to  the  typo‐
315              graphical   conventions   of   the   rest   of  this  volume  of
316              IEEE Std 1003.1-2001, in this case <state> does not represent  a
317              metavariable, but the literal angle-bracket characters surround‐
318              ing a symbol.) The start condition shall be recognized  as  such
319              only at the beginning of a regular expression.
320
321       r/x    The regular expression r shall be matched only if it is followed
322              by an occurrence of regular expression x ( x is the instance  of
323              trailing context, further defined below).  The token returned in
324              yytext shall only match r. If the trailing portion of r  matches
325              the  beginning of x, the result is unspecified. The r expression
326              cannot include further trailing context or the  '$'  (match-end-
327              of-line) operator; x cannot include the '^' (match-beginning-of-
328              line) operator, nor trailing context, nor the '$' operator. That
329              is,  only one occurrence of trailing context is allowed in a lex
330              regular expression, and the '^' operator only can be used at the
331              beginning of such an expression.
332
333       {name} When  name  is  one of the substitution symbols from the Defini‐
334              tions section, the string, including the enclosing braces, shall
335              be  replaced by the substitute value. The substitute value shall
336              be treated in the extended regular  expression  as  if  it  were
337              enclosed  in parentheses. No substitution shall occur if { name}
338              occurs within a bracket expression or within double-quotes.
339
340
341       Within an ERE, a backslash character shall be considered  to  begin  an
342       escape  sequence as specified in the table in the Base Definitions vol‐
343       ume of IEEE Std 1003.1-2001, Chapter 5, File Format  Notation  (  '\\',
344       '\a',  '\b',  '\f',  '\n',  '\r', '\t', '\v' ). In addition, the escape
345       sequences in the following table shall be recognized.
346
347       A literal <newline> cannot occur within an  ERE;  the  escape  sequence
348       '\n'  can  be  used  to represent a <newline>. A <newline> shall not be
349       matched by a period operator.
350
351                           Table: Escape Sequences in lex
352
353       Escape
354       Sequence Description                    Meaning
355       \digits  A backslash character followed The character whose encoding
356                by the longest sequence of     is represented by the one,
357                one, two, or three octal-digit two, or three-digit octal
358                characters (01234567). If all  integer. If the size of a byte
359                of the digits are 0 (that is,  on the system is greater than
360                representation of the NUL      nine bits, the valid escape
361                character), the behavior is    sequence used to represent a
362                undefined.                     byte is implementation-
363                                               defined. Multi-byte characters
364                                               require multiple, concatenated
365                                               escape sequences of this type,
366                                               including the leading '\' for
367                                               each byte.
368       \xdigits A backslash character followed The character whose encoding
369                by the longest sequence of     is represented by the hexadec‐
370                hexadecimal-digit characters   imal integer.
371                (01234567abcdefABCDEF). If all
372                of the digits are 0 (that is,
373                representation of the NUL
374                character), the behavior is
375                undefined.
376       \c       A backslash character followed The character 'c', unchanged.
377                by any character not described
378                in this table or in the table
379                in the Base Definitions volume
380                of IEEE Std 1003.1-2001, Chap‐
381                ter 5, File Format Notation (
382                '\\', '\a', '\b', '\f', '\n',
383                '\r', '\t', '\v' ).
384
385       Note:  If  a  '\x' sequence needs to be immediately followed by a hexa‐
386              decimal digit character, a sequence such as  "\x1"  "1"  can  be
387              used,  which represents a character containing the value 1, fol‐
388              lowed by the character '1' .
389
390
391       The order of precedence given to extended regular expressions  for  lex
392       differs   from  that  specified  in  the  Base  Definitions  volume  of
393       IEEE Std 1003.1-2001, Section 9.4, Extended  Regular  Expressions.  The
394       order  of  precedence for lex shall be as shown in the following table,
395       from high to low.
396
397       Note:  The escaped characters entry is not meant to  imply  that  these
398              are  operators, but they are included in the table to show their
399              relationships to the true operators. The start condition, trail‐
400              ing  context, and anchoring notations have been omitted from the
401              table because of the placement restrictions  described  in  this
402              section;  they  can only appear at the beginning or ending of an
403              ERE.
404
405
406
407                                Table: ERE Precedence in lex
408
409                  Extended Regular Expression        Precedence
410                  collation-related bracket symbols  [= =] [: :] [. .]
411                  escaped characters                 \<special character>
412                  bracket expression                 [ ]
413                  quoting                            "..."
414                  grouping                           ( )
415                  definition                         {name}
416                  single-character RE duplication    * + ?
417                  concatenation
418                  interval expression                {m,n}
419                  alternation                        |
420
421       The ERE anchoring operators '^' and '$' do not  appear  in  the  table.
422       With  lex  regular expressions, these operators are restricted in their
423       use: the '^' operator can only be used at the beginning  of  an  entire
424       regular expression, and the '$' operator only at the end. The operators
425       apply to the entire regular expression. Thus, for example, the  pattern
426       "(^abc)|(def$)" is undefined; it can instead be written as two separate
427       rules, one with the regular expression  "^abc"  and  one  with  "def$",
428       which  share a common action via the special '|' action (see below). If
429       the pattern were written "^abc|def$", it would match  either  "abc"  or
430       "def" on a line by itself.
431
432       Unlike the general ERE rules, embedded anchoring is not allowed by most
433       historical lex implementations. An example of embedded anchoring  would
434       be  for  patterns such as "(^| )foo( |$)" to match "foo" when it exists
435       as a complete word. This functionality can be obtained  using  existing
436       lex features:
437
438
439              ^foo/[ \n]      |
440              " foo"/[ \n]    /* Found foo as a separate word. */
441
442       Note  also  that '$' is a form of trailing context (it is equivalent to
443       "/\n" ) and as such cannot be used with regular expressions  containing
444       another  instance  of  the  operator  (see  the preceding discussion of
445       trailing context).
446
447       The additional regular expressions trailing-context operator '/' can be
448       used  as an ordinary character if presented within double-quotes, "/" ;
449       preceded by a backslash, "\/" ; or within a bracket expression, "[/]" .
450       The  start-condition  '<'  and '>' operators shall be special only in a
451       start condition at the beginning of a regular expression; elsewhere  in
452       the regular expression they shall be treated as ordinary characters.
453
454   Actions in lex
455       The  action to be taken when an ERE is matched can be a C program frag‐
456       ment or the special actions described below; the program  fragment  can
457       contain one or more C statements, and can also include special actions.
458       The empty C statement ';' shall be a valid action; any  string  in  the
459       lex.yy.c  input  that  matches  the  pattern  portion of such a rule is
460       effectively ignored or skipped. However, the absence of an action shall
461       not  be  valid,  and  the action lex takes in such a condition is unde‐
462       fined.
463
464       The specification for an action, including  C  statements  and  special
465       actions, can extend across several lines if enclosed in braces:
466
467
468              ERE <one or more blanks> { program statement
469                                         program statement }
470
471       The  default action when a string in the input to a lex.yy.c program is
472       not matched by any expression shall be to copy the string to  the  out‐
473       put.  Because  the default behavior of a program generated by lex is to
474       read the input and copy it to the output, a minimal lex source  program
475       that  has  just  "%%" shall generate a C program that simply copies the
476       input to the output unchanged.
477
478       Four special actions shall be available:
479
480
481              |   ECHO;   REJECT;   BEGIN
482
483       |      The action '|' means that the action for the next  rule  is  the
484              action for this rule. Unlike the other three actions, '|' cannot
485              be enclosed in braces or be semicolon-terminated;  the  applica‐
486              tion  shall  ensure  that  it  is specified alone, with no other
487              actions.
488
489       ECHO;  Write the contents of the string yytext on the output.
490
491       REJECT;
492              Usually only a single expression is matched by a given string in
493              the  input.  REJECT  means "continue to the next expression that
494              matches the current input", and shall cause  whatever  rule  was
495              the  second choice after the current rule to be executed for the
496              same input. Thus, multiple rules can be matched and executed for
497              one  input  string  or  overlapping input strings.  For example,
498              given the regular expressions  "xyz"  and  "xy"  and  the  input
499              "xyz",  usually  only  the regular expression "xyz" would match.
500              The next attempted match would start after z. If the last action
501              in  the  "xyz"  rule is REJECT, both this rule and the "xy" rule
502              would be executed. The REJECT action may be implemented in  such
503              a fashion that flow of control does not continue after it, as if
504              it were equivalent to a goto to another part of yylex(). The use
505              of REJECT may result in somewhat larger and slower scanners.
506
507       BEGIN  The action:
508
509
510              BEGIN newstate;
511
512       switches  the  state  (start condition) to newstate. If the string new‐
513       state has not been declared previously as a start condition in the Def‐
514       initions  section,  the  results  are unspecified. The initial state is
515       indicated by the digit '0' or the token INITIAL.
516
517
518       The functions or macros described below are  accessible  to  user  code
519       included in the lex input. It is unspecified whether they appear in the
520       C code output of lex, or are accessible only through the  -l l  operand
521       to c99 (the lex library).
522
523       int  yylex(void)
524
525              Performs  lexical  analysis  on  the  input; this is the primary
526              function generated by the lex utility. The function shall return
527              zero  when  the  end  of  input  is reached; otherwise, it shall
528              return non-zero values (tokens) determined by the  actions  that
529              are selected.
530
531       int  yymore(void)
532
533              When called, indicates that when the next input string is recog‐
534              nized, it is to be appended  to  the  current  value  of  yytext
535              rather  than replacing it; the value in yyleng shall be adjusted
536              accordingly.
537
538       int  yyless(int  n)
539
540              Retains n initial  characters  in  yytext,  NUL-terminated,  and
541              treats  the  remaining  characters as if they had not been read;
542              the value in yyleng shall be adjusted accordingly.
543
544       int  input(void)
545
546              Returns the next character from the input, or  zero  on  end-of-
547              file.   It  shall  obtain  input  from  the stream pointer yyin,
548              although possibly via an intermediate buffer. Thus,  once  scan‐
549              ning  has  begun,  the  effect  of altering the value of yyin is
550              undefined. The character read shall be removed  from  the  input
551              stream of the scanner without any processing by the scanner.
552
553       int  unput(int  c)
554
555              Returns  the  character  'c' to the input; yytext and yyleng are
556              undefined until the next expression is matched.  The  result  of
557              using  unput()  for  more  characters  than  have  been input is
558              unspecified.
559
560
561       The following functions shall appear only in the lex library accessible
562       through the -l l operand; they can therefore be redefined by a conform‐
563       ing application:
564
565       int  yywrap(void)
566
567              Called by yylex() at end-of-file;  the  default  yywrap()  shall
568              always return 1. If the application requires yylex() to continue
569              processing with another source of input,  then  the  application
570              can  include  a function yywrap(), which associates another file
571              with the external variable FILE * yyin and shall return a  value
572              of zero.
573
574       int  main(int  argc, char *argv[])
575
576              Calls  yylex() to perform lexical analysis, then exits. The user
577              code can contain main() to perform  application-specific  opera‐
578              tions, calling yylex() as applicable.
579
580
581       Except  for input(), unput(), and main(), all external and static names
582       generated by lex shall begin with the prefix yy or YY.
583

EXIT STATUS

585       The following exit values shall be returned:
586
587        0     Successful completion.
588
589       >0     An error occurred.
590
591

CONSEQUENCES OF ERRORS

593       Default.
594
595       The following sections are informative.
596

APPLICATION USAGE

598       Conforming applications are warned that in the Rules  section,  an  ERE
599       without  an action is not acceptable, but need not be detected as erro‐
600       neous by lex. This may result in compilation or runtime errors.
601
602       The purpose of input() is to take characters off the input  stream  and
603       discard  them as far as the lexical analysis is concerned. A common use
604       is to discard the body of a comment once the beginning of a comment  is
605       recognized.
606
607       The lex utility is not fully internationalized in its treatment of reg‐
608       ular expressions in the lex source code or generated lexical  analyzer.
609       It would seem desirable to have the lexical analyzer interpret the reg‐
610       ular expressions given in the lex source according to  the  environment
611       specified when the lexical analyzer is executed, but this is not possi‐
612       ble with the current lex technology. Furthermore, the  very  nature  of
613       the lexical analyzers produced by lex must be closely tied to the lexi‐
614       cal requirements of the input language being described, which  is  fre‐
615       quently  locale-specific anyway. (For example, writing an analyzer that
616       is used for French text is  not  automatically  useful  for  processing
617       other languages.)
618

EXAMPLES

620       The following is an example of a lex program that implements a rudimen‐
621       tary scanner for a Pascal-like syntax:
622
623
624              %{
625              /* Need this for the call to atof() below. */
626              #include <math.h>
627              /* Need this for printf(), fopen(), and stdin below. */
628              #include <stdio.h>
629              %}
630
631
632              DIGIT    [0-9]
633              ID       [a-z][a-z0-9]*
634
635
636              %%
637
638
639              {DIGIT}+ {
640                  printf("An integer: %s (%d)\n", yytext,
641                      atoi(yytext));
642                  }
643
644
645              {DIGIT}+"."{DIGIT}*        {
646                  printf("A float: %s (%g)\n", yytext,
647                      atof(yytext));
648                  }
649
650
651              if|then|begin|end|procedure|function        {
652                  printf("A keyword: %s\n", yytext);
653                  }
654
655
656              {ID}    printf("An identifier: %s\n", yytext);
657
658
659              "+"|"-"|"*"|"/"        printf("An operator: %s\n", yytext);
660
661
662              "{"[^}\n]*"}"    /* Eat up one-line comments. */
663
664
665              [ \t\n]+        /* Eat up white space. */
666
667
668              .  printf("Unrecognized character: %s\n", yytext);
669
670
671              %%
672
673
674              int main(int argc, char *argv[])
675              {
676                  ++argv, --argc;  /* Skip over program name. */
677                  if (argc > 0)
678                      yyin = fopen(argv[0], "r");
679                  else
680                      yyin = stdin;
681
682
683                  yylex();
684              }
685

RATIONALE

687       Even though the -c option and references to the C language are retained
688       in  this description, lex may be generalized to other languages, as was
689       done at one time for EFL, the Extended FORTRAN Language. Since the  lex
690       input  specification  is  essentially language-independent, versions of
691       this utility could be written to produce Ada, Modula-2, or Pascal code,
692       and there are known historical implementations that do so.
693
694       The  current  description  of  lex  bypasses  the issue of dealing with
695       internationalized EREs in the lex source code or generated lexical ana‐
696       lyzer.  If it follows the model used by awk (the source code is assumed
697       to be presented in the POSIX locale, but input and output  are  in  the
698       locale  specified by the environment variables), then the tables in the
699       lexical analyzer produced by lex would interpret EREs specified in  the
700       lex source in terms of the environment variables specified when lex was
701       executed. The desired effect would be  to  have  the  lexical  analyzer
702       interpret the EREs given in the lex source according to the environment
703       specified when the lexical analyzer is executed, but this is not possi‐
704       ble with the current lex technology.
705
706       The  description of octal and hexadecimal-digit escape sequences agrees
707       with the ISO C standard use of escape sequences. See the RATIONALE  for
708       ed  for  a  discussion of bytes larger than 9 bits being represented by
709       octal values.  Hexadecimal values can represent larger bytes and multi-
710       byte characters directly, using as many digits as required.
711
712       There is no detailed output format specification. The observed behavior
713       of lex under four different historical implementations was that none of
714       these  implementations consistently reported the line numbers for error
715       and warning messages.  Furthermore, there was  a  desire  that  lex  be
716       allowed  to output additional diagnostic messages. Leaving message for‐
717       mats unspecified avoids these formatting questions  and  problems  with
718       internationalization.
719
720       Although the %x specifier for exclusive start conditions is not histor‐
721       ical practice, it is believed to be a minor change to historical imple‐
722       mentations  and greatly enhances the usability of lex programs since it
723       permits an application to obtain the expected functionality with  fewer
724       statements.
725
726       The %array and %pointer declarations were added as a compromise between
727       historical systems. The System V-based lex copies the matched text to a
728       yytext  array. The flex program, supported in BSD and GNU systems, uses
729       a pointer. In the latter case, significant performance improvements are
730       available for some scanners. Most historical programs should require no
731       change in porting from one system to another because the  string  being
732       referenced  is  null-terminated in both cases. (The method used by flex
733       in its case is to null-terminate the token in place by remembering  the
734       character  that  used  to  come  right after the token and replacing it
735       before continuing on to the next scan.) Multi-file programs with exter‐
736       nal  references  to  yytext outside the scanner source file should con‐
737       tinue to operate on their historical systems, but would require one  of
738       the new declarations to be considered strictly portable.
739
740       The  description  of EREs avoids unnecessary duplication of ERE details
741       because their meanings within a lex ERE are the same as  that  for  the
742       ERE in this volume of IEEE Std 1003.1-2001.
743
744       The  reason  for the undefined condition associated with text beginning
745       with a <blank> or within "%{" and "%}" delimiter lines appearing in the
746       Rules  section  is  historical  practice. Both the BSD and System V lex
747       copy the indented (or enclosed) input in the Rules section  (except  at
748       the  beginning)  to unreachable areas of the yylex() function (the code
749       is written directly after a break statement). In some cases, the System
750       V  lex  generates  an error message or a syntax error, depending on the
751       form of indented input.
752
753       The intention in breaking the list of functions  into  those  that  may
754       appear in lex.yy.c versus those that only appear in libl.a is that only
755       those functions in libl.a can be reliably  redefined  by  a  conforming
756       application.
757
758       The  descriptions  of  standard  output and standard error are somewhat
759       complicated because historical lex implementations chose to issue diag‐
760       nostic   messages   to   standard   output   (unless   -t  was  given).
761       IEEE Std 1003.1-2001 allows this behavior, but leaves  an  opening  for
762       the  more  expected  behavior  of using standard error for diagnostics.
763       Also, the System V behavior of writing the statistics  when  any  table
764       sizes are given is allowed, while BSD-derived systems can avoid it. The
765       programmer can always precisely obtain the  desired  results  by  using
766       either the -t or -n options.
767
768       The  OPERANDS  section  does  not mention the use of - as a synonym for
769       standard input; not all historical implementations support  such  usage
770       for any of the file operands.
771
772       A description of the translation table was deleted from early proposals
773       because of its relatively low usage in historical applications.
774
775       The change to the  definition  of  the  input()  function  that  allows
776       buffering of input presents the opportunity for major performance gains
777       in some applications.
778
779       The following examples clarify  the  differences  between  lex  regular
780       expressions  and regular expressions appearing elsewhere in this volume
781       of IEEE Std 1003.1-2001. For regular expressions of the form "r/x", the
782       string  matching  r  is  always  returned; confusion may arise when the
783       beginning of x matches the trailing portion of r.  For  example,  given
784       the  regular  expression  "a*b/cc" and the input "aaabcc", yytext would
785       contain the string "aaab" on this match. But given the regular  expres‐
786       sion  "x*/xy"  and the input "xxxy", the token xxx, not xx, is returned
787       by some implementations because xxx matches "x*" .
788
789       In the rule "ab*/bc", the "b*" at the end of r extends r's  match  into
790       the beginning of the trailing context, so the result is unspecified. If
791       this rule were "ab/bc", however, the rule matches the text "ab" when it
792       is  followed  by the text "bc" . In this latter case, the matching of r
793       cannot extend into the beginning of x, so the result is specified.
794

FUTURE DIRECTIONS

796       None.
797

SEE ALSO

799       c99, ed, yacc
800
802       Portions of this text are reprinted and reproduced in  electronic  form
803       from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
804       -- Portable Operating System Interface (POSIX),  The  Open  Group  Base
805       Specifications  Issue  6,  Copyright  (C) 2001-2003 by the Institute of
806       Electrical and Electronics Engineers, Inc and The Open  Group.  In  the
807       event of any discrepancy between this version and the original IEEE and
808       The Open Group Standard, the original IEEE and The Open Group  Standard
809       is  the  referee document. The original Standard can be obtained online
810       at http://www.opengroup.org/unix/online.html .
811
812
813
814IEEE/The Open Group                  2003                              LEX(1P)
Impressum