1lex(1) User Commands lex(1)
2
3
4
6 lex - generate programs for lexical tasks
7
9 lex [-cntv] [-e | -w] [-V -Q [y | n]] [file]...
10
11
13 The lex utility generates C programs to be used in lexical processing
14 of character input, and that can be used as an interface to yacc. The C
15 programs are generated from lex source code and conform to the ISO C
16 standard. Usually, the lex utility writes the program it generates to
17 the file lex.yy.c. The state of this file is unspecified if lex exits
18 with a non-zero exit status. See EXTENDED DESCRIPTION for a complete
19 description of the lex input language.
20
22 The following options are supported:
23
24 -c Indicates C-language action (default option).
25
26
27 -e Generates a program that can handle EUC characters (cannot
28 be used with the -w option). yytext[] is of type unsigned
29 char[].
30
31
32 -n Suppresses the summary of statistics usually written with
33 the -v option. If no table sizes are specified in the lex
34 source code and the -v option is not specified, then -n is
35 implied.
36
37
38 -t Writes the resulting program to standard output instead of
39 lex.yy.c.
40
41
42 -v Writes a summary of lex statistics to the standard error.
43 (See the discussion of lex table sizes under the heading
44 Definitions in lex.) If table sizes are specified in the
45 lex source code, and if the -n option is not specified, the
46 -v option may be enabled.
47
48
49 -w Generates a program that can handle EUC characters (cannot
50 be used with the -e option). Unlike the -e option, yytext[]
51 is of type wchar_t[].
52
53
54 -V Prints out version information on standard error.
55
56
57 -Q[y|n] Prints out version information to output file lex.yy.c by
58 using -Qy. The -Qn option does not print out version infor‐
59 mation and is the default.
60
61
63 The following operand is supported:
64
65 file A pathname of an input file. If more than one such file is
66 specified, all files will be concatenated to produce a single
67 lex program. If no file operands are specified, or if a file
68 operand is −, the standard input will be used.
69
70
72 The lex output files are described below.
73
74 Stdout
75 If the -t option is specified, the text file of C source code output of
76 lex will be written to standard output.
77
78 Stderr
79 If the -t option is specified informational, error and warning messages
80 concerning the contents of lex source code input will be written to the
81 standard error.
82
83
84 If the -t option is not specified:
85
86 1. Informational error and warning messages concerning the con‐
87 tents of lex source code input will be written to either the
88 standard output or standard error.
89
90 2. If the -v option is specified and the -n option is not spec‐
91 ified, lex statistics will also be written to standard
92 error. These statistics may also be generated if table sizes
93 are specified with a % operator in the Definitions in lex
94 section (see EXTENDED DESCRIPTION), as long as the -n option
95 is not specified.
96
97 Output Files
98 A text file containing C source code will be written to lex.yy.c, or to
99 the standard output if the -t option is present.
100
102 Each input file contains lex source code, which is a table of regular
103 expressions with corresponding actions in the form of C program frag‐
104 ments.
105
106
107 When lex.yy.c is compiled and linked with the lex library (using the -l
108 l operand with c89 or cc), the resulting program reads character input
109 from the standard input and partitions it into strings that match the
110 given expressions.
111
112
113 When an expression is matched, these actions will occur:
114
115 o The input string that was matched is left in yytext as a
116 null-terminated string; yytext is either an external charac‐
117 ter array or a pointer to a character string. As explained
118 in Definitions in lex, the type can be explicitly selected
119 using the %array or %pointer declarations, but the default
120 is %array.
121
122 o The external int yyleng is set to the length of the matching
123 string.
124
125 o The expression's corresponding program fragment, or action,
126 is executed.
127
128
129 During pattern matching, lex searches the set of patterns for the sin‐
130 gle longest possible match. Among rules that match the same number of
131 characters, the rule given first will be chosen.
132
133
134 The general format of lex source is:
135
136 Definitions
137 %%
138 Rules
139 %%
140 User Subroutines
141
142
143
144 The first %% is required to mark the beginning of the rules (regular
145 expressions and actions); the second %% is required only if user sub‐
146 routines follow.
147
148
149 Any line in the Definitions in lex section beginning with a blank char‐
150 acter will be assumed to be a C program fragment and will be copied to
151 the external definition area of the lex.yy.c file. Similarly, anything
152 in the Definitions in lex section included between delimiter lines con‐
153 taining only %{ and %} will also be copied unchanged to the external
154 definition area of the lex.yy.c file.
155
156
157 Any such input (beginning with a blank character or within %{ and %}
158 delimiter lines) appearing at the beginning of the Rules section before
159 any rules are specified will be written to lex.yy.c after the declara‐
160 tions of variables for the yylex function and before the first line of
161 code in yylex. Thus, user variables local to yylex can be declared
162 here, as well as application code to execute upon entry to yylex.
163
164
165 The action taken by lex when encountering any input beginning with a
166 blank character or within %{ and %} delimiter lines appearing in the
167 Rules section but coming after one or more rules is undefined. The
168 presence of such input may result in an erroneous definition of the
169 yylex function.
170
171 Definitions in lex
172 Definitions in lex appear before the first %% delimiter. Any line in
173 this section not contained between %{ and %} lines and not beginning
174 with a blank character is assumed to define a lex substitution string.
175 The format of these lines is:
176
177 name substitute
178
179
180
181
182 If a name does not meet the requirements for identifiers in the ISO C
183 standard, the result is undefined. The string substitute will replace
184 the string { name } when it is used in a rule. The name string is rec‐
185 ognized in this context only when the braces are provided and when it
186 does not appear within a bracket expression or within double-quotes.
187
188
189 In the Definitions in lex section, any line beginning with a % (percent
190 sign) character and followed by an alphanumeric word beginning with
191 either s or S defines a set of start conditions. Any line beginning
192 with a % followed by a word beginning with either x or X defines a set
193 of exclusive start conditions. When the generated scanner is in a %s
194 state, patterns with no state specified will be also active; in a %x
195 state, such patterns will not be active. The rest of the line, after
196 the first word, is considered to be one or more blank-character-sepa‐
197 rated names of start conditions. Start condition names are constructed
198 in the same way as definition names. Start conditions can be used to
199 restrict the matching of regular expressions to one or more states as
200 described in Regular expressions in lex.
201
202
203 Implementations accept either of the following two mutually exclusive
204 declarations in the Definitions in lex section:
205
206 %array Declare the type of yytext to be a null-terminated charac‐
207 ter array.
208
209
210 %pointer Declare the type of yytext to be a pointer to a null-ter‐
211 minated character string.
212
213
214
215 Note: When using the %pointer option, you may not also use the yyless
216 function to alter yytext.
217
218
219 %array is the default. If %array is specified (or neither %array nor
220 %pointer is specified), then the correct way to make an external refer‐
221 ence to yyext is with a declaration of the form:
222
223
224 extern char yytext[]
225
226
227 If %pointer is specified, then the correct external reference is of the
228 form:
229
230
231 extern char *yytext;
232
233
234 lex will accept declarations in the Definitions in lex section for set‐
235 ting certain internal table sizes. The declarations are shown in the
236 following table.
237
238
239 Table Size Declaration in lex
240
241
242
243
244 ┌───────────────────────────────────────────────────────────────────┐
245 │ Declaration Description Default │
246 ├───────────────────────────────────────────────────────────────────┤
247 │%pn Number of positions 2500 │
248 │%nn Number of states 500 │
249 │%a n Number of transitions 2000 │
250 │%en Number of parse tree nodes 1000 │
251 │%kn Number of packed character classes 10000 │
252 │%on Size of the output array 3000 │
253 └───────────────────────────────────────────────────────────────────┘
254
255
256 Programs generated by lex need either the -e or -w option to handle
257 input that contains EUC characters from supplementary codesets. If nei‐
258 ther of these options is specified, yytext is of the type char[], and
259 the generated program can handle only ASCII characters.
260
261
262 When the -e option is used, yytext is of the type unsigned char[] and
263 yyleng gives the total number of bytes in the matched string. With this
264 option, the macros input(), unput(c), and output(c) should do a byte-
265 based I/O in the same way as with the regular ASCII lex. Two more vari‐
266 ables are available with the -e option, yywtext and yywleng, which
267 behave the same as yytext and yyleng would under the -w option.
268
269
270 When the -w option is used, yytext is of the type wchar_t[] and yyleng
271 gives the total number of characters in the matched string. If you
272 supply your own input(), unput(c), or output(c) macros with this
273 option, they must return or accept EUC characters in the form of wide
274 character (wchar_t). This allows a different interface between your
275 program and the lex internals, to expedite some programs.
276
277 Rules in lex
278 The Rules in lex source files are a table in which the left column con‐
279 tains regular expressions and the right column contains actions (C pro‐
280 gram fragments) to be executed when the expressions are recognized.
281
282 ERE action
283 ERE action
284 ...
285
286
287
288 The extended regular expression (ERE) portion of a row will be sepa‐
289 rated from action by one or more blank characters. A regular expression
290 containing blank characters is recognized under one of the following
291 conditions:
292
293 o The entire expression appears within double-quotes.
294
295 o The blank characters appear within double-quotes or square
296 brackets.
297
298 o Each blank character is preceded by a backslash character.
299
300 User Subroutines in lex
301 Anything in the user subroutines section will be copied to lex.yy.c
302 following yylex.
303
304 Regular Expressions in lex
305 The lex utility supports the set of Extended Regular Expressions (EREs)
306 described on regex(5) with the following additions and exceptions to
307 the syntax:
308
309 ... Any string enclosed in double-quotes will represent the
310 characters within the double-quotes as themselves, except
311 that backslash escapes (which appear in the following ta‐
312 ble) are recognized. Any backslash-escape sequence is
313 terminated by the closing quote. For example, "\01""1"
314 represents a single string: the octal value 1 followed by
315 the character 1.
316
317
318
319 <state>r
320
321 <state1, state2, ...>r
322
323 The regular expression r will be matched only when the program is
324 in one of the start conditions indicated by state, state1, and so
325 forth. For more information, see Actions in lex. As an exception to
326 the typographical conventions of the rest of this document, in this
327 case <state> does not represent a metavariable, but the literal
328 angle-bracket characters surrounding a symbol. The start condition
329 is recognized as such only at the beginning of a regular expres‐
330 sion.
331
332
333 r/x
334
335 The regular expression r will be matched only if it is followed by
336 an occurrence of regular expression x. The token returned in yytext
337 will only match r. If the trailing portion of r matches the begin‐
338 ning of x, the result is unspecified. The r expression cannot
339 include further trailing context or the $ (match-end-of-line) oper‐
340 ator; x cannot include the ^ (match-beginning-of-line) operator,
341 nor trailing context, nor the $ operator. That is, only one occur‐
342 rence of trailing context is allowed in a lex regular expression,
343 and the ^ operator only can be used at the beginning of such an
344 expression. A further restriction is that the trailing-context
345 operator / (slash) cannot be grouped within parentheses.
346
347
348 {name}
349
350 When name is one of the substitution symbols from the Definitions
351 section, the string, including the enclosing braces, will be
352 replaced by the substitute value. The substitute value will be
353 treated in the extended regular expression as if it were enclosed
354 in parentheses. No substitution will occur if {name} occurs within
355 a bracket expression or within double-quotes.
356
357
358
359 Within an ERE, a backslash character (\\, \a, \b, \f, \n, \r, \t, \v)
360 is considered to begin an escape sequence. In addition, the escape
361 sequences in the following table will be recognized.
362
363
364 A literal newline character cannot occur within an ERE; the escape
365 sequence \n can be used to represent a newline character. A newline
366 character cannot be matched by a period operator.
367
368
369 Escape Sequences in lex
370
371
372
373
374 ┌──────────────────────────────────────────────────────────────────────────────────────┐
375 │Escape Sequences in lex │
376 ├──────────────────────────────────────────────────────────────────────────────────────┤
377 │ Escape Sequence Description Meaning │
378 ├──────────────────────────────────────────────────────────────────────────────────────┤
379 │ \digits A backslash character fol‐ The character whose encod‐ │
380 │ lowed by the longest sequence ing is represented by the │
381 │ of one, two or three octal- one-, two- or three-digit │
382 │ digit characters (01234567). octal integer. Multi-byte │
383 │ Ifall of the digits are 0, characters require multi‐ │
384 │ (that is, representation of ple, concatenated escape │
385 │ the NUL character), the sequences of this type, │
386 │ behavior is undefined. including the leading \ for │
387 │ each byte. │
388 ├──────────────────────────────────────────────────────────────────────────────────────┤
389 │ \xdigits A backslash character fol‐ The character whose encod‐ │
390 │ lowed by the longest sequence ing is represented by the │
391 │ of hexadecimal-digit charac‐ hexadecimal integer. │
392 │ ters (01234567abcdefABCDEF). │
393 │ If all of the digits are 0, │
394 │ (that is, representation of │
395 │ the NUL character), the │
396 │ behavior is undefined. │
397 ├──────────────────────────────────────────────────────────────────────────────────────┤
398 │ \c A backslash character fol‐ The character c, unchanged. │
399 │ lowed by any character not │
400 │ described in this table. │
401 │ (\\, \a, \b, \f, \en, \r, \t, │
402 │ \v). │
403 └──────────────────────────────────────────────────────────────────────────────────────┘
404
405
406 The order of precedence given to extended regular expressions for lex
407 is as shown in the following table, from high to low.
408
409 Note: The escaped characters entry is not meant to imply that these
410 are operators, but they are included in the table to show
411 their relationships to the true operators. The start condi‐
412 tion, trailing context and anchoring notations have been
413 omitted from the table because of the placement restrictions
414 described in this section; they can only appear at the begin‐
415 ning or ending of an ERE.
416
417
418
419
420
421 ┌────────────────────────────────────────────────────────────────┐
422 │ ERE Precedence in lex │
423 ├────────────────────────────────────────────────────────────────┤
424 │collation-related bracket symbols [= =] [: :] [. .] │
425 │escaped characters \<special character> │
426 │bracket expression [ ] │
427 │quoting "..." │
428 │grouping () │
429 │definition {name} │
430 │single-character RE duplication * + ? │
431 │concatenation │
432 │interval expression {m,n} │
433 │alternation | │
434 └────────────────────────────────────────────────────────────────┘
435
436
437 The ERE anchoring operators (^ and $) do not appear in the table. With
438 lex regular expressions, these operators are restricted in their use:
439 the ^ operator can only be used at the beginning of an entire regular
440 expression, and the $ operator only at the end. The operators apply to
441 the entire regular expression. Thus, for example, the pattern
442 (^abc)|(def$) is undefined; it can instead be written as two separate
443 rules, one with the regular expression ^abc and one with def$, which
444 share a common action via the special | action (see below). If the pat‐
445 tern were written ^abc|def$, it would match either of abc or def on a
446 line by itself.
447
448
449 Unlike the general ERE rules, embedded anchoring is not allowed by most
450 historical lex implementations. An example of embedded anchoring would
451 be for patterns such as (^)foo($) to match foo when it exists as a com‐
452 plete word. This functionality can be obtained using existing lex fea‐
453 tures:
454
455 ^foo/[ \n]|
456 " foo"/[ \n] /* found foo as a separate word */
457
458
459
460 Notice also that $ is a form of trailing context (it is equivalent to
461 /\n and as such cannot be used with regular expressions containing
462 another instance of the operator (see the preceding discussion of
463 trailing context).
464
465
466 The additional regular expressions trailing-context operator / (slash)
467 can be used as an ordinary character if presented within double-quotes,
468 "/"; preceded by a backslash, \/; or within a bracket expression, [/].
469 The start-condition < and > operators are special only in a start con‐
470 dition at the beginning of a regular expression; elsewhere in the regu‐
471 lar expression they are treated as ordinary characters.
472
473
474 The following examples clarify the differences between lex regular
475 expressions and regular expressions appearing elsewhere in this docu‐
476 ment. For regular expressions of the form r/x, the string matching r is
477 always returned; confusion may arise when the beginning of x matches
478 the trailing portion of r. For example, given the regular expression
479 a*b/cc and the input aaabcc, yytext would contain the string aaab on
480 this match. But given the regular expression x*/xy and the input xxxy,
481 the token xxx, not xx, is returned by some implementations because xxx
482 matches x*.
483
484
485 In the rule ab*/bc, the b* at the end of r will extend r's match into
486 the beginning of the trailing context, so the result is unspecified. If
487 this rule were ab/bc, however, the rule matches the text ab when it is
488 followed by the text bc. In this latter case, the matching of r cannot
489 extend into the beginning of x, so the result is specified.
490
491 Actions in lex
492 The action to be taken when an ERE is matched can be a C program frag‐
493 ment or the special actions described below; the program fragment can
494 contain one or more C statements, and can also include special actions.
495 The empty C statement ; is a valid action; any string in the lex.yy.c
496 input that matches the pattern portion of such a rule is effectively
497 ignored or skipped. However, the absence of an action is not valid, and
498 the action lex takes in such a condition is undefined.
499
500
501 The specification for an action, including C statements and special
502 actions, can extend across several lines if enclosed in braces:
503
504 ERE <one or more blanks> { program statement
505 program statement }
506
507
508
509
510 The default action when a string in the input to a lex.yy.c program is
511 not matched by any expression is to copy the string to the output.
512 Because the default behavior of a program generated by lex is to read
513 the input and copy it to the output, a minimal lex source program that
514 has just %% generates a C program that simply copies the input to the
515 output unchanged.
516
517
518 Four special actions are available:
519
520 | ECHO; REJECT; BEGIN
521
522
523
524 | The action | means that the action for the next rule is the
525 action for this rule. Unlike the other three actions, |
526 cannot be enclosed in braces or be semicolon-terminated. It
527 must be specified alone, with no other actions.
528
529
530 ECHO; Writes the contents of the string yytext on the output.
531
532
533 REJECT; Usually only a single expression is matched by a given
534 string in the input. REJECT means "continue to the next
535 expression that matches the current input," and causes
536 whatever rule was the second choice after the current rule
537 to be executed for the same input. Thus, multiple rules can
538 be matched and executed for one input string or overlapping
539 input strings. For example, given the regular expressions
540 xyz and xy and the input xyz, usually only the regular
541 expression xyz would match. The next attempted match would
542 start after z. If the last action in the xyz rule is REJECT
543 , both this rule and the xy rule would be executed. The
544 REJECT action may be implemented in such a fashion that
545 flow of control does not continue after it, as if it were
546 equivalent to a goto to another part of yylex. The use of
547 REJECT may result in somewhat larger and slower scanners.
548
549
550 BEGIN The action:
551
552 BEGIN newstate;
553
554 switches the state (start condition) to newstate. If the
555 string newstate has not been declared previously as a start
556 condition in the Definitions in lex section, the results
557 are unspecified. The initial state is indicated by the
558 digit 0 or the token INITIAL.
559
560
561
562 The functions or macros described below are accessible to user code
563 included in the lex input. It is unspecified whether they appear in the
564 C code output of lex, or are accessible only through the -l l operand
565 to c89 or cc (the lex library).
566
567 int yylex(void) Performs lexical analysis on the input; this is
568 the primary function generated by the lex utility.
569 The function returns zero when the end of input is
570 reached; otherwise it returns non-zero values
571 (tokens) determined by the actions that are
572 selected.
573
574
575 int yymore(void) When called, indicates that when the next input
576 string is recognized, it is to be appended to the
577 current value of yytext rather than replacing it;
578 the value in yyleng is adjusted accordingly.
579
580
581 intyyless(int n) Retains n initial characters in yytext, NUL-termi‐
582 nated, and treats the remaining characters as if
583 they had not been read; the value in yyleng is
584 adjusted accordingly.
585
586
587 int input(void) Returns the next character from the input, or zero
588 on end-of-file. It obtains input from the stream
589 pointer yyin, although possibly via an intermedi‐
590 ate buffer. Thus, once scanning has begun, the
591 effect of altering the value of yyin is undefined.
592 The character read is removed from the input
593 stream of the scanner without any processing by
594 the scanner.
595
596
597 int unput(int c) Returns the character c to the input; yytext and
598 yyleng are undefined until the next expression is
599 matched. The result of using unput for more char‐
600 acters than have been input is unspecified.
601
602
603
604 The following functions appear only in the lex library accessible
605 through the -l l operand; they can therefore be redefined by a portable
606 application:
607
608 int yywrap(void)
609
610 Called by yylex at end-of-file; the default yywrap always will
611 return 1. If the application requires yylex to continue processing
612 with another source of input, then the application can include a
613 function yywrap, which associates another file with the external
614 variable FILE *yyin and will return a value of zero.
615
616
617 int main(int argc, char *argv[])
618
619 Calls yylex to perform lexical analysis, then exits. The user code
620 can contain main to perform application-specific operations, call‐
621 ing yylex as applicable.
622
623
624
625 The reason for breaking these functions into two lists is that only
626 those functions in libl.a can be reliably redefined by a portable
627 application.
628
629
630 Except for input, unput and main, all external and static names gener‐
631 ated by lex begin with the prefix yy or YY.
632
634 Portable applications are warned that in the Rules in lex section, an
635 ERE without an action is not acceptable, but need not be detected as
636 erroneous by lex. This may result in compilation or run-time errors.
637
638
639 The purpose of input is to take characters off the input stream and
640 discard them as far as the lexical analysis is concerned. A common use
641 is to discard the body of a comment once the beginning of a comment is
642 recognized.
643
644
645 The lex utility is not fully internationalized in its treatment of reg‐
646 ular expressions in the lex source code or generated lexical analyzer.
647 It would seem desirable to have the lexical analyzer interpret the reg‐
648 ular expressions given in the lex source according to the environment
649 specified when the lexical analyzer is executed, but this is not possi‐
650 ble with the current lex technology. Furthermore, the very nature of
651 the lexical analyzers produced by lex must be closely tied to the lexi‐
652 cal requirements of the input language being described, which will fre‐
653 quently be locale-specific anyway. (For example, writing an analyzer
654 that is used for French text will not automatically be useful for pro‐
655 cessing other languages.)
656
658 Example 1 Using lex
659
660
661 The following is an example of a lex program that implements a rudimen‐
662 tary scanner for a Pascal-like syntax:
663
664
665 %{
666 /* need this for the call to atof() below */
667 #include <math.h>
668 /* need this for printf(), fopen() and stdin below */
669 #include <stdio.h>
670 %}
671
672 DIGIT [0-9]
673 ID [a-z][a-z0-9]*
674 %%
675
676 {DIGIT}+ {
677 printf("An integer: %s (%d)\n", yytext,
678 atoi(yytext));
679 }
680
681 {DIGIT}+"."{DIGIT}* {
682 printf("A float: %s (%g)\n", yytext,
683 atof(yytext));
684 }
685
686 if|then|begin|end|procedure|function {
687 printf("A keyword: %s\n", yytext);
688 }
689
690 {ID} printf("An identifier: %s\n", yytext);
691
692 "+"|"-"|"*"|"/" printf("An operator: %s\n", yytext);
693
694 "{"[^}\n]*"}" /* eat up one-line comments */
695
696 [ \t\n]+ /* eat up white space */
697
698 . printf("Unrecognized character: %s\n", yytext);
699
700 %%
701
702 int main(int argc, char *argv[])
703 {
704 ++argv, --argc; /* skip over program name */
705 if (argc > 0)
706 yyin = fopen(argv[0], "r");
707 else
708 yyin = stdin;
709
710 yylex();
711 }
712
713
714
716 See environ(5) for descriptions of the following environment variables
717 that affect the execution of lex: LANG, LC_ALL, LC_COLLATE, LC_CTYPE,
718 LC_MESSAGES, and NLSPATH.
719
721 The following exit values are returned:
722
723 0 Successful completion.
724
725
726 >0 An error occurred.
727
728
730 See attributes(5) for descriptions of the following attributes:
731
732
733
734
735 ┌─────────────────────────────┬─────────────────────────────┐
736 │ ATTRIBUTE TYPE │ ATTRIBUTE VALUE │
737 ├─────────────────────────────┼─────────────────────────────┤
738 │Availability │SUNWbtool │
739 ├─────────────────────────────┼─────────────────────────────┤
740 │Interface Stability │Standard │
741 └─────────────────────────────┴─────────────────────────────┘
742
744 yacc(1), attributes(5), environ(5), regex(5), standards(5)
745
747 If routines such as yyback(), yywrap(), and yylock() in .l (ell) files
748 are to be external C functions, the command line to compile a C++ pro‐
749 gram must define the __EXTERN_C__ macro. For example:
750
751 example% CC -D__EXTERN_C__ ... file
752
753
754
755
756
757SunOS 5.11 22 Aug 1997 lex(1)