1LEX(P) POSIX Programmer's Manual LEX(P)
2
3
4
6 lex - generate programs for lexical tasks (DEVELOPMENT)
7
9 lex [-t][-n|-v][file ...]
10
12 The lex utility shall generate C programs to be used in lexical pro‐
13 cessing of character input, and that can be used as an interface to
14 yacc. The C programs shall be generated from lex source code and con‐
15 form to the ISO C standard. Usually, the lex utility shall write the
16 program it generates to the file lex.yy.c; the state of this file is
17 unspecified if lex exits with a non-zero exit status. See the EXTENDED
18 DESCRIPTION section for a complete description of the lex input lan‐
19 guage.
20
22 The lex utility shall conform to the Base Definitions volume of
23 IEEE Std 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
24
25 The following options shall be supported:
26
27 -n Suppress the summary of statistics usually written with the -v
28 option. If no table sizes are specified in the lex source code
29 and the -v option is not specified, then -n is implied.
30
31 -t Write the resulting program to standard output instead of
32 lex.yy.c.
33
34 -v Write a summary of lex statistics to the standard output. (See
35 the discussion of lex table sizes in Definitions in lex .) If
36 the -t option is specified and -n is not specified, this report
37 shall be written to standard error. If table sizes are specified
38 in the lex source code, and if the -n option is not specified,
39 the -v option may be enabled.
40
41
43 The following operand shall be supported:
44
45 file A pathname of an input file. If more than one such file is spec‐
46 ified, all files shall be concatenated to produce a single lex
47 program. If no file operands are specified, or if a file operand
48 is '-' , the standard input shall be used.
49
50
52 The standard input shall be used if no file operands are specified, or
53 if a file operand is '-' . See INPUT FILES.
54
56 The input files shall be text files containing lex source code, as
57 described in the EXTENDED DESCRIPTION section.
58
60 The following environment variables shall affect the execution of lex:
61
62 LANG Provide a default value for the internationalization variables
63 that are unset or null. (See the Base Definitions volume of
64 IEEE Std 1003.1-2001, Section 8.2, Internationalization Vari‐
65 ables for the precedence of internationalization variables used
66 to determine the values of locale categories.)
67
68 LC_ALL If set to a non-empty string value, override the values of all
69 the other internationalization variables.
70
71 LC_COLLATE
72
73 Determine the locale for the behavior of ranges, equivalence
74 classes, and multi-character collating elements within regular
75 expressions. If this variable is not set to the POSIX locale,
76 the results are unspecified.
77
78 LC_CTYPE
79 Determine the locale for the interpretation of sequences of
80 bytes of text data as characters (for example, single-byte as
81 opposed to multi-byte characters in arguments and input files),
82 and the behavior of character classes within regular expres‐
83 sions. If this variable is not set to the POSIX locale, the
84 results are unspecified.
85
86 LC_MESSAGES
87 Determine the locale that should be used to affect the format
88 and contents of diagnostic messages written to standard error.
89
90 NLSPATH
91 Determine the location of message catalogs for the processing of
92 LC_MESSAGES .
93
94
96 Default.
97
99 If the -t option is specified, the text file of C source code output of
100 lex shall be written to standard output.
101
102 If the -t option is not specified:
103
104 * Implementation-defined informational, error, and warning messages
105 concerning the contents of lex source code input shall be written to
106 either the standard output or standard error.
107
108 * If the -v option is specified and the -n option is not specified,
109 lex statistics shall also be written to either the standard output
110 or standard error, in an implementation-defined format. These sta‐
111 tistics may also be generated if table sizes are specified with a
112 '%' operator in the Definitions section, as long as the -n option is
113 not specified.
114
116 If the -t option is specified, implementation-defined informational,
117 error, and warning messages concerning the contents of lex source code
118 input shall be written to the standard error.
119
120 If the -t option is not specified:
121
122 1. Implementation-defined informational, error, and warning messages
123 concerning the contents of lex source code input shall be written
124 to either the standard output or standard error.
125
126 2. If the -v option is specified and the -n option is not specified,
127 lex statistics shall also be written to either the standard output
128 or standard error, in an implementation-defined format. These sta‐
129 tistics may also be generated if table sizes are specified with a
130 '%' operator in the Definitions section, as long as the -n option
131 is not specified.
132
134 A text file containing C source code shall be written to lex.yy.c, or
135 to the standard output if the -t option is present.
136
138 Each input file shall contain lex source code, which is a table of reg‐
139 ular expressions with corresponding actions in the form of C program
140 fragments.
141
142 When lex.yy.c is compiled and linked with the lex library (using the
143 -l l operand with c99), the resulting program shall read character
144 input from the standard input and shall partition it into strings that
145 match the given expressions.
146
147 When an expression is matched, these actions shall occur:
148
149 * The input string that was matched shall be left in yytext as a null-
150 terminated string; yytext shall either be an external character
151 array or a pointer to a character string. As explained in Defini‐
152 tions in lex , the type can be explicitly selected using the %array
153 or %pointer declarations, but the default is implementation-defined.
154
155 * The external int yyleng shall be set to the length of the matching
156 string.
157
158 * The expression's corresponding program fragment, or action, shall be
159 executed.
160
161 During pattern matching, lex shall search the set of patterns for the
162 single longest possible match. Among rules that match the same number
163 of characters, the rule given first shall be chosen.
164
165 The general format of lex source shall be:
166
167
168 Definitions
169 %%
170 Rules
171 %%
172 UserSubroutines
173
174 The first "%%" is required to mark the beginning of the rules (regular
175 expressions and actions); the second "%%" is required only if user sub‐
176 routines follow.
177
178 Any line in the Definitions section beginning with a <blank> shall be
179 assumed to be a C program fragment and shall be copied to the external
180 definition area of the lex.yy.c file. Similarly, anything in the Defi‐
181 nitions section included between delimiter lines containing only "%{"
182 and "%}" shall also be copied unchanged to the external definition area
183 of the lex.yy.c file.
184
185 Any such input (beginning with a <blank> or within "%{" and "%}" delim‐
186 iter lines) appearing at the beginning of the Rules section before any
187 rules are specified shall be written to lex.yy.c after the declarations
188 of variables for the yylex() function and before the first line of code
189 in yylex(). Thus, user variables local to yylex() can be declared here,
190 as well as application code to execute upon entry to yylex().
191
192 The action taken by lex when encountering any input beginning with a
193 <blank> or within "%{" and "%}" delimiter lines appearing in the Rules
194 section but coming after one or more rules is undefined. The presence
195 of such input may result in an erroneous definition of the yylex()
196 function.
197
198 Definitions in lex
199 Definitions appear before the first "%%" delimiter. Any line in this
200 section not contained between "%{" and "%}" lines and not beginning
201 with a <blank> shall be assumed to define a lex substitution string.
202 The format of these lines shall be:
203
204
205 name substitute
206
207 If a name does not meet the requirements for identifiers in the ISO C
208 standard, the result is undefined. The string substitute shall replace
209 the string { name} when it is used in a rule. The name string shall be
210 recognized in this context only when the braces are provided and when
211 it does not appear within a bracket expression or within double-quotes.
212
213 In the Definitions section, any line beginning with a '%' (percent
214 sign) character and followed by an alphanumeric word beginning with
215 either 's' or 'S' shall define a set of start conditions. Any line
216 beginning with a '%' followed by a word beginning with either 'x' or
217 'X' shall define a set of exclusive start conditions. When the gener‐
218 ated scanner is in a %s state, patterns with no state specified shall
219 be also active; in a %x state, such patterns shall not be active. The
220 rest of the line, after the first word, shall be considered to be one
221 or more <blank>-separated names of start conditions. Start condition
222 names shall be constructed in the same way as definition names. Start
223 conditions can be used to restrict the matching of regular expressions
224 to one or more states as described in Regular Expressions in lex .
225
226 Implementations shall accept either of the following two mutually-
227 exclusive declarations in the Definitions section:
228
229 %array Declare the type of yytext to be a null-terminated character
230 array.
231
232 %pointer
233 Declare the type of yytext to be a pointer to a null-terminated
234 character string.
235
236
237 The default type of yytext is implementation-defined. If an application
238 refers to yytext outside of the scanner source file (that is, via an
239 extern), the application shall include the appropriate %array or
240 %pointer declaration in the scanner source file.
241
242 Implementations shall accept declarations in the Definitions section
243 for setting certain internal table sizes. The declarations are shown in
244 the following table.
245
246 Table: Table Size Declarations in lex
247
248 Declaration Description Minimum Value
249 %p n Number of positions 2500
250 %n n Number of states 500
251 %a n Number of transitions 2000
252 %e n Number of parse tree nodes 1000
253 %k n Number of packed character classes 1000
254 %o n Size of the output array 3000
255
256 In the table, n represents a positive decimal integer, preceded by one
257 or more <blank>s. The exact meaning of these table size numbers is
258 implementation-defined. The implementation shall document how these
259 numbers affect the lex utility and how they are related to any output
260 that may be generated by the implementation should limitations be
261 encountered during the execution of lex. It shall be possible to deter‐
262 mine from this output which of the table size values needs to be modi‐
263 fied to permit lex to successfully generate tables for the input lan‐
264 guage. The values in the column Minimum Value represent the lowest
265 values conforming implementations shall provide.
266
267 Rules in lex
268 The rules in lex source files are a table in which the left column con‐
269 tains regular expressions and the right column contains actions (C pro‐
270 gram fragments) to be executed when the expressions are recognized.
271
272
273 ERE action
274 ERE action...
275
276 The extended regular expression (ERE) portion of a row shall be sepa‐
277 rated from action by one or more <blank>s. A regular expression con‐
278 taining <blank>s shall be recognized under one of the following condi‐
279 tions:
280
281 * The entire expression appears within double-quotes.
282
283 * The <blank>s appear within double-quotes or square brackets.
284
285 * Each <blank> is preceded by a backslash character.
286
287 User Subroutines in lex
288 Anything in the user subroutines section shall be copied to lex.yy.c
289 following yylex().
290
291 Regular Expressions in lex
292 The lex utility shall support the set of extended regular expressions
293 (see the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.4,
294 Extended Regular Expressions), with the following additions and excep‐
295 tions to the syntax:
296
297 "..." Any string enclosed in double-quotes shall represent the charac‐
298 ters within the double-quotes as themselves, except that back‐
299 slash escapes (which appear in the following table) shall be
300 recognized. Any backslash-escape sequence shall be terminated
301 by the closing quote. For example, "\01" "1" represents a single
302 string: the octal value 1 followed by the character '1' .
303
304 <state>r, <state1,state2,...>r
305
306 The regular expression r shall be matched only when the program
307 is in one of the start conditions indicated by state, state1,
308 and so on; see Actions in lex . (As an exception to the typo‐
309 graphical conventions of the rest of this volume of
310 IEEE Std 1003.1-2001, in this case <state> does not represent a
311 metavariable, but the literal angle-bracket characters surround‐
312 ing a symbol.) The start condition shall be recognized as such
313 only at the beginning of a regular expression.
314
315 r/x The regular expression r shall be matched only if it is followed
316 by an occurrence of regular expression x ( x is the instance of
317 trailing context, further defined below). The token returned in
318 yytext shall only match r. If the trailing portion of r matches
319 the beginning of x, the result is unspecified. The r expression
320 cannot include further trailing context or the '$' (match-end-
321 of-line) operator; x cannot include the '^' (match-beginning-of-
322 line) operator, nor trailing context, nor the '$' operator. That
323 is, only one occurrence of trailing context is allowed in a lex
324 regular expression, and the '^' operator only can be used at the
325 beginning of such an expression.
326
327 {name} When name is one of the substitution symbols from the Defini‐
328 tions section, the string, including the enclosing braces, shall
329 be replaced by the substitute value. The substitute value shall
330 be treated in the extended regular expression as if it were
331 enclosed in parentheses. No substitution shall occur if { name}
332 occurs within a bracket expression or within double-quotes.
333
334
335 Within an ERE, a backslash character shall be considered to begin an
336 escape sequence as specified in the table in the Base Definitions vol‐
337 ume of IEEE Std 1003.1-2001, Chapter 5, File Format Notation ( '\\' ,
338 '\a' , '\b' , '\f' , '\n' , '\r' , '\t' , '\v' ). In addition, the
339 escape sequences in the following table shall be recognized.
340
341 A literal <newline> cannot occur within an ERE; the escape sequence
342 '\n' can be used to represent a <newline>. A <newline> shall not be
343 matched by a period operator.
344
345 Table: Escape Sequences in lex
346
347 Escape
348 Sequence Description Meaning
349 \digits A backslash character followed The character whose encoding
350 by the longest sequence of is represented by the one,
351 one, two, or three octal-digit two, or three-digit octal
352 characters [4m(01234567). If all integer. If the size of a byte
353 of the digits are 0 (that is, on the system is greater than
354 representation of the NUL nine bits, the valid escape
355 character), the behavior is sequence used to represent a
356 undefined. byte is implementation-
357 defined. Multi-byte characters
358 require multiple, concatenated
359 escape sequences of this type,
360 including the leading '\' for
361 each byte.
362 \xdigits A backslash character followed The character whose encoding
363 by the longest sequence of is represented by the hexadec‐
364 hexadecimal-digit characters imal integer.
365 (01234567abcdefABCDEF). If all
366 of the digits are 0 (that is,
367 representation of the NUL
368 character), the behavior is
369 undefined.
370 \c A backslash character followed The character 'c' , unchanged.
371 by any character not described
372 in this table or in the table
373 in the Base Definitions volume
374 of IEEE Std 1003.1-2001, Chap‐
375 ter 5, File Format Notation (
376 '\\' , '\a' , '\b' , '\f' ,
377 '\n' , '\r' , '\t' , '\v' ).
378
379 Note: If a '\x' sequence needs to be immediately followed by a hexa‐
380 decimal digit character, a sequence such as "\x1" "1" can be
381 used, which represents a character containing the value 1, fol‐
382 lowed by the character '1' .
383
384
385 The order of precedence given to extended regular expressions for lex
386 differs from that specified in the Base Definitions volume of
387 IEEE Std 1003.1-2001, Section 9.4, Extended Regular Expressions. The
388 order of precedence for lex shall be as shown in the following table,
389 from high to low.
390
391 Note: The escaped characters entry is not meant to imply that these
392 are operators, but they are included in the table to show their
393 relationships to the true operators. The start condition, trail‐
394 ing context, and anchoring notations have been omitted from the
395 table because of the placement restrictions described in this
396 section; they can only appear at the beginning or ending of an
397 ERE.
398
399
400
401 Table: ERE Precedence in lex
402
403 Extended Regular Expression Precedence
404 collation-related bracket symbols [= =] [: :] [. .]
405 escaped characters \<special character>
406 bracket expression [ ]
407
408 quoting "..."
409 grouping ( )
410 definition {name}
411 single-character RE duplication * + ?
412 concatenation
413 interval expression {m,n}
414 alternation |
415
416 The ERE anchoring operators '^' and '$' do not appear in the table.
417 With lex regular expressions, these operators are restricted in their
418 use: the '^' operator can only be used at the beginning of an entire
419 regular expression, and the '$' operator only at the end. The operators
420 apply to the entire regular expression. Thus, for example, the pattern
421 "(^abc)|(def$)" is undefined; it can instead be written as two separate
422 rules, one with the regular expression "^abc" and one with "def$" ,
423 which share a common action via the special '|' action (see below). If
424 the pattern were written "^abc|def$" , it would match either "abc" or
425 "def" on a line by itself.
426
427 Unlike the general ERE rules, embedded anchoring is not allowed by most
428 historical lex implementations. An example of embedded anchoring would
429 be for patterns such as "(^| )foo( |$)" to match "foo" when it exists
430 as a complete word. This functionality can be obtained using existing
431 lex features:
432
433
434 ^foo/[ \n] |
435 " foo"/[ \n] /* Found foo as a separate word. */
436
437 Note also that '$' is a form of trailing context (it is equivalent to
438 "/\n" ) and as such cannot be used with regular expressions containing
439 another instance of the operator (see the preceding discussion of
440 trailing context).
441
442 The additional regular expressions trailing-context operator '/' can be
443 used as an ordinary character if presented within double-quotes, "/" ;
444 preceded by a backslash, "\/" ; or within a bracket expression, "[/]" .
445 The start-condition '<' and '>' operators shall be special only in a
446 start condition at the beginning of a regular expression; elsewhere in
447 the regular expression they shall be treated as ordinary characters.
448
449 Actions in lex
450 The action to be taken when an ERE is matched can be a C program frag‐
451 ment or the special actions described below; the program fragment can
452 contain one or more C statements, and can also include special actions.
453 The empty C statement ';' shall be a valid action; any string in the
454 lex.yy.c input that matches the pattern portion of such a rule is
455 effectively ignored or skipped. However, the absence of an action shall
456 not be valid, and the action lex takes in such a condition is unde‐
457 fined.
458
459 The specification for an action, including C statements and special
460 actions, can extend across several lines if enclosed in braces:
461
462
463 ERE <one or more blanks> { program statement
464 program statement }
465
466 The default action when a string in the input to a lex.yy.c program is
467 not matched by any expression shall be to copy the string to the out‐
468 put. Because the default behavior of a program generated by lex is to
469 read the input and copy it to the output, a minimal lex source program
470 that has just "%%" shall generate a C program that simply copies the
471 input to the output unchanged.
472
473 Four special actions shall be available:
474
475
476 | ECHO; REJECT; BEGIN
477
478 | The action '|' means that the action for the next rule is the
479 action for this rule. Unlike the other three actions, '|' cannot
480 be enclosed in braces or be semicolon-terminated; the applica‐
481 tion shall ensure that it is specified alone, with no other
482 actions.
483
484 ECHO; Write the contents of the string yytext on the output.
485
486 REJECT;
487 Usually only a single expression is matched by a given string in
488 the input. REJECT means "continue to the next expression that
489 matches the current input", and shall cause whatever rule was
490 the second choice after the current rule to be executed for the
491 same input. Thus, multiple rules can be matched and executed for
492 one input string or overlapping input strings. For example,
493 given the regular expressions "xyz" and "xy" and the input "xyz"
494 , usually only the regular expression "xyz" would match. The
495 next attempted match would start after z. If the last action in
496 the "xyz" rule is REJECT, both this rule and the "xy" rule would
497 be executed. The REJECT action may be implemented in such a
498 fashion that flow of control does not continue after it, as if
499 it were equivalent to a goto to another part of yylex(). The use
500 of REJECT may result in somewhat larger and slower scanners.
501
502 BEGIN The action:
503
504
505 BEGIN newstate;
506
507 switches the state (start condition) to newstate. If the string new‐
508 state has not been declared previously as a start condition in the Def‐
509 initions section, the results are unspecified. The initial state is
510 indicated by the digit '0' or the token INITIAL.
511
512
513 The functions or macros described below are accessible to user code
514 included in the lex input. It is unspecified whether they appear in the
515 C code output of lex, or are accessible only through the -l l operand
516 to c99 (the lex library).
517
518 int yylex(void)
519
520 Performs lexical analysis on the input; this is the primary
521 function generated by the lex utility. The function shall return
522 zero when the end of input is reached; otherwise, it shall
523 return non-zero values (tokens) determined by the actions that
524 are selected.
525
526 int yymore(void)
527
528 When called, indicates that when the next input string is recog‐
529 nized, it is to be appended to the current value of yytext
530 rather than replacing it; the value in yyleng shall be adjusted
531 accordingly.
532
533 int yyless(int n)
534
535 Retains n initial characters in yytext, NUL-terminated, and
536 treats the remaining characters as if they had not been read;
537 the value in yyleng shall be adjusted accordingly.
538
539 int input(void)
540
541 Returns the next character from the input, or zero on end-of-
542 file. It shall obtain input from the stream pointer yyin,
543 although possibly via an intermediate buffer. Thus, once scan‐
544 ning has begun, the effect of altering the value of yyin is
545 undefined. The character read shall be removed from the input
546 stream of the scanner without any processing by the scanner.
547
548 int unput(int c)
549
550 Returns the character 'c' to the input; yytext and yyleng are
551 undefined until the next expression is matched. The result of
552 using unput() for more characters than have been input is
553 unspecified.
554
555
556 The following functions shall appear only in the lex library accessible
557 through the -l l operand; they can therefore be redefined by a conform‐
558 ing application:
559
560 int yywrap(void)
561
562 Called by yylex() at end-of-file; the default yywrap() shall
563 always return 1. If the application requires yylex() to continue
564 processing with another source of input, then the application
565 can include a function yywrap(), which associates another file
566 with the external variable FILE * yyin and shall return a value
567 of zero.
568
569 int main(int argc, char *argv[])
570
571 Calls yylex() to perform lexical analysis, then exits. The user
572 code can contain main() to perform application-specific opera‐
573 tions, calling yylex() as applicable.
574
575
576 Except for input(), unput(), and main(), all external and static names
577 generated by lex shall begin with the prefix yy or YY.
578
580 The following exit values shall be returned:
581
582 0 Successful completion.
583
584 >0 An error occurred.
585
586
588 Default.
589
590 The following sections are informative.
591
593 Conforming applications are warned that in the Rules section, an ERE
594 without an action is not acceptable, but need not be detected as erro‐
595 neous by lex. This may result in compilation or runtime errors.
596
597 The purpose of input() is to take characters off the input stream and
598 discard them as far as the lexical analysis is concerned. A common use
599 is to discard the body of a comment once the beginning of a comment is
600 recognized.
601
602 The lex utility is not fully internationalized in its treatment of reg‐
603 ular expressions in the lex source code or generated lexical analyzer.
604 It would seem desirable to have the lexical analyzer interpret the reg‐
605 ular expressions given in the lex source according to the environment
606 specified when the lexical analyzer is executed, but this is not possi‐
607 ble with the current lex technology. Furthermore, the very nature of
608 the lexical analyzers produced by lex must be closely tied to the lexi‐
609 cal requirements of the input language being described, which is fre‐
610 quently locale-specific anyway. (For example, writing an analyzer that
611 is used for French text is not automatically useful for processing
612 other languages.)
613
615 The following is an example of a lex program that implements a rudimen‐
616 tary scanner for a Pascal-like syntax:
617
618
619 %{
620 /* Need this for the call to atof() below. */
621 #include <math.h>
622 /* Need this for printf(), fopen(), and stdin below. */
623 #include <stdio.h>
624 %}
625
626
627 DIGIT [0-9]
628 ID [a-z][a-z0-9]*
629
630
631 %%
632
633
634 {DIGIT}+ {
635 printf("An integer: %s (%d)\n", yytext,
636 atoi(yytext));
637 }
638
639
640 {DIGIT}+"."{DIGIT}* {
641 printf("A float: %s (%g)\n", yytext,
642 atof(yytext));
643 }
644
645
646 if|then|begin|end|procedure|function {
647 printf("A keyword: %s\n", yytext);
648 }
649
650
651 {ID} printf("An identifier: %s\n", yytext);
652
653
654 "+"|"-"|"*"|"/" printf("An operator: %s\n", yytext);
655
656
657 "{"[^}\n]*"}" /* Eat up one-line comments. */
658
659
660 [ \t\n]+ /* Eat up white space. */
661
662
663 . printf("Unrecognized character: %s\n", yytext);
664
665
666 %%
667
668
669 int main(int argc, char *argv[])
670 {
671 ++argv, --argc; /* Skip over program name. */
672 if (argc > 0)
673 yyin = fopen(argv[0], "r");
674 else
675 yyin = stdin;
676
677
678 yylex();
679 }
680
682 Even though the -c option and references to the C language are retained
683 in this description, lex may be generalized to other languages, as was
684 done at one time for EFL, the Extended FORTRAN Language. Since the lex
685 input specification is essentially language-independent, versions of
686 this utility could be written to produce Ada, Modula-2, or Pascal code,
687 and there are known historical implementations that do so.
688
689 The current description of lex bypasses the issue of dealing with
690 internationalized EREs in the lex source code or generated lexical ana‐
691 lyzer. If it follows the model used by awk (the source code is assumed
692 to be presented in the POSIX locale, but input and output are in the
693 locale specified by the environment variables), then the tables in the
694 lexical analyzer produced by lex would interpret EREs specified in the
695 lex source in terms of the environment variables specified when lex was
696 executed. The desired effect would be to have the lexical analyzer
697 interpret the EREs given in the lex source according to the environment
698 specified when the lexical analyzer is executed, but this is not possi‐
699 ble with the current lex technology.
700
701 The description of octal and hexadecimal-digit escape sequences agrees
702 with the ISO C standard use of escape sequences. See the RATIONALE for
703 ed for a discussion of bytes larger than 9 bits being represented by
704 octal values. Hexadecimal values can represent larger bytes and multi-
705 byte characters directly, using as many digits as required.
706
707 There is no detailed output format specification. The observed behavior
708 of lex under four different historical implementations was that none of
709 these implementations consistently reported the line numbers for error
710 and warning messages. Furthermore, there was a desire that lex be
711 allowed to output additional diagnostic messages. Leaving message for‐
712 mats unspecified avoids these formatting questions and problems with
713 internationalization.
714
715 Although the %x specifier for exclusive start conditions is not histor‐
716 ical practice, it is believed to be a minor change to historical imple‐
717 mentations and greatly enhances the usability of lex programs since it
718 permits an application to obtain the expected functionality with fewer
719 statements.
720
721 The %array and %pointer declarations were added as a compromise between
722 historical systems. The System V-based lex copies the matched text to a
723 yytext array. The flex program, supported in BSD and GNU systems, uses
724 a pointer. In the latter case, significant performance improvements are
725 available for some scanners. Most historical programs should require no
726 change in porting from one system to another because the string being
727 referenced is null-terminated in both cases. (The method used by flex
728 in its case is to null-terminate the token in place by remembering the
729 character that used to come right after the token and replacing it
730 before continuing on to the next scan.) Multi-file programs with exter‐
731 nal references to yytext outside the scanner source file should con‐
732 tinue to operate on their historical systems, but would require one of
733 the new declarations to be considered strictly portable.
734
735 The description of EREs avoids unnecessary duplication of ERE details
736 because their meanings within a lex ERE are the same as that for the
737 ERE in this volume of IEEE Std 1003.1-2001.
738
739 The reason for the undefined condition associated with text beginning
740 with a <blank> or within "%{" and "%}" delimiter lines appearing in the
741 Rules section is historical practice. Both the BSD and System V lex
742 copy the indented (or enclosed) input in the Rules section (except at
743 the beginning) to unreachable areas of the yylex() function (the code
744 is written directly after a break statement). In some cases, the System
745 V lex generates an error message or a syntax error, depending on the
746 form of indented input.
747
748 The intention in breaking the list of functions into those that may
749 appear in lex.yy.c versus those that only appear in libl.a is that only
750 those functions in libl.a can be reliably redefined by a conforming
751 application.
752
753 The descriptions of standard output and standard error are somewhat
754 complicated because historical lex implementations chose to issue diag‐
755 nostic messages to standard output (unless -t was given).
756 IEEE Std 1003.1-2001 allows this behavior, but leaves an opening for
757 the more expected behavior of using standard error for diagnostics.
758 Also, the System V behavior of writing the statistics when any table
759 sizes are given is allowed, while BSD-derived systems can avoid it. The
760 programmer can always precisely obtain the desired results by using
761 either the -t or -n options.
762
763 The OPERANDS section does not mention the use of - as a synonym for
764 standard input; not all historical implementations support such usage
765 for any of the file operands.
766
767 A description of the translation table was deleted from early proposals
768 because of its relatively low usage in historical applications.
769
770 The change to the definition of the input() function that allows
771 buffering of input presents the opportunity for major performance gains
772 in some applications.
773
774 The following examples clarify the differences between lex regular
775 expressions and regular expressions appearing elsewhere in this volume
776 of IEEE Std 1003.1-2001. For regular expressions of the form "r/x" ,
777 the string matching r is always returned; confusion may arise when the
778 beginning of x matches the trailing portion of r. For example, given
779 the regular expression "a*b/cc" and the input "aaabcc" , yytext would
780 contain the string "aaab" on this match. But given the regular expres‐
781 sion "x*/xy" and the input "xxxy" , the token xxx, not xx, is returned
782 by some implementations because xxx matches "x*" .
783
784 In the rule "ab*/bc" , the "b*" at the end of r extends r's match into
785 the beginning of the trailing context, so the result is unspecified. If
786 this rule were "ab/bc" , however, the rule matches the text "ab" when
787 it is followed by the text "bc" . In this latter case, the matching of
788 r cannot extend into the beginning of x, so the result is specified.
789
791 None.
792
794 c99 , ed , yacc
795
797 Portions of this text are reprinted and reproduced in electronic form
798 from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
799 -- Portable Operating System Interface (POSIX), The Open Group Base
800 Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of
801 Electrical and Electronics Engineers, Inc and The Open Group. In the
802 event of any discrepancy between this version and the original IEEE and
803 The Open Group Standard, the original IEEE and The Open Group Standard
804 is the referee document. The original Standard can be obtained online
805 at http://www.opengroup.org/unix/online.html .
806
807
808
809IEEE/The Open Group 2003 LEX(P)