1YACC(1P) POSIX Programmer's Manual YACC(1P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
12 yacc - yet another compiler compiler (DEVELOPMENT)
13
15 yacc [-dltv][-b file_prefix][-p sym_prefix] grammar
16
18 The yacc utility shall read a description of a context-free grammar in
19 grammar and write C source code, conforming to the ISO C standard, to a
20 code file, and optionally header information into a header file, in the
21 current directory. The C code shall define a function and related rou‐
22 tines and macros for an automaton that executes a parsing algorithm
23 meeting the requirements in Algorithms .
24
25 The form and meaning of the grammar are described in the EXTENDED
26 DESCRIPTION section.
27
28 The C source code and header file shall be produced in a form suitable
29 as input for the C compiler (see c99 ).
30
32 The yacc utility shall conform to the Base Definitions volume of
33 IEEE Std 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
34
35 The following options shall be supported:
36
37 -b file_prefix
38 Use file_prefix instead of y as the prefix for all output file‐
39 names. The code file y.tab.c, the header file y.tab.h (created
40 when -d is specified), and the description file y.output (cre‐
41 ated when -v is specified), shall be changed to file_prefix
42 .tab.c, file_prefix .tab.h, and file_prefix .output, respec‐
43 tively.
44
45 -d Write the header file; by default only the code file is written.
46 The #define statements associate the token codes assigned by
47 yacc with the user-declared token names. This allows source
48 files other than y.tab.c to access the token codes.
49
50 -l Produce a code file that does not contain any #line constructs.
51 If this option is not present, it is unspecified whether the
52 code file or header file contains #line directives. This should
53 only be used after the grammar and the associated actions are
54 fully debugged.
55
56 -p sym_prefix
57
58 Use sym_prefix instead of yy as the prefix for all external
59 names produced by yacc. The names affected shall include the
60 functions yyparse(), yylex(), and yyerror(), and the variables
61 yylval, yychar, and yydebug. (In the remainder of this section,
62 the six symbols cited are referenced using their default names
63 only as a notational convenience.) Local names may also be
64 affected by the -p option; however, the -p option shall not
65 affect #define symbols generated by yacc.
66
67 -t Modify conditional compilation directives to permit compilation
68 of debugging code in the code file. Runtime debugging statements
69 shall always be contained in the code file, but by default con‐
70 ditional compilation directives prevent their compilation.
71
72 -v Write a file containing a description of the parser and a report
73 of conflicts generated by ambiguities in the grammar.
74
75
77 The following operand is required:
78
79 grammar
80 A pathname of a file containing instructions, hereafter called
81 grammar, for which a parser is to be created. The format for the
82 grammar is described in the EXTENDED DESCRIPTION section.
83
84
86 Not used.
87
89 The file grammar shall be a text file formatted as specified in the
90 EXTENDED DESCRIPTION section.
91
93 The following environment variables shall affect the execution of yacc:
94
95 LANG Provide a default value for the internationalization variables
96 that are unset or null. (See the Base Definitions volume of
97 IEEE Std 1003.1-2001, Section 8.2, Internationalization Vari‐
98 ables for the precedence of internationalization variables used
99 to determine the values of locale categories.)
100
101 LC_ALL If set to a non-empty string value, override the values of all
102 the other internationalization variables.
103
104 LC_CTYPE
105 Determine the locale for the interpretation of sequences of
106 bytes of text data as characters (for example, single-byte as
107 opposed to multi-byte characters in arguments and input files).
108
109 LC_MESSAGES
110 Determine the locale that should be used to affect the format
111 and contents of diagnostic messages written to standard error.
112
113 NLSPATH
114 Determine the location of message catalogs for the processing of
115 LC_MESSAGES .
116
117
118 The LANG and LC_* variables affect the execution of the yacc utility as
119 stated. The main() function defined in Yacc Library shall call:
120
121
122 setlocale(LC_ALL, "")
123
124 and thus the program generated by yacc shall also be affected by the
125 contents of these variables at runtime.
126
128 Default.
129
131 Not used.
132
134 If shift/reduce or reduce/reduce conflicts are detected in grammar,
135 yacc shall write a report of those conflicts to the standard error in
136 an unspecified format.
137
138 Standard error shall also be used for diagnostic messages.
139
141 The code file, the header file, and the description file shall be text
142 files. All are described in the following sections.
143
144 Code File
145 This file shall contain the C source code for the yyparse() function.
146 It shall contain code for the various semantic actions with macro sub‐
147 stitution performed on them as described in the EXTENDED DESCRIPTION
148 section. It also shall contain a copy of the #define statements in the
149 header file. If a %union declaration is used, the declaration for
150 YYSTYPE shall also be included in this file.
151
152 Header File
153 The header file shall contain #define statements that associate the
154 token numbers with the token names. This allows source files other than
155 the code file to access the token codes. If a %union declaration is
156 used, the declaration for YYSTYPE and an extern YYSTYPE yylval declara‐
157 tion shall also be included in this file.
158
159 Description File
160 The description file shall be a text file containing a description of
161 the state machine corresponding to the parser, using an unspecified
162 format. Limits for internal tables (see Limits ) shall also be
163 reported, in an implementation-defined manner. (Some implementations
164 may use dynamic allocation techniques and have no specific limit values
165 to report.)
166
168 The yacc command accepts a language that is used to define a grammar
169 for a target language to be parsed by the tables and code generated by
170 yacc. The language accepted by yacc as a grammar for the target lan‐
171 guage is described below using the yacc input language itself.
172
173 The input grammar includes rules describing the input structure of the
174 target language and code to be invoked when these rules are recognized
175 to provide the associated semantic action. The code to be executed
176 shall appear as bodies of text that are intended to be C-language code.
177 The C-language inclusions are presumed to form a correct function when
178 processed by yacc into its output files. The code included in this way
179 shall be executed during the recognition of the target language.
180
181 Given a grammar, the yacc utility generates the files described in the
182 OUTPUT FILES section. The code file can be compiled and linked using
183 c99. If the declaration and programs sections of the grammar file did
184 not include definitions of main(), yylex(), and yyerror(), the compiled
185 output requires linking with externally supplied versions of those
186 functions. Default versions of main() and yyerror() are supplied in the
187 yacc library and can be linked in by using the -l y operand to c99.
188 The yacc library interfaces need not support interfaces with other than
189 the default yy symbol prefix. The application provides the lexical ana‐
190 lyzer function, yylex(); the lex utility is specifically designed to
191 generate such a routine.
192
193 Input Language
194 The application shall ensure that every specification file consists of
195 three sections in order: declarations, grammar rules, and programs,
196 separated by double percent signs ( "%%" ). The declarations and pro‐
197 grams sections can be empty. If the latter is empty, the preceding "%%"
198 mark separating it from the rules section can be omitted.
199
200 The input is free form text following the structure of the grammar
201 defined below.
202
203 Lexical Structure of the Grammar
204 The <blank>s, <newline>s, and <form-feed>s shall be ignored, except
205 that the application shall ensure that they do not appear in names or
206 multi-character reserved symbols. Comments shall be enclosed in
207 "/* ... */", and can appear wherever a name is valid.
208
209 Names are of arbitrary length, made up of letters, periods ( '.' ),
210 underscores ( '_' ), and non-initial digits. Uppercase and lowercase
211 letters are distinct. Conforming applications shall not use names
212 beginning in yy or YY since the yacc parser uses such names. Many of
213 the names appear in the final output of yacc, and thus they should be
214 chosen to conform with any additional rules created by the C compiler
215 to be used. In particular they appear in #define statements.
216
217 A literal shall consist of a single character enclosed in single-quotes
218 ( '" ). All of the escape sequences supported for character constants
219 by the ISO C standard shall be supported by yacc.
220
221 The relationship with the lexical analyzer is discussed in detail
222 below.
223
224 The application shall ensure that the NUL character is not used in
225 grammar rules or literals.
226
227 Declarations Section
228 The declarations section is used to define the symbols used to define
229 the target language and their relationship with each other. In particu‐
230 lar, much of the additional information required to resolve ambiguities
231 in the context-free grammar for the target language is provided here.
232
233 Usually yacc assigns the relationship between the symbolic names it
234 generates and their underlying numeric value. The declarations section
235 makes it possible to control the assignment of these values.
236
237 It is also possible to keep semantic information associated with the
238 tokens currently on the parse stack in a user-defined C-language union,
239 if the members of the union are associated with the various names in
240 the grammar. The declarations section provides for this as well.
241
242 The first group of declarators below all take a list of names as argu‐
243 ments. That list can optionally be preceded by the name of a C union
244 member (called a tag below) appearing within '<' and '>' . (As an
245 exception to the typographical conventions of the rest of this volume
246 of IEEE Std 1003.1-2001, in this case <tag> does not represent a
247 metavariable, but the literal angle bracket characters surrounding a
248 symbol.) The use of tag specifies that the tokens named on this line
249 shall be of the same C type as the union member referenced by tag. This
250 is discussed in more detail below.
251
252 For lists used to define tokens, the first appearance of a given token
253 can be followed by a positive integer (as a string of decimal digits).
254 If this is done, the underlying value assigned to it for lexical pur‐
255 poses shall be taken to be that number.
256
257 The following declares name to be a token:
258
259
260 %token [<tag>] name [number][name [number]]...
261
262 If tag is present, the C type for all tokens on this line shall be
263 declared to be the type referenced by tag. If a positive integer, num‐
264 ber, follows a name, that value shall be assigned to the token.
265
266 The following declares name to be a token, and assigns precedence to
267 it:
268
269
270 %left [<tag>] name [number][name [number]]...
271 %right [<tag>] name [number][name [number]]...
272
273 One or more lines, each beginning with one of these symbols, can appear
274 in this section. All tokens on the same line have the same precedence
275 level and associativity; the lines are in order of increasing prece‐
276 dence or binding strength. %left denotes that the operators on that
277 line are left associative, and %right similarly denotes right associa‐
278 tive operators. If tag is present, it shall declare a C type for names
279 as described for %token.
280
281 The following declares name to be a token, and indicates that this can‐
282 not be used associatively:
283
284
285 %nonassoc [<tag>] name [number][name [number]]...
286
287 If the parser encounters associative use of this token it reports an
288 error. If tag is present, it shall declare a C type for names as
289 described for %token.
290
291 The following declares that union member names are non-terminals, and
292 thus it is required to have a tag field at its beginning:
293
294
295 %type <tag> name...
296
297 Because it deals with non-terminals only, assigning a token number or
298 using a literal is also prohibited. If this construct is present, yacc
299 shall perform type checking; if this construct is not present, the
300 parse stack shall hold only the int type.
301
302 Every name used in grammar not defined by a %token, %left, %right, or
303 %nonassoc declaration is assumed to represent a non-terminal symbol.
304 The yacc utility shall report an error for any non-terminal symbol that
305 does not appear on the left side of at least one grammar rule.
306
307 Once the type, precedence, or token number of a name is specified, it
308 shall not be changed. If the first declaration of a token does not
309 assign a token number, yacc shall assign a token number. Once this
310 assignment is made, the token number shall not be changed by explicit
311 assignment.
312
313 The following declarators do not follow the previous pattern.
314
315 The following declares the non-terminal name to be the start symbol,
316 which represents the largest, most general structure described by the
317 grammar rules:
318
319
320 %start name
321
322 By default, it is the left-hand side of the first grammar rule; this
323 default can be overridden with this declaration.
324
325 The following declares the yacc value stack to be a union of the vari‐
326 ous types of values desired:
327
328
329 %union { body of union (in C) }
330
331 By default, the values returned by actions (see below) and the lexical
332 analyzer shall be of type int. The yacc utility keeps track of types,
333 and it shall insert corresponding union member names in order to per‐
334 form strict type checking of the resulting parser.
335
336 Alternatively, given that at least one <tag> construct is used, the
337 union can be declared in a header file (which shall be included in the
338 declarations section by using a #include construct within %{ and %}),
339 and a typedef used to define the symbol YYSTYPE to represent this
340 union. The effect of %union is to provide the declaration of YYSTYPE
341 directly from the yacc input.
342
343 C-language declarations and definitions can appear in the declarations
344 section, enclosed by the following marks:
345
346
347 %{ ... %}
348
349 These statements shall be copied into the code file, and have global
350 scope within it so that they can be used in the rules and program sec‐
351 tions.
352
353 The application shall ensure that the declarations section is termi‐
354 nated by the token %%.
355
356 Grammar Rules in yacc
357 The rules section defines the context-free grammar to be accepted by
358 the function yacc generates, and associates with those rules C-language
359 actions and additional precedence information. The grammar is
360 described below, and a formal definition follows.
361
362 The rules section is comprised of one or more grammar rules. A grammar
363 rule has the form:
364
365
366 A : BODY ;
367
368 The symbol A represents a non-terminal name, and BODY represents a
369 sequence of zero or more names, literals, and semantic actions that can
370 then be followed by optional precedence rules. Only the names and lit‐
371 erals participate in the formation of the grammar; the semantic actions
372 and precedence rules are used in other ways. The colon and the semi‐
373 colon are yacc punctuation. If there are several successive grammar
374 rules with the same left-hand side, the vertical bar '|' can be used to
375 avoid rewriting the left-hand side; in this case the semicolon appears
376 only after the last rule. The BODY part can be empty (or empty of names
377 and literals) to indicate that the non-terminal symbol matches the
378 empty string.
379
380 The yacc utility assigns a unique number to each rule. Rules using the
381 vertical bar notation are distinct rules. The number assigned to the
382 rule appears in the description file.
383
384 The elements comprising a BODY are:
385
386 name, literal
387 These form the rules of the grammar: name is either a token or a
388 non-terminal; literal stands for itself (less the lexically
389 required quotation marks).
390
391 semantic action
392
393 With each grammar rule, the user can associate actions to be
394 performed each time the rule is recognized in the input process.
395 (Note that the word "action" can also refer to the actions of
396 the parser-shift, reduce, and so on.)
397
398 These actions can return values and can obtain the values returned by
399 previous actions. These values are kept in objects of type YYSTYPE (see
400 %union). The result value of the action shall be kept on the parse
401 stack with the left-hand side of the rule, to be accessed by other
402 reductions as part of their right-hand side. By using the <tag> infor‐
403 mation provided in the declarations section, the code generated by yacc
404 can be strictly type checked and contain arbitrary information. In
405 addition, the lexical analyzer can provide the same kinds of values for
406 tokens, if desired.
407
408 An action is an arbitrary C statement and as such can do input or out‐
409 put, call subprograms, and alter external variables. An action is one
410 or more C statements enclosed in curly braces '{' and '}' .
411
412 Certain pseudo-variables can be used in the action. These are macros
413 for access to data structures known internally to yacc.
414
415 $$
416 The value of the action can be set by assigning it to $$. If
417 type checking is enabled and the type of the value to be
418 assigned cannot be determined, a diagnostic message may be gen‐
419 erated.
420
421 $number
422 This refers to the value returned by the component specified by
423 the token number in the right side of a rule, reading from left
424 to right; number can be zero or negative. If number is zero or
425 negative, it refers to the data associated with the name on the
426 parser's stack preceding the leftmost symbol of the current
427 rule. (That is, "$0" refers to the name immediately preceding
428 the leftmost name in the current rule to be found on the
429 parser's stack and "$-1" refers to the symbol to its left.) If
430 number refers to an element past the current point in the rule,
431 or beyond the bottom of the stack, the result is undefined. If
432 type checking is enabled and the type of the value to be
433 assigned cannot be determined, a diagnostic message may be gen‐
434 erated.
435
436 $<tag>number
437
438 These correspond exactly to the corresponding symbols without
439 the tag inclusion, but allow for strict type checking (and pre‐
440 clude unwanted type conversions). The effect is that the macro
441 is expanded to use tag to select an element from the YYSTYPE
442 union (using dataname.tag). This is particularly useful if num‐
443 ber is not positive.
444
445 $<tag>$
446 This imposes on the reference the type of the union member ref‐
447 erenced by tag. This construction is applicable when a reference
448 to a left context value occurs in the grammar, and provides yacc
449 with a means for selecting a type.
450
451
452 Actions can occur anywhere in a rule (not just at the end); an action
453 can access values returned by actions to its left, and in turn the
454 value it returns can be accessed by actions to its right. An action
455 appearing in the middle of a rule shall be equivalent to replacing the
456 action with a new non-terminal symbol and adding an empty rule with
457 that non-terminal symbol on the left-hand side. The semantic action
458 associated with the new rule shall be equivalent to the original
459 action. The use of actions within rules might introduce conflicts that
460 would not otherwise exist.
461
462 By default, the value of a rule shall be the value of the first element
463 in it. If the first element does not have a type (particularly in the
464 case of a literal) and type checking is turned on by %type, an error
465 message shall result.
466
467 precedence
468 The keyword %prec can be used to change the precedence level
469 associated with a particular grammar rule. Examples of this are
470 in cases where a unary and binary operator have the same sym‐
471 bolic representation, but need to be given different prece‐
472 dences, or where the handling of an ambiguous if-else construc‐
473 tion is necessary. The reserved symbol %prec can appear immedi‐
474 ately after the body of the grammar rule and can be followed by
475 a token name or a literal. It shall cause the precedence of the
476 grammar rule to become that of the following token name or lit‐
477 eral. The action for the rule as a whole can follow %prec.
478
479
480 If a program section follows, the application shall ensure that the
481 grammar rules are terminated by %%.
482
483 Programs Section
484 The programs section can include the definition of the lexical analyzer
485 yylex(), and any other functions; for example, those used in the
486 actions specified in the grammar rules. It is unspecified whether the
487 programs section precedes or follows the semantic actions in the output
488 file; therefore, if the application contains any macro definitions and
489 declarations intended to apply to the code in the semantic actions, it
490 shall place them within "%{ ... %}" in the declarations section.
491
492 Input Grammar
493 The following input to yacc yields a parser for the input to yacc. This
494 formal syntax takes precedence over the preceding text syntax descrip‐
495 tion.
496
497 The lexical structure is defined less precisely; Lexical Structure of
498 the Grammar defines most terms. The correspondence between the previous
499 terms and the tokens below is as follows.
500
501 IDENTIFIER
502 This corresponds to the concept of name, given previously. It
503 also includes literals as defined previously.
504
505 C_IDENTIFIER
506 This is a name, and additionally it is known to be followed by a
507 colon. A literal cannot yield this token.
508
509 NUMBER A string of digits (a non-negative decimal integer).
510
511 TYPE, LEFT, MARK, LCURL, RCURL
512
513 These correspond directly to %type, %left, %%, %{, and %}.
514
515 { ... }
516 This indicates C-language source code, with the possible inclu‐
517 sion of '$' macros as discussed previously.
518
519
520
521 /* Grammar for the input to yacc. */
522 /* Basic entries. */
523 /* The following are recognized by the lexical analyzer. */
524
525
526 %token IDENTIFIER /* Includes identifiers and literals */
527 %token C_IDENTIFIER /* identifier (but not literal)
528 followed by a :. */
529 %token NUMBER /* [0-9][0-9]* */
530
531
532 /* Reserved words : %type=>TYPE %left=>LEFT, and so on */
533
534
535 %token LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION
536
537
538 %token MARK /* The %% mark. */
539 %token LCURL /* The %{ mark. */
540 %token RCURL /* The %} mark. */
541
542
543 /* 8-bit character literals stand for themselves; */
544 /* tokens have to be defined for multi-byte characters. */
545
546
547 %start spec
548
549
550 %%
551
552
553 spec : defs MARK rules tail
554 ;
555 tail : MARK
556 {
557 /* In this action, set up the rest of the file. */
558 }
559 | /* Empty; the second MARK is optional. */
560 ;
561 defs : /* Empty. */
562 | defs def
563 ;
564 def : START IDENTIFIER
565 | UNION
566 {
567 /* Copy union definition to output. */
568 }
569 | LCURL
570 {
571 /* Copy C code to output file. */
572 }
573 RCURL
574 | rword tag nlist
575 ;
576 rword : TOKEN
577 | LEFT
578 | RIGHT
579 | NONASSOC
580 | TYPE
581 ;
582 tag : /* Empty: union tag ID optional. */
583 | '<' IDENTIFIER '>'
584 ;
585 nlist : nmno
586 | nlist nmno
587 ;
588 nmno : IDENTIFIER /* Note: literal invalid with % type. */
589 | IDENTIFIER NUMBER /* Note: invalid with % type. */
590 ;
591
592
593 /* Rule section */
594
595
596 rules : C_IDENTIFIER rbody prec
597 | rules rule
598 ;
599 rule : C_IDENTIFIER rbody prec
600 | '|' rbody prec
601 ;
602 rbody : /* empty */
603 | rbody IDENTIFIER
604 | rbody act
605 ;
606 act : '{'
607 {
608 /* Copy action, translate $$, and so on. */
609 }
610 '}'
611 ;
612 prec : /* Empty */
613 | PREC IDENTIFIER
614 | PREC IDENTIFIER act
615 | prec ';'
616 ;
617
618 Conflicts
619 The parser produced for an input grammar may contain states in which
620 conflicts occur. The conflicts occur because the grammar is not
621 LALR(1). An ambiguous grammar always contains at least one LALR(1) con‐
622 flict. The yacc utility shall resolve all conflicts, using either
623 default rules or user-specified precedence rules.
624
625 Conflicts are either shift/reduce conflicts or reduce/reduce conflicts.
626 A shift/reduce conflict is where, for a given state and lookahead sym‐
627 bol, both a shift action and a reduce action are possible. A
628 reduce/reduce conflict is where, for a given state and lookahead sym‐
629 bol, reductions by two different rules are possible.
630
631 The rules below describe how to specify what actions to take when a
632 conflict occurs. Not all shift/reduce conflicts can be successfully
633 resolved this way because the conflict may be due to something other
634 than ambiguity, so incautious use of these facilities can cause the
635 language accepted by the parser to be much different from that which
636 was intended. The description file shall contain sufficient information
637 to understand the cause of the conflict. Where ambiguity is the reason
638 either the default or explicit rules should be adequate to produce a
639 working parser.
640
641 The declared precedences and associativities (see Declarations Section
642 ) are used to resolve parsing conflicts as follows:
643
644 1. A precedence and associativity is associated with each grammar
645 rule; it is the precedence and associativity of the last token or
646 literal in the body of the rule. If the %prec keyword is used, it
647 overrides this default. Some grammar rules might not have both
648 precedence and associativity.
649
650 2. If there is a shift/reduce conflict, and both the grammar rule and
651 the input symbol have precedence and associativity associated with
652 them, then the conflict is resolved in favor of the action (shift
653 or reduce) associated with the higher precedence. If the prece‐
654 dences are the same, then the associativity is used; left associa‐
655 tive implies reduce, right associative implies shift, and non-asso‐
656 ciative implies an error in the string being parsed.
657
658 3. When there is a shift/reduce conflict that cannot be resolved by
659 rule 2, the shift is done. Conflicts resolved this way are counted
660 in the diagnostic output described in Error Handling .
661
662 4. When there is a reduce/reduce conflict, a reduction is done by the
663 grammar rule that occurs earlier in the input sequence. Conflicts
664 resolved this way are counted in the diagnostic output described in
665 Error Handling .
666
667 Conflicts resolved by precedence or associativity shall not be counted
668 in the shift/reduce and reduce/reduce conflicts reported by yacc on
669 either standard error or in the description file.
670
671 Error Handling
672 The token error shall be reserved for error handling. The name error
673 can be used in grammar rules. It indicates places where the parser can
674 recover from a syntax error. The default value of error shall be 256.
675 Its value can be changed using a %token declaration. The lexical ana‐
676 lyzer should not return the value of error.
677
678 The parser shall detect a syntax error when it is in a state where the
679 action associated with the lookahead symbol is error. A semantic action
680 can cause the parser to initiate error handling by executing the macro
681 YYERROR. When YYERROR is executed, the semantic action passes control
682 back to the parser. YYERROR cannot be used outside of semantic actions.
683
684 When the parser detects a syntax error, it normally calls yyerror()
685 with the character string "syntax error" as its argument. The call
686 shall not be made if the parser is still recovering from a previous
687 error when the error is detected. The parser is considered to be recov‐
688 ering from a previous error until the parser has shifted over at least
689 three normal input symbols since the last error was detected or a
690 semantic action has executed the macro yyerrok. The parser shall not
691 call yyerror() when YYERROR is executed.
692
693 The macro function YYRECOVERING shall return 1 if a syntax error has
694 been detected and the parser has not yet fully recovered from it. Oth‐
695 erwise, zero shall be returned.
696
697 When a syntax error is detected by the parser, the parser shall check
698 if a previous syntax error has been detected. If a previous error was
699 detected, and if no normal input symbols have been shifted since the
700 preceding error was detected, the parser checks if the lookahead symbol
701 is an endmarker (see Interface to the Lexical Analyzer ). If it is, the
702 parser shall return with a non-zero value. Otherwise, the lookahead
703 symbol shall be discarded and normal parsing shall resume.
704
705 When YYERROR is executed or when the parser detects a syntax error and
706 no previous error has been detected, or at least one normal input sym‐
707 bol has been shifted since the previous error was detected, the parser
708 shall pop back one state at a time until the parse stack is empty or
709 the current state allows a shift over error. If the parser empties the
710 parse stack, it shall return with a non-zero value. Otherwise, it shall
711 shift over error and then resume normal parsing. If the parser reads a
712 lookahead symbol before the error was detected, that symbol shall still
713 be the lookahead symbol when parsing is resumed.
714
715 The macro yyerrok in a semantic action shall cause the parser to act as
716 if it has fully recovered from any previous errors. The macro yyclearin
717 shall cause the parser to discard the current lookahead token. If the
718 current lookahead token has not yet been read, yyclearin shall have no
719 effect.
720
721 The macro YYACCEPT shall cause the parser to return with the value
722 zero. The macro YYABORT shall cause the parser to return with a non-
723 zero value.
724
725 Interface to the Lexical Analyzer
726 The yylex() function is an integer-valued function that returns a token
727 number representing the kind of token read. If there is a value associ‐
728 ated with the token returned by yylex() (see the discussion of tag
729 above), it shall be assigned to the external variable yylval.
730
731 If the parser and yylex() do not agree on these token numbers, reliable
732 communication between them cannot occur. For (single-byte character)
733 literals, the token is simply the numeric value of the character in the
734 current character set. The numbers for other tokens can either be cho‐
735 sen by yacc, or chosen by the user. In either case, the #define con‐
736 struct of C is used to allow yylex() to return these numbers symboli‐
737 cally. The #define statements are put into the code file, and the
738 header file if that file is requested. The set of characters permitted
739 by yacc in an identifier is larger than that permitted by C. Token
740 names found to contain such characters shall not be included in the
741 #define declarations.
742
743 If the token numbers are chosen by yacc, the tokens other than literals
744 shall be assigned numbers greater than 256, although no order is
745 implied. A token can be explicitly assigned a number by following its
746 first appearance in the declarations section with a number. Names and
747 literals not defined this way retain their default definition. All
748 token numbers assigned by yacc shall be unique and distinct from the
749 token numbers used for literals and user-assigned tokens. If duplicate
750 token numbers cause conflicts in parser generation, yacc shall report
751 an error; otherwise, it is unspecified whether the token assignment is
752 accepted or an error is reported.
753
754 The end of the input is marked by a special token called the endmarker,
755 which has a token number that is zero or negative. (These values are
756 invalid for any other token.) All lexical analyzers shall return zero
757 or negative as a token number upon reaching the end of their input. If
758 the tokens up to, but excluding, the endmarker form a structure that
759 matches the start symbol, the parser shall accept the input. If the
760 endmarker is seen in any other context, it shall be considered an
761 error.
762
763 Completing the Program
764 In addition to yyparse() and yylex(), the functions yyerror() and
765 main() are required to make a complete program. The application can
766 supply main() and yyerror(), or those routines can be obtained from the
767 yacc library.
768
769 Yacc Library
770 The following functions shall appear only in the yacc library accessi‐
771 ble through the -l y operand to c99; they can therefore be redefined by
772 a conforming application:
773
774 int main(void)
775
776 This function shall call yyparse() and exit with an unspecified
777 value. Other actions within this function are unspecified.
778
779 int yyerror(const char *s)
780
781 This function shall write the NUL-terminated argument to stan‐
782 dard error, followed by a <newline>.
783
784
785 The order of the -l y and -l l operands given to c99 is significant;
786 the application shall either provide its own main() function or ensure
787 that -l y precedes -l l.
788
789 Debugging the Parser
790 The parser generated by yacc shall have diagnostic facilities in it
791 that can be optionally enabled at either compile time or at runtime (if
792 enabled at compile time). The compilation of the runtime debugging code
793 is under the control of YYDEBUG, a preprocessor symbol. If YYDEBUG has
794 a non-zero value, the debugging code shall be included. If its value is
795 zero, the code shall not be included.
796
797 In parsers where the debugging code has been included, the external int
798 yydebug can be used to turn debugging on (with a non-zero value) and
799 off (zero value) at runtime. The initial value of yydebug shall be
800 zero.
801
802 When -t is specified, the code file shall be built such that, if YYDE‐
803 BUG is not already defined at compilation time (using the c99 -D YYDE‐
804 BUG option, for example), YYDEBUG shall be set explicitly to 1. When -t
805 is not specified, the code file shall be built such that, if YYDEBUG is
806 not already defined, it shall be set explicitly to zero.
807
808 The format of the debugging output is unspecified but includes at least
809 enough information to determine the shift and reduce actions, and the
810 input symbols. It also provides information about error recovery.
811
812 Algorithms
813 The parser constructed by yacc implements an LALR(1) parsing algorithm
814 as documented in the literature. It is unspecified whether the parser
815 is table-driven or direct-coded.
816
817 A parser generated by yacc shall never request an input symbol from
818 yylex() while in a state where the only actions other than the error
819 action are reductions by a single rule.
820
821 The literature of parsing theory defines these concepts.
822
823 Limits
824 The yacc utility may have several internal tables. The minimum maximums
825 for these tables are shown in the following table. The exact meaning of
826 these values is implementation-defined. The implementation shall
827 define the relationship between these values and between them and any
828 error messages that the implementation may generate should it run out
829 of space for any internal structure. An implementation may combine
830 groups of these resources into a single pool as long as the total
831 available to the user does not fall below the sum of the sizes speci‐
832 fied by this section.
833
834 Table: Internal Limits in yacc
835
836 Minimum
837 Limit Maximum Description
838 {NTERMS} 126 Number of tokens.
839 {NNONTERM} 200 Number of non-terminals.
840 {NPROD} 300 Number of rules.
841 {NSTATES} 600 Number of states.
842 {MEMSIZE} 5200 Length of rules. The total length, in
843 names (tokens and non-terminals), of all
844 the rules of the grammar. The left-hand
845 side is counted for each rule, even if
846 it is not explicitly repeated, as speci‐
847 fied in Grammar Rules in yacc .
848 {ACTSIZE} 4000 Number of actions. "Actions" here (and
849 in the description file) refer to parser
850 actions (shift, reduce, and so on) not
851 to semantic actions defined in Grammar
852 Rules in yacc .
853
855 The following exit values shall be returned:
856
857 0 Successful completion.
858
859 >0 An error occurred.
860
861
863 If any errors are encountered, the run is aborted and yacc exits with a
864 non-zero status. Partial code files and header files may be produced.
865 The summary information in the description file shall always be pro‐
866 duced if the -v flag is present.
867
868 The following sections are informative.
869
871 Historical implementations experience name conflicts on the names
872 yacc.tmp, yacc.acts, yacc.debug, y.tab.c, y.tab.h, and y.output if more
873 than one copy of yacc is running in a single directory at one time. The
874 -b option was added to overcome this problem. The related problem of
875 allowing multiple yacc parsers to be placed in the same file was
876 addressed by adding a -p option to override the previously hard-coded
877 yy variable prefix.
878
879 The description of the -p option specifies the minimal set of function
880 and variable names that cause conflict when multiple parsers are linked
881 together. YYSTYPE does not need to be changed. Instead, the programmer
882 can use -b to give the header files for different parsers different
883 names, and then the file with the yylex() for a given parser can
884 include the header for that parser. Names such as yyclearerr do not
885 need to be changed because they are used only in the actions; they do
886 not have linkage. It is possible that an implementation has other
887 names, either internal ones for implementing things such as yyclearerr,
888 or providing non-standard features that it wants to change with -p.
889
890 Unary operators that are the same token as a binary operator in general
891 need their precedence adjusted. This is handled by the %prec advisory
892 symbol associated with the particular grammar rule defining that unary
893 operator. (See Grammar Rules in yacc .) Applications are not required
894 to use this operator for unary operators, but the grammars that do not
895 require it are rare.
896
898 Access to the yacc library is obtained with library search operands to
899 c99. To use the yacc library main():
900
901
902 c99 y.tab.c -l y
903
904 Both the lex library and the yacc library contain main(). To access
905 the yacc main():
906
907
908 c99 y.tab.c lex.yy.c -l y -l l
909
910 This ensures that the yacc library is searched first, so that its
911 main() is used.
912
913 The historical yacc libraries have contained two simple functions that
914 are normally coded by the application programmer. These functions are
915 similar to the following code:
916
917
918 #include <locale.h>
919 int main(void)
920 {
921 extern int yyparse();
922
923
924 setlocale(LC_ALL, "");
925
926
927 /* If the following parser is one created by lex, the
928 application must be careful to ensure that LC_CTYPE
929 and LC_COLLATE are set to the POSIX locale. */
930 (void) yyparse();
931 return (0);
932 }
933
934
935 #include <stdio.h>
936
937
938 int yyerror(const char *msg)
939 {
940 (void) fprintf(stderr, "%s\n", msg);
941 return (0);
942 }
943
945 The references in may be helpful in constructing the parser generator.
946 The referenced DeRemer and Pennello article (along with the works it
947 references) describes a technique to generate parsers that conform to
948 this volume of IEEE Std 1003.1-2001. Work in this area continues to be
949 done, so implementors should consult current literature before doing
950 any new implementations. The original Knuth article is the theoretical
951 basis for this kind of parser, but the tables it generates are imprac‐
952 tically large for reasonable grammars and should not be used. The
953 "equivalent to" wording is intentional to assure that the best tables
954 that are LALR(1) can be generated.
955
956 There has been confusion between the class of grammars, the algorithms
957 needed to generate parsers, and the algorithms needed to parse the lan‐
958 guages. They are all reasonably orthogonal. In particular, a parser
959 generator that accepts the full range of LR(1) grammars need not gener‐
960 ate a table any more complex than one that accepts SLR(1) (a relatively
961 weak class of LR grammars) for a grammar that happens to be SLR(1).
962 Such an implementation need not recognize the case, either; table com‐
963 pression can yield the SLR(1) table (or one even smaller than that)
964 without recognizing that the grammar is SLR(1). The speed of an LR(1)
965 parser for any class is dependent more upon the table representation
966 and compression (or the code generation if a direct parser is gener‐
967 ated) than upon the class of grammar that the table generator handles.
968
969 The speed of the parser generator is somewhat dependent upon the class
970 of grammar it handles. However, the original Knuth article algorithms
971 for constructing LR parsers were judged by its author to be impracti‐
972 cally slow at that time. Although full LR is more complex than LALR(1),
973 as computer speeds and algorithms improve, the difference (in terms of
974 acceptable wall-clock execution time) is becoming less significant.
975
976 Potential authors are cautioned that the referenced DeRemer and Pen‐
977 nello article previously cited identifies a bug (an over-simplification
978 of the computation of LALR(1) lookahead sets) in some of the LALR(1)
979 algorithm statements that preceded it to publication. They should take
980 the time to seek out that paper, as well as current relevant work, par‐
981 ticularly Aho's.
982
983 The -b option was added to provide a portable method for permitting
984 yacc to work on multiple separate parsers in the same directory. If a
985 directory contains more than one yacc grammar, and both grammars are
986 constructed at the same time (by, for example, a parallel make pro‐
987 gram), conflict results. While the solution is not historical prac‐
988 tice, it corrects a known deficiency in historical implementations.
989 Corresponding changes were made to all sections that referenced the
990 filenames y.tab.c (now "the code file"), y.tab.h (now "the header
991 file"), and y.output (now "the description file").
992
993 The grammar for yacc input is based on System V documentation. The
994 textual description shows there that the ';' is required at the end of
995 the rule. The grammar and the implementation do not require this. (The
996 use of C_IDENTIFIER causes a reduce to occur in the right place.)
997
998 Also, in that implementation, the constructs such as %token can be ter‐
999 minated by a semicolon, but this is not permitted by the grammar. The
1000 keywords such as %token can also appear in uppercase, which is again
1001 not discussed. In most places where '%' is used, '\' can be substi‐
1002 tuted, and there are alternate spellings for some of the symbols (for
1003 example, %LEFT can be "%<" or even "\<" ).
1004
1005 Historically, <tag> can contain any characters except '>', including
1006 white space, in the implementation. However, since the tag must refer‐
1007 ence an ISO C standard union member, in practice conforming implementa‐
1008 tions need to support only the set of characters for ISO C standard
1009 identifiers in this context.
1010
1011 Some historical implementations are known to accept actions that are
1012 terminated by a period. Historical implementations often allow '$' in
1013 names. A conforming implementation does not need to support either of
1014 these behaviors.
1015
1016 Deciding when to use %prec illustrates the difficulty in specifying the
1017 behavior of yacc. There may be situations in which the grammar is not,
1018 strictly speaking, in error, and yet yacc cannot interpret it unambigu‐
1019 ously. The resolution of ambiguities in the grammar can in many
1020 instances be resolved by providing additional information, such as
1021 using %type or %union declarations. It is often easier and it usually
1022 yields a smaller parser to take this alternative when it is appropri‐
1023 ate.
1024
1025 The size and execution time of a program produced without the runtime
1026 debugging code is usually smaller and slightly faster in historical
1027 implementations.
1028
1029 Statistics messages from several historical implementations include the
1030 following types of information:
1031
1032
1033 n/512 terminals, n/300 non-terminals
1034 n/600 grammar rules, n/1500 states
1035 n shift/reduce, n reduce/reduce conflicts reported
1036 n/350 working sets used
1037 Memory: states, etc. n/15000, parser n/15000
1038 n/600 distinct lookahead sets
1039 n extra closures
1040 n shift entries, n exceptions
1041 n goto entries
1042 n entries saved by goto default
1043 Optimizer space used: input n/15000, output n/15000
1044 n table entries, n zero
1045 Maximum spread: n, Maximum offset: n
1046
1047 The report of internal tables in the description file is left implemen‐
1048 tation-defined because all aspects of these limits are also implementa‐
1049 tion-defined. Some implementations may use dynamic allocation tech‐
1050 niques and have no specific limit values to report.
1051
1052 The format of the y.output file is not given because specification of
1053 the format was not seen to enhance applications portability. The list‐
1054 ing is primarily intended to help human users understand and debug the
1055 parser; use of y.output by a conforming application script would be
1056 unusual. Furthermore, implementations have not produced consistent out‐
1057 put and no popular format was apparent. The format selected by the
1058 implementation should be human-readable, in addition to the requirement
1059 that it be a text file.
1060
1061 Standard error reports are not specifically described because they are
1062 seldom of use to conforming applications and there was no reason to
1063 restrict implementations.
1064
1065 Some implementations recognize "={" as equivalent to '{' because it
1066 appears in historical documentation. This construction was recognized
1067 and documented as obsolete as long ago as 1978, in the referenced Yacc:
1068 Yet Another Compiler-Compiler. This volume of IEEE Std 1003.1-2001
1069 chose to leave it as obsolete and omit it.
1070
1071 Multi-byte characters should be recognized by the lexical analyzer and
1072 returned as tokens. They should not be returned as multi-byte character
1073 literals. The token error that is used for error recovery is normally
1074 assigned the value 256 in the historical implementation. Thus, the
1075 token value 256, which is used in many multi-byte character sets, is
1076 not available for use as the value of a user-defined token.
1077
1079 None.
1080
1082 c99, lex
1083
1085 Portions of this text are reprinted and reproduced in electronic form
1086 from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
1087 -- Portable Operating System Interface (POSIX), The Open Group Base
1088 Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of
1089 Electrical and Electronics Engineers, Inc and The Open Group. In the
1090 event of any discrepancy between this version and the original IEEE and
1091 The Open Group Standard, the original IEEE and The Open Group Standard
1092 is the referee document. The original Standard can be obtained online
1093 at http://www.opengroup.org/unix/online.html .
1094
1095
1096
1097IEEE/The Open Group 2003 YACC(1P)