1yecc(3) Erlang Module Definition yecc(3)
2
3
4
6 yecc - LALR-1 Parser Generator
7
9 An LALR-1 parser generator for Erlang, similar to yacc. Takes a BNF
10 grammar definition as input, and produces Erlang code for a parser.
11
12 To understand this text, you also have to look at the yacc documenta‐
13 tion in the UNIX(TM) manual. This is most probably necessary in order
14 to understand the idea of a parser generator, and the principle and
15 problems of LALR parsing with finite look-ahead.
16
18 file(Grammarfile [, Options]) -> YeccRet
19
20 Types:
21
22 Grammarfile = filename()
23 Options = Option | [Option]
24 Option = - see below -
25 YeccRet = {ok, Parserfile} | {ok, Parserfile, Warnings} |
26 error | {error, Errors, Warnings}
27 Parserfile = filename()
28 Warnings = Errors = [{filename(), [ErrorInfo]}]
29 ErrorInfo = {ErrorLine, module(), Reason}
30 ErrorLine = integer()
31 Reason = - formatable by format_error/1 -
32
33 Grammarfile is the file of declarations and grammar rules.
34 Returns ok upon success, or error if there are errors. An Erlang
35 file containing the parser is created if there are no errors.
36 The options are:
37
38 {parserfile, Parserfile}.:
39 Parserfile is the name of the file that will contain the
40 Erlang parser code that is generated. The default ("") is to
41 add the extension .erl to Grammarfile stripped of the .yrl
42 extension.
43
44 {includefile, Includefile}.:
45 Indicates a customized prologue file which the user may want
46 to use instead of the default file lib/parse‐
47 tools/include/yeccpre.hrl which is otherwise included at the
48 beginning of the resulting parser file. N.B. The Includefile
49 is included 'as is' in the parser file, so it must not have
50 a module declaration of its own, and it should not be com‐
51 piled. It must, however, contain the necessary export decla‐
52 rations. The default is indicated by "".
53
54 {report_errors, bool()}.:
55 Causes errors to be printed as they occur. Default is true.
56
57 {report_warnings, bool()}.:
58 Causes warnings to be printed as they occur. Default is
59 true.
60
61 {report, bool()}.:
62 This is a short form for both report_errors and report_warn‐
63 ings.
64
65 warnings_as_errors:
66 Causes warnings to be treated as errors.
67
68 {return_errors, bool()}.:
69 If this flag is set, {error, Errors, Warnings} is returned
70 when there are errors. Default is false.
71
72 {return_warnings, bool()}.:
73 If this flag is set, an extra field containing Warnings is
74 added to the tuple returned upon success. Default is false.
75
76 {return, bool()}.:
77 This is a short form for both return_errors and return_warn‐
78 ings.
79
80 {verbose, bool()}. :
81 Determines whether the parser generator should give full
82 information about resolved and unresolved parse action con‐
83 flicts (true), or only about those conflicts that prevent a
84 parser from being generated from the input grammar (false,
85 the default).
86
87 Any of the Boolean options can be set to true by stating the
88 name of the option. For example, verbose is equivalent to {ver‐
89 bose, true}.
90
91 The value of the Parserfile option stripped of the .erl exten‐
92 sion is used by Yecc as the module name of the generated parser
93 file.
94
95 Yecc will add the extension .yrl to the Grammarfile name, the
96 extension .hrl to the Includefile name, and the extension .erl
97 to the Parserfile name, unless the extension is already there.
98
99 format_error(Reason) -> Chars
100
101 Types:
102
103 Reason = - as returned by yecc:file/1,2 -
104 Chars = [char() | Chars]
105
106 Returns a descriptive string in English of an error tuple
107 returned by yecc:file/1,2. This function is mainly used by the
108 compiler invoking Yecc.
109
111 A scanner to pre-process the text (program, etc.) to be parsed is not
112 provided in the yecc module. The scanner serves as a kind of lexicon
113 look-up routine. It is possible to write a grammar that uses only char‐
114 acter tokens as terminal symbols, thereby eliminating the need for a
115 scanner, but this would make the parser larger and slower.
116
117 The user should implement a scanner that segments the input text, and
118 turns it into one or more lists of tokens. Each token should be a tuple
119 containing information about syntactic category, position in the text
120 (e.g. line number), and the actual terminal symbol found in the text:
121 {Category, LineNumber, Symbol}.
122
123 If a terminal symbol is the only member of a category, and the symbol
124 name is identical to the category name, the token format may be {Sym‐
125 bol, LineNumber}.
126
127 A list of tokens produced by the scanner should end with a special
128 end_of_input tuple which the parser is looking for. The format of this
129 tuple should be {Endsymbol, LastLineNumber}, where Endsymbol is an
130 identifier that is distinguished from all the terminal and non-terminal
131 categories of the syntax rules. The Endsymbol may be declared in the
132 grammar file (see below).
133
134 The simplest case is to segment the input string into a list of identi‐
135 fiers (atoms) and use those atoms both as categories and values of the
136 tokens. For example, the input string aaa bbb 777, X may be scanned
137 (tokenized) as:
138
139 [{aaa, 1}, {bbb, 1}, {777, 1}, {',' , 1}, {'X', 1},
140 {'$end', 1}].
141
142 This assumes that this is the first line of the input text, and that
143 '$end' is the distinguished end_of_input symbol.
144
145 The Erlang scanner in the io module can be used as a starting point
146 when writing a new scanner. Study yeccscan.erl in order to see how a
147 filter can be added on top of io:scan_erl_form/3 to provide a scanner
148 for Yecc that tokenizes grammar files before parsing them with the Yecc
149 parser. A more general approach to scanner implementation is to use a
150 scanner generator. A scanner generator in Erlang called leex is under
151 development.
152
154 Erlang style comments, starting with a '%', are allowed in grammar
155 files.
156
157 Each declaration or rule ends with a dot (the character '.').
158
159 The grammar starts with an optional header section. The header is put
160 first in the generated file, before the module declaration. The purpose
161 of the header is to provide a means to make the documentation generated
162 by EDoc look nicer. Each header line should be enclosed in double
163 quotes, and newlines will be inserted between the lines. For example:
164
165 Header "%% Copyright (C)"
166 "%% @private"
167 "%% @Author John".
168
169 Next comes a declaration of the nonterminal categories to be used in
170 the rules. For example:
171
172 Nonterminals sentence nounphrase verbphrase.
173
174 A non-terminal category can be used at the left hand side (= lhs, or
175 head) of a grammar rule. It can also appear at the right hand side of
176 rules.
177
178 Next comes a declaration of the terminal categories, which are the cat‐
179 egories of tokens produced by the scanner. For example:
180
181 Terminals article adjective noun verb.
182
183 Terminal categories may only appear in the right hand sides (= rhs) of
184 grammar rules.
185
186 Next comes a declaration of the rootsymbol, or start category of the
187 grammar. For example:
188
189 Rootsymbol sentence.
190
191 This symbol should appear in the lhs of at least one grammar rule. This
192 is the most general syntactic category which the parser ultimately will
193 parse every input string into.
194
195 After the rootsymbol declaration comes an optional declaration of the
196 end_of_input symbol that your scanner is expected to use. For example:
197
198 Endsymbol '$end'.
199
200 Next comes one or more declarations of operator precedences, if needed.
201 These are used to resolve shift/reduce conflicts (see yacc documenta‐
202 tion).
203
204 Examples of operator declarations:
205
206 Right 100 '='.
207 Nonassoc 200 '==' '=/='.
208 Left 300 '+'.
209 Left 400 '*'.
210 Unary 500 '-'.
211
212 These declarations mean that '=' is defined as a right associative
213 binary operator with precedence 100, '==' and '=/=' are operators with
214 no associativity, '+' and '*' are left associative binary operators,
215 where '*' takes precedence over '+' (the normal case), and '-' is a
216 unary operator of higher precedence than '*'. The fact that '==' has no
217 associativity means that an expression like a == b == c is considered a
218 syntax error.
219
220 Certain rules are assigned precedence: each rule gets its precedence
221 from the last terminal symbol mentioned in the right hand side of the
222 rule. It is also possible to declare precedence for non-terminals, "one
223 level up". This is practical when an operator is overloaded (see also
224 example 3 below).
225
226 Next come the grammar rules. Each rule has the general form
227
228 Left_hand_side -> Right_hand_side : Associated_code.
229
230 The left hand side is a non-terminal category. The right hand side is a
231 sequence of one or more non-terminal or terminal symbols with spaces
232 between. The associated code is a sequence of zero or more Erlang
233 expressions (with commas ',' as separators). If the associated code is
234 empty, the separating colon ':' is also omitted. A final dot marks the
235 end of the rule.
236
237 Symbols such as '{', '.', etc., have to be enclosed in single quotes
238 when used as terminal or non-terminal symbols in grammar rules. The use
239 of the symbols '$empty', '$end', and '$undefined' should be avoided.
240
241 The last part of the grammar file is an optional section with Erlang
242 code (= function definitions) which is included 'as is' in the result‐
243 ing parser file. This section must start with the pseudo declaration,
244 or key words
245
246 Erlang code.
247
248 No syntax rule definitions or other declarations may follow this sec‐
249 tion. To avoid conflicts with internal variables, do not use variable
250 names beginning with two underscore characters ('__') in the Erlang
251 code in this section, or in the code associated with the individual
252 syntax rules.
253
254 The optional expect declaration can be placed anywhere before the last
255 optional section with Erlang code. It is used for suppressing the warn‐
256 ing about conflicts that is ordinarily given if the grammar is ambigu‐
257 ous. An example:
258
259 Expect 2.
260
261 The warning is given if the number of shift/reduce conflicts differs
262 from 2, or if there are reduce/reduce conflicts.
263
265 A grammar to parse list expressions (with empty associated code):
266
267 Nonterminals list elements element.
268 Terminals atom '(' ')'.
269 Rootsymbol list.
270 list -> '(' ')'.
271 list -> '(' elements ')'.
272 elements -> element.
273 elements -> element elements.
274 element -> atom.
275 element -> list.
276
277 This grammar can be used to generate a parser which parses list expres‐
278 sions, such as (), (a), (peter charles), (a (b c) d (())), ... provided
279 that your scanner tokenizes, for example, the input (peter charles) as
280 follows:
281
282 [{'(', 1} , {atom, 1, peter}, {atom, 1, charles}, {')', 1},
283 {'$end', 1}]
284
285 When a grammar rule is used by the parser to parse (part of) the input
286 string as a grammatical phrase, the associated code is evaluated, and
287 the value of the last expression becomes the value of the parsed
288 phrase. This value may be used by the parser later to build structures
289 that are values of higher phrases of which the current phrase is a
290 part. The values initially associated with terminal category phrases,
291 i.e. input tokens, are the token tuples themselves.
292
293 Below is an example of the grammar above with structure building code
294 added:
295
296 list -> '(' ')' : nil.
297 list -> '(' elements ')' : '$2'.
298 elements -> element : {cons, '$1', nil}.
299 elements -> element elements : {cons, '$1', '$2'}.
300 element -> atom : '$1'.
301 element -> list : '$1'.
302
303 With this code added to the grammar rules, the parser produces the fol‐
304 lowing value (structure) when parsing the input string (a b c).. This
305 still assumes that this was the first input line that the scanner tok‐
306 enized:
307
308 {cons, {atom, 1, a,} {cons, {atom, 1, b},
309 {cons, {atom, 1, c}, nil}}}
310
311 The associated code contains pseudo variables '$1', '$2', '$3', etc.
312 which refer to (are bound to) the values associated previously by the
313 parser with the symbols of the right hand side of the rule. When these
314 symbols are terminal categories, the values are token tuples of the
315 input string (see above).
316
317 The associated code may not only be used to build structures associated
318 with phrases, but may also be used for syntactic and semantic tests,
319 printout actions (for example for tracing), etc. during the parsing
320 process. Since tokens contain positional (line number) information, it
321 is possible to produce error messages which contain line numbers. If
322 there is no associated code after the right hand side of the rule, the
323 value '$undefined' is associated with the phrase.
324
325 The right hand side of a grammar rule may be empty. This is indicated
326 by using the special symbol '$empty' as rhs. Then the list grammar
327 above may be simplified to:
328
329 list -> '(' elements ')' : '$2'.
330 elements -> element elements : {cons, '$1', '$2'}.
331 elements -> '$empty' : nil.
332 element -> atom : '$1'.
333 element -> list : '$1'.
334
336 To call the parser generator, use the following command:
337
338 yecc:file(Grammarfile).
339
340 An error message from Yecc will be shown if the grammar is not of the
341 LALR type (for example too ambiguous). Shift/reduce conflicts are
342 resolved in favor of shifting if there are no operator precedence dec‐
343 larations. Refer to the yacc documentation on the use of operator
344 precedence.
345
346 The output file contains Erlang source code for a parser module with
347 module name equal to the Parserfile parameter. After compilation, the
348 parser can be called as follows (the module name is assumed to be
349 myparser):
350
351 myparser:parse(myscanner:scan(Inport))
352
353 The call format may be different if a customized prologue file has been
354 included when generating the parser instead of the default file
355 lib/parsetools/include/yeccpre.hrl.
356
357 With the standard prologue, this call will return either {ok, Result},
358 where Result is a structure that the Erlang code of the grammar file
359 has built, or {error, {Line_number, Module, Message}} if there was a
360 syntax error in the input.
361
362 Message is something which may be converted into a string by calling
363 Module:format_error(Message) and printed with io:format/3.
364
365 Note:
366 By default, the parser that was generated will not print out error mes‐
367 sages to the screen. The user will have to do this either by printing
368 the returned error messages, or by inserting tests and print instruc‐
369 tions in the Erlang code associated with the syntax rules of the gram‐
370 mar file.
371
372
373 It is also possible to make the parser ask for more input tokens when
374 needed if the following call format is used:
375
376 myparser:parse_and_scan({Function, Args})
377 myparser:parse_and_scan({Mod, Tokenizer, Args})
378
379 The tokenizer Function is either a fun or a tuple {Mod, Tokenizer}. The
380 call apply(Function, Args) or apply({Mod, Tokenizer}, Args) is executed
381 whenever a new token is needed. This, for example, makes it possible to
382 parse from a file, token by token.
383
384 The tokenizer used above has to be implemented so as to return one of
385 the following:
386
387 {ok, Tokens, Endline}
388 {eof, Endline}
389 {error, Error_description, Endline}
390
391 This conforms to the format used by the scanner in the Erlang io
392 library module.
393
394 If {eof, Endline} is returned immediately, the call to parse_and_scan/1
395 returns {ok, eof}. If {eof, Endline} is returned before the parser
396 expects end of input, parse_and_scan/1 will, of course, return an error
397 message (see above). Otherwise {ok, Result} is returned.
398
400 1. A grammar for parsing infix arithmetic expressions into prefix nota‐
401 tion, without operator precedence:
402
403 Nonterminals E T F.
404 Terminals '+' '*' '(' ')' number.
405 Rootsymbol E.
406 E -> E '+' T: {'$2', '$1', '$3'}.
407 E -> T : '$1'.
408 T -> T '*' F: {'$2', '$1', '$3'}.
409 T -> F : '$1'.
410 F -> '(' E ')' : '$2'.
411 F -> number : '$1'.
412
413 2. The same with operator precedence becomes simpler:
414
415 Nonterminals E.
416 Terminals '+' '*' '(' ')' number.
417 Rootsymbol E.
418 Left 100 '+'.
419 Left 200 '*'.
420 E -> E '+' E : {'$2', '$1', '$3'}.
421 E -> E '*' E : {'$2', '$1', '$3'}.
422 E -> '(' E ')' : '$2'.
423 E -> number : '$1'.
424
425 3. An overloaded minus operator:
426
427 Nonterminals E uminus.
428 Terminals '*' '-' number.
429 Rootsymbol E.
430
431 Left 100 '-'.
432 Left 200 '*'.
433 Unary 300 uminus.
434
435 E -> E '-' E.
436 E -> E '*' E.
437 E -> uminus.
438 E -> number.
439
440 uminus -> '-' E.
441
442 4. The Yecc grammar that is used for parsing grammar files, including
443 itself:
444
445 Nonterminals
446 grammar declaration rule head symbol symbols attached_code
447 token tokens.
448 Terminals
449 atom float integer reserved_symbol reserved_word string char var
450 Rootsymbol grammar.
451 Endsymbol '$end'.
452 grammar -> declaration : '$1'.
453 grammar -> rule : '$1'.
454 declaration -> symbol symbols dot: {'$1', '$2'}.
455 rule -> head '->' symbols attached_code dot: {rule, ['$1' | '$3'],
456 '$4'}.
457 head -> symbol : '$1'.
458 symbols -> symbol : ['$1'].
459 symbols -> symbol symbols : ['$1' | '$2'].
460 attached_code -> ':' tokens : {erlang_code, '$2'}.
461 attached_code -> '$empty' : {erlang_code,
462 [{atom, 0, '$undefined'}]}.
463 tokens -> token : ['$1'].
464 tokens -> token tokens : ['$1' | '$2'].
465 symbol -> var : value_of('$1').
466 symbol -> atom : value_of('$1').
467 symbol -> integer : value_of('$1').
468 symbol -> reserved_word : value_of('$1').
469 token -> var : '$1'.
470 token -> atom : '$1'.
471 token -> float : '$1'.
472 token -> integer : '$1'.
473 token -> string : '$1'.
474 token -> char : '$1'.
475 token -> reserved_symbol : {value_of('$1'), line_of('$1')}.
476 token -> reserved_word : {value_of('$1'), line_of('$1')}.
477 token -> '->' : {'->', line_of('$1')}.
478 token -> ':' : {':', line_of('$1')}.
479 Erlang code.
480 value_of(Token) ->
481 element(3, Token).
482 line_of(Token) ->
483 element(2, Token).
484
485 Note:
486 The symbols '->', and ':' have to be treated in a special way, as they
487 are meta symbols of the grammar notation, as well as terminal symbols
488 of the Yecc grammar.
489
490
491 5. The file erl_parse.yrl in the lib/stdlib/src directory contains the
492 grammar for Erlang.
493
494 Note:
495 Syntactic tests are used in the code associated with some rules, and an
496 error is thrown (and caught by the generated parser to produce an error
497 message) when a test fails. The same effect can be achieved with a call
498 to return_error(Error_line, Message_string), which is defined in the
499 yeccpre.hrl default header file.
500
501
503 lib/parsetools/include/yeccpre.hrl
504
506 Aho & Johnson: 'LR Parsing', ACM Computing Surveys, vol. 6:2, 1974.
507
508
509
510Ericsson AB parsetools 2.2 yecc(3)