yecc(3erl)

1yecc(3)                    Erlang Module Definition                    yecc(3)
2
3
4

NAME

6       yecc - LALR-1 Parser Generator
7

DESCRIPTION

9       An  LALR-1  parser  generator  for Erlang, similar to yacc. Takes a BNF
10       grammar definition as input, and produces Erlang code for a parser.
11
12       To understand this text, you also have to look at the  yacc  documenta‐
13       tion  in  the UNIX(TM) manual. This is most probably necessary in order
14       to understand the idea of a parser generator,  and  the  principle  and
15       problems of LALR parsing with finite look-ahead.
16

EXPORTS

18       file(Grammarfile [, Options]) -> YeccRet
19
20              Types:
21
22                 Grammarfile = filename()
23                 Options = Option | [Option]
24                 Option = - see below -
25                 YeccRet  =  {ok,  Parserfile}  | {ok, Parserfile, Warnings} |
26                 error | {error, Errors, Warnings}
27                 Parserfile = filename()
28                 Warnings = Errors = [{filename(), [ErrorInfo]}]
29                 ErrorInfo = {ErrorLine, module(), Reason}
30                 ErrorLine = integer()
31                 Reason = - formatable by format_error/1 -
32
33              Grammarfile is the  file  of  declarations  and  grammar  rules.
34              Returns ok upon success, or error if there are errors. An Erlang
35              file containing the parser is created if there  are  no  errors.
36              The options are:
37
38                {parserfile, Parserfile}.:
39                  Parserfile  is  the  name  of the file that will contain the
40                  Erlang parser code that is generated. The default ("") is to
41                  add  the  extension .erl to Grammarfile stripped of the .yrl
42                  extension.
43
44                {includefile, Includefile}.:
45                  Indicates a customized prologue file which the user may want
46                  to    use   instead   of   the   default   file   lib/parse‐
47                  tools/include/yeccpre.hrl which is otherwise included at the
48                  beginning of the resulting parser file. N.B. The Includefile
49                  is included 'as is' in the parser file, so it must not  have
50                  a  module  declaration of its own, and it should not be com‐
51                  piled. It must, however, contain the necessary export decla‐
52                  rations. The default is indicated by "".
53
54                {report_errors, bool()}.:
55                  Causes errors to be printed as they occur. Default is true.
56
57                {report_warnings, bool()}.:
58                  Causes  warnings  to  be  printed  as they occur. Default is
59                  true.
60
61                {report, bool()}.:
62                  This is a short form for both report_errors and report_warn‐
63                  ings.
64
65                warnings_as_errors:
66                  Causes warnings to be treated as errors.
67
68                {return_errors, bool()}.:
69                  If  this  flag is set, {error, Errors, Warnings} is returned
70                  when there are errors. Default is false.
71
72                {return_warnings, bool()}.:
73                  If this flag is set, an extra field containing  Warnings  is
74                  added to the tuple returned upon success. Default is false.
75
76                {return, bool()}.:
77                  This is a short form for both return_errors and return_warn‐
78                  ings.
79
80                {verbose, bool()}. :
81                  Determines whether the parser  generator  should  give  full
82                  information  about resolved and unresolved parse action con‐
83                  flicts (true), or only about those conflicts that prevent  a
84                  parser  from  being generated from the input grammar (false,
85                  the default).
86
87              Any of the Boolean options can be set to  true  by  stating  the
88              name  of the option. For example, verbose is equivalent to {ver‐
89              bose, true}.
90
91              The value of the Parserfile option stripped of the  .erl  exten‐
92              sion  is used by Yecc as the module name of the generated parser
93              file.
94
95              Yecc will add the extension .yrl to the  Grammarfile  name,  the
96              extension  .hrl  to the Includefile name, and the extension .erl
97              to the Parserfile name, unless the extension is already there.
98
99       format_error(Reason) -> Chars
100
101              Types:
102
103                 Reason = - as returned by yecc:file/1,2 -
104                 Chars = [char() | Chars]
105
106              Returns a descriptive  string  in  English  of  an  error  tuple
107              returned  by  yecc:file/1,2. This function is mainly used by the
108              compiler invoking Yecc.
109

PRE-PROCESSING

111       A scanner to pre-process the text (program, etc.) to be parsed  is  not
112       provided  in  the  yecc module. The scanner serves as a kind of lexicon
113       look-up routine. It is possible to write a grammar that uses only char‐
114       acter  tokens  as  terminal symbols, thereby eliminating the need for a
115       scanner, but this would make the parser larger and slower.
116
117       The user should implement a scanner that segments the input  text,  and
118       turns it into one or more lists of tokens. Each token should be a tuple
119       containing information about syntactic category, position in  the  text
120       (e.g.  line  number), and the actual terminal symbol found in the text:
121       {Category, LineNumber, Symbol}.
122
123       If a terminal symbol is the only member of a category, and  the  symbol
124       name  is  identical to the category name, the token format may be {Sym‐
125       bol, LineNumber}.
126
127       A list of tokens produced by the scanner  should  end  with  a  special
128       end_of_input  tuple which the parser is looking for. The format of this
129       tuple should be {Endsymbol,  LastLineNumber},  where  Endsymbol  is  an
130       identifier that is distinguished from all the terminal and non-terminal
131       categories of the syntax rules. The Endsymbol may be  declared  in  the
132       grammar file (see below).
133
134       The simplest case is to segment the input string into a list of identi‐
135       fiers (atoms) and use those atoms both as categories and values of  the
136       tokens.  For  example,  the  input string aaa bbb 777, X may be scanned
137       (tokenized) as:
138
139       [{aaa, 1}, {bbb, 1}, {777, 1}, {',' , 1}, {'X', 1},
140        {'$end', 1}].
141
142       This assumes that this is the first line of the input  text,  and  that
143       '$end' is the distinguished end_of_input symbol.
144
145       The  Erlang  scanner  in  the io module can be used as a starting point
146       when writing a new scanner. Study yeccscan.erl in order to  see  how  a
147       filter  can  be added on top of io:scan_erl_form/3 to provide a scanner
148       for Yecc that tokenizes grammar files before parsing them with the Yecc
149       parser.  A  more general approach to scanner implementation is to use a
150       scanner generator. A scanner generator in Erlang called leex  is  under
151       development.
152

GRAMMAR DEFINITION FORMAT

154       Erlang  style  comments,  starting  with  a '%', are allowed in grammar
155       files.
156
157       Each declaration or rule ends with a dot (the character '.').
158
159       The grammar starts with an optional header section. The header  is  put
160       first in the generated file, before the module declaration. The purpose
161       of the header is to provide a means to make the documentation generated
162       by  EDoc  look  nicer.  Each  header  line should be enclosed in double
163       quotes, and newlines will be inserted between the lines. For example:
164
165       Header "%% Copyright (C)"
166       "%% @private"
167       "%% @Author John".
168
169       Next comes a declaration of the nonterminal categories to  be  used  in
170       the rules. For example:
171
172       Nonterminals sentence nounphrase verbphrase.
173
174       A  non-terminal  category  can be used at the left hand side (= lhs, or
175       head) of a grammar rule. It can also appear at the right hand  side  of
176       rules.
177
178       Next comes a declaration of the terminal categories, which are the cat‐
179       egories of tokens produced by the scanner. For example:
180
181       Terminals article adjective noun verb.
182
183       Terminal categories may only appear in the right hand sides (= rhs)  of
184       grammar rules.
185
186       Next  comes  a  declaration of the rootsymbol, or start category of the
187       grammar. For example:
188
189       Rootsymbol sentence.
190
191       This symbol should appear in the lhs of at least one grammar rule. This
192       is the most general syntactic category which the parser ultimately will
193       parse every input string into.
194
195       After the rootsymbol declaration comes an optional declaration  of  the
196       end_of_input symbol that your scanner is expected to use. For example:
197
198       Endsymbol '$end'.
199
200       Next comes one or more declarations of operator precedences, if needed.
201       These are used to resolve shift/reduce conflicts (see  yacc  documenta‐
202       tion).
203
204       Examples of operator declarations:
205
206       Right 100 '='.
207       Nonassoc 200 '==' '=/='.
208       Left 300 '+'.
209       Left 400 '*'.
210       Unary 500 '-'.
211
212       These  declarations  mean  that  '='  is defined as a right associative
213       binary operator with precedence 100, '==' and '=/=' are operators  with
214       no  associativity,  '+'  and '*' are left associative binary operators,
215       where '*' takes precedence over '+' (the normal case),  and  '-'  is  a
216       unary operator of higher precedence than '*'. The fact that '==' has no
217       associativity means that an expression like a == b == c is considered a
218       syntax error.
219
220       Certain  rules  are  assigned precedence: each rule gets its precedence
221       from the last terminal symbol mentioned in the right hand side  of  the
222       rule. It is also possible to declare precedence for non-terminals, "one
223       level up". This is practical when an operator is overloaded  (see  also
224       example 3 below).
225
226       Next come the grammar rules. Each rule has the general form
227
228       Left_hand_side -> Right_hand_side : Associated_code.
229
230       The left hand side is a non-terminal category. The right hand side is a
231       sequence of one or more non-terminal or terminal  symbols  with  spaces
232       between.  The  associated  code  is  a  sequence of zero or more Erlang
233       expressions (with commas ',' as separators). If the associated code  is
234       empty,  the separating colon ':' is also omitted. A final dot marks the
235       end of the rule.
236
237       Symbols such as '{', '.', etc., have to be enclosed  in  single  quotes
238       when used as terminal or non-terminal symbols in grammar rules. The use
239       of the symbols '$empty', '$end', and '$undefined' should be avoided.
240
241       The last part of the grammar file is an optional  section  with  Erlang
242       code  (= function definitions) which is included 'as is' in the result‐
243       ing parser file. This section must start with the  pseudo  declaration,
244       or key words
245
246       Erlang code.
247
248       No  syntax  rule definitions or other declarations may follow this sec‐
249       tion. To avoid conflicts with internal variables, do not  use  variable
250       names  beginning  with  two  underscore characters ('__') in the Erlang
251       code in this section, or in the code  associated  with  the  individual
252       syntax rules.
253
254       The  optional expect declaration can be placed anywhere before the last
255       optional section with Erlang code. It is used for suppressing the warn‐
256       ing  about conflicts that is ordinarily given if the grammar is ambigu‐
257       ous. An example:
258
259       Expect 2.
260
261       The warning is given if the number of  shift/reduce  conflicts  differs
262       from 2, or if there are reduce/reduce conflicts.
263

EXAMPLES

265       A grammar to parse list expressions (with empty associated code):
266
267       Nonterminals list elements element.
268       Terminals atom '(' ')'.
269       Rootsymbol list.
270       list -> '(' ')'.
271       list -> '(' elements ')'.
272       elements -> element.
273       elements -> element elements.
274       element -> atom.
275       element -> list.
276
277       This grammar can be used to generate a parser which parses list expres‐
278       sions, such as (), (a), (peter charles), (a (b c) d (())), ... provided
279       that  your scanner tokenizes, for example, the input (peter charles) as
280       follows:
281
282       [{'(', 1} , {atom, 1, peter}, {atom, 1, charles}, {')', 1},
283        {'$end', 1}]
284
285       When a grammar rule is used by the parser to parse (part of) the  input
286       string  as  a grammatical phrase, the associated code is evaluated, and
287       the value of the last  expression  becomes  the  value  of  the  parsed
288       phrase.  This value may be used by the parser later to build structures
289       that are values of higher phrases of which  the  current  phrase  is  a
290       part.  The  values initially associated with terminal category phrases,
291       i.e. input tokens, are the token tuples themselves.
292
293       Below is an example of the grammar above with structure  building  code
294       added:
295
296       list -> '(' ')' : nil.
297       list -> '(' elements ')' : '$2'.
298       elements -> element : {cons, '$1', nil}.
299       elements -> element elements : {cons, '$1', '$2'}.
300       element -> atom : '$1'.
301       element -> list : '$1'.
302
303       With this code added to the grammar rules, the parser produces the fol‐
304       lowing value (structure) when parsing the input string (a b  c)..  This
305       still  assumes that this was the first input line that the scanner tok‐
306       enized:
307
308       {cons, {atom, 1, a,} {cons, {atom, 1, b},
309                                   {cons, {atom, 1, c}, nil}}}
310
311       The associated code contains pseudo variables '$1',  '$2',  '$3',  etc.
312       which  refer  to (are bound to) the values associated previously by the
313       parser with the symbols of the right hand side of the rule. When  these
314       symbols  are  terminal  categories,  the values are token tuples of the
315       input string (see above).
316
317       The associated code may not only be used to build structures associated
318       with  phrases,  but  may also be used for syntactic and semantic tests,
319       printout actions (for example for tracing),  etc.  during  the  parsing
320       process.  Since tokens contain positional (line number) information, it
321       is possible to produce error messages which contain  line  numbers.  If
322       there  is no associated code after the right hand side of the rule, the
323       value '$undefined' is associated with the phrase.
324
325       The right hand side of a grammar rule may be empty. This  is  indicated
326       by  using  the  special  symbol  '$empty' as rhs. Then the list grammar
327       above may be simplified to:
328
329       list -> '(' elements ')' : '$2'.
330       elements -> element elements : {cons, '$1', '$2'}.
331       elements -> '$empty' : nil.
332       element -> atom : '$1'.
333       element -> list : '$1'.
334

GENERATING A PARSER

336       To call the parser generator, use the following command:
337
338       yecc:file(Grammarfile).
339
340       An error message from Yecc will be shown if the grammar is not  of  the
341       LALR  type  (for  example  too  ambiguous).  Shift/reduce conflicts are
342       resolved in favor of shifting if there are no operator precedence  dec‐
343       larations.  Refer  to  the  yacc  documentation  on the use of operator
344       precedence.
345
346       The output file contains Erlang source code for a  parser  module  with
347       module  name  equal to the Parserfile parameter. After compilation, the
348       parser can be called as follows (the  module  name  is  assumed  to  be
349       myparser):
350
351       myparser:parse(myscanner:scan(Inport))
352
353       The call format may be different if a customized prologue file has been
354       included when  generating  the  parser  instead  of  the  default  file
355       lib/parsetools/include/yeccpre.hrl.
356
357       With  the standard prologue, this call will return either {ok, Result},
358       where Result is a structure that the Erlang code of  the  grammar  file
359       has  built,  or  {error, {Line_number, Module, Message}} if there was a
360       syntax error in the input.
361
362       Message is something which may be converted into a  string  by  calling
363       Module:format_error(Message) and printed with io:format/3.
364
365   Note:
366       By default, the parser that was generated will not print out error mes‐
367       sages to the screen. The user will have to do this either  by  printing
368       the  returned  error messages, or by inserting tests and print instruc‐
369       tions in the Erlang code associated with the syntax rules of the  gram‐
370       mar file.
371
372
373       It  is  also possible to make the parser ask for more input tokens when
374       needed if the following call format is used:
375
376       myparser:parse_and_scan({Function, Args})
377       myparser:parse_and_scan({Mod, Tokenizer, Args})
378
379       The tokenizer Function is either a fun or a tuple {Mod, Tokenizer}. The
380       call apply(Function, Args) or apply({Mod, Tokenizer}, Args) is executed
381       whenever a new token is needed. This, for example, makes it possible to
382       parse from a file, token by token.
383
384       The  tokenizer  used above has to be implemented so as to return one of
385       the following:
386
387       {ok, Tokens, Endline}
388       {eof, Endline}
389       {error, Error_description, Endline}
390
391       This conforms to the format used  by  the  scanner  in  the  Erlang  io
392       library module.
393
394       If {eof, Endline} is returned immediately, the call to parse_and_scan/1
395       returns {ok, eof}. If {eof, Endline}  is  returned  before  the  parser
396       expects end of input, parse_and_scan/1 will, of course, return an error
397       message (see above). Otherwise {ok, Result} is returned.
398

MORE EXAMPLES

400       1. A grammar for parsing infix arithmetic expressions into prefix nota‐
401       tion, without operator precedence:
402
403       Nonterminals E T F.
404       Terminals '+' '*' '(' ')' number.
405       Rootsymbol E.
406       E -> E '+' T: {'$2', '$1', '$3'}.
407       E -> T : '$1'.
408       T -> T '*' F: {'$2', '$1', '$3'}.
409       T -> F : '$1'.
410       F -> '(' E ')' : '$2'.
411       F -> number : '$1'.
412
413       2. The same with operator precedence becomes simpler:
414
415       Nonterminals E.
416       Terminals '+' '*' '(' ')' number.
417       Rootsymbol E.
418       Left 100 '+'.
419       Left 200 '*'.
420       E -> E '+' E : {'$2', '$1', '$3'}.
421       E -> E '*' E : {'$2', '$1', '$3'}.
422       E -> '(' E ')' : '$2'.
423       E -> number : '$1'.
424
425       3. An overloaded minus operator:
426
427       Nonterminals E uminus.
428       Terminals '*' '-' number.
429       Rootsymbol E.
430
431       Left 100 '-'.
432       Left 200 '*'.
433       Unary 300 uminus.
434
435       E -> E '-' E.
436       E -> E '*' E.
437       E -> uminus.
438       E -> number.
439
440       uminus -> '-' E.
441
442       4.  The  Yecc grammar that is used for parsing grammar files, including
443       itself:
444
445       Nonterminals
446       grammar declaration rule head symbol symbols attached_code
447       token tokens.
448       Terminals
449       atom float integer reserved_symbol reserved_word string char var
450       Rootsymbol grammar.
451       Endsymbol '$end'.
452       grammar -> declaration : '$1'.
453       grammar -> rule : '$1'.
454       declaration -> symbol symbols dot: {'$1', '$2'}.
455       rule -> head '->' symbols attached_code dot: {rule, ['$1' | '$3'],
456               '$4'}.
457       head -> symbol : '$1'.
458       symbols -> symbol : ['$1'].
459       symbols -> symbol symbols : ['$1' | '$2'].
460       attached_code -> ':' tokens : {erlang_code, '$2'}.
461       attached_code -> '$empty' : {erlang_code,
462                        [{atom, 0, '$undefined'}]}.
463       tokens -> token : ['$1'].
464       tokens -> token tokens : ['$1' | '$2'].
465       symbol -> var : value_of('$1').
466       symbol -> atom : value_of('$1').
467       symbol -> integer : value_of('$1').
468       symbol -> reserved_word : value_of('$1').
469       token -> var : '$1'.
470       token -> atom : '$1'.
471       token -> float : '$1'.
472       token -> integer : '$1'.
473       token -> string : '$1'.
474       token -> char : '$1'.
475       token -> reserved_symbol : {value_of('$1'), line_of('$1')}.
476       token -> reserved_word : {value_of('$1'), line_of('$1')}.
477       token -> '->' : {'->', line_of('$1')}.
478       token -> ':' : {':', line_of('$1')}.
479       Erlang code.
480       value_of(Token) ->
481           element(3, Token).
482       line_of(Token) ->
483           element(2, Token).
484
485   Note:
486       The symbols '->', and ':' have to be treated in a special way, as  they
487       are  meta  symbols of the grammar notation, as well as terminal symbols
488       of the Yecc grammar.
489
490
491       5. The file erl_parse.yrl in the lib/stdlib/src directory contains  the
492       grammar for Erlang.
493
494   Note:
495       Syntactic tests are used in the code associated with some rules, and an
496       error is thrown (and caught by the generated parser to produce an error
497       message) when a test fails. The same effect can be achieved with a call
498       to return_error(Error_line, Message_string), which is  defined  in  the
499       yeccpre.hrl default header file.
500
501

FILES

503       lib/parsetools/include/yeccpre.hrl
504