pt_peg_language(n)

1pt::peg_language(n)              Parser Tools              pt::peg_language(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt::peg_language - PEG Language Tutorial
9

SYNOPSIS

11       package require Tcl  8.5
12
13______________________________________________________________________________
14

DESCRIPTION

16       Are  you  lost ?  Do you have trouble understanding this document ?  In
17       that case please read the overview  provided  by  the  Introduction  to
18       Parser  Tools.  This document is the entrypoint to the whole system the
19       current package is a part of.
20
21       Welcome to the tutorial / introduction for the PEG  Specification  Lan‐
22       guage.   If  you are already familiar with the language we are about to
23       discuss, and only wish to refresh your memory you can, of course,  skip
24       ahead to the aforementioned section and just read the full formal spec‐
25       ification.
26

WHAT IS IT?

28       peg, a language for the specification of parsing expression grammars is
29       meant  to be human readable, and writable as well, yet strict enough to
30       allow its processing by machine. Like any  computer  language.  It  was
31       defined  to make writing the specification of a grammar easy, something
32       the other formats found in the Parser Tools do not lend themselves too.
33

THE ELEMENTS OF THE LANGUAGE

35   BASIC STRUCTURE
36       The general outline of a textual PEG is
37
38
39              PEG <<name>> (<<start-expression>>)
40                 <<rules>>
41              END;
42
43       Note: We are using text in double angle-brackets as  place-holders  for
44       things not yet explained.
45
46   NAMES
47       Names  are mostly used to identify the nonterminal symbols of the gram‐
48       mar, i.e. that which occurs on the left-hand side  of  a  <rule>.   The
49       exception to that is the name given after the keyword PEG (see previous
50       section), which is the name of the whole grammar itself.
51
52       The structure of a name is simple:
53
54       [1]    It begins with a letter, underscore, or colon, followed by
55
56       [2]    zero or more letters, digits, underscores, or colons.
57
58       Or, in formal textual notation:
59
60
61                  ([_:] / <alpha>) ([_:] / <alnum>)*
62
63       Examples of names:
64
65
66                  Hello
67                  ::world
68                  _:submarine55_
69
70       Examples of text which are not names:
71
72
73                  12
74                  .bogus
75                  0wrong
76                  @location
77
78
79   RULES
80       The main body of the text of a grammar specification is taken up by the
81       rules. Each rule defines the sentence structure of one nonterminal sym‐
82       bol. Their basic structure is
83
84
85                   <<name>>  <-  <<expression>> ;
86
87       The <name> specifies the nonterminal symbol to be defined, the <expres‐
88       sion> after the arrow (<-) then declares its structure.
89
90       Note  that  each  rule ends in a single semicolon, even the last.  I.e.
91       the semicolon is a rule terminator, not a separator.
92
93       We can have as many rules as we like, as long as we define each nonter‐
94       minal  symbol at most once, and have at least one rule for each nonter‐
95       minal symbol which occured in an expression, i.e. in either  the  start
96       expression of the grammar, or the right-hande side of a rule.
97
98   EXPRESSIONS
99       The parsing expressions are the meat of any specification. They declare
100       the structure of the whole document (<<start-expression>>), and of  all
101       nonterminal symbols.
102
103       All  expressions  are  made  up out of atomic expressions and operators
104       combining them. We have operators for  choosing  between  alternatives,
105       repetition  of  parts,  and  for  look-ahead  constraints.  There is no
106       explicit operator for the sequencing (also known as  concatenation)  of
107       parts  however.  This is specified by simply placing the parts adjacent
108       to each other.
109
110       Here are the operators, from highest to lowest priority (i.e.  strength
111       of binding):
112
113
114                  # Binary operators.
115
116                  <<expression-1>>     <<expression-2>>  # sequence. parse 1, then 2.
117                  <<expression-1>>  /  <<expression-2>>  # alternative. try to parse 1, and parse 2 if 1 failed to parse.
118
119                  # Prefix operators. Lookahead constraints. Same priority.
120
121                  & <<expression>>  # Parse expression, ok on successful parse.
122                  ! <<expression>>  # Ditto, except ok on failure to parse.
123
124                  # Suffix operators. Repetition. Same priority.
125
126                  <<expression>> ?  # Parse expression none, or once (repeat 0 or 1).
127                  <<expression>> *  # Parse expression zero or more times.
128                  <<expression>> +  # Parse expression one or more times.
129
130                  # Expression nesting
131
132                  ( <<expression>> ) # Put an expression in parens to change its priority.
133
134       With  this we can now deconstruct the formal expression for names given
135       in section Names:
136
137
138                  ([_:] / <alpha>) ([_:] / <alnum>)*
139
140       It is a sequence of two parts,
141
142                  [_:] / <alpha>
143       and
144
145                  ([_:] / <alnum>)*
146       The parentheses around the parts kept their  inner  alternatives  bound
147       together  against the normally higher priority of the sequence. Each of
148       the two parts is an alternative,  with  the  second  part  additionally
149       repeated  zero  or more times, leaving us with the three atomic expres‐
150       sions
151
152
153                  [_:]
154                  <alpha>
155                  <alnum>
156
157       And atomic expressions  are  our  next  topic.  They  fall  into  three
158       classes:
159
160       [1]    names, i.e. nonterminal symbols,
161
162       [2]    string literals, and
163
164       [3]    character classes.
165
166       Names we know about already, or see section Names for a refresher.
167
168       String  literals  are simple. They are delimited by (i.e. start and end
169       with) either a single or double-apostroph, and in  between  the  delim‐
170       iters  we  can have any character but the delimiter itself. They can be
171       empty as well. Examples of strings are
172
173
174                  ''
175                  ""
176                  'hello'
177                  "umbra"
178                  "'"
179                  '"'
180
181       The last two examples show how to place any of the  delimiters  into  a
182       string.
183
184       For  the  last,  but  not  least  of  our atomic expressions, character
185       classes, we have a number of predefined classes, shown below,  and  the
186       ability to construct or own. The predefined classes are:
187
188
189                  <alnum>    # Any unicode alphabet or digit character (string is alnum).
190                  <alpha>    # Any unicode alphabet character (string is alpha).
191                  <ascii>    # Any unicode character below codepoint 0x80 (string is ascii).
192                  <control>  # Any unicode control character (string is control).
193                  <ddigit>   # The digit characters [0-9].
194                  <digit>    # Any unicode digit character (string is digit).
195                  <graph>    # Any unicode printing character, except space (string is graph).
196                  <lower>    # Any unicode lower-case alphabet character (string is lower).
197                  <print>    # Any unicode printing character, incl. space (string is print).
198                  <punct>    # Any unicode punctuation character (string is punct).
199                  <space>    # Any unicode space character (string is space).
200                  <upper>    # Any unicode upper-case alphabet character (string is upper).
201                  <wordchar> # Any unicode word character (string is wordchar).
202                  <xdigit>   # The hexadecimal digit characters [0-9a-fA-F].
203                  .          # Any character, except end of input.
204
205       And the syntax of custom-defined character classes is
206
207
208                  [ <<range>>* ]
209
210       where each range is either a single character, or of the form
211
212
213                 <<character>> - <character>>
214
215       Examples  for  character  classes we have seen already in the course of
216       this introduction are
217
218
219                  [_:]
220                  [0-9]
221                  [0-9a-fA-F]
222
223       We are nearly done with expressions. The only piece left is to tell how
224       the characters in character classes and string literals are specified.
225
226       Basically characters in the input stand for themselves, and in addition
227       to that we several types of escape syntax to to repesent control  char‐
228       acters, or characters outside of the encoding the text is in.
229
230       All the escaped forms are started with a backslash character ('\', uni‐
231       code codepoint 0x5C). This is then followed by a series of  octal  dig‐
232       its, or 'u' and hexedecimal digits, or a regular character from a fixed
233       set for various control characters. Some examples:
234
235
236                  \n \r \t \' \" \[ \] \\ #
237                  \000 up to \277         # octal escape, all ascii character, leading 0's can be removed.
238                  \u2CA7                  # hexadecimal escape, all unicode characters.
239                  #                       # Here 2ca7 <=> Koptic Small Letter Tau
240
241
242   WHITESPACE AND COMMENTS
243       One issue not touched upon so far is whitespace and comments.
244
245       Whitespace is any unicode space character, i.e. anything in the charac‐
246       ter class <space>, and comments. The latter are sequences of characters
247       starting with a '#' (hash, unicode codepoint 0x23) and  ending  at  the
248       next end-of-line.
249
250       Whitespace  can  be  freely  used between all syntactical elements of a
251       grammar specification. It cannot be used  inside  of  syntactical  ele‐
252       ments, like names, string literals, predefined character classes, etc.
253
254   NONTERMINAL ATTRIBUTES
255       Lastly,  a more advanced topic. In the section Rules we gave the struc‐
256       ture of a rule as
257
258
259                   <<name>>  <-  <<expression>> ;
260
261       This is not quite true. It is possible to  associate  a  semantic  mode
262       with  the nonterminal in the rule, by writing it before the name, sepa‐
263       rated from it by a colon, i.e. writing
264
265
266                  <<mode>> : <<name>>  <-  <<expression>> ;
267
268       is also allowed. This mode is optional. The known modes and their mean‐
269       ings are:
270
271       value  The semantic value of the nonterminal symbol is an abstract syn‐
272              tax tree consisting of a single node node  for  the  nonterminal
273              itself,  which  has  the ASTs of the symbol's right hand side as
274              its children.
275
276       leaf   The semantic value of the nonterminal symbol is an abstract syn‐
277              tax  tree  consisting of a single node node for the nonterminal,
278              without any children. Any ASTs generated by the  symbol's  right
279              hand side are discarded.
280
281       void   The nonterminal has no semantic value. Any ASTs generated by the
282              symbol's right hand side are discarded (as well).
283
284       Of these three modes only leaf and  void  can  be  specified  directly.
285       value  is implicitly specified by the absence of a mode before the non‐
286       terminal.
287
288       Now, with all the above under our belt it should  be  possible  to  not
289       only  read,  but understand the formal specification of the text repre‐
290       sentation shown in the next section, written in itself.
291

PEG SPECIFICATION LANGUAGE

293       peg, a language for the specification of parsing expression grammars is
294       meant  to be human readable, and writable as well, yet strict enough to
295       allow its processing by machine. Like any  computer  language.  It  was
296       defined  to make writing the specification of a grammar easy, something
297       the other formats found in the Parser Tools do not lend themselves too.
298
299       It is formally specified by the grammar shown below, written in itself.
300       For  a  tutorial  / introduction to the language please go and read the
301       PEG Language Tutorial.
302
303              PEG pe-grammar-for-peg (Grammar)
304
305                # --------------------------------------------------------------------
306                      # Syntactical constructs
307
308                      Grammar         <- WHITESPACE Header Definition* Final EOF ;
309
310                      Header          <- PEG Identifier StartExpr ;
311                      Definition      <- Attribute? Identifier IS Expression SEMICOLON ;
312                      Attribute       <- (VOID / LEAF) COLON ;
313                      Expression      <- Sequence (SLASH Sequence)* ;
314                      Sequence        <- Prefix+ ;
315                      Prefix          <- (AND / NOT)? Suffix ;
316                      Suffix          <- Primary (QUESTION / STAR / PLUS)? ;
317                      Primary         <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
318                                      /  GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
319                                      /  WORDCHAR / XDIGIT
320                                      / Identifier
321                                      /  OPEN Expression CLOSE
322                                      /  Literal
323                                      /  Class
324                                      /  DOT
325                                      ;
326                      Literal         <- APOSTROPH  (!APOSTROPH  Char)* APOSTROPH  WHITESPACE
327                                      /  DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
328                      Class           <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
329                      Range           <- Char TO Char / Char ;
330
331                      StartExpr       <- OPEN Expression CLOSE ;
332              void:   Final           <- "END" WHITESPACE SEMICOLON WHITESPACE ;
333
334                      # --------------------------------------------------------------------
335                      # Lexing constructs
336
337                      Identifier      <- Ident WHITESPACE ;
338              leaf:   Ident           <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
339                      Char            <- CharSpecial / CharOctalFull / CharOctalPart
340                                      /  CharUnicode / CharUnescaped
341                                      ;
342
343              leaf:   CharSpecial     <- "\\" [nrt'"\[\]\\] ;
344              leaf:   CharOctalFull   <- "\\" [0-2][0-7][0-7] ;
345              leaf:   CharOctalPart   <- "\\" [0-7][0-7]? ;
346              leaf:   CharUnicode     <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
347              leaf:   CharUnescaped   <- !"\\" . ;
348
349              void:   HexDigit        <- [0-9a-fA-F] ;
350
351              void:   TO              <- '-'           ;
352              void:   OPENB           <- "["           ;
353              void:   CLOSEB          <- "]"           ;
354              void:   APOSTROPH       <- "'"           ;
355              void:   DAPOSTROPH      <- '"'           ;
356              void:   PEG             <- "PEG" !([_:] / <alnum>) WHITESPACE ;
357              void:   IS              <- "<-"    WHITESPACE ;
358              leaf:   VOID            <- "void"  WHITESPACE ; # Implies that definition has no semantic value.
359              leaf:   LEAF            <- "leaf"  WHITESPACE ; # Implies that definition has no terminals.
360              void:   SEMICOLON       <- ";"     WHITESPACE ;
361              void:   COLON           <- ":"     WHITESPACE ;
362              void:   SLASH           <- "/"     WHITESPACE ;
363              leaf:   AND             <- "&"     WHITESPACE ;
364              leaf:   NOT             <- "!"     WHITESPACE ;
365              leaf:   QUESTION        <- "?"     WHITESPACE ;
366              leaf:   STAR            <- "*"     WHITESPACE ;
367              leaf:   PLUS            <- "+"     WHITESPACE ;
368              void:   OPEN            <- "("     WHITESPACE ;
369              void:   CLOSE           <- ")"     WHITESPACE ;
370              leaf:   DOT             <- "."     WHITESPACE ;
371
372              leaf:   ALNUM           <- "<alnum>"    WHITESPACE ;
373              leaf:   ALPHA           <- "<alpha>"    WHITESPACE ;
374              leaf:   ASCII           <- "<ascii>"    WHITESPACE ;
375              leaf:   CONTROL         <- "<control>"  WHITESPACE ;
376              leaf:   DDIGIT          <- "<ddigit>"   WHITESPACE ;
377              leaf:   DIGIT           <- "<digit>"    WHITESPACE ;
378              leaf:   GRAPH           <- "<graph>"    WHITESPACE ;
379              leaf:   LOWER           <- "<lower>"    WHITESPACE ;
380              leaf:   PRINTABLE       <- "<print>"    WHITESPACE ;
381              leaf:   PUNCT           <- "<punct>"    WHITESPACE ;
382              leaf:   SPACE           <- "<space>"    WHITESPACE ;
383              leaf:   UPPER           <- "<upper>"    WHITESPACE ;
384              leaf:   WORDCHAR        <- "<wordchar>" WHITESPACE ;
385              leaf:   XDIGIT          <- "<xdigit>"   WHITESPACE ;
386
387              void:   WHITESPACE      <- (" " / "\t" / EOL / COMMENT)* ;
388              void:   COMMENT         <- '#' (!EOL .)* EOL ;
389              void:   EOL             <- "\n\r" / "\n" / "\r" ;
390              void:   EOF             <- !. ;
391
392                      # --------------------------------------------------------------------
393              END;
394
395
396   EXAMPLE
397       Our example specifies the grammar for a basic 4-operation calculator.
398
399              PEG calculator (Expression)
400                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
401                  Sign       <- '-' / '+'                                     ;
402                  Number     <- Sign? Digit+                                  ;
403                  Expression <- Term (AddOp Term)*                            ;
404                  MulOp      <- '*' / '/'                                     ;
405                  Term       <- Factor (MulOp Factor)*                        ;
406                  AddOp      <- '+'/'-'                                       ;
407                  Factor     <- '(' Expression ')' / Number                   ;
408              END;
409
410
411       Using higher-level features of the notation, i.e. the character classes
412       (predefined and custom), this example can be rewritten as
413
414              PEG calculator (Expression)
415                  Sign       <- [-+] ;
416                  Number     <- Sign? <ddigit>+;
417                  Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
418                  MulOp      <- [*/];
419                  Factor     <- Term (AddOp Term)*;
420                  AddOp      <- [-+];
421                  Term       <- Number;
422              END;
423
424

BUGS, IDEAS, FEEDBACK

426       This  document,  and the package it describes, will undoubtedly contain
427       bugs and other problems.  Please report such in the category pt of  the
428       Tcllib  Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please also
429       report any ideas for enhancements  you  may  have  for  either  package
430       and/or documentation.
431
432       When proposing code changes, please provide unified diffs, i.e the out‐
433       put of diff -u.
434
435       Note further that  attachments  are  strongly  preferred  over  inlined
436       patches.  Attachments  can  be  made  by  going to the Edit form of the
437       ticket immediately after its creation, and  then  using  the  left-most
438       button in the secondary navigation bar.
439

KEYWORDS

441       EBNF,  LL(k),  PEG,  TDPL, context-free languages, expression, grammar,
442       matching, parser, parsing expression, parsing expression grammar,  push
443       down  automaton,  recursive descent, state, top-down parsing languages,
444       transducer
445

COPYRIGHT

450       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
451
452
453
454
455tcllib                                 1                   pt::peg_language(n)