1pt::peg_language(n) Parser Tools pt::peg_language(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg_language - PEG Language Tutorial
9
11 package require Tcl 8.5
12
13______________________________________________________________________________
14
16 Are you lost ? Do you have trouble understanding this document ? In
17 that case please read the overview provided by the Introduction to
18 Parser Tools. This document is the entrypoint to the whole system the
19 current package is a part of.
20
21 Welcome to the tutorial / introduction for the PEG Specification Lan‐
22 guage. If you are already familiar with the language we are about to
23 discuss, and only wish to refresh your memory you can, of course, skip
24 ahead to the aforementioned section and just read the full formal spec‐
25 ification.
26
28 peg, a language for the specification of parsing expression grammars is
29 meant to be human readable, and writable as well, yet strict enough to
30 allow its processing by machine. Like any computer language. It was de‐
31 fined to make writing the specification of a grammar easy, something
32 the other formats found in the Parser Tools do not lend themselves too.
33
35 BASIC STRUCTURE
36 The general outline of a textual PEG is
37
38
39 PEG <<name>> (<<start-expression>>)
40 <<rules>>
41 END;
42
43 Note: We are using text in double angle-brackets as place-holders for
44 things not yet explained.
45
46 NAMES
47 Names are mostly used to identify the nonterminal symbols of the gram‐
48 mar, i.e. that which occurs on the left-hand side of a <rule>. The ex‐
49 ception to that is the name given after the keyword PEG (see previous
50 section), which is the name of the whole grammar itself.
51
52 The structure of a name is simple:
53
54 [1] It begins with a letter, underscore, or colon, followed by
55
56 [2] zero or more letters, digits, underscores, or colons.
57
58 Or, in formal textual notation:
59
60
61 ([_:] / <alpha>) ([_:] / <alnum>)*
62
63 Examples of names:
64
65
66 Hello
67 ::world
68 _:submarine55_
69
70 Examples of text which are not names:
71
72
73 12
74 .bogus
75 0wrong
76 @location
77
78
79 RULES
80 The main body of the text of a grammar specification is taken up by the
81 rules. Each rule defines the sentence structure of one nonterminal sym‐
82 bol. Their basic structure is
83
84
85 <<name>> <- <<expression>> ;
86
87 The <name> specifies the nonterminal symbol to be defined, the <expres‐
88 sion> after the arrow (<-) then declares its structure.
89
90 Note that each rule ends in a single semicolon, even the last. I.e.
91 the semicolon is a rule terminator, not a separator.
92
93 We can have as many rules as we like, as long as we define each nonter‐
94 minal symbol at most once, and have at least one rule for each nonter‐
95 minal symbol which occured in an expression, i.e. in either the start
96 expression of the grammar, or the right-hande side of a rule.
97
98 EXPRESSIONS
99 The parsing expressions are the meat of any specification. They declare
100 the structure of the whole document (<<start-expression>>), and of all
101 nonterminal symbols.
102
103 All expressions are made up out of atomic expressions and operators
104 combining them. We have operators for choosing between alternatives,
105 repetition of parts, and for look-ahead constraints. There is no ex‐
106 plicit operator for the sequencing (also known as concatenation) of
107 parts however. This is specified by simply placing the parts adjacent
108 to each other.
109
110 Here are the operators, from highest to lowest priority (i.e. strength
111 of binding):
112
113
114 # Binary operators.
115
116 <<expression-1>> <<expression-2>> # sequence. parse 1, then 2.
117 <<expression-1>> / <<expression-2>> # alternative. try to parse 1, and parse 2 if 1 failed to parse.
118
119 # Prefix operators. Lookahead constraints. Same priority.
120
121 & <<expression>> # Parse expression, ok on successful parse.
122 ! <<expression>> # Ditto, except ok on failure to parse.
123
124 # Suffix operators. Repetition. Same priority.
125
126 <<expression>> ? # Parse expression none, or once (repeat 0 or 1).
127 <<expression>> * # Parse expression zero or more times.
128 <<expression>> + # Parse expression one or more times.
129
130 # Expression nesting
131
132 ( <<expression>> ) # Put an expression in parens to change its priority.
133
134 With this we can now deconstruct the formal expression for names given
135 in section Names:
136
137
138 ([_:] / <alpha>) ([_:] / <alnum>)*
139
140 It is a sequence of two parts,
141
142 [_:] / <alpha>
143 and
144
145 ([_:] / <alnum>)*
146 The parentheses around the parts kept their inner alternatives bound
147 together against the normally higher priority of the sequence. Each of
148 the two parts is an alternative, with the second part additionally re‐
149 peated zero or more times, leaving us with the three atomic expressions
150
151
152 [_:]
153 <alpha>
154 <alnum>
155
156 And atomic expressions are our next topic. They fall into three
157 classes:
158
159 [1] names, i.e. nonterminal symbols,
160
161 [2] string literals, and
162
163 [3] character classes.
164
165 Names we know about already, or see section Names for a refresher.
166
167 String literals are simple. They are delimited by (i.e. start and end
168 with) either a single or double-apostroph, and in between the delim‐
169 iters we can have any character but the delimiter itself. They can be
170 empty as well. Examples of strings are
171
172
173 ''
174 ""
175 'hello'
176 "umbra"
177 "'"
178 '"'
179
180 The last two examples show how to place any of the delimiters into a
181 string.
182
183 For the last, but not least of our atomic expressions, character
184 classes, we have a number of predefined classes, shown below, and the
185 ability to construct or own. The predefined classes are:
186
187
188 <alnum> # Any unicode alphabet or digit character (string is alnum).
189 <alpha> # Any unicode alphabet character (string is alpha).
190 <ascii> # Any unicode character below codepoint 0x80 (string is ascii).
191 <control> # Any unicode control character (string is control).
192 <ddigit> # The digit characters [0-9].
193 <digit> # Any unicode digit character (string is digit).
194 <graph> # Any unicode printing character, except space (string is graph).
195 <lower> # Any unicode lower-case alphabet character (string is lower).
196 <print> # Any unicode printing character, incl. space (string is print).
197 <punct> # Any unicode punctuation character (string is punct).
198 <space> # Any unicode space character (string is space).
199 <upper> # Any unicode upper-case alphabet character (string is upper).
200 <wordchar> # Any unicode word character (string is wordchar).
201 <xdigit> # The hexadecimal digit characters [0-9a-fA-F].
202 . # Any character, except end of input.
203
204 And the syntax of custom-defined character classes is
205
206
207 [ <<range>>* ]
208
209 where each range is either a single character, or of the form
210
211
212 <<character>> - <character>>
213
214 Examples for character classes we have seen already in the course of
215 this introduction are
216
217
218 [_:]
219 [0-9]
220 [0-9a-fA-F]
221
222 We are nearly done with expressions. The only piece left is to tell how
223 the characters in character classes and string literals are specified.
224
225 Basically characters in the input stand for themselves, and in addition
226 to that we several types of escape syntax to to repesent control char‐
227 acters, or characters outside of the encoding the text is in.
228
229 All the escaped forms are started with a backslash character ('\', uni‐
230 code codepoint 0x5C). This is then followed by a series of octal dig‐
231 its, or 'u' and hexedecimal digits, or a regular character from a fixed
232 set for various control characters. Some examples:
233
234
235 \n \r \t \' \" \[ \] \\ #
236 \000 up to \277 # octal escape, all ascii character, leading 0's can be removed.
237 \u2CA7 # hexadecimal escape, all unicode characters.
238 # # Here 2ca7 <=> Koptic Small Letter Tau
239
240
241 WHITESPACE AND COMMENTS
242 One issue not touched upon so far is whitespace and comments.
243
244 Whitespace is any unicode space character, i.e. anything in the charac‐
245 ter class <space>, and comments. The latter are sequences of characters
246 starting with a '#' (hash, unicode codepoint 0x23) and ending at the
247 next end-of-line.
248
249 Whitespace can be freely used between all syntactical elements of a
250 grammar specification. It cannot be used inside of syntactical ele‐
251 ments, like names, string literals, predefined character classes, etc.
252
253 NONTERMINAL ATTRIBUTES
254 Lastly, a more advanced topic. In the section Rules we gave the struc‐
255 ture of a rule as
256
257
258 <<name>> <- <<expression>> ;
259
260 This is not quite true. It is possible to associate a semantic mode
261 with the nonterminal in the rule, by writing it before the name, sepa‐
262 rated from it by a colon, i.e. writing
263
264
265 <<mode>> : <<name>> <- <<expression>> ;
266
267 is also allowed. This mode is optional. The known modes and their mean‐
268 ings are:
269
270 value The semantic value of the nonterminal symbol is an abstract syn‐
271 tax tree consisting of a single node node for the nonterminal
272 itself, which has the ASTs of the symbol's right hand side as
273 its children.
274
275 leaf The semantic value of the nonterminal symbol is an abstract syn‐
276 tax tree consisting of a single node node for the nonterminal,
277 without any children. Any ASTs generated by the symbol's right
278 hand side are discarded.
279
280 void The nonterminal has no semantic value. Any ASTs generated by the
281 symbol's right hand side are discarded (as well).
282
283 Of these three modes only leaf and void can be specified directly.
284 value is implicitly specified by the absence of a mode before the non‐
285 terminal.
286
287 Now, with all the above under our belt it should be possible to not
288 only read, but understand the formal specification of the text repre‐
289 sentation shown in the next section, written in itself.
290
292 peg, a language for the specification of parsing expression grammars is
293 meant to be human readable, and writable as well, yet strict enough to
294 allow its processing by machine. Like any computer language. It was de‐
295 fined to make writing the specification of a grammar easy, something
296 the other formats found in the Parser Tools do not lend themselves too.
297
298 It is formally specified by the grammar shown below, written in itself.
299 For a tutorial / introduction to the language please go and read the
300 PEG Language Tutorial.
301
302 PEG pe-grammar-for-peg (Grammar)
303
304 # --------------------------------------------------------------------
305 # Syntactical constructs
306
307 Grammar <- WHITESPACE Header Definition* Final EOF ;
308
309 Header <- PEG Identifier StartExpr ;
310 Definition <- Attribute? Identifier IS Expression SEMICOLON ;
311 Attribute <- (VOID / LEAF) COLON ;
312 Expression <- Sequence (SLASH Sequence)* ;
313 Sequence <- Prefix+ ;
314 Prefix <- (AND / NOT)? Suffix ;
315 Suffix <- Primary (QUESTION / STAR / PLUS)? ;
316 Primary <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
317 / GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
318 / WORDCHAR / XDIGIT
319 / Identifier
320 / OPEN Expression CLOSE
321 / Literal
322 / Class
323 / DOT
324 ;
325 Literal <- APOSTROPH (!APOSTROPH Char)* APOSTROPH WHITESPACE
326 / DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
327 Class <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
328 Range <- Char TO Char / Char ;
329
330 StartExpr <- OPEN Expression CLOSE ;
331 void: Final <- "END" WHITESPACE SEMICOLON WHITESPACE ;
332
333 # --------------------------------------------------------------------
334 # Lexing constructs
335
336 Identifier <- Ident WHITESPACE ;
337 leaf: Ident <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
338 Char <- CharSpecial / CharOctalFull / CharOctalPart
339 / CharUnicode / CharUnescaped
340 ;
341
342 leaf: CharSpecial <- "\\" [nrt'"\[\]\\] ;
343 leaf: CharOctalFull <- "\\" [0-2][0-7][0-7] ;
344 leaf: CharOctalPart <- "\\" [0-7][0-7]? ;
345 leaf: CharUnicode <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
346 leaf: CharUnescaped <- !"\\" . ;
347
348 void: HexDigit <- [0-9a-fA-F] ;
349
350 void: TO <- '-' ;
351 void: OPENB <- "[" ;
352 void: CLOSEB <- "]" ;
353 void: APOSTROPH <- "'" ;
354 void: DAPOSTROPH <- '"' ;
355 void: PEG <- "PEG" !([_:] / <alnum>) WHITESPACE ;
356 void: IS <- "<-" WHITESPACE ;
357 leaf: VOID <- "void" WHITESPACE ; # Implies that definition has no semantic value.
358 leaf: LEAF <- "leaf" WHITESPACE ; # Implies that definition has no terminals.
359 void: SEMICOLON <- ";" WHITESPACE ;
360 void: COLON <- ":" WHITESPACE ;
361 void: SLASH <- "/" WHITESPACE ;
362 leaf: AND <- "&" WHITESPACE ;
363 leaf: NOT <- "!" WHITESPACE ;
364 leaf: QUESTION <- "?" WHITESPACE ;
365 leaf: STAR <- "*" WHITESPACE ;
366 leaf: PLUS <- "+" WHITESPACE ;
367 void: OPEN <- "(" WHITESPACE ;
368 void: CLOSE <- ")" WHITESPACE ;
369 leaf: DOT <- "." WHITESPACE ;
370
371 leaf: ALNUM <- "<alnum>" WHITESPACE ;
372 leaf: ALPHA <- "<alpha>" WHITESPACE ;
373 leaf: ASCII <- "<ascii>" WHITESPACE ;
374 leaf: CONTROL <- "<control>" WHITESPACE ;
375 leaf: DDIGIT <- "<ddigit>" WHITESPACE ;
376 leaf: DIGIT <- "<digit>" WHITESPACE ;
377 leaf: GRAPH <- "<graph>" WHITESPACE ;
378 leaf: LOWER <- "<lower>" WHITESPACE ;
379 leaf: PRINTABLE <- "<print>" WHITESPACE ;
380 leaf: PUNCT <- "<punct>" WHITESPACE ;
381 leaf: SPACE <- "<space>" WHITESPACE ;
382 leaf: UPPER <- "<upper>" WHITESPACE ;
383 leaf: WORDCHAR <- "<wordchar>" WHITESPACE ;
384 leaf: XDIGIT <- "<xdigit>" WHITESPACE ;
385
386 void: WHITESPACE <- (" " / "\t" / EOL / COMMENT)* ;
387 void: COMMENT <- '#' (!EOL .)* EOL ;
388 void: EOL <- "\n\r" / "\n" / "\r" ;
389 void: EOF <- !. ;
390
391 # --------------------------------------------------------------------
392 END;
393
394
395 EXAMPLE
396 Our example specifies the grammar for a basic 4-operation calculator.
397
398 PEG calculator (Expression)
399 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
400 Sign <- '-' / '+' ;
401 Number <- Sign? Digit+ ;
402 Expression <- Term (AddOp Term)* ;
403 MulOp <- '*' / '/' ;
404 Term <- Factor (MulOp Factor)* ;
405 AddOp <- '+'/'-' ;
406 Factor <- '(' Expression ')' / Number ;
407 END;
408
409
410 Using higher-level features of the notation, i.e. the character classes
411 (predefined and custom), this example can be rewritten as
412
413 PEG calculator (Expression)
414 Sign <- [-+] ;
415 Number <- Sign? <ddigit>+;
416 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
417 MulOp <- [*/];
418 Factor <- Term (AddOp Term)*;
419 AddOp <- [-+];
420 Term <- Number;
421 END;
422
423
425 This document, and the package it describes, will undoubtedly contain
426 bugs and other problems. Please report such in the category pt of the
427 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
428 report any ideas for enhancements you may have for either package
429 and/or documentation.
430
431 When proposing code changes, please provide unified diffs, i.e the out‐
432 put of diff -u.
433
434 Note further that attachments are strongly preferred over inlined
435 patches. Attachments can be made by going to the Edit form of the
436 ticket immediately after its creation, and then using the left-most
437 button in the secondary navigation bar.
438
440 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
441 matching, parser, parsing expression, parsing expression grammar, push
442 down automaton, recursive descent, state, top-down parsing languages,
443 transducer
444
446 Parsing and Grammars
447
449 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
450
451
452
453
454tcllib 1 pt::peg_language(n)