1pt::peg_language(n) Parser Tools pt::peg_language(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg_language - PEG Language Tutorial
9
11 package require Tcl 8.5
12
13______________________________________________________________________________
14
16 Are you lost ? Do you have trouble understanding this document ? In
17 that case please read the overview provided by the Introduction to
18 Parser Tools. This document is the entrypoint to the whole system the
19 current package is a part of.
20
21 Welcome to the tutorial / introduction for the PEG Specification Lan‐
22 guage. If you are already familiar with the language we are about to
23 discuss, and only wish to refresh your memory you can, of course, skip
24 ahead to the aforementioned section and just read the full formal spec‐
25 ification.
26
28 peg, a language for the specification of parsing expression grammars is
29 meant to be human readable, and writable as well, yet strict enough to
30 allow its processing by machine. Like any computer language. It was
31 defined to make writing the specification of a grammar easy, something
32 the other formats found in the Parser Tools do not lend themselves too.
33
35 BASIC STRUCTURE
36 The general outline of a textual PEG is
37
38
39 PEG <<name>> (<<start-expression>>)
40 <<rules>>
41 END;
42
43 Note: We are using text in double angle-brackets as place-holders for
44 things not yet explained.
45
46 NAMES
47 Names are mostly used to identify the nonterminal symbols of the gram‐
48 mar, i.e. that which occurs on the left-hand side of a <rule>. The
49 exception to that is the name given after the keyword PEG (see previous
50 section), which is the name of the whole grammar itself.
51
52 The structure of a name is simple:
53
54 [1] It begins with a letter, underscore, or colon, followed by
55
56 [2] zero or more letters, digits, underscores, or colons.
57
58 Or, in formal textual notation:
59
60
61 ([_:] / <alpha>) ([_:] / <alnum>)*
62
63 Examples of names:
64
65
66 Hello
67 ::world
68 _:submarine55_
69
70 Examples of text which are not names:
71
72
73 12
74 .bogus
75 0wrong
76 @location
77
78
79 RULES
80 The main body of the text of a grammar specification is taken up by the
81 rules. Each rule defines the sentence structure of one nonterminal sym‐
82 bol. Their basic structure is
83
84
85 <<name>> <- <<expression>> ;
86
87 The <name> specifies the nonterminal symbol to be defined, the <expres‐
88 sion> after the arrow (<-) then declares its structure.
89
90 Note that each rule ends in a single semicolon, even the last. I.e.
91 the semicolon is a rule terminator, not a separator.
92
93 We can have as many rules as we like, as long as we define each nonter‐
94 minal symbol at most once, and have at least one rule for each nonter‐
95 minal symbol which occured in an expression, i.e. in either the start
96 expression of the grammar, or the right-hande side of a rule.
97
98 EXPRESSIONS
99 The parsing expressions are the meat of any specification. They declare
100 the structure of the whole document (<<start-expression>>), and of all
101 nonterminal symbols.
102
103 All expressions are made up out of atomic expressions and operators
104 combining them. We have operators for choosing between alternatives,
105 repetition of parts, and for look-ahead constraints. There is no
106 explicit operator for the sequencing (also known as concatenation) of
107 parts however. This is specified by simply placing the parts adjacent
108 to each other.
109
110 Here are the operators, from highest to lowest priority (i.e. strength
111 of binding):
112
113
114 # Binary operators.
115
116 <<expression-1>> <<expression-2>> # sequence. parse 1, then 2.
117 <<expression-1>> / <<expression-2>> # alternative. try to parse 1, and parse 2 if 1 failed to parse.
118
119 # Prefix operators. Lookahead constraints. Same priority.
120
121 & <<expression>> # Parse expression, ok on successful parse.
122 ! <<expression>> # Ditto, except ok on failure to parse.
123
124 # Suffix operators. Repetition. Same priority.
125
126 <<expression>> ? # Parse expression none, or once (repeat 0 or 1).
127 <<expression>> * # Parse expression zero or more times.
128 <<expression>> + # Parse expression one or more times.
129
130 # Expression nesting
131
132 ( <<expression>> ) # Put an expression in parens to change its priority.
133
134 With this we can now deconstruct the formal expression for names given
135 in section Names:
136
137
138 ([_:] / <alpha>) ([_:] / <alnum>)*
139
140 It is a sequence of two parts,
141
142 [_:] / <alpha>
143 and
144
145 ([_:] / <alnum>)*
146 The parentheses around the parts kept their inner alternatives bound
147 together against the normally higher priority of the sequence. Each of
148 the two parts is an alternative, with the second part additionally
149 repeated zero or more times, leaving us with the three atomic expres‐
150 sions
151
152
153 [_:]
154 <alpha>
155 <alnum>
156
157 And atomic expressions are our next topic. They fall into three
158 classes:
159
160 [1] names, i.e. nonterminal symbols,
161
162 [2] string literals, and
163
164 [3] character classes.
165
166 Names we know about already, or see section Names for a refresher.
167
168 String literals are simple. They are delimited by (i.e. start and end
169 with) either a single or double-apostroph, and in between the delim‐
170 iters we can have any character but the delimiter itself. They can be
171 empty as well. Examples of strings are
172
173
174 ''
175 ""
176 'hello'
177 "umbra"
178 "'"
179 '"'
180
181 The last two examples show how to place any of the delimiters into a
182 string.
183
184 For the last, but not least of our atomic expressions, character
185 classes, we have a number of predefined classes, shown below, and the
186 ability to construct or own. The predefined classes are:
187
188
189 <alnum> # Any unicode alphabet or digit character (string is alnum).
190 <alpha> # Any unicode alphabet character (string is alpha).
191 <ascii> # Any unicode character below codepoint 0x80 (string is ascii).
192 <control> # Any unicode control character (string is control).
193 <ddigit> # The digit characters [0-9].
194 <digit> # Any unicode digit character (string is digit).
195 <graph> # Any unicode printing character, except space (string is graph).
196 <lower> # Any unicode lower-case alphabet character (string is lower).
197 <print> # Any unicode printing character, incl. space (string is print).
198 <punct> # Any unicode punctuation character (string is punct).
199 <space> # Any unicode space character (string is space).
200 <upper> # Any unicode upper-case alphabet character (string is upper).
201 <wordchar> # Any unicode word character (string is wordchar).
202 <xdigit> # The hexadecimal digit characters [0-9a-fA-F].
203 . # Any character, except end of input.
204
205 And the syntax of custom-defined character classes is
206
207
208 [ <<range>>* ]
209
210 where each range is either a single character, or of the form
211
212
213 <<character>> - <character>>
214
215 Examples for character classes we have seen already in the course of
216 this introduction are
217
218
219 [_:]
220 [0-9]
221 [0-9a-fA-F]
222
223 We are nearly done with expressions. The only piece left is to tell how
224 the characters in character classes and string literals are specified.
225
226 Basically characters in the input stand for themselves, and in addition
227 to that we several types of escape syntax to to repesent control char‐
228 acters, or characters outside of the encoding the text is in.
229
230 All the escaped forms are started with a backslash character ('\', uni‐
231 code codepoint 0x5C). This is then followed by a series of octal dig‐
232 its, or 'u' and hexedecimal digits, or a regular character from a fixed
233 set for various control characters. Some examples:
234
235
236 \n \r \t \' \" \[ \] \\ #
237 \000 up to \277 # octal escape, all ascii character, leading 0's can be removed.
238 \u2CA7 # hexadecimal escape, all unicode characters.
239 # # Here 2ca7 <=> Koptic Small Letter Tau
240
241
242 WHITESPACE AND COMMENTS
243 One issue not touched upon so far is whitespace and comments.
244
245 Whitespace is any unicode space character, i.e. anything in the charac‐
246 ter class <space>, and comments. The latter are sequences of characters
247 starting with a '#' (hash, unicode codepoint 0x23) and ending at the
248 next end-of-line.
249
250 Whitespace can be freely used between all syntactical elements of a
251 grammar specification. It cannot be used inside of syntactical ele‐
252 ments, like names, string literals, predefined character classes, etc.
253
254 NONTERMINAL ATTRIBUTES
255 Lastly, a more advanced topic. In the section Rules we gave the struc‐
256 ture of a rule as
257
258
259 <<name>> <- <<expression>> ;
260
261 This is not quite true. It is possible to associate a semantic mode
262 with the nonterminal in the rule, by writing it before the name, sepa‐
263 rated from it by a colon, i.e. writing
264
265
266 <<mode>> : <<name>> <- <<expression>> ;
267
268 is also allowed. This mode is optional. The known modes and their mean‐
269 ings are:
270
271 value The semantic value of the nonterminal symbol is an abstract syn‐
272 tax tree consisting of a single node node for the nonterminal
273 itself, which has the ASTs of the symbol's right hand side as
274 its children.
275
276 leaf The semantic value of the nonterminal symbol is an abstract syn‐
277 tax tree consisting of a single node node for the nonterminal,
278 without any children. Any ASTs generated by the symbol's right
279 hand side are discarded.
280
281 void The nonterminal has no semantic value. Any ASTs generated by the
282 symbol's right hand side are discarded (as well).
283
284 Of these three modes only leaf and void can be specified directly.
285 value is implicitly specified by the absence of a mode before the non‐
286 terminal.
287
288 Now, with all the above under our belt it should be possible to not
289 only read, but understand the formal specification of the text repre‐
290 sentation shown in the next section, written in itself.
291
293 peg, a language for the specification of parsing expression grammars is
294 meant to be human readable, and writable as well, yet strict enough to
295 allow its processing by machine. Like any computer language. It was
296 defined to make writing the specification of a grammar easy, something
297 the other formats found in the Parser Tools do not lend themselves too.
298
299 It is formally specified by the grammar shown below, written in itself.
300 For a tutorial / introduction to the language please go and read the
301 PEG Language Tutorial.
302
303 PEG pe-grammar-for-peg (Grammar)
304
305 # --------------------------------------------------------------------
306 # Syntactical constructs
307
308 Grammar <- WHITESPACE Header Definition* Final EOF ;
309
310 Header <- PEG Identifier StartExpr ;
311 Definition <- Attribute? Identifier IS Expression SEMICOLON ;
312 Attribute <- (VOID / LEAF) COLON ;
313 Expression <- Sequence (SLASH Sequence)* ;
314 Sequence <- Prefix+ ;
315 Prefix <- (AND / NOT)? Suffix ;
316 Suffix <- Primary (QUESTION / STAR / PLUS)? ;
317 Primary <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
318 / GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
319 / WORDCHAR / XDIGIT
320 / Identifier
321 / OPEN Expression CLOSE
322 / Literal
323 / Class
324 / DOT
325 ;
326 Literal <- APOSTROPH (!APOSTROPH Char)* APOSTROPH WHITESPACE
327 / DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
328 Class <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
329 Range <- Char TO Char / Char ;
330
331 StartExpr <- OPEN Expression CLOSE ;
332 void: Final <- "END" WHITESPACE SEMICOLON WHITESPACE ;
333
334 # --------------------------------------------------------------------
335 # Lexing constructs
336
337 Identifier <- Ident WHITESPACE ;
338 leaf: Ident <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
339 Char <- CharSpecial / CharOctalFull / CharOctalPart
340 / CharUnicode / CharUnescaped
341 ;
342
343 leaf: CharSpecial <- "\\" [nrt'"\[\]\\] ;
344 leaf: CharOctalFull <- "\\" [0-2][0-7][0-7] ;
345 leaf: CharOctalPart <- "\\" [0-7][0-7]? ;
346 leaf: CharUnicode <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
347 leaf: CharUnescaped <- !"\\" . ;
348
349 void: HexDigit <- [0-9a-fA-F] ;
350
351 void: TO <- '-' ;
352 void: OPENB <- "[" ;
353 void: CLOSEB <- "]" ;
354 void: APOSTROPH <- "'" ;
355 void: DAPOSTROPH <- '"' ;
356 void: PEG <- "PEG" !([_:] / <alnum>) WHITESPACE ;
357 void: IS <- "<-" WHITESPACE ;
358 leaf: VOID <- "void" WHITESPACE ; # Implies that definition has no semantic value.
359 leaf: LEAF <- "leaf" WHITESPACE ; # Implies that definition has no terminals.
360 void: SEMICOLON <- ";" WHITESPACE ;
361 void: COLON <- ":" WHITESPACE ;
362 void: SLASH <- "/" WHITESPACE ;
363 leaf: AND <- "&" WHITESPACE ;
364 leaf: NOT <- "!" WHITESPACE ;
365 leaf: QUESTION <- "?" WHITESPACE ;
366 leaf: STAR <- "*" WHITESPACE ;
367 leaf: PLUS <- "+" WHITESPACE ;
368 void: OPEN <- "(" WHITESPACE ;
369 void: CLOSE <- ")" WHITESPACE ;
370 leaf: DOT <- "." WHITESPACE ;
371
372 leaf: ALNUM <- "<alnum>" WHITESPACE ;
373 leaf: ALPHA <- "<alpha>" WHITESPACE ;
374 leaf: ASCII <- "<ascii>" WHITESPACE ;
375 leaf: CONTROL <- "<control>" WHITESPACE ;
376 leaf: DDIGIT <- "<ddigit>" WHITESPACE ;
377 leaf: DIGIT <- "<digit>" WHITESPACE ;
378 leaf: GRAPH <- "<graph>" WHITESPACE ;
379 leaf: LOWER <- "<lower>" WHITESPACE ;
380 leaf: PRINTABLE <- "<print>" WHITESPACE ;
381 leaf: PUNCT <- "<punct>" WHITESPACE ;
382 leaf: SPACE <- "<space>" WHITESPACE ;
383 leaf: UPPER <- "<upper>" WHITESPACE ;
384 leaf: WORDCHAR <- "<wordchar>" WHITESPACE ;
385 leaf: XDIGIT <- "<xdigit>" WHITESPACE ;
386
387 void: WHITESPACE <- (" " / "\t" / EOL / COMMENT)* ;
388 void: COMMENT <- '#' (!EOL .)* EOL ;
389 void: EOL <- "\n\r" / "\n" / "\r" ;
390 void: EOF <- !. ;
391
392 # --------------------------------------------------------------------
393 END;
394
395
396 EXAMPLE
397 Our example specifies the grammar for a basic 4-operation calculator.
398
399 PEG calculator (Expression)
400 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
401 Sign <- '-' / '+' ;
402 Number <- Sign? Digit+ ;
403 Expression <- Term (AddOp Term)* ;
404 MulOp <- '*' / '/' ;
405 Term <- Factor (MulOp Factor)* ;
406 AddOp <- '+'/'-' ;
407 Factor <- '(' Expression ')' / Number ;
408 END;
409
410
411 Using higher-level features of the notation, i.e. the character classes
412 (predefined and custom), this example can be rewritten as
413
414 PEG calculator (Expression)
415 Sign <- [-+] ;
416 Number <- Sign? <ddigit>+;
417 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
418 MulOp <- [*/];
419 Factor <- Term (AddOp Term)*;
420 AddOp <- [-+];
421 Term <- Number;
422 END;
423
424
426 This document, and the package it describes, will undoubtedly contain
427 bugs and other problems. Please report such in the category pt of the
428 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
429 report any ideas for enhancements you may have for either package
430 and/or documentation.
431
432 When proposing code changes, please provide unified diffs, i.e the out‐
433 put of diff -u.
434
435 Note further that attachments are strongly preferred over inlined
436 patches. Attachments can be made by going to the Edit form of the
437 ticket immediately after its creation, and then using the left-most
438 button in the secondary navigation bar.
439
441 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
442 matching, parser, parsing expression, parsing expression grammar, push
443 down automaton, recursive descent, state, top-down parsing languages,
444 transducer
445
447 Parsing and Grammars
448
450 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
451
452
453
454
455tcllib 1 pt::peg_language(n)