1pt::peg::import::peg(n) Parser Tools pt::peg::import::peg(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg::import::peg - PEG Import Plugin. Read PEG format
9
11 package require Tcl 8.5
12
13 package require pt::peg::import::peg ?1?
14
15 package require pt::peg::to::peg
16
17 import text
18
19______________________________________________________________________________
20
22 Are you lost ? Do you have trouble understanding this document ? In
23 that case please read the overview provided by the Introduction to
24 Parser Tools. This document is the entrypoint to the whole system the
25 current package is a part of.
26
27 This package implements the parsing expression grammar import plugin
28 processing PEG markup.
29
30 It resides in the Import section of the Core Layer of Parser Tools and
31 is intended to be used by pt::peg::import, the import manager, sitting
32 between it and the corresponding core conversion functionality provided
33 by pt::peg::from::peg.
34
35 IMAGE: arch_core_iplugins
36
37 While the direct use of this package with a regular interpreter is pos‐
38 sible, this is strongly disrecommended and requires a number of contor‐
39 tions to provide the expected environment. The proper way to use this
40 functionality depends on the situation:
41
42 [1] In an untrusted environment the proper access is through the
43 package pt::peg::import and the import manager objects it pro‐
44 vides.
45
46 [2] In a trusted environment however simply use the package
47 pt::peg::from::peg and access the core conversion functionality
48 directly.
49
51 The API provided by this package satisfies the specification of the
52 Plugin API found in the Parser Tools Import API specification.
53
54 import text
55 This command takes the PEG markup encoding a parsing expression
56 grammar and contained in text, and generates the canonical seri‐
57 alization of said grammar, as specified in section PEG serial‐
58 ization format. The created value is then returned as the
59 result of the command.
60
62 peg, a language for the specification of parsing expression grammars is
63 meant to be human readable, and writable as well, yet strict enough to
64 allow its processing by machine. Like any computer language. It was
65 defined to make writing the specification of a grammar easy, something
66 the other formats found in the Parser Tools do not lend themselves too.
67
68 It is formally specified by the grammar shown below, written in itself.
69 For a tutorial / introduction to the language please go and read the
70 PEG Language Tutorial.
71
72 PEG pe-grammar-for-peg (Grammar)
73
74 # --------------------------------------------------------------------
75 # Syntactical constructs
76
77 Grammar <- WHITESPACE Header Definition* Final EOF ;
78
79 Header <- PEG Identifier StartExpr ;
80 Definition <- Attribute? Identifier IS Expression SEMICOLON ;
81 Attribute <- (VOID / LEAF) COLON ;
82 Expression <- Sequence (SLASH Sequence)* ;
83 Sequence <- Prefix+ ;
84 Prefix <- (AND / NOT)? Suffix ;
85 Suffix <- Primary (QUESTION / STAR / PLUS)? ;
86 Primary <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
87 / GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
88 / WORDCHAR / XDIGIT
89 / Identifier
90 / OPEN Expression CLOSE
91 / Literal
92 / Class
93 / DOT
94 ;
95 Literal <- APOSTROPH (!APOSTROPH Char)* APOSTROPH WHITESPACE
96 / DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
97 Class <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
98 Range <- Char TO Char / Char ;
99
100 StartExpr <- OPEN Expression CLOSE ;
101 void: Final <- "END" WHITESPACE SEMICOLON WHITESPACE ;
102
103 # --------------------------------------------------------------------
104 # Lexing constructs
105
106 Identifier <- Ident WHITESPACE ;
107 leaf: Ident <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
108 Char <- CharSpecial / CharOctalFull / CharOctalPart
109 / CharUnicode / CharUnescaped
110 ;
111
112 leaf: CharSpecial <- "\\" [nrt'"\[\]\\] ;
113 leaf: CharOctalFull <- "\\" [0-2][0-7][0-7] ;
114 leaf: CharOctalPart <- "\\" [0-7][0-7]? ;
115 leaf: CharUnicode <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
116 leaf: CharUnescaped <- !"\\" . ;
117
118 void: HexDigit <- [0-9a-fA-F] ;
119
120 void: TO <- '-' ;
121 void: OPENB <- "[" ;
122 void: CLOSEB <- "]" ;
123 void: APOSTROPH <- "'" ;
124 void: DAPOSTROPH <- '"' ;
125 void: PEG <- "PEG" !([_:] / <alnum>) WHITESPACE ;
126 void: IS <- "<-" WHITESPACE ;
127 leaf: VOID <- "void" WHITESPACE ; # Implies that definition has no semantic value.
128 leaf: LEAF <- "leaf" WHITESPACE ; # Implies that definition has no terminals.
129 void: SEMICOLON <- ";" WHITESPACE ;
130 void: COLON <- ":" WHITESPACE ;
131 void: SLASH <- "/" WHITESPACE ;
132 leaf: AND <- "&" WHITESPACE ;
133 leaf: NOT <- "!" WHITESPACE ;
134 leaf: QUESTION <- "?" WHITESPACE ;
135 leaf: STAR <- "*" WHITESPACE ;
136 leaf: PLUS <- "+" WHITESPACE ;
137 void: OPEN <- "(" WHITESPACE ;
138 void: CLOSE <- ")" WHITESPACE ;
139 leaf: DOT <- "." WHITESPACE ;
140
141 leaf: ALNUM <- "<alnum>" WHITESPACE ;
142 leaf: ALPHA <- "<alpha>" WHITESPACE ;
143 leaf: ASCII <- "<ascii>" WHITESPACE ;
144 leaf: CONTROL <- "<control>" WHITESPACE ;
145 leaf: DDIGIT <- "<ddigit>" WHITESPACE ;
146 leaf: DIGIT <- "<digit>" WHITESPACE ;
147 leaf: GRAPH <- "<graph>" WHITESPACE ;
148 leaf: LOWER <- "<lower>" WHITESPACE ;
149 leaf: PRINTABLE <- "<print>" WHITESPACE ;
150 leaf: PUNCT <- "<punct>" WHITESPACE ;
151 leaf: SPACE <- "<space>" WHITESPACE ;
152 leaf: UPPER <- "<upper>" WHITESPACE ;
153 leaf: WORDCHAR <- "<wordchar>" WHITESPACE ;
154 leaf: XDIGIT <- "<xdigit>" WHITESPACE ;
155
156 void: WHITESPACE <- (" " / "\t" / EOL / COMMENT)* ;
157 void: COMMENT <- '#' (!EOL .)* EOL ;
158 void: EOL <- "\n\r" / "\n" / "\r" ;
159 void: EOF <- !. ;
160
161 # --------------------------------------------------------------------
162 END;
163
164
165 EXAMPLE
166 Our example specifies the grammar for a basic 4-operation calculator.
167
168 PEG calculator (Expression)
169 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
170 Sign <- '-' / '+' ;
171 Number <- Sign? Digit+ ;
172 Expression <- Term (AddOp Term)* ;
173 MulOp <- '*' / '/' ;
174 Term <- Factor (MulOp Factor)* ;
175 AddOp <- '+'/'-' ;
176 Factor <- '(' Expression ')' / Number ;
177 END;
178
179
180 Using higher-level features of the notation, i.e. the character classes
181 (predefined and custom), this example can be rewritten as
182
183 PEG calculator (Expression)
184 Sign <- [-+] ;
185 Number <- Sign? <ddigit>+;
186 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
187 MulOp <- [*/];
188 Factor <- Term (AddOp Term)*;
189 AddOp <- [-+];
190 Term <- Number;
191 END;
192
193
195 Here we specify the format used by the Parser Tools to serialize Pars‐
196 ing Expression Grammars as immutable values for transport, comparison,
197 etc.
198
199 We distinguish between regular and canonical serializations. While a
200 PEG may have more than one regular serialization only exactly one of
201 them will be canonical.
202
203 regular serialization
204
205 [1] The serialization of any PEG is a nested Tcl dictionary.
206
207 [2] This dictionary holds a single key, pt::grammar::peg, and
208 its value. This value holds the contents of the grammar.
209
210 [3] The contents of the grammar are a Tcl dictionary holding
211 the set of nonterminal symbols and the starting expres‐
212 sion. The relevant keys and their values are
213
214 rules The value is a Tcl dictionary whose keys are the
215 names of the nonterminal symbols known to the
216 grammar.
217
218 [1] Each nonterminal symbol may occur only
219 once.
220
221 [2] The empty string is not a legal nonterminal
222 symbol.
223
224 [3] The value for each symbol is a Tcl dictio‐
225 nary itself. The relevant keys and their
226 values in this dictionary are
227
228 is The value is the serialization of
229 the parsing expression describing
230 the symbols sentennial structure, as
231 specified in the section PE serial‐
232 ization format.
233
234 mode The value can be one of three values
235 specifying how a parser should han‐
236 dle the semantic value produced by
237 the symbol.
238
239 value The semantic value of the
240 nonterminal symbol is an
241 abstract syntax tree consist‐
242 ing of a single node node for
243 the nonterminal itself, which
244 has the ASTs of the symbol's
245 right hand side as its chil‐
246 dren.
247
248 leaf The semantic value of the
249 nonterminal symbol is an
250 abstract syntax tree consist‐
251 ing of a single node node for
252 the nonterminal, without any
253 children. Any ASTs generated
254 by the symbol's right hand
255 side are discarded.
256
257 void The nonterminal has no seman‐
258 tic value. Any ASTs generated
259 by the symbol's right hand
260 side are discarded (as well).
261
262 start The value is the serialization of the start pars‐
263 ing expression of the grammar, as specified in the
264 section PE serialization format.
265
266 [4] The terminal symbols of the grammar are specified implic‐
267 itly as the set of all terminal symbols used in the start
268 expression and on the RHS of the grammar rules.
269
270 canonical serialization
271 The canonical serialization of a grammar has the format as spec‐
272 ified in the previous item, and then additionally satisfies the
273 constraints below, which make it unique among all the possible
274 serializations of this grammar.
275
276 [1] The keys found in all the nested Tcl dictionaries are
277 sorted in ascending dictionary order, as generated by
278 Tcl's builtin command lsort -increasing -dict.
279
280 [2] The string representation of the value is the canonical
281 representation of a Tcl dictionary. I.e. it does not con‐
282 tain superfluous whitespace.
283
284 EXAMPLE
285 Assuming the following PEG for simple mathematical expressions
286
287 PEG calculator (Expression)
288 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
289 Sign <- '-' / '+' ;
290 Number <- Sign? Digit+ ;
291 Expression <- Term (AddOp Term)* ;
292 MulOp <- '*' / '/' ;
293 Term <- Factor (MulOp Factor)* ;
294 AddOp <- '+'/'-' ;
295 Factor <- '(' Expression ')' / Number ;
296 END;
297
298
299 then its canonical serialization (except for whitespace) is
300
301 pt::grammar::peg {
302 rules {
303 AddOp {is {/ {t -} {t +}} mode value}
304 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
305 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
306 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
307 MulOp {is {/ {t *} {t /}} mode value}
308 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
309 Sign {is {/ {t -} {t +}} mode value}
310 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
311 }
312 start {n Expression}
313 }
314
315
317 Here we specify the format used by the Parser Tools to serialize Pars‐
318 ing Expressions as immutable values for transport, comparison, etc.
319
320 We distinguish between regular and canonical serializations. While a
321 parsing expression may have more than one regular serialization only
322 exactly one of them will be canonical.
323
324 Regular serialization
325
326 Atomic Parsing Expressions
327
328 [1] The string epsilon is an atomic parsing expres‐
329 sion. It matches the empty string.
330
331 [2] The string dot is an atomic parsing expression. It
332 matches any character.
333
334 [3] The string alnum is an atomic parsing expression.
335 It matches any Unicode alphabet or digit charac‐
336 ter. This is a custom extension of PEs based on
337 Tcl's builtin command string is.
338
339 [4] The string alpha is an atomic parsing expression.
340 It matches any Unicode alphabet character. This is
341 a custom extension of PEs based on Tcl's builtin
342 command string is.
343
344 [5] The string ascii is an atomic parsing expression.
345 It matches any Unicode character below U0080. This
346 is a custom extension of PEs based on Tcl's
347 builtin command string is.
348
349 [6] The string control is an atomic parsing expres‐
350 sion. It matches any Unicode control character.
351 This is a custom extension of PEs based on Tcl's
352 builtin command string is.
353
354 [7] The string digit is an atomic parsing expression.
355 It matches any Unicode digit character. Note that
356 this includes characters outside of the [0..9]
357 range. This is a custom extension of PEs based on
358 Tcl's builtin command string is.
359
360 [8] The string graph is an atomic parsing expression.
361 It matches any Unicode printing character, except
362 for space. This is a custom extension of PEs based
363 on Tcl's builtin command string is.
364
365 [9] The string lower is an atomic parsing expression.
366 It matches any Unicode lower-case alphabet charac‐
367 ter. This is a custom extension of PEs based on
368 Tcl's builtin command string is.
369
370 [10] The string print is an atomic parsing expression.
371 It matches any Unicode printing character, includ‐
372 ing space. This is a custom extension of PEs based
373 on Tcl's builtin command string is.
374
375 [11] The string punct is an atomic parsing expression.
376 It matches any Unicode punctuation character. This
377 is a custom extension of PEs based on Tcl's
378 builtin command string is.
379
380 [12] The string space is an atomic parsing expression.
381 It matches any Unicode space character. This is a
382 custom extension of PEs based on Tcl's builtin
383 command string is.
384
385 [13] The string upper is an atomic parsing expression.
386 It matches any Unicode upper-case alphabet charac‐
387 ter. This is a custom extension of PEs based on
388 Tcl's builtin command string is.
389
390 [14] The string wordchar is an atomic parsing expres‐
391 sion. It matches any Unicode word character. This
392 is any alphanumeric character (see alnum), and any
393 connector punctuation characters (e.g. under‐
394 score). This is a custom extension of PEs based on
395 Tcl's builtin command string is.
396
397 [15] The string xdigit is an atomic parsing expression.
398 It matches any hexadecimal digit character. This
399 is a custom extension of PEs based on Tcl's
400 builtin command string is.
401
402 [16] The string ddigit is an atomic parsing expression.
403 It matches any decimal digit character. This is a
404 custom extension of PEs based on Tcl's builtin
405 command regexp.
406
407 [17] The expression [list t x] is an atomic parsing
408 expression. It matches the terminal string x.
409
410 [18] The expression [list n A] is an atomic parsing
411 expression. It matches the nonterminal A.
412
413 Combined Parsing Expressions
414
415 [1] For parsing expressions e1, e2, ... the result of
416 [list / e1 e2 ... ] is a parsing expression as
417 well. This is the ordered choice, aka prioritized
418 choice.
419
420 [2] For parsing expressions e1, e2, ... the result of
421 [list x e1 e2 ... ] is a parsing expression as
422 well. This is the sequence.
423
424 [3] For a parsing expression e the result of [list *
425 e] is a parsing expression as well. This is the
426 kleene closure, describing zero or more repeti‐
427 tions.
428
429 [4] For a parsing expression e the result of [list +
430 e] is a parsing expression as well. This is the
431 positive kleene closure, describing one or more
432 repetitions.
433
434 [5] For a parsing expression e the result of [list &
435 e] is a parsing expression as well. This is the
436 and lookahead predicate.
437
438 [6] For a parsing expression e the result of [list !
439 e] is a parsing expression as well. This is the
440 not lookahead predicate.
441
442 [7] For a parsing expression e the result of [list ?
443 e] is a parsing expression as well. This is the
444 optional input.
445
446 Canonical serialization
447 The canonical serialization of a parsing expression has the for‐
448 mat as specified in the previous item, and then additionally
449 satisfies the constraints below, which make it unique among all
450 the possible serializations of this parsing expression.
451
452 [1] The string representation of the value is the canonical
453 representation of a pure Tcl list. I.e. it does not con‐
454 tain superfluous whitespace.
455
456 [2] Terminals are not encoded as ranges (where start and end
457 of the range are identical).
458
459 EXAMPLE
460 Assuming the parsing expression shown on the right-hand side of the
461 rule
462
463 Expression <- Term (AddOp Term)*
464
465
466 then its canonical serialization (except for whitespace) is
467
468 {x {n Term} {* {x {n AddOp} {n Term}}}}
469
470
472 This document, and the package it describes, will undoubtedly contain
473 bugs and other problems. Please report such in the category pt of the
474 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
475 report any ideas for enhancements you may have for either package
476 and/or documentation.
477
478 When proposing code changes, please provide unified diffs, i.e the out‐
479 put of diff -u.
480
481 Note further that attachments are strongly preferred over inlined
482 patches. Attachments can be made by going to the Edit form of the
483 ticket immediately after its creation, and then using the left-most
484 button in the secondary navigation bar.
485
487 EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar,
488 import, matching, parser, parsing expression, parsing expression gram‐
489 mar, plugin, push down automaton, recursive descent, serialization,
490 state, top-down parsing languages, transducer
491
493 Parsing and Grammars
494
496 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
497
498
499
500
501tcllib 1 pt::peg::import::peg(n)