1pt::peg::from::peg(n) Parser Tools pt::peg::from::peg(n)
2
3
4
5______________________________________________________________________________
6
8 pt::peg::from::peg - PEG Conversion. Read PEG format
9
11 package require Tcl 8.5
12
13 package require pt::peg::from::peg ?1.0.3?
14
15 pt::peg::from::peg convert text
16
17______________________________________________________________________________
18
20 Are you lost ? Do you have trouble understanding this document ? In
21 that case please read the overview provided by the Introduction to
22 Parser Tools. This document is the entrypoint to the whole system the
23 current package is a part of.
24
25 This package implements the converter from PEG markup to parsing
26 expression grammars.
27
28 It resides in the Import section of the Core Layer of Parser Tools, and
29 can be used either directly with the other packages of this layer, or
30 indirectly through the import manager provided by pt::peg::import. The
31 latter is intented for use in untrusted environments and done through
32 the corresponding import plugin pt::peg::import::peg sitting between
33 converter and import manager.
34
35 IMAGE: arch_core_iplugins
36
38 The API provided by this package satisfies the specification of the
39 Converter API found in the Parser Tools Import API specification.
40
41 pt::peg::from::peg convert text
42 This command takes the PEG markup encoding a parsing expression
43 grammar and contained in text, and generates the canonical seri‐
44 alization of said grammar, as specified in section PEG serial‐
45 ization format. The created value is then returned as the
46 result of the command.
47
49 peg, a language for the specification of parsing expression grammars is
50 meant to be human readable, and writable as well, yet strict enough to
51 allow its processing by machine. Like any computer language. It was
52 defined to make writing the specification of a grammar easy, something
53 the other formats found in the Parser Tools do not lend themselves too.
54
55 It is formally specified by the grammar shown below, written in itself.
56 For a tutorial / introduction to the language please go and read the
57 PEG Language Tutorial.
58
59 PEG pe-grammar-for-peg (Grammar)
60
61 # --------------------------------------------------------------------
62 # Syntactical constructs
63
64 Grammar <- WHITESPACE Header Definition* Final EOF ;
65
66 Header <- PEG Identifier StartExpr ;
67 Definition <- Attribute? Identifier IS Expression SEMICOLON ;
68 Attribute <- (VOID / LEAF) COLON ;
69 Expression <- Sequence (SLASH Sequence)* ;
70 Sequence <- Prefix+ ;
71 Prefix <- (AND / NOT)? Suffix ;
72 Suffix <- Primary (QUESTION / STAR / PLUS)? ;
73 Primary <- ALNUM / ALPHA / ASCII / CONTROL / DDIGIT / DIGIT
74 / GRAPH / LOWER / PRINTABLE / PUNCT / SPACE / UPPER
75 / WORDCHAR / XDIGIT
76 / Identifier
77 / OPEN Expression CLOSE
78 / Literal
79 / Class
80 / DOT
81 ;
82 Literal <- APOSTROPH (!APOSTROPH Char)* APOSTROPH WHITESPACE
83 / DAPOSTROPH (!DAPOSTROPH Char)* DAPOSTROPH WHITESPACE ;
84 Class <- OPENB (!CLOSEB Range)* CLOSEB WHITESPACE ;
85 Range <- Char TO Char / Char ;
86
87 StartExpr <- OPEN Expression CLOSE ;
88 void: Final <- "END" WHITESPACE SEMICOLON WHITESPACE ;
89
90 # --------------------------------------------------------------------
91 # Lexing constructs
92
93 Identifier <- Ident WHITESPACE ;
94 leaf: Ident <- ([_:] / <alpha>) ([_:] / <alnum>)* ;
95 Char <- CharSpecial / CharOctalFull / CharOctalPart
96 / CharUnicode / CharUnescaped
97 ;
98
99 leaf: CharSpecial <- "\\" [nrt'"\[\]\\] ;
100 leaf: CharOctalFull <- "\\" [0-2][0-7][0-7] ;
101 leaf: CharOctalPart <- "\\" [0-7][0-7]? ;
102 leaf: CharUnicode <- "\\" 'u' HexDigit (HexDigit (HexDigit HexDigit?)?)? ;
103 leaf: CharUnescaped <- !"\\" . ;
104
105 void: HexDigit <- [0-9a-fA-F] ;
106
107 void: TO <- '-' ;
108 void: OPENB <- "[" ;
109 void: CLOSEB <- "]" ;
110 void: APOSTROPH <- "'" ;
111 void: DAPOSTROPH <- '"' ;
112 void: PEG <- "PEG" !([_:] / <alnum>) WHITESPACE ;
113 void: IS <- "<-" WHITESPACE ;
114 leaf: VOID <- "void" WHITESPACE ; # Implies that definition has no semantic value.
115 leaf: LEAF <- "leaf" WHITESPACE ; # Implies that definition has no terminals.
116 void: SEMICOLON <- ";" WHITESPACE ;
117 void: COLON <- ":" WHITESPACE ;
118 void: SLASH <- "/" WHITESPACE ;
119 leaf: AND <- "&" WHITESPACE ;
120 leaf: NOT <- "!" WHITESPACE ;
121 leaf: QUESTION <- "?" WHITESPACE ;
122 leaf: STAR <- "*" WHITESPACE ;
123 leaf: PLUS <- "+" WHITESPACE ;
124 void: OPEN <- "(" WHITESPACE ;
125 void: CLOSE <- ")" WHITESPACE ;
126 leaf: DOT <- "." WHITESPACE ;
127
128 leaf: ALNUM <- "<alnum>" WHITESPACE ;
129 leaf: ALPHA <- "<alpha>" WHITESPACE ;
130 leaf: ASCII <- "<ascii>" WHITESPACE ;
131 leaf: CONTROL <- "<control>" WHITESPACE ;
132 leaf: DDIGIT <- "<ddigit>" WHITESPACE ;
133 leaf: DIGIT <- "<digit>" WHITESPACE ;
134 leaf: GRAPH <- "<graph>" WHITESPACE ;
135 leaf: LOWER <- "<lower>" WHITESPACE ;
136 leaf: PRINTABLE <- "<print>" WHITESPACE ;
137 leaf: PUNCT <- "<punct>" WHITESPACE ;
138 leaf: SPACE <- "<space>" WHITESPACE ;
139 leaf: UPPER <- "<upper>" WHITESPACE ;
140 leaf: WORDCHAR <- "<wordchar>" WHITESPACE ;
141 leaf: XDIGIT <- "<xdigit>" WHITESPACE ;
142
143 void: WHITESPACE <- (" " / "\t" / EOL / COMMENT)* ;
144 void: COMMENT <- '#' (!EOL .)* EOL ;
145 void: EOL <- "\n\r" / "\n" / "\r" ;
146 void: EOF <- !. ;
147
148 # --------------------------------------------------------------------
149 END;
150
151
152 EXAMPLE
153 Our example specifies the grammar for a basic 4-operation calculator.
154
155 PEG calculator (Expression)
156 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
157 Sign <- '-' / '+' ;
158 Number <- Sign? Digit+ ;
159 Expression <- Term (AddOp Term)* ;
160 MulOp <- '*' / '/' ;
161 Term <- Factor (MulOp Factor)* ;
162 AddOp <- '+'/'-' ;
163 Factor <- '(' Expression ')' / Number ;
164 END;
165
166
167 Using higher-level features of the notation, i.e. the character classes
168 (predefined and custom), this example can be rewritten as
169
170 PEG calculator (Expression)
171 Sign <- [-+] ;
172 Number <- Sign? <ddigit>+;
173 Expression <- '(' Expression ')' / (Factor (MulOp Factor)*);
174 MulOp <- [*/];
175 Factor <- Term (AddOp Term)*;
176 AddOp <- [-+];
177 Term <- Number;
178 END;
179
180
182 Here we specify the format used by the Parser Tools to serialize Pars‐
183 ing Expression Grammars as immutable values for transport, comparison,
184 etc.
185
186 We distinguish between regular and canonical serializations. While a
187 PEG may have more than one regular serialization only exactly one of
188 them will be canonical.
189
190 regular serialization
191
192 [1] The serialization of any PEG is a nested Tcl dictionary.
193
194 [2] This dictionary holds a single key, pt::grammar::peg, and
195 its value. This value holds the contents of the grammar.
196
197 [3] The contents of the grammar are a Tcl dictionary holding
198 the set of nonterminal symbols and the starting expres‐
199 sion. The relevant keys and their values are
200
201 rules The value is a Tcl dictionary whose keys are the
202 names of the nonterminal symbols known to the
203 grammar.
204
205 [1] Each nonterminal symbol may occur only
206 once.
207
208 [2] The empty string is not a legal nonterminal
209 symbol.
210
211 [3] The value for each symbol is a Tcl dictio‐
212 nary itself. The relevant keys and their
213 values in this dictionary are
214
215 is The value is the serialization of
216 the parsing expression describing
217 the symbols sentennial structure, as
218 specified in the section PE serial‐
219 ization format.
220
221 mode The value can be one of three values
222 specifying how a parser should han‐
223 dle the semantic value produced by
224 the symbol.
225
226 value The semantic value of the
227 nonterminal symbol is an
228 abstract syntax tree consist‐
229 ing of a single node node for
230 the nonterminal itself, which
231 has the ASTs of the symbol's
232 right hand side as its chil‐
233 dren.
234
235 leaf The semantic value of the
236 nonterminal symbol is an
237 abstract syntax tree consist‐
238 ing of a single node node for
239 the nonterminal, without any
240 children. Any ASTs generated
241 by the symbol's right hand
242 side are discarded.
243
244 void The nonterminal has no seman‐
245 tic value. Any ASTs generated
246 by the symbol's right hand
247 side are discarded (as well).
248
249 start The value is the serialization of the start pars‐
250 ing expression of the grammar, as specified in the
251 section PE serialization format.
252
253 [4] The terminal symbols of the grammar are specified implic‐
254 itly as the set of all terminal symbols used in the start
255 expression and on the RHS of the grammar rules.
256
257 canonical serialization
258 The canonical serialization of a grammar has the format as spec‐
259 ified in the previous item, and then additionally satisfies the
260 constraints below, which make it unique among all the possible
261 serializations of this grammar.
262
263 [1] The keys found in all the nested Tcl dictionaries are
264 sorted in ascending dictionary order, as generated by
265 Tcl's builtin command lsort -increasing -dict.
266
267 [2] The string representation of the value is the canonical
268 representation of a Tcl dictionary. I.e. it does not con‐
269 tain superfluous whitespace.
270
271 EXAMPLE
272 Assuming the following PEG for simple mathematical expressions
273
274 PEG calculator (Expression)
275 Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
276 Sign <- '-' / '+' ;
277 Number <- Sign? Digit+ ;
278 Expression <- Term (AddOp Term)* ;
279 MulOp <- '*' / '/' ;
280 Term <- Factor (MulOp Factor)* ;
281 AddOp <- '+'/'-' ;
282 Factor <- '(' Expression ')' / Number ;
283 END;
284
285
286 then its canonical serialization (except for whitespace) is
287
288 pt::grammar::peg {
289 rules {
290 AddOp {is {/ {t -} {t +}} mode value}
291 Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
292 Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
293 Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
294 MulOp {is {/ {t *} {t /}} mode value}
295 Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
296 Sign {is {/ {t -} {t +}} mode value}
297 Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
298 }
299 start {n Expression}
300 }
301
302
304 Here we specify the format used by the Parser Tools to serialize Pars‐
305 ing Expressions as immutable values for transport, comparison, etc.
306
307 We distinguish between regular and canonical serializations. While a
308 parsing expression may have more than one regular serialization only
309 exactly one of them will be canonical.
310
311 Regular serialization
312
313 Atomic Parsing Expressions
314
315 [1] The string epsilon is an atomic parsing expres‐
316 sion. It matches the empty string.
317
318 [2] The string dot is an atomic parsing expression. It
319 matches any character.
320
321 [3] The string alnum is an atomic parsing expression.
322 It matches any Unicode alphabet or digit charac‐
323 ter. This is a custom extension of PEs based on
324 Tcl's builtin command string is.
325
326 [4] The string alpha is an atomic parsing expression.
327 It matches any Unicode alphabet character. This is
328 a custom extension of PEs based on Tcl's builtin
329 command string is.
330
331 [5] The string ascii is an atomic parsing expression.
332 It matches any Unicode character below U0080. This
333 is a custom extension of PEs based on Tcl's
334 builtin command string is.
335
336 [6] The string control is an atomic parsing expres‐
337 sion. It matches any Unicode control character.
338 This is a custom extension of PEs based on Tcl's
339 builtin command string is.
340
341 [7] The string digit is an atomic parsing expression.
342 It matches any Unicode digit character. Note that
343 this includes characters outside of the [0..9]
344 range. This is a custom extension of PEs based on
345 Tcl's builtin command string is.
346
347 [8] The string graph is an atomic parsing expression.
348 It matches any Unicode printing character, except
349 for space. This is a custom extension of PEs based
350 on Tcl's builtin command string is.
351
352 [9] The string lower is an atomic parsing expression.
353 It matches any Unicode lower-case alphabet charac‐
354 ter. This is a custom extension of PEs based on
355 Tcl's builtin command string is.
356
357 [10] The string print is an atomic parsing expression.
358 It matches any Unicode printing character, includ‐
359 ing space. This is a custom extension of PEs based
360 on Tcl's builtin command string is.
361
362 [11] The string punct is an atomic parsing expression.
363 It matches any Unicode punctuation character. This
364 is a custom extension of PEs based on Tcl's
365 builtin command string is.
366
367 [12] The string space is an atomic parsing expression.
368 It matches any Unicode space character. This is a
369 custom extension of PEs based on Tcl's builtin
370 command string is.
371
372 [13] The string upper is an atomic parsing expression.
373 It matches any Unicode upper-case alphabet charac‐
374 ter. This is a custom extension of PEs based on
375 Tcl's builtin command string is.
376
377 [14] The string wordchar is an atomic parsing expres‐
378 sion. It matches any Unicode word character. This
379 is any alphanumeric character (see alnum), and any
380 connector punctuation characters (e.g. under‐
381 score). This is a custom extension of PEs based on
382 Tcl's builtin command string is.
383
384 [15] The string xdigit is an atomic parsing expression.
385 It matches any hexadecimal digit character. This
386 is a custom extension of PEs based on Tcl's
387 builtin command string is.
388
389 [16] The string ddigit is an atomic parsing expression.
390 It matches any decimal digit character. This is a
391 custom extension of PEs based on Tcl's builtin
392 command regexp.
393
394 [17] The expression [list t x] is an atomic parsing
395 expression. It matches the terminal string x.
396
397 [18] The expression [list n A] is an atomic parsing
398 expression. It matches the nonterminal A.
399
400 Combined Parsing Expressions
401
402 [1] For parsing expressions e1, e2, ... the result of
403 [list / e1 e2 ... ] is a parsing expression as
404 well. This is the ordered choice, aka prioritized
405 choice.
406
407 [2] For parsing expressions e1, e2, ... the result of
408 [list x e1 e2 ... ] is a parsing expression as
409 well. This is the sequence.
410
411 [3] For a parsing expression e the result of [list *
412 e] is a parsing expression as well. This is the
413 kleene closure, describing zero or more repeti‐
414 tions.
415
416 [4] For a parsing expression e the result of [list +
417 e] is a parsing expression as well. This is the
418 positive kleene closure, describing one or more
419 repetitions.
420
421 [5] For a parsing expression e the result of [list &
422 e] is a parsing expression as well. This is the
423 and lookahead predicate.
424
425 [6] For a parsing expression e the result of [list !
426 e] is a parsing expression as well. This is the
427 not lookahead predicate.
428
429 [7] For a parsing expression e the result of [list ?
430 e] is a parsing expression as well. This is the
431 optional input.
432
433 Canonical serialization
434 The canonical serialization of a parsing expression has the for‐
435 mat as specified in the previous item, and then additionally
436 satisfies the constraints below, which make it unique among all
437 the possible serializations of this parsing expression.
438
439 [1] The string representation of the value is the canonical
440 representation of a pure Tcl list. I.e. it does not con‐
441 tain superfluous whitespace.
442
443 [2] Terminals are not encoded as ranges (where start and end
444 of the range are identical).
445
446 EXAMPLE
447 Assuming the parsing expression shown on the right-hand side of the
448 rule
449
450 Expression <- Term (AddOp Term)*
451
452
453 then its canonical serialization (except for whitespace) is
454
455 {x {n Term} {* {x {n AddOp} {n Term}}}}
456
457
459 This document, and the package it describes, will undoubtedly contain
460 bugs and other problems. Please report such in the category pt of the
461 Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also
462 report any ideas for enhancements you may have for either package
463 and/or documentation.
464
465 When proposing code changes, please provide unified diffs, i.e the out‐
466 put of diff -u.
467
468 Note further that attachments are strongly preferred over inlined
469 patches. Attachments can be made by going to the Edit form of the
470 ticket immediately after its creation, and then using the left-most
471 button in the secondary navigation bar.
472
474 EBNF, LL(k), PEG, TDPL, context-free languages, conversion, expression,
475 format conversion, grammar, matching, parser, parsing expression, pars‐
476 ing expression grammar, push down automaton, recursive descent, serial‐
477 ization, state, top-down parsing languages, transducer
478
480 Parsing and Grammars
481
483 Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
484
485
486
487
488tcllib 1.0.3 pt::peg::from::peg(n)