1pt_parser_api(i)                 Parser Tools                 pt_parser_api(i)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt_parser_api - Parser API
9

SYNOPSIS

11       package require Tcl  8.5
12
13       className ?objectName?
14
15       objectName destroy
16
17       objectName parse chan
18
19       objectName parset text
20
21______________________________________________________________________________
22

DESCRIPTION

24       Are  you  lost ?  Do you have trouble understanding this document ?  In
25       that case please read the overview  provided  by  the  Introduction  to
26       Parser  Tools.  This document is the entrypoint to the whole system the
27       current package is a part of.
28
29       This document describes the API shared by the grammar interpreter  pro‐
30       vided  by  the package pt::peg::interp and the parsers generated by the
31       pt application for the result formats critcl, snit,  and  oo  regarding
32       access to the actual parsing functionality.
33
34       Its  intended  audience are people who wish to create a parser for some
35       language of theirs and then use that parser within a Tcl-based  package
36       or application.
37
38       It resides in the User Layer of Parser Tools.
39
40       IMAGE: arch_user_pkg
41

CLASS API

43       className ?objectName?
44              The class command constructs parser instances, i.e. objects. The
45              result of the command is the fully-qualified  name  of  the  in‐
46              stance command.
47
48              If no objectName is specified the class will generate and use an
49              automatic name. If the objectName  was  specified,  but  is  not
50              fully qualified the command will be created in the current name‐
51              space.
52

INSTANCE API

54       All parser instances provide at least the methods shown below:
55
56       objectName destroy
57              This method destroys the parser instance, releasing all  claimed
58              memory and other resources, and deleting the instance command.
59
60              The result of the command is the empty string.
61
62       objectName parse chan
63              This  method runs the parser using the contents of chan as input
64              (starting at the current location in the channel), until parsing
65              is  not  possible anymore, either because parsing has completed,
66              or run into a syntax error.
67
68              Note here that the Parser Tools are based on Tcl 8.5+. In  other
69              words, the channel argument is not restricted to files, sockets,
70              etc. We have the full power of reflected channels available.
71
72              It should also be noted that the  parser  pulls  the  characters
73              from  the  input stream as it needs them. If a parser created by
74              this package has to be operated in a push aka event-driven  man‐
75              ner  it  will  be necessary to go to Tcl 8.6+ and use the corou‐
76              tine::auto to wrap it into a coroutine where  read  is  properly
77              changed for push-operation.
78
79              Upon  successful completion the command returns an abstract syn‐
80              tax tree as its result.  This AST is in the  form  specified  in
81              section AST serialization format.  As a plain nested Tcl-list it
82              can then be processed with any Tcl commands the user likes,  do‐
83              ing  transformations, semantic checks, etc.  To help in this the
84              package pt::ast provides a set of convenience commands for vali‐
85              dation of the tree's basic structure, printing it for debugging,
86              and walking it either from the bottom up, or top down.
87
88              When encountering a syntax error the command will throw an error
89              instead.   This  error will be a 4-element Tcl-list, containing,
90              in the order listed below:
91
92              [1]    The string pt::rde identifying it as parser  runtime  er‐
93                     ror.
94
95              [2]    The location of the parse error, as character offset from
96                     the beginning of the parsed input.
97
98              [3]    The location of parse error, now as a 2-element list con‐
99                     taining line-number and column in the line.
100
101              [4]    A  set  of atomic parsing expressions indicating encoding
102                     the characters and/or nonterminal symbols the parser  ex‐
103                     pected to see at the location of the parse error, but did
104                     not get.  For the specification of atomic parsing expres‐
105                     sions please see the section PE serialization format.
106
107       objectName parset text
108              This  method  runs the parser using the string in text as input.
109              In all other ways it behaves like the method parse, shown above.
110

USAGE

112       A generated parser is used like this
113
114
115                  package require the-parser-package ;# Generated by result-formats 'critcl', 'snit' or 'oo' of 'pt'.
116                  set parser [the-parser-class]
117
118                  set ast [$parser parse $channel]
119                  ... process the abstract syntax tree ...
120
121       When using a grammar interpreter for parsing some differences creep in
122
123
124                  package require the-grammar-package ;# Generated by result-format 'container' of 'pt'.
125                  set grammar [the-grammar-class]
126
127                  package require pt::peg::interp
128                  set parser [pt::peg::interp]
129
130                  $parser use $grammar
131
132                  set ast [$parser parse $channel]
133                  $parser destroy
134
135                  ... process the abstract syntax tree ...
136
137

AST SERIALIZATION FORMAT

139       Here we specify the format used by the Parser Tools  to  serialize  Ab‐
140       stract  Syntax Trees (ASTs) as immutable values for transport, compari‐
141       son, etc.
142
143       Each node in an AST represents a nonterminal symbol of a  grammar,  and
144       the  range of tokens/characters in the input covered by it. ASTs do not
145       contain terminal symbols, i.e. tokens/characters. These can  be  recov‐
146       ered from the input given a symbol's location.
147
148       We  distinguish  between regular and canonical serializations.  While a
149       tree may have more than one regular serialization only exactly  one  of
150       them will be canonical.
151
152       Regular serialization
153
154              [1]    The  serialization of any AST is the serialization of its
155                     root node.
156
157              [2]    The serialization of any node is a Tcl list containing at
158                     least three elements.
159
160                     [1]    The  first  element is the name of the nonterminal
161                            symbol stored in the node.
162
163                     [2]    The second and third element are the locations  of
164                            the  first  and last token in the token stream the
165                            node represents (covers).
166
167                            [1]    Locations are provided as non-negative  in‐
168                                   teger offsets from the beginning of the to‐
169                                   ken stream, with the first token  found  in
170                                   the stream located at offset 0 (zero).
171
172                            [2]    The  end  location  has  to  be equal to or
173                                   larger than the start location.
174
175                     [3]    All elements after the first three  represent  the
176                            children  of the node, which are themselves nodes.
177                            This means that the serializations of nodes  with‐
178                            out  children, i.e. leaf nodes, have exactly three
179                            elements.  The children are  stored  in  the  list
180                            with  the  leftmost child first, and the rightmost
181                            child last.
182
183       Canonical serialization
184              The canonical serialization of an abstract syntax tree  has  the
185              format  as specified in the previous item, and then additionally
186              satisfies the constraints below, which make it unique among  all
187              the possible serializations of this tree.
188
189              [1]    The  string  representation of the value is the canonical
190                     representation of a pure Tcl list. I.e. it does not  con‐
191                     tain superfluous whitespace.
192
193   EXAMPLE
194       Assuming the parsing expression grammar below
195
196              PEG calculator (Expression)
197                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
198                  Sign       <- '-' / '+'                                     ;
199                  Number     <- Sign? Digit+                                  ;
200                  Expression <- Term (AddOp Term)*                            ;
201                  MulOp      <- '*' / '/'                                     ;
202                  Term       <- Factor (MulOp Factor)*                        ;
203                  AddOp      <- '+'/'-'                                       ;
204                  Factor     <- '(' Expression ')' / Number                   ;
205              END;
206
207
208       and the input string
209
210               120+5
211       then a parser should deliver the abstract syntax tree below (except for
212       whitespace)
213
214              set ast {Expression 0 4
215                  {Factor 0 4
216                      {Term 0 2
217                          {Number 0 2
218                              {Digit 0 0}
219                              {Digit 1 1}
220                              {Digit 2 2}
221                          }
222                      }
223                      {AddOp 3 3}
224                      {Term 4 4
225                          {Number 4 4
226                              {Digit 4 4}
227                          }
228                      }
229                  }
230              }
231
232
233       Or, more graphical
234
235       .nf +- Digit 0 0 | 1 |            | +- Term 0 2  ---  Number  0  2  -+-
236       Digit   1   1   |   2   |                            |             |  |
237       +- Digit 2 2 | 0 |                                        |  Expression
238       0  4  ---  Factor  0  4 -+----------------------------- AddOp 3 3 | + |
239       | +- Term 4 4 --- Number 4 4 --- Digit 4 4 | 5 .fi
240

PE SERIALIZATION FORMAT

242       Here we specify the format used by the Parser Tools to serialize  Pars‐
243       ing Expressions as immutable values for transport, comparison, etc.
244
245       We  distinguish  between regular and canonical serializations.  While a
246       parsing expression may have more than one  regular  serialization  only
247       exactly one of them will be canonical.
248
249       Regular serialization
250
251              Atomic Parsing Expressions
252
253                     [1]    The  string  epsilon  is an atomic parsing expres‐
254                            sion. It matches the empty string.
255
256                     [2]    The string dot is an atomic parsing expression. It
257                            matches any character.
258
259                     [3]    The  string alnum is an atomic parsing expression.
260                            It matches any Unicode alphabet or  digit  charac‐
261                            ter.  This  is  a custom extension of PEs based on
262                            Tcl's builtin command string is.
263
264                     [4]    The string alpha is an atomic parsing  expression.
265                            It matches any Unicode alphabet character. This is
266                            a custom extension of PEs based on  Tcl's  builtin
267                            command string is.
268
269                     [5]    The  string ascii is an atomic parsing expression.
270                            It matches any Unicode character below U0080. This
271                            is  a  custom  extension  of  PEs  based  on Tcl's
272                            builtin command string is.
273
274                     [6]    The string control is an  atomic  parsing  expres‐
275                            sion.  It  matches  any Unicode control character.
276                            This is a custom extension of PEs based  on  Tcl's
277                            builtin command string is.
278
279                     [7]    The  string digit is an atomic parsing expression.
280                            It matches any Unicode digit character. Note  that
281                            this  includes  characters  outside  of the [0..9]
282                            range. This is a custom extension of PEs based  on
283                            Tcl's builtin command string is.
284
285                     [8]    The  string graph is an atomic parsing expression.
286                            It matches any Unicode printing character,  except
287                            for space. This is a custom extension of PEs based
288                            on Tcl's builtin command string is.
289
290                     [9]    The string lower is an atomic parsing  expression.
291                            It matches any Unicode lower-case alphabet charac‐
292                            ter. This is a custom extension of  PEs  based  on
293                            Tcl's builtin command string is.
294
295                     [10]   The  string print is an atomic parsing expression.
296                            It matches any Unicode printing character, includ‐
297                            ing space. This is a custom extension of PEs based
298                            on Tcl's builtin command string is.
299
300                     [11]   The string punct is an atomic parsing  expression.
301                            It matches any Unicode punctuation character. This
302                            is a  custom  extension  of  PEs  based  on  Tcl's
303                            builtin command string is.
304
305                     [12]   The  string space is an atomic parsing expression.
306                            It matches any Unicode space character. This is  a
307                            custom  extension  of  PEs  based on Tcl's builtin
308                            command string is.
309
310                     [13]   The string upper is an atomic parsing  expression.
311                            It matches any Unicode upper-case alphabet charac‐
312                            ter. This is a custom extension of  PEs  based  on
313                            Tcl's builtin command string is.
314
315                     [14]   The  string  wordchar is an atomic parsing expres‐
316                            sion. It matches any Unicode word character.  This
317                            is any alphanumeric character (see alnum), and any
318                            connector  punctuation  characters  (e.g.   under‐
319                            score). This is a custom extension of PEs based on
320                            Tcl's builtin command string is.
321
322                     [15]   The string xdigit is an atomic parsing expression.
323                            It  matches  any hexadecimal digit character. This
324                            is a  custom  extension  of  PEs  based  on  Tcl's
325                            builtin command string is.
326
327                     [16]   The string ddigit is an atomic parsing expression.
328                            It matches any decimal digit character. This is  a
329                            custom  extension  of  PEs  based on Tcl's builtin
330                            command regexp.
331
332                     [17]   The expression [list t x] is an atomic parsing ex‐
333                            pression. It matches the terminal string x.
334
335                     [18]   The expression [list n A] is an atomic parsing ex‐
336                            pression. It matches the nonterminal A.
337
338              Combined Parsing Expressions
339
340                     [1]    For parsing expressions e1, e2, ... the result  of
341                            [list  /  e1  e2  ... ] is a parsing expression as
342                            well.  This is the ordered choice, aka prioritized
343                            choice.
344
345                     [2]    For  parsing expressions e1, e2, ... the result of
346                            [list x e1 e2 ... ] is  a  parsing  expression  as
347                            well.  This is the sequence.
348
349                     [3]    For  a  parsing expression e the result of [list *
350                            e] is a parsing expression as well.  This  is  the
351                            kleene  closure,  describing  zero or more repeti‐
352                            tions.
353
354                     [4]    For a parsing expression e the result of  [list  +
355                            e]  is  a parsing expression as well.  This is the
356                            positive kleene closure, describing  one  or  more
357                            repetitions.
358
359                     [5]    For  a  parsing expression e the result of [list &
360                            e] is a parsing expression as well.  This  is  the
361                            and lookahead predicate.
362
363                     [6]    For  a  parsing expression e the result of [list !
364                            e] is a parsing expression as well.  This  is  the
365                            not lookahead predicate.
366
367                     [7]    For  a  parsing expression e the result of [list ?
368                            e] is a parsing expression as well.  This  is  the
369                            optional input.
370
371       Canonical serialization
372              The canonical serialization of a parsing expression has the for‐
373              mat as specified in the previous  item,  and  then  additionally
374              satisfies  the constraints below, which make it unique among all
375              the possible serializations of this parsing expression.
376
377              [1]    The string representation of the value is  the  canonical
378                     representation  of a pure Tcl list. I.e. it does not con‐
379                     tain superfluous whitespace.
380
381              [2]    Terminals are not encoded as ranges (where start and  end
382                     of the range are identical).
383
384   EXAMPLE
385       Assuming  the  parsing  expression  shown on the right-hand side of the
386       rule
387
388                  Expression <- Term (AddOp Term)*
389
390
391       then its canonical serialization (except for whitespace) is
392
393                  {x {n Term} {* {x {n AddOp} {n Term}}}}
394
395

BUGS, IDEAS, FEEDBACK

397       This document, and the package it describes, will  undoubtedly  contain
398       bugs  and other problems.  Please report such in the category pt of the
399       Tcllib Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please  also
400       report  any  ideas  for  enhancements  you  may have for either package
401       and/or documentation.
402
403       When proposing code changes, please provide unified diffs, i.e the out‐
404       put of diff -u.
405
406       Note  further  that  attachments  are  strongly  preferred over inlined
407       patches. Attachments can be made by going  to  the  Edit  form  of  the
408       ticket  immediately  after  its  creation, and then using the left-most
409       button in the secondary navigation bar.
410

KEYWORDS

412       EBNF, LL(k), PEG, TDPL, context-free  languages,  expression,  grammar,
413       matching,  parser, parsing expression, parsing expression grammar, push
414       down automaton, recursive descent, state, top-down  parsing  languages,
415       transducer
416

CATEGORY

418       Parsing and Grammars
419
421       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
422
423
424
425
426tcllib                                 1                      pt_parser_api(i)
Impressum