1pt_parser_api(i)                 Parser Tools                 pt_parser_api(i)
2
3
4
5______________________________________________________________________________
6

NAME

8       pt_parser_api - Parser API
9

SYNOPSIS

11       package require Tcl  8.5
12
13       className ?objectName?
14
15       objectName destroy
16
17       objectName parse chan
18
19       objectName parset text
20
21______________________________________________________________________________
22

DESCRIPTION

24       Are  you  lost ?  Do you have trouble understanding this document ?  In
25       that case please read the overview  provided  by  the  Introduction  to
26       Parser  Tools.  This document is the entrypoint to the whole system the
27       current package is a part of.
28
29       This document describes the API shared by the grammar interpreter  pro‐
30       vided  by  the package pt::peg::interp and the parsers generated by the
31       pt application for the result formats critcl, snit,  and  oo  regarding
32       access to the actual parsing functionality.
33
34       Its  intended  audience are people who wish to create a parser for some
35       language of theirs and then use that parser within a Tcl-based  package
36       or application.
37
38       It resides in the User Layer of Parser Tools.
39
40       IMAGE: arch_user_pkg
41

CLASS API

43       className ?objectName?
44              The class command constructs parser instances, i.e. objects. The
45              result of  the  command  is  the  fully-qualified  name  of  the
46              instance command.
47
48              If no objectName is specified the class will generate and use an
49              automatic name. If the objectName  was  specified,  but  is  not
50              fully  qualified  the  command  will  be  created in the current
51              namespace.
52

INSTANCE API

54       All parser instances provide at least the methods shown below:
55
56       objectName destroy
57              This method destroys the parser instance, releasing all  claimed
58              memory and other resources, and deleting the instance command.
59
60              The result of the command is the empty string.
61
62       objectName parse chan
63              This  method runs the parser using the contents of chan as input
64              (starting at the current location in the channel), until parsing
65              is  not  possible anymore, either because parsing has completed,
66              or run into a syntax error.
67
68              Note here that the Parser Tools are based on Tcl 8.5+. In  other
69              words, the channel argument is not restricted to files, sockets,
70              etc. We have the full power of reflected channels available.
71
72              It should also be noted that the  parser  pulls  the  characters
73              from  the  input stream as it needs them. If a parser created by
74              this package has to be operated in a push aka event-driven  man‐
75              ner  it  will  be necessary to go to Tcl 8.6+ and use the corou‐
76              tine::auto to wrap it into a coroutine where  read  is  properly
77              changed for push-operation.
78
79              Upon  successful completion the command returns an abstract syn‐
80              tax tree as its result.  This AST is in the  form  specified  in
81              section AST serialization format.  As a plain nested Tcl-list it
82              can then be processed with any  Tcl  commands  the  user  likes,
83              doing  transformations,  semantic  checks, etc.  To help in this
84              the package pt::ast provides a set of convenience  commands  for
85              validation of the tree's basic structure, printing it for debug‐
86              ging, and walking it either from the bottom up, or top down.
87
88              When encountering a syntax error the command will throw an error
89              instead.   This  error will be a 4-element Tcl-list, containing,
90              in the order listed below:
91
92              [1]    The string  pt::rde  identifying  it  as  parser  runtime
93                     error.
94
95              [2]    The location of the parse error, as character offset from
96                     the beginning of the parsed input.
97
98              [3]    The location of parse error, now as a 2-element list con‐
99                     taining line-number and column in the line.
100
101              [4]    A  set  of atomic parsing expressions indicating encoding
102                     the characters  and/or  nonterminal  symbols  the  parser
103                     expected  to  see at the location of the parse error, but
104                     did not get.  For the  specification  of  atomic  parsing
105                     expressions  please see the section PE serialization for‐
106                     mat.
107
108       objectName parset text
109              This method runs the parser using the string in text  as  input.
110              In all other ways it behaves like the method parse, shown above.
111

USAGE

113       A generated parser is used like this
114
115
116                  package require the-parser-package ;# Generated by result-formats 'critcl', 'snit' or 'oo' of 'pt'.
117                  set parser [the-parser-class]
118
119                  set ast [$parser parse $channel]
120                  ... process the abstract syntax tree ...
121
122       When using a grammar interpreter for parsing some differences creep in
123
124
125                  package require the-grammar-package ;# Generated by result-format 'container' of 'pt'.
126                  set grammar [the-grammar-class]
127
128                  package require pt::peg::interp
129                  set parser [pt::peg::interp]
130
131                  $parser use $grammar
132
133                  set ast [$parser parse $channel]
134                  $parser destroy
135
136                  ... process the abstract syntax tree ...
137
138

AST SERIALIZATION FORMAT

140       Here  we  specify  the  format  used  by  the Parser Tools to serialize
141       Abstract Syntax Trees (ASTs) as immutable values for transport, compar‐
142       ison, etc.
143
144       Each  node  in an AST represents a nonterminal symbol of a grammar, and
145       the range of tokens/characters in the input covered by it. ASTs do  not
146       contain  terminal  symbols, i.e. tokens/characters. These can be recov‐
147       ered from the input given a symbol's location.
148
149       We distinguish between regular and canonical serializations.   While  a
150       tree  may  have more than one regular serialization only exactly one of
151       them will be canonical.
152
153       Regular serialization
154
155              [1]    The serialization of any AST is the serialization of  its
156                     root node.
157
158              [2]    The serialization of any node is a Tcl list containing at
159                     least three elements.
160
161                     [1]    The first element is the name of  the  nonterminal
162                            symbol stored in the node.
163
164                     [2]    The  second and third element are the locations of
165                            the first and last token in the token  stream  the
166                            node represents (covers).
167
168                            [1]    Locations   are  provided  as  non-negative
169                                   integer offsets from the beginning  of  the
170                                   token stream, with the first token found in
171                                   the stream located at offset 0 (zero).
172
173                            [2]    The end location has  to  be  equal  to  or
174                                   larger than the start location.
175
176                     [3]    All  elements  after the first three represent the
177                            children of the node, which are themselves  nodes.
178                            This  means that the serializations of nodes with‐
179                            out children, i.e. leaf nodes, have exactly  three
180                            elements.   The  children  are  stored in the list
181                            with the leftmost child first, and  the  rightmost
182                            child last.
183
184       Canonical serialization
185              The  canonical  serialization of an abstract syntax tree has the
186              format as specified in the previous item, and then  additionally
187              satisfies  the constraints below, which make it unique among all
188              the possible serializations of this tree.
189
190              [1]    The string representation of the value is  the  canonical
191                     representation  of a pure Tcl list. I.e. it does not con‐
192                     tain superfluous whitespace.
193
194   EXAMPLE
195       Assuming the parsing expression grammar below
196
197              PEG calculator (Expression)
198                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'       ;
199                  Sign       <- '-' / '+'                                     ;
200                  Number     <- Sign? Digit+                                  ;
201                  Expression <- Term (AddOp Term)*                            ;
202                  MulOp      <- '*' / '/'                                     ;
203                  Term       <- Factor (MulOp Factor)*                        ;
204                  AddOp      <- '+'/'-'                                       ;
205                  Factor     <- '(' Expression ')' / Number                   ;
206              END;
207
208
209       and the input string
210
211               120+5
212       then a parser should deliver the abstract syntax tree below (except for
213       whitespace)
214
215              set ast {Expression 0 4
216                  {Factor 0 4
217                      {Term 0 2
218                          {Number 0 2
219                              {Digit 0 0}
220                              {Digit 1 1}
221                              {Digit 2 2}
222                          }
223                      }
224                      {AddOp 3 3}
225                      {Term 4 4
226                          {Number 4 4
227                              {Digit 4 4}
228                          }
229                      }
230                  }
231              }
232
233
234       Or, more graphical
235
236       .nf  +-  Digit  0  0  | 1 |            | +- Term 0 2 --- Number 0 2 -+-
237       Digit  1  1  |   2   |                             |              |   |
238       +-  Digit 2 2 | 0 |                                        | Expression
239       0 4 --- Factor 0 4 -+----------------------------- AddOp  3  3  |  +  |
240       | +- Term 4 4 --- Number 4 4 --- Digit 4 4 | 5 .fi
241

PE SERIALIZATION FORMAT

243       Here  we specify the format used by the Parser Tools to serialize Pars‐
244       ing Expressions as immutable values for transport, comparison, etc.
245
246       We distinguish between regular and canonical serializations.   While  a
247       parsing  expression  may  have more than one regular serialization only
248       exactly one of them will be canonical.
249
250       Regular serialization
251
252              Atomic Parsing Expressions
253
254                     [1]    The string epsilon is an  atomic  parsing  expres‐
255                            sion. It matches the empty string.
256
257                     [2]    The string dot is an atomic parsing expression. It
258                            matches any character.
259
260                     [3]    The string alnum is an atomic parsing  expression.
261                            It  matches  any Unicode alphabet or digit charac‐
262                            ter. This is a custom extension of  PEs  based  on
263                            Tcl's builtin command string is.
264
265                     [4]    The  string alpha is an atomic parsing expression.
266                            It matches any Unicode alphabet character. This is
267                            a  custom  extension of PEs based on Tcl's builtin
268                            command string is.
269
270                     [5]    The string ascii is an atomic parsing  expression.
271                            It matches any Unicode character below U0080. This
272                            is a  custom  extension  of  PEs  based  on  Tcl's
273                            builtin command string is.
274
275                     [6]    The  string  control  is an atomic parsing expres‐
276                            sion. It matches any  Unicode  control  character.
277                            This  is  a custom extension of PEs based on Tcl's
278                            builtin command string is.
279
280                     [7]    The string digit is an atomic parsing  expression.
281                            It  matches any Unicode digit character. Note that
282                            this includes characters  outside  of  the  [0..9]
283                            range.  This is a custom extension of PEs based on
284                            Tcl's builtin command string is.
285
286                     [8]    The string graph is an atomic parsing  expression.
287                            It  matches any Unicode printing character, except
288                            for space. This is a custom extension of PEs based
289                            on Tcl's builtin command string is.
290
291                     [9]    The  string lower is an atomic parsing expression.
292                            It matches any Unicode lower-case alphabet charac‐
293                            ter.  This  is  a custom extension of PEs based on
294                            Tcl's builtin command string is.
295
296                     [10]   The string print is an atomic parsing  expression.
297                            It matches any Unicode printing character, includ‐
298                            ing space. This is a custom extension of PEs based
299                            on Tcl's builtin command string is.
300
301                     [11]   The  string punct is an atomic parsing expression.
302                            It matches any Unicode punctuation character. This
303                            is  a  custom  extension  of  PEs  based  on Tcl's
304                            builtin command string is.
305
306                     [12]   The string space is an atomic parsing  expression.
307                            It  matches any Unicode space character. This is a
308                            custom extension of PEs  based  on  Tcl's  builtin
309                            command string is.
310
311                     [13]   The  string upper is an atomic parsing expression.
312                            It matches any Unicode upper-case alphabet charac‐
313                            ter.  This  is  a custom extension of PEs based on
314                            Tcl's builtin command string is.
315
316                     [14]   The string wordchar is an atomic  parsing  expres‐
317                            sion.  It matches any Unicode word character. This
318                            is any alphanumeric character (see alnum), and any
319                            connector  punctuation  characters  (e.g.   under‐
320                            score). This is a custom extension of PEs based on
321                            Tcl's builtin command string is.
322
323                     [15]   The string xdigit is an atomic parsing expression.
324                            It matches any hexadecimal digit  character.  This
325                            is  a  custom  extension  of  PEs  based  on Tcl's
326                            builtin command string is.
327
328                     [16]   The string ddigit is an atomic parsing expression.
329                            It  matches any decimal digit character. This is a
330                            custom extension of PEs  based  on  Tcl's  builtin
331                            command regexp.
332
333                     [17]   The  expression  [list  t  x] is an atomic parsing
334                            expression. It matches the terminal string x.
335
336                     [18]   The expression [list n A]  is  an  atomic  parsing
337                            expression. It matches the nonterminal A.
338
339              Combined Parsing Expressions
340
341                     [1]    For  parsing expressions e1, e2, ... the result of
342                            [list / e1 e2 ... ] is  a  parsing  expression  as
343                            well.  This is the ordered choice, aka prioritized
344                            choice.
345
346                     [2]    For parsing expressions e1, e2, ... the result  of
347                            [list  x  e1  e2  ... ] is a parsing expression as
348                            well.  This is the sequence.
349
350                     [3]    For a parsing expression e the result of  [list  *
351                            e]  is  a parsing expression as well.  This is the
352                            kleene closure, describing zero  or  more  repeti‐
353                            tions.
354
355                     [4]    For  a  parsing expression e the result of [list +
356                            e] is a parsing expression as well.  This  is  the
357                            positive  kleene  closure,  describing one or more
358                            repetitions.
359
360                     [5]    For a parsing expression e the result of  [list  &
361                            e]  is  a parsing expression as well.  This is the
362                            and lookahead predicate.
363
364                     [6]    For a parsing expression e the result of  [list  !
365                            e]  is  a parsing expression as well.  This is the
366                            not lookahead predicate.
367
368                     [7]    For a parsing expression e the result of  [list  ?
369                            e]  is  a parsing expression as well.  This is the
370                            optional input.
371
372       Canonical serialization
373              The canonical serialization of a parsing expression has the for‐
374              mat  as  specified  in  the previous item, and then additionally
375              satisfies the constraints below, which make it unique among  all
376              the possible serializations of this parsing expression.
377
378              [1]    The  string  representation of the value is the canonical
379                     representation of a pure Tcl list. I.e. it does not  con‐
380                     tain superfluous whitespace.
381
382              [2]    Terminals  are not encoded as ranges (where start and end
383                     of the range are identical).
384
385   EXAMPLE
386       Assuming the parsing expression shown on the  right-hand  side  of  the
387       rule
388
389                  Expression <- Term (AddOp Term)*
390
391
392       then its canonical serialization (except for whitespace) is
393
394                  {x {n Term} {* {x {n AddOp} {n Term}}}}
395
396

BUGS, IDEAS, FEEDBACK

398       This  document,  and the package it describes, will undoubtedly contain
399       bugs and other problems.  Please report such in the category pt of  the
400       Tcllib  Trackers  [http://core.tcl.tk/tcllib/reportlist].   Please also
401       report any ideas for enhancements  you  may  have  for  either  package
402       and/or documentation.
403
404       When proposing code changes, please provide unified diffs, i.e the out‐
405       put of diff -u.
406
407       Note further that  attachments  are  strongly  preferred  over  inlined
408       patches.  Attachments  can  be  made  by  going to the Edit form of the
409       ticket immediately after its creation, and  then  using  the  left-most
410       button in the secondary navigation bar.
411

KEYWORDS

413       EBNF,  LL(k),  PEG,  TDPL, context-free languages, expression, grammar,
414       matching, parser, parsing expression, parsing expression grammar,  push
415       down  automaton,  recursive descent, state, top-down parsing languages,
416       transducer
417

CATEGORY

419       Parsing and Grammars
420
422       Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>
423
424
425
426
427tcllib                                 1                      pt_parser_api(i)
Impressum