1htmlparse(n)                      HTML Parser                     htmlparse(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       htmlparse - Procedures to parse HTML strings
9

SYNOPSIS

11       package require Tcl  8.2
12
13       package require struct::stack  1.3
14
15       package require cmdline  1.1
16
17       package require htmlparse  ?1.2.2?
18
19       ::htmlparse::parse  ?-cmd  cmd?  ?-vroot  tag? ?-split n? ?-incvar var?
20       ?-queue q? html
21
22       ::htmlparse::debugCallback ?clientdata?  tag  slash  param  textBehind‐
23       TheTag
24
25       ::htmlparse::mapEscapes html
26
27       ::htmlparse::2tree html tree
28
29       ::htmlparse::removeVisualFluff tree
30
31       ::htmlparse::removeFormDefs tree
32
33______________________________________________________________________________
34

DESCRIPTION

36       The htmlparse package provides commands that allow libraries and appli‐
37       cations to parse HTML in  a  string  into  a  representation  of  their
38       choice.
39
40       The following commands are available:
41
42       ::htmlparse::parse  ?-cmd  cmd?  ?-vroot  tag? ?-split n? ?-incvar var?
43       ?-queue q? html
44              This command is the basic parser for  HTML.  It  takes  an  HTML
45              string, parses it and invokes a command prefix for every tag en‐
46              countered. It is not necessary for the HTML to be valid for this
47              parser  to function. It is the responsibility of the command in‐
48              voked for every tag to check this. Another responsibility of the
49              invoked  command is the handling of tag attributes and character
50              entities (escaped characters). The parser provides the un-inter‐
51              preted  tag attributes to the invoked command to aid in the for‐
52              mer, and the package at large provides a helper command, ::html‐
53              parse::mapEscapes,  to  aid  in  the handling of the latter. The
54              parser does ignore leading DOCTYPE declarations  and  all  valid
55              HTML comments it encounters.
56
57              All  information  beyond the HTML string itself is specified via
58              options, these are explained below.
59
60              To help understand the options, some more background information
61              about the parser.
62
63              It  is  capable  of detecting incomplete tags in the HTML string
64              given to it. Under normal  circumstances  this  will  cause  the
65              parser  to  throw an error, but if the option -incvar is used to
66              specify a global (or namespace) variable, the parser will  store
67              the  incomplete  part  of  the input into this variable instead.
68              This will aid greatly in the handling of incrementally  arriving
69              HTML,  as  the  parser will handle whatever it can and defer the
70              handling of the incomplete part until more data has arrived.
71
72              Another feature of the parser are its two possible modes of  op‐
73              eration.  The  normal  mode is activated if the option -queue is
74              not present on the command line invoking the parser.  If  it  is
75              present, the parser will go into the incremental mode instead.
76
77              The main difference is that a parser in normal mode will immedi‐
78              ately invoke the command prefix for each tag it  encounters.  In
79              incremental  mode  however  the parser will generate a number of
80              scripts which invoke the command prefix for groups  of  tags  in
81              the  HTML  string  and then store these scripts in the specified
82              queue. It is then the responsibility of the caller of the parser
83              to ensure the execution of the scripts in the queue.
84
85              Note:  The  queue  object given to the parser has to provide the
86              same interface as the queue defined in tcllib  ->  struct.  This
87              means, for example, that all queues created via that tcllib mod‐
88              ule can be immediately used here. Still, the queue doesn't  have
89              to  come  from tcllib -> struct as long as the same interface is
90              provided.
91
92              In both modes the parser will return  an  empty  string  to  the
93              caller.
94
95              The  -split  option may be given to a parser in incremental mode
96              to specify the size of the groups it creates.  In  other  words,
97              -split  5  means  that each of the generated scripts will invoke
98              the command prefix for 5 consecutive tags in the HTML string.  A
99              parser in normal mode will ignore this option and its value.
100
101              The option -vroot specifies a virtual root tag. A parser in nor‐
102              mal mode will invoke the command prefix for it  immediately  be‐
103              fore  and after it processes the tags in the HTML, thus simulat‐
104              ing that the HTML string is enclosed in a <vroot> </vroot>  com‐
105              bination.  In  incremental  mode however the parser is unable to
106              provide the closing virtual root as it never knows when the  in‐
107              put is complete. In this case the first script generated by each
108              invocation of the parser will contain an invocation of the  com‐
109              mand prefix for the virtual root as its first command.  The fol‐
110              lowing options are available:
111
112              -cmd cmd
113                     The command prefix to invoke for every tag  in  the  HTML
114                     string. Defaults to ::htmlparse::debugCallback.
115
116              -vroot tag
117                     The  virtual  root  tag  to add around the HTML in normal
118                     mode. In incremental mode it is the  first  tag  in  each
119                     chunk processed by the parser, but there will be no clos‐
120                     ing tags. Defaults to hmstart.
121
122              -split n
123                     The size of the groups produced by  an  incremental  mode
124                     parser. Ignored when in normal mode. Defaults to 10. Val‐
125                     ues <= 0 are not allowed.
126
127              -incvar var
128                     The name of the variable where to  store  any  incomplete
129                     HTML  into.  This  makes  most  sense for the incremental
130                     mode. The parser will throw an error if  it  sees  incom‐
131                     plete  HTML  and  has no place to store it to. This makes
132                     sense for the normal mode. Only incomplete tags  are  de‐
133                     tected,  not  missing  tags.   Optional,  defaults to 'no
134                     variable'.
135
136              Interface to the command prefix
137                     In normal mode the parser will invoke the command  prefix
138                     with four arguments appended. See ::htmlparse::debugCall‐
139                     back for a description.
140
141                     In incremental mode, however, the generated scripts  will
142                     invoke  the  command prefix with five arguments appended.
143                     The last four of these are the same which were  mentioned
144                     above.  The  first  is a placeholder string (@win@) for a
145                     clientdata value to be supplied later during  the  actual
146                     execution  of  the  generated scripts. This could be a tk
147                     window path, for example. This allows the  user  of  this
148                     package  to  preprocess  HTML  strings without committing
149                     them to a specific window, object, whatever during  pars‐
150                     ing.  This  connection can be made later. This also means
151                     that it  is  possible  to  cache  preprocessed  HTML.  Of
152                     course,  nothing prevents the user of the parser from re‐
153                     placing the placeholder with an empty string.
154
155       ::htmlparse::debugCallback ?clientdata?  tag  slash  param  textBehind‐
156       TheTag
157              This  command  is  the  standard  callback used by the parser in
158              ::htmlparse::parse if none was specified by the user. It  simply
159              dumps  its  arguments  to stdout.  This callback can be used for
160              both normal and incremental mode of the calling parser. In other
161              words,  it  accepts  four or five arguments. The last four argu‐
162              ments are described below. The optional fifth argument  contains
163              the  clientdata  value passed to the callback by a parser in in‐
164              cremental mode. All callbacks have to follow  the  signature  of
165              this  command  in the last four arguments, and callbacks used in
166              incremental parsing have to follow this signature  in  the  last
167              five arguments.
168
169              The  first argument, clientdata, is optional and present only if
170              this command is invoked by a parser in incremental mode. It con‐
171              tains whatever the user of this package wishes.
172
173              The  second argument, tag, contains the name of the tag which is
174              currently processed by the parser.
175
176              The third argument, slash, is either empty or contains  a  slash
177              character. It allows the callback to distinguish between opening
178              (slash is empty) and closing tags (slash contains a slash  char‐
179              acter).
180
181              The  fourth argument, param, contains the un-interpreted list of
182              parameters to the tag.
183
184              The fifth and last argument, textBehindTheTag, contains the text
185              found by the parser behind the tag named in tag.
186
187       ::htmlparse::mapEscapes html
188              This  command  takes  a  HTML string, substitutes all escape se‐
189              quences with their actual characters and then  returns  the  re‐
190              sulting  string.   HTML  strings which do not contain escape se‐
191              quences are returned unchanged.
192
193       ::htmlparse::2tree html tree
194              This command is a wrapper around ::htmlparse::parse which  takes
195              an  HTML string (in html) and converts it into a tree containing
196              the logical structure of the parsed document. The  name  of  the
197              tree  is given to the command as its second argument (tree). The
198              command does not generate the tree by itself  but  expects  that
199              the  caller provided it with an existing and empty tree. It also
200              expects that the specified tree object follows the  same  inter‐
201              face  as the tree object in tcllib -> struct. It doesn't have to
202              be from tcllib -> struct, but it must provide  the  same  inter‐
203              face.
204
205              The  internal callback does some basic checking of HTML validity
206              and tries to recover from the most basic errors. The command re‐
207              turns  the contents of its second argument. Side effects are the
208              creation and manipulation of a tree object.
209
210              Each node in the generated tree represent one tag in the  input.
211              The name of the tag is stored in the attribute type of the node.
212              Any html attributes coming with the tag are stored unmodified in
213              the  attribute data of the tag. In other words, the command does
214              not parse html attributes into their names and values.
215
216              If a tag contains text its node will have children of  type  PC‐
217              DATA containing this text. The text will be stored in the attri‐
218              bute data of these children.
219
220       ::htmlparse::removeVisualFluff tree
221              This command walks a tree as generated by ::htmlparse::2tree and
222              removes all the nodes which represent visual tags and not struc‐
223              tural ones. The purpose of the command is to make the tree  eas‐
224              ier  to  navigate without getting bogged down in visual informa‐
225              tion not relevant to the search. Its only argument is  the  name
226              of the tree to cut down.
227
228       ::htmlparse::removeFormDefs tree
229              Like  ::htmlparse::removeVisualFluff this command is here to cut
230              down on the size of the tree as generated by ::htmlparse::2tree.
231              It  removes  all nodes representing forms and form elements. Its
232              only argument is the name of the tree to cut down.
233

BUGS, IDEAS, FEEDBACK

235       This document, and the package it describes, will  undoubtedly  contain
236       bugs  and other problems.  Please report such in the category htmlparse
237       of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist].   Please
238       also  report any ideas for enhancements you may have for either package
239       and/or documentation.
240
241       When proposing code changes, please provide unified diffs, i.e the out‐
242       put of diff -u.
243
244       Note  further  that  attachments  are  strongly  preferred over inlined
245       patches. Attachments can be made by going  to  the  Edit  form  of  the
246       ticket  immediately  after  its  creation, and then using the left-most
247       button in the secondary navigation bar.
248

SEE ALSO

250       struct::tree
251

KEYWORDS

253       html, parsing, queue, tree
254

CATEGORY

256       Text processing
257
258
259
260tcllib                               1.2.2                        htmlparse(n)
Impressum