1htmlparse(n)                      HTML Parser                     htmlparse(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       htmlparse - Procedures to parse HTML strings
9

SYNOPSIS

11       package require Tcl  8.2
12
13       package require struct::stack  1.3
14
15       package require cmdline  1.1
16
17       package require htmlparse  ?1.1.3?
18
19       ::htmlparse::parse  ?-cmd  cmd?  ?-vroot  tag? ?-split n? ?-incvar var?
20       ?-queue q? html
21
22       ::htmlparse::debugCallback ?clientdata?  tag  slash  param  textBehind‐
23       TheTag
24
25       ::htmlparse::mapEscapes html
26
27       ::htmlparse::2tree html tree
28
29       ::htmlparse::removeVisualFluff tree
30
31       ::htmlparse::removeFormDefs tree
32
33_________________________________________________________________
34

DESCRIPTION

36       The htmlparse package provides commands that allow libraries and appli‐
37       cations to parse HTML in  a  string  into  a  representation  of  their
38       choice.
39
40       The following commands are available:
41
42       ::htmlparse::parse  ?-cmd  cmd?  ?-vroot  tag? ?-split n? ?-incvar var?
43       ?-queue q? html
44              This command is the basic parser for  HTML.  It  takes  an  HTML
45              string,  parses  it  and  invokes a command prefix for every tag
46              encountered. It is not necessary for the HTML to  be  valid  for
47              this parser to function. It is the responsibility of the command
48              invoked for every tag to check this. Another  responsibility  of
49              the  invoked command is the handling of tag attributes and char‐
50              acter entities (escaped characters). The parser provides the un-
51              interpreted  tag attributes to the invoked command to aid in the
52              former, and the package at  large  provides  a  helper  command,
53              ::htmlparse::mapEscapes,  to  aid in the handling of the latter.
54              The parser does ignore  leading  DOCTYPE  declarations  and  all
55              valid HTML comments it encounters.
56
57              All  information  beyond the HTML string itself is specified via
58              options, these are explained below.
59
60              To help understand the options, some more background information
61              about the parser.
62
63              It  is  capable  of detecting incomplete tags in the HTML string
64              given to it. Under normal  circumstances  this  will  cause  the
65              parser  to  throw an error, but if the option -incvar is used to
66              specify a global (or namespace) variable, the parser will  store
67              the  incomplete  part  of  the input into this variable instead.
68              This will aid greatly in the handling of incrementally  arriving
69              HTML,  as  the  parser will handle whatever it can and defer the
70              handling of the incomplete part until more data has arrived.
71
72              Another feature of the parser are  its  two  possible  modes  of
73              operation.  The normal mode is activated if the option -queue is
74              not present on the command line invoking the parser.  If  it  is
75              present, the parser will go into the incremental mode instead.
76
77              The main difference is that a parser in normal mode will immedi‐
78              ately invoke the command prefix for each tag it  encounters.  In
79              incremental  mode  however  the parser will generate a number of
80              scripts which invoke the command prefix for groups  of  tags  in
81              the  HTML  string  and then store these scripts in the specified
82              queue. It is then the responsibility of the caller of the parser
83              to ensure the execution of the scripts in the queue.
84
85              Note:  The  queue  object given to the parser has to provide the
86              same interface as the queue defined in tcllib  ->  struct.  This
87              means, for example, that all queues created via that tcllib mod‐
88              ule can be immediately used here. Still, the queue doesn't  have
89              to  come  from tcllib -> struct as long as the same interface is
90              provided.
91
92              In both modes the parser will return  an  empty  string  to  the
93              caller.
94
95              The  -split  option may be given to a parser in incremental mode
96              to specify the size of the groups it creates.  In  other  words,
97              -split  5  means  that each of the generated scripts will invoke
98              the command prefix for 5 consecutive tags in the HTML string.  A
99              parser in normal mode will ignore this option and its value.
100
101              The option -vroot specifies a virtual root tag. A parser in nor‐
102              mal mode will invoke  the  command  prefix  for  it  immediately
103              before  and  after it processes the tags in the HTML, thus simu‐
104              lating that the HTML string is enclosed in  a  <vroot>  </vroot>
105              combination. In incremental mode however the parser is unable to
106              provide the closing virtual root as  it  never  knows  when  the
107              input  is  complete.  In this case the first script generated by
108              each invocation of the parser will contain an invocation of  the
109              command  prefix  for the virtual root as its first command.  The
110              following options are available:
111
112              -cmd cmd
113                     The command prefix to invoke for every tag  in  the  HTML
114                     string. Defaults to ::htmlparse::debugCallback.
115
116              -vroot tag
117                     The  virtual  root  tag  to add around the HTML in normal
118                     mode. In incremental mode it is the  first  tag  in  each
119                     chunk processed by the parser, but there will be no clos‐
120                     ing tags. Defaults to hmstart.
121
122              -split n
123                     The size of the groups produced by  an  incremental  mode
124                     parser. Ignored when in normal mode. Defaults to 10. Val‐
125                     ues <= 0 are not allowed.
126
127              -incvar var
128                     The name of the variable where to  store  any  incomplete
129                     HTML  into.  This  makes  most  sense for the incremental
130                     mode. The parser will throw an error if  it  sees  incom‐
131                     plete  HTML  and  has no place to store it to. This makes
132                     sense for the  normal  mode.  Only  incomplete  tags  are
133                     detected,  not  missing  tags.  Optional, defaults to 'no
134                     variable'.
135
136
137              Interface to the command prefix
138                     In normal mode the parser will invoke the command  prefix
139                     with four arguments appended. See ::htmlparse::debugCall‐
140                     back for a description.
141
142                     In incremental mode, however, the generated scripts  will
143                     invoke  the  command prefix with five arguments appended.
144                     The last four of these are the same which were  mentioned
145                     above.  The first is a placeholder string (\\win\\) for a
146                     clientdata value to be supplied later during  the  actual
147                     execution  of  the  generated scripts. This could be a tk
148                     window path, for example. This allows the  user  of  this
149                     package  to  preprocess  HTML  strings without committing
150                     them to a specific window, object, whatever during  pars‐
151                     ing.  This  connection can be made later. This also means
152                     that it  is  possible  to  cache  preprocessed  HTML.  Of
153                     course,  nothing  prevents  the  user  of the parser from
154                     replacing the placeholder with an empty string.
155
156       ::htmlparse::debugCallback ?clientdata?  tag  slash  param  textBehind‐
157       TheTag
158              This  command  is  the  standard  callback used by the parser in
159              ::htmlparse::parse if none was specified by the user. It  simply
160              dumps  its  arguments  to stdout.  This callback can be used for
161              both normal and incremental mode of the calling parser. In other
162              words,  it  accepts  four or five arguments. The last four argu‐
163              ments are described below. The optional fifth argument  contains
164              the  clientdata  value  passed  to  the  callback by a parser in
165              incremental mode. All callbacks have to follow the signature  of
166              this  command  in the last four arguments, and callbacks used in
167              incremental parsing have to follow this signature  in  the  last
168              five arguments.
169
170              The  first argument, clientdata, is optional and present only if
171              this command is invoked by a parser in incremental mode. It con‐
172              tains whatever the user of this package wishes.
173
174              The  second argument, tag, contains the name of the tag which is
175              currently processed by the parser.
176
177              The third argument, slash, is either empty or contains  a  slash
178              character. It allows the callback to distinguish between opening
179              (slash is empty) and closing tags (slash contains a slash  char‐
180              acter).
181
182              The  fourth argument, param, contains the un-interpreted list of
183              parameters to the tag.
184
185              The fifth and last argument, textBehindTheTag, contains the text
186              found by the parser behind the tag named in tag.
187
188       ::htmlparse::mapEscapes html
189              This  command  takes  a  HTML  string,  substitutes  all  escape
190              sequences with their actual  characters  and  then  returns  the
191              resulting  string.   HTML  strings  which  do not contain escape
192              sequences are returned unchanged.
193
194       ::htmlparse::2tree html tree
195              This command is a wrapper around ::htmlparse::parse which  takes
196              an  HTML string (in html) and converts it into a tree containing
197              the logical structure of the parsed document. The  name  of  the
198              tree  is given to the command as its second argument (tree). The
199              command does not generate the tree by itself  but  expects  that
200              the  caller provided it with an existing and empty tree. It also
201              expects that the specified tree object follows the  same  inter‐
202              face  as the tree object in tcllib -> struct. It doesn't have to
203              be from tcllib -> struct, but it must provide  the  same  inter‐
204              face.
205
206              The  internal callback does some basic checking of HTML validity
207              and tries to recover from the most  basic  errors.  The  command
208              returns  the  contents  of its second argument. Side effects are
209              the creation and manipulation of a tree object.
210
211              Each node in the generated tree represent one tag in the  input.
212              The name of the tag is stored in the attribute type of the node.
213              Any html attributes coming with the tag are stored unmodified in
214              the  attribute data of the tag. In other words, the command does
215              not parse html attributes into their names and values.
216
217              If a tag contains text its  node  will  have  children  of  type
218              PCDATA  containing  this  text.  The  text will be stored in the
219              attribute data of these children.
220
221       ::htmlparse::removeVisualFluff tree
222              This command walks a tree as generated by ::htmlparse::2tree and
223              removes all the nodes which represent visual tags and not struc‐
224              tural ones. The purpose of the command is to make the tree  eas‐
225              ier  to  navigate without getting bogged down in visual informa‐
226              tion not relevant to the search. Its only argument is  the  name
227              of the tree to cut down.
228
229       ::htmlparse::removeFormDefs tree
230              Like  ::htmlparse::removeVisualFluff this command is here to cut
231              down on the size of the tree as generated by ::htmlparse::2tree.
232              It  removes  all nodes representing forms and form elements. Its
233              only argument is the name of the tree to cut down.
234

BUGS, IDEAS, FEEDBACK

236       This document, and the package it describes, will  undoubtedly  contain
237       bugs  and other problems.  Please report such in the category htmlparse
238       of       the       Tcllib       SF       Trackers       [http://source
239       forge.net/tracker/?group_id=12883].   Please  also report any ideas for
240       enhancements you may have for either package and/or documentation.
241

SEE ALSO

243       struct::tree
244

KEYWORDS

246       html, parsing, queue, tree
247
248
249
250htmlparse                            1.1.3                        htmlparse(n)
Impressum