1htmlparse(n) HTML Parser htmlparse(n)
2
3
4
5______________________________________________________________________________
6
8 htmlparse - Procedures to parse HTML strings
9
11 package require Tcl 8.2
12
13 package require struct::stack 1.3
14
15 package require cmdline 1.1
16
17 package require htmlparse ?1.2.2?
18
19 ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var?
20 ?-queue q? html
21
22 ::htmlparse::debugCallback ?clientdata? tag slash param textBehind‐
23 TheTag
24
25 ::htmlparse::mapEscapes html
26
27 ::htmlparse::2tree html tree
28
29 ::htmlparse::removeVisualFluff tree
30
31 ::htmlparse::removeFormDefs tree
32
33______________________________________________________________________________
34
36 The htmlparse package provides commands that allow libraries and appli‐
37 cations to parse HTML in a string into a representation of their
38 choice.
39
40 The following commands are available:
41
42 ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var?
43 ?-queue q? html
44 This command is the basic parser for HTML. It takes an HTML
45 string, parses it and invokes a command prefix for every tag en‐
46 countered. It is not necessary for the HTML to be valid for this
47 parser to function. It is the responsibility of the command in‐
48 voked for every tag to check this. Another responsibility of the
49 invoked command is the handling of tag attributes and character
50 entities (escaped characters). The parser provides the un-inter‐
51 preted tag attributes to the invoked command to aid in the for‐
52 mer, and the package at large provides a helper command, ::html‐
53 parse::mapEscapes, to aid in the handling of the latter. The
54 parser does ignore leading DOCTYPE declarations and all valid
55 HTML comments it encounters.
56
57 All information beyond the HTML string itself is specified via
58 options, these are explained below.
59
60 To help understand the options, some more background information
61 about the parser.
62
63 It is capable of detecting incomplete tags in the HTML string
64 given to it. Under normal circumstances this will cause the
65 parser to throw an error, but if the option -incvar is used to
66 specify a global (or namespace) variable, the parser will store
67 the incomplete part of the input into this variable instead.
68 This will aid greatly in the handling of incrementally arriving
69 HTML, as the parser will handle whatever it can and defer the
70 handling of the incomplete part until more data has arrived.
71
72 Another feature of the parser are its two possible modes of op‐
73 eration. The normal mode is activated if the option -queue is
74 not present on the command line invoking the parser. If it is
75 present, the parser will go into the incremental mode instead.
76
77 The main difference is that a parser in normal mode will immedi‐
78 ately invoke the command prefix for each tag it encounters. In
79 incremental mode however the parser will generate a number of
80 scripts which invoke the command prefix for groups of tags in
81 the HTML string and then store these scripts in the specified
82 queue. It is then the responsibility of the caller of the parser
83 to ensure the execution of the scripts in the queue.
84
85 Note: The queue object given to the parser has to provide the
86 same interface as the queue defined in tcllib -> struct. This
87 means, for example, that all queues created via that tcllib mod‐
88 ule can be immediately used here. Still, the queue doesn't have
89 to come from tcllib -> struct as long as the same interface is
90 provided.
91
92 In both modes the parser will return an empty string to the
93 caller.
94
95 The -split option may be given to a parser in incremental mode
96 to specify the size of the groups it creates. In other words,
97 -split 5 means that each of the generated scripts will invoke
98 the command prefix for 5 consecutive tags in the HTML string. A
99 parser in normal mode will ignore this option and its value.
100
101 The option -vroot specifies a virtual root tag. A parser in nor‐
102 mal mode will invoke the command prefix for it immediately be‐
103 fore and after it processes the tags in the HTML, thus simulat‐
104 ing that the HTML string is enclosed in a <vroot> </vroot> com‐
105 bination. In incremental mode however the parser is unable to
106 provide the closing virtual root as it never knows when the in‐
107 put is complete. In this case the first script generated by each
108 invocation of the parser will contain an invocation of the com‐
109 mand prefix for the virtual root as its first command. The fol‐
110 lowing options are available:
111
112 -cmd cmd
113 The command prefix to invoke for every tag in the HTML
114 string. Defaults to ::htmlparse::debugCallback.
115
116 -vroot tag
117 The virtual root tag to add around the HTML in normal
118 mode. In incremental mode it is the first tag in each
119 chunk processed by the parser, but there will be no clos‐
120 ing tags. Defaults to hmstart.
121
122 -split n
123 The size of the groups produced by an incremental mode
124 parser. Ignored when in normal mode. Defaults to 10. Val‐
125 ues <= 0 are not allowed.
126
127 -incvar var
128 The name of the variable where to store any incomplete
129 HTML into. This makes most sense for the incremental
130 mode. The parser will throw an error if it sees incom‐
131 plete HTML and has no place to store it to. This makes
132 sense for the normal mode. Only incomplete tags are de‐
133 tected, not missing tags. Optional, defaults to 'no
134 variable'.
135
136 Interface to the command prefix
137 In normal mode the parser will invoke the command prefix
138 with four arguments appended. See ::htmlparse::debugCall‐
139 back for a description.
140
141 In incremental mode, however, the generated scripts will
142 invoke the command prefix with five arguments appended.
143 The last four of these are the same which were mentioned
144 above. The first is a placeholder string (@win@) for a
145 clientdata value to be supplied later during the actual
146 execution of the generated scripts. This could be a tk
147 window path, for example. This allows the user of this
148 package to preprocess HTML strings without committing
149 them to a specific window, object, whatever during pars‐
150 ing. This connection can be made later. This also means
151 that it is possible to cache preprocessed HTML. Of
152 course, nothing prevents the user of the parser from re‐
153 placing the placeholder with an empty string.
154
155 ::htmlparse::debugCallback ?clientdata? tag slash param textBehind‐
156 TheTag
157 This command is the standard callback used by the parser in
158 ::htmlparse::parse if none was specified by the user. It simply
159 dumps its arguments to stdout. This callback can be used for
160 both normal and incremental mode of the calling parser. In other
161 words, it accepts four or five arguments. The last four argu‐
162 ments are described below. The optional fifth argument contains
163 the clientdata value passed to the callback by a parser in in‐
164 cremental mode. All callbacks have to follow the signature of
165 this command in the last four arguments, and callbacks used in
166 incremental parsing have to follow this signature in the last
167 five arguments.
168
169 The first argument, clientdata, is optional and present only if
170 this command is invoked by a parser in incremental mode. It con‐
171 tains whatever the user of this package wishes.
172
173 The second argument, tag, contains the name of the tag which is
174 currently processed by the parser.
175
176 The third argument, slash, is either empty or contains a slash
177 character. It allows the callback to distinguish between opening
178 (slash is empty) and closing tags (slash contains a slash char‐
179 acter).
180
181 The fourth argument, param, contains the un-interpreted list of
182 parameters to the tag.
183
184 The fifth and last argument, textBehindTheTag, contains the text
185 found by the parser behind the tag named in tag.
186
187 ::htmlparse::mapEscapes html
188 This command takes a HTML string, substitutes all escape se‐
189 quences with their actual characters and then returns the re‐
190 sulting string. HTML strings which do not contain escape se‐
191 quences are returned unchanged.
192
193 ::htmlparse::2tree html tree
194 This command is a wrapper around ::htmlparse::parse which takes
195 an HTML string (in html) and converts it into a tree containing
196 the logical structure of the parsed document. The name of the
197 tree is given to the command as its second argument (tree). The
198 command does not generate the tree by itself but expects that
199 the caller provided it with an existing and empty tree. It also
200 expects that the specified tree object follows the same inter‐
201 face as the tree object in tcllib -> struct. It doesn't have to
202 be from tcllib -> struct, but it must provide the same inter‐
203 face.
204
205 The internal callback does some basic checking of HTML validity
206 and tries to recover from the most basic errors. The command re‐
207 turns the contents of its second argument. Side effects are the
208 creation and manipulation of a tree object.
209
210 Each node in the generated tree represent one tag in the input.
211 The name of the tag is stored in the attribute type of the node.
212 Any html attributes coming with the tag are stored unmodified in
213 the attribute data of the tag. In other words, the command does
214 not parse html attributes into their names and values.
215
216 If a tag contains text its node will have children of type PC‐
217 DATA containing this text. The text will be stored in the attri‐
218 bute data of these children.
219
220 ::htmlparse::removeVisualFluff tree
221 This command walks a tree as generated by ::htmlparse::2tree and
222 removes all the nodes which represent visual tags and not struc‐
223 tural ones. The purpose of the command is to make the tree eas‐
224 ier to navigate without getting bogged down in visual informa‐
225 tion not relevant to the search. Its only argument is the name
226 of the tree to cut down.
227
228 ::htmlparse::removeFormDefs tree
229 Like ::htmlparse::removeVisualFluff this command is here to cut
230 down on the size of the tree as generated by ::htmlparse::2tree.
231 It removes all nodes representing forms and form elements. Its
232 only argument is the name of the tree to cut down.
233
235 This document, and the package it describes, will undoubtedly contain
236 bugs and other problems. Please report such in the category htmlparse
237 of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please
238 also report any ideas for enhancements you may have for either package
239 and/or documentation.
240
241 When proposing code changes, please provide unified diffs, i.e the out‐
242 put of diff -u.
243
244 Note further that attachments are strongly preferred over inlined
245 patches. Attachments can be made by going to the Edit form of the
246 ticket immediately after its creation, and then using the left-most
247 button in the secondary navigation bar.
248
250 struct::tree
251
253 html, parsing, queue, tree
254
256 Text processing
257
258
259
260tcllib 1.2.2 htmlparse(n)