1HTML::TreeBuilder(3) User Contributed Perl Documentation HTML::TreeBuilder(3)
2
3
4
6 HTML::TreeBuilder - Parser that builds a HTML syntax tree
7
9 foreach my $file_name (@ARGV) {
10 my $tree = HTML::TreeBuilder->new; # empty tree
11 $tree->parse_file($file_name);
12 print "Hey, here's a dump of the parse tree of $file_name:\n";
13 $tree->dump; # a method we inherit from HTML::Element
14 print "And here it is, bizarrely rerendered as HTML:\n",
15 $tree->as_HTML, "\n";
16
17 # Now that we're done with it, we must destroy it.
18 $tree = $tree->delete;
19 }
20
22 (This class is part of the HTML::Tree dist.)
23
24 This class is for HTML syntax trees that get built out of HTML source.
25 The way to use it is to:
26
27 1. start a new (empty) HTML::TreeBuilder object,
28
29 2. then use one of the methods from HTML::Parser (presumably with
30 $tree->parse_file($filename) for files, or with
31 $tree->parse($document_content) and $tree->eof if you've got the
32 content in a string) to parse the HTML document into the tree $tree.
33
34 (You can combine steps 1 and 2 with the "new_from_file" or
35 "new_from_content" methods.)
36
37 2b. call $root->elementify() if you want.
38
39 3. do whatever you need to do with the syntax tree, presumably
40 involving traversing it looking for some bit of information in it,
41
42 4. and finally, when you're done with the tree, call $tree->delete() to
43 erase the contents of the tree from memory. This kind of thing usually
44 isn't necessary with most Perl objects, but it's necessary for
45 TreeBuilder objects. See HTML::Element for a more verbose explanation
46 of why this is the case.
47
49 Objects of this class inherit the methods of both HTML::Parser and
50 HTML::Element. The methods inherited from HTML::Parser are used for
51 building the HTML tree, and the methods inherited from HTML::Element
52 are what you use to scrutinize the tree. Besides this
53 (HTML::TreeBuilder) documentation, you must also carefully read the
54 HTML::Element documentation, and also skim the HTML::Parser
55 documentation -- probably only its parse and parse_file methods are of
56 interest.
57
58 Most of the following methods native to HTML::TreeBuilder control how
59 parsing takes place; they should be set before you try parsing into the
60 given object. You can set the attributes by passing a TRUE or FALSE
61 value as argument. E.g., $root->implicit_tags returns the current
62 setting for the implicit_tags option, $root->implicit_tags(1) turns
63 that option on, and $root->implicit_tags(0) turns it off.
64
65 $root = HTML::TreeBuilder->new_from_file(...)
66 This "shortcut" constructor merely combines constructing a new
67 object (with the "new" method, below), and calling
68 $new->parse_file(...) on it. Returns the new object. Note that
69 this provides no way of setting any parse options like
70 store_comments (for that, call new, and then set options, before
71 calling parse_file). See the notes (below) on parameters to
72 parse_file.
73
74 $root = HTML::TreeBuilder->new_from_content(...)
75 This "shortcut" constructor merely combines constructing a new
76 object (with the "new" method, below), and calling
77 for(...){$new->parse($_)} and $new->eof on it. Returns the new
78 object. Note that this provides no way of setting any parse
79 options like store_comments (for that, call new, and then set
80 options, before calling parse_file). Example usages:
81 HTML::TreeBuilder->new_from_content(@lines), or
82 HTML::TreeBuilder->new_from_content($content)
83
84 $root = HTML::TreeBuilder->new()
85 This creates a new HTML::TreeBuilder object. This method takes no
86 attributes.
87
88 $root->parse_file(...)
89 [An important method inherited from HTML::Parser, which see.
90 Current versions of HTML::Parser can take a filespec, or a
91 filehandle object, like *FOO, or some object from class IO::Handle,
92 IO::File, IO::Socket) or the like. I think you should check that a
93 given file exists before calling $root->parse_file($filespec).]
94
95 $root->parse(...)
96 [A important method inherited from HTML::Parser, which see. See
97 the note below for $root->eof().]
98
99 $root->eof()
100 This signals that you're finished parsing content into this tree;
101 this runs various kinds of crucial cleanup on the tree. This is
102 called for you when you call $root->parse_file(...), but not when
103 you call $root->parse(...). So if you call $root->parse(...), then
104 you must call $root->eof() once you've finished feeding all the
105 chunks to parse(...), and before you actually start doing anything
106 else with the tree in $root.
107
108 "$root->parse_content(...)"
109 Basically a happly alias for "$root->parse(...); $root->eof".
110 Takes the exact same arguments as "$root->parse()".
111
112 $root->delete()
113 [An important method inherited from HTML::Element, which see.]
114
115 $root->elementify()
116 This changes the class of the object in $root from
117 HTML::TreeBuilder to the class used for all the rest of the
118 elements in that tree (generally HTML::Element). Returns $root.
119
120 For most purposes, this is unnecessary, but if you call this after
121 (after!!) you've finished building a tree, then it keeps you from
122 accidentally trying to call anything but HTML::Element methods on
123 it. (I.e., if you accidentally call "$root->parse_file(...)" on
124 the already-complete and elementified tree, then instead of
125 charging ahead and wreaking havoc, it'll throw a fatal error --
126 since $root is now an object just of class HTML::Element which has
127 no "parse_file" method.
128
129 Note that elementify currently deletes all the private attributes
130 of $root except for "_tag", "_parent", "_content", "_pos", and
131 "_implicit". If anyone requests that I change this to leave in yet
132 more private attributes, I might do so, in future versions.
133
134 @nodes = $root->guts()
135 $parent_for_nodes = $root->guts()
136 In list context (as in the first case), this method returns the
137 topmost non-implicit nodes in a tree. This is useful when you're
138 parsing HTML code that you know doesn't expect an HTML document,
139 but instead just a fragment of an HTML document. For example, if
140 you wanted the parse tree for a file consisting of just this:
141
142 <li>I like pie!
143
144 Then you would get that with "@nodes = $root->guts();". It so
145 happens that in this case, @nodes will contain just one element
146 object, representing the "li" node (with "I like pie!" being its
147 text child node). However, consider if you were parsing this:
148
149 <hr>Hooboy!<hr>
150
151 In that case, "$root->guts()" would return three items: an element
152 object for the first "hr", a text string "Hooboy!", and another
153 "hr" element object.
154
155 For cases where you want definitely one element (so you can treat
156 it as a "document fragment", roughly speaking), call "guts()" in
157 scalar context, as in "$parent_for_nodes = $root->guts()". That
158 works like "guts()" in list context; in fact, "guts()" in list
159 context would have returned exactly one value, and if it would have
160 been an object (as opposed to a text string), then that's what
161 "guts" in scalar context will return. Otherwise, if "guts()" in
162 list context would have returned no values at all, then "guts()" in
163 scalar context returns undef. In all other cases, "guts()" in
164 scalar context returns an implicit 'div' element node, with
165 children consisting of whatever nodes "guts()" in list context
166 would have returned. Note that that may detach those nodes from
167 $root's tree.
168
169 @nodes = $root->disembowel()
170 $parent_for_nodes = $root->disembowel()
171 The "disembowel()" method works just like the "guts()" method,
172 except that disembowel definitively destroys the tree above the
173 nodes that are returned. Usually when you want the guts from a
174 tree, you're just going to toss out the rest of the tree anyway, so
175 this saves you the bother. (Remember, "disembowel" means "remove
176 the guts from".)
177
178 $root->implicit_tags(value)
179 Setting this attribute to true will instruct the parser to try to
180 deduce implicit elements and implicit end tags. If it is false you
181 get a parse tree that just reflects the text as it stands, which is
182 unlikely to be useful for anything but quick and dirty parsing.
183 (In fact, I'd be curious to hear from anyone who finds it useful to
184 have implicit_tags set to false.) Default is true.
185
186 Implicit elements have the implicit() attribute set.
187
188 $root->implicit_body_p_tag(value)
189 This controls an aspect of implicit element behavior, if
190 implicit_tags is on: If a text element (PCDATA) or a phrasal
191 element (such as "<em>") is to be inserted under "<body>", two
192 things can happen: if implicit_body_p_tag is true, it's placed
193 under a new, implicit "<p>" tag. (Past DTDs suggested this was the
194 only correct behavior, and this is how past versions of this module
195 behaved.) But if implicit_body_p_tag is false, nothing is
196 implicated -- the PCDATA or phrasal element is simply placed under
197 "<body>". Default is false.
198
199 $root->ignore_unknown(value)
200 This attribute controls whether unknown tags should be represented
201 as elements in the parse tree, or whether they should be ignored.
202 Default is true (to ignore unknown tags.)
203
204 $root->ignore_text(value)
205 Do not represent the text content of elements. This saves space if
206 all you want is to examine the structure of the document. Default
207 is false.
208
209 $root->ignore_ignorable_whitespace(value)
210 If set to true, TreeBuilder will try to avoid creating ignorable
211 whitespace text nodes in the tree. Default is true. (In fact, I'd
212 be interested in hearing if there's ever a case where you need this
213 off, or where leaving it on leads to incorrect behavior.)
214
215 $root->no_space_compacting(value)
216 This determines whether TreeBuilder compacts all whitespace strings
217 in the document (well, outside of PRE or TEXTAREA elements), or
218 leaves them alone. Normally (default, value of 0), each string of
219 contiguous whitespace in the document is turned into a single
220 space. But that's not done if no_space_compacting is set to 1.
221
222 Setting no_space_compacting to 1 might be useful if you want to
223 read in a tree just to make some minor changes to it before writing
224 it back out.
225
226 This method is experimental. If you use it, be sure to report any
227 problems you might have with it.
228
229 $root->p_strict(value)
230 If set to true (and it defaults to false), TreeBuilder will take a
231 narrower than normal view of what can be under a "p" element; if it
232 sees a non-phrasal element about to be inserted under a "p", it
233 will close that "p". Otherwise it will close p elements only for
234 other "p"'s, headings, and "form" (altho the latter may be removed
235 in future versions).
236
237 For example, when going thru this snippet of code,
238
239 <p>stuff
240 <ul>
241
242 TreeBuilder will normally (with "p_strict" false) put the "ul"
243 element under the "p" element. However, with "p_strict" set to
244 true, it will close the "p" first.
245
246 In theory, there should be strictness options like this for
247 other/all elements besides just "p"; but I treat this as a specal
248 case simply because of the fact that "p" occurs so frequently and
249 its end-tag is omitted so often; and also because application of
250 strictness rules at parse-time across all elements often makes tiny
251 errors in HTML coding produce drastically bad parse-trees, in my
252 experience.
253
254 If you find that you wish you had an option like this to enforce
255 content-models on all elements, then I suggest that what you want
256 is content-model checking as a stage after TreeBuilder has finished
257 parsing.
258
259 $root->store_comments(value)
260 This determines whether TreeBuilder will normally store comments
261 found while parsing content into $root. Currently, this is off by
262 default.
263
264 $root->store_declarations(value)
265 This determines whether TreeBuilder will normally store markup
266 declarations found while parsing content into $root. This is on by
267 default.
268
269 $root->store_pis(value)
270 This determines whether TreeBuilder will normally store processing
271 instructions found while parsing content into $root -- assuming a
272 recent version of HTML::Parser (old versions won't parse PIs
273 correctly). Currently, this is off (false) by default.
274
275 It is somewhat of a known bug (to be fixed one of these days, if
276 anyone needs it?) that PIs in the preamble (before the "html"
277 start-tag) end up actually under the "html" element.
278
279 $root->warn(value)
280 This determines whether syntax errors during parsing should
281 generate warnings, emitted via Perl's "warn" function.
282
283 This is off (false) by default.
284
286 HTML is rather harder to parse than people who write it generally
287 suspect.
288
289 Here's the problem: HTML is a kind of SGML that permits "minimization"
290 and "implication". In short, this means that you don't have to close
291 every tag you open (because the opening of a subsequent tag may
292 implicitly close it), and if you use a tag that can't occur in the
293 context you seem to using it in, under certain conditions the parser
294 will be able to realize you mean to leave the current context and enter
295 the new one, that being the only one that your code could correctly be
296 interpreted in.
297
298 Now, this would all work flawlessly and unproblematically if: 1) all
299 the rules that both prescribe and describe HTML were (and had been)
300 clearly set out, and 2) everyone was aware of these rules and wrote
301 their code in compliance to them.
302
303 However, it didn't happen that way, and so most HTML pages are
304 difficult if not impossible to correctly parse with nearly any set of
305 straightforward SGML rules. That's why the internals of
306 HTML::TreeBuilder consist of lots and lots of special cases -- instead
307 of being just a generic SGML parser with HTML DTD rules plugged in.
308
310 The techniques that HTML::TreeBuilder uses to perform what I consider
311 very robust parses on everyday code are not things that can work only
312 in Perl. To date, the algorithms at the center of HTML::TreeBuilder
313 have been implemented only in Perl, as far as I know; and I don't
314 foresee getting around to implementing them in any other language any
315 time soon.
316
317 If, however, anyone is looking for a semester project for an applied
318 programming class (or if they merely enjoy extra-curricular masochism),
319 they might do well to see about choosing as a topic the
320 implementation/adaptation of these routines to any other interesting
321 programming language that you feel currently suffers from a lack of
322 robust HTML-parsing. I welcome correspondence on this subject, and
323 point out that one can learn a great deal about languages by trying to
324 translate between them, and then comparing the result.
325
326 The HTML::TreeBuilder source may seem long and complex, but it is
327 rather well commented, and symbol names are generally self-explanatory.
328 (You are encouraged to read the Mozilla HTML parser source for
329 comparison.) Some of the complexity comes from little-used features,
330 and some of it comes from having the HTML tokenizer (HTML::Parser)
331 being a separate module, requiring somewhat of a different interface
332 than you'd find in a combined tokenizer and tree-builder. But most of
333 the length of the source comes from the fact that it's essentially a
334 long list of special cases, with lots and lots of sanity-checking, and
335 sanity-recovery -- because, as Roseanne Rosannadanna once said, "it's
336 always something".
337
338 Users looking to compare several HTML parsers should look at the source
339 for Raggett's Tidy ("<http://www.w3.org/People/Raggett/tidy/>"),
340 Mozilla ("<http://www.mozilla.org/>"), and possibly root around the
341 browsers section of Yahoo to find the various open-source ones
342 ("<http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Browsers/>").
343
345 * Framesets seem to work correctly now. Email me if you get a strange
346 parse from a document with framesets.
347
348 * Really bad HTML code will, often as not, make for a somewhat
349 objectionable parse tree. Regrettable, but unavoidably true.
350
351 * If you're running with implicit_tags off (God help you!), consider
352 that $tree->content_list probably contains the tree or grove from the
353 parse, and not $tree itself (which will, oddly enough, be an implicit
354 'html' element). This seems counter-intuitive and problematic; but
355 seeing as how almost no HTML ever parses correctly with implicit_tags
356 off, this interface oddity seems the least of your problems.
357
359 When a document parses in a way different from how you think it should,
360 I ask that you report this to me as a bug. The first thing you should
361 do is copy the document, trim out as much of it as you can while still
362 producing the bug in question, and then email me that mini-document and
363 the code you're using to parse it, to the HTML::Tree bug queue at
364 "bug-html-tree at rt.cpan.org".
365
366 Include a note as to how it parses (presumably including its
367 $tree->dump output), and then a careful and clear explanation of where
368 you think the parser is going astray, and how you would prefer that it
369 work instead.
370
372 HTML::Tree; HTML::Parser, HTML::Element, HTML::Tagset
373
374 HTML::DOMbo
375
377 Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
378 Lester, 2006 Pete Krawczyk.
379
380 This library is free software; you can redistribute it and/or modify it
381 under the same terms as Perl itself.
382
383 This program is distributed in the hope that it will be useful, but
384 without any warranty; without even the implied warranty of
385 merchantability or fitness for a particular purpose.
386
388 Currently maintained by Pete Krawczyk "<petek@cpan.org>"
389
390 Original authors: Gisle Aas, Sean Burke and Andy Lester.
391
392
393
394perl v5.10.1 2006-11-12 HTML::TreeBuilder(3)