1HTML::TreeBuilder(3) User Contributed Perl Documentation HTML::TreeBuilder(3)
2
3
4
6 HTML::TreeBuilder - Parser that builds a HTML syntax tree
7
9 foreach my $file_name (@ARGV) {
10 my $tree = HTML::TreeBuilder->new; # empty tree
11 $tree->parse_file($file_name);
12 print "Hey, here's a dump of the parse tree of $file_name:\n";
13 $tree->dump; # a method we inherit from HTML::Element
14 print "And here it is, bizarrely rerendered as HTML:\n",
15 $tree->as_HTML, "\n";
16
17 # Now that we're done with it, we must destroy it.
18 $tree = $tree->delete;
19 }
20
22 (This class is part of the HTML::Tree dist.)
23
24 This class is for HTML syntax trees that get built out of HTML source.
25 The way to use it is to:
26
27 1. start a new (empty) HTML::TreeBuilder object,
28
29 2. then use one of the methods from HTML::Parser (presumably with
30 $tree->parse_file($filename) for files, or with
31 $tree->parse($document_content) and $tree->eof if you've got the
32 content in a string) to parse the HTML document into the tree $tree.
33
34 (You can combine steps 1 and 2 with the "new_from_file" or
35 "new_from_content" methods.)
36
37 2b. call $root->elementify() if you want.
38
39 3. do whatever you need to do with the syntax tree, presumably
40 involving traversing it looking for some bit of information in it,
41
42 4. and finally, when you're done with the tree, call $tree->delete() to
43 erase the contents of the tree from memory. This kind of thing usually
44 isn't necessary with most Perl objects, but it's necessary for
45 TreeBuilder objects. See HTML::Element for a more verbose explanation
46 of why this is the case.
47
49 Objects of this class inherit the methods of both HTML::Parser and
50 HTML::Element. The methods inherited from HTML::Parser are used for
51 building the HTML tree, and the methods inherited from HTML::Element
52 are what you use to scrutinize the tree. Besides this
53 (HTML::TreeBuilder) documentation, you must also carefully read the
54 HTML::Element documentation, and also skim the HTML::Parser
55 documentation -- probably only its parse and parse_file methods are of
56 interest.
57
58 Most of the following methods native to HTML::TreeBuilder control how
59 parsing takes place; they should be set before you try parsing into the
60 given object. You can set the attributes by passing a TRUE or FALSE
61 value as argument. E.g., $root->implicit_tags returns the current
62 setting for the implicit_tags option, $root->implicit_tags(1) turns
63 that option on, and $root->implicit_tags(0) turns it off.
64
65 $root = HTML::TreeBuilder->new_from_file(...)
66 This "shortcut" constructor merely combines constructing a new
67 object (with the "new" method, below), and calling
68 $new->parse_file(...) on it. Returns the new object. Note that
69 this provides no way of setting any parse options like
70 store_comments (for that, call new, and then set options, before
71 calling parse_file). See the notes (below) on parameters to
72 parse_file.
73
74 $root = HTML::TreeBuilder->new_from_content(...)
75 This "shortcut" constructor merely combines constructing a new
76 object (with the "new" method, below), and calling
77 for(...){$new->parse($_)} and $new->eof on it. Returns the new
78 object. Note that this provides no way of setting any parse
79 options like store_comments (for that, call new, and then set
80 options, before calling parse_file). Example usages:
81 HTML::TreeBuilder->new_from_content(@lines), or
82 HTML::TreeBuilder->new_from_content($content)
83
84 $root = HTML::TreeBuilder->new()
85 This creates a new HTML::TreeBuilder object. This method takes no
86 attributes.
87
88 $root->parse_file(...)
89 [An important method inherited from HTML::Parser, which see.
90 Current versions of HTML::Parser can take a filespec, or a
91 filehandle object, like *FOO, or some object from class IO::Handle,
92 IO::File, IO::Socket) or the like. I think you should check that a
93 given file exists before calling $root->parse_file($filespec).]
94
95 $root->parse(...)
96 [A important method inherited from HTML::Parser, which see. See
97 the note below for $root->eof().]
98
99 $root->eof()
100 This signals that you're finished parsing content into this tree;
101 this runs various kinds of crucial cleanup on the tree. This is
102 called for you when you call $root->parse_file(...), but not when
103 you call $root->parse(...). So if you call $root->parse(...), then
104 you must call $root->eof() once you've finished feeding all the
105 chunks to parse(...), and before you actually start doing anything
106 else with the tree in $root.
107
108 "$root->parse_content(...)"
109 Basically a happly alias for "$root->parse(...); $root->eof".
110 Takes the exact same arguments as "$root->parse()".
111
112 $root->delete()
113 [An important method inherited from HTML::Element, which see.]
114
115 $root->elementify()
116 This changes the class of the object in $root from
117 HTML::TreeBuilder to the class used for all the rest of the
118 elements in that tree (generally HTML::Element). Returns $root.
119
120 For most purposes, this is unnecessary, but if you call this after
121 (after!!) you've finished building a tree, then it keeps you from
122 accidentally trying to call anything but HTML::Element methods on
123 it. (I.e., if you accidentally call "$root->parse_file(...)" on
124 the already-complete and elementified tree, then instead of
125 charging ahead and wreaking havoc, it'll throw a fatal error --
126 since $root is now an object just of class HTML::Element which has
127 no "parse_file" method.
128
129 Note that elementify currently deletes all the private attributes
130 of $root except for "_tag", "_parent", "_content", "_pos", and
131 "_implicit". If anyone requests that I change this to leave in yet
132 more private attributes, I might do so, in future versions.
133
134 @nodes = $root->guts()
135 $parent_for_nodes = $root->guts()
136 In list context (as in the first case), this method returns the
137 topmost non-implicit nodes in a tree. This is useful when you're
138 parsing HTML code that you know doesn't expect an HTML document,
139 but instead just a fragment of an HTML document. For example, if
140 you wanted the parse tree for a file consisting of just this:
141
142 <li>I like pie!
143
144 Then you would get that with "@nodes = $root->guts();". It so
145 happens that in this case, @nodes will contain just one element
146 object, representing the "li" node (with "I like pie!" being its
147 text child node). However, consider if you were parsing this:
148
149 <hr>Hooboy!<hr>
150
151 In that case, "$root->guts()" would return three items: an element
152 object for the first "hr", a text string "Hooboy!", and another
153 "hr" element object.
154
155 For cases where you want definitely one element (so you can treat
156 it as a "document fragment", roughly speaking), call "guts()" in
157 scalar context, as in "$parent_for_nodes = $root->guts()". That
158 works like "guts()" in list context; in fact, "guts()" in list
159 context would have returned exactly one value, and if it would have
160 been an object (as opposed to a text string), then that's what
161 "guts" in scalar context will return. Otherwise, if "guts()" in
162 list context would have returned no values at all, then "guts()" in
163 scalar context returns undef. In all other cases, "guts()" in
164 scalar context returns an implicit 'div' element node, with
165 children consisting of whatever nodes "guts()" in list context
166 would have returned. Note that that may detach those nodes from
167 $root's tree.
168
169 @nodes = $root->disembowel()
170 $parent_for_nodes = $root->disembowel()
171 The "disembowel()" method works just like the "guts()" method,
172 except that disembowel definitively destroys the tree above the
173 nodes that are returned. Usually when you want the guts from a
174 tree, you're just going to toss out the rest of the tree anyway, so
175 this saves you the bother. (Remember, "disembowel" means "remove
176 the guts from".)
177
178 $root->implicit_tags(value)
179 Setting this attribute to true will instruct the parser to try to
180 deduce implicit elements and implicit end tags. If it is false you
181 get a parse tree that just reflects the text as it stands, which is
182 unlikely to be useful for anything but quick and dirty parsing.
183 (In fact, I'd be curious to hear from anyone who finds it useful to
184 have implicit_tags set to false.) Default is true.
185
186 Implicit elements have the implicit() attribute set.
187
188 $root->implicit_body_p_tag(value)
189 This controls an aspect of implicit element behavior, if
190 implicit_tags is on: If a text element (PCDATA) or a phrasal
191 element (such as "<em>") is to be inserted under "<body>", two
192 things can happen: if implicit_body_p_tag is true, it's placed
193 under a new, implicit "<p>" tag. (Past DTDs suggested this was the
194 only correct behavior, and this is how past versions of this module
195 behaved.) But if implicit_body_p_tag is false, nothing is
196 implicated -- the PCDATA or phrasal element is simply placed under
197 "<body>". Default is false.
198
199 $root->no_expand_entities(value)
200 This attribute controls whether entities are decoded during the
201 initial parse of the source. Enable this if you don't want entities
202 decoded to their character value. e.g. '&' is decoded to '&' by
203 default, but will be unchanged if this is enabled. Default is
204 false (entities will be decoded.)
205
206 $root->ignore_unknown(value)
207 This attribute controls whether unknown tags should be represented
208 as elements in the parse tree, or whether they should be ignored.
209 Default is true (to ignore unknown tags.)
210
211 $root->ignore_text(value)
212 Do not represent the text content of elements. This saves space if
213 all you want is to examine the structure of the document. Default
214 is false.
215
216 $root->ignore_ignorable_whitespace(value)
217 If set to true, TreeBuilder will try to avoid creating ignorable
218 whitespace text nodes in the tree. Default is true. (In fact, I'd
219 be interested in hearing if there's ever a case where you need this
220 off, or where leaving it on leads to incorrect behavior.)
221
222 $root->no_space_compacting(value)
223 This determines whether TreeBuilder compacts all whitespace strings
224 in the document (well, outside of PRE or TEXTAREA elements), or
225 leaves them alone. Normally (default, value of 0), each string of
226 contiguous whitespace in the document is turned into a single
227 space. But that's not done if no_space_compacting is set to 1.
228
229 Setting no_space_compacting to 1 might be useful if you want to
230 read in a tree just to make some minor changes to it before writing
231 it back out.
232
233 This method is experimental. If you use it, be sure to report any
234 problems you might have with it.
235
236 $root->p_strict(value)
237 If set to true (and it defaults to false), TreeBuilder will take a
238 narrower than normal view of what can be under a "p" element; if it
239 sees a non-phrasal element about to be inserted under a "p", it
240 will close that "p". Otherwise it will close p elements only for
241 other "p"'s, headings, and "form" (although the latter may be
242 removed in future versions).
243
244 For example, when going thru this snippet of code,
245
246 <p>stuff
247 <ul>
248
249 TreeBuilder will normally (with "p_strict" false) put the "ul"
250 element under the "p" element. However, with "p_strict" set to
251 true, it will close the "p" first.
252
253 In theory, there should be strictness options like this for
254 other/all elements besides just "p"; but I treat this as a special
255 case simply because of the fact that "p" occurs so frequently and
256 its end-tag is omitted so often; and also because application of
257 strictness rules at parse-time across all elements often makes tiny
258 errors in HTML coding produce drastically bad parse-trees, in my
259 experience.
260
261 If you find that you wish you had an option like this to enforce
262 content-models on all elements, then I suggest that what you want
263 is content-model checking as a stage after TreeBuilder has finished
264 parsing.
265
266 $root->store_comments(value)
267 This determines whether TreeBuilder will normally store comments
268 found while parsing content into $root. Currently, this is off by
269 default.
270
271 $root->store_declarations(value)
272 This determines whether TreeBuilder will normally store markup
273 declarations found while parsing content into $root. This is on by
274 default.
275
276 $root->store_pis(value)
277 This determines whether TreeBuilder will normally store processing
278 instructions found while parsing content into $root -- assuming a
279 recent version of HTML::Parser (old versions won't parse PIs
280 correctly). Currently, this is off (false) by default.
281
282 It is somewhat of a known bug (to be fixed one of these days, if
283 anyone needs it?) that PIs in the preamble (before the "html"
284 start-tag) end up actually under the "html" element.
285
286 $root->warn(value)
287 This determines whether syntax errors during parsing should
288 generate warnings, emitted via Perl's "warn" function.
289
290 This is off (false) by default.
291
292 $h->element_class
293 This method returns the class which will be used for new elements.
294 It defaults to HTML::Element, but can be overridden by subclassing
295 or esoteric means best left to those will will read the source and
296 then not complain when those esoteric means change. (Just
297 subclass.)
298
299 DEBUG
300 Are we in Debug mode?
301
302 comment
303 Accept a "here's a comment" signal from HTML::Parser.
304
305 declaration
306 Accept a "here's a markup declaration" signal from HTML::Parser.
307
308 done
309 TODO: document
310
311 end Either: Acccept an end-tag signal from HTML::Parser Or: Method for
312 closing currently open elements in some fairly complex way, as used
313 by other methods in this class.
314
315 TODO: Why is this hidden?
316
317 process
318 Accept a "here's a PI" signal from HTML::Parser.
319
320 start
321 Accept a signal from HTML::Parser for start-tags.
322
323 TODO: Why is this hidden?
324
325 stunt
326 TODO: document
327
328 stunted
329 TODO: document
330
331 text
332 Accept a "here's a text token" signal from HTML::Parser.
333
334 TODO: Why is this hidden?
335
336 tighten_up
337 Legacy
338
339 Redirects to HTML::Element:: delete_ignorable_whitespace
340
341 warning
342 Wrapper for CORE::warn
343
344 TODO: why not just use carp?
345
347 HTML is rather harder to parse than people who write it generally
348 suspect.
349
350 Here's the problem: HTML is a kind of SGML that permits "minimization"
351 and "implication". In short, this means that you don't have to close
352 every tag you open (because the opening of a subsequent tag may
353 implicitly close it), and if you use a tag that can't occur in the
354 context you seem to using it in, under certain conditions the parser
355 will be able to realize you mean to leave the current context and enter
356 the new one, that being the only one that your code could correctly be
357 interpreted in.
358
359 Now, this would all work flawlessly and unproblematically if: 1) all
360 the rules that both prescribe and describe HTML were (and had been)
361 clearly set out, and 2) everyone was aware of these rules and wrote
362 their code in compliance to them.
363
364 However, it didn't happen that way, and so most HTML pages are
365 difficult if not impossible to correctly parse with nearly any set of
366 straightforward SGML rules. That's why the internals of
367 HTML::TreeBuilder consist of lots and lots of special cases -- instead
368 of being just a generic SGML parser with HTML DTD rules plugged in.
369
371 The techniques that HTML::TreeBuilder uses to perform what I consider
372 very robust parses on everyday code are not things that can work only
373 in Perl. To date, the algorithms at the center of HTML::TreeBuilder
374 have been implemented only in Perl, as far as I know; and I don't
375 foresee getting around to implementing them in any other language any
376 time soon.
377
378 If, however, anyone is looking for a semester project for an applied
379 programming class (or if they merely enjoy extra-curricular masochism),
380 they might do well to see about choosing as a topic the
381 implementation/adaptation of these routines to any other interesting
382 programming language that you feel currently suffers from a lack of
383 robust HTML-parsing. I welcome correspondence on this subject, and
384 point out that one can learn a great deal about languages by trying to
385 translate between them, and then comparing the result.
386
387 The HTML::TreeBuilder source may seem long and complex, but it is
388 rather well commented, and symbol names are generally self-explanatory.
389 (You are encouraged to read the Mozilla HTML parser source for
390 comparison.) Some of the complexity comes from little-used features,
391 and some of it comes from having the HTML tokenizer (HTML::Parser)
392 being a separate module, requiring somewhat of a different interface
393 than you'd find in a combined tokenizer and tree-builder. But most of
394 the length of the source comes from the fact that it's essentially a
395 long list of special cases, with lots and lots of sanity-checking, and
396 sanity-recovery -- because, as Roseanne Rosannadanna once said, "it's
397 always something".
398
399 Users looking to compare several HTML parsers should look at the source
400 for Raggett's Tidy ("<http://www.w3.org/People/Raggett/tidy/>"),
401 Mozilla ("<http://www.mozilla.org/>"), and possibly root around the
402 browsers section of Yahoo to find the various open-source ones
403 ("<http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Browsers/>").
404
406 * Framesets seem to work correctly now. Email me if you get a strange
407 parse from a document with framesets.
408
409 * Really bad HTML code will, often as not, make for a somewhat
410 objectionable parse tree. Regrettable, but unavoidably true.
411
412 * If you're running with implicit_tags off (God help you!), consider
413 that $tree->content_list probably contains the tree or grove from the
414 parse, and not $tree itself (which will, oddly enough, be an implicit
415 'html' element). This seems counter-intuitive and problematic; but
416 seeing as how almost no HTML ever parses correctly with implicit_tags
417 off, this interface oddity seems the least of your problems.
418
420 When a document parses in a way different from how you think it should,
421 I ask that you report this to me as a bug. The first thing you should
422 do is copy the document, trim out as much of it as you can while still
423 producing the bug in question, and then email me that mini-document and
424 the code you're using to parse it, to the HTML::Tree bug queue at
425 "bug-html-tree at rt.cpan.org".
426
427 Include a note as to how it parses (presumably including its
428 $tree->dump output), and then a careful and clear explanation of where
429 you think the parser is going astray, and how you would prefer that it
430 work instead.
431
433 HTML::Tree; HTML::Parser, HTML::Element, HTML::Tagset
434
435 HTML::DOMbo
436
438 Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
439 Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn.
440
441 This library is free software; you can redistribute it and/or modify it
442 under the same terms as Perl itself.
443
444 This program is distributed in the hope that it will be useful, but
445 without any warranty; without even the implied warranty of
446 merchantability or fitness for a particular purpose.
447
449 Currently maintained by Pete Krawczyk "<petek@cpan.org>"
450
451 Original authors: Gisle Aas, Sean Burke and Andy Lester.
452
453
454
455perl v5.12.2 2010-12-20 HTML::TreeBuilder(3)