HTML::TreeBuilder(3pm)

1HTML::TreeBuilder(3)  User Contributed Perl Documentation HTML::TreeBuilder(3)
2
3
4

NAME

6       HTML::TreeBuilder - Parser that builds a HTML syntax tree
7

SYNOPSIS

9         foreach my $file_name (@ARGV) {
10           my $tree = HTML::TreeBuilder->new; # empty tree
11           $tree->parse_file($file_name);
12           print "Hey, here's a dump of the parse tree of $file_name:\n";
13           $tree->dump; # a method we inherit from HTML::Element
14           print "And here it is, bizarrely rerendered as HTML:\n",
15             $tree->as_HTML, "\n";
16
17           # Now that we're done with it, we must destroy it.
18           $tree = $tree->delete;
19         }
20

DESCRIPTION

22       (This class is part of the HTML::Tree dist.)
23
24       This class is for HTML syntax trees that get built out of HTML source.
25       The way to use it is to:
26
27       1. start a new (empty) HTML::TreeBuilder object,
28
29       2. then use one of the methods from HTML::Parser (presumably with
30       $tree->parse_file($filename) for files, or with
31       $tree->parse($document_content) and $tree->eof if you've got the
32       content in a string) to parse the HTML document into the tree $tree.
33
34       (You can combine steps 1 and 2 with the "new_from_file" or
35       "new_from_content" methods.)
36
37       2b. call $root->elementify() if you want.
38
39       3. do whatever you need to do with the syntax tree, presumably
40       involving traversing it looking for some bit of information in it,
41
42       4. and finally, when you're done with the tree, call $tree->delete() to
43       erase the contents of the tree from memory.  This kind of thing usually
44       isn't necessary with most Perl objects, but it's necessary for
45       TreeBuilder objects.  See HTML::Element for a more verbose explanation
46       of why this is the case.
47

METHODS AND ATTRIBUTES

49       Objects of this class inherit the methods of both HTML::Parser and
50       HTML::Element.  The methods inherited from HTML::Parser are used for
51       building the HTML tree, and the methods inherited from HTML::Element
52       are what you use to scrutinize the tree.  Besides this
53       (HTML::TreeBuilder) documentation, you must also carefully read the
54       HTML::Element documentation, and also skim the HTML::Parser
55       documentation -- probably only its parse and parse_file methods are of
56       interest.
57
58       Most of the following methods native to HTML::TreeBuilder control how
59       parsing takes place; they should be set before you try parsing into the
60       given object.  You can set the attributes by passing a TRUE or FALSE
61       value as argument.  E.g., $root->implicit_tags returns the current
62       setting for the implicit_tags option, $root->implicit_tags(1) turns
63       that option on, and $root->implicit_tags(0) turns it off.
64
65       $root = HTML::TreeBuilder->new_from_file(...)
66           This "shortcut" constructor merely combines constructing a new
67           object (with the "new" method, below), and calling
68           $new->parse_file(...) on it.  Returns the new object.  Note that
69           this provides no way of setting any parse options like
70           store_comments (for that, call new, and then set options, before
71           calling parse_file).  See the notes (below) on parameters to
72           parse_file.
73
74       $root = HTML::TreeBuilder->new_from_content(...)
75           This "shortcut" constructor merely combines constructing a new
76           object (with the "new" method, below), and calling
77           for(...){$new->parse($_)} and $new->eof on it.  Returns the new
78           object.  Note that this provides no way of setting any parse
79           options like store_comments (for that, call new, and then set
80           options, before calling parse_file).  Example usages:
81           HTML::TreeBuilder->new_from_content(@lines), or
82           HTML::TreeBuilder->new_from_content($content)
83
84       $root = HTML::TreeBuilder->new()
85           This creates a new HTML::TreeBuilder object.  This method takes no
86           attributes.
87
88       $root->parse_file(...)
89           [An important method inherited from HTML::Parser, which see.
90           Current versions of HTML::Parser can take a filespec, or a
91           filehandle object, like *FOO, or some object from class IO::Handle,
92           IO::File, IO::Socket) or the like.  I think you should check that a
93           given file exists before calling $root->parse_file($filespec).]
94
95       $root->parse(...)
96           [A important method inherited from HTML::Parser, which see.  See
97           the note below for $root->eof().]
98
99       $root->eof()
100           This signals that you're finished parsing content into this tree;
101           this runs various kinds of crucial cleanup on the tree.  This is
102           called for you when you call $root->parse_file(...), but not when
103           you call $root->parse(...).  So if you call $root->parse(...), then
104           you must call $root->eof() once you've finished feeding all the
105           chunks to parse(...), and before you actually start doing anything
106           else with the tree in $root.
107
108       "$root->parse_content(...)"
109           Basically a happly alias for "$root->parse(...); $root->eof".
110           Takes the exact same arguments as "$root->parse()".
111
112       $root->delete()
113           [An important method inherited from HTML::Element, which see.]
114
115       $root->elementify()
116           This changes the class of the object in $root from
117           HTML::TreeBuilder to the class used for all the rest of the
118           elements in that tree (generally HTML::Element).  Returns $root.
119
120           For most purposes, this is unnecessary, but if you call this after
121           (after!!)  you've finished building a tree, then it keeps you from
122           accidentally trying to call anything but HTML::Element methods on
123           it.  (I.e., if you accidentally call "$root->parse_file(...)" on
124           the already-complete and elementified tree, then instead of
125           charging ahead and wreaking havoc, it'll throw a fatal error --
126           since $root is now an object just of class HTML::Element which has
127           no "parse_file" method.
128
129           Note that elementify currently deletes all the private attributes
130           of $root except for "_tag", "_parent", "_content", "_pos", and
131           "_implicit".  If anyone requests that I change this to leave in yet
132           more private attributes, I might do so, in future versions.
133
134       @nodes = $root->guts()
135       $parent_for_nodes = $root->guts()
136           In list context (as in the first case), this method returns the
137           topmost non-implicit nodes in a tree.  This is useful when you're
138           parsing HTML code that you know doesn't expect an HTML document,
139           but instead just a fragment of an HTML document.  For example, if
140           you wanted the parse tree for a file consisting of just this:
141
142             <li>I like pie!
143
144           Then you would get that with "@nodes = $root->guts();".  It so
145           happens that in this case, @nodes will contain just one element
146           object, representing the "li" node (with "I like pie!" being its
147           text child node).  However, consider if you were parsing this:
148
149             <hr>Hooboy!<hr>
150
151           In that case, "$root->guts()" would return three items: an element
152           object for the first "hr", a text string "Hooboy!", and another
153           "hr" element object.
154
155           For cases where you want definitely one element (so you can treat
156           it as a "document fragment", roughly speaking), call "guts()" in
157           scalar context, as in "$parent_for_nodes = $root->guts()". That
158           works like "guts()" in list context; in fact, "guts()" in list
159           context would have returned exactly one value, and if it would have
160           been an object (as opposed to a text string), then that's what
161           "guts" in scalar context will return.  Otherwise, if "guts()" in
162           list context would have returned no values at all, then "guts()" in
163           scalar context returns undef.  In all other cases, "guts()" in
164           scalar context returns an implicit 'div' element node, with
165           children consisting of whatever nodes "guts()" in list context
166           would have returned.  Note that that may detach those nodes from
167           $root's tree.
168
169       @nodes = $root->disembowel()
170       $parent_for_nodes = $root->disembowel()
171           The "disembowel()" method works just like the "guts()" method,
172           except that disembowel definitively destroys the tree above the
173           nodes that are returned.  Usually when you want the guts from a
174           tree, you're just going to toss out the rest of the tree anyway, so
175           this saves you the bother.  (Remember, "disembowel" means "remove
176           the guts from".)
177
178       $root->implicit_tags(value)
179           Setting this attribute to true will instruct the parser to try to
180           deduce implicit elements and implicit end tags.  If it is false you
181           get a parse tree that just reflects the text as it stands, which is
182           unlikely to be useful for anything but quick and dirty parsing.
183           (In fact, I'd be curious to hear from anyone who finds it useful to
184           have implicit_tags set to false.)  Default is true.
185
186           Implicit elements have the implicit() attribute set.
187
188       $root->implicit_body_p_tag(value)
189           This controls an aspect of implicit element behavior, if
190           implicit_tags is on:  If a text element (PCDATA) or a phrasal
191           element (such as "<em>") is to be inserted under "<body>", two
192           things can happen: if implicit_body_p_tag is true, it's placed
193           under a new, implicit "<p>" tag.  (Past DTDs suggested this was the
194           only correct behavior, and this is how past versions of this module
195           behaved.)  But if implicit_body_p_tag is false, nothing is
196           implicated -- the PCDATA or phrasal element is simply placed under
197           "<body>".  Default is false.
198
199       $root->ignore_unknown(value)
200           This attribute controls whether unknown tags should be represented
201           as elements in the parse tree, or whether they should be ignored.
202           Default is true (to ignore unknown tags.)
203
204       $root->ignore_text(value)
205           Do not represent the text content of elements.  This saves space if
206           all you want is to examine the structure of the document.  Default
207           is false.
208
209       $root->ignore_ignorable_whitespace(value)
210           If set to true, TreeBuilder will try to avoid creating ignorable
211           whitespace text nodes in the tree.  Default is true.  (In fact, I'd
212           be interested in hearing if there's ever a case where you need this
213           off, or where leaving it on leads to incorrect behavior.)
214
215       $root->no_space_compacting(value)
216           This determines whether TreeBuilder compacts all whitespace strings
217           in the document (well, outside of PRE or TEXTAREA elements), or
218           leaves them alone.  Normally (default, value of 0), each string of
219           contiguous whitespace in the document is turned into a single
220           space.  But that's not done if no_space_compacting is set to 1.
221
222           Setting no_space_compacting to 1 might be useful if you want to
223           read in a tree just to make some minor changes to it before writing
224           it back out.
225
226           This method is experimental.  If you use it, be sure to report any
227           problems you might have with it.
228
229       $root->p_strict(value)
230           If set to true (and it defaults to false), TreeBuilder will take a
231           narrower than normal view of what can be under a "p" element; if it
232           sees a non-phrasal element about to be inserted under a "p", it
233           will close that "p".  Otherwise it will close p elements only for
234           other "p"'s, headings, and "form" (altho the latter may be removed
235           in future versions).
236
237           For example, when going thru this snippet of code,
238
239             <p>stuff
240             <ul>
241
242           TreeBuilder will normally (with "p_strict" false) put the "ul"
243           element under the "p" element.  However, with "p_strict" set to
244           true, it will close the "p" first.
245
246           In theory, there should be strictness options like this for
247           other/all elements besides just "p"; but I treat this as a specal
248           case simply because of the fact that "p" occurs so frequently and
249           its end-tag is omitted so often; and also because application of
250           strictness rules at parse-time across all elements often makes tiny
251           errors in HTML coding produce drastically bad parse-trees, in my
252           experience.
253
254           If you find that you wish you had an option like this to enforce
255           content-models on all elements, then I suggest that what you want
256           is content-model checking as a stage after TreeBuilder has finished
257           parsing.
258
259       $root->store_comments(value)
260           This determines whether TreeBuilder will normally store comments
261           found while parsing content into $root.  Currently, this is off by
262           default.
263
264       $root->store_declarations(value)
265           This determines whether TreeBuilder will normally store markup
266           declarations found while parsing content into $root.  This is on by
267           default.
268
269       $root->store_pis(value)
270           This determines whether TreeBuilder will normally store processing
271           instructions found while parsing content into $root -- assuming a
272           recent version of HTML::Parser (old versions won't parse PIs
273           correctly).  Currently, this is off (false) by default.
274
275           It is somewhat of a known bug (to be fixed one of these days, if
276           anyone needs it?) that PIs in the preamble (before the "html"
277           start-tag) end up actually under the "html" element.
278
279       $root->warn(value)
280           This determines whether syntax errors during parsing should
281           generate warnings, emitted via Perl's "warn" function.
282
283           This is off (false) by default.
284

HTML AND ITS DISCONTENTS

286       HTML is rather harder to parse than people who write it generally
287       suspect.
288
289       Here's the problem: HTML is a kind of SGML that permits "minimization"
290       and "implication".  In short, this means that you don't have to close
291       every tag you open (because the opening of a subsequent tag may
292       implicitly close it), and if you use a tag that can't occur in the
293       context you seem to using it in, under certain conditions the parser
294       will be able to realize you mean to leave the current context and enter
295       the new one, that being the only one that your code could correctly be
296       interpreted in.
297
298       Now, this would all work flawlessly and unproblematically if: 1) all
299       the rules that both prescribe and describe HTML were (and had been)
300       clearly set out, and 2) everyone was aware of these rules and wrote
301       their code in compliance to them.
302
303       However, it didn't happen that way, and so most HTML pages are
304       difficult if not impossible to correctly parse with nearly any set of
305       straightforward SGML rules.  That's why the internals of
306       HTML::TreeBuilder consist of lots and lots of special cases -- instead
307       of being just a generic SGML parser with HTML DTD rules plugged in.
308

TRANSLATIONS?

310       The techniques that HTML::TreeBuilder uses to perform what I consider
311       very robust parses on everyday code are not things that can work only
312       in Perl.  To date, the algorithms at the center of HTML::TreeBuilder
313       have been implemented only in Perl, as far as I know; and I don't
314       foresee getting around to implementing them in any other language any
315       time soon.
316
317       If, however, anyone is looking for a semester project for an applied
318       programming class (or if they merely enjoy extra-curricular masochism),
319       they might do well to see about choosing as a topic the
320       implementation/adaptation of these routines to any other interesting
321       programming language that you feel currently suffers from a lack of
322       robust HTML-parsing.  I welcome correspondence on this subject, and
323       point out that one can learn a great deal about languages by trying to
324       translate between them, and then comparing the result.
325
326       The HTML::TreeBuilder source may seem long and complex, but it is
327       rather well commented, and symbol names are generally self-explanatory.
328       (You are encouraged to read the Mozilla HTML parser source for
329       comparison.)  Some of the complexity comes from little-used features,
330       and some of it comes from having the HTML tokenizer (HTML::Parser)
331       being a separate module, requiring somewhat of a different interface
332       than you'd find in a combined tokenizer and tree-builder.  But most of
333       the length of the source comes from the fact that it's essentially a
334       long list of special cases, with lots and lots of sanity-checking, and
335       sanity-recovery -- because, as Roseanne Rosannadanna once said, "it's
336       always something".
337
338       Users looking to compare several HTML parsers should look at the source
339       for Raggett's Tidy ("<http://www.w3.org/People/Raggett/tidy/>"),
340       Mozilla ("<http://www.mozilla.org/>"), and possibly root around the
341       browsers section of Yahoo to find the various open-source ones
342       ("<http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Browsers/>").
343

BUGS

345       * Framesets seem to work correctly now.  Email me if you get a strange
346       parse from a document with framesets.
347
348       * Really bad HTML code will, often as not, make for a somewhat
349       objectionable parse tree.  Regrettable, but unavoidably true.
350
351       * If you're running with implicit_tags off (God help you!), consider
352       that $tree->content_list probably contains the tree or grove from the
353       parse, and not $tree itself (which will, oddly enough, be an implicit
354       'html' element).  This seems counter-intuitive and problematic; but
355       seeing as how almost no HTML ever parses correctly with implicit_tags
356       off, this interface oddity seems the least of your problems.
357

BUG REPORTS

359       When a document parses in a way different from how you think it should,
360       I ask that you report this to me as a bug.  The first thing you should
361       do is copy the document, trim out as much of it as you can while still
362       producing the bug in question, and then email me that mini-document and
363       the code you're using to parse it, to the HTML::Tree bug queue at
364       "bug-html-tree at rt.cpan.org".
365
366       Include a note as to how it parses (presumably including its
367       $tree->dump output), and then a careful and clear explanation of where
368       you think the parser is going astray, and how you would prefer that it
369       work instead.
370

COPYRIGHT

377       Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
378       Lester, 2006 Pete Krawczyk.
379
380       This library is free software; you can redistribute it and/or modify it
381       under the same terms as Perl itself.
382
383       This program is distributed in the hope that it will be useful, but
384       without any warranty; without even the implied warranty of
385       merchantability or fitness for a particular purpose.
386

AUTHOR

388       Currently maintained by Pete Krawczyk "<petek@cpan.org>"
389
390       Original authors: Gisle Aas, Sean Burke and Andy Lester.
391
392
393
394perl v5.10.1                      2006-11-12              HTML::TreeBuilder(3)