HTML::TreeBuilder(3pm)

1HTML::TreeBuilder(3)  User Contributed Perl Documentation HTML::TreeBuilder(3)
2
3
4

NAME

6       HTML::TreeBuilder - Parser that builds a HTML syntax tree
7

SYNOPSIS

9         foreach my $file_name (@ARGV) {
10           my $tree = HTML::TreeBuilder->new; # empty tree
11           $tree->parse_file($file_name);
12           print "Hey, here's a dump of the parse tree of $file_name:\n";
13           $tree->dump; # a method we inherit from HTML::Element
14           print "And here it is, bizarrely rerendered as HTML:\n",
15             $tree->as_HTML, "\n";
16
17           # Now that we're done with it, we must destroy it.
18           $tree = $tree->delete;
19         }
20

DESCRIPTION

22       (This class is part of the HTML::Tree dist.)
23
24       This class is for HTML syntax trees that get built out of HTML source.
25       The way to use it is to:
26
27       1. start a new (empty) HTML::TreeBuilder object,
28
29       2. then use one of the methods from HTML::Parser (presumably with
30       $tree->parse_file($filename) for files, or with
31       $tree->parse($document_content) and $tree->eof if you've got the
32       content in a string) to parse the HTML document into the tree $tree.
33
34       (You can combine steps 1 and 2 with the "new_from_file" or
35       "new_from_content" methods.)
36
37       2b. call $root->elementify() if you want.
38
39       3. do whatever you need to do with the syntax tree, presumably
40       involving traversing it looking for some bit of information in it,
41
42       4. and finally, when you're done with the tree, call $tree->delete() to
43       erase the contents of the tree from memory.  This kind of thing usually
44       isn't necessary with most Perl objects, but it's necessary for
45       TreeBuilder objects.  See HTML::Element for a more verbose explanation
46       of why this is the case.
47

METHODS AND ATTRIBUTES

49       Objects of this class inherit the methods of both HTML::Parser and
50       HTML::Element.  The methods inherited from HTML::Parser are used for
51       building the HTML tree, and the methods inherited from HTML::Element
52       are what you use to scrutinize the tree.  Besides this
53       (HTML::TreeBuilder) documentation, you must also carefully read the
54       HTML::Element documentation, and also skim the HTML::Parser
55       documentation -- probably only its parse and parse_file methods are of
56       interest.
57
58       Most of the following methods native to HTML::TreeBuilder control how
59       parsing takes place; they should be set before you try parsing into the
60       given object.  You can set the attributes by passing a TRUE or FALSE
61       value as argument.  E.g., $root->implicit_tags returns the current
62       setting for the implicit_tags option, $root->implicit_tags(1) turns
63       that option on, and $root->implicit_tags(0) turns it off.
64
65       $root = HTML::TreeBuilder->new_from_file(...)
66           This "shortcut" constructor merely combines constructing a new
67           object (with the "new" method, below), and calling
68           $new->parse_file(...) on it.  Returns the new object.  Note that
69           this provides no way of setting any parse options like
70           store_comments (for that, call new, and then set options, before
71           calling parse_file).  See the notes (below) on parameters to
72           parse_file.
73
74       $root = HTML::TreeBuilder->new_from_content(...)
75           This "shortcut" constructor merely combines constructing a new
76           object (with the "new" method, below), and calling
77           for(...){$new->parse($_)} and $new->eof on it.  Returns the new
78           object.  Note that this provides no way of setting any parse
79           options like store_comments (for that, call new, and then set
80           options, before calling parse_file).  Example usages:
81           HTML::TreeBuilder->new_from_content(@lines), or
82           HTML::TreeBuilder->new_from_content($content)
83
84       $root = HTML::TreeBuilder->new()
85           This creates a new HTML::TreeBuilder object.  This method takes no
86           attributes.
87
88       $root->parse_file(...)
89           [An important method inherited from HTML::Parser, which see.
90           Current versions of HTML::Parser can take a filespec, or a
91           filehandle object, like *FOO, or some object from class IO::Handle,
92           IO::File, IO::Socket) or the like.  I think you should check that a
93           given file exists before calling $root->parse_file($filespec).]
94
95       $root->parse(...)
96           [A important method inherited from HTML::Parser, which see.  See
97           the note below for $root->eof().]
98
99       $root->eof()
100           This signals that you're finished parsing content into this tree;
101           this runs various kinds of crucial cleanup on the tree.  This is
102           called for you when you call $root->parse_file(...), but not when
103           you call $root->parse(...).  So if you call $root->parse(...), then
104           you must call $root->eof() once you've finished feeding all the
105           chunks to parse(...), and before you actually start doing anything
106           else with the tree in $root.
107
108       "$root->parse_content(...)"
109           Basically a happly alias for "$root->parse(...); $root->eof".
110           Takes the exact same arguments as "$root->parse()".
111
112       $root->delete()
113           [An important method inherited from HTML::Element, which see.]
114
115       $root->elementify()
116           This changes the class of the object in $root from
117           HTML::TreeBuilder to the class used for all the rest of the
118           elements in that tree (generally HTML::Element).  Returns $root.
119
120           For most purposes, this is unnecessary, but if you call this after
121           (after!!)  you've finished building a tree, then it keeps you from
122           accidentally trying to call anything but HTML::Element methods on
123           it.  (I.e., if you accidentally call "$root->parse_file(...)" on
124           the already-complete and elementified tree, then instead of
125           charging ahead and wreaking havoc, it'll throw a fatal error --
126           since $root is now an object just of class HTML::Element which has
127           no "parse_file" method.
128
129           Note that elementify currently deletes all the private attributes
130           of $root except for "_tag", "_parent", "_content", "_pos", and
131           "_implicit".  If anyone requests that I change this to leave in yet
132           more private attributes, I might do so, in future versions.
133
134       @nodes = $root->guts()
135       $parent_for_nodes = $root->guts()
136           In list context (as in the first case), this method returns the
137           topmost non-implicit nodes in a tree.  This is useful when you're
138           parsing HTML code that you know doesn't expect an HTML document,
139           but instead just a fragment of an HTML document.  For example, if
140           you wanted the parse tree for a file consisting of just this:
141
142             <li>I like pie!
143
144           Then you would get that with "@nodes = $root->guts();".  It so
145           happens that in this case, @nodes will contain just one element
146           object, representing the "li" node (with "I like pie!" being its
147           text child node).  However, consider if you were parsing this:
148
149             <hr>Hooboy!<hr>
150
151           In that case, "$root->guts()" would return three items: an element
152           object for the first "hr", a text string "Hooboy!", and another
153           "hr" element object.
154
155           For cases where you want definitely one element (so you can treat
156           it as a "document fragment", roughly speaking), call "guts()" in
157           scalar context, as in "$parent_for_nodes = $root->guts()". That
158           works like "guts()" in list context; in fact, "guts()" in list
159           context would have returned exactly one value, and if it would have
160           been an object (as opposed to a text string), then that's what
161           "guts" in scalar context will return.  Otherwise, if "guts()" in
162           list context would have returned no values at all, then "guts()" in
163           scalar context returns undef.  In all other cases, "guts()" in
164           scalar context returns an implicit 'div' element node, with
165           children consisting of whatever nodes "guts()" in list context
166           would have returned.  Note that that may detach those nodes from
167           $root's tree.
168
169       @nodes = $root->disembowel()
170       $parent_for_nodes = $root->disembowel()
171           The "disembowel()" method works just like the "guts()" method,
172           except that disembowel definitively destroys the tree above the
173           nodes that are returned.  Usually when you want the guts from a
174           tree, you're just going to toss out the rest of the tree anyway, so
175           this saves you the bother.  (Remember, "disembowel" means "remove
176           the guts from".)
177
178       $root->implicit_tags(value)
179           Setting this attribute to true will instruct the parser to try to
180           deduce implicit elements and implicit end tags.  If it is false you
181           get a parse tree that just reflects the text as it stands, which is
182           unlikely to be useful for anything but quick and dirty parsing.
183           (In fact, I'd be curious to hear from anyone who finds it useful to
184           have implicit_tags set to false.)  Default is true.
185
186           Implicit elements have the implicit() attribute set.
187
188       $root->implicit_body_p_tag(value)
189           This controls an aspect of implicit element behavior, if
190           implicit_tags is on:  If a text element (PCDATA) or a phrasal
191           element (such as "<em>") is to be inserted under "<body>", two
192           things can happen: if implicit_body_p_tag is true, it's placed
193           under a new, implicit "<p>" tag.  (Past DTDs suggested this was the
194           only correct behavior, and this is how past versions of this module
195           behaved.)  But if implicit_body_p_tag is false, nothing is
196           implicated -- the PCDATA or phrasal element is simply placed under
197           "<body>".  Default is false.
198
199       $root->no_expand_entities(value)
200           This attribute controls whether entities are decoded during the
201           initial parse of the source. Enable this if you don't want entities
202           decoded to their character value. e.g. '&amp;' is decoded to '&' by
203           default, but will be unchanged if this is enabled.  Default is
204           false (entities will be decoded.)
205
206       $root->ignore_unknown(value)
207           This attribute controls whether unknown tags should be represented
208           as elements in the parse tree, or whether they should be ignored.
209           Default is true (to ignore unknown tags.)
210
211       $root->ignore_text(value)
212           Do not represent the text content of elements.  This saves space if
213           all you want is to examine the structure of the document.  Default
214           is false.
215
216       $root->ignore_ignorable_whitespace(value)
217           If set to true, TreeBuilder will try to avoid creating ignorable
218           whitespace text nodes in the tree.  Default is true.  (In fact, I'd
219           be interested in hearing if there's ever a case where you need this
220           off, or where leaving it on leads to incorrect behavior.)
221
222       $root->no_space_compacting(value)
223           This determines whether TreeBuilder compacts all whitespace strings
224           in the document (well, outside of PRE or TEXTAREA elements), or
225           leaves them alone.  Normally (default, value of 0), each string of
226           contiguous whitespace in the document is turned into a single
227           space.  But that's not done if no_space_compacting is set to 1.
228
229           Setting no_space_compacting to 1 might be useful if you want to
230           read in a tree just to make some minor changes to it before writing
231           it back out.
232
233           This method is experimental.  If you use it, be sure to report any
234           problems you might have with it.
235
236       $root->p_strict(value)
237           If set to true (and it defaults to false), TreeBuilder will take a
238           narrower than normal view of what can be under a "p" element; if it
239           sees a non-phrasal element about to be inserted under a "p", it
240           will close that "p".  Otherwise it will close p elements only for
241           other "p"'s, headings, and "form" (although the latter may be
242           removed in future versions).
243
244           For example, when going thru this snippet of code,
245
246             <p>stuff
247             <ul>
248
249           TreeBuilder will normally (with "p_strict" false) put the "ul"
250           element under the "p" element.  However, with "p_strict" set to
251           true, it will close the "p" first.
252
253           In theory, there should be strictness options like this for
254           other/all elements besides just "p"; but I treat this as a special
255           case simply because of the fact that "p" occurs so frequently and
256           its end-tag is omitted so often; and also because application of
257           strictness rules at parse-time across all elements often makes tiny
258           errors in HTML coding produce drastically bad parse-trees, in my
259           experience.
260
261           If you find that you wish you had an option like this to enforce
262           content-models on all elements, then I suggest that what you want
263           is content-model checking as a stage after TreeBuilder has finished
264           parsing.
265
266       $root->store_comments(value)
267           This determines whether TreeBuilder will normally store comments
268           found while parsing content into $root.  Currently, this is off by
269           default.
270
271       $root->store_declarations(value)
272           This determines whether TreeBuilder will normally store markup
273           declarations found while parsing content into $root.  This is on by
274           default.
275
276       $root->store_pis(value)
277           This determines whether TreeBuilder will normally store processing
278           instructions found while parsing content into $root -- assuming a
279           recent version of HTML::Parser (old versions won't parse PIs
280           correctly).  Currently, this is off (false) by default.
281
282           It is somewhat of a known bug (to be fixed one of these days, if
283           anyone needs it?) that PIs in the preamble (before the "html"
284           start-tag) end up actually under the "html" element.
285
286       $root->warn(value)
287           This determines whether syntax errors during parsing should
288           generate warnings, emitted via Perl's "warn" function.
289
290           This is off (false) by default.
291
292       $h->element_class
293           This method returns the class which will be used for new elements.
294           It defaults to HTML::Element, but can be overridden by subclassing
295           or esoteric means best left to those will will read the source and
296           then not complain when those esoteric means change.  (Just
297           subclass.)
298
299       DEBUG
300           Are we in Debug mode?
301
302       comment
303           Accept a "here's a comment" signal from HTML::Parser.
304
305       declaration
306           Accept a "here's a markup declaration" signal from HTML::Parser.
307
308       done
309           TODO: document
310
311       end Either: Acccept an end-tag signal from HTML::Parser Or: Method for
312           closing currently open elements in some fairly complex way, as used
313           by other methods in this class.
314
315           TODO: Why is this hidden?
316
317       process
318           Accept a "here's a PI" signal from HTML::Parser.
319
320       start
321           Accept a signal from HTML::Parser for start-tags.
322
323           TODO: Why is this hidden?
324
325       stunt
326           TODO: document
327
328       stunted
329           TODO: document
330
331       text
332           Accept a "here's a text token" signal from HTML::Parser.
333
334           TODO: Why is this hidden?
335
336       tighten_up
337           Legacy
338
339           Redirects to HTML::Element:: delete_ignorable_whitespace
340
341       warning
342           Wrapper for CORE::warn
343
344           TODO: why not just use carp?
345

HTML AND ITS DISCONTENTS

347       HTML is rather harder to parse than people who write it generally
348       suspect.
349
350       Here's the problem: HTML is a kind of SGML that permits "minimization"
351       and "implication".  In short, this means that you don't have to close
352       every tag you open (because the opening of a subsequent tag may
353       implicitly close it), and if you use a tag that can't occur in the
354       context you seem to using it in, under certain conditions the parser
355       will be able to realize you mean to leave the current context and enter
356       the new one, that being the only one that your code could correctly be
357       interpreted in.
358
359       Now, this would all work flawlessly and unproblematically if: 1) all
360       the rules that both prescribe and describe HTML were (and had been)
361       clearly set out, and 2) everyone was aware of these rules and wrote
362       their code in compliance to them.
363
364       However, it didn't happen that way, and so most HTML pages are
365       difficult if not impossible to correctly parse with nearly any set of
366       straightforward SGML rules.  That's why the internals of
367       HTML::TreeBuilder consist of lots and lots of special cases -- instead
368       of being just a generic SGML parser with HTML DTD rules plugged in.
369

TRANSLATIONS?

371       The techniques that HTML::TreeBuilder uses to perform what I consider
372       very robust parses on everyday code are not things that can work only
373       in Perl.  To date, the algorithms at the center of HTML::TreeBuilder
374       have been implemented only in Perl, as far as I know; and I don't
375       foresee getting around to implementing them in any other language any
376       time soon.
377
378       If, however, anyone is looking for a semester project for an applied
379       programming class (or if they merely enjoy extra-curricular masochism),
380       they might do well to see about choosing as a topic the
381       implementation/adaptation of these routines to any other interesting
382       programming language that you feel currently suffers from a lack of
383       robust HTML-parsing.  I welcome correspondence on this subject, and
384       point out that one can learn a great deal about languages by trying to
385       translate between them, and then comparing the result.
386
387       The HTML::TreeBuilder source may seem long and complex, but it is
388       rather well commented, and symbol names are generally self-explanatory.
389       (You are encouraged to read the Mozilla HTML parser source for
390       comparison.)  Some of the complexity comes from little-used features,
391       and some of it comes from having the HTML tokenizer (HTML::Parser)
392       being a separate module, requiring somewhat of a different interface
393       than you'd find in a combined tokenizer and tree-builder.  But most of
394       the length of the source comes from the fact that it's essentially a
395       long list of special cases, with lots and lots of sanity-checking, and
396       sanity-recovery -- because, as Roseanne Rosannadanna once said, "it's
397       always something".
398
399       Users looking to compare several HTML parsers should look at the source
400       for Raggett's Tidy ("<http://www.w3.org/People/Raggett/tidy/>"),
401       Mozilla ("<http://www.mozilla.org/>"), and possibly root around the
402       browsers section of Yahoo to find the various open-source ones
403       ("<http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Browsers/>").
404

BUGS

406       * Framesets seem to work correctly now.  Email me if you get a strange
407       parse from a document with framesets.
408
409       * Really bad HTML code will, often as not, make for a somewhat
410       objectionable parse tree.  Regrettable, but unavoidably true.
411
412       * If you're running with implicit_tags off (God help you!), consider
413       that $tree->content_list probably contains the tree or grove from the
414       parse, and not $tree itself (which will, oddly enough, be an implicit
415       'html' element).  This seems counter-intuitive and problematic; but
416       seeing as how almost no HTML ever parses correctly with implicit_tags
417       off, this interface oddity seems the least of your problems.
418

BUG REPORTS

420       When a document parses in a way different from how you think it should,
421       I ask that you report this to me as a bug.  The first thing you should
422       do is copy the document, trim out as much of it as you can while still
423       producing the bug in question, and then email me that mini-document and
424       the code you're using to parse it, to the HTML::Tree bug queue at
425       "bug-html-tree at rt.cpan.org".
426
427       Include a note as to how it parses (presumably including its
428       $tree->dump output), and then a careful and clear explanation of where
429       you think the parser is going astray, and how you would prefer that it
430       work instead.
431

COPYRIGHT

438       Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
439       Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn.
440
441       This library is free software; you can redistribute it and/or modify it
442       under the same terms as Perl itself.
443
444       This program is distributed in the hope that it will be useful, but
445       without any warranty; without even the implied warranty of
446       merchantability or fitness for a particular purpose.
447

AUTHOR

449       Currently maintained by Pete Krawczyk "<petek@cpan.org>"
450
451       Original authors: Gisle Aas, Sean Burke and Andy Lester.
452
453
454
455perl v5.12.2                      2010-12-20              HTML::TreeBuilder(3)