1HTML::Tree::Scanning(3)User Contributed Perl DocumentatioHnTML::Tree::Scanning(3)
2
3
4

NAME

6       HTML::Tree::Scanning -- article: "Scanning HTML"
7

SYNOPSIS

9         # This an article, not a module.
10

DESCRIPTION

12       The following article by Sean M. Burke first appeared in The Perl Jour‐
13       nal #19 and is copyright 2000 The Perl Journal. It appears courtesy of
14       Jon Orwant and The Perl Journal.  This document may be distributed
15       under the same terms as Perl itself.
16

Scanning HTML

18       -- Sean M. Burke
19
20       In The Perl Journal issue 17, Ken MacFarlane's article "Parsing HTML
21       with HTML::Parser" describes how the HTML::Parser module scans HTML
22       source as a stream of start-tags, end-tags, text, comments, etc.  In
23       TPJ #18, my "Trees" article kicked around the idea of tree-shaped data
24       structures.  Now I'll try to tie it together, in a discussion of HTML
25       trees.
26
27       The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser
28       picks out, and builds a parse tree -- a tree-shaped network of
29       objects...
30
31           Footnote: And if you need a quick explanation of objects, see my
32           TPJ17 article "A User's View of Object-Oriented Modules"; or go
33           whole hog and get Damian Conway's excellent book Object-Oriented
34           Perl, from Manning Publications.
35
36       ...representing the structured content of the HTML document.  And once
37       the document is parsed as a tree, you'll find the common tasks of
38       extracting data from that HTML document/tree to be quite straightfor‐
39       ward.
40
41       HTML::Parser, HTML::TreeBuilder, and HTML::Element
42
43       You use HTML::TreeBuilder to make a parse tree out of an HTML source
44       file, by simply saying:
45
46         use HTML::TreeBuilder;
47         my $tree = HTML::TreeBuilder->new();
48         $tree->parse_file('foo.html');
49
50       and then $tree contains a parse tree built from the HTML source from
51       the file "foo.html".  The way this parse tree is represented is with a
52       network of objects -- $tree is the root, an element with tag-name
53       "html", and its children typically include a "head" and "body" element,
54       and so on.  Elements in the tree are objects of the class HTML::Ele‐
55       ment.
56
57       So, if you take this source:
58
59         <html><head><title>Doc 1</title></head>
60         <body>
61         Stuff <hr> 2000-08-17
62         </body></html>
63
64       and feed it to HTML::TreeBuilder, it'll return a tree of objects that
65       looks like this:
66
67                      html
68                    /      \
69                head        body
70               /          /   ⎪  \
71            title    "Stuff"  hr  "2000-08-17"
72
73           "Doc 1"
74
75       This is a pretty simple document, but if it were any more complex, it'd
76       be a bit hard to draw in that style, since it's sprawl left and right.
77       The same tree can be represented a bit more easily sideways, with
78       indenting:
79
80         . html
81            . head
82               . title
83                  . "Doc 1"
84            . body
85               . "Stuff"
86               . hr
87               . "2000-08-17"
88
89       Either way expresses the same structure.  In that structure, the root
90       node is an object of the class HTML::Element
91
92           Footnote: Well actually, the root is of the class HTML::Tree‐
93           Builder, but that's just a subclass of HTML::Element, plus the few
94           extra methods like "parse_file" that elaborate the tree
95
96       , with the tag name "html", and with two children: an HTML::Element
97       object whose tag names are "head" and "body".  And each of those ele‐
98       ments have children, and so on down.  Not all elements (as we'll call
99       the objects of class HTML::Element) have children -- the "hr" element
100       doesn't.  And note all nodes in the tree are elements -- the text nodes
101       ("Doc 1", "Stuff", and "2000-08-17") are just strings.
102
103       Objects of the class HTML::Element each have three noteworthy
104       attributes:
105
106       "_tag" -- (best accessed as "$e->tag") this element's tag-name, lower‐
107       cased (e.g., "em" for an "em" element).
108               Footnote: Yes, this is misnamed.  In proper SGML terminology,
109               this is instead called a "GI", short for "generic identifier";
110               and the term "tag" is used for a token of SGML source that rep‐
111               resents either the start of an element (a start-tag like "<em
112               lang='fr'>") or the end of an element (an end-tag like "</em>".
113               However, since more people claim to have been abducted by
114               aliens than to have ever seen the SGML standard, and since both
115               encounters typically involve a feeling of "missing time", it's
116               not surprising that the terminology of the SGML standard is not
117               closely followed.
118
119       "_parent" -- (best accessed as "$e->parent") the element that is $obj's
120       parent, or undef if this element is the root of its tree.
121       "_content" -- (best accessed as "$e->content_list") the list of nodes
122       (i.e., elements or text segments) that are $e's children.
123
124       Moreover, if an element object has any attributes in the SGML sense of
125       the word, then those are readable as "$e->attr('name')" -- for example,
126       with the object built from having parsed "<a id='foo'>bar</a>",
127       "$e->attr('id')" will return the string "foo".  Moreover, "$e->tag" on
128       that object returns the string "a", "$e->content_list" returns a list
129       consisting of just the single scalar "bar", and "$e->parent" returns
130       the object that's this node's parent -- which may be, for example, a
131       "p" element.
132
133       And that's all that there is to it -- you throw HTML source at Tree‐
134       Builder, and it returns a tree built of HTML::Element objects and some
135       text strings.
136
137       However, what do you do with a tree of objects?  People code informa‐
138       tion into HTML trees not for the fun of arranging elements, but to rep‐
139       resent the structure of specific text and images -- some text is in
140       this "li" element, some other text is in that heading, some images are
141       in that other table cell that has those attributes, and so on.
142
143       Now, it may happen that you're rendering that whole HTML tree into some
144       layout format.  Or you could be trying to make some systematic change
145       to the HTML tree before dumping it out as HTML source again.  But, in
146       my experience, by far the most common programming task that Perl pro‐
147       grammers face with HTML is in trying to extract some piece of informa‐
148       tion from a larger document.  Since that's so common (and also since it
149       involves concepts that are basic to more complex tasks), that is what
150       the rest of this article will be about.
151
152       Scanning HTML trees
153
154       Suppose you have a thousand HTML documents, each of them a press
155       release.  They all start out:
156
157         [...lots of leading images and junk...]
158         <h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
159         BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
160         of world conquest, Rock Feldspar, announced today the opening of a
161         new office in Ougadougou, the capital city of Burkino Faso, gateway
162         to the bustling "Silicon Sahara" of Africa...
163         [...etc...]
164
165       ...and what you've got to do is, for each document, copy whatever text
166       is in the "h1" element, so that you can, for example, make a table of
167       contents of it.  Now, there are three ways to do this:
168
169       * You can just use a regexp to scan the file for a text pattern.
170           For many very simple tasks, this will do fine.  Many HTML documents
171           are, in practice, very consistently formatted as far as placement
172           of linebreaks and whitespace, so you could just get away with scan‐
173           ning the file like so:
174
175             sub get_heading {
176               my $filename = $_[0];
177               local *HTML;
178               open(HTML, $filename)
179                 or die "Couldn't open $filename);
180               my $heading;
181              Line:
182               while(<HTML>) {
183                 if( m{<h1>(.*?)</h1>}i ) {  # match it!
184                   $heading = $1;
185                   last Line;
186                 }
187               }
188               close(HTML);
189               warn "No heading in $filename?"
190                unless defined $heading;
191               return $heading;
192             }
193
194           This is quick and fast, but awfully fragile -- if there's a newline
195           in the middle of a heading's text, it won't match the above regexp,
196           and you'll get an error.  The regexp will also fail if the "h1"
197           element's start-tag has any attributes.  If you have to adapt your
198           code to fit more kinds of start-tags, you'll end up basically rein‐
199           venting part of HTML::Parser, at which point you should probably
200           just stop, and use HTML::Parser itself:
201
202       * You can use HTML::Parser to scan the file for an "h1" start-tag
203       token, then capture all the text tokens until the "h1" close-tag.  This
204       approach is extensively covered in the Ken MacFarlane's TPJ17 article
205       "Parsing HTML with HTML::Parser".  (A variant of this approach is to
206       use HTML::TokeParser, which presents a different and rather handier
207       interface to the tokens that HTML::Parser picks out.)
208           Using HTML::Parser is less fragile than our first approach, since
209           it's not sensitive to the exact internal formatting of the start-
210           tag (much less whether it's split across two lines).  However, when
211           you need more information about the context of the "h1" element, or
212           if you're having to deal with any of the tricky bits of HTML, such
213           as parsing of tables, you'll find out the flat list of tokens that
214           HTML::Parser returns isn't immediately useful.  To get something
215           useful out of those tokens, you'll need to write code that knows
216           some things about what elements take no content (as with "hr" ele‐
217           ments), and that a "</p>" end-tags are omissible, so a "<p>" will
218           end any currently open paragraph -- and you're well on your way to
219           pointlessly reinventing much of the code in HTML::TreeBuilder
220
221               Footnote: And, as the person who last rewrote that module, I
222               can attest that it wasn't terribly easy to get right!  Never
223               underestimate the perversity of people coding HTML.
224
225           , at which point you should probably just stop, and use HTML::Tree‐
226           Builder itself:
227
228       * You can use HTML::Treebuilder, and scan the tree of element objects
229       that you get back.
230
231       The last approach, using HTML::TreeBuilder, is the diametric opposite
232       of first approach:  The first approach involves just elementary Perl
233       and one regexp, whereas the TreeBuilder approach involves being at home
234       with the concept of tree-shaped data structures and modules with
235       object-oriented interfaces, as well as with the particular interfaces
236       that HTML::TreeBuilder and HTML::Element provide.
237
238       However, what the TreeBuilder approach has going for it is that it's
239       the most robust, because it involves dealing with HTML in its "native"
240       format -- it deals with the tree structure that HTML code represents,
241       without any consideration of how the source is coded and with what tags
242       omitted.
243
244       So, to extract the text from the "h1" elements of an HTML document:
245
246         sub get_heading {
247           my $tree = HTML::TreeBuilder->new;
248           $tree->parse_file($_[0]);   # !
249           my $heading;
250           my $h1 = $tree->look_down('_tag', 'h1');  # !
251           if($h1) {
252             $heading = $h1->as_text;   # !
253           } else {
254             warn "No heading in $_[0]?";
255           }
256           $tree->delete; # clear memory!
257           return $heading;
258         }
259
260       This uses some unfamiliar methods that need explaining.  The
261       "parse_file" method that we've seen before, builds a tree based on
262       source from the file given.  The "delete" method is for marking a
263       tree's contents as available for garbage collection, when you're done
264       with the tree.  The "as_text" method returns a string that contains all
265       the text bits that are children (or otherwise descendants) of the given
266       node -- to get the text content of the $h1 object, we could just say:
267
268         $heading = join '', $h1->content_list;
269
270       but that will work only if we're sure that the "h1" element's children
271       will be only text bits -- if the document contained:
272
273         <h1>Local Man Sees <cite>Blade</cite> Again</h1>
274
275       then the sub-tree would be:
276
277         . h1
278           . "Local Man Sees "
279           . cite
280             . "Blade"
281           . " Again'
282
283       so "join '', $h1->content_list" will be something like:
284
285         Local Man Sees HTML::Element=HASH(0x15424040) Again
286
287       whereas "$h1->as_text" would yield:
288
289         Local Man Sees Blade Again
290
291       and depending on what you're doing with the heading text, you might
292       want the "as_HTML" method instead.  It returns the (sub)tree repre‐
293       sented as HTML source.  "$h1->as_HTML" would yield:
294
295         <h1>Local Man Sees <cite>Blade</cite> Again</h1>
296
297       However, if you wanted the contents of $h1 as HTML, but not the $h1
298       itself, you could say:
299
300         join '',
301           map(
302             ref($_) ? $_->as_HTML : $_,
303             $h1->content_list
304           )
305
306       This "map" iterates over the nodes in $h1's list of children; and for
307       each node that's just a text bit (as "Local Man Sees " is), it just
308       passes through that string value, and for each node that's an actual
309       object (causing "ref" to be true), "as_HTML" will used instead of the
310       string value of the object itself (which would be something quite use‐
311       less, as most object values are).  So that "as_HTML" for the "cite"
312       element will be the string "<cite>Blade</cite>".  And then, finally,
313       "join" just puts into one string all the strings that the "map"
314       returns.
315
316       Last but not least, the most important method in our "get_heading" sub
317       is the "look_down" method.  This method looks down at the subtree
318       starting at the given object ($h1), looking for elements that meet cri‐
319       teria you provide.
320
321       The criteria are specified in the method's argument list.  Each crite‐
322       rion can consist of two scalars, a key and a value, which express that
323       you want elements that have that attribute (like "_tag", or "src") with
324       the given value ("h1"); or the criterion can be a reference to a sub‐
325       routine that, when called on the given element, returns true if that is
326       a node you're looking for.  If you specify several criteria, then
327       that's taken to mean that you want all the elements that each satisfy
328       all the criteria.  (In other words, there's an "implicit AND".)
329
330       And finally, there's a bit of an optimization -- if you call the
331       "look_down" method in a scalar context, you get just the first node (or
332       undef if none) -- and, in fact, once "look_down" finds that first
333       matching element, it doesn't bother looking any further.
334
335       So the example:
336
337         $h1 = $tree->look_down('_tag', 'h1');
338
339       returns the first element at-or-under $tree whose "_tag" attribute has
340       the value "h1".
341
342       Complex Criteria in Tree Scanning
343
344       Now, the above "look_down" code looks like a lot of bother, with barely
345       more benefit than just grepping the file!  But consider if your crite‐
346       ria were more complicated -- suppose you found that some of the press
347       releases that you were scanning had several "h1" elements, possibly
348       before or after the one you actually want.  For example:
349
350         <h1><center>Visit Our Corporate Partner
351          <br><a href="/dyna/clickthru"
352            ><img src="/dyna/vend_ad"></a>
353         </center></h1>
354         <h1><center>ConGlomCo President Schreck to Visit Regional HQ
355          <br><a href="/photos/Schreck_visit_large.jpg"
356            ><img src="/photos/Schreck_visit.jpg"></a>
357         </center></h1>
358
359       Here, you want to ignore the first "h1" element because it contains an
360       ad, and you want the text from the second "h1".  The problem is in for‐
361       malizing the way you know that it's an ad.  Since ad banners are always
362       entreating you to "visit" the sponsoring site, you could exclude "h1"
363       elements that contain the word "visit" under them:
364
365         my $real_h1 = $tree->look_down(
366           '_tag', 'h1',
367           sub {
368             $_[0]->as_text !~ m/\bvisit/i
369           }
370         );
371
372       The first criterion looks for "h1" elements, and the second criterion
373       limits those to only the ones whose text content doesn't match
374       "m/\bvisit/".  But unfortunately, that won't work for our example,
375       since the second "h1" mentions "ConGlomCo President Schreck to Visit
376       Regional HQ".
377
378       Instead you could try looking for the first "h1" element that doesn't
379       contain an image:
380
381         my $real_h1 = $tree->look_down(
382           '_tag', 'h1',
383           sub {
384             not $_[0]->look_down('_tag', 'img')
385           }
386         );
387
388       This criterion sub might seem a bit odd, since it calls "look_down" as
389       part of a larger "look_down" operation, but that's fine.  Note that
390       when considered as a boolean value, a "look_down" in a scalar context
391       value returns false (specifically, undef) if there's no matching ele‐
392       ment at or under the given element; and it returns the first matching
393       element (which, being a reference and object, is always a true value),
394       if any matches.  So, here,
395
396         sub {
397           not $_[0]->look_down('_tag', 'img')
398         }
399
400       means "return true only if this element has no 'img' element as descen‐
401       dants (and isn't an 'img' element itself)."
402
403       This correctly filters out the first "h1" that contains the ad, but it
404       also incorrectly filters out the second "h1" that contains a non-adver‐
405       tisement photo besides the headline text you want.
406
407       There clearly are detectable differences between the first and second
408       "h1" elements -- the only second one contains the string "Schreck", and
409       we could just test for that:
410
411         my $real_h1 = $tree->look_down(
412           '_tag', 'h1',
413           sub {
414             $_[0]->as_text =~ m{Schreck}
415           }
416         );
417
418       And that works fine for this one example, but unless all thousand of
419       your press releases have "Schreck" in the headline, that's just not a
420       general solution.  However, if all the ads-in-"h1"s that you want to
421       exclude involve a link whose URL involves "/dyna/", then you can use
422       that:
423
424         my $real_h1 = $tree->look_down(
425           '_tag', 'h1',
426           sub {
427             my $link = $_[0]->look_down('_tag','a');
428             return 1 unless $link;
429               # no link means it's fine
430             return 0 if $link->attr('href') =~ m{/dyna/};
431               # a link to there is bad
432             return 1; # otherwise okay
433           }
434         );
435
436       Or you can look at it another way and say that you want the first "h1"
437       element that either contains no images, or else whose image has a "src"
438       attribute whose value contains "/photos/":
439
440         my $real_h1 = $tree->look_down(
441           '_tag', 'h1',
442           sub {
443             my $img = $_[0]->look_down('_tag','img');
444             return 1 unless $img;
445               # no image means it's fine
446             return 1 if $img->attr('src') =~ m{/photos/};
447               # good if a photo
448             return 0; # otherwise bad
449           }
450         );
451
452       Recall that this use of "look_down" in a scalar context means to return
453       the first element at or under $tree that matches all the criteria.  But
454       if you notice that you can formulate criteria that'll match several
455       possible "h1" elements, some of which may be bogus but the last one of
456       which is always the one you want, then you can use "look_down" in a
457       list context, and just use the last element of that list:
458
459         my @h1s = $tree->look_down(
460           '_tag', 'h1',
461           ...maybe more criteria...
462         );
463         die "What, no h1s here?" unless @h1s;
464         my $real_h1 = $h1s[-1]; # last or only
465
466       A Case Study: Scanning Yahoo News's HTML
467
468       The above (somewhat contrived) case involves extracting data from a
469       bunch of pre-existing HTML files.  In that sort of situation, if your
470       code works for all the files, then you know that the code works --
471       since the data it's meant to handle won't go changing or growing; and,
472       typically, once you've used the program, you'll never need to use it
473       again.
474
475       The other kind of situation faced in many data extraction tasks is
476       where the program is used recurringly to handle new data -- such as
477       from ever-changing Web pages.  As a real-world example of this, con‐
478       sider a program that you could use (suppose it's crontabbed) to extract
479       headline-links from subsections of Yahoo News ("http://dai
480       lynews.yahoo.com/").
481
482       Yahoo News has several subsections:
483
484       http://dailynews.yahoo.com/h/tc/ for technology news
485       http://dailynews.yahoo.com/h/sc/ for science news
486       http://dailynews.yahoo.com/h/hl/ for health news
487       http://dailynews.yahoo.com/h/wl/ for world news
488       http://dailynews.yahoo.com/h/en/ for entertainment news
489
490       and others.  All of them are built on the same basic HTML template --
491       and a scarily complicated template it is, especially when you look at
492       it with an eye toward making up rules that will select where the real
493       headline-links are, while screening out all the links to other parts of
494       Yahoo, other news services, etc.  You will need to puzzle over the HTML
495       source, and scrutinize the output of "$tree->dump" on the parse tree of
496       that HTML.
497
498       Sometimes the only way to pin down what you're after is by position in
499       the tree. For example, headlines of interest may be in the third column
500       of the second row of the second table element in a page:
501
502         my $table = ( $tree->look_down('_tag','table') )[1];
503         my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
504         my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
505         ...then do things with $col3...
506
507       Or they may be all the links in a "p" element that has at least three
508       "br" elements as children:
509
510         my $p = $tree->look_down(
511           '_tag', 'p',
512           sub {
513             2 < grep { ref($_) and $_->tag eq 'br' }
514                      $_[0]->content_list
515           }
516         );
517         @links = $p->look_down('_tag', 'a');
518
519       But almost always, you can get away with looking for properties of the
520       of the thing itself, rather than just looking for contexts.  Now, if
521       you're lucky, the document you're looking through has clear semantic
522       tagging, such is as useful in CSS -- note the class="headlinelink" bit
523       here:
524
525         <a href="...long_news_url..." class="headlinelink">Elvis
526         seen in tortilla</a>
527
528       If you find anything like that, you could leap right in and select
529       links with:
530
531         @links = $tree->look_down('class','headlinelink');
532
533       Regrettably, your chances of seeing any sort of semantic markup princi‐
534       ples really being followed with actual HTML are pretty thin.
535
536           Footnote: In fact, your chances of finding a page that is simply
537           free of HTML errors are even thinner.  And surprisingly, sites like
538           Amazon or Yahoo are typically worse as far as quality of code than
539           personal sites whose entire production cycle involves simply being
540           saved and uploaded from Netscape Composer.
541
542       The code may be sort of "accidentally semantic", however -- for exam‐
543       ple, in a set of pages I was scanning recently, I found that looking
544       for "td" elements with a "width" attribute value of "375" got me
545       exactly what I wanted.  No-one designing that page ever conceived of
546       "width=375" as meaning "this is a headline", but if you impute it to
547       mean that, it works.
548
549       An approach like this happens to work for the Yahoo News code, because
550       the headline-links are distinguished by the fact that they (and they
551       alone) contain a "b" element:
552
553         <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
554
555       or, diagrammed as a part of the parse tree:
556
557         . a  [href="...long_news_url..."]
558           . b
559             . "Elvis seen in tortilla"
560
561       A rule that matches these can be formalized as "look for any 'a' ele‐
562       ment that has only one daugher node, which must be a 'b' element".  And
563       this is what it looks like when cooked up as a "look_down" expression
564       and prefaced with a bit of code that retrieves the text of the given
565       Yahoo News page and feeds it to TreeBuilder:
566
567         use strict;
568         use HTML::TreeBuilder 2.97;
569         use LWP::UserAgent;
570         sub get_headlines {
571           my $url = $_[0] ⎪⎪ die "What URL?";
572
573           my $response = LWP::UserAgent->new->request(
574             HTTP::Request->new( GET => $url )
575           );
576           unless($response->is_success) {
577             warn "Couldn't get $url: ", $response->status_line, "\n";
578             return;
579           }
580
581           my $tree = HTML::TreeBuilder->new();
582           $tree->parse($response->content);
583           $tree->eof;
584
585           my @out;
586           foreach my $link (
587             $tree->look_down(   # !
588               '_tag', 'a',
589               sub {
590                 return unless $_[0]->attr('href');
591                 my @c = $_[0]->content_list;
592                 @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
593               }
594             )
595           ) {
596             push @out, [ $link->attr('href'), $link->as_text ];
597           }
598
599           warn "Odd, fewer than 6 stories in $url!" if @out < 6;
600           $tree->delete;
601           return @out;
602         }
603
604       ...and add a bit of code to actually call that routine and display the
605       results...
606
607         foreach my $section (qw[tc sc hl wl en]) {
608           my @links = get_headlines(
609             "http://dailynews.yahoo.com/h/$section/"
610           );
611           print
612             $section, ": ", scalar(@links), " stories\n",
613             map(("  ", $_->[0], " : ", $_->[1], "\n"), @links),
614             "\n";
615         }
616
617       And we've got our own headline-extractor service!  This in and of
618       itself isn't no amazingly useful (since if you want to see the head‐
619       lines, you can just look at the Yahoo News pages), but it could easily
620       be the basis for quite useful features like filtering the headlines for
621       matching certain keywords of interest to you.
622
623       Now, one of these days, Yahoo News will decide to change its HTML tem‐
624       plate.  When this happens, this will appear to the above program as
625       there being no links that meet the given criteria; or, less likely,
626       dozens of erroneous links will meet the criteria.  In either case, the
627       criteria will have to be changed for the new template; they may just
628       need adjustment, or you may need to scrap them and start over.
629
630       Regardez, duvet!
631
632       It's often quite a challenge to write criteria to match the desired
633       parts of an HTML parse tree.  Very often you can pull it off with a
634       simple "$tree->look_down('_tag', 'h1')", but sometimes you do have to
635       keep adding and refining criteria, until you might end up with complex
636       filters like what I've shown in this article.  The benefit to learning
637       how to deal with HTML parse trees is that one main search tool, the
638       "look_down" method, can do most of the work, making simple things easy,
639       while still making hard things possible.
640
641       [end body of article]
642
643       [Author Credit]
644
645       Sean M. Burke ("sburke@cpan.org") is the current maintainer of
646       "HTML::TreeBuilder" and "HTML::Element", both originally by Gisle Aas.
647
648       Sean adds: "I'd like to thank the folks who listened to me ramble
649       incessantly about HTML::TreeBuilder and HTML::Element at this year's
650       Yet Another Perl Conference and O'Reilly Open Source Software Conven‐
651       tion."
652

BACK

654       Return to the HTML::Tree docs.
655
656
657
658perl v5.8.8                       2006-08-04           HTML::Tree::Scanning(3)
Impressum