1HTML::Tree::Scanning(3)User Contributed Perl DocumentatioHnTML::Tree::Scanning(3)
2
3
4

NAME

6       HTML::Tree::Scanning -- article: "Scanning HTML"
7

SYNOPSIS

9         # This an article, not a module.
10

DESCRIPTION

12       The following article by Sean M. Burke first appeared in The Perl
13       Journal #19 and is copyright 2000 The Perl Journal. It appears courtesy
14       of Jon Orwant and The Perl Journal.  This document may be distributed
15       under the same terms as Perl itself.
16
17       (Note that this is discussed in chapters 6 through 10 of the book Perl
18       and LWP <http://lwp.interglacial.com/> which was written after the
19       following documentation, and which is available free online.)
20

Scanning HTML

22       -- Sean M. Burke
23
24       In The Perl Journal issue 17, Ken MacFarlane's article "Parsing HTML
25       with HTML::Parser" describes how the HTML::Parser module scans HTML
26       source as a stream of start-tags, end-tags, text, comments, etc.  In
27       TPJ #18, my "Trees" article kicked around the idea of tree-shaped data
28       structures.  Now I'll try to tie it together, in a discussion of HTML
29       trees.
30
31       The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser
32       picks out, and builds a parse tree -- a tree-shaped network of
33       objects...
34
35           Footnote: And if you need a quick explanation of objects, see my
36           TPJ17 article "A User's View of Object-Oriented Modules"; or go
37           whole hog and get Damian Conway's excellent book Object-Oriented
38           Perl, from Manning Publications.
39
40       ...representing the structured content of the HTML document.  And once
41       the document is parsed as a tree, you'll find the common tasks of
42       extracting data from that HTML document/tree to be quite
43       straightforward.
44
45   HTML::Parser, HTML::TreeBuilder, and HTML::Element
46       You use HTML::TreeBuilder to make a parse tree out of an HTML source
47       file, by simply saying:
48
49         use HTML::TreeBuilder;
50         my $tree = HTML::TreeBuilder->new();
51         $tree->parse_file('foo.html');
52
53       and then $tree contains a parse tree built from the HTML source from
54       the file "foo.html".  The way this parse tree is represented is with a
55       network of objects -- $tree is the root, an element with tag-name
56       "html", and its children typically include a "head" and "body" element,
57       and so on.  Elements in the tree are objects of the class
58       HTML::Element.
59
60       So, if you take this source:
61
62         <html><head><title>Doc 1</title></head>
63         <body>
64         Stuff <hr> 2000-08-17
65         </body></html>
66
67       and feed it to HTML::TreeBuilder, it'll return a tree of objects that
68       looks like this:
69
70                      html
71                    /      \
72                head        body
73               /          /   |  \
74            title    "Stuff"  hr  "2000-08-17"
75              |
76           "Doc 1"
77
78       This is a pretty simple document, but if it were any more complex, it'd
79       be a bit hard to draw in that style, since it's sprawl left and right.
80       The same tree can be represented a bit more easily sideways, with
81       indenting:
82
83         . html
84            . head
85               . title
86                  . "Doc 1"
87            . body
88               . "Stuff"
89               . hr
90               . "2000-08-17"
91
92       Either way expresses the same structure.  In that structure, the root
93       node is an object of the class HTML::Element
94
95           Footnote: Well actually, the root is of the class
96           HTML::TreeBuilder, but that's just a subclass of HTML::Element,
97           plus the few extra methods like "parse_file" that elaborate the
98           tree
99
100       , with the tag name "html", and with two children: an HTML::Element
101       object whose tag names are "head" and "body".  And each of those
102       elements have children, and so on down.  Not all elements (as we'll
103       call the objects of class HTML::Element) have children -- the "hr"
104       element doesn't.  And note all nodes in the tree are elements -- the
105       text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
106
107       Objects of the class HTML::Element each have three noteworthy
108       attributes:
109
110       "_tag" -- (best accessed as "$e->tag") this element's tag-name,
111       lowercased (e.g., "em" for an "em" element).
112               Footnote: Yes, this is misnamed.  In proper SGML terminology,
113               this is instead called a "GI", short for "generic identifier";
114               and the term "tag" is used for a token of SGML source that
115               represents either the start of an element (a start-tag like
116               "<em lang='fr'>") or the end of an element (an end-tag like
117               "</em>".  However, since more people claim to have been
118               abducted by aliens than to have ever seen the SGML standard,
119               and since both encounters typically involve a feeling of
120               "missing time", it's not surprising that the terminology of the
121               SGML standard is not closely followed.
122
123       "_parent" -- (best accessed as "$e->parent") the element that is $obj's
124       parent, or undef if this element is the root of its tree.
125       "_content" -- (best accessed as "$e->content_list") the list of nodes
126       (i.e., elements or text segments) that are $e's children.
127
128       Moreover, if an element object has any attributes in the SGML sense of
129       the word, then those are readable as "$e->attr('name')" -- for example,
130       with the object built from having parsed "<a id='foo'>bar</a>",
131       "$e->attr('id')" will return the string "foo".  Moreover, "$e->tag" on
132       that object returns the string "a", "$e->content_list" returns a list
133       consisting of just the single scalar "bar", and "$e->parent" returns
134       the object that's this node's parent -- which may be, for example, a
135       "p" element.
136
137       And that's all that there is to it -- you throw HTML source at
138       TreeBuilder, and it returns a tree built of HTML::Element objects and
139       some text strings.
140
141       However, what do you do with a tree of objects?  People code
142       information into HTML trees not for the fun of arranging elements, but
143       to represent the structure of specific text and images -- some text is
144       in this "li" element, some other text is in that heading, some images
145       are in that other table cell that has those attributes, and so on.
146
147       Now, it may happen that you're rendering that whole HTML tree into some
148       layout format.  Or you could be trying to make some systematic change
149       to the HTML tree before dumping it out as HTML source again.  But, in
150       my experience, by far the most common programming task that Perl
151       programmers face with HTML is in trying to extract some piece of
152       information from a larger document.  Since that's so common (and also
153       since it involves concepts that are basic to more complex tasks), that
154       is what the rest of this article will be about.
155
156   Scanning HTML trees
157       Suppose you have a thousand HTML documents, each of them a press
158       release.  They all start out:
159
160         [...lots of leading images and junk...]
161         <h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
162         BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
163         of world conquest, Rock Feldspar, announced today the opening of a
164         new office in Ougadougou, the capital city of Burkino Faso, gateway
165         to the bustling "Silicon Sahara" of Africa...
166         [...etc...]
167
168       ...and what you've got to do is, for each document, copy whatever text
169       is in the "h1" element, so that you can, for example, make a table of
170       contents of it.  Now, there are three ways to do this:
171
172       ·   You can just use a regexp to scan the file for a text pattern.
173
174           For many very simple tasks, this will do fine.  Many HTML documents
175           are, in practice, very consistently formatted as far as placement
176           of linebreaks and whitespace, so you could just get away with
177           scanning the file like so:
178
179             sub get_heading {
180               my $filename = $_[0];
181               local *HTML;
182               open(HTML, $filename)
183                 or die "Couldn't open $filename);
184               my $heading;
185              Line:
186               while(<HTML>) {
187                 if( m{<h1>(.*?)</h1>}i ) {  # match it!
188                   $heading = $1;
189                   last Line;
190                 }
191               }
192               close(HTML);
193               warn "No heading in $filename?"
194                unless defined $heading;
195               return $heading;
196             }
197
198           This is quick and fast, but awfully fragile -- if there's a newline
199           in the middle of a heading's text, it won't match the above regexp,
200           and you'll get an error.  The regexp will also fail if the "h1"
201           element's start-tag has any attributes.  If you have to adapt your
202           code to fit more kinds of start-tags, you'll end up basically
203           reinventing part of HTML::Parser, at which point you should
204           probably just stop, and use HTML::Parser itself:
205
206       ·   You can use HTML::Parser to scan the file for an "h1" start-tag
207           token, then capture all the text tokens until the "h1" close-tag.
208           This approach is extensively covered in the Ken MacFarlane's TPJ17
209           article "Parsing HTML with HTML::Parser".  (A variant of this
210           approach is to use HTML::TokeParser, which presents a different and
211           rather handier interface to the tokens that HTML::Parser picks
212           out.)
213
214           Using HTML::Parser is less fragile than our first approach, since
215           it's not sensitive to the exact internal formatting of the start-
216           tag (much less whether it's split across two lines).  However, when
217           you need more information about the context of the "h1" element, or
218           if you're having to deal with any of the tricky bits of HTML, such
219           as parsing of tables, you'll find out the flat list of tokens that
220           HTML::Parser returns isn't immediately useful.  To get something
221           useful out of those tokens, you'll need to write code that knows
222           some things about what elements take no content (as with "hr"
223           elements), and that a "</p>" end-tags are omissible, so a "<p>"
224           will end any currently open paragraph -- and you're well on your
225           way to pointlessly reinventing much of the code in
226           HTML::TreeBuilder
227
228               Footnote: And, as the person who last rewrote that module, I
229               can attest that it wasn't terribly easy to get right!  Never
230               underestimate the perversity of people coding HTML.
231
232           , at which point you should probably just stop, and use
233           HTML::TreeBuilder itself:
234
235       ·   You can use HTML::Treebuilder, and scan the tree of element objects
236           that you get back.
237
238       The last approach, using HTML::TreeBuilder, is the diametric opposite
239       of first approach:  The first approach involves just elementary Perl
240       and one regexp, whereas the TreeBuilder approach involves being at home
241       with the concept of tree-shaped data structures and modules with
242       object-oriented interfaces, as well as with the particular interfaces
243       that HTML::TreeBuilder and HTML::Element provide.
244
245       However, what the TreeBuilder approach has going for it is that it's
246       the most robust, because it involves dealing with HTML in its "native"
247       format -- it deals with the tree structure that HTML code represents,
248       without any consideration of how the source is coded and with what tags
249       omitted.
250
251       So, to extract the text from the "h1" elements of an HTML document:
252
253         sub get_heading {
254           my $tree = HTML::TreeBuilder->new;
255           $tree->parse_file($_[0]);   # !
256           my $heading;
257           my $h1 = $tree->look_down('_tag', 'h1');  # !
258           if($h1) {
259             $heading = $h1->as_text;   # !
260           } else {
261             warn "No heading in $_[0]?";
262           }
263           $tree->delete; # clear memory!
264           return $heading;
265         }
266
267       This uses some unfamiliar methods that need explaining.  The
268       "parse_file" method that we've seen before, builds a tree based on
269       source from the file given.  The "delete" method is for marking a
270       tree's contents as available for garbage collection, when you're done
271       with the tree.  The "as_text" method returns a string that contains all
272       the text bits that are children (or otherwise descendants) of the given
273       node -- to get the text content of the $h1 object, we could just say:
274
275         $heading = join '', $h1->content_list;
276
277       but that will work only if we're sure that the "h1" element's children
278       will be only text bits -- if the document contained:
279
280         <h1>Local Man Sees <cite>Blade</cite> Again</h1>
281
282       then the sub-tree would be:
283
284         . h1
285           . "Local Man Sees "
286           . cite
287             . "Blade"
288           . " Again'
289
290       so "join '', $h1->content_list" will be something like:
291
292         Local Man Sees HTML::Element=HASH(0x15424040) Again
293
294       whereas "$h1->as_text" would yield:
295
296         Local Man Sees Blade Again
297
298       and depending on what you're doing with the heading text, you might
299       want the "as_HTML" method instead.  It returns the (sub)tree
300       represented as HTML source.  "$h1->as_HTML" would yield:
301
302         <h1>Local Man Sees <cite>Blade</cite> Again</h1>
303
304       However, if you wanted the contents of $h1 as HTML, but not the $h1
305       itself, you could say:
306
307         join '',
308           map(
309             ref($_) ? $_->as_HTML : $_,
310             $h1->content_list
311           )
312
313       This "map" iterates over the nodes in $h1's list of children; and for
314       each node that's just a text bit (as "Local Man Sees " is), it just
315       passes through that string value, and for each node that's an actual
316       object (causing "ref" to be true), "as_HTML" will used instead of the
317       string value of the object itself (which would be something quite
318       useless, as most object values are).  So that "as_HTML" for the "cite"
319       element will be the string "<cite>Blade</cite>".  And then, finally,
320       "join" just puts into one string all the strings that the "map"
321       returns.
322
323       Last but not least, the most important method in our "get_heading" sub
324       is the "look_down" method.  This method looks down at the subtree
325       starting at the given object ($h1), looking for elements that meet
326       criteria you provide.
327
328       The criteria are specified in the method's argument list.  Each
329       criterion can consist of two scalars, a key and a value, which express
330       that you want elements that have that attribute (like "_tag", or "src")
331       with the given value ("h1"); or the criterion can be a reference to a
332       subroutine that, when called on the given element, returns true if that
333       is a node you're looking for.  If you specify several criteria, then
334       that's taken to mean that you want all the elements that each satisfy
335       all the criteria.  (In other words, there's an "implicit AND".)
336
337       And finally, there's a bit of an optimization -- if you call the
338       "look_down" method in a scalar context, you get just the first node (or
339       undef if none) -- and, in fact, once "look_down" finds that first
340       matching element, it doesn't bother looking any further.
341
342       So the example:
343
344         $h1 = $tree->look_down('_tag', 'h1');
345
346       returns the first element at-or-under $tree whose "_tag" attribute has
347       the value "h1".
348
349   Complex Criteria in Tree Scanning
350       Now, the above "look_down" code looks like a lot of bother, with barely
351       more benefit than just grepping the file!  But consider if your
352       criteria were more complicated -- suppose you found that some of the
353       press releases that you were scanning had several "h1" elements,
354       possibly before or after the one you actually want.  For example:
355
356         <h1><center>Visit Our Corporate Partner
357          <br><a href="/dyna/clickthru"
358            ><img src="/dyna/vend_ad"></a>
359         </center></h1>
360         <h1><center>ConGlomCo President Schreck to Visit Regional HQ
361          <br><a href="/photos/Schreck_visit_large.jpg"
362            ><img src="/photos/Schreck_visit.jpg"></a>
363         </center></h1>
364
365       Here, you want to ignore the first "h1" element because it contains an
366       ad, and you want the text from the second "h1".  The problem is in
367       formalizing the way you know that it's an ad.  Since ad banners are
368       always entreating you to "visit" the sponsoring site, you could exclude
369       "h1" elements that contain the word "visit" under them:
370
371         my $real_h1 = $tree->look_down(
372           '_tag', 'h1',
373           sub {
374             $_[0]->as_text !~ m/\bvisit/i
375           }
376         );
377
378       The first criterion looks for "h1" elements, and the second criterion
379       limits those to only the ones whose text content doesn't match
380       "m/\bvisit/".  But unfortunately, that won't work for our example,
381       since the second "h1" mentions "ConGlomCo President Schreck to Visit
382       Regional HQ".
383
384       Instead you could try looking for the first "h1" element that doesn't
385       contain an image:
386
387         my $real_h1 = $tree->look_down(
388           '_tag', 'h1',
389           sub {
390             not $_[0]->look_down('_tag', 'img')
391           }
392         );
393
394       This criterion sub might seem a bit odd, since it calls "look_down" as
395       part of a larger "look_down" operation, but that's fine.  Note that
396       when considered as a boolean value, a "look_down" in a scalar context
397       value returns false (specifically, undef) if there's no matching
398       element at or under the given element; and it returns the first
399       matching element (which, being a reference and object, is always a true
400       value), if any matches.  So, here,
401
402         sub {
403           not $_[0]->look_down('_tag', 'img')
404         }
405
406       means "return true only if this element has no 'img' element as
407       descendants (and isn't an 'img' element itself)."
408
409       This correctly filters out the first "h1" that contains the ad, but it
410       also incorrectly filters out the second "h1" that contains a non-
411       advertisement photo besides the headline text you want.
412
413       There clearly are detectable differences between the first and second
414       "h1" elements -- the only second one contains the string "Schreck", and
415       we could just test for that:
416
417         my $real_h1 = $tree->look_down(
418           '_tag', 'h1',
419           sub {
420             $_[0]->as_text =~ m{Schreck}
421           }
422         );
423
424       And that works fine for this one example, but unless all thousand of
425       your press releases have "Schreck" in the headline, that's just not a
426       general solution.  However, if all the ads-in-"h1"s that you want to
427       exclude involve a link whose URL involves "/dyna/", then you can use
428       that:
429
430         my $real_h1 = $tree->look_down(
431           '_tag', 'h1',
432           sub {
433             my $link = $_[0]->look_down('_tag','a');
434             return 1 unless $link;
435               # no link means it's fine
436             return 0 if $link->attr('href') =~ m{/dyna/};
437               # a link to there is bad
438             return 1; # otherwise okay
439           }
440         );
441
442       Or you can look at it another way and say that you want the first "h1"
443       element that either contains no images, or else whose image has a "src"
444       attribute whose value contains "/photos/":
445
446         my $real_h1 = $tree->look_down(
447           '_tag', 'h1',
448           sub {
449             my $img = $_[0]->look_down('_tag','img');
450             return 1 unless $img;
451               # no image means it's fine
452             return 1 if $img->attr('src') =~ m{/photos/};
453               # good if a photo
454             return 0; # otherwise bad
455           }
456         );
457
458       Recall that this use of "look_down" in a scalar context means to return
459       the first element at or under $tree that matches all the criteria.  But
460       if you notice that you can formulate criteria that'll match several
461       possible "h1" elements, some of which may be bogus but the last one of
462       which is always the one you want, then you can use "look_down" in a
463       list context, and just use the last element of that list:
464
465         my @h1s = $tree->look_down(
466           '_tag', 'h1',
467           ...maybe more criteria...
468         );
469         die "What, no h1s here?" unless @h1s;
470         my $real_h1 = $h1s[-1]; # last or only
471
472   A Case Study: Scanning Yahoo News's HTML
473       The above (somewhat contrived) case involves extracting data from a
474       bunch of pre-existing HTML files.  In that sort of situation, if your
475       code works for all the files, then you know that the code works --
476       since the data it's meant to handle won't go changing or growing; and,
477       typically, once you've used the program, you'll never need to use it
478       again.
479
480       The other kind of situation faced in many data extraction tasks is
481       where the program is used recurringly to handle new data -- such as
482       from ever-changing Web pages.  As a real-world example of this,
483       consider a program that you could use (suppose it's crontabbed) to
484       extract headline-links from subsections of Yahoo News
485       ("http://dailynews.yahoo.com/").
486
487       Yahoo News has several subsections:
488
489       http://dailynews.yahoo.com/h/tc/ for technology news
490       http://dailynews.yahoo.com/h/sc/ for science news
491       http://dailynews.yahoo.com/h/hl/ for health news
492       http://dailynews.yahoo.com/h/wl/ for world news
493       http://dailynews.yahoo.com/h/en/ for entertainment news
494
495       and others.  All of them are built on the same basic HTML template --
496       and a scarily complicated template it is, especially when you look at
497       it with an eye toward making up rules that will select where the real
498       headline-links are, while screening out all the links to other parts of
499       Yahoo, other news services, etc.  You will need to puzzle over the HTML
500       source, and scrutinize the output of "$tree->dump" on the parse tree of
501       that HTML.
502
503       Sometimes the only way to pin down what you're after is by position in
504       the tree. For example, headlines of interest may be in the third column
505       of the second row of the second table element in a page:
506
507         my $table = ( $tree->look_down('_tag','table') )[1];
508         my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
509         my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
510         ...then do things with $col3...
511
512       Or they may be all the links in a "p" element that has at least three
513       "br" elements as children:
514
515         my $p = $tree->look_down(
516           '_tag', 'p',
517           sub {
518             2 < grep { ref($_) and $_->tag eq 'br' }
519                      $_[0]->content_list
520           }
521         );
522         @links = $p->look_down('_tag', 'a');
523
524       But almost always, you can get away with looking for properties of the
525       of the thing itself, rather than just looking for contexts.  Now, if
526       you're lucky, the document you're looking through has clear semantic
527       tagging, such is as useful in CSS -- note the class="headlinelink" bit
528       here:
529
530         <a href="...long_news_url..." class="headlinelink">Elvis
531         seen in tortilla</a>
532
533       If you find anything like that, you could leap right in and select
534       links with:
535
536         @links = $tree->look_down('class','headlinelink');
537
538       Regrettably, your chances of seeing any sort of semantic markup
539       principles really being followed with actual HTML are pretty thin.
540
541           Footnote: In fact, your chances of finding a page that is simply
542           free of HTML errors are even thinner.  And surprisingly, sites like
543           Amazon or Yahoo are typically worse as far as quality of code than
544           personal sites whose entire production cycle involves simply being
545           saved and uploaded from Netscape Composer.
546
547       The code may be sort of "accidentally semantic", however -- for
548       example, in a set of pages I was scanning recently, I found that
549       looking for "td" elements with a "width" attribute value of "375" got
550       me exactly what I wanted.  No-one designing that page ever conceived of
551       "width=375" as meaning "this is a headline", but if you impute it to
552       mean that, it works.
553
554       An approach like this happens to work for the Yahoo News code, because
555       the headline-links are distinguished by the fact that they (and they
556       alone) contain a "b" element:
557
558         <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
559
560       or, diagrammed as a part of the parse tree:
561
562         . a  [href="...long_news_url..."]
563           . b
564             . "Elvis seen in tortilla"
565
566       A rule that matches these can be formalized as "look for any 'a'
567       element that has only one daughter node, which must be a 'b' element".
568       And this is what it looks like when cooked up as a "look_down"
569       expression and prefaced with a bit of code that retrieves the text of
570       the given Yahoo News page and feeds it to TreeBuilder:
571
572         use strict;
573         use HTML::TreeBuilder 2.97;
574         use LWP::UserAgent;
575         sub get_headlines {
576           my $url = $_[0] || die "What URL?";
577
578           my $response = LWP::UserAgent->new->request(
579             HTTP::Request->new( GET => $url )
580           );
581           unless($response->is_success) {
582             warn "Couldn't get $url: ", $response->status_line, "\n";
583             return;
584           }
585
586           my $tree = HTML::TreeBuilder->new();
587           $tree->parse($response->content);
588           $tree->eof;
589
590           my @out;
591           foreach my $link (
592             $tree->look_down(   # !
593               '_tag', 'a',
594               sub {
595                 return unless $_[0]->attr('href');
596                 my @c = $_[0]->content_list;
597                 @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
598               }
599             )
600           ) {
601             push @out, [ $link->attr('href'), $link->as_text ];
602           }
603
604           warn "Odd, fewer than 6 stories in $url!" if @out < 6;
605           $tree->delete;
606           return @out;
607         }
608
609       ...and add a bit of code to actually call that routine and display the
610       results...
611
612         foreach my $section (qw[tc sc hl wl en]) {
613           my @links = get_headlines(
614             "http://dailynews.yahoo.com/h/$section/"
615           );
616           print
617             $section, ": ", scalar(@links), " stories\n",
618             map(("  ", $_->[0], " : ", $_->[1], "\n"), @links),
619             "\n";
620         }
621
622       And we've got our own headline-extractor service!  This in and of
623       itself isn't no amazingly useful (since if you want to see the
624       headlines, you can just look at the Yahoo News pages), but it could
625       easily be the basis for quite useful features like filtering the
626       headlines for matching certain keywords of interest to you.
627
628       Now, one of these days, Yahoo News will decide to change its HTML
629       template.  When this happens, this will appear to the above program as
630       there being no links that meet the given criteria; or, less likely,
631       dozens of erroneous links will meet the criteria.  In either case, the
632       criteria will have to be changed for the new template; they may just
633       need adjustment, or you may need to scrap them and start over.
634
635   Regardez, duvet!
636       It's often quite a challenge to write criteria to match the desired
637       parts of an HTML parse tree.  Very often you can pull it off with a
638       simple "$tree->look_down('_tag', 'h1')", but sometimes you do have to
639       keep adding and refining criteria, until you might end up with complex
640       filters like what I've shown in this article.  The benefit to learning
641       how to deal with HTML parse trees is that one main search tool, the
642       "look_down" method, can do most of the work, making simple things easy,
643       while still making hard things possible.
644
645       [end body of article]
646
647   [Author Credit]
648       Sean M. Burke ("sburke@cpan.org") is the current maintainer of
649       "HTML::TreeBuilder" and "HTML::Element", both originally by Gisle Aas.
650
651       Sean adds: "I'd like to thank the folks who listened to me ramble
652       incessantly about HTML::TreeBuilder and HTML::Element at this year's
653       Yet Another Perl Conference and O'Reilly Open Source Software
654       Convention."
655

BACK

657       Return to the HTML::Tree docs.
658
659
660
661perl v5.16.3                      2014-06-10           HTML::Tree::Scanning(3)
Impressum