1HTML::Tree::Scanning(3)User Contributed Perl DocumentatioHnTML::Tree::Scanning(3)
2
3
4
6 HTML::Tree::Scanning -- article: "Scanning HTML"
7
9 # This an article, not a module.
10
12 The following article by Sean M. Burke first appeared in The Perl Jour‐
13 nal #19 and is copyright 2000 The Perl Journal. It appears courtesy of
14 Jon Orwant and The Perl Journal. This document may be distributed
15 under the same terms as Perl itself.
16
18 -- Sean M. Burke
19
20 In The Perl Journal issue 17, Ken MacFarlane's article "Parsing HTML
21 with HTML::Parser" describes how the HTML::Parser module scans HTML
22 source as a stream of start-tags, end-tags, text, comments, etc. In
23 TPJ #18, my "Trees" article kicked around the idea of tree-shaped data
24 structures. Now I'll try to tie it together, in a discussion of HTML
25 trees.
26
27 The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser
28 picks out, and builds a parse tree -- a tree-shaped network of
29 objects...
30
31 Footnote: And if you need a quick explanation of objects, see my
32 TPJ17 article "A User's View of Object-Oriented Modules"; or go
33 whole hog and get Damian Conway's excellent book Object-Oriented
34 Perl, from Manning Publications.
35
36 ...representing the structured content of the HTML document. And once
37 the document is parsed as a tree, you'll find the common tasks of
38 extracting data from that HTML document/tree to be quite straightfor‐
39 ward.
40
41 HTML::Parser, HTML::TreeBuilder, and HTML::Element
42
43 You use HTML::TreeBuilder to make a parse tree out of an HTML source
44 file, by simply saying:
45
46 use HTML::TreeBuilder;
47 my $tree = HTML::TreeBuilder->new();
48 $tree->parse_file('foo.html');
49
50 and then $tree contains a parse tree built from the HTML source from
51 the file "foo.html". The way this parse tree is represented is with a
52 network of objects -- $tree is the root, an element with tag-name
53 "html", and its children typically include a "head" and "body" element,
54 and so on. Elements in the tree are objects of the class HTML::Ele‐
55 ment.
56
57 So, if you take this source:
58
59 <html><head><title>Doc 1</title></head>
60 <body>
61 Stuff <hr> 2000-08-17
62 </body></html>
63
64 and feed it to HTML::TreeBuilder, it'll return a tree of objects that
65 looks like this:
66
67 html
68 / \
69 head body
70 / / ⎪ \
71 title "Stuff" hr "2000-08-17"
72 ⎪
73 "Doc 1"
74
75 This is a pretty simple document, but if it were any more complex, it'd
76 be a bit hard to draw in that style, since it's sprawl left and right.
77 The same tree can be represented a bit more easily sideways, with
78 indenting:
79
80 . html
81 . head
82 . title
83 . "Doc 1"
84 . body
85 . "Stuff"
86 . hr
87 . "2000-08-17"
88
89 Either way expresses the same structure. In that structure, the root
90 node is an object of the class HTML::Element
91
92 Footnote: Well actually, the root is of the class HTML::Tree‐
93 Builder, but that's just a subclass of HTML::Element, plus the few
94 extra methods like "parse_file" that elaborate the tree
95
96 , with the tag name "html", and with two children: an HTML::Element
97 object whose tag names are "head" and "body". And each of those ele‐
98 ments have children, and so on down. Not all elements (as we'll call
99 the objects of class HTML::Element) have children -- the "hr" element
100 doesn't. And note all nodes in the tree are elements -- the text nodes
101 ("Doc 1", "Stuff", and "2000-08-17") are just strings.
102
103 Objects of the class HTML::Element each have three noteworthy
104 attributes:
105
106 "_tag" -- (best accessed as "$e->tag") this element's tag-name, lower‐
107 cased (e.g., "em" for an "em" element).
108 Footnote: Yes, this is misnamed. In proper SGML terminology,
109 this is instead called a "GI", short for "generic identifier";
110 and the term "tag" is used for a token of SGML source that rep‐
111 resents either the start of an element (a start-tag like "<em
112 lang='fr'>") or the end of an element (an end-tag like "</em>".
113 However, since more people claim to have been abducted by
114 aliens than to have ever seen the SGML standard, and since both
115 encounters typically involve a feeling of "missing time", it's
116 not surprising that the terminology of the SGML standard is not
117 closely followed.
118
119 "_parent" -- (best accessed as "$e->parent") the element that is $obj's
120 parent, or undef if this element is the root of its tree.
121 "_content" -- (best accessed as "$e->content_list") the list of nodes
122 (i.e., elements or text segments) that are $e's children.
123
124 Moreover, if an element object has any attributes in the SGML sense of
125 the word, then those are readable as "$e->attr('name')" -- for example,
126 with the object built from having parsed "<a id='foo'>bar</a>",
127 "$e->attr('id')" will return the string "foo". Moreover, "$e->tag" on
128 that object returns the string "a", "$e->content_list" returns a list
129 consisting of just the single scalar "bar", and "$e->parent" returns
130 the object that's this node's parent -- which may be, for example, a
131 "p" element.
132
133 And that's all that there is to it -- you throw HTML source at Tree‐
134 Builder, and it returns a tree built of HTML::Element objects and some
135 text strings.
136
137 However, what do you do with a tree of objects? People code informa‐
138 tion into HTML trees not for the fun of arranging elements, but to rep‐
139 resent the structure of specific text and images -- some text is in
140 this "li" element, some other text is in that heading, some images are
141 in that other table cell that has those attributes, and so on.
142
143 Now, it may happen that you're rendering that whole HTML tree into some
144 layout format. Or you could be trying to make some systematic change
145 to the HTML tree before dumping it out as HTML source again. But, in
146 my experience, by far the most common programming task that Perl pro‐
147 grammers face with HTML is in trying to extract some piece of informa‐
148 tion from a larger document. Since that's so common (and also since it
149 involves concepts that are basic to more complex tasks), that is what
150 the rest of this article will be about.
151
152 Scanning HTML trees
153
154 Suppose you have a thousand HTML documents, each of them a press
155 release. They all start out:
156
157 [...lots of leading images and junk...]
158 <h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
159 BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
160 of world conquest, Rock Feldspar, announced today the opening of a
161 new office in Ougadougou, the capital city of Burkino Faso, gateway
162 to the bustling "Silicon Sahara" of Africa...
163 [...etc...]
164
165 ...and what you've got to do is, for each document, copy whatever text
166 is in the "h1" element, so that you can, for example, make a table of
167 contents of it. Now, there are three ways to do this:
168
169 * You can just use a regexp to scan the file for a text pattern.
170 For many very simple tasks, this will do fine. Many HTML documents
171 are, in practice, very consistently formatted as far as placement
172 of linebreaks and whitespace, so you could just get away with scan‐
173 ning the file like so:
174
175 sub get_heading {
176 my $filename = $_[0];
177 local *HTML;
178 open(HTML, $filename)
179 or die "Couldn't open $filename);
180 my $heading;
181 Line:
182 while(<HTML>) {
183 if( m{<h1>(.*?)</h1>}i ) { # match it!
184 $heading = $1;
185 last Line;
186 }
187 }
188 close(HTML);
189 warn "No heading in $filename?"
190 unless defined $heading;
191 return $heading;
192 }
193
194 This is quick and fast, but awfully fragile -- if there's a newline
195 in the middle of a heading's text, it won't match the above regexp,
196 and you'll get an error. The regexp will also fail if the "h1"
197 element's start-tag has any attributes. If you have to adapt your
198 code to fit more kinds of start-tags, you'll end up basically rein‐
199 venting part of HTML::Parser, at which point you should probably
200 just stop, and use HTML::Parser itself:
201
202 * You can use HTML::Parser to scan the file for an "h1" start-tag
203 token, then capture all the text tokens until the "h1" close-tag. This
204 approach is extensively covered in the Ken MacFarlane's TPJ17 article
205 "Parsing HTML with HTML::Parser". (A variant of this approach is to
206 use HTML::TokeParser, which presents a different and rather handier
207 interface to the tokens that HTML::Parser picks out.)
208 Using HTML::Parser is less fragile than our first approach, since
209 it's not sensitive to the exact internal formatting of the start-
210 tag (much less whether it's split across two lines). However, when
211 you need more information about the context of the "h1" element, or
212 if you're having to deal with any of the tricky bits of HTML, such
213 as parsing of tables, you'll find out the flat list of tokens that
214 HTML::Parser returns isn't immediately useful. To get something
215 useful out of those tokens, you'll need to write code that knows
216 some things about what elements take no content (as with "hr" ele‐
217 ments), and that a "</p>" end-tags are omissible, so a "<p>" will
218 end any currently open paragraph -- and you're well on your way to
219 pointlessly reinventing much of the code in HTML::TreeBuilder
220
221 Footnote: And, as the person who last rewrote that module, I
222 can attest that it wasn't terribly easy to get right! Never
223 underestimate the perversity of people coding HTML.
224
225 , at which point you should probably just stop, and use HTML::Tree‐
226 Builder itself:
227
228 * You can use HTML::Treebuilder, and scan the tree of element objects
229 that you get back.
230
231 The last approach, using HTML::TreeBuilder, is the diametric opposite
232 of first approach: The first approach involves just elementary Perl
233 and one regexp, whereas the TreeBuilder approach involves being at home
234 with the concept of tree-shaped data structures and modules with
235 object-oriented interfaces, as well as with the particular interfaces
236 that HTML::TreeBuilder and HTML::Element provide.
237
238 However, what the TreeBuilder approach has going for it is that it's
239 the most robust, because it involves dealing with HTML in its "native"
240 format -- it deals with the tree structure that HTML code represents,
241 without any consideration of how the source is coded and with what tags
242 omitted.
243
244 So, to extract the text from the "h1" elements of an HTML document:
245
246 sub get_heading {
247 my $tree = HTML::TreeBuilder->new;
248 $tree->parse_file($_[0]); # !
249 my $heading;
250 my $h1 = $tree->look_down('_tag', 'h1'); # !
251 if($h1) {
252 $heading = $h1->as_text; # !
253 } else {
254 warn "No heading in $_[0]?";
255 }
256 $tree->delete; # clear memory!
257 return $heading;
258 }
259
260 This uses some unfamiliar methods that need explaining. The
261 "parse_file" method that we've seen before, builds a tree based on
262 source from the file given. The "delete" method is for marking a
263 tree's contents as available for garbage collection, when you're done
264 with the tree. The "as_text" method returns a string that contains all
265 the text bits that are children (or otherwise descendants) of the given
266 node -- to get the text content of the $h1 object, we could just say:
267
268 $heading = join '', $h1->content_list;
269
270 but that will work only if we're sure that the "h1" element's children
271 will be only text bits -- if the document contained:
272
273 <h1>Local Man Sees <cite>Blade</cite> Again</h1>
274
275 then the sub-tree would be:
276
277 . h1
278 . "Local Man Sees "
279 . cite
280 . "Blade"
281 . " Again'
282
283 so "join '', $h1->content_list" will be something like:
284
285 Local Man Sees HTML::Element=HASH(0x15424040) Again
286
287 whereas "$h1->as_text" would yield:
288
289 Local Man Sees Blade Again
290
291 and depending on what you're doing with the heading text, you might
292 want the "as_HTML" method instead. It returns the (sub)tree repre‐
293 sented as HTML source. "$h1->as_HTML" would yield:
294
295 <h1>Local Man Sees <cite>Blade</cite> Again</h1>
296
297 However, if you wanted the contents of $h1 as HTML, but not the $h1
298 itself, you could say:
299
300 join '',
301 map(
302 ref($_) ? $_->as_HTML : $_,
303 $h1->content_list
304 )
305
306 This "map" iterates over the nodes in $h1's list of children; and for
307 each node that's just a text bit (as "Local Man Sees " is), it just
308 passes through that string value, and for each node that's an actual
309 object (causing "ref" to be true), "as_HTML" will used instead of the
310 string value of the object itself (which would be something quite use‐
311 less, as most object values are). So that "as_HTML" for the "cite"
312 element will be the string "<cite>Blade</cite>". And then, finally,
313 "join" just puts into one string all the strings that the "map"
314 returns.
315
316 Last but not least, the most important method in our "get_heading" sub
317 is the "look_down" method. This method looks down at the subtree
318 starting at the given object ($h1), looking for elements that meet cri‐
319 teria you provide.
320
321 The criteria are specified in the method's argument list. Each crite‐
322 rion can consist of two scalars, a key and a value, which express that
323 you want elements that have that attribute (like "_tag", or "src") with
324 the given value ("h1"); or the criterion can be a reference to a sub‐
325 routine that, when called on the given element, returns true if that is
326 a node you're looking for. If you specify several criteria, then
327 that's taken to mean that you want all the elements that each satisfy
328 all the criteria. (In other words, there's an "implicit AND".)
329
330 And finally, there's a bit of an optimization -- if you call the
331 "look_down" method in a scalar context, you get just the first node (or
332 undef if none) -- and, in fact, once "look_down" finds that first
333 matching element, it doesn't bother looking any further.
334
335 So the example:
336
337 $h1 = $tree->look_down('_tag', 'h1');
338
339 returns the first element at-or-under $tree whose "_tag" attribute has
340 the value "h1".
341
342 Complex Criteria in Tree Scanning
343
344 Now, the above "look_down" code looks like a lot of bother, with barely
345 more benefit than just grepping the file! But consider if your crite‐
346 ria were more complicated -- suppose you found that some of the press
347 releases that you were scanning had several "h1" elements, possibly
348 before or after the one you actually want. For example:
349
350 <h1><center>Visit Our Corporate Partner
351 <br><a href="/dyna/clickthru"
352 ><img src="/dyna/vend_ad"></a>
353 </center></h1>
354 <h1><center>ConGlomCo President Schreck to Visit Regional HQ
355 <br><a href="/photos/Schreck_visit_large.jpg"
356 ><img src="/photos/Schreck_visit.jpg"></a>
357 </center></h1>
358
359 Here, you want to ignore the first "h1" element because it contains an
360 ad, and you want the text from the second "h1". The problem is in for‐
361 malizing the way you know that it's an ad. Since ad banners are always
362 entreating you to "visit" the sponsoring site, you could exclude "h1"
363 elements that contain the word "visit" under them:
364
365 my $real_h1 = $tree->look_down(
366 '_tag', 'h1',
367 sub {
368 $_[0]->as_text !~ m/\bvisit/i
369 }
370 );
371
372 The first criterion looks for "h1" elements, and the second criterion
373 limits those to only the ones whose text content doesn't match
374 "m/\bvisit/". But unfortunately, that won't work for our example,
375 since the second "h1" mentions "ConGlomCo President Schreck to Visit
376 Regional HQ".
377
378 Instead you could try looking for the first "h1" element that doesn't
379 contain an image:
380
381 my $real_h1 = $tree->look_down(
382 '_tag', 'h1',
383 sub {
384 not $_[0]->look_down('_tag', 'img')
385 }
386 );
387
388 This criterion sub might seem a bit odd, since it calls "look_down" as
389 part of a larger "look_down" operation, but that's fine. Note that
390 when considered as a boolean value, a "look_down" in a scalar context
391 value returns false (specifically, undef) if there's no matching ele‐
392 ment at or under the given element; and it returns the first matching
393 element (which, being a reference and object, is always a true value),
394 if any matches. So, here,
395
396 sub {
397 not $_[0]->look_down('_tag', 'img')
398 }
399
400 means "return true only if this element has no 'img' element as descen‐
401 dants (and isn't an 'img' element itself)."
402
403 This correctly filters out the first "h1" that contains the ad, but it
404 also incorrectly filters out the second "h1" that contains a non-adver‐
405 tisement photo besides the headline text you want.
406
407 There clearly are detectable differences between the first and second
408 "h1" elements -- the only second one contains the string "Schreck", and
409 we could just test for that:
410
411 my $real_h1 = $tree->look_down(
412 '_tag', 'h1',
413 sub {
414 $_[0]->as_text =~ m{Schreck}
415 }
416 );
417
418 And that works fine for this one example, but unless all thousand of
419 your press releases have "Schreck" in the headline, that's just not a
420 general solution. However, if all the ads-in-"h1"s that you want to
421 exclude involve a link whose URL involves "/dyna/", then you can use
422 that:
423
424 my $real_h1 = $tree->look_down(
425 '_tag', 'h1',
426 sub {
427 my $link = $_[0]->look_down('_tag','a');
428 return 1 unless $link;
429 # no link means it's fine
430 return 0 if $link->attr('href') =~ m{/dyna/};
431 # a link to there is bad
432 return 1; # otherwise okay
433 }
434 );
435
436 Or you can look at it another way and say that you want the first "h1"
437 element that either contains no images, or else whose image has a "src"
438 attribute whose value contains "/photos/":
439
440 my $real_h1 = $tree->look_down(
441 '_tag', 'h1',
442 sub {
443 my $img = $_[0]->look_down('_tag','img');
444 return 1 unless $img;
445 # no image means it's fine
446 return 1 if $img->attr('src') =~ m{/photos/};
447 # good if a photo
448 return 0; # otherwise bad
449 }
450 );
451
452 Recall that this use of "look_down" in a scalar context means to return
453 the first element at or under $tree that matches all the criteria. But
454 if you notice that you can formulate criteria that'll match several
455 possible "h1" elements, some of which may be bogus but the last one of
456 which is always the one you want, then you can use "look_down" in a
457 list context, and just use the last element of that list:
458
459 my @h1s = $tree->look_down(
460 '_tag', 'h1',
461 ...maybe more criteria...
462 );
463 die "What, no h1s here?" unless @h1s;
464 my $real_h1 = $h1s[-1]; # last or only
465
466 A Case Study: Scanning Yahoo News's HTML
467
468 The above (somewhat contrived) case involves extracting data from a
469 bunch of pre-existing HTML files. In that sort of situation, if your
470 code works for all the files, then you know that the code works --
471 since the data it's meant to handle won't go changing or growing; and,
472 typically, once you've used the program, you'll never need to use it
473 again.
474
475 The other kind of situation faced in many data extraction tasks is
476 where the program is used recurringly to handle new data -- such as
477 from ever-changing Web pages. As a real-world example of this, con‐
478 sider a program that you could use (suppose it's crontabbed) to extract
479 headline-links from subsections of Yahoo News ("http://dai‐
480 lynews.yahoo.com/").
481
482 Yahoo News has several subsections:
483
484 http://dailynews.yahoo.com/h/tc/ for technology news
485 http://dailynews.yahoo.com/h/sc/ for science news
486 http://dailynews.yahoo.com/h/hl/ for health news
487 http://dailynews.yahoo.com/h/wl/ for world news
488 http://dailynews.yahoo.com/h/en/ for entertainment news
489
490 and others. All of them are built on the same basic HTML template --
491 and a scarily complicated template it is, especially when you look at
492 it with an eye toward making up rules that will select where the real
493 headline-links are, while screening out all the links to other parts of
494 Yahoo, other news services, etc. You will need to puzzle over the HTML
495 source, and scrutinize the output of "$tree->dump" on the parse tree of
496 that HTML.
497
498 Sometimes the only way to pin down what you're after is by position in
499 the tree. For example, headlines of interest may be in the third column
500 of the second row of the second table element in a page:
501
502 my $table = ( $tree->look_down('_tag','table') )[1];
503 my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
504 my $col3 = ( $row2->look-down('_tag', 'td') )[2];
505 ...then do things with $col3...
506
507 Or they may be all the links in a "p" element that has at least three
508 "br" elements as children:
509
510 my $p = $tree->look_down(
511 '_tag', 'p',
512 sub {
513 2 < grep { ref($_) and $_->tag eq 'br' }
514 $_[0]->content_list
515 }
516 );
517 @links = $p->look_down('_tag', 'a');
518
519 But almost always, you can get away with looking for properties of the
520 of the thing itself, rather than just looking for contexts. Now, if
521 you're lucky, the document you're looking through has clear semantic
522 tagging, such is as useful in CSS -- note the class="headlinelink" bit
523 here:
524
525 <a href="...long_news_url..." class="headlinelink">Elvis
526 seen in tortilla</a>
527
528 If you find anything like that, you could leap right in and select
529 links with:
530
531 @links = $tree->look_down('class','headlinelink');
532
533 Regrettably, your chances of seeing any sort of semantic markup princi‐
534 ples really being followed with actual HTML are pretty thin.
535
536 Footnote: In fact, your chances of finding a page that is simply
537 free of HTML errors are even thinner. And surprisingly, sites like
538 Amazon or Yahoo are typically worse as far as quality of code than
539 personal sites whose entire production cycle involves simply being
540 saved and uploaded from Netscape Composer.
541
542 The code may be sort of "accidentally semantic", however -- for exam‐
543 ple, in a set of pages I was scanning recently, I found that looking
544 for "td" elements with a "width" attribute value of "375" got me
545 exactly what I wanted. No-one designing that page ever conceived of
546 "width=375" as meaning "this is a headline", but if you impute it to
547 mean that, it works.
548
549 An approach like this happens to work for the Yahoo News code, because
550 the headline-links are distinguished by the fact that they (and they
551 alone) contain a "b" element:
552
553 <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
554
555 or, diagrammed as a part of the parse tree:
556
557 . a [href="...long_news_url..."]
558 . b
559 . "Elvis seen in tortilla"
560
561 A rule that matches these can be formalized as "look for any 'a' ele‐
562 ment that has only one daugher node, which must be a 'b' element". And
563 this is what it looks like when cooked up as a "look_down" expression
564 and prefaced with a bit of code that retrieves the text of the given
565 Yahoo News page and feeds it to TreeBuilder:
566
567 use strict;
568 use HTML::TreeBuilder 2.97;
569 use LWP::UserAgent;
570 sub get_headlines {
571 my $url = $_[0] ⎪⎪ die "What URL?";
572
573 my $response = LWP::UserAgent->new->request(
574 HTTP::Request->new( GET => $url )
575 );
576 unless($response->is_success) {
577 warn "Couldn't get $url: ", $response->status_line, "\n";
578 return;
579 }
580
581 my $tree = HTML::TreeBuilder->new();
582 $tree->parse($response->content);
583 $tree->eof;
584
585 my @out;
586 foreach my $link (
587 $tree->look_down( # !
588 '_tag', 'a',
589 sub {
590 return unless $_[0]->attr('href');
591 my @c = $_[0]->content_list;
592 @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
593 }
594 )
595 ) {
596 push @out, [ $link->attr('href'), $link->as_text ];
597 }
598
599 warn "Odd, fewer than 6 stories in $url!" if @out < 6;
600 $tree->delete;
601 return @out;
602 }
603
604 ...and add a bit of code to actually call that routine and display the
605 results...
606
607 foreach my $section (qw[tc sc hl wl en]) {
608 my @links = get_headlines(
609 "http://dailynews.yahoo.com/h/$section/"
610 );
611 print
612 $section, ": ", scalar(@links), " stories\n",
613 map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
614 "\n";
615 }
616
617 And we've got our own headline-extractor service! This in and of
618 itself isn't no amazingly useful (since if you want to see the head‐
619 lines, you can just look at the Yahoo News pages), but it could easily
620 be the basis for quite useful features like filtering the headlines for
621 matching certain keywords of interest to you.
622
623 Now, one of these days, Yahoo News will decide to change its HTML tem‐
624 plate. When this happens, this will appear to the above program as
625 there being no links that meet the given criteria; or, less likely,
626 dozens of erroneous links will meet the criteria. In either case, the
627 criteria will have to be changed for the new template; they may just
628 need adjustment, or you may need to scrap them and start over.
629
630 Regardez, duvet!
631
632 It's often quite a challenge to write criteria to match the desired
633 parts of an HTML parse tree. Very often you can pull it off with a
634 simple "$tree->look_down('_tag', 'h1')", but sometimes you do have to
635 keep adding and refining criteria, until you might end up with complex
636 filters like what I've shown in this article. The benefit to learning
637 how to deal with HTML parse trees is that one main search tool, the
638 "look_down" method, can do most of the work, making simple things easy,
639 while still making hard things possible.
640
641 [end body of article]
642
643 [Author Credit]
644
645 Sean M. Burke ("sburke@cpan.org") is the current maintainer of
646 "HTML::TreeBuilder" and "HTML::Element", both originally by Gisle Aas.
647
648 Sean adds: "I'd like to thank the folks who listened to me ramble
649 incessantly about HTML::TreeBuilder and HTML::Element at this year's
650 Yet Another Perl Conference and O'Reilly Open Source Software Conven‐
651 tion."
652
654 Return to the HTML::Tree docs.
655
656
657
658perl v5.8.8 2006-08-04 HTML::Tree::Scanning(3)