1HTML::Tree::Scanning(3)User Contributed Perl DocumentatioHnTML::Tree::Scanning(3)
2
3
4
6 HTML::Tree::Scanning -- article: "Scanning HTML"
7
9 # This an article, not a module.
10
12 The following article by Sean M. Burke first appeared in The Perl
13 Journal #19 and is copyright 2000 The Perl Journal. It appears courtesy
14 of Jon Orwant and The Perl Journal. This document may be distributed
15 under the same terms as Perl itself.
16
18 -- Sean M. Burke
19
20 In The Perl Journal issue 17, Ken MacFarlane's article "Parsing HTML
21 with HTML::Parser" describes how the HTML::Parser module scans HTML
22 source as a stream of start-tags, end-tags, text, comments, etc. In
23 TPJ #18, my "Trees" article kicked around the idea of tree-shaped data
24 structures. Now I'll try to tie it together, in a discussion of HTML
25 trees.
26
27 The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser
28 picks out, and builds a parse tree -- a tree-shaped network of
29 objects...
30
31 Footnote: And if you need a quick explanation of objects, see my
32 TPJ17 article "A User's View of Object-Oriented Modules"; or go
33 whole hog and get Damian Conway's excellent book Object-Oriented
34 Perl, from Manning Publications.
35
36 ...representing the structured content of the HTML document. And once
37 the document is parsed as a tree, you'll find the common tasks of
38 extracting data from that HTML document/tree to be quite
39 straightforward.
40
41 HTML::Parser, HTML::TreeBuilder, and HTML::Element
42 You use HTML::TreeBuilder to make a parse tree out of an HTML source
43 file, by simply saying:
44
45 use HTML::TreeBuilder;
46 my $tree = HTML::TreeBuilder->new();
47 $tree->parse_file('foo.html');
48
49 and then $tree contains a parse tree built from the HTML source from
50 the file "foo.html". The way this parse tree is represented is with a
51 network of objects -- $tree is the root, an element with tag-name
52 "html", and its children typically include a "head" and "body" element,
53 and so on. Elements in the tree are objects of the class
54 HTML::Element.
55
56 So, if you take this source:
57
58 <html><head><title>Doc 1</title></head>
59 <body>
60 Stuff <hr> 2000-08-17
61 </body></html>
62
63 and feed it to HTML::TreeBuilder, it'll return a tree of objects that
64 looks like this:
65
66 html
67 / \
68 head body
69 / / | \
70 title "Stuff" hr "2000-08-17"
71 |
72 "Doc 1"
73
74 This is a pretty simple document, but if it were any more complex, it'd
75 be a bit hard to draw in that style, since it's sprawl left and right.
76 The same tree can be represented a bit more easily sideways, with
77 indenting:
78
79 . html
80 . head
81 . title
82 . "Doc 1"
83 . body
84 . "Stuff"
85 . hr
86 . "2000-08-17"
87
88 Either way expresses the same structure. In that structure, the root
89 node is an object of the class HTML::Element
90
91 Footnote: Well actually, the root is of the class
92 HTML::TreeBuilder, but that's just a subclass of HTML::Element,
93 plus the few extra methods like "parse_file" that elaborate the
94 tree
95
96 , with the tag name "html", and with two children: an HTML::Element
97 object whose tag names are "head" and "body". And each of those
98 elements have children, and so on down. Not all elements (as we'll
99 call the objects of class HTML::Element) have children -- the "hr"
100 element doesn't. And note all nodes in the tree are elements -- the
101 text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
102
103 Objects of the class HTML::Element each have three noteworthy
104 attributes:
105
106 "_tag" -- (best accessed as "$e->tag") this element's tag-name,
107 lowercased (e.g., "em" for an "em" element).
108 Footnote: Yes, this is misnamed. In proper SGML terminology,
109 this is instead called a "GI", short for "generic identifier";
110 and the term "tag" is used for a token of SGML source that
111 represents either the start of an element (a start-tag like
112 "<em lang='fr'>") or the end of an element (an end-tag like
113 "</em>". However, since more people claim to have been
114 abducted by aliens than to have ever seen the SGML standard,
115 and since both encounters typically involve a feeling of
116 "missing time", it's not surprising that the terminology of the
117 SGML standard is not closely followed.
118
119 "_parent" -- (best accessed as "$e->parent") the element that is $obj's
120 parent, or undef if this element is the root of its tree.
121 "_content" -- (best accessed as "$e->content_list") the list of nodes
122 (i.e., elements or text segments) that are $e's children.
123
124 Moreover, if an element object has any attributes in the SGML sense of
125 the word, then those are readable as "$e->attr('name')" -- for example,
126 with the object built from having parsed "<a id='foo'>bar</a>",
127 "$e->attr('id')" will return the string "foo". Moreover, "$e->tag" on
128 that object returns the string "a", "$e->content_list" returns a list
129 consisting of just the single scalar "bar", and "$e->parent" returns
130 the object that's this node's parent -- which may be, for example, a
131 "p" element.
132
133 And that's all that there is to it -- you throw HTML source at
134 TreeBuilder, and it returns a tree built of HTML::Element objects and
135 some text strings.
136
137 However, what do you do with a tree of objects? People code
138 information into HTML trees not for the fun of arranging elements, but
139 to represent the structure of specific text and images -- some text is
140 in this "li" element, some other text is in that heading, some images
141 are in that other table cell that has those attributes, and so on.
142
143 Now, it may happen that you're rendering that whole HTML tree into some
144 layout format. Or you could be trying to make some systematic change
145 to the HTML tree before dumping it out as HTML source again. But, in
146 my experience, by far the most common programming task that Perl
147 programmers face with HTML is in trying to extract some piece of
148 information from a larger document. Since that's so common (and also
149 since it involves concepts that are basic to more complex tasks), that
150 is what the rest of this article will be about.
151
152 Scanning HTML trees
153 Suppose you have a thousand HTML documents, each of them a press
154 release. They all start out:
155
156 [...lots of leading images and junk...]
157 <h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
158 BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
159 of world conquest, Rock Feldspar, announced today the opening of a
160 new office in Ougadougou, the capital city of Burkino Faso, gateway
161 to the bustling "Silicon Sahara" of Africa...
162 [...etc...]
163
164 ...and what you've got to do is, for each document, copy whatever text
165 is in the "h1" element, so that you can, for example, make a table of
166 contents of it. Now, there are three ways to do this:
167
168 · You can just use a regexp to scan the file for a text pattern.
169
170 For many very simple tasks, this will do fine. Many HTML documents
171 are, in practice, very consistently formatted as far as placement
172 of linebreaks and whitespace, so you could just get away with
173 scanning the file like so:
174
175 sub get_heading {
176 my $filename = $_[0];
177 local *HTML;
178 open(HTML, $filename)
179 or die "Couldn't open $filename);
180 my $heading;
181 Line:
182 while(<HTML>) {
183 if( m{<h1>(.*?)</h1>}i ) { # match it!
184 $heading = $1;
185 last Line;
186 }
187 }
188 close(HTML);
189 warn "No heading in $filename?"
190 unless defined $heading;
191 return $heading;
192 }
193
194 This is quick and fast, but awfully fragile -- if there's a newline
195 in the middle of a heading's text, it won't match the above regexp,
196 and you'll get an error. The regexp will also fail if the "h1"
197 element's start-tag has any attributes. If you have to adapt your
198 code to fit more kinds of start-tags, you'll end up basically
199 reinventing part of HTML::Parser, at which point you should
200 probably just stop, and use HTML::Parser itself:
201
202 · You can use HTML::Parser to scan the file for an "h1" start-tag
203 token, then capture all the text tokens until the "h1" close-tag.
204 This approach is extensively covered in the Ken MacFarlane's TPJ17
205 article "Parsing HTML with HTML::Parser". (A variant of this
206 approach is to use HTML::TokeParser, which presents a different and
207 rather handier interface to the tokens that HTML::Parser picks
208 out.)
209
210 Using HTML::Parser is less fragile than our first approach, since
211 it's not sensitive to the exact internal formatting of the start-
212 tag (much less whether it's split across two lines). However, when
213 you need more information about the context of the "h1" element, or
214 if you're having to deal with any of the tricky bits of HTML, such
215 as parsing of tables, you'll find out the flat list of tokens that
216 HTML::Parser returns isn't immediately useful. To get something
217 useful out of those tokens, you'll need to write code that knows
218 some things about what elements take no content (as with "hr"
219 elements), and that a "</p>" end-tags are omissible, so a "<p>"
220 will end any currently open paragraph -- and you're well on your
221 way to pointlessly reinventing much of the code in
222 HTML::TreeBuilder
223
224 Footnote: And, as the person who last rewrote that module, I
225 can attest that it wasn't terribly easy to get right! Never
226 underestimate the perversity of people coding HTML.
227
228 , at which point you should probably just stop, and use
229 HTML::TreeBuilder itself:
230
231 · You can use HTML::Treebuilder, and scan the tree of element objects
232 that you get back.
233
234 The last approach, using HTML::TreeBuilder, is the diametric opposite
235 of first approach: The first approach involves just elementary Perl
236 and one regexp, whereas the TreeBuilder approach involves being at home
237 with the concept of tree-shaped data structures and modules with
238 object-oriented interfaces, as well as with the particular interfaces
239 that HTML::TreeBuilder and HTML::Element provide.
240
241 However, what the TreeBuilder approach has going for it is that it's
242 the most robust, because it involves dealing with HTML in its "native"
243 format -- it deals with the tree structure that HTML code represents,
244 without any consideration of how the source is coded and with what tags
245 omitted.
246
247 So, to extract the text from the "h1" elements of an HTML document:
248
249 sub get_heading {
250 my $tree = HTML::TreeBuilder->new;
251 $tree->parse_file($_[0]); # !
252 my $heading;
253 my $h1 = $tree->look_down('_tag', 'h1'); # !
254 if($h1) {
255 $heading = $h1->as_text; # !
256 } else {
257 warn "No heading in $_[0]?";
258 }
259 $tree->delete; # clear memory!
260 return $heading;
261 }
262
263 This uses some unfamiliar methods that need explaining. The
264 "parse_file" method that we've seen before, builds a tree based on
265 source from the file given. The "delete" method is for marking a
266 tree's contents as available for garbage collection, when you're done
267 with the tree. The "as_text" method returns a string that contains all
268 the text bits that are children (or otherwise descendants) of the given
269 node -- to get the text content of the $h1 object, we could just say:
270
271 $heading = join '', $h1->content_list;
272
273 but that will work only if we're sure that the "h1" element's children
274 will be only text bits -- if the document contained:
275
276 <h1>Local Man Sees <cite>Blade</cite> Again</h1>
277
278 then the sub-tree would be:
279
280 . h1
281 . "Local Man Sees "
282 . cite
283 . "Blade"
284 . " Again'
285
286 so "join '', $h1->content_list" will be something like:
287
288 Local Man Sees HTML::Element=HASH(0x15424040) Again
289
290 whereas "$h1->as_text" would yield:
291
292 Local Man Sees Blade Again
293
294 and depending on what you're doing with the heading text, you might
295 want the "as_HTML" method instead. It returns the (sub)tree
296 represented as HTML source. "$h1->as_HTML" would yield:
297
298 <h1>Local Man Sees <cite>Blade</cite> Again</h1>
299
300 However, if you wanted the contents of $h1 as HTML, but not the $h1
301 itself, you could say:
302
303 join '',
304 map(
305 ref($_) ? $_->as_HTML : $_,
306 $h1->content_list
307 )
308
309 This "map" iterates over the nodes in $h1's list of children; and for
310 each node that's just a text bit (as "Local Man Sees " is), it just
311 passes through that string value, and for each node that's an actual
312 object (causing "ref" to be true), "as_HTML" will used instead of the
313 string value of the object itself (which would be something quite
314 useless, as most object values are). So that "as_HTML" for the "cite"
315 element will be the string "<cite>Blade</cite>". And then, finally,
316 "join" just puts into one string all the strings that the "map"
317 returns.
318
319 Last but not least, the most important method in our "get_heading" sub
320 is the "look_down" method. This method looks down at the subtree
321 starting at the given object ($h1), looking for elements that meet
322 criteria you provide.
323
324 The criteria are specified in the method's argument list. Each
325 criterion can consist of two scalars, a key and a value, which express
326 that you want elements that have that attribute (like "_tag", or "src")
327 with the given value ("h1"); or the criterion can be a reference to a
328 subroutine that, when called on the given element, returns true if that
329 is a node you're looking for. If you specify several criteria, then
330 that's taken to mean that you want all the elements that each satisfy
331 all the criteria. (In other words, there's an "implicit AND".)
332
333 And finally, there's a bit of an optimization -- if you call the
334 "look_down" method in a scalar context, you get just the first node (or
335 undef if none) -- and, in fact, once "look_down" finds that first
336 matching element, it doesn't bother looking any further.
337
338 So the example:
339
340 $h1 = $tree->look_down('_tag', 'h1');
341
342 returns the first element at-or-under $tree whose "_tag" attribute has
343 the value "h1".
344
345 Complex Criteria in Tree Scanning
346 Now, the above "look_down" code looks like a lot of bother, with barely
347 more benefit than just grepping the file! But consider if your
348 criteria were more complicated -- suppose you found that some of the
349 press releases that you were scanning had several "h1" elements,
350 possibly before or after the one you actually want. For example:
351
352 <h1><center>Visit Our Corporate Partner
353 <br><a href="/dyna/clickthru"
354 ><img src="/dyna/vend_ad"></a>
355 </center></h1>
356 <h1><center>ConGlomCo President Schreck to Visit Regional HQ
357 <br><a href="/photos/Schreck_visit_large.jpg"
358 ><img src="/photos/Schreck_visit.jpg"></a>
359 </center></h1>
360
361 Here, you want to ignore the first "h1" element because it contains an
362 ad, and you want the text from the second "h1". The problem is in
363 formalizing the way you know that it's an ad. Since ad banners are
364 always entreating you to "visit" the sponsoring site, you could exclude
365 "h1" elements that contain the word "visit" under them:
366
367 my $real_h1 = $tree->look_down(
368 '_tag', 'h1',
369 sub {
370 $_[0]->as_text !~ m/\bvisit/i
371 }
372 );
373
374 The first criterion looks for "h1" elements, and the second criterion
375 limits those to only the ones whose text content doesn't match
376 "m/\bvisit/". But unfortunately, that won't work for our example,
377 since the second "h1" mentions "ConGlomCo President Schreck to Visit
378 Regional HQ".
379
380 Instead you could try looking for the first "h1" element that doesn't
381 contain an image:
382
383 my $real_h1 = $tree->look_down(
384 '_tag', 'h1',
385 sub {
386 not $_[0]->look_down('_tag', 'img')
387 }
388 );
389
390 This criterion sub might seem a bit odd, since it calls "look_down" as
391 part of a larger "look_down" operation, but that's fine. Note that
392 when considered as a boolean value, a "look_down" in a scalar context
393 value returns false (specifically, undef) if there's no matching
394 element at or under the given element; and it returns the first
395 matching element (which, being a reference and object, is always a true
396 value), if any matches. So, here,
397
398 sub {
399 not $_[0]->look_down('_tag', 'img')
400 }
401
402 means "return true only if this element has no 'img' element as
403 descendants (and isn't an 'img' element itself)."
404
405 This correctly filters out the first "h1" that contains the ad, but it
406 also incorrectly filters out the second "h1" that contains a non-
407 advertisement photo besides the headline text you want.
408
409 There clearly are detectable differences between the first and second
410 "h1" elements -- the only second one contains the string "Schreck", and
411 we could just test for that:
412
413 my $real_h1 = $tree->look_down(
414 '_tag', 'h1',
415 sub {
416 $_[0]->as_text =~ m{Schreck}
417 }
418 );
419
420 And that works fine for this one example, but unless all thousand of
421 your press releases have "Schreck" in the headline, that's just not a
422 general solution. However, if all the ads-in-"h1"s that you want to
423 exclude involve a link whose URL involves "/dyna/", then you can use
424 that:
425
426 my $real_h1 = $tree->look_down(
427 '_tag', 'h1',
428 sub {
429 my $link = $_[0]->look_down('_tag','a');
430 return 1 unless $link;
431 # no link means it's fine
432 return 0 if $link->attr('href') =~ m{/dyna/};
433 # a link to there is bad
434 return 1; # otherwise okay
435 }
436 );
437
438 Or you can look at it another way and say that you want the first "h1"
439 element that either contains no images, or else whose image has a "src"
440 attribute whose value contains "/photos/":
441
442 my $real_h1 = $tree->look_down(
443 '_tag', 'h1',
444 sub {
445 my $img = $_[0]->look_down('_tag','img');
446 return 1 unless $img;
447 # no image means it's fine
448 return 1 if $img->attr('src') =~ m{/photos/};
449 # good if a photo
450 return 0; # otherwise bad
451 }
452 );
453
454 Recall that this use of "look_down" in a scalar context means to return
455 the first element at or under $tree that matches all the criteria. But
456 if you notice that you can formulate criteria that'll match several
457 possible "h1" elements, some of which may be bogus but the last one of
458 which is always the one you want, then you can use "look_down" in a
459 list context, and just use the last element of that list:
460
461 my @h1s = $tree->look_down(
462 '_tag', 'h1',
463 ...maybe more criteria...
464 );
465 die "What, no h1s here?" unless @h1s;
466 my $real_h1 = $h1s[-1]; # last or only
467
468 A Case Study: Scanning Yahoo News's HTML
469 The above (somewhat contrived) case involves extracting data from a
470 bunch of pre-existing HTML files. In that sort of situation, if your
471 code works for all the files, then you know that the code works --
472 since the data it's meant to handle won't go changing or growing; and,
473 typically, once you've used the program, you'll never need to use it
474 again.
475
476 The other kind of situation faced in many data extraction tasks is
477 where the program is used recurringly to handle new data -- such as
478 from ever-changing Web pages. As a real-world example of this,
479 consider a program that you could use (suppose it's crontabbed) to
480 extract headline-links from subsections of Yahoo News
481 ("http://dailynews.yahoo.com/").
482
483 Yahoo News has several subsections:
484
485 http://dailynews.yahoo.com/h/tc/ for technology news
486 http://dailynews.yahoo.com/h/sc/ for science news
487 http://dailynews.yahoo.com/h/hl/ for health news
488 http://dailynews.yahoo.com/h/wl/ for world news
489 http://dailynews.yahoo.com/h/en/ for entertainment news
490
491 and others. All of them are built on the same basic HTML template --
492 and a scarily complicated template it is, especially when you look at
493 it with an eye toward making up rules that will select where the real
494 headline-links are, while screening out all the links to other parts of
495 Yahoo, other news services, etc. You will need to puzzle over the HTML
496 source, and scrutinize the output of "$tree->dump" on the parse tree of
497 that HTML.
498
499 Sometimes the only way to pin down what you're after is by position in
500 the tree. For example, headlines of interest may be in the third column
501 of the second row of the second table element in a page:
502
503 my $table = ( $tree->look_down('_tag','table') )[1];
504 my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
505 my $col3 = ( $row2->look-down('_tag', 'td') )[2];
506 ...then do things with $col3...
507
508 Or they may be all the links in a "p" element that has at least three
509 "br" elements as children:
510
511 my $p = $tree->look_down(
512 '_tag', 'p',
513 sub {
514 2 < grep { ref($_) and $_->tag eq 'br' }
515 $_[0]->content_list
516 }
517 );
518 @links = $p->look_down('_tag', 'a');
519
520 But almost always, you can get away with looking for properties of the
521 of the thing itself, rather than just looking for contexts. Now, if
522 you're lucky, the document you're looking through has clear semantic
523 tagging, such is as useful in CSS -- note the class="headlinelink" bit
524 here:
525
526 <a href="...long_news_url..." class="headlinelink">Elvis
527 seen in tortilla</a>
528
529 If you find anything like that, you could leap right in and select
530 links with:
531
532 @links = $tree->look_down('class','headlinelink');
533
534 Regrettably, your chances of seeing any sort of semantic markup
535 principles really being followed with actual HTML are pretty thin.
536
537 Footnote: In fact, your chances of finding a page that is simply
538 free of HTML errors are even thinner. And surprisingly, sites like
539 Amazon or Yahoo are typically worse as far as quality of code than
540 personal sites whose entire production cycle involves simply being
541 saved and uploaded from Netscape Composer.
542
543 The code may be sort of "accidentally semantic", however -- for
544 example, in a set of pages I was scanning recently, I found that
545 looking for "td" elements with a "width" attribute value of "375" got
546 me exactly what I wanted. No-one designing that page ever conceived of
547 "width=375" as meaning "this is a headline", but if you impute it to
548 mean that, it works.
549
550 An approach like this happens to work for the Yahoo News code, because
551 the headline-links are distinguished by the fact that they (and they
552 alone) contain a "b" element:
553
554 <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
555
556 or, diagrammed as a part of the parse tree:
557
558 . a [href="...long_news_url..."]
559 . b
560 . "Elvis seen in tortilla"
561
562 A rule that matches these can be formalized as "look for any 'a'
563 element that has only one daugher node, which must be a 'b' element".
564 And this is what it looks like when cooked up as a "look_down"
565 expression and prefaced with a bit of code that retrieves the text of
566 the given Yahoo News page and feeds it to TreeBuilder:
567
568 use strict;
569 use HTML::TreeBuilder 2.97;
570 use LWP::UserAgent;
571 sub get_headlines {
572 my $url = $_[0] || die "What URL?";
573
574 my $response = LWP::UserAgent->new->request(
575 HTTP::Request->new( GET => $url )
576 );
577 unless($response->is_success) {
578 warn "Couldn't get $url: ", $response->status_line, "\n";
579 return;
580 }
581
582 my $tree = HTML::TreeBuilder->new();
583 $tree->parse($response->content);
584 $tree->eof;
585
586 my @out;
587 foreach my $link (
588 $tree->look_down( # !
589 '_tag', 'a',
590 sub {
591 return unless $_[0]->attr('href');
592 my @c = $_[0]->content_list;
593 @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
594 }
595 )
596 ) {
597 push @out, [ $link->attr('href'), $link->as_text ];
598 }
599
600 warn "Odd, fewer than 6 stories in $url!" if @out < 6;
601 $tree->delete;
602 return @out;
603 }
604
605 ...and add a bit of code to actually call that routine and display the
606 results...
607
608 foreach my $section (qw[tc sc hl wl en]) {
609 my @links = get_headlines(
610 "http://dailynews.yahoo.com/h/$section/"
611 );
612 print
613 $section, ": ", scalar(@links), " stories\n",
614 map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
615 "\n";
616 }
617
618 And we've got our own headline-extractor service! This in and of
619 itself isn't no amazingly useful (since if you want to see the
620 headlines, you can just look at the Yahoo News pages), but it could
621 easily be the basis for quite useful features like filtering the
622 headlines for matching certain keywords of interest to you.
623
624 Now, one of these days, Yahoo News will decide to change its HTML
625 template. When this happens, this will appear to the above program as
626 there being no links that meet the given criteria; or, less likely,
627 dozens of erroneous links will meet the criteria. In either case, the
628 criteria will have to be changed for the new template; they may just
629 need adjustment, or you may need to scrap them and start over.
630
631 Regardez, duvet!
632 It's often quite a challenge to write criteria to match the desired
633 parts of an HTML parse tree. Very often you can pull it off with a
634 simple "$tree->look_down('_tag', 'h1')", but sometimes you do have to
635 keep adding and refining criteria, until you might end up with complex
636 filters like what I've shown in this article. The benefit to learning
637 how to deal with HTML parse trees is that one main search tool, the
638 "look_down" method, can do most of the work, making simple things easy,
639 while still making hard things possible.
640
641 [end body of article]
642
643 [Author Credit]
644 Sean M. Burke ("sburke@cpan.org") is the current maintainer of
645 "HTML::TreeBuilder" and "HTML::Element", both originally by Gisle Aas.
646
647 Sean adds: "I'd like to thank the folks who listened to me ramble
648 incessantly about HTML::TreeBuilder and HTML::Element at this year's
649 Yet Another Perl Conference and O'Reilly Open Source Software
650 Convention."
651
653 Return to the HTML::Tree docs.
654
655
656
657perl v5.12.2 2010-12-20 HTML::Tree::Scanning(3)