1HTML::Tree::Scanning(3)User Contributed Perl DocumentatioHnTML::Tree::Scanning(3)
2
3
4
6 HTML::Tree::Scanning -- article: "Scanning HTML"
7
9 # This an article, not a module.
10
12 The following article by Sean M. Burke first appeared in The Perl
13 Journal #19 and is copyright 2000 The Perl Journal. It appears courtesy
14 of Jon Orwant and The Perl Journal. This document may be distributed
15 under the same terms as Perl itself.
16
17 (Note that this is discussed in chapters 6 through 10 of the book Perl
18 and LWP <http://lwp.interglacial.com/> which was written after the
19 following documentation, and which is available free online.)
20
22 -- Sean M. Burke
23
24 In The Perl Journal issue 17, Ken MacFarlane's article "Parsing HTML
25 with HTML::Parser" describes how the HTML::Parser module scans HTML
26 source as a stream of start-tags, end-tags, text, comments, etc. In
27 TPJ #18, my "Trees" article kicked around the idea of tree-shaped data
28 structures. Now I'll try to tie it together, in a discussion of HTML
29 trees.
30
31 The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser
32 picks out, and builds a parse tree -- a tree-shaped network of
33 objects...
34
35 Footnote: And if you need a quick explanation of objects, see my
36 TPJ17 article "A User's View of Object-Oriented Modules"; or go
37 whole hog and get Damian Conway's excellent book Object-Oriented
38 Perl, from Manning Publications.
39
40 ...representing the structured content of the HTML document. And once
41 the document is parsed as a tree, you'll find the common tasks of
42 extracting data from that HTML document/tree to be quite
43 straightforward.
44
45 HTML::Parser, HTML::TreeBuilder, and HTML::Element
46 You use HTML::TreeBuilder to make a parse tree out of an HTML source
47 file, by simply saying:
48
49 use HTML::TreeBuilder;
50 my $tree = HTML::TreeBuilder->new();
51 $tree->parse_file('foo.html');
52
53 and then $tree contains a parse tree built from the HTML source from
54 the file "foo.html". The way this parse tree is represented is with a
55 network of objects -- $tree is the root, an element with tag-name
56 "html", and its children typically include a "head" and "body" element,
57 and so on. Elements in the tree are objects of the class
58 HTML::Element.
59
60 So, if you take this source:
61
62 <html><head><title>Doc 1</title></head>
63 <body>
64 Stuff <hr> 2000-08-17
65 </body></html>
66
67 and feed it to HTML::TreeBuilder, it'll return a tree of objects that
68 looks like this:
69
70 html
71 / \
72 head body
73 / / | \
74 title "Stuff" hr "2000-08-17"
75 |
76 "Doc 1"
77
78 This is a pretty simple document, but if it were any more complex, it'd
79 be a bit hard to draw in that style, since it's sprawl left and right.
80 The same tree can be represented a bit more easily sideways, with
81 indenting:
82
83 . html
84 . head
85 . title
86 . "Doc 1"
87 . body
88 . "Stuff"
89 . hr
90 . "2000-08-17"
91
92 Either way expresses the same structure. In that structure, the root
93 node is an object of the class HTML::Element
94
95 Footnote: Well actually, the root is of the class
96 HTML::TreeBuilder, but that's just a subclass of HTML::Element,
97 plus the few extra methods like "parse_file" that elaborate the
98 tree
99
100 , with the tag name "html", and with two children: an HTML::Element
101 object whose tag names are "head" and "body". And each of those
102 elements have children, and so on down. Not all elements (as we'll
103 call the objects of class HTML::Element) have children -- the "hr"
104 element doesn't. And note all nodes in the tree are elements -- the
105 text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
106
107 Objects of the class HTML::Element each have three noteworthy
108 attributes:
109
110 "_tag" -- (best accessed as "$e->tag") this element's tag-name,
111 lowercased (e.g., "em" for an "em" element).
112 Footnote: Yes, this is misnamed. In proper SGML terminology,
113 this is instead called a "GI", short for "generic identifier";
114 and the term "tag" is used for a token of SGML source that
115 represents either the start of an element (a start-tag like
116 "<em lang='fr'>") or the end of an element (an end-tag like
117 "</em>". However, since more people claim to have been
118 abducted by aliens than to have ever seen the SGML standard,
119 and since both encounters typically involve a feeling of
120 "missing time", it's not surprising that the terminology of the
121 SGML standard is not closely followed.
122
123 "_parent" -- (best accessed as "$e->parent") the element that is $obj's
124 parent, or undef if this element is the root of its tree.
125 "_content" -- (best accessed as "$e->content_list") the list of nodes
126 (i.e., elements or text segments) that are $e's children.
127
128 Moreover, if an element object has any attributes in the SGML sense of
129 the word, then those are readable as "$e->attr('name')" -- for example,
130 with the object built from having parsed "<a id='foo'>bar</a>",
131 "$e->attr('id')" will return the string "foo". Moreover, "$e->tag" on
132 that object returns the string "a", "$e->content_list" returns a list
133 consisting of just the single scalar "bar", and "$e->parent" returns
134 the object that's this node's parent -- which may be, for example, a
135 "p" element.
136
137 And that's all that there is to it -- you throw HTML source at
138 TreeBuilder, and it returns a tree built of HTML::Element objects and
139 some text strings.
140
141 However, what do you do with a tree of objects? People code
142 information into HTML trees not for the fun of arranging elements, but
143 to represent the structure of specific text and images -- some text is
144 in this "li" element, some other text is in that heading, some images
145 are in that other table cell that has those attributes, and so on.
146
147 Now, it may happen that you're rendering that whole HTML tree into some
148 layout format. Or you could be trying to make some systematic change
149 to the HTML tree before dumping it out as HTML source again. But, in
150 my experience, by far the most common programming task that Perl
151 programmers face with HTML is in trying to extract some piece of
152 information from a larger document. Since that's so common (and also
153 since it involves concepts that are basic to more complex tasks), that
154 is what the rest of this article will be about.
155
156 Scanning HTML trees
157 Suppose you have a thousand HTML documents, each of them a press
158 release. They all start out:
159
160 [...lots of leading images and junk...]
161 <h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
162 BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
163 of world conquest, Rock Feldspar, announced today the opening of a
164 new office in Ougadougou, the capital city of Burkino Faso, gateway
165 to the bustling "Silicon Sahara" of Africa...
166 [...etc...]
167
168 ...and what you've got to do is, for each document, copy whatever text
169 is in the "h1" element, so that you can, for example, make a table of
170 contents of it. Now, there are three ways to do this:
171
172 • You can just use a regexp to scan the file for a text pattern.
173
174 For many very simple tasks, this will do fine. Many HTML documents
175 are, in practice, very consistently formatted as far as placement
176 of linebreaks and whitespace, so you could just get away with
177 scanning the file like so:
178
179 sub get_heading {
180 my $filename = $_[0];
181 local *HTML;
182 open(HTML, $filename)
183 or die "Couldn't open $filename);
184 my $heading;
185 Line:
186 while(<HTML>) {
187 if( m{<h1>(.*?)</h1>}i ) { # match it!
188 $heading = $1;
189 last Line;
190 }
191 }
192 close(HTML);
193 warn "No heading in $filename?"
194 unless defined $heading;
195 return $heading;
196 }
197
198 This is quick and fast, but awfully fragile -- if there's a newline
199 in the middle of a heading's text, it won't match the above regexp,
200 and you'll get an error. The regexp will also fail if the "h1"
201 element's start-tag has any attributes. If you have to adapt your
202 code to fit more kinds of start-tags, you'll end up basically
203 reinventing part of HTML::Parser, at which point you should
204 probably just stop, and use HTML::Parser itself:
205
206 • You can use HTML::Parser to scan the file for an "h1" start-tag
207 token, then capture all the text tokens until the "h1" close-tag.
208 This approach is extensively covered in the Ken MacFarlane's TPJ17
209 article "Parsing HTML with HTML::Parser". (A variant of this
210 approach is to use HTML::TokeParser, which presents a different and
211 rather handier interface to the tokens that HTML::Parser picks
212 out.)
213
214 Using HTML::Parser is less fragile than our first approach, since
215 it's not sensitive to the exact internal formatting of the start-
216 tag (much less whether it's split across two lines). However, when
217 you need more information about the context of the "h1" element, or
218 if you're having to deal with any of the tricky bits of HTML, such
219 as parsing of tables, you'll find out the flat list of tokens that
220 HTML::Parser returns isn't immediately useful. To get something
221 useful out of those tokens, you'll need to write code that knows
222 some things about what elements take no content (as with "hr"
223 elements), and that a "</p>" end-tags are omissible, so a "<p>"
224 will end any currently open paragraph -- and you're well on your
225 way to pointlessly reinventing much of the code in
226 HTML::TreeBuilder
227
228 Footnote: And, as the person who last rewrote that module, I
229 can attest that it wasn't terribly easy to get right! Never
230 underestimate the perversity of people coding HTML.
231
232 , at which point you should probably just stop, and use
233 HTML::TreeBuilder itself:
234
235 • You can use HTML::Treebuilder, and scan the tree of element objects
236 that you get back.
237
238 The last approach, using HTML::TreeBuilder, is the diametric opposite
239 of first approach: The first approach involves just elementary Perl
240 and one regexp, whereas the TreeBuilder approach involves being at home
241 with the concept of tree-shaped data structures and modules with
242 object-oriented interfaces, as well as with the particular interfaces
243 that HTML::TreeBuilder and HTML::Element provide.
244
245 However, what the TreeBuilder approach has going for it is that it's
246 the most robust, because it involves dealing with HTML in its "native"
247 format -- it deals with the tree structure that HTML code represents,
248 without any consideration of how the source is coded and with what tags
249 omitted.
250
251 So, to extract the text from the "h1" elements of an HTML document:
252
253 sub get_heading {
254 my $tree = HTML::TreeBuilder->new;
255 $tree->parse_file($_[0]); # !
256 my $heading;
257 my $h1 = $tree->look_down('_tag', 'h1'); # !
258 if($h1) {
259 $heading = $h1->as_text; # !
260 } else {
261 warn "No heading in $_[0]?";
262 }
263 $tree->delete; # clear memory!
264 return $heading;
265 }
266
267 This uses some unfamiliar methods that need explaining. The
268 "parse_file" method that we've seen before, builds a tree based on
269 source from the file given. The "delete" method is for marking a
270 tree's contents as available for garbage collection, when you're done
271 with the tree. The "as_text" method returns a string that contains all
272 the text bits that are children (or otherwise descendants) of the given
273 node -- to get the text content of the $h1 object, we could just say:
274
275 $heading = join '', $h1->content_list;
276
277 but that will work only if we're sure that the "h1" element's children
278 will be only text bits -- if the document contained:
279
280 <h1>Local Man Sees <cite>Blade</cite> Again</h1>
281
282 then the sub-tree would be:
283
284 . h1
285 . "Local Man Sees "
286 . cite
287 . "Blade"
288 . " Again'
289
290 so "join '', $h1->content_list" will be something like:
291
292 Local Man Sees HTML::Element=HASH(0x15424040) Again
293
294 whereas "$h1->as_text" would yield:
295
296 Local Man Sees Blade Again
297
298 and depending on what you're doing with the heading text, you might
299 want the "as_HTML" method instead. It returns the (sub)tree
300 represented as HTML source. "$h1->as_HTML" would yield:
301
302 <h1>Local Man Sees <cite>Blade</cite> Again</h1>
303
304 However, if you wanted the contents of $h1 as HTML, but not the $h1
305 itself, you could say:
306
307 join '',
308 map(
309 ref($_) ? $_->as_HTML : $_,
310 $h1->content_list
311 )
312
313 This "map" iterates over the nodes in $h1's list of children; and for
314 each node that's just a text bit (as "Local Man Sees " is), it just
315 passes through that string value, and for each node that's an actual
316 object (causing "ref" to be true), "as_HTML" will used instead of the
317 string value of the object itself (which would be something quite
318 useless, as most object values are). So that "as_HTML" for the "cite"
319 element will be the string "<cite>Blade</cite>". And then, finally,
320 "join" just puts into one string all the strings that the "map"
321 returns.
322
323 Last but not least, the most important method in our "get_heading" sub
324 is the "look_down" method. This method looks down at the subtree
325 starting at the given object ($h1), looking for elements that meet
326 criteria you provide.
327
328 The criteria are specified in the method's argument list. Each
329 criterion can consist of two scalars, a key and a value, which express
330 that you want elements that have that attribute (like "_tag", or "src")
331 with the given value ("h1"); or the criterion can be a reference to a
332 subroutine that, when called on the given element, returns true if that
333 is a node you're looking for. If you specify several criteria, then
334 that's taken to mean that you want all the elements that each satisfy
335 all the criteria. (In other words, there's an "implicit AND".)
336
337 And finally, there's a bit of an optimization -- if you call the
338 "look_down" method in a scalar context, you get just the first node (or
339 undef if none) -- and, in fact, once "look_down" finds that first
340 matching element, it doesn't bother looking any further.
341
342 So the example:
343
344 $h1 = $tree->look_down('_tag', 'h1');
345
346 returns the first element at-or-under $tree whose "_tag" attribute has
347 the value "h1".
348
349 Complex Criteria in Tree Scanning
350 Now, the above "look_down" code looks like a lot of bother, with barely
351 more benefit than just grepping the file! But consider if your
352 criteria were more complicated -- suppose you found that some of the
353 press releases that you were scanning had several "h1" elements,
354 possibly before or after the one you actually want. For example:
355
356 <h1><center>Visit Our Corporate Partner
357 <br><a href="/dyna/clickthru"
358 ><img src="/dyna/vend_ad"></a>
359 </center></h1>
360 <h1><center>ConGlomCo President Schreck to Visit Regional HQ
361 <br><a href="/photos/Schreck_visit_large.jpg"
362 ><img src="/photos/Schreck_visit.jpg"></a>
363 </center></h1>
364
365 Here, you want to ignore the first "h1" element because it contains an
366 ad, and you want the text from the second "h1". The problem is in
367 formalizing the way you know that it's an ad. Since ad banners are
368 always entreating you to "visit" the sponsoring site, you could exclude
369 "h1" elements that contain the word "visit" under them:
370
371 my $real_h1 = $tree->look_down(
372 '_tag', 'h1',
373 sub {
374 $_[0]->as_text !~ m/\bvisit/i
375 }
376 );
377
378 The first criterion looks for "h1" elements, and the second criterion
379 limits those to only the ones whose text content doesn't match
380 "m/\bvisit/". But unfortunately, that won't work for our example,
381 since the second "h1" mentions "ConGlomCo President Schreck to Visit
382 Regional HQ".
383
384 Instead you could try looking for the first "h1" element that doesn't
385 contain an image:
386
387 my $real_h1 = $tree->look_down(
388 '_tag', 'h1',
389 sub {
390 not $_[0]->look_down('_tag', 'img')
391 }
392 );
393
394 This criterion sub might seem a bit odd, since it calls "look_down" as
395 part of a larger "look_down" operation, but that's fine. Note that
396 when considered as a boolean value, a "look_down" in a scalar context
397 value returns false (specifically, undef) if there's no matching
398 element at or under the given element; and it returns the first
399 matching element (which, being a reference and object, is always a true
400 value), if any matches. So, here,
401
402 sub {
403 not $_[0]->look_down('_tag', 'img')
404 }
405
406 means "return true only if this element has no 'img' element as
407 descendants (and isn't an 'img' element itself)."
408
409 This correctly filters out the first "h1" that contains the ad, but it
410 also incorrectly filters out the second "h1" that contains a non-
411 advertisement photo besides the headline text you want.
412
413 There clearly are detectable differences between the first and second
414 "h1" elements -- the only second one contains the string "Schreck", and
415 we could just test for that:
416
417 my $real_h1 = $tree->look_down(
418 '_tag', 'h1',
419 sub {
420 $_[0]->as_text =~ m{Schreck}
421 }
422 );
423
424 And that works fine for this one example, but unless all thousand of
425 your press releases have "Schreck" in the headline, that's just not a
426 general solution. However, if all the ads-in-"h1"s that you want to
427 exclude involve a link whose URL involves "/dyna/", then you can use
428 that:
429
430 my $real_h1 = $tree->look_down(
431 '_tag', 'h1',
432 sub {
433 my $link = $_[0]->look_down('_tag','a');
434 return 1 unless $link;
435 # no link means it's fine
436 return 0 if $link->attr('href') =~ m{/dyna/};
437 # a link to there is bad
438 return 1; # otherwise okay
439 }
440 );
441
442 Or you can look at it another way and say that you want the first "h1"
443 element that either contains no images, or else whose image has a "src"
444 attribute whose value contains "/photos/":
445
446 my $real_h1 = $tree->look_down(
447 '_tag', 'h1',
448 sub {
449 my $img = $_[0]->look_down('_tag','img');
450 return 1 unless $img;
451 # no image means it's fine
452 return 1 if $img->attr('src') =~ m{/photos/};
453 # good if a photo
454 return 0; # otherwise bad
455 }
456 );
457
458 Recall that this use of "look_down" in a scalar context means to return
459 the first element at or under $tree that matches all the criteria. But
460 if you notice that you can formulate criteria that'll match several
461 possible "h1" elements, some of which may be bogus but the last one of
462 which is always the one you want, then you can use "look_down" in a
463 list context, and just use the last element of that list:
464
465 my @h1s = $tree->look_down(
466 '_tag', 'h1',
467 ...maybe more criteria...
468 );
469 die "What, no h1s here?" unless @h1s;
470 my $real_h1 = $h1s[-1]; # last or only
471
472 A Case Study: Scanning Yahoo News's HTML
473 The above (somewhat contrived) case involves extracting data from a
474 bunch of pre-existing HTML files. In that sort of situation, if your
475 code works for all the files, then you know that the code works --
476 since the data it's meant to handle won't go changing or growing; and,
477 typically, once you've used the program, you'll never need to use it
478 again.
479
480 The other kind of situation faced in many data extraction tasks is
481 where the program is used recurringly to handle new data -- such as
482 from ever-changing Web pages. As a real-world example of this,
483 consider a program that you could use (suppose it's crontabbed) to
484 extract headline-links from subsections of Yahoo News
485 ("http://dailynews.yahoo.com/").
486
487 Yahoo News has several subsections:
488
489 http://dailynews.yahoo.com/h/tc/ for technology news
490 http://dailynews.yahoo.com/h/sc/ for science news
491 http://dailynews.yahoo.com/h/hl/ for health news
492 http://dailynews.yahoo.com/h/wl/ for world news
493 http://dailynews.yahoo.com/h/en/ for entertainment news
494
495 and others. All of them are built on the same basic HTML template --
496 and a scarily complicated template it is, especially when you look at
497 it with an eye toward making up rules that will select where the real
498 headline-links are, while screening out all the links to other parts of
499 Yahoo, other news services, etc. You will need to puzzle over the HTML
500 source, and scrutinize the output of "$tree->dump" on the parse tree of
501 that HTML.
502
503 Sometimes the only way to pin down what you're after is by position in
504 the tree. For example, headlines of interest may be in the third column
505 of the second row of the second table element in a page:
506
507 my $table = ( $tree->look_down('_tag','table') )[1];
508 my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
509 my $col3 = ( $row2->look-down('_tag', 'td') )[2];
510 ...then do things with $col3...
511
512 Or they may be all the links in a "p" element that has at least three
513 "br" elements as children:
514
515 my $p = $tree->look_down(
516 '_tag', 'p',
517 sub {
518 2 < grep { ref($_) and $_->tag eq 'br' }
519 $_[0]->content_list
520 }
521 );
522 @links = $p->look_down('_tag', 'a');
523
524 But almost always, you can get away with looking for properties of the
525 of the thing itself, rather than just looking for contexts. Now, if
526 you're lucky, the document you're looking through has clear semantic
527 tagging, such is as useful in CSS -- note the class="headlinelink" bit
528 here:
529
530 <a href="...long_news_url..." class="headlinelink">Elvis
531 seen in tortilla</a>
532
533 If you find anything like that, you could leap right in and select
534 links with:
535
536 @links = $tree->look_down('class','headlinelink');
537
538 Regrettably, your chances of seeing any sort of semantic markup
539 principles really being followed with actual HTML are pretty thin.
540
541 Footnote: In fact, your chances of finding a page that is simply
542 free of HTML errors are even thinner. And surprisingly, sites like
543 Amazon or Yahoo are typically worse as far as quality of code than
544 personal sites whose entire production cycle involves simply being
545 saved and uploaded from Netscape Composer.
546
547 The code may be sort of "accidentally semantic", however -- for
548 example, in a set of pages I was scanning recently, I found that
549 looking for "td" elements with a "width" attribute value of "375" got
550 me exactly what I wanted. No-one designing that page ever conceived of
551 "width=375" as meaning "this is a headline", but if you impute it to
552 mean that, it works.
553
554 An approach like this happens to work for the Yahoo News code, because
555 the headline-links are distinguished by the fact that they (and they
556 alone) contain a "b" element:
557
558 <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
559
560 or, diagrammed as a part of the parse tree:
561
562 . a [href="...long_news_url..."]
563 . b
564 . "Elvis seen in tortilla"
565
566 A rule that matches these can be formalized as "look for any 'a'
567 element that has only one daughter node, which must be a 'b' element".
568 And this is what it looks like when cooked up as a "look_down"
569 expression and prefaced with a bit of code that retrieves the text of
570 the given Yahoo News page and feeds it to TreeBuilder:
571
572 use strict;
573 use HTML::TreeBuilder 2.97;
574 use LWP::UserAgent;
575 sub get_headlines {
576 my $url = $_[0] || die "What URL?";
577
578 my $response = LWP::UserAgent->new->request(
579 HTTP::Request->new( GET => $url )
580 );
581 unless($response->is_success) {
582 warn "Couldn't get $url: ", $response->status_line, "\n";
583 return;
584 }
585
586 my $tree = HTML::TreeBuilder->new();
587 $tree->parse($response->content);
588 $tree->eof;
589
590 my @out;
591 foreach my $link (
592 $tree->look_down( # !
593 '_tag', 'a',
594 sub {
595 return unless $_[0]->attr('href');
596 my @c = $_[0]->content_list;
597 @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
598 }
599 )
600 ) {
601 push @out, [ $link->attr('href'), $link->as_text ];
602 }
603
604 warn "Odd, fewer than 6 stories in $url!" if @out < 6;
605 $tree->delete;
606 return @out;
607 }
608
609 ...and add a bit of code to actually call that routine and display the
610 results...
611
612 foreach my $section (qw[tc sc hl wl en]) {
613 my @links = get_headlines(
614 "http://dailynews.yahoo.com/h/$section/"
615 );
616 print
617 $section, ": ", scalar(@links), " stories\n",
618 map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
619 "\n";
620 }
621
622 And we've got our own headline-extractor service! This in and of
623 itself isn't no amazingly useful (since if you want to see the
624 headlines, you can just look at the Yahoo News pages), but it could
625 easily be the basis for quite useful features like filtering the
626 headlines for matching certain keywords of interest to you.
627
628 Now, one of these days, Yahoo News will decide to change its HTML
629 template. When this happens, this will appear to the above program as
630 there being no links that meet the given criteria; or, less likely,
631 dozens of erroneous links will meet the criteria. In either case, the
632 criteria will have to be changed for the new template; they may just
633 need adjustment, or you may need to scrap them and start over.
634
635 Regardez, duvet!
636 It's often quite a challenge to write criteria to match the desired
637 parts of an HTML parse tree. Very often you can pull it off with a
638 simple "$tree->look_down('_tag', 'h1')", but sometimes you do have to
639 keep adding and refining criteria, until you might end up with complex
640 filters like what I've shown in this article. The benefit to learning
641 how to deal with HTML parse trees is that one main search tool, the
642 "look_down" method, can do most of the work, making simple things easy,
643 while still making hard things possible.
644
645 [end body of article]
646
647 [Author Credit]
648 Sean M. Burke ("sburke@cpan.org") is the current maintainer of
649 "HTML::TreeBuilder" and "HTML::Element", both originally by Gisle Aas.
650
651 Sean adds: "I'd like to thank the folks who listened to me ramble
652 incessantly about HTML::TreeBuilder and HTML::Element at this year's
653 Yet Another Perl Conference and O'Reilly Open Source Software
654 Convention."
655
657 Return to the HTML::Tree docs.
658
659
660
661perl v5.34.0 2021-07-22 HTML::Tree::Scanning(3)