1Web::Scraper(3pm) User Contributed Perl Documentation Web::Scraper(3pm)
2
3
4
6 Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or
7 XPath expressions
8
10 use URI;
11 use Web::Scraper;
12 use Encode;
13
14 # First, create your scraper block
15 my $authors = scraper {
16 # Parse all TDs inside 'table[width="100%]"', store them into
17 # an array 'authors'. We embed other scrapers for each TD.
18 process 'table[width="100%"] td', "authors[]" => scraper {
19 # And, in each TD,
20 # get the URI of "a" element
21 process "a", uri => '@href';
22 # get text inside "small" element
23 process "small", fullname => 'TEXT';
24 };
25 };
26
27 my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") );
28
29 # iterate the array 'authors'
30 for my $author (@{$res->{authors}}) {
31 # output is like:
32 # Andy Adler http://search.cpan.org/~aadler/
33 # Aaron K Dancygier http://search.cpan.org/~aakd/
34 # Aamer Akhter http://search.cpan.org/~aakhter/
35 print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n");
36 }
37
38 The structure would resemble this (visually)
39 {
40 authors => [
41 { fullname => $fullname, link => $uri },
42 { fullname => $fullname, link => $uri },
43 ]
44 }
45
47 Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent
48 Scrapi. It provides a DSL-ish interface for traversing HTML documents
49 and returning a neatly arranged Perl data structure.
50
51 The scraper and process blocks provide a method to define what segments
52 of a document to extract. It understands HTML and CSS Selectors as
53 well as XPath expressions.
54
56 scraper
57 $scraper = scraper { ... };
58
59 Creates a new Web::Scraper object by wrapping the DSL code that will be
60 fired when scrape method is called.
61
62 scrape
63 $res = $scraper->scrape(URI->new($uri));
64 $res = $scraper->scrape($html_content);
65 $res = $scraper->scrape(\$html_content);
66 $res = $scraper->scrape($http_response);
67 $res = $scraper->scrape($html_element);
68
69 Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings
70 and creates a DOM object, then fires the callback scraper code to
71 retrieve the data structure.
72
73 If you pass URI or HTTP::Response object, Web::Scraper will
74 automatically guesses the encoding of the content by looking at
75 Content-Type headers and META tags. Otherwise you need to decode the
76 HTML to Unicode before passing it to scrape method.
77
78 You can optionally pass the base URL when you pass the HTML content as
79 a string instead of URI or HTTP::Response.
80
81 $res = $scraper->scrape($html_content, "http://example.com/foo");
82
83 This way Web::Scraper can resolve the relative links found in the
84 document.
85
86 process
87 scraper {
88 process "tag.class", key => 'TEXT';
89 process '//tag[contains(@foo, "bar")]', key2 => '@attr';
90 process '//comment()', 'comments[]' => 'TEXT';
91 };
92
93 process is the method to find matching elements from HTML with CSS
94 selector or XPath expression, then extract text or attributes into the
95 result stash.
96
97 If the first argument begins with "//" or "id(" it's treated as an
98 XPath expression and otherwise CSS selector.
99
100 # <span class="date">2008/12/21</span>
101 # date => "2008/12/21"
102 process ".date", date => 'TEXT';
103
104 # <div class="body"><a href="http://example.com/">foo</a></div>
105 # link => URI->new("http://example.com/")
106 process ".body > a", link => '@href';
107
108 # <div class="body"><!-- HTML Comment here --><a href="http://example.com/">foo</a></div>
109 # comment => " HTML Comment here "
110 #
111 # NOTES: A comment nodes are accessed when installed
112 # the HTML::TreeBuilder::XPath (version >= 0.14) and/or
113 # the HTML::TreeBuilder::LibXML (version >= 0.13)
114 process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT';
115
116 # <div class="body"><a href="http://example.com/">foo</a></div>
117 # link => URI->new("http://example.com/"), text => "foo"
118 process ".body > a", link => '@href', text => 'TEXT';
119
120 # <ul><li>foo</li><li>bar</li></ul>
121 # list => [ "foo", "bar" ]
122 process "li", "list[]" => "TEXT";
123
124 # <ul><li id="1">foo</li><li id="2">bar</li></ul>
125 # list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ];
126 process "li", "list[]" => { id => '@id', text => "TEXT" };
127
128 process_first
129 "process_first" is the same as "process" but stops when the first
130 matching result is found.
131
132 # <span class="date">2008/12/21</span>
133 # <span class="date">2008/12/22</span>
134 # date => "2008/12/21"
135 process_first ".date", date => 'TEXT';
136
137 result
138 "result" allows to return not the default value after processing but a
139 single value specified by a key or a hash reference built from several
140 keys.
141
142 process 'a', 'want[]' => 'TEXT';
143 result 'want';
144
146 There are many examples in the "eg/" dir packaged in this distribution.
147 It is recommended to look through these.
148
150 Scrapers can be nested thus allowing to scrape already captured data.
151
152 # <ul>
153 # <li class="foo"><a href="foo1">bar1</a></li>
154 # <li class="bar"><a href="foo2">bar2</a></li>
155 # <li class="foo"><a href="foo3">bar3</a></li>
156 # </ul>
157 # friends => [ {href => 'foo1'}, {href => 'foo2'} ];
158 process 'li', 'friends[]' => scraper {
159 process 'a', href => '@href',
160 };
161
163 Filters are applied to the result after processing. They can be
164 declared as anonymous subroutines or as class names.
165
166 process $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];
167 process $exp, $key => [ 'TEXT', 'Something' ];
168 process $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];
169
170 Filters can be stacked
171
172 process $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \&baz ];
173
174 More about filters you can find in Web::Scraper::Filter documentation.
175
177 By default HTML::TreeBuilder::XPath is used, this can be replaces by a
178 XML::LibXML backend using Web::Scraper::LibXML module.
179
180 use Web::Scraper::LibXML;
181
182 # same as Web::Scraper
183 my $scraper = scraper { ... };
184
186 Tatsuhiko Miyagawa <miyagawa@bulknews.net>
187
189 This library is free software; you can redistribute it and/or modify it
190 under the same terms as Perl itself.
191
193 <http://blog.labnotes.org/category/scrapi/>
194
195 HTML::TreeBuilder::XPath
196
197
198
199perl v5.32.0 2020-07-28 Web::Scraper(3pm)