Web::Scraper(3pm)

1Web::Scraper(3pm)     User Contributed Perl Documentation    Web::Scraper(3pm)
2
3
4

NAME

6       Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or
7       XPath expressions
8

SYNOPSIS

10         use URI;
11         use Web::Scraper;
12         use Encode;
13
14         # First, create your scraper block
15         my $authors = scraper {
16             # Parse all TDs inside 'table[width="100%]"', store them into
17             # an array 'authors'.  We embed other scrapers for each TD.
18             process 'table[width="100%"] td', "authors[]" => scraper {
19               # And, in each TD,
20               # get the URI of "a" element
21               process "a", uri => '@href';
22               # get text inside "small" element
23               process "small", fullname => 'TEXT';
24             };
25         };
26
27         my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") );
28
29         # iterate the array 'authors'
30         for my $author (@{$res->{authors}}) {
31             # output is like:
32             # Andy Adler      http://search.cpan.org/~aadler/
33             # Aaron K Dancygier       http://search.cpan.org/~aakd/
34             # Aamer Akhter    http://search.cpan.org/~aakhter/
35             print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n");
36         }
37
38       The structure would resemble this (visually)
39         {
40           authors => [
41             { fullname => $fullname, link => $uri },
42             { fullname => $fullname, link => $uri },
43           ]
44         }
45

DESCRIPTION

47       Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent
48       Scrapi. It provides a DSL-ish interface for traversing HTML documents
49       and returning a neatly arranged Perl data structure.
50
51       The scraper and process blocks provide a method to define what segments
52       of a document to extract.  It understands HTML and CSS Selectors as
53       well as XPath expressions.
54

METHODS

56   scraper
57         $scraper = scraper { ... };
58
59       Creates a new Web::Scraper object by wrapping the DSL code that will be
60       fired when scrape method is called.
61
62   scrape
63         $res = $scraper->scrape(URI->new($uri));
64         $res = $scraper->scrape($html_content);
65         $res = $scraper->scrape(\$html_content);
66         $res = $scraper->scrape($http_response);
67         $res = $scraper->scrape($html_element);
68
69       Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings
70       and creates a DOM object, then fires the callback scraper code to
71       retrieve the data structure.
72
73       If you pass URI or HTTP::Response object, Web::Scraper will
74       automatically guesses the encoding of the content by looking at
75       Content-Type headers and META tags. Otherwise you need to decode the
76       HTML to Unicode before passing it to scrape method.
77
78       You can optionally pass the base URL when you pass the HTML content as
79       a string instead of URI or HTTP::Response.
80
81         $res = $scraper->scrape($html_content, "http://example.com/foo");
82
83       This way Web::Scraper can resolve the relative links found in the
84       document.
85
86   process
87         scraper {
88             process "tag.class", key => 'TEXT';
89             process '//tag[contains(@foo, "bar")]', key2 => '@attr';
90             process '//comment()', 'comments[]' => 'TEXT';
91         };
92
93       process is the method to find matching elements from HTML with CSS
94       selector or XPath expression, then extract text or attributes into the
95       result stash.
96
97       If the first argument begins with "//" or "id(" it's treated as an
98       XPath expression and otherwise CSS selector.
99
100         # <span class="date">2008/12/21</span>
101         # date => "2008/12/21"
102         process ".date", date => 'TEXT';
103
104         # <div class="body"><a href="http://example.com/">foo</a></div>
105         # link => URI->new("http://example.com/")
106         process ".body > a", link => '@href';
107
108         # <div class="body"><!-- HTML Comment here --><a href="http://example.com/">foo</a></div>
109         # comment => " HTML Comment here "
110         #
111         # NOTES: A comment nodes are accessed when installed
112         # the HTML::TreeBuilder::XPath (version >= 0.14) and/or
113         # the HTML::TreeBuilder::LibXML (version >= 0.13)
114         process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT';
115
116         # <div class="body"><a href="http://example.com/">foo</a></div>
117         # link => URI->new("http://example.com/"), text => "foo"
118         process ".body > a", link => '@href', text => 'TEXT';
119
120         # <ul><li>foo</li><li>bar</li></ul>
121         # list => [ "foo", "bar" ]
122         process "li", "list[]" => "TEXT";
123
124         # <ul><li id="1">foo</li><li id="2">bar</li></ul>
125         # list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ];
126         process "li", "list[]" => { id => '@id', text => "TEXT" };
127
128   process_first
129       "process_first" is the same as "process" but stops when the first
130       matching result is found.
131
132         # <span class="date">2008/12/21</span>
133         # <span class="date">2008/12/22</span>
134         # date => "2008/12/21"
135         process_first ".date", date => 'TEXT';
136
137   result
138       "result" allows to return not the default value after processing but a
139       single value specified by a key or a hash reference built from several
140       keys.
141
142         process 'a', 'want[]' => 'TEXT';
143         result 'want';
144

EXAMPLES

146       There are many examples in the "eg/" dir packaged in this distribution.
147       It is recommended to look through these.
148

NESTED SCRAPERS

150       Scrapers can be nested thus allowing to scrape already captured data.
151
152         # <ul>
153         # <li class="foo"><a href="foo1">bar1</a></li>
154         # <li class="bar"><a href="foo2">bar2</a></li>
155         # <li class="foo"><a href="foo3">bar3</a></li>
156         # </ul>
157         # friends => [ {href => 'foo1'}, {href => 'foo2'} ];
158         process 'li', 'friends[]' => scraper {
159           process 'a', href => '@href',
160         };
161

FILTERS

163       Filters are applied to the result after processing. They can be
164       declared as anonymous subroutines or as class names.
165
166         process $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];
167         process $exp, $key => [ 'TEXT', 'Something' ];
168         process $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];
169
170       Filters can be stacked
171
172         process $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \&baz ];
173
174       More about filters you can find in Web::Scraper::Filter documentation.
175

XML backends

177       By default HTML::TreeBuilder::XPath is used, this can be replaces by a
178       XML::LibXML backend using Web::Scraper::LibXML module.
179
180         use Web::Scraper::LibXML;
181
182         # same as Web::Scraper
183         my $scraper = scraper { ... };
184

AUTHOR

186       Tatsuhiko Miyagawa <miyagawa@bulknews.net>
187

LICENSE

189       This library is free software; you can redistribute it and/or modify it
190       under the same terms as Perl itself.
191