1HTTP::OAI::Harvester(3pUms)er Contributed Perl DocumentatHiToTnP::OAI::Harvester(3pm)
2
3
4

NAME

6       HTTP::OAI::Harvester - Agent for harvesting from Open Archives version
7       1.0, 1.1, 2.0 and static ('2.0s') compatible repositories
8

DESCRIPTION

10       "HTTP::OAI::Harvester" is the harvesting front-end in the OAI-PERL
11       library.
12
13       To harvest from an OAI-PMH compliant repository create an
14       "HTTP::OAI::Harvester" object using the baseURL option and then call
15       OAI-PMH methods to request data from the repository. To handle version
16       1.0/1.1 repositories automatically you must request "Identify()" first.
17
18       It is recommended that you request an Identify from the Repository and
19       use the "repository()" method to update the Identify object used by the
20       harvester.
21
22       When making OAI requests the underlying HTTP::OAI::UserAgent module
23       will take care of automatic redirection (http code 302) and retry-after
24       (http code 503). OAI-PMH flow control (i.e. resumption tokens) is
25       handled transparently by "HTTP::OAI::Response".
26
27   Static Repository Support
28       Static repositories are automatically and transparently supported
29       within the existing API. To harvest a static repository specify the
30       repository XML file using the baseURL argument to HTTP::OAI::Harvester.
31       An initial request is made that determines whether the base URL
32       specifies a static repository or a normal OAI 1.x/2.0 CGI repository.
33       To prevent this initial request state the OAI version using an
34       HTTP::OAI::Identify object e.g.
35
36               $h = HTTP::OAI::Harvester->new(
37                       repository=>HTTP::OAI::Identify->new(
38                               baseURL => 'http://arXiv.org/oai2',
39                               version => '2.0',
40               ));
41
42       If a static repository is found the response is cached, and further
43       requests are served by that cache. Static repositories do not support
44       sets, and will result in a noSetHierarchy error if you try to use sets.
45       You can determine whether the repository is static by checking the
46       version ($ha->repository->version), which will be "2.0s" for static
47       repositories.
48

FURTHER READING

50       You should refer to the Open Archives Protocol version 2.0 and other
51       OAI documentation, available from http://www.openarchives.org/.
52
53       Note OAI-PMH 1.0 and 1.1 are deprecated.
54

BEFORE USING EXAMPLES

56       In the examples I use arXiv.org's and cogprints OAI interfaces. To
57       avoid causing annoyance to their server administrators please contact
58       them before performing testing or large downloads (or use other, less
59       loaded, servers for testing).
60

SYNOPSIS

62               use HTTP::OAI;
63
64               my $h = new HTTP::OAI::Harvester(baseURL=>'http://arXiv.org/oai2');
65               my $response = $h->repository($h->Identify)
66               if( $response->is_error ) {
67                       print "Error requesting Identify:\n",
68                               $response->code . " " . $response->message, "\n";
69                       exit;
70               }
71
72               # Note: repositoryVersion will always be 2.0, $r->version returns
73               # the actual version the repository is running
74               print "Repository supports protocol version ", $response->version, "\n";
75
76               # Version 1.x repositories don't support metadataPrefix,
77               # but OAI-PERL will drop the prefix automatically
78               # if an Identify was requested first (as above)
79               $response = $h->ListIdentifiers(
80                       metadataPrefix=>'oai_dc',
81                       from=>'2001-02-03',
82                       until=>'2001-04-10'
83               );
84
85               if( $response->is_error ) {
86                       die("Error harvesting: " . $response->message . "\n");
87               }
88
89               print "responseDate => ", $response->responseDate, "\n",
90                       "requestURL => ", $response->requestURL, "\n";
91
92               while( my $id = $response->next ) {
93                       print "identifier => ", $id->identifier;
94                       # Only available from OAI 2.0 repositories
95                       print " (", $id->datestamp, ")" if $id->datestamp;
96                       print " (", $id->status, ")" if $id->status;
97                       print "\n";
98                       # Only available from OAI 2.0 repositories
99                       for( $id->setSpec ) {
100                               print "\t", $_, "\n";
101                       }
102               }
103
104               # Using a handler
105               $response = $h->ListRecords(
106                       metadataPrefix=>'oai_dc',
107                       handlers=>{metadata=>'HTTP::OAI::Metadata::OAI_DC'},
108               );
109               while( my $rec = $response->next ) {
110                       print $rec->identifier, "\t",
111                               $rec->datestamp, "\n",
112                               $rec->metadata, "\n";
113                       print join(',', @{$rec->metadata->dc->{'title'}}), "\n";
114               }
115               if( $rec->is_error ) {
116                       die $response->message;
117               }
118
119               # Offline parsing
120               $I = HTTP::OAI::Identify->new();
121               $I->parse_string($content);
122               $I->parse_file($fh);
123

METHODS

125       HTTP::OAI::Harvester->new( %params )
126           This constructor method returns a new instance of
127           "HTTP::OAI::Harvester". Requires either an HTTP::OAI::Identify
128           object, which in turn must contain a baseURL, or a baseURL from
129           which to construct an Identify object.
130
131           Any other parameters are passed to the HTTP::OAI::UserAgent module,
132           and from there to the LWP::UserAgent module.
133
134                   $h = HTTP::OAI::Harvester->new(
135                           baseURL =>      'http://arXiv.org/oai2',
136                           resume=>0, # Suppress automatic resumption
137                   )
138                   $id = $h->repository();
139                   $h->repository($h->Identify);
140
141                   $h = HTTP::OAI::Harvester->new(
142                           HTTP::OAI::Identify->new(
143                                   baseURL => 'http://arXiv.org/oai2',
144                   ));
145
146       $h->repository()
147           Returns and optionally sets the HTTP::OAI::Identify object used by
148           the Harvester agent.
149
150       $h->resume( [1] )
151           If set to true (default) resumption tokens will automatically be
152           handled by requesting the next partial list during "next()" calls.
153

OAI-PMH Verbs

155       The 6 OAI-PMH Verbs are the requests supported by an OAI-PMH interface.
156
157   Error Messages
158       Use "is_success()" or "is_error()" on the returned object to determine
159       whether an error occurred (see HTTP::OAI::Response).
160
161       "code()" and "message()" return the error code (200 is success) and a
162       human-readable message respectively. Errors returned by the repository
163       can be retrieved using the "errors()" method:
164
165               foreach my $error ($r->errors) {
166                       print $error->code, "\t", $error->message, "\n";
167               }
168
169       Note: "is_success()" is true for the OAI Error Code "noRecordsMatch"
170       (i.e. empty set), although "errors()" will still contain the OAI error.
171
172   Flow Control
173       If the response contained a resumption token this can be retrieved
174       using the $r->resumptionToken method.
175
176   Methods
177       These methods return an object subclassed from HTTP::Response (where
178       the class corresponds to the verb requested, e.g. "GetRecord" requests
179       return an "HTTP::OAI::GetRecord" object).
180
181       $r = $h->GetRecord( %params )
182           Get a single record from the repository identified by identifier,
183           in format metadataPrefix.
184
185                   $gr = $h->GetRecord(
186                           identifier      =>      'oai:arXiv:hep-th/0001001', # Required
187                           metadataPrefix  =>      'oai_dc' # Required
188                   );
189                   $rec = $gr->next;
190                   die $rec->message if $rec->is_error;
191                   printf("%s (%s)\n", $rec->identifier, $rec->datestamp);
192                   $dom = $rec->metadata->dom;
193
194       $r = $h->Identify()
195           Get information about the repository.
196
197                   $id = $h->Identify();
198                   print join ',', $id->adminEmail;
199
200       $r = $h->ListIdentifiers( %params )
201           Retrieve the identifiers, datestamps, sets and deleted status for
202           all records within the specified date range (from/until) and set
203           spec (set). 1.x repositories will only return the identifier. Or,
204           resume an existing harvest by specifying resumptionToken.
205
206                   $lr = $h->ListIdentifiers(
207                           metadataPrefix  =>      'oai_dc', # Required
208                           from            =>              '2001-10-01',
209                           until           =>              '2001-10-31',
210                           set=>'physics:hep-th',
211                   );
212                   while($rec = $lr->next)
213                   {
214                           { ... do something with $rec ... }
215                   }
216                   die $lr->message if $lr->is_error;
217
218       $r = $h->ListMetadataFormats( %params )
219           List available metadata formats. Given an identifier the repository
220           should only return those metadata formats for which that item can
221           be disseminated.
222
223                   $lmdf = $h->ListMetadataFormats(
224                           identifier => 'oai:arXiv.org:hep-th/0001001'
225                   );
226                   for($lmdf->metadataFormat) {
227                           print $_->metadataPrefix, "\n";
228                   }
229                   die $lmdf->message if $lmdf->is_error;
230
231       $r = $h->ListRecords( %params )
232           Return full records within the specified date range (from/until),
233           set and metadata format. Or, specify a resumption token to resume a
234           previous partial harvest.
235
236                   $lr = $h->ListRecords(
237                           metadataPrefix=>'oai_dc', # Required
238                           from    =>      '2001-10-01',
239                           until   =>      '2001-10-01',
240                           set             =>      'physics:hep-th',
241                   );
242                   while($rec = $lr->next)
243                   {
244                           { ... do something with $rec ... }
245                   }
246                   die $lr->message if $lr->is_error;
247
248       $r = $h->ListSets( %params )
249           Return a list of sets provided by the repository. The scope of sets
250           is undefined by OAI-PMH, so therefore may represent any subset of a
251           collection. Optionally provide a resumption token to resume a
252           previous partial request.
253
254                   $ls = $h->ListSets();
255                   while($set = $ls->next)
256                   {
257                           print $set->setSpec, "\n";
258                   }
259                   die $ls->message if $ls->is_error;
260

AUTHOR

262       These modules have been written by Tim Brody <tdb01r@ecs.soton.ac.uk>.
263
264
265
266perl v5.36.0                      2022-07-22         HTTP::OAI::Harvester(3pm)
Impressum