1HTTP::OAI::Harvester(3pUms)er Contributed Perl DocumentatHiToTnP::OAI::Harvester(3pm)
2
3
4

NAME

6       HTTP::OAI::Harvester - Agent for harvesting from Open Archives version
7       1.0, 1.1, 2.0 and static ('2.0s') compatible repositories
8

DESCRIPTION

10       "HTTP::OAI::Harvester" is the harvesting front-end in the OAI-PERL
11       library.
12
13       To harvest from an OAI-PMH compliant repository create an
14       "HTTP::OAI::Harvester" object using the baseURL option and then call
15       OAI-PMH methods to request data from the repository. To handle version
16       1.0/1.1 repositories automatically you must request Identify() first.
17
18       It is recommended that you request an Identify from the Repository and
19       use the repository() method to update the Identify object used by the
20       harvester.
21
22       When making OAI requests the underlying HTTP::OAI::UserAgent module
23       will take care of automatic redirection (http code 302) and retry-after
24       (http code 503). OAI-PMH flow control (i.e. resumption tokens) is
25       handled transparently by "HTTP::OAI::Response".
26
27   Static Repository Support
28       Static repositories are automatically and transparently supported
29       within the existing API. To harvest a static repository specify the
30       repository XML file using the baseURL argument to HTTP::OAI::Harvester.
31       An initial request is made that determines whether the base URL
32       specifies a static repository or a normal OAI 1.x/2.0 CGI repository.
33       To prevent this initial request state the OAI version using an
34       HTTP::OAI::Identify object e.g.
35
36               $h = HTTP::OAI::Harvester->new(
37                       repository=>HTTP::OAI::Identify->new(
38                               baseURL => 'http://arXiv.org/oai2',
39                               version => '2.0',
40               ));
41
42       If a static repository is found the response is cached, and further
43       requests are served by that cache. Static repositories do not support
44       sets, and will result in a noSetHierarchy error if you try to use sets.
45       You can determine whether the repository is static by checking the
46       version ($ha->repository->version), which will be "2.0s" for static
47       repositories.
48

FURTHER READING

50       You should refer to the Open Archives Protocol version 2.0 and other
51       OAI documentation, available from http://www.openarchives.org/.
52
53       Note OAI-PMH 1.0 and 1.1 are deprecated.
54

BEFORE USING EXAMPLES

56       In the examples I use arXiv.org's and cogprints OAI interfaces. To
57       avoid causing annoyance to their server administrators please contact
58       them before performing testing or large downloads (or use other, less
59       loaded, servers for testing).
60

SYNOPSIS

62               use HTTP::OAI;
63
64               my $h = new HTTP::OAI::Harvester(baseURL=>'http://arXiv.org/oai2');
65               my $response = $h->Identify;
66
67               if( $response->is_error ) {
68                       print "Error requesting Identify:\n",
69                               $response->code . " " . $response->message, "\n";
70                       exit;
71               }
72
73               # Note: repositoryVersion will always be 2.0, $r->version returns
74               # the actual version the repository is running
75               print "Repository supports protocol version ", $response->version, "\n";
76
77               # Version 1.x repositories don't support metadataPrefix,
78               # but OAI-PERL will drop the prefix automatically
79               # if an Identify was requested first (as above)
80               $response = $h->ListIdentifiers(
81                       metadataPrefix=>'oai_dc',
82                       from=>'2001-02-03',
83                       until=>'2001-04-10'
84               );
85
86               if( $response->is_error ) {
87                       die("Error harvesting: " . $response->message . "\n");
88               }
89
90               print "responseDate => ", $response->responseDate, "\n",
91                       "requestURL => ", $response->requestURL, "\n";
92
93               while( my $id = $response->next ) {
94                       print "identifier => ", $id->identifier;
95                       # Only available from OAI 2.0 repositories
96                       print " (", $id->datestamp, ")" if $id->datestamp;
97                       print " (", $id->status, ")" if $id->status;
98                       print "\n";
99                       # Only available from OAI 2.0 repositories
100                       for( $id->setSpec ) {
101                               print "\t", $_, "\n";
102                       }
103               }
104
105               # Using a handler
106               $response = $h->ListRecords(
107                       metadataPrefix=>'oai_dc',
108                       handlers=>{metadata=>'HTTP::OAI::Metadata::OAI_DC'},
109                       onRecord=>sub {
110                               my $rec = shift;
111
112                               printf"%s\t%s\t%s\n"
113                                                , $rec->identifier
114                                                , $rec->datestamp
115                                                , join(',', @{$rec->metadata->dc->{'title'}});
116                       }
117               );
118
119               # End program
120               #################
121
122               #################
123               # If you have some local OAI-PMH reponse data you want to
124               # parse you can use the OAI-PMH verb as in:
125
126               use HTTP::OAI;
127               my $I = HTTP::OAI::Identify->new();
128
129               # If you have a $content string with some cached OAI-PMH verb=Identify response
130               # it can be parsed like this..
131               $I->parse_string($content);
132
133               # Or if you have an opened file handle $fh to a file with a cached
134               # OAI-PMH verb=Identify response
135               $I->parse_file($fh);
136
137               # Using either method now you can do something like
138
139               printf "RepositoryName: %s\n" , $I->repositoryName;
140               for ($I->adminEmail) {
141                       print $_, "\n";
142               }
143

METHODS

145       HTTP::OAI::Harvester->new( %params )
146           This constructor method returns a new instance of
147           "HTTP::OAI::Harvester". Requires either an HTTP::OAI::Identify
148           object, which in turn must contain a baseURL, or a baseURL from
149           which to construct an Identify object.
150
151           Any other parameters are passed to the HTTP::OAI::UserAgent module,
152           and from there to the LWP::UserAgent module.
153
154                   $h = HTTP::OAI::Harvester->new(
155                           baseURL =>      'http://arXiv.org/oai2',
156                           resume=>0, # Suppress automatic resumption
157                   )
158                   $id = $h->repository();
159                   $h->repository($h->Identify);
160
161                   $h = HTTP::OAI::Harvester->new(
162                           HTTP::OAI::Identify->new(
163                                   baseURL => 'http://arXiv.org/oai2',
164                   ));
165
166       $h->repository()
167           Returns and optionally sets the HTTP::OAI::Identify object used by
168           the Harvester agent.
169
170       $h->resume( [1] )
171           If set to true (default) resumption tokens will automatically be
172           handled by requesting the next partial list during next() calls.
173

OAI-PMH Verbs

175       The 6 OAI-PMH Verbs are the requests supported by an OAI-PMH interface.
176
177   Error Messages
178       Use is_success() or is_error() on the returned object to determine
179       whether an error occurred (see HTTP::OAI::Response).
180
181       code() and message() return the error code (200 is success) and a
182       human-readable message respectively. Errors returned by the repository
183       can be retrieved using the errors() method:
184
185               foreach my $error ($r->errors) {
186                       print $error->code, "\t", $error->message, "\n";
187               }
188
189       Note: is_success() is true for the OAI Error Code "noRecordsMatch"
190       (i.e. empty set), although errors() will still contain the OAI error.
191
192   Flow Control
193       If the response contained a resumption token this can be retrieved
194       using the $r->resumptionToken method.
195
196   Methods
197       These methods return an object subclassed from HTTP::Response (where
198       the class corresponds to the verb requested, e.g. "GetRecord" requests
199       return an "HTTP::OAI::GetRecord" object).
200
201       $r = $h->GetRecord( %params )
202           Get a single record from the repository identified by identifier,
203           in format metadataPrefix.
204
205                   $gr = $h->GetRecord(
206                           identifier      =>      'oai:arXiv:hep-th/0001001', # Required
207                           metadataPrefix  =>      'oai_dc' # Required
208                   );
209                   $rec = $gr->next;
210                   die $rec->message if $rec->is_error;
211                   printf("%s (%s)\n", $rec->identifier, $rec->datestamp);
212                   $dom = $rec->metadata->dom;
213
214       $r = $h->Identify()
215           Get information about the repository.
216
217                   $id = $h->Identify();
218                   print join ',', $id->adminEmail;
219
220       $r = $h->ListIdentifiers( %params )
221           Retrieve the identifiers, datestamps, sets and deleted status for
222           all records within the specified date range (from/until) and set
223           spec (set). 1.x repositories will only return the identifier. Or,
224           resume an existing harvest by specifying resumptionToken.
225
226                   $lr = $h->ListIdentifiers(
227                           metadataPrefix  =>      'oai_dc', # Required
228                           from            =>              '2001-10-01',
229                           until           =>              '2001-10-31',
230                           set=>'physics:hep-th',
231                   );
232                   while($rec = $lr->next)
233                   {
234                           { ... do something with $rec ... }
235                   }
236                   die $lr->message if $lr->is_error;
237
238       $r = $h->ListMetadataFormats( %params )
239           List available metadata formats. Given an identifier the repository
240           should only return those metadata formats for which that item can
241           be disseminated.
242
243                   $lmdf = $h->ListMetadataFormats(
244                           identifier => 'oai:arXiv.org:hep-th/0001001'
245                   );
246                   for($lmdf->metadataFormat) {
247                           print $_->metadataPrefix, "\n";
248                   }
249                   die $lmdf->message if $lmdf->is_error;
250
251       $r = $h->ListRecords( %params )
252           Return full records within the specified date range (from/until),
253           set and metadata format. Or, specify a resumption token to resume a
254           previous partial harvest.
255
256                   $lr = $h->ListRecords(
257                           metadataPrefix=>'oai_dc', # Required
258                           from    =>      '2001-10-01',
259                           until   =>      '2001-10-01',
260                           set             =>      'physics:hep-th',
261                   );
262                   while($rec = $lr->next)
263                   {
264                           { ... do something with $rec ... }
265                   }
266                   die $lr->message if $lr->is_error;
267
268       $r = $h->ListSets( %params )
269           Return a list of sets provided by the repository. The scope of sets
270           is undefined by OAI-PMH, so therefore may represent any subset of a
271           collection. Optionally provide a resumption token to resume a
272           previous partial request.
273
274                   $ls = $h->ListSets();
275                   while($set = $ls->next)
276                   {
277                           print $set->setSpec, "\n";
278                   }
279                   die $ls->message if $ls->is_error;
280

ENVIRONMENT

282       The HTTP Agent is default OAI-PERL/<Version> where <Version> is the
283       HTTP::OAI version.  This Agent can be set via an environment variable
284       HTTP_OAI_AGENT.
285

AUTHOR

287       These modules have been written by Tim Brody <tdb01r@ecs.soton.ac.uk>.
288
289
290
291perl v5.36.1                      2023-06-06         HTTP::OAI::Harvester(3pm)
Impressum