1HTTP::OAI::Harvester(3pUms)er Contributed Perl DocumentatHiToTnP::OAI::Harvester(3pm)
2
3
4
6 HTTP::OAI::Harvester - Agent for harvesting from Open Archives version
7 1.0, 1.1, 2.0 and static ('2.0s') compatible repositories
8
10 "HTTP::OAI::Harvester" is the harvesting front-end in the OAI-PERL
11 library.
12
13 To harvest from an OAI-PMH compliant repository create an
14 "HTTP::OAI::Harvester" object using the baseURL option and then call
15 OAI-PMH methods to request data from the repository. To handle version
16 1.0/1.1 repositories automatically you must request Identify() first.
17
18 It is recommended that you request an Identify from the Repository and
19 use the repository() method to update the Identify object used by the
20 harvester.
21
22 When making OAI requests the underlying HTTP::OAI::UserAgent module
23 will take care of automatic redirection (http code 302) and retry-after
24 (http code 503). OAI-PMH flow control (i.e. resumption tokens) is
25 handled transparently by "HTTP::OAI::Response".
26
27 Static Repository Support
28 Static repositories are automatically and transparently supported
29 within the existing API. To harvest a static repository specify the
30 repository XML file using the baseURL argument to HTTP::OAI::Harvester.
31 An initial request is made that determines whether the base URL
32 specifies a static repository or a normal OAI 1.x/2.0 CGI repository.
33 To prevent this initial request state the OAI version using an
34 HTTP::OAI::Identify object e.g.
35
36 $h = HTTP::OAI::Harvester->new(
37 repository=>HTTP::OAI::Identify->new(
38 baseURL => 'http://arXiv.org/oai2',
39 version => '2.0',
40 ));
41
42 If a static repository is found the response is cached, and further
43 requests are served by that cache. Static repositories do not support
44 sets, and will result in a noSetHierarchy error if you try to use sets.
45 You can determine whether the repository is static by checking the
46 version ($ha->repository->version), which will be "2.0s" for static
47 repositories.
48
50 You should refer to the Open Archives Protocol version 2.0 and other
51 OAI documentation, available from http://www.openarchives.org/.
52
53 Note OAI-PMH 1.0 and 1.1 are deprecated.
54
56 In the examples I use arXiv.org's and cogprints OAI interfaces. To
57 avoid causing annoyance to their server administrators please contact
58 them before performing testing or large downloads (or use other, less
59 loaded, servers for testing).
60
62 use HTTP::OAI;
63
64 my $h = new HTTP::OAI::Harvester(baseURL=>'http://arXiv.org/oai2');
65 my $response = $h->Identify;
66
67 if( $response->is_error ) {
68 print "Error requesting Identify:\n",
69 $response->code . " " . $response->message, "\n";
70 exit;
71 }
72
73 # Note: repositoryVersion will always be 2.0, $r->version returns
74 # the actual version the repository is running
75 print "Repository supports protocol version ", $response->version, "\n";
76
77 # Version 1.x repositories don't support metadataPrefix,
78 # but OAI-PERL will drop the prefix automatically
79 # if an Identify was requested first (as above)
80 $response = $h->ListIdentifiers(
81 metadataPrefix=>'oai_dc',
82 from=>'2001-02-03',
83 until=>'2001-04-10'
84 );
85
86 if( $response->is_error ) {
87 die("Error harvesting: " . $response->message . "\n");
88 }
89
90 print "responseDate => ", $response->responseDate, "\n",
91 "requestURL => ", $response->requestURL, "\n";
92
93 while( my $id = $response->next ) {
94 print "identifier => ", $id->identifier;
95 # Only available from OAI 2.0 repositories
96 print " (", $id->datestamp, ")" if $id->datestamp;
97 print " (", $id->status, ")" if $id->status;
98 print "\n";
99 # Only available from OAI 2.0 repositories
100 for( $id->setSpec ) {
101 print "\t", $_, "\n";
102 }
103 }
104
105 # Using a handler
106 $response = $h->ListRecords(
107 metadataPrefix=>'oai_dc',
108 handlers=>{metadata=>'HTTP::OAI::Metadata::OAI_DC'},
109 onRecord=>sub {
110 my $rec = shift;
111
112 printf"%s\t%s\t%s\n"
113 , $rec->identifier
114 , $rec->datestamp
115 , join(',', @{$rec->metadata->dc->{'title'}});
116 }
117 );
118
119 # End program
120 #################
121
122 #################
123 # If you have some local OAI-PMH reponse data you want to
124 # parse you can use the OAI-PMH verb as in:
125
126 use HTTP::OAI;
127 my $I = HTTP::OAI::Identify->new();
128
129 # If you have a $content string with some cached OAI-PMH verb=Identify response
130 # it can be parsed like this..
131 $I->parse_string($content);
132
133 # Or if you have an opened file handle $fh to a file with a cached
134 # OAI-PMH verb=Identify response
135 $I->parse_file($fh);
136
137 # Using either method now you can do something like
138
139 printf "RepositoryName: %s\n" , $I->repositoryName;
140 for ($I->adminEmail) {
141 print $_, "\n";
142 }
143
145 HTTP::OAI::Harvester->new( %params )
146 This constructor method returns a new instance of
147 "HTTP::OAI::Harvester". Requires either an HTTP::OAI::Identify
148 object, which in turn must contain a baseURL, or a baseURL from
149 which to construct an Identify object.
150
151 Any other parameters are passed to the HTTP::OAI::UserAgent module,
152 and from there to the LWP::UserAgent module.
153
154 $h = HTTP::OAI::Harvester->new(
155 baseURL => 'http://arXiv.org/oai2',
156 resume=>0, # Suppress automatic resumption
157 )
158 $id = $h->repository();
159 $h->repository($h->Identify);
160
161 $h = HTTP::OAI::Harvester->new(
162 HTTP::OAI::Identify->new(
163 baseURL => 'http://arXiv.org/oai2',
164 ));
165
166 $h->repository()
167 Returns and optionally sets the HTTP::OAI::Identify object used by
168 the Harvester agent.
169
170 $h->resume( [1] )
171 If set to true (default) resumption tokens will automatically be
172 handled by requesting the next partial list during next() calls.
173
175 The 6 OAI-PMH Verbs are the requests supported by an OAI-PMH interface.
176
177 Error Messages
178 Use is_success() or is_error() on the returned object to determine
179 whether an error occurred (see HTTP::OAI::Response).
180
181 code() and message() return the error code (200 is success) and a
182 human-readable message respectively. Errors returned by the repository
183 can be retrieved using the errors() method:
184
185 foreach my $error ($r->errors) {
186 print $error->code, "\t", $error->message, "\n";
187 }
188
189 Note: is_success() is true for the OAI Error Code "noRecordsMatch"
190 (i.e. empty set), although errors() will still contain the OAI error.
191
192 Flow Control
193 If the response contained a resumption token this can be retrieved
194 using the $r->resumptionToken method.
195
196 Methods
197 These methods return an object subclassed from HTTP::Response (where
198 the class corresponds to the verb requested, e.g. "GetRecord" requests
199 return an "HTTP::OAI::GetRecord" object).
200
201 $r = $h->GetRecord( %params )
202 Get a single record from the repository identified by identifier,
203 in format metadataPrefix.
204
205 $gr = $h->GetRecord(
206 identifier => 'oai:arXiv:hep-th/0001001', # Required
207 metadataPrefix => 'oai_dc' # Required
208 );
209 $rec = $gr->next;
210 die $rec->message if $rec->is_error;
211 printf("%s (%s)\n", $rec->identifier, $rec->datestamp);
212 $dom = $rec->metadata->dom;
213
214 $r = $h->Identify()
215 Get information about the repository.
216
217 $id = $h->Identify();
218 print join ',', $id->adminEmail;
219
220 $r = $h->ListIdentifiers( %params )
221 Retrieve the identifiers, datestamps, sets and deleted status for
222 all records within the specified date range (from/until) and set
223 spec (set). 1.x repositories will only return the identifier. Or,
224 resume an existing harvest by specifying resumptionToken.
225
226 $lr = $h->ListIdentifiers(
227 metadataPrefix => 'oai_dc', # Required
228 from => '2001-10-01',
229 until => '2001-10-31',
230 set=>'physics:hep-th',
231 );
232 while($rec = $lr->next)
233 {
234 { ... do something with $rec ... }
235 }
236 die $lr->message if $lr->is_error;
237
238 $r = $h->ListMetadataFormats( %params )
239 List available metadata formats. Given an identifier the repository
240 should only return those metadata formats for which that item can
241 be disseminated.
242
243 $lmdf = $h->ListMetadataFormats(
244 identifier => 'oai:arXiv.org:hep-th/0001001'
245 );
246 for($lmdf->metadataFormat) {
247 print $_->metadataPrefix, "\n";
248 }
249 die $lmdf->message if $lmdf->is_error;
250
251 $r = $h->ListRecords( %params )
252 Return full records within the specified date range (from/until),
253 set and metadata format. Or, specify a resumption token to resume a
254 previous partial harvest.
255
256 $lr = $h->ListRecords(
257 metadataPrefix=>'oai_dc', # Required
258 from => '2001-10-01',
259 until => '2001-10-01',
260 set => 'physics:hep-th',
261 );
262 while($rec = $lr->next)
263 {
264 { ... do something with $rec ... }
265 }
266 die $lr->message if $lr->is_error;
267
268 $r = $h->ListSets( %params )
269 Return a list of sets provided by the repository. The scope of sets
270 is undefined by OAI-PMH, so therefore may represent any subset of a
271 collection. Optionally provide a resumption token to resume a
272 previous partial request.
273
274 $ls = $h->ListSets();
275 while($set = $ls->next)
276 {
277 print $set->setSpec, "\n";
278 }
279 die $ls->message if $ls->is_error;
280
282 The HTTP Agent is default OAI-PERL/<Version> where <Version> is the
283 HTTP::OAI version. This Agent can be set via an environment variable
284 HTTP_OAI_AGENT.
285
287 These modules have been written by Tim Brody <tdb01r@ecs.soton.ac.uk>.
288
289
290
291perl v5.36.1 2023-06-06 HTTP::OAI::Harvester(3pm)