1HTTP::OAI::Harvester(3pUms)er Contributed Perl DocumentatHiToTnP::OAI::Harvester(3pm)
2
3
4
6 HTTP::OAI::Harvester - Agent for harvesting from Open Archives version
7 1.0, 1.1, 2.0 and static ('2.0s') compatible repositories
8
10 "HTTP::OAI::Harvester" is the harvesting front-end in the OAI-PERL
11 library.
12
13 To harvest from an OAI-PMH compliant repository create an
14 "HTTP::OAI::Harvester" object using the baseURL option and then call
15 OAI-PMH methods to request data from the repository. To handle version
16 1.0/1.1 repositories automatically you must request "Identify()" first.
17
18 It is recommended that you request an Identify from the Repository and
19 use the "repository()" method to update the Identify object used by the
20 harvester.
21
22 When making OAI requests the underlying HTTP::OAI::UserAgent module
23 will take care of automatic redirection (http code 302) and retry-after
24 (http code 503). OAI-PMH flow control (i.e. resumption tokens) is
25 handled transparently by "HTTP::OAI::Response".
26
27 Static Repository Support
28 Static repositories are automatically and transparently supported
29 within the existing API. To harvest a static repository specify the
30 repository XML file using the baseURL argument to HTTP::OAI::Harvester.
31 An initial request is made that determines whether the base URL
32 specifies a static repository or a normal OAI 1.x/2.0 CGI repository.
33 To prevent this initial request state the OAI version using an
34 HTTP::OAI::Identify object e.g.
35
36 $h = HTTP::OAI::Harvester->new(
37 repository=>HTTP::OAI::Identify->new(
38 baseURL => 'http://arXiv.org/oai2',
39 version => '2.0',
40 ));
41
42 If a static repository is found the response is cached, and further
43 requests are served by that cache. Static repositories do not support
44 sets, and will result in a noSetHierarchy error if you try to use sets.
45 You can determine whether the repository is static by checking the
46 version ($ha->repository->version), which will be "2.0s" for static
47 repositories.
48
50 You should refer to the Open Archives Protocol version 2.0 and other
51 OAI documentation, available from http://www.openarchives.org/.
52
53 Note OAI-PMH 1.0 and 1.1 are deprecated.
54
56 In the examples I use arXiv.org's and cogprints OAI interfaces. To
57 avoid causing annoyance to their server administrators please contact
58 them before performing testing or large downloads (or use other, less
59 loaded, servers for testing).
60
62 use HTTP::OAI;
63
64 my $h = new HTTP::OAI::Harvester(baseURL=>'http://arXiv.org/oai2');
65 my $response = $h->repository($h->Identify)
66 if( $response->is_error ) {
67 print "Error requesting Identify:\n",
68 $response->code . " " . $response->message, "\n";
69 exit;
70 }
71
72 # Note: repositoryVersion will always be 2.0, $r->version returns
73 # the actual version the repository is running
74 print "Repository supports protocol version ", $response->version, "\n";
75
76 # Version 1.x repositories don't support metadataPrefix,
77 # but OAI-PERL will drop the prefix automatically
78 # if an Identify was requested first (as above)
79 $response = $h->ListIdentifiers(
80 metadataPrefix=>'oai_dc',
81 from=>'2001-02-03',
82 until=>'2001-04-10'
83 );
84
85 if( $response->is_error ) {
86 die("Error harvesting: " . $response->message . "\n");
87 }
88
89 print "responseDate => ", $response->responseDate, "\n",
90 "requestURL => ", $response->requestURL, "\n";
91
92 while( my $id = $response->next ) {
93 print "identifier => ", $id->identifier;
94 # Only available from OAI 2.0 repositories
95 print " (", $id->datestamp, ")" if $id->datestamp;
96 print " (", $id->status, ")" if $id->status;
97 print "\n";
98 # Only available from OAI 2.0 repositories
99 for( $id->setSpec ) {
100 print "\t", $_, "\n";
101 }
102 }
103
104 # Using a handler
105 $response = $h->ListRecords(
106 metadataPrefix=>'oai_dc',
107 handlers=>{metadata=>'HTTP::OAI::Metadata::OAI_DC'},
108 );
109 while( my $rec = $response->next ) {
110 print $rec->identifier, "\t",
111 $rec->datestamp, "\n",
112 $rec->metadata, "\n";
113 print join(',', @{$rec->metadata->dc->{'title'}}), "\n";
114 }
115 if( $rec->is_error ) {
116 die $response->message;
117 }
118
119 # Offline parsing
120 $I = HTTP::OAI::Identify->new();
121 $I->parse_string($content);
122 $I->parse_file($fh);
123
125 HTTP::OAI::Harvester->new( %params )
126 This constructor method returns a new instance of
127 "HTTP::OAI::Harvester". Requires either an HTTP::OAI::Identify
128 object, which in turn must contain a baseURL, or a baseURL from
129 which to construct an Identify object.
130
131 Any other parameters are passed to the HTTP::OAI::UserAgent module,
132 and from there to the LWP::UserAgent module.
133
134 $h = HTTP::OAI::Harvester->new(
135 baseURL => 'http://arXiv.org/oai2',
136 resume=>0, # Suppress automatic resumption
137 )
138 $id = $h->repository();
139 $h->repository($h->Identify);
140
141 $h = HTTP::OAI::Harvester->new(
142 HTTP::OAI::Identify->new(
143 baseURL => 'http://arXiv.org/oai2',
144 ));
145
146 $h->repository()
147 Returns and optionally sets the HTTP::OAI::Identify object used by
148 the Harvester agent.
149
150 $h->resume( [1] )
151 If set to true (default) resumption tokens will automatically be
152 handled by requesting the next partial list during "next()" calls.
153
155 The 6 OAI-PMH Verbs are the requests supported by an OAI-PMH interface.
156
157 Error Messages
158 Use "is_success()" or "is_error()" on the returned object to determine
159 whether an error occurred (see HTTP::OAI::Response).
160
161 "code()" and "message()" return the error code (200 is success) and a
162 human-readable message respectively. Errors returned by the repository
163 can be retrieved using the "errors()" method:
164
165 foreach my $error ($r->errors) {
166 print $error->code, "\t", $error->message, "\n";
167 }
168
169 Note: "is_success()" is true for the OAI Error Code "noRecordsMatch"
170 (i.e. empty set), although "errors()" will still contain the OAI error.
171
172 Flow Control
173 If the response contained a resumption token this can be retrieved
174 using the $r->resumptionToken method.
175
176 Methods
177 These methods return an object subclassed from HTTP::Response (where
178 the class corresponds to the verb requested, e.g. "GetRecord" requests
179 return an "HTTP::OAI::GetRecord" object).
180
181 $r = $h->GetRecord( %params )
182 Get a single record from the repository identified by identifier,
183 in format metadataPrefix.
184
185 $gr = $h->GetRecord(
186 identifier => 'oai:arXiv:hep-th/0001001', # Required
187 metadataPrefix => 'oai_dc' # Required
188 );
189 $rec = $gr->next;
190 die $rec->message if $rec->is_error;
191 printf("%s (%s)\n", $rec->identifier, $rec->datestamp);
192 $dom = $rec->metadata->dom;
193
194 $r = $h->Identify()
195 Get information about the repository.
196
197 $id = $h->Identify();
198 print join ',', $id->adminEmail;
199
200 $r = $h->ListIdentifiers( %params )
201 Retrieve the identifiers, datestamps, sets and deleted status for
202 all records within the specified date range (from/until) and set
203 spec (set). 1.x repositories will only return the identifier. Or,
204 resume an existing harvest by specifying resumptionToken.
205
206 $lr = $h->ListIdentifiers(
207 metadataPrefix => 'oai_dc', # Required
208 from => '2001-10-01',
209 until => '2001-10-31',
210 set=>'physics:hep-th',
211 );
212 while($rec = $lr->next)
213 {
214 { ... do something with $rec ... }
215 }
216 die $lr->message if $lr->is_error;
217
218 $r = $h->ListMetadataFormats( %params )
219 List available metadata formats. Given an identifier the repository
220 should only return those metadata formats for which that item can
221 be disseminated.
222
223 $lmdf = $h->ListMetadataFormats(
224 identifier => 'oai:arXiv.org:hep-th/0001001'
225 );
226 for($lmdf->metadataFormat) {
227 print $_->metadataPrefix, "\n";
228 }
229 die $lmdf->message if $lmdf->is_error;
230
231 $r = $h->ListRecords( %params )
232 Return full records within the specified date range (from/until),
233 set and metadata format. Or, specify a resumption token to resume a
234 previous partial harvest.
235
236 $lr = $h->ListRecords(
237 metadataPrefix=>'oai_dc', # Required
238 from => '2001-10-01',
239 until => '2001-10-01',
240 set => 'physics:hep-th',
241 );
242 while($rec = $lr->next)
243 {
244 { ... do something with $rec ... }
245 }
246 die $lr->message if $lr->is_error;
247
248 $r = $h->ListSets( %params )
249 Return a list of sets provided by the repository. The scope of sets
250 is undefined by OAI-PMH, so therefore may represent any subset of a
251 collection. Optionally provide a resumption token to resume a
252 previous partial request.
253
254 $ls = $h->ListSets();
255 while($set = $ls->next)
256 {
257 print $set->setSpec, "\n";
258 }
259 die $ls->message if $ls->is_error;
260
262 These modules have been written by Tim Brody <tdb01r@ecs.soton.ac.uk>.
263
264
265
266perl v5.36.0 2022-07-22 HTTP::OAI::Harvester(3pm)