1lwptut(3) User Contributed Perl Documentation lwptut(3)
2
3
4
6 lwptut -- An LWP Tutorial
7
9 LWP (short for "Library for WWW in Perl") is a very popular group of
10 Perl modules for accessing data on the Web. Like most Perl module-dis‐
11 tributions, each of LWP's component modules comes with documentation
12 that is a complete reference to its interface. However, there are so
13 many modules in LWP that it's hard to know where to start looking for
14 information on how to do even the simplest most common things.
15
16 Really introducing you to using LWP would require a whole book -- a
17 book that just happens to exist, called Perl & LWP. But this article
18 should give you a taste of how you can go about some common tasks with
19 LWP.
20
21 Getting documents with LWP::Simple
22
23 If you just want to get what's at a particular URL, the simplest way to
24 do it is LWP::Simple's functions.
25
26 In a Perl program, you can call its "get($url)" function. It will try
27 getting that URL's content. If it works, then it'll return the con‐
28 tent; but if there's some error, it'll return undef.
29
30 my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
31 # Just an example: the URL for the most recent /Fresh Air/ show
32
33 use LWP::Simple;
34 my $content = get $url;
35 die "Couldn't get $url" unless defined $content;
36
37 # Then go do things with $content, like this:
38
39 if($content =~ m/jazz/i) {
40 print "They're talking about jazz today on Fresh Air!\n";
41 }
42 else {
43 print "Fresh Air is apparently jazzless today.\n";
44 }
45
46 The handiest variant on "get" is "getprint", which is useful in Perl
47 one-liners. If it can get the page whose URL you provide, it sends it
48 to STDOUT; otherwise it complains to STDERR.
49
50 % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
51
52 That is the URL of a plaintext file that lists new files in CPAN in the
53 past two weeks. You can easily make it part of a tidy little shell
54 command, like this one that mails you the list of new "Acme::" modules:
55
56 % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'" \
57 ⎪ grep "/by-module/Acme" ⎪ mail -s "New Acme modules! Joy!" $USER
58
59 There are other useful functions in LWP::Simple, including one function
60 for running a HEAD request on a URL (useful for checking links, or get‐
61 ting the last-revised time of a URL), and two functions for saving/mir‐
62 roring a URL to a local file. See the LWP::Simple documentation for the
63 full details, or chapter 2 of Perl & LWP for more examples.
64
65 The Basics of the LWP Class Model
66
67 LWP::Simple's functions are handy for simple cases, but its functions
68 don't support cookies or authorization, don't support setting header
69 lines in the HTTP request, generally don't support reading header lines
70 in the HTTP response (notably the full HTTP error message, in case of
71 an error). To get at all those features, you'll have to use the full
72 LWP class model.
73
74 While LWP consists of dozens of classes, the main two that you have to
75 understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a
76 class for "virtual browsers" which you use for performing requests, and
77 HTTP::Response is a class for the responses (or error messages) that
78 you get back from those requests.
79
80 The basic idiom is "$response = $browser->get($url)", or more fully
81 illustrated:
82
83 # Early in your program:
84
85 use LWP 5.64; # Loads all important LWP classes, and makes
86 # sure your version is reasonably recent.
87
88 my $browser = LWP::UserAgent->new;
89
90 ...
91
92 # Then later, whenever you need to make a get request:
93 my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
94
95 my $response = $browser->get( $url );
96 die "Can't get $url -- ", $response->status_line
97 unless $response->is_success;
98
99 die "Hey, I was expecting HTML, not ", $response->content_type
100 unless $response->content_type eq 'text/html';
101 # or whatever content-type you're equipped to deal with
102
103 # Otherwise, process the content somehow:
104
105 if($response->decoded_content =~ m/jazz/i) {
106 print "They're talking about jazz today on Fresh Air!\n";
107 }
108 else {
109 print "Fresh Air is apparently jazzless today.\n";
110 }
111
112 There are two objects involved: $browser, which holds an object of
113 class LWP::UserAgent, and then the $response object, which is of class
114 HTTP::Response. You really need only one browser object per program;
115 but every time you make a request, you get back a new HTTP::Response
116 object, which will have some interesting attributes:
117
118 · A status code indicating success or failure (which you can test
119 with "$response->is_success").
120
121 · An HTTP status line that is hopefully informative if there's fail‐
122 ure (which you can see with "$response->status_line", returning
123 something like "404 Not Found").
124
125 · A MIME content-type like "text/html", "image/gif", "applica‐
126 tion/xml", etc., which you can see with "$response->content_type"
127
128 · The actual content of the response, in "$response->decoded_con‐
129 tent". If the response is HTML, that's where the HTML source will
130 be; if it's a GIF, then "$response->decoded_content" will be the
131 binary GIF data.
132
133 · And dozens of other convenient and more specific methods that are
134 documented in the docs for HTML::Response, and its superclasses
135 HTML::Message and HTML::Headers.
136
137 Adding Other HTTP Request Headers
138
139 The most commonly used syntax for requests is "$response =
140 $browser->get($url)", but in truth, you can add extra HTTP header lines
141 to the request by adding a list of key-value pairs after the URL, like
142 so:
143
144 $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
145
146 For example, here's how to send some more Netscape-like headers, in
147 case you're dealing with a site that would otherwise reject your
148 request:
149
150 my @ns_headers = (
151 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
152 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
153 'Accept-Charset' => 'iso-8859-1,*,utf-8',
154 'Accept-Language' => 'en-US',
155 );
156
157 ...
158
159 $response = $browser->get($url, @ns_headers);
160
161 If you weren't reusing that array, you could just go ahead and do this:
162
163 $response = $browser->get($url,
164 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
165 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
166 'Accept-Charset' => 'iso-8859-1,*,utf-8',
167 'Accept-Language' => 'en-US',
168 );
169
170 If you were only ever changing the 'User-Agent' line, you could just
171 change the $browser object's default line from "libwww-perl/5.65" (or
172 the like) to whatever you like, using the LWP::UserAgent "agent"
173 method:
174
175 $browser->agent('Mozilla/4.76 [en] (Win98; U)');
176
177 Enabling Cookies
178
179 A default LWP::UserAgent object acts like a browser with its cookies
180 support turned off. There are various ways of turning it on, by setting
181 its "cookie_jar" attribute. A "cookie jar" is an object representing a
182 little database of all the HTTP cookies that a browser can know about.
183 It can correspond to a file on disk (the way Netscape uses its cook‐
184 ies.txt file), or it can be just an in-memory object that starts out
185 empty, and whose collection of cookies will disappear once the program
186 is finished running.
187
188 To give a browser an in-memory empty cookie jar, you set its
189 "cookie_jar" attribute like so:
190
191 $browser->cookie_jar({});
192
193 To give it a copy that will be read from a file on disk, and will be
194 saved to it when the program is finished running, set the "cookie_jar"
195 attribute like this:
196
197 use HTTP::Cookies;
198 $browser->cookie_jar( HTTP::Cookies->new(
199 'file' => '/some/where/cookies.lwp',
200 # where to read/write cookies
201 'autosave' => 1,
202 # save it to disk when done
203 ));
204
205 That file will be an LWP-specific format. If you want to be access the
206 cookies in your Netscape cookies file, you can use the HTTP::Cook‐
207 ies::Netscape class:
208
209 use HTTP::Cookies;
210 # yes, loads HTTP::Cookies::Netscape too
211
212 $browser->cookie_jar( HTTP::Cookies::Netscape->new(
213 'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt',
214 # where to read cookies
215 ));
216
217 You could add an "'autosave' => 1" line as further above, but at time
218 of writing, it's uncertain whether Netscape might discard some of the
219 cookies you could be writing back to disk.
220
221 Posting Form Data
222
223 Many HTML forms send data to their server using an HTTP POST request,
224 which you can send with this syntax:
225
226 $response = $browser->post( $url,
227 [
228 formkey1 => value1,
229 formkey2 => value2,
230 ...
231 ],
232 );
233
234 Or if you need to send HTTP headers:
235
236 $response = $browser->post( $url,
237 [
238 formkey1 => value1,
239 formkey2 => value2,
240 ...
241 ],
242 headerkey1 => value1,
243 headerkey2 => value2,
244 );
245
246 For example, the following program makes a search request to AltaVista
247 (by sending some form data via an HTTP POST request), and extracts from
248 the HTML the report of the number of matches:
249
250 use strict;
251 use warnings;
252 use LWP 5.64;
253 my $browser = LWP::UserAgent->new;
254
255 my $word = 'tarragon';
256
257 my $url = 'http://www.altavista.com/sites/search/web';
258 my $response = $browser->post( $url,
259 [ 'q' => $word, # the Altavista query string
260 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
261 ]
262 );
263 die "$url error: ", $response->status_line
264 unless $response->is_success;
265 die "Weird content type at $url -- ", $response->content_type
266 unless $response->content_type eq 'text/html';
267
268 if( $response->decoded_content =~ m{AltaVista found ([0-9,]+) results} ) {
269 # The substring will be like "AltaVista found 2,345 results"
270 print "$word: $1\n";
271 }
272 else {
273 print "Couldn't find the match-string in the response\n";
274 }
275
276 Sending GET Form Data
277
278 Some HTML forms convey their form data not by sending the data in an
279 HTTP POST request, but by making a normal GET request with the data
280 stuck on the end of the URL. For example, if you went to "imdb.com"
281 and ran a search on "Blade Runner", the URL you'd see in your browser
282 window would be:
283
284 http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV
285
286 To run the same search with LWP, you'd use this idiom, which involves
287 the URI class:
288
289 use URI;
290 my $url = URI->new( 'http://us.imdb.com/Tsearch' );
291 # makes an object representing the URL
292
293 $url->query_form( # And here the form data pairs:
294 'title' => 'Blade Runner',
295 'restrict' => 'Movies and TV',
296 );
297
298 my $response = $browser->get($url);
299
300 See chapter 5 of Perl & LWP for a longer discussion of HTML forms and
301 of form data, and chapters 6 through 9 for a longer discussion of
302 extracting data from HTML.
303
304 Absolutizing URLs
305
306 The URI class that we just mentioned above provides all sorts of meth‐
307 ods for accessing and modifying parts of URLs (such as asking sort of
308 URL it is with "$url->scheme", and asking what host it refers to with
309 "$url->host", and so on, as described in the docs for the URI class.
310 However, the methods of most immediate interest are the "query_form"
311 method seen above, and now the "new_abs" method for taking a probably-
312 relative URL string (like "../foo.html") and getting back an absolute
313 URL (like "http://www.perl.com/stuff/foo.html"), as shown here:
314
315 use URI;
316 $abs = URI->new_abs($maybe_relative, $base);
317
318 For example, consider this program that matches URLs in the HTML list
319 of new modules in CPAN:
320
321 use strict;
322 use warnings;
323 use LWP;
324 my $browser = LWP::UserAgent->new;
325
326 my $url = 'http://www.cpan.org/RECENT.html';
327 my $response = $browser->get($url);
328 die "Can't get $url -- ", $response->status_line
329 unless $response->is_success;
330
331 my $html = $response->decoded_content;
332 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
333 print "$1\n";
334 }
335
336 When run, it emits output that starts out something like this:
337
338 MIRRORING.FROM
339 RECENT
340 RECENT.html
341 authors/00whois.html
342 authors/01mailrc.txt.gz
343 authors/id/A/AA/AASSAD/CHECKSUMS
344 ...
345
346 However, if you actually want to have those be absolute URLs, you can
347 use the URI module's "new_abs" method, by changing the "while" loop to
348 this:
349
350 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
351 print URI->new_abs( $1, $response->base ) ,"\n";
352 }
353
354 (The "$response->base" method from HTTP::Message is for returning what
355 URL should be used for resolving relative URLs -- it's usually just the
356 same as the URL that you requested.)
357
358 That program then emits nicely absolute URLs:
359
360 http://www.cpan.org/MIRRORING.FROM
361 http://www.cpan.org/RECENT
362 http://www.cpan.org/RECENT.html
363 http://www.cpan.org/authors/00whois.html
364 http://www.cpan.org/authors/01mailrc.txt.gz
365 http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
366 ...
367
368 See chapter 4 of Perl & LWP for a longer discussion of URI objects.
369
370 Of course, using a regexp to match hrefs is a bit simplistic, and for
371 more robust programs, you'll probably want to use an HTML-parsing mod‐
372 ule like HTML::LinkExtor or HTML::TokeParser or even maybe HTML::Tree‐
373 Builder.
374
375 Other Browser Attributes
376
377 LWP::UserAgent objects have many attributes for controlling how they
378 work. Here are a few notable ones:
379
380 · "$browser->timeout(15);"
381
382 This sets this browser object to give up on requests that don't
383 answer within 15 seconds.
384
385 · "$browser->protocols_allowed( [ 'http', 'gopher'] );"
386
387 This sets this browser object to not speak any protocols other than
388 HTTP and gopher. If it tries accessing any other kind of URL (like
389 an "ftp:" or "mailto:" or "news:" URL), then it won't actually try
390 connecting, but instead will immediately return an error code 500,
391 with a message like "Access to 'ftp' URIs has been disabled".
392
393 · "use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new());"
394
395 This tells the browser object to try using the HTTP/1.1
396 "Keep-Alive" feature, which speeds up requests by reusing the same
397 socket connection for multiple requests to the same server.
398
399 · "$browser->agent( 'SomeName/1.23 (more info here maybe)' )"
400
401 This changes how the browser object will identify itself in the
402 default "User-Agent" line is its HTTP requests. By default, it'll
403 send "libwww-perl/versionnumber", like "libwww-perl/5.65". You can
404 change that to something more descriptive like this:
405
406 $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
407
408 Or if need be, you can go in disguise, like this:
409
410 $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
411
412 · "push @{ $ua->requests_redirectable }, 'POST';"
413
414 This tells this browser to obey redirection responses to POST
415 requests (like most modern interactive browsers), even though the
416 HTTP RFC says that should not normally be done.
417
418 For more options and information, see the full documentation for
419 LWP::UserAgent.
420
421 Writing Polite Robots
422
423 If you want to make sure that your LWP-based program respects ro‐
424 bots.txt files and doesn't make too many requests too fast, you can use
425 the LWP::RobotUA class instead of the LWP::UserAgent class.
426
427 LWP::RobotUA class is just like LWP::UserAgent, and you can use it like
428 so:
429
430 use LWP::RobotUA;
431 my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
432 # Your bot's name and your email address
433
434 my $response = $browser->get($url);
435
436 But HTTP::RobotUA adds these features:
437
438 · If the robots.txt on $url's server forbids you from accessing $url,
439 then the $browser object (assuming it's of class LWP::RobotUA)
440 won't actually request it, but instead will give you back (in
441 $response) a 403 error with a message "Forbidden by robots.txt".
442 That is, if you have this line:
443
444 die "$url -- ", $response->status_line, "\nAborted"
445 unless $response->is_success;
446
447 then the program would die with an error message like this:
448
449 http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
450 Aborted at whateverprogram.pl line 1234
451
452 · If this $browser object sees that the last time it talked to $url's
453 server was too recently, then it will pause (via "sleep") to avoid
454 making too many requests too often. How long it will pause for, is
455 by default one minute -- but you can control it with the
456 "$browser->delay( minutes )" attribute.
457
458 For example, this code:
459
460 $browser->delay( 7/60 );
461
462 ...means that this browser will pause when it needs to avoid talk‐
463 ing to any given server more than once every 7 seconds.
464
465 For more options and information, see the full documentation for
466 LWP::RobotUA.
467
468 Using Proxies
469
470 In some cases, you will want to (or will have to) use proxies for
471 accessing certain sites and/or using certain protocols. This is most
472 commonly the case when your LWP program is running (or could be run‐
473 ning) on a machine that is behind a firewall.
474
475 To make a browser object use proxies that are defined in the usual
476 environment variables ("HTTP_PROXY", etc.), just call the "env_proxy"
477 on a user-agent object before you go making any requests on it.
478 Specifically:
479
480 use LWP::UserAgent;
481 my $browser = LWP::UserAgent->new;
482
483 # And before you go making any requests:
484 $browser->env_proxy;
485
486 For more information on proxy parameters, see the LWP::UserAgent docu‐
487 mentation, specifically the "proxy", "env_proxy", and "no_proxy" meth‐
488 ods.
489
490 HTTP Authentication
491
492 Many web sites restrict access to documents by using "HTTP Authentica‐
493 tion". This isn't just any form of "enter your password" restriction,
494 but is a specific mechanism where the HTTP server sends the browser an
495 HTTP code that says "That document is part of a protected 'realm', and
496 you can access it only if you re-request it and add some special autho‐
497 rization headers to your request".
498
499 For example, the Unicode.org admins stop email-harvesting bots from
500 harvesting the contents of their mailing list archives, by protecting
501 them with HTTP Authentication, and then publicly stating the username
502 and password (at "http://www.unicode.org/mail-arch/") -- namely user‐
503 name "unicode-ml" and password "unicode".
504
505 For example, consider this URL, which is part of the protected area of
506 the web site:
507
508 http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
509
510 If you access that with a browser, you'll get a prompt like "Enter
511 username and password for 'Unicode-MailList-Archives' at server
512 'www.unicode.org'".
513
514 In LWP, if you just request that URL, like this:
515
516 use LWP;
517 my $browser = LWP::UserAgent->new;
518
519 my $url =
520 'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
521 my $response = $browser->get($url);
522
523 die "Error: ", $response->header('WWW-Authenticate') ⎪⎪ 'Error accessing',
524 # ('WWW-Authenticate' is the realm-name)
525 "\n ", $response->status_line, "\n at $url\n Aborting"
526 unless $response->is_success;
527
528 Then you'll get this error:
529
530 Error: Basic realm="Unicode-MailList-Archives"
531 401 Authorization Required
532 at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
533 Aborting at auth1.pl line 9. [or wherever]
534
535 ...because the $browser doesn't know any the username and password for
536 that realm ("Unicode-MailList-Archives") at that host ("www.uni‐
537 code.org"). The simplest way to let the browser know about this is to
538 use the "credentials" method to let it know about a username and pass‐
539 word that it can try using for that realm at that host. The syntax is:
540
541 $browser->credentials(
542 'servername:portnumber',
543 'realm-name',
544 'username' => 'password'
545 );
546
547 In most cases, the port number is 80, the default TCP/IP port for HTTP;
548 and you usually call the "credentials" method before you make any
549 requests. For example:
550
551 $browser->credentials(
552 'reports.mybazouki.com:80',
553 'web_server_usage_reports',
554 'plinky' => 'banjo123'
555 );
556
557 So if we add the following to the program above, right after the
558 "$browser = LWP::UserAgent->new;" line...
559
560 $browser->credentials( # add this to our $browser 's "key ring"
561 'www.unicode.org:80',
562 'Unicode-MailList-Archives',
563 'unicode-ml' => 'unicode'
564 );
565
566 ...then when we run it, the request succeeds, instead of causing the
567 "die" to be called.
568
569 Accessing HTTPS URLs
570
571 When you access an HTTPS URL, it'll work for you just like an HTTP URL
572 would -- if your LWP installation has HTTPS support (via an appropriate
573 Secure Sockets Layer library). For example:
574
575 use LWP;
576 my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
577 my $browser = LWP::UserAgent->new;
578 my $response = $browser->get($url);
579 die "Error at $url\n ", $response->status_line, "\n Aborting"
580 unless $response->is_success;
581 print "Whee, it worked! I got that ",
582 $response->content_type, " document!\n";
583
584 If your LWP installation doesn't have HTTPS support set up, then the
585 response will be unsuccessful, and you'll get this error message:
586
587 Error at https://www.paypal.com/
588 501 Protocol scheme 'https' is not supported
589 Aborting at paypal.pl line 7. [or whatever program and line]
590
591 If your LWP installation does have HTTPS support installed, then the
592 response should be successful, and you should be able to consult
593 $response just like with any normal HTTP response.
594
595 For information about installing HTTPS support for your LWP installa‐
596 tion, see the helpful README.SSL file that comes in the libwww-perl
597 distribution.
598
599 Getting Large Documents
600
601 When you're requesting a large (or at least potentially large) docu‐
602 ment, a problem with the normal way of using the request methods (like
603 "$response = $browser->get($url)") is that the response object in mem‐
604 ory will have to hold the whole document -- in memory. If the response
605 is a thirty megabyte file, this is likely to be quite an imposition on
606 this process's memory usage.
607
608 A notable alternative is to have LWP save the content to a file on
609 disk, instead of saving it up in memory. This is the syntax to use:
610
611 $response = $ua->get($url,
612 ':content_file' => $filespec,
613 );
614
615 For example,
616
617 $response = $ua->get('http://search.cpan.org/',
618 ':content_file' => '/tmp/sco.html'
619 );
620
621 When you use this ":content_file" option, the $response will have all
622 the normal header lines, but "$response->content" will be empty.
623
624 Note that this ":content_file" option isn't supported under older ver‐
625 sions of LWP, so you should consider adding "use LWP 5.66;" to check
626 the LWP version, if you think your program might run on systems with
627 older versions.
628
629 If you need to be compatible with older LWP versions, then use this
630 syntax, which does the same thing:
631
632 use HTTP::Request::Common;
633 $response = $ua->request( GET($url), $filespec );
634
636 Remember, this article is just the most rudimentary introduction to LWP
637 -- to learn more about LWP and LWP-related tasks, you really must read
638 from the following:
639
640 · LWP::Simple -- simple functions for getting/heading/mirroring URLs
641
642 · LWP -- overview of the libwww-perl modules
643
644 · LWP::UserAgent -- the class for objects that represent "virtual
645 browsers"
646
647 · HTTP::Response -- the class for objects that represent the response
648 to a LWP response, as in "$response = $browser->get(...)"
649
650 · HTTP::Message and HTTP::Headers -- classes that provide more meth‐
651 ods to HTTP::Response.
652
653 · URI -- class for objects that represent absolute or relative URLs
654
655 · URI::Escape -- functions for URL-escaping and URL-unescaping
656 strings (like turning "this & that" to and from
657 "this%20%26%20that").
658
659 · HTML::Entities -- functions for HTML-escaping and HTML-unescaping
660 strings (like turning "C. & E. Brontee" to and from "C. & E.
661 Brontë")
662
663 · HTML::TokeParser and HTML::TreeBuilder -- classes for parsing HTML
664
665 · HTML::LinkExtor -- class for finding links in HTML documents
666
667 · The book Perl & LWP by Sean M. Burke. O'Reilly & Associates, 2002.
668 ISBN: 0-596-00178-9. "http://www.oreilly.com/catalog/perllwp/"
669
671 Copyright 2002, Sean M. Burke. You can redistribute this document
672 and/or modify it, but only under the same terms as Perl itself.
673
675 Sean M. Burke "sburke@cpan.org"
676
677
678
679perl v5.8.8 2004-04-06 lwptut(3)