1lwptut(3) User Contributed Perl Documentation lwptut(3)
2
3
4
6 lwptut -- An LWP Tutorial
7
9 LWP (short for "Library for WWW in Perl") is a very popular group of
10 Perl modules for accessing data on the Web. Like most Perl module-
11 distributions, each of LWP's component modules comes with documentation
12 that is a complete reference to its interface. However, there are so
13 many modules in LWP that it's hard to know where to start looking for
14 information on how to do even the simplest most common things.
15
16 Really introducing you to using LWP would require a whole book -- a
17 book that just happens to exist, called Perl & LWP. But this article
18 should give you a taste of how you can go about some common tasks with
19 LWP.
20
21 Getting documents with LWP::Simple
22 If you just want to get what's at a particular URL, the simplest way to
23 do it is LWP::Simple's functions.
24
25 In a Perl program, you can call its "get($url)" function. It will try
26 getting that URL's content. If it works, then it'll return the
27 content; but if there's some error, it'll return undef.
28
29 my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
30 # Just an example: the URL for the most recent /Fresh Air/ show
31
32 use LWP::Simple;
33 my $content = get $url;
34 die "Couldn't get $url" unless defined $content;
35
36 # Then go do things with $content, like this:
37
38 if($content =~ m/jazz/i) {
39 print "They're talking about jazz today on Fresh Air!\n";
40 }
41 else {
42 print "Fresh Air is apparently jazzless today.\n";
43 }
44
45 The handiest variant on "get" is "getprint", which is useful in Perl
46 one-liners. If it can get the page whose URL you provide, it sends it
47 to STDOUT; otherwise it complains to STDERR.
48
49 % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
50
51 That is the URL of a plain text file that lists new files in CPAN in
52 the past two weeks. You can easily make it part of a tidy little shell
53 command, like this one that mails you the list of new "Acme::" modules:
54
55 % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'" \
56 | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
57
58 There are other useful functions in LWP::Simple, including one function
59 for running a HEAD request on a URL (useful for checking links, or
60 getting the last-revised time of a URL), and two functions for
61 saving/mirroring a URL to a local file. See the LWP::Simple
62 documentation for the full details, or chapter 2 of Perl & LWP for more
63 examples.
64
65 The Basics of the LWP Class Model
66 LWP::Simple's functions are handy for simple cases, but its functions
67 don't support cookies or authorization, don't support setting header
68 lines in the HTTP request, generally don't support reading header lines
69 in the HTTP response (notably the full HTTP error message, in case of
70 an error). To get at all those features, you'll have to use the full
71 LWP class model.
72
73 While LWP consists of dozens of classes, the main two that you have to
74 understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a
75 class for "virtual browsers" which you use for performing requests, and
76 HTTP::Response is a class for the responses (or error messages) that
77 you get back from those requests.
78
79 The basic idiom is "$response = $browser->get($url)", or more fully
80 illustrated:
81
82 # Early in your program:
83
84 use LWP 5.64; # Loads all important LWP classes, and makes
85 # sure your version is reasonably recent.
86
87 my $browser = LWP::UserAgent->new;
88
89 ...
90
91 # Then later, whenever you need to make a get request:
92 my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
93
94 my $response = $browser->get( $url );
95 die "Can't get $url -- ", $response->status_line
96 unless $response->is_success;
97
98 die "Hey, I was expecting HTML, not ", $response->content_type
99 unless $response->content_type eq 'text/html';
100 # or whatever content-type you're equipped to deal with
101
102 # Otherwise, process the content somehow:
103
104 if($response->decoded_content =~ m/jazz/i) {
105 print "They're talking about jazz today on Fresh Air!\n";
106 }
107 else {
108 print "Fresh Air is apparently jazzless today.\n";
109 }
110
111 There are two objects involved: $browser, which holds an object of
112 class LWP::UserAgent, and then the $response object, which is of class
113 HTTP::Response. You really need only one browser object per program;
114 but every time you make a request, you get back a new HTTP::Response
115 object, which will have some interesting attributes:
116
117 · A status code indicating success or failure (which you can test
118 with "$response->is_success").
119
120 · An HTTP status line that is hopefully informative if there's
121 failure (which you can see with "$response->status_line", returning
122 something like "404 Not Found").
123
124 · A MIME content-type like "text/html", "image/gif",
125 "application/xml", etc., which you can see with
126 "$response->content_type"
127
128 · The actual content of the response, in
129 "$response->decoded_content". If the response is HTML, that's
130 where the HTML source will be; if it's a GIF, then
131 "$response->decoded_content" will be the binary GIF data.
132
133 · And dozens of other convenient and more specific methods that are
134 documented in the docs for HTTP::Response, and its superclasses
135 HTTP::Message and HTTP::Headers.
136
137 Adding Other HTTP Request Headers
138 The most commonly used syntax for requests is "$response =
139 $browser->get($url)", but in truth, you can add extra HTTP header lines
140 to the request by adding a list of key-value pairs after the URL, like
141 so:
142
143 $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
144
145 For example, here's how to send some more Netscape-like headers, in
146 case you're dealing with a site that would otherwise reject your
147 request:
148
149 my @ns_headers = (
150 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
151 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
152 'Accept-Charset' => 'iso-8859-1,*,utf-8',
153 'Accept-Language' => 'en-US',
154 );
155
156 ...
157
158 $response = $browser->get($url, @ns_headers);
159
160 If you weren't reusing that array, you could just go ahead and do this:
161
162 $response = $browser->get($url,
163 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
164 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
165 'Accept-Charset' => 'iso-8859-1,*,utf-8',
166 'Accept-Language' => 'en-US',
167 );
168
169 If you were only ever changing the 'User-Agent' line, you could just
170 change the $browser object's default line from "libwww-perl/5.65" (or
171 the like) to whatever you like, using the LWP::UserAgent "agent"
172 method:
173
174 $browser->agent('Mozilla/4.76 [en] (Win98; U)');
175
176 Enabling Cookies
177 A default LWP::UserAgent object acts like a browser with its cookies
178 support turned off. There are various ways of turning it on, by setting
179 its "cookie_jar" attribute. A "cookie jar" is an object representing a
180 little database of all the HTTP cookies that a browser can know about.
181 It can correspond to a file on disk (the way Netscape uses its
182 cookies.txt file), or it can be just an in-memory object that starts
183 out empty, and whose collection of cookies will disappear once the
184 program is finished running.
185
186 To give a browser an in-memory empty cookie jar, you set its
187 "cookie_jar" attribute like so:
188
189 $browser->cookie_jar({});
190
191 To give it a copy that will be read from a file on disk, and will be
192 saved to it when the program is finished running, set the "cookie_jar"
193 attribute like this:
194
195 use HTTP::Cookies;
196 $browser->cookie_jar( HTTP::Cookies->new(
197 'file' => '/some/where/cookies.lwp',
198 # where to read/write cookies
199 'autosave' => 1,
200 # save it to disk when done
201 ));
202
203 That file will be an LWP-specific format. If you want to be access the
204 cookies in your Netscape cookies file, you can use the
205 HTTP::Cookies::Netscape class:
206
207 use HTTP::Cookies;
208 # yes, loads HTTP::Cookies::Netscape too
209
210 $browser->cookie_jar( HTTP::Cookies::Netscape->new(
211 'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt',
212 # where to read cookies
213 ));
214
215 You could add an "'autosave' => 1" line as further above, but at time
216 of writing, it's uncertain whether Netscape might discard some of the
217 cookies you could be writing back to disk.
218
219 Posting Form Data
220 Many HTML forms send data to their server using an HTTP POST request,
221 which you can send with this syntax:
222
223 $response = $browser->post( $url,
224 [
225 formkey1 => value1,
226 formkey2 => value2,
227 ...
228 ],
229 );
230
231 Or if you need to send HTTP headers:
232
233 $response = $browser->post( $url,
234 [
235 formkey1 => value1,
236 formkey2 => value2,
237 ...
238 ],
239 headerkey1 => value1,
240 headerkey2 => value2,
241 );
242
243 For example, the following program makes a search request to AltaVista
244 (by sending some form data via an HTTP POST request), and extracts from
245 the HTML the report of the number of matches:
246
247 use strict;
248 use warnings;
249 use LWP 5.64;
250 my $browser = LWP::UserAgent->new;
251
252 my $word = 'tarragon';
253
254 my $url = 'http://www.altavista.com/sites/search/web';
255 my $response = $browser->post( $url,
256 [ 'q' => $word, # the Altavista query string
257 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
258 ]
259 );
260 die "$url error: ", $response->status_line
261 unless $response->is_success;
262 die "Weird content type at $url -- ", $response->content_type
263 unless $response->content_is_html;
264
265 if( $response->decoded_content =~ m{AltaVista found ([0-9,]+) results} ) {
266 # The substring will be like "AltaVista found 2,345 results"
267 print "$word: $1\n";
268 }
269 else {
270 print "Couldn't find the match-string in the response\n";
271 }
272
273 Sending GET Form Data
274 Some HTML forms convey their form data not by sending the data in an
275 HTTP POST request, but by making a normal GET request with the data
276 stuck on the end of the URL. For example, if you went to "imdb.com"
277 and ran a search on "Blade Runner", the URL you'd see in your browser
278 window would be:
279
280 http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV
281
282 To run the same search with LWP, you'd use this idiom, which involves
283 the URI class:
284
285 use URI;
286 my $url = URI->new( 'http://us.imdb.com/Tsearch' );
287 # makes an object representing the URL
288
289 $url->query_form( # And here the form data pairs:
290 'title' => 'Blade Runner',
291 'restrict' => 'Movies and TV',
292 );
293
294 my $response = $browser->get($url);
295
296 See chapter 5 of Perl & LWP for a longer discussion of HTML forms and
297 of form data, and chapters 6 through 9 for a longer discussion of
298 extracting data from HTML.
299
300 Absolutizing URLs
301 The URI class that we just mentioned above provides all sorts of
302 methods for accessing and modifying parts of URLs (such as asking sort
303 of URL it is with "$url->scheme", and asking what host it refers to
304 with "$url->host", and so on, as described in the docs for the URI
305 class. However, the methods of most immediate interest are the
306 "query_form" method seen above, and now the "new_abs" method for taking
307 a probably-relative URL string (like "../foo.html") and getting back an
308 absolute URL (like "http://www.perl.com/stuff/foo.html"), as shown
309 here:
310
311 use URI;
312 $abs = URI->new_abs($maybe_relative, $base);
313
314 For example, consider this program that matches URLs in the HTML list
315 of new modules in CPAN:
316
317 use strict;
318 use warnings;
319 use LWP;
320 my $browser = LWP::UserAgent->new;
321
322 my $url = 'http://www.cpan.org/RECENT.html';
323 my $response = $browser->get($url);
324 die "Can't get $url -- ", $response->status_line
325 unless $response->is_success;
326
327 my $html = $response->decoded_content;
328 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
329 print "$1\n";
330 }
331
332 When run, it emits output that starts out something like this:
333
334 MIRRORING.FROM
335 RECENT
336 RECENT.html
337 authors/00whois.html
338 authors/01mailrc.txt.gz
339 authors/id/A/AA/AASSAD/CHECKSUMS
340 ...
341
342 However, if you actually want to have those be absolute URLs, you can
343 use the URI module's "new_abs" method, by changing the "while" loop to
344 this:
345
346 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
347 print URI->new_abs( $1, $response->base ) ,"\n";
348 }
349
350 (The "$response->base" method from HTTP::Message is for returning what
351 URL should be used for resolving relative URLs -- it's usually just the
352 same as the URL that you requested.)
353
354 That program then emits nicely absolute URLs:
355
356 http://www.cpan.org/MIRRORING.FROM
357 http://www.cpan.org/RECENT
358 http://www.cpan.org/RECENT.html
359 http://www.cpan.org/authors/00whois.html
360 http://www.cpan.org/authors/01mailrc.txt.gz
361 http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
362 ...
363
364 See chapter 4 of Perl & LWP for a longer discussion of URI objects.
365
366 Of course, using a regexp to match hrefs is a bit simplistic, and for
367 more robust programs, you'll probably want to use an HTML-parsing
368 module like HTML::LinkExtor or HTML::TokeParser or even maybe
369 HTML::TreeBuilder.
370
371 Other Browser Attributes
372 LWP::UserAgent objects have many attributes for controlling how they
373 work. Here are a few notable ones:
374
375 · "$browser->timeout(15);"
376
377 This sets this browser object to give up on requests that don't
378 answer within 15 seconds.
379
380 · "$browser->protocols_allowed( [ 'http', 'gopher'] );"
381
382 This sets this browser object to not speak any protocols other than
383 HTTP and gopher. If it tries accessing any other kind of URL (like
384 an "ftp:" or "mailto:" or "news:" URL), then it won't actually try
385 connecting, but instead will immediately return an error code 500,
386 with a message like "Access to 'ftp' URIs has been disabled".
387
388 · "use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new());"
389
390 This tells the browser object to try using the HTTP/1.1 "Keep-
391 Alive" feature, which speeds up requests by reusing the same socket
392 connection for multiple requests to the same server.
393
394 · "$browser->agent( 'SomeName/1.23 (more info here maybe)' )"
395
396 This changes how the browser object will identify itself in the
397 default "User-Agent" line is its HTTP requests. By default, it'll
398 send "libwww-perl/versionnumber", like "libwww-perl/5.65". You can
399 change that to something more descriptive like this:
400
401 $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
402
403 Or if need be, you can go in disguise, like this:
404
405 $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
406
407 · "push @{ $ua->requests_redirectable }, 'POST';"
408
409 This tells this browser to obey redirection responses to POST
410 requests (like most modern interactive browsers), even though the
411 HTTP RFC says that should not normally be done.
412
413 For more options and information, see the full documentation for
414 LWP::UserAgent.
415
416 Writing Polite Robots
417 If you want to make sure that your LWP-based program respects
418 robots.txt files and doesn't make too many requests too fast, you can
419 use the LWP::RobotUA class instead of the LWP::UserAgent class.
420
421 LWP::RobotUA class is just like LWP::UserAgent, and you can use it like
422 so:
423
424 use LWP::RobotUA;
425 my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
426 # Your bot's name and your email address
427
428 my $response = $browser->get($url);
429
430 But HTTP::RobotUA adds these features:
431
432 · If the robots.txt on $url's server forbids you from accessing $url,
433 then the $browser object (assuming it's of class LWP::RobotUA)
434 won't actually request it, but instead will give you back (in
435 $response) a 403 error with a message "Forbidden by robots.txt".
436 That is, if you have this line:
437
438 die "$url -- ", $response->status_line, "\nAborted"
439 unless $response->is_success;
440
441 then the program would die with an error message like this:
442
443 http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
444 Aborted at whateverprogram.pl line 1234
445
446 · If this $browser object sees that the last time it talked to $url's
447 server was too recently, then it will pause (via "sleep") to avoid
448 making too many requests too often. How long it will pause for, is
449 by default one minute -- but you can control it with the
450 "$browser->delay( minutes )" attribute.
451
452 For example, this code:
453
454 $browser->delay( 7/60 );
455
456 ...means that this browser will pause when it needs to avoid
457 talking to any given server more than once every 7 seconds.
458
459 For more options and information, see the full documentation for
460 LWP::RobotUA.
461
462 Using Proxies
463 In some cases, you will want to (or will have to) use proxies for
464 accessing certain sites and/or using certain protocols. This is most
465 commonly the case when your LWP program is running (or could be
466 running) on a machine that is behind a firewall.
467
468 To make a browser object use proxies that are defined in the usual
469 environment variables ("HTTP_PROXY", etc.), just call the "env_proxy"
470 on a user-agent object before you go making any requests on it.
471 Specifically:
472
473 use LWP::UserAgent;
474 my $browser = LWP::UserAgent->new;
475
476 # And before you go making any requests:
477 $browser->env_proxy;
478
479 For more information on proxy parameters, see the LWP::UserAgent
480 documentation, specifically the "proxy", "env_proxy", and "no_proxy"
481 methods.
482
483 HTTP Authentication
484 Many web sites restrict access to documents by using "HTTP
485 Authentication". This isn't just any form of "enter your password"
486 restriction, but is a specific mechanism where the HTTP server sends
487 the browser an HTTP code that says "That document is part of a
488 protected 'realm', and you can access it only if you re-request it and
489 add some special authorization headers to your request".
490
491 For example, the Unicode.org admins stop email-harvesting bots from
492 harvesting the contents of their mailing list archives, by protecting
493 them with HTTP Authentication, and then publicly stating the username
494 and password (at "http://www.unicode.org/mail-arch/") -- namely
495 username "unicode-ml" and password "unicode".
496
497 For example, consider this URL, which is part of the protected area of
498 the web site:
499
500 http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
501
502 If you access that with a browser, you'll get a prompt like "Enter
503 username and password for 'Unicode-MailList-Archives' at server
504 'www.unicode.org'".
505
506 In LWP, if you just request that URL, like this:
507
508 use LWP;
509 my $browser = LWP::UserAgent->new;
510
511 my $url =
512 'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
513 my $response = $browser->get($url);
514
515 die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing',
516 # ('WWW-Authenticate' is the realm-name)
517 "\n ", $response->status_line, "\n at $url\n Aborting"
518 unless $response->is_success;
519
520 Then you'll get this error:
521
522 Error: Basic realm="Unicode-MailList-Archives"
523 401 Authorization Required
524 at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
525 Aborting at auth1.pl line 9. [or wherever]
526
527 ...because the $browser doesn't know any the username and password for
528 that realm ("Unicode-MailList-Archives") at that host
529 ("www.unicode.org"). The simplest way to let the browser know about
530 this is to use the "credentials" method to let it know about a username
531 and password that it can try using for that realm at that host. The
532 syntax is:
533
534 $browser->credentials(
535 'servername:portnumber',
536 'realm-name',
537 'username' => 'password'
538 );
539
540 In most cases, the port number is 80, the default TCP/IP port for HTTP;
541 and you usually call the "credentials" method before you make any
542 requests. For example:
543
544 $browser->credentials(
545 'reports.mybazouki.com:80',
546 'web_server_usage_reports',
547 'plinky' => 'banjo123'
548 );
549
550 So if we add the following to the program above, right after the
551 "$browser = LWP::UserAgent->new;" line...
552
553 $browser->credentials( # add this to our $browser 's "key ring"
554 'www.unicode.org:80',
555 'Unicode-MailList-Archives',
556 'unicode-ml' => 'unicode'
557 );
558
559 ...then when we run it, the request succeeds, instead of causing the
560 "die" to be called.
561
562 Accessing HTTPS URLs
563 When you access an HTTPS URL, it'll work for you just like an HTTP URL
564 would -- if your LWP installation has HTTPS support (via an appropriate
565 Secure Sockets Layer library). For example:
566
567 use LWP;
568 my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
569 my $browser = LWP::UserAgent->new;
570 my $response = $browser->get($url);
571 die "Error at $url\n ", $response->status_line, "\n Aborting"
572 unless $response->is_success;
573 print "Whee, it worked! I got that ",
574 $response->content_type, " document!\n";
575
576 If your LWP installation doesn't have HTTPS support set up, then the
577 response will be unsuccessful, and you'll get this error message:
578
579 Error at https://www.paypal.com/
580 501 Protocol scheme 'https' is not supported
581 Aborting at paypal.pl line 7. [or whatever program and line]
582
583 If your LWP installation does have HTTPS support installed, then the
584 response should be successful, and you should be able to consult
585 $response just like with any normal HTTP response.
586
587 For information about installing HTTPS support for your LWP
588 installation, see the helpful README.SSL file that comes in the libwww-
589 perl distribution.
590
591 Getting Large Documents
592 When you're requesting a large (or at least potentially large)
593 document, a problem with the normal way of using the request methods
594 (like "$response = $browser->get($url)") is that the response object in
595 memory will have to hold the whole document -- in memory. If the
596 response is a thirty megabyte file, this is likely to be quite an
597 imposition on this process's memory usage.
598
599 A notable alternative is to have LWP save the content to a file on
600 disk, instead of saving it up in memory. This is the syntax to use:
601
602 $response = $ua->get($url,
603 ':content_file' => $filespec,
604 );
605
606 For example,
607
608 $response = $ua->get('http://search.cpan.org/',
609 ':content_file' => '/tmp/sco.html'
610 );
611
612 When you use this ":content_file" option, the $response will have all
613 the normal header lines, but "$response->content" will be empty.
614
615 Note that this ":content_file" option isn't supported under older
616 versions of LWP, so you should consider adding "use LWP 5.66;" to check
617 the LWP version, if you think your program might run on systems with
618 older versions.
619
620 If you need to be compatible with older LWP versions, then use this
621 syntax, which does the same thing:
622
623 use HTTP::Request::Common;
624 $response = $ua->request( GET($url), $filespec );
625
627 Remember, this article is just the most rudimentary introduction to LWP
628 -- to learn more about LWP and LWP-related tasks, you really must read
629 from the following:
630
631 · LWP::Simple -- simple functions for getting/heading/mirroring URLs
632
633 · LWP -- overview of the libwww-perl modules
634
635 · LWP::UserAgent -- the class for objects that represent "virtual
636 browsers"
637
638 · HTTP::Response -- the class for objects that represent the response
639 to a LWP response, as in "$response = $browser->get(...)"
640
641 · HTTP::Message and HTTP::Headers -- classes that provide more
642 methods to HTTP::Response.
643
644 · URI -- class for objects that represent absolute or relative URLs
645
646 · URI::Escape -- functions for URL-escaping and URL-unescaping
647 strings (like turning "this & that" to and from
648 "this%20%26%20that").
649
650 · HTML::Entities -- functions for HTML-escaping and HTML-unescaping
651 strings (like turning "C. & E. Brontee" to and from "C. & E.
652 Brontë")
653
654 · HTML::TokeParser and HTML::TreeBuilder -- classes for parsing HTML
655
656 · HTML::LinkExtor -- class for finding links in HTML documents
657
658 · The book Perl & LWP by Sean M. Burke. O'Reilly & Associates, 2002.
659 ISBN: 0-596-00178-9, <http://www.oreilly.com/catalog/perllwp/>.
660 The whole book is also available free online:
661 <http://lwp.interglacial.com>.
662
664 Copyright 2002, Sean M. Burke. You can redistribute this document
665 and/or modify it, but only under the same terms as Perl itself.
666
668 Sean M. Burke "sburke@cpan.org"
669
670
671
672perl v5.12.4 2010-03-14 lwptut(3)