1lwptut(3) User Contributed Perl Documentation lwptut(3)
2
3
4
6 lwptut -- An LWP Tutorial
7
9 LWP (short for "Library for WWW in Perl") is a very popular group of
10 Perl modules for accessing data on the Web. Like most Perl module-
11 distributions, each of LWP's component modules comes with documentation
12 that is a complete reference to its interface. However, there are so
13 many modules in LWP that it's hard to know where to start looking for
14 information on how to do even the simplest most common things.
15
16 Really introducing you to using LWP would require a whole book -- a
17 book that just happens to exist, called Perl & LWP. But this article
18 should give you a taste of how you can go about some common tasks with
19 LWP.
20
21 Getting documents with LWP::Simple
22 If you just want to get what's at a particular URL, the simplest way to
23 do it is LWP::Simple's functions.
24
25 In a Perl program, you can call its "get($url)" function. It will try
26 getting that URL's content. If it works, then it'll return the
27 content; but if there's some error, it'll return undef.
28
29 my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
30 # Just an example: the URL for the most recent /Fresh Air/ show
31
32 use LWP::Simple;
33 my $content = get $url;
34 die "Couldn't get $url" unless defined $content;
35
36 # Then go do things with $content, like this:
37
38 if($content =~ m/jazz/i) {
39 print "They're talking about jazz today on Fresh Air!\n";
40 }
41 else {
42 print "Fresh Air is apparently jazzless today.\n";
43 }
44
45 The handiest variant on "get" is "getprint", which is useful in Perl
46 one-liners. If it can get the page whose URL you provide, it sends it
47 to STDOUT; otherwise it complains to STDERR.
48
49 % perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'"
50
51 That is the URL of a plain text file that lists new files in CPAN in
52 the past two weeks. You can easily make it part of a tidy little shell
53 command, like this one that mails you the list of new "Acme::" modules:
54
55 % perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'" \
56 | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
57
58 There are other useful functions in LWP::Simple, including one function
59 for running a HEAD request on a URL (useful for checking links, or
60 getting the last-revised time of a URL), and two functions for
61 saving/mirroring a URL to a local file. See the LWP::Simple
62 documentation for the full details, or chapter 2 of Perl & LWP for more
63 examples.
64
65 The Basics of the LWP Class Model
66 LWP::Simple's functions are handy for simple cases, but its functions
67 don't support cookies or authorization, don't support setting header
68 lines in the HTTP request, generally don't support reading header lines
69 in the HTTP response (notably the full HTTP error message, in case of
70 an error). To get at all those features, you'll have to use the full
71 LWP class model.
72
73 While LWP consists of dozens of classes, the main two that you have to
74 understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a
75 class for "virtual browsers" which you use for performing requests, and
76 HTTP::Response is a class for the responses (or error messages) that
77 you get back from those requests.
78
79 The basic idiom is "$response = $browser->get($url)", or more fully
80 illustrated:
81
82 # Early in your program:
83
84 use LWP 5.64; # Loads all important LWP classes, and makes
85 # sure your version is reasonably recent.
86
87 my $browser = LWP::UserAgent->new;
88
89 ...
90
91 # Then later, whenever you need to make a get request:
92 my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
93
94 my $response = $browser->get( $url );
95 die "Can't get $url -- ", $response->status_line
96 unless $response->is_success;
97
98 die "Hey, I was expecting HTML, not ", $response->content_type
99 unless $response->content_type eq 'text/html';
100 # or whatever content-type you're equipped to deal with
101
102 # Otherwise, process the content somehow:
103
104 if($response->decoded_content =~ m/jazz/i) {
105 print "They're talking about jazz today on Fresh Air!\n";
106 }
107 else {
108 print "Fresh Air is apparently jazzless today.\n";
109 }
110
111 There are two objects involved: $browser, which holds an object of
112 class LWP::UserAgent, and then the $response object, which is of class
113 HTTP::Response. You really need only one browser object per program;
114 but every time you make a request, you get back a new HTTP::Response
115 object, which will have some interesting attributes:
116
117 • A status code indicating success or failure (which you can test
118 with "$response->is_success").
119
120 • An HTTP status line that is hopefully informative if there's
121 failure (which you can see with "$response->status_line", returning
122 something like "404 Not Found").
123
124 • A MIME content-type like "text/html", "image/gif",
125 "application/xml", etc., which you can see with
126 "$response->content_type"
127
128 • The actual content of the response, in
129 "$response->decoded_content". If the response is HTML, that's
130 where the HTML source will be; if it's a GIF, then
131 "$response->decoded_content" will be the binary GIF data.
132
133 • And dozens of other convenient and more specific methods that are
134 documented in the docs for HTTP::Response, and its superclasses
135 HTTP::Message and HTTP::Headers.
136
137 Adding Other HTTP Request Headers
138 The most commonly used syntax for requests is "$response =
139 $browser->get($url)", but in truth, you can add extra HTTP header lines
140 to the request by adding a list of key-value pairs after the URL, like
141 so:
142
143 $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
144
145 For example, here's how to send some commonly used headers, in case
146 you're dealing with a site that would otherwise reject your request:
147
148 my @ns_headers = (
149 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
150 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
151 'Accept-Charset' => 'iso-8859-1,*,utf-8',
152 'Accept-Language' => 'en-US',
153 );
154
155 ...
156
157 $response = $browser->get($url, @ns_headers);
158
159 If you weren't reusing that array, you could just go ahead and do this:
160
161 $response = $browser->get($url,
162 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
163 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
164 'Accept-Charset' => 'iso-8859-1,*,utf-8',
165 'Accept-Language' => 'en-US',
166 );
167
168 If you were only ever changing the 'User-Agent' line, you could just
169 change the $browser object's default line from "libwww-perl/5.65" (or
170 the like) to whatever you like, using the LWP::UserAgent "agent"
171 method:
172
173 $browser->agent('Mozilla/4.76 [en] (Win98; U)');
174
175 Enabling Cookies
176 A default LWP::UserAgent object acts like a browser with its cookies
177 support turned off. There are various ways of turning it on, by setting
178 its "cookie_jar" attribute. A "cookie jar" is an object representing a
179 little database of all the HTTP cookies that a browser knows about. It
180 can correspond to a file on disk or an in-memory object that starts out
181 empty, and whose collection of cookies will disappear once the program
182 is finished running.
183
184 To give a browser an in-memory empty cookie jar, you set its
185 "cookie_jar" attribute like so:
186
187 use HTTP::CookieJar::LWP;
188 $browser->cookie_jar( HTTP::CookieJar::LWP->new );
189
190 To save a cookie jar to disk, see "dump_cookies" in HTTP::CookieJar.
191 To load cookies from disk into a jar, see "load_cookies" in
192 HTTP::CookieJar.
193
194 Posting Form Data
195 Many HTML forms send data to their server using an HTTP POST request,
196 which you can send with this syntax:
197
198 $response = $browser->post( $url,
199 [
200 formkey1 => value1,
201 formkey2 => value2,
202 ...
203 ],
204 );
205
206 Or if you need to send HTTP headers:
207
208 $response = $browser->post( $url,
209 [
210 formkey1 => value1,
211 formkey2 => value2,
212 ...
213 ],
214 headerkey1 => value1,
215 headerkey2 => value2,
216 );
217
218 For example, the following program makes a search request to AltaVista
219 (by sending some form data via an HTTP POST request), and extracts from
220 the HTML the report of the number of matches:
221
222 use strict;
223 use warnings;
224 use LWP 5.64;
225 my $browser = LWP::UserAgent->new;
226
227 my $word = 'tarragon';
228
229 my $url = 'http://search.yahoo.com/yhs/search';
230 my $response = $browser->post( $url,
231 [ 'q' => $word, # the Altavista query string
232 'fr' => 'altavista', 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
233 ]
234 );
235 die "$url error: ", $response->status_line
236 unless $response->is_success;
237 die "Weird content type at $url -- ", $response->content_type
238 unless $response->content_is_html;
239
240 if( $response->decoded_content =~ m{([0-9,]+)(?:<.*?>)? results for} ) {
241 # The substring will be like "996,000</strong> results for"
242 print "$word: $1\n";
243 }
244 else {
245 print "Couldn't find the match-string in the response\n";
246 }
247
248 Sending GET Form Data
249 Some HTML forms convey their form data not by sending the data in an
250 HTTP POST request, but by making a normal GET request with the data
251 stuck on the end of the URL. For example, if you went to
252 "www.imdb.com" and ran a search on "Blade Runner", the URL you'd see in
253 your browser window would be:
254
255 http://www.imdb.com/find?s=all&q=Blade+Runner
256
257 To run the same search with LWP, you'd use this idiom, which involves
258 the URI class:
259
260 use URI;
261 my $url = URI->new( 'http://www.imdb.com/find' );
262 # makes an object representing the URL
263
264 $url->query_form( # And here the form data pairs:
265 'q' => 'Blade Runner',
266 's' => 'all',
267 );
268
269 my $response = $browser->get($url);
270
271 See chapter 5 of Perl & LWP for a longer discussion of HTML forms and
272 of form data, and chapters 6 through 9 for a longer discussion of
273 extracting data from HTML.
274
275 Absolutizing URLs
276 The URI class that we just mentioned above provides all sorts of
277 methods for accessing and modifying parts of URLs (such as asking sort
278 of URL it is with "$url->scheme", and asking what host it refers to
279 with "$url->host", and so on, as described in the docs for the URI
280 class. However, the methods of most immediate interest are the
281 "query_form" method seen above, and now the "new_abs" method for taking
282 a probably-relative URL string (like "../foo.html") and getting back an
283 absolute URL (like "http://www.perl.com/stuff/foo.html"), as shown
284 here:
285
286 use URI;
287 $abs = URI->new_abs($maybe_relative, $base);
288
289 For example, consider this program that matches URLs in the HTML list
290 of new modules in CPAN:
291
292 use strict;
293 use warnings;
294 use LWP;
295 my $browser = LWP::UserAgent->new;
296
297 my $url = 'http://www.cpan.org/RECENT.html';
298 my $response = $browser->get($url);
299 die "Can't get $url -- ", $response->status_line
300 unless $response->is_success;
301
302 my $html = $response->decoded_content;
303 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
304 print "$1\n";
305 }
306
307 When run, it emits output that starts out something like this:
308
309 MIRRORING.FROM
310 RECENT
311 RECENT.html
312 authors/00whois.html
313 authors/01mailrc.txt.gz
314 authors/id/A/AA/AASSAD/CHECKSUMS
315 ...
316
317 However, if you actually want to have those be absolute URLs, you can
318 use the URI module's "new_abs" method, by changing the "while" loop to
319 this:
320
321 while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
322 print URI->new_abs( $1, $response->base ) ,"\n";
323 }
324
325 (The "$response->base" method from HTTP::Message is for returning what
326 URL should be used for resolving relative URLs -- it's usually just the
327 same as the URL that you requested.)
328
329 That program then emits nicely absolute URLs:
330
331 http://www.cpan.org/MIRRORING.FROM
332 http://www.cpan.org/RECENT
333 http://www.cpan.org/RECENT.html
334 http://www.cpan.org/authors/00whois.html
335 http://www.cpan.org/authors/01mailrc.txt.gz
336 http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
337 ...
338
339 See chapter 4 of Perl & LWP for a longer discussion of URI objects.
340
341 Of course, using a regexp to match hrefs is a bit simplistic, and for
342 more robust programs, you'll probably want to use an HTML-parsing
343 module like HTML::LinkExtor or HTML::TokeParser or even maybe
344 HTML::TreeBuilder.
345
346 Other Browser Attributes
347 LWP::UserAgent objects have many attributes for controlling how they
348 work. Here are a few notable ones:
349
350 • "$browser->timeout(15);"
351
352 This sets this browser object to give up on requests that don't
353 answer within 15 seconds.
354
355 • "$browser->protocols_allowed( [ 'http', 'gopher'] );"
356
357 This sets this browser object to not speak any protocols other than
358 HTTP and gopher. If it tries accessing any other kind of URL (like
359 an "ftp:" or "mailto:" or "news:" URL), then it won't actually try
360 connecting, but instead will immediately return an error code 500,
361 with a message like "Access to 'ftp' URIs has been disabled".
362
363 • "use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new());"
364
365 This tells the browser object to try using the HTTP/1.1 "Keep-
366 Alive" feature, which speeds up requests by reusing the same socket
367 connection for multiple requests to the same server.
368
369 • "$browser->agent( 'SomeName/1.23 (more info here maybe)' )"
370
371 This changes how the browser object will identify itself in the
372 default "User-Agent" line is its HTTP requests. By default, it'll
373 send "libwww-perl/versionnumber", like "libwww-perl/5.65". You can
374 change that to something more descriptive like this:
375
376 $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
377
378 Or if need be, you can go in disguise, like this:
379
380 $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
381
382 • "push @{ $ua->requests_redirectable }, 'POST';"
383
384 This tells this browser to obey redirection responses to POST
385 requests (like most modern interactive browsers), even though the
386 HTTP RFC says that should not normally be done.
387
388 For more options and information, see the full documentation for
389 LWP::UserAgent.
390
391 Writing Polite Robots
392 If you want to make sure that your LWP-based program respects
393 robots.txt files and doesn't make too many requests too fast, you can
394 use the LWP::RobotUA class instead of the LWP::UserAgent class.
395
396 LWP::RobotUA class is just like LWP::UserAgent, and you can use it like
397 so:
398
399 use LWP::RobotUA;
400 my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
401 # Your bot's name and your email address
402
403 my $response = $browser->get($url);
404
405 But HTTP::RobotUA adds these features:
406
407 • If the robots.txt on $url's server forbids you from accessing $url,
408 then the $browser object (assuming it's of class LWP::RobotUA)
409 won't actually request it, but instead will give you back (in
410 $response) a 403 error with a message "Forbidden by robots.txt".
411 That is, if you have this line:
412
413 die "$url -- ", $response->status_line, "\nAborted"
414 unless $response->is_success;
415
416 then the program would die with an error message like this:
417
418 http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
419 Aborted at whateverprogram.pl line 1234
420
421 • If this $browser object sees that the last time it talked to $url's
422 server was too recently, then it will pause (via "sleep") to avoid
423 making too many requests too often. How long it will pause for, is
424 by default one minute -- but you can control it with the
425 "$browser->delay( minutes )" attribute.
426
427 For example, this code:
428
429 $browser->delay( 7/60 );
430
431 ...means that this browser will pause when it needs to avoid
432 talking to any given server more than once every 7 seconds.
433
434 For more options and information, see the full documentation for
435 LWP::RobotUA.
436
437 Using Proxies
438 In some cases, you will want to (or will have to) use proxies for
439 accessing certain sites and/or using certain protocols. This is most
440 commonly the case when your LWP program is running (or could be
441 running) on a machine that is behind a firewall.
442
443 To make a browser object use proxies that are defined in the usual
444 environment variables ("HTTP_PROXY", etc.), just call the "env_proxy"
445 on a user-agent object before you go making any requests on it.
446 Specifically:
447
448 use LWP::UserAgent;
449 my $browser = LWP::UserAgent->new;
450
451 # And before you go making any requests:
452 $browser->env_proxy;
453
454 For more information on proxy parameters, see the LWP::UserAgent
455 documentation, specifically the "proxy", "env_proxy", and "no_proxy"
456 methods.
457
458 HTTP Authentication
459 Many web sites restrict access to documents by using "HTTP
460 Authentication". This isn't just any form of "enter your password"
461 restriction, but is a specific mechanism where the HTTP server sends
462 the browser an HTTP code that says "That document is part of a
463 protected 'realm', and you can access it only if you re-request it and
464 add some special authorization headers to your request".
465
466 For example, the Unicode.org admins stop email-harvesting bots from
467 harvesting the contents of their mailing list archives, by protecting
468 them with HTTP Authentication, and then publicly stating the username
469 and password (at "http://www.unicode.org/mail-arch/") -- namely
470 username "unicode-ml" and password "unicode".
471
472 For example, consider this URL, which is part of the protected area of
473 the web site:
474
475 http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
476
477 If you access that with a browser, you'll get a prompt like "Enter
478 username and password for 'Unicode-MailList-Archives' at server
479 'www.unicode.org'".
480
481 In LWP, if you just request that URL, like this:
482
483 use LWP;
484 my $browser = LWP::UserAgent->new;
485
486 my $url =
487 'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
488 my $response = $browser->get($url);
489
490 die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing',
491 # ('WWW-Authenticate' is the realm-name)
492 "\n ", $response->status_line, "\n at $url\n Aborting"
493 unless $response->is_success;
494
495 Then you'll get this error:
496
497 Error: Basic realm="Unicode-MailList-Archives"
498 401 Authorization Required
499 at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
500 Aborting at auth1.pl line 9. [or wherever]
501
502 ...because the $browser doesn't know any the username and password for
503 that realm ("Unicode-MailList-Archives") at that host
504 ("www.unicode.org"). The simplest way to let the browser know about
505 this is to use the "credentials" method to let it know about a username
506 and password that it can try using for that realm at that host. The
507 syntax is:
508
509 $browser->credentials(
510 'servername:portnumber',
511 'realm-name',
512 'username' => 'password'
513 );
514
515 In most cases, the port number is 80, the default TCP/IP port for HTTP;
516 and you usually call the "credentials" method before you make any
517 requests. For example:
518
519 $browser->credentials(
520 'reports.mybazouki.com:80',
521 'web_server_usage_reports',
522 'plinky' => 'banjo123'
523 );
524
525 So if we add the following to the program above, right after the
526 "$browser = LWP::UserAgent->new;" line...
527
528 $browser->credentials( # add this to our $browser 's "key ring"
529 'www.unicode.org:80',
530 'Unicode-MailList-Archives',
531 'unicode-ml' => 'unicode'
532 );
533
534 ...then when we run it, the request succeeds, instead of causing the
535 "die" to be called.
536
537 Accessing HTTPS URLs
538 When you access an HTTPS URL, it'll work for you just like an HTTP URL
539 would -- if your LWP installation has HTTPS support (via an appropriate
540 Secure Sockets Layer library). For example:
541
542 use LWP;
543 my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
544 my $browser = LWP::UserAgent->new;
545 my $response = $browser->get($url);
546 die "Error at $url\n ", $response->status_line, "\n Aborting"
547 unless $response->is_success;
548 print "Whee, it worked! I got that ",
549 $response->content_type, " document!\n";
550
551 If your LWP installation doesn't have HTTPS support set up, then the
552 response will be unsuccessful, and you'll get this error message:
553
554 Error at https://www.paypal.com/
555 501 Protocol scheme 'https' is not supported
556 Aborting at paypal.pl line 7. [or whatever program and line]
557
558 If your LWP installation does have HTTPS support installed, then the
559 response should be successful, and you should be able to consult
560 $response just like with any normal HTTP response.
561
562 For information about installing HTTPS support for your LWP
563 installation, see the helpful README.SSL file that comes in the libwww-
564 perl distribution.
565
566 Getting Large Documents
567 When you're requesting a large (or at least potentially large)
568 document, a problem with the normal way of using the request methods
569 (like "$response = $browser->get($url)") is that the response object in
570 memory will have to hold the whole document -- in memory. If the
571 response is a thirty megabyte file, this is likely to be quite an
572 imposition on this process's memory usage.
573
574 A notable alternative is to have LWP save the content to a file on
575 disk, instead of saving it up in memory. This is the syntax to use:
576
577 $response = $ua->get($url,
578 ':content_file' => $filespec,
579 );
580
581 For example,
582
583 $response = $ua->get('http://search.cpan.org/',
584 ':content_file' => '/tmp/sco.html'
585 );
586
587 When you use this ":content_file" option, the $response will have all
588 the normal header lines, but "$response->content" will be empty.
589 Errors writing to the content file (for example due to permission
590 denied or the filesystem being full) will be reported via the
591 "Client-Aborted" or "X-Died" response headers, and not the "is_success"
592 method:
593
594 if ($response->header('Client-Aborted') eq 'die') {
595 # handle error ...
596
597 Note that this ":content_file" option isn't supported under older
598 versions of LWP, so you should consider adding "use LWP 5.66;" to check
599 the LWP version, if you think your program might run on systems with
600 older versions.
601
602 If you need to be compatible with older LWP versions, then use this
603 syntax, which does the same thing:
604
605 use HTTP::Request::Common;
606 $response = $ua->request( GET($url), $filespec );
607
609 Remember, this article is just the most rudimentary introduction to LWP
610 -- to learn more about LWP and LWP-related tasks, you really must read
611 from the following:
612
613 • LWP::Simple -- simple functions for getting/heading/mirroring URLs
614
615 • LWP -- overview of the libwww-perl modules
616
617 • LWP::UserAgent -- the class for objects that represent "virtual
618 browsers"
619
620 • HTTP::Response -- the class for objects that represent the response
621 to a LWP response, as in "$response = $browser->get(...)"
622
623 • HTTP::Message and HTTP::Headers -- classes that provide more
624 methods to HTTP::Response.
625
626 • URI -- class for objects that represent absolute or relative URLs
627
628 • URI::Escape -- functions for URL-escaping and URL-unescaping
629 strings (like turning "this & that" to and from
630 "this%20%26%20that").
631
632 • HTML::Entities -- functions for HTML-escaping and HTML-unescaping
633 strings (like turning "C. & E. Brontë" to and from "C. & E.
634 Brontë")
635
636 • HTML::TokeParser and HTML::TreeBuilder -- classes for parsing HTML
637
638 • HTML::LinkExtor -- class for finding links in HTML documents
639
640 • The book Perl & LWP by Sean M. Burke. O'Reilly & Associates, 2002.
641 ISBN: 0-596-00178-9, <http://oreilly.com/catalog/perllwp/>. The
642 whole book is also available free online:
643 <http://lwp.interglacial.com>.
644
646 Copyright 2002, Sean M. Burke. You can redistribute this document
647 and/or modify it, but only under the same terms as Perl itself.
648
650 Sean M. Burke "sburke@cpan.org"
651
652
653
654perl v5.34.0 2021-09-21 lwptut(3)