lwptut(3pm)

1lwptut(3)             User Contributed Perl Documentation            lwptut(3)
2
3
4

NAME

6       lwptut -- An LWP Tutorial
7

DESCRIPTION

9       LWP (short for "Library for WWW in Perl") is a very popular group of
10       Perl modules for accessing data on the Web. Like most Perl module-
11       distributions, each of LWP's component modules comes with documentation
12       that is a complete reference to its interface. However, there are so
13       many modules in LWP that it's hard to know where to start looking for
14       information on how to do even the simplest most common things.
15
16       Really introducing you to using LWP would require a whole book -- a
17       book that just happens to exist, called Perl & LWP. But this article
18       should give you a taste of how you can go about some common tasks with
19       LWP.
20
21   Getting documents with LWP::Simple
22       If you just want to get what's at a particular URL, the simplest way to
23       do it is LWP::Simple's functions.
24
25       In a Perl program, you can call its "get($url)" function.  It will try
26       getting that URL's content.  If it works, then it'll return the
27       content; but if there's some error, it'll return undef.
28
29         my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
30           # Just an example: the URL for the most recent /Fresh Air/ show
31
32         use LWP::Simple;
33         my $content = get $url;
34         die "Couldn't get $url" unless defined $content;
35
36         # Then go do things with $content, like this:
37
38         if($content =~ m/jazz/i) {
39           print "They're talking about jazz today on Fresh Air!\n";
40         }
41         else {
42           print "Fresh Air is apparently jazzless today.\n";
43         }
44
45       The handiest variant on "get" is "getprint", which is useful in Perl
46       one-liners.  If it can get the page whose URL you provide, it sends it
47       to STDOUT; otherwise it complains to STDERR.
48
49         % perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'"
50
51       That is the URL of a plain text file that lists new files in CPAN in
52       the past two weeks.  You can easily make it part of a tidy little shell
53       command, like this one that mails you the list of new "Acme::" modules:
54
55         % perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'"  \
56            | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
57
58       There are other useful functions in LWP::Simple, including one function
59       for running a HEAD request on a URL (useful for checking links, or
60       getting the last-revised time of a URL), and two functions for
61       saving/mirroring a URL to a local file. See the LWP::Simple
62       documentation for the full details, or chapter 2 of Perl & LWP for more
63       examples.
64
65   The Basics of the LWP Class Model
66       LWP::Simple's functions are handy for simple cases, but its functions
67       don't support cookies or authorization, don't support setting header
68       lines in the HTTP request, generally don't support reading header lines
69       in the HTTP response (notably the full HTTP error message, in case of
70       an error). To get at all those features, you'll have to use the full
71       LWP class model.
72
73       While LWP consists of dozens of classes, the main two that you have to
74       understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a
75       class for "virtual browsers" which you use for performing requests, and
76       HTTP::Response is a class for the responses (or error messages) that
77       you get back from those requests.
78
79       The basic idiom is "$response = $browser->get($url)", or more fully
80       illustrated:
81
82         # Early in your program:
83
84         use LWP 5.64; # Loads all important LWP classes, and makes
85                       #  sure your version is reasonably recent.
86
87         my $browser = LWP::UserAgent->new;
88
89         ...
90
91         # Then later, whenever you need to make a get request:
92         my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
93
94         my $response = $browser->get( $url );
95         die "Can't get $url -- ", $response->status_line
96          unless $response->is_success;
97
98         die "Hey, I was expecting HTML, not ", $response->content_type
99          unless $response->content_type eq 'text/html';
100            # or whatever content-type you're equipped to deal with
101
102         # Otherwise, process the content somehow:
103
104         if($response->decoded_content =~ m/jazz/i) {
105           print "They're talking about jazz today on Fresh Air!\n";
106         }
107         else {
108           print "Fresh Air is apparently jazzless today.\n";
109         }
110
111       There are two objects involved: $browser, which holds an object of
112       class LWP::UserAgent, and then the $response object, which is of class
113       HTTP::Response. You really need only one browser object per program;
114       but every time you make a request, you get back a new HTTP::Response
115       object, which will have some interesting attributes:
116
117       ·   A status code indicating success or failure (which you can test
118           with "$response->is_success").
119
120       ·   An HTTP status line that is hopefully informative if there's
121           failure (which you can see with "$response->status_line", returning
122           something like "404 Not Found").
123
124       ·   A MIME content-type like "text/html", "image/gif",
125           "application/xml", etc., which you can see with
126           "$response->content_type"
127
128       ·   The actual content of the response, in
129           "$response->decoded_content".  If the response is HTML, that's
130           where the HTML source will be; if it's a GIF, then
131           "$response->decoded_content" will be the binary GIF data.
132
133       ·   And dozens of other convenient and more specific methods that are
134           documented in the docs for HTTP::Response, and its superclasses
135           HTTP::Message and HTTP::Headers.
136
137   Adding Other HTTP Request Headers
138       The most commonly used syntax for requests is "$response =
139       $browser->get($url)", but in truth, you can add extra HTTP header lines
140       to the request by adding a list of key-value pairs after the URL, like
141       so:
142
143         $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
144
145       For example, here's how to send some more Netscape-like headers, in
146       case you're dealing with a site that would otherwise reject your
147       request:
148
149         my @ns_headers = (
150          'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
151          'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
152          'Accept-Charset' => 'iso-8859-1,*,utf-8',
153          'Accept-Language' => 'en-US',
154         );
155
156         ...
157
158         $response = $browser->get($url, @ns_headers);
159
160       If you weren't reusing that array, you could just go ahead and do this:
161
162         $response = $browser->get($url,
163          'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
164          'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
165          'Accept-Charset' => 'iso-8859-1,*,utf-8',
166          'Accept-Language' => 'en-US',
167         );
168
169       If you were only ever changing the 'User-Agent' line, you could just
170       change the $browser object's default line from "libwww-perl/5.65" (or
171       the like) to whatever you like, using the LWP::UserAgent "agent"
172       method:
173
174          $browser->agent('Mozilla/4.76 [en] (Win98; U)');
175
176   Enabling Cookies
177       A default LWP::UserAgent object acts like a browser with its cookies
178       support turned off. There are various ways of turning it on, by setting
179       its "cookie_jar" attribute. A "cookie jar" is an object representing a
180       little database of all the HTTP cookies that a browser can know about.
181       It can correspond to a file on disk (the way Netscape uses its
182       cookies.txt file), or it can be just an in-memory object that starts
183       out empty, and whose collection of cookies will disappear once the
184       program is finished running.
185
186       To give a browser an in-memory empty cookie jar, you set its
187       "cookie_jar" attribute like so:
188
189         $browser->cookie_jar({});
190
191       To give it a copy that will be read from a file on disk, and will be
192       saved to it when the program is finished running, set the "cookie_jar"
193       attribute like this:
194
195         use HTTP::Cookies;
196         $browser->cookie_jar( HTTP::Cookies->new(
197           'file' => '/some/where/cookies.lwp',
198               # where to read/write cookies
199           'autosave' => 1,
200               # save it to disk when done
201         ));
202
203       That file will be an LWP-specific format. If you want to be access the
204       cookies in your Netscape cookies file, you can use the
205       HTTP::Cookies::Netscape class:
206
207         use HTTP::Cookies;
208           # yes, loads HTTP::Cookies::Netscape too
209
210         $browser->cookie_jar( HTTP::Cookies::Netscape->new(
211           'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt',
212               # where to read cookies
213         ));
214
215       You could add an "'autosave' => 1" line as further above, but at time
216       of writing, it's uncertain whether Netscape might discard some of the
217       cookies you could be writing back to disk.
218
219   Posting Form Data
220       Many HTML forms send data to their server using an HTTP POST request,
221       which you can send with this syntax:
222
223        $response = $browser->post( $url,
224          [
225            formkey1 => value1,
226            formkey2 => value2,
227            ...
228          ],
229        );
230
231       Or if you need to send HTTP headers:
232
233        $response = $browser->post( $url,
234          [
235            formkey1 => value1,
236            formkey2 => value2,
237            ...
238          ],
239          headerkey1 => value1,
240          headerkey2 => value2,
241        );
242
243       For example, the following program makes a search request to AltaVista
244       (by sending some form data via an HTTP POST request), and extracts from
245       the HTML the report of the number of matches:
246
247         use strict;
248         use warnings;
249         use LWP 5.64;
250         my $browser = LWP::UserAgent->new;
251
252         my $word = 'tarragon';
253
254         my $url = 'http://search.yahoo.com/yhs/search';
255         my $response = $browser->post( $url,
256           [ 'q' => $word,  # the Altavista query string
257             'fr' => 'altavista', 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
258           ]
259         );
260         die "$url error: ", $response->status_line
261          unless $response->is_success;
262         die "Weird content type at $url -- ", $response->content_type
263          unless $response->content_is_html;
264
265         if( $response->decoded_content =~ m{([0-9,]+)(?:<.*?>)? results for} ) {
266           # The substring will be like "996,000</strong> results for"
267           print "$word: $1\n";
268         }
269         else {
270           print "Couldn't find the match-string in the response\n";
271         }
272
273   Sending GET Form Data
274       Some HTML forms convey their form data not by sending the data in an
275       HTTP POST request, but by making a normal GET request with the data
276       stuck on the end of the URL.  For example, if you went to
277       "www.imdb.com" and ran a search on "Blade Runner", the URL you'd see in
278       your browser window would be:
279
280         http://www.imdb.com/find?s=all&q=Blade+Runner
281
282       To run the same search with LWP, you'd use this idiom, which involves
283       the URI class:
284
285         use URI;
286         my $url = URI->new( 'http://www.imdb.com/find' );
287           # makes an object representing the URL
288
289         $url->query_form(  # And here the form data pairs:
290           'q' => 'Blade Runner',
291           's' => 'all',
292         );
293
294         my $response = $browser->get($url);
295
296       See chapter 5 of Perl & LWP for a longer discussion of HTML forms and
297       of form data, and chapters 6 through 9 for a longer discussion of
298       extracting data from HTML.
299
300   Absolutizing URLs
301       The URI class that we just mentioned above provides all sorts of
302       methods for accessing and modifying parts of URLs (such as asking sort
303       of URL it is with "$url->scheme", and asking what host it refers to
304       with "$url->host", and so on, as described in the docs for the URI
305       class.  However, the methods of most immediate interest are the
306       "query_form" method seen above, and now the "new_abs" method for taking
307       a probably-relative URL string (like "../foo.html") and getting back an
308       absolute URL (like "http://www.perl.com/stuff/foo.html"), as shown
309       here:
310
311         use URI;
312         $abs = URI->new_abs($maybe_relative, $base);
313
314       For example, consider this program that matches URLs in the HTML list
315       of new modules in CPAN:
316
317         use strict;
318         use warnings;
319         use LWP;
320         my $browser = LWP::UserAgent->new;
321
322         my $url = 'http://www.cpan.org/RECENT.html';
323         my $response = $browser->get($url);
324         die "Can't get $url -- ", $response->status_line
325          unless $response->is_success;
326
327         my $html = $response->decoded_content;
328         while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
329           print "$1\n";
330         }
331
332       When run, it emits output that starts out something like this:
333
334         MIRRORING.FROM
335         RECENT
336         RECENT.html
337         authors/00whois.html
338         authors/01mailrc.txt.gz
339         authors/id/A/AA/AASSAD/CHECKSUMS
340         ...
341
342       However, if you actually want to have those be absolute URLs, you can
343       use the URI module's "new_abs" method, by changing the "while" loop to
344       this:
345
346         while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
347           print URI->new_abs( $1, $response->base ) ,"\n";
348         }
349
350       (The "$response->base" method from HTTP::Message is for returning what
351       URL should be used for resolving relative URLs -- it's usually just the
352       same as the URL that you requested.)
353
354       That program then emits nicely absolute URLs:
355
356         http://www.cpan.org/MIRRORING.FROM
357         http://www.cpan.org/RECENT
358         http://www.cpan.org/RECENT.html
359         http://www.cpan.org/authors/00whois.html
360         http://www.cpan.org/authors/01mailrc.txt.gz
361         http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
362         ...
363
364       See chapter 4 of Perl & LWP for a longer discussion of URI objects.
365
366       Of course, using a regexp to match hrefs is a bit simplistic, and for
367       more robust programs, you'll probably want to use an HTML-parsing
368       module like HTML::LinkExtor or HTML::TokeParser or even maybe
369       HTML::TreeBuilder.
370
371   Other Browser Attributes
372       LWP::UserAgent objects have many attributes for controlling how they
373       work.  Here are a few notable ones:
374
375       ·   "$browser->timeout(15);"
376
377           This sets this browser object to give up on requests that don't
378           answer within 15 seconds.
379
380       ·   "$browser->protocols_allowed( [ 'http', 'gopher'] );"
381
382           This sets this browser object to not speak any protocols other than
383           HTTP and gopher. If it tries accessing any other kind of URL (like
384           an "ftp:" or "mailto:" or "news:" URL), then it won't actually try
385           connecting, but instead will immediately return an error code 500,
386           with a message like "Access to 'ftp' URIs has been disabled".
387
388       ·   "use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new());"
389
390           This tells the browser object to try using the HTTP/1.1 "Keep-
391           Alive" feature, which speeds up requests by reusing the same socket
392           connection for multiple requests to the same server.
393
394       ·   "$browser->agent( 'SomeName/1.23 (more info here maybe)' )"
395
396           This changes how the browser object will identify itself in the
397           default "User-Agent" line is its HTTP requests.  By default, it'll
398           send "libwww-perl/versionnumber", like "libwww-perl/5.65".  You can
399           change that to something more descriptive like this:
400
401             $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
402
403           Or if need be, you can go in disguise, like this:
404
405             $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
406
407       ·   "push @{ $ua->requests_redirectable }, 'POST';"
408
409           This tells this browser to obey redirection responses to POST
410           requests (like most modern interactive browsers), even though the
411           HTTP RFC says that should not normally be done.
412
413       For more options and information, see the full documentation for
414       LWP::UserAgent.
415
416   Writing Polite Robots
417       If you want to make sure that your LWP-based program respects
418       robots.txt files and doesn't make too many requests too fast, you can
419       use the LWP::RobotUA class instead of the LWP::UserAgent class.
420
421       LWP::RobotUA class is just like LWP::UserAgent, and you can use it like
422       so:
423
424         use LWP::RobotUA;
425         my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
426           # Your bot's name and your email address
427
428         my $response = $browser->get($url);
429
430       But HTTP::RobotUA adds these features:
431
432       ·   If the robots.txt on $url's server forbids you from accessing $url,
433           then the $browser object (assuming it's of class LWP::RobotUA)
434           won't actually request it, but instead will give you back (in
435           $response) a 403 error with a message "Forbidden by robots.txt".
436           That is, if you have this line:
437
438             die "$url -- ", $response->status_line, "\nAborted"
439              unless $response->is_success;
440
441           then the program would die with an error message like this:
442
443             http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
444             Aborted at whateverprogram.pl line 1234
445
446       ·   If this $browser object sees that the last time it talked to $url's
447           server was too recently, then it will pause (via "sleep") to avoid
448           making too many requests too often. How long it will pause for, is
449           by default one minute -- but you can control it with the
450           "$browser->delay( minutes )" attribute.
451
452           For example, this code:
453
454             $browser->delay( 7/60 );
455
456           ...means that this browser will pause when it needs to avoid
457           talking to any given server more than once every 7 seconds.
458
459       For more options and information, see the full documentation for
460       LWP::RobotUA.
461
462   Using Proxies
463       In some cases, you will want to (or will have to) use proxies for
464       accessing certain sites and/or using certain protocols. This is most
465       commonly the case when your LWP program is running (or could be
466       running) on a machine that is behind a firewall.
467
468       To make a browser object use proxies that are defined in the usual
469       environment variables ("HTTP_PROXY", etc.), just call the "env_proxy"
470       on a user-agent object before you go making any requests on it.
471       Specifically:
472
473         use LWP::UserAgent;
474         my $browser = LWP::UserAgent->new;
475
476         # And before you go making any requests:
477         $browser->env_proxy;
478
479       For more information on proxy parameters, see the LWP::UserAgent
480       documentation, specifically the "proxy", "env_proxy", and "no_proxy"
481       methods.
482
483   HTTP Authentication
484       Many web sites restrict access to documents by using "HTTP
485       Authentication". This isn't just any form of "enter your password"
486       restriction, but is a specific mechanism where the HTTP server sends
487       the browser an HTTP code that says "That document is part of a
488       protected 'realm', and you can access it only if you re-request it and
489       add some special authorization headers to your request".
490
491       For example, the Unicode.org admins stop email-harvesting bots from
492       harvesting the contents of their mailing list archives, by protecting
493       them with HTTP Authentication, and then publicly stating the username
494       and password (at "http://www.unicode.org/mail-arch/") -- namely
495       username "unicode-ml" and password "unicode".
496
497       For example, consider this URL, which is part of the protected area of
498       the web site:
499
500         http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
501
502       If you access that with a browser, you'll get a prompt like "Enter
503       username and password for 'Unicode-MailList-Archives' at server
504       'www.unicode.org'".
505
506       In LWP, if you just request that URL, like this:
507
508         use LWP;
509         my $browser = LWP::UserAgent->new;
510
511         my $url =
512          'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
513         my $response = $browser->get($url);
514
515         die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing',
516           #  ('WWW-Authenticate' is the realm-name)
517           "\n ", $response->status_line, "\n at $url\n Aborting"
518          unless $response->is_success;
519
520       Then you'll get this error:
521
522         Error: Basic realm="Unicode-MailList-Archives"
523          401 Authorization Required
524          at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
525          Aborting at auth1.pl line 9.  [or wherever]
526
527       ...because the $browser doesn't know any the username and password for
528       that realm ("Unicode-MailList-Archives") at that host
529       ("www.unicode.org").  The simplest way to let the browser know about
530       this is to use the "credentials" method to let it know about a username
531       and password that it can try using for that realm at that host.  The
532       syntax is:
533
534         $browser->credentials(
535           'servername:portnumber',
536           'realm-name',
537          'username' => 'password'
538         );
539
540       In most cases, the port number is 80, the default TCP/IP port for HTTP;
541       and you usually call the "credentials" method before you make any
542       requests.  For example:
543
544         $browser->credentials(
545           'reports.mybazouki.com:80',
546           'web_server_usage_reports',
547           'plinky' => 'banjo123'
548         );
549
550       So if we add the following to the program above, right after the
551       "$browser = LWP::UserAgent->new;" line...
552
553         $browser->credentials(  # add this to our $browser 's "key ring"
554           'www.unicode.org:80',
555           'Unicode-MailList-Archives',
556           'unicode-ml' => 'unicode'
557         );
558
559       ...then when we run it, the request succeeds, instead of causing the
560       "die" to be called.
561
562   Accessing HTTPS URLs
563       When you access an HTTPS URL, it'll work for you just like an HTTP URL
564       would -- if your LWP installation has HTTPS support (via an appropriate
565       Secure Sockets Layer library).  For example:
566
567         use LWP;
568         my $url = 'https://www.paypal.com/';   # Yes, HTTPS!
569         my $browser = LWP::UserAgent->new;
570         my $response = $browser->get($url);
571         die "Error at $url\n ", $response->status_line, "\n Aborting"
572          unless $response->is_success;
573         print "Whee, it worked!  I got that ",
574          $response->content_type, " document!\n";
575
576       If your LWP installation doesn't have HTTPS support set up, then the
577       response will be unsuccessful, and you'll get this error message:
578
579         Error at https://www.paypal.com/
580          501 Protocol scheme 'https' is not supported
581          Aborting at paypal.pl line 7.   [or whatever program and line]
582
583       If your LWP installation does have HTTPS support installed, then the
584       response should be successful, and you should be able to consult
585       $response just like with any normal HTTP response.
586
587       For information about installing HTTPS support for your LWP
588       installation, see the helpful README.SSL file that comes in the libwww-
589       perl distribution.
590
591   Getting Large Documents
592       When you're requesting a large (or at least potentially large)
593       document, a problem with the normal way of using the request methods
594       (like "$response = $browser->get($url)") is that the response object in
595       memory will have to hold the whole document -- in memory. If the
596       response is a thirty megabyte file, this is likely to be quite an
597       imposition on this process's memory usage.
598
599       A notable alternative is to have LWP save the content to a file on
600       disk, instead of saving it up in memory.  This is the syntax to use:
601
602         $response = $ua->get($url,
603                                ':content_file' => $filespec,
604                             );
605
606       For example,
607
608         $response = $ua->get('http://search.cpan.org/',
609                                ':content_file' => '/tmp/sco.html'
610                             );
611
612       When you use this ":content_file" option, the $response will have all
613       the normal header lines, but "$response->content" will be empty.
614
615       Note that this ":content_file" option isn't supported under older
616       versions of LWP, so you should consider adding "use LWP 5.66;" to check
617       the LWP version, if you think your program might run on systems with
618       older versions.
619
620       If you need to be compatible with older LWP versions, then use this
621       syntax, which does the same thing:
622
623         use HTTP::Request::Common;
624         $response = $ua->request( GET($url), $filespec );
625

COPYRIGHT

664       Copyright 2002, Sean M. Burke.  You can redistribute this document
665       and/or modify it, but only under the same terms as Perl itself.
666

AUTHOR

668       Sean M. Burke "sburke@cpan.org"
669
670
671
672perl v5.16.3                      2012-02-11                         lwptut(3)

NAME

DESCRIPTION

SEE ALSO

COPYRIGHT

AUTHOR