libwww::lwptut(3pm)

1lwptut(3)             User Contributed Perl Documentation            lwptut(3)
2
3
4

NAME

6       lwptut -- An LWP Tutorial
7

DESCRIPTION

9       LWP (short for "Library for WWW in Perl") is a very popular group of
10       Perl modules for accessing data on the Web. Like most Perl module-
11       distributions, each of LWP's component modules comes with documentation
12       that is a complete reference to its interface. However, there are so
13       many modules in LWP that it's hard to know where to start looking for
14       information on how to do even the simplest most common things.
15
16       Really introducing you to using LWP would require a whole book -- a
17       book that just happens to exist, called Perl & LWP. But this article
18       should give you a taste of how you can go about some common tasks with
19       LWP.
20
21   Getting documents with LWP::Simple
22       If you just want to get what's at a particular URL, the simplest way to
23       do it is LWP::Simple's functions.
24
25       In a Perl program, you can call its "get($url)" function.  It will try
26       getting that URL's content.  If it works, then it'll return the
27       content; but if there's some error, it'll return undef.
28
29         my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
30           # Just an example: the URL for the most recent /Fresh Air/ show
31
32         use LWP::Simple;
33         my $content = get $url;
34         die "Couldn't get $url" unless defined $content;
35
36         # Then go do things with $content, like this:
37
38         if($content =~ m/jazz/i) {
39           print "They're talking about jazz today on Fresh Air!\n";
40         }
41         else {
42           print "Fresh Air is apparently jazzless today.\n";
43         }
44
45       The handiest variant on "get" is "getprint", which is useful in Perl
46       one-liners.  If it can get the page whose URL you provide, it sends it
47       to STDOUT; otherwise it complains to STDERR.
48
49         % perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'"
50
51       That is the URL of a plain text file that lists new files in CPAN in
52       the past two weeks.  You can easily make it part of a tidy little shell
53       command, like this one that mails you the list of new "Acme::" modules:
54
55         % perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'"  \
56            | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
57
58       There are other useful functions in LWP::Simple, including one function
59       for running a HEAD request on a URL (useful for checking links, or
60       getting the last-revised time of a URL), and two functions for
61       saving/mirroring a URL to a local file. See the LWP::Simple
62       documentation for the full details, or chapter 2 of Perl & LWP for more
63       examples.
64
65   The Basics of the LWP Class Model
66       LWP::Simple's functions are handy for simple cases, but its functions
67       don't support cookies or authorization, don't support setting header
68       lines in the HTTP request, generally don't support reading header lines
69       in the HTTP response (notably the full HTTP error message, in case of
70       an error). To get at all those features, you'll have to use the full
71       LWP class model.
72
73       While LWP consists of dozens of classes, the main two that you have to
74       understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a
75       class for "virtual browsers" which you use for performing requests, and
76       HTTP::Response is a class for the responses (or error messages) that
77       you get back from those requests.
78
79       The basic idiom is "$response = $browser->get($url)", or more fully
80       illustrated:
81
82         # Early in your program:
83
84         use LWP 5.64; # Loads all important LWP classes, and makes
85                       #  sure your version is reasonably recent.
86
87         my $browser = LWP::UserAgent->new;
88
89         ...
90
91         # Then later, whenever you need to make a get request:
92         my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
93
94         my $response = $browser->get( $url );
95         die "Can't get $url -- ", $response->status_line
96          unless $response->is_success;
97
98         die "Hey, I was expecting HTML, not ", $response->content_type
99          unless $response->content_type eq 'text/html';
100            # or whatever content-type you're equipped to deal with
101
102         # Otherwise, process the content somehow:
103
104         if($response->decoded_content =~ m/jazz/i) {
105           print "They're talking about jazz today on Fresh Air!\n";
106         }
107         else {
108           print "Fresh Air is apparently jazzless today.\n";
109         }
110
111       There are two objects involved: $browser, which holds an object of
112       class LWP::UserAgent, and then the $response object, which is of class
113       HTTP::Response. You really need only one browser object per program;
114       but every time you make a request, you get back a new HTTP::Response
115       object, which will have some interesting attributes:
116
117       ·   A status code indicating success or failure (which you can test
118           with "$response->is_success").
119
120       ·   An HTTP status line that is hopefully informative if there's
121           failure (which you can see with "$response->status_line", returning
122           something like "404 Not Found").
123
124       ·   A MIME content-type like "text/html", "image/gif",
125           "application/xml", etc., which you can see with
126           "$response->content_type"
127
128       ·   The actual content of the response, in
129           "$response->decoded_content".  If the response is HTML, that's
130           where the HTML source will be; if it's a GIF, then
131           "$response->decoded_content" will be the binary GIF data.
132
133       ·   And dozens of other convenient and more specific methods that are
134           documented in the docs for HTTP::Response, and its superclasses
135           HTTP::Message and HTTP::Headers.
136
137   Adding Other HTTP Request Headers
138       The most commonly used syntax for requests is "$response =
139       $browser->get($url)", but in truth, you can add extra HTTP header lines
140       to the request by adding a list of key-value pairs after the URL, like
141       so:
142
143         $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
144
145       For example, here's how to send some commonly used headers, in case
146       you're dealing with a site that would otherwise reject your request:
147
148         my @ns_headers = (
149          'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
150          'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
151          'Accept-Charset' => 'iso-8859-1,*,utf-8',
152          'Accept-Language' => 'en-US',
153         );
154
155         ...
156
157         $response = $browser->get($url, @ns_headers);
158
159       If you weren't reusing that array, you could just go ahead and do this:
160
161         $response = $browser->get($url,
162          'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
163          'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
164          'Accept-Charset' => 'iso-8859-1,*,utf-8',
165          'Accept-Language' => 'en-US',
166         );
167
168       If you were only ever changing the 'User-Agent' line, you could just
169       change the $browser object's default line from "libwww-perl/5.65" (or
170       the like) to whatever you like, using the LWP::UserAgent "agent"
171       method:
172
173          $browser->agent('Mozilla/4.76 [en] (Win98; U)');
174
175   Enabling Cookies
176       A default LWP::UserAgent object acts like a browser with its cookies
177       support turned off. There are various ways of turning it on, by setting
178       its "cookie_jar" attribute. A "cookie jar" is an object representing a
179       little database of all the HTTP cookies that a browser knows about. It
180       can correspond to a file on disk or an in-memory object that starts out
181       empty, and whose collection of cookies will disappear once the program
182       is finished running.
183
184       To give a browser an in-memory empty cookie jar, you set its
185       "cookie_jar" attribute like so:
186
187         use HTTP::CookieJar::LWP;
188         $browser->cookie_jar( HTTP::CookieJar::LWP->new );
189
190       To save a cookie jar to disk, see "dump_cookies" in HTTP::CookieJar.
191       To load cookies from disk into a jar, see "load_cookies" in
192       HTTP::CookieJar.
193
194   Posting Form Data
195       Many HTML forms send data to their server using an HTTP POST request,
196       which you can send with this syntax:
197
198        $response = $browser->post( $url,
199          [
200            formkey1 => value1,
201            formkey2 => value2,
202            ...
203          ],
204        );
205
206       Or if you need to send HTTP headers:
207
208        $response = $browser->post( $url,
209          [
210            formkey1 => value1,
211            formkey2 => value2,
212            ...
213          ],
214          headerkey1 => value1,
215          headerkey2 => value2,
216        );
217
218       For example, the following program makes a search request to AltaVista
219       (by sending some form data via an HTTP POST request), and extracts from
220       the HTML the report of the number of matches:
221
222         use strict;
223         use warnings;
224         use LWP 5.64;
225         my $browser = LWP::UserAgent->new;
226
227         my $word = 'tarragon';
228
229         my $url = 'http://search.yahoo.com/yhs/search';
230         my $response = $browser->post( $url,
231           [ 'q' => $word,  # the Altavista query string
232             'fr' => 'altavista', 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
233           ]
234         );
235         die "$url error: ", $response->status_line
236          unless $response->is_success;
237         die "Weird content type at $url -- ", $response->content_type
238          unless $response->content_is_html;
239
240         if( $response->decoded_content =~ m{([0-9,]+)(?:<.*?>)? results for} ) {
241           # The substring will be like "996,000</strong> results for"
242           print "$word: $1\n";
243         }
244         else {
245           print "Couldn't find the match-string in the response\n";
246         }
247
248   Sending GET Form Data
249       Some HTML forms convey their form data not by sending the data in an
250       HTTP POST request, but by making a normal GET request with the data
251       stuck on the end of the URL.  For example, if you went to
252       "www.imdb.com" and ran a search on "Blade Runner", the URL you'd see in
253       your browser window would be:
254
255         http://www.imdb.com/find?s=all&q=Blade+Runner
256
257       To run the same search with LWP, you'd use this idiom, which involves
258       the URI class:
259
260         use URI;
261         my $url = URI->new( 'http://www.imdb.com/find' );
262           # makes an object representing the URL
263
264         $url->query_form(  # And here the form data pairs:
265           'q' => 'Blade Runner',
266           's' => 'all',
267         );
268
269         my $response = $browser->get($url);
270
271       See chapter 5 of Perl & LWP for a longer discussion of HTML forms and
272       of form data, and chapters 6 through 9 for a longer discussion of
273       extracting data from HTML.
274
275   Absolutizing URLs
276       The URI class that we just mentioned above provides all sorts of
277       methods for accessing and modifying parts of URLs (such as asking sort
278       of URL it is with "$url->scheme", and asking what host it refers to
279       with "$url->host", and so on, as described in the docs for the URI
280       class.  However, the methods of most immediate interest are the
281       "query_form" method seen above, and now the "new_abs" method for taking
282       a probably-relative URL string (like "../foo.html") and getting back an
283       absolute URL (like "http://www.perl.com/stuff/foo.html"), as shown
284       here:
285
286         use URI;
287         $abs = URI->new_abs($maybe_relative, $base);
288
289       For example, consider this program that matches URLs in the HTML list
290       of new modules in CPAN:
291
292         use strict;
293         use warnings;
294         use LWP;
295         my $browser = LWP::UserAgent->new;
296
297         my $url = 'http://www.cpan.org/RECENT.html';
298         my $response = $browser->get($url);
299         die "Can't get $url -- ", $response->status_line
300          unless $response->is_success;
301
302         my $html = $response->decoded_content;
303         while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
304           print "$1\n";
305         }
306
307       When run, it emits output that starts out something like this:
308
309         MIRRORING.FROM
310         RECENT
311         RECENT.html
312         authors/00whois.html
313         authors/01mailrc.txt.gz
314         authors/id/A/AA/AASSAD/CHECKSUMS
315         ...
316
317       However, if you actually want to have those be absolute URLs, you can
318       use the URI module's "new_abs" method, by changing the "while" loop to
319       this:
320
321         while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
322           print URI->new_abs( $1, $response->base ) ,"\n";
323         }
324
325       (The "$response->base" method from HTTP::Message is for returning what
326       URL should be used for resolving relative URLs -- it's usually just the
327       same as the URL that you requested.)
328
329       That program then emits nicely absolute URLs:
330
331         http://www.cpan.org/MIRRORING.FROM
332         http://www.cpan.org/RECENT
333         http://www.cpan.org/RECENT.html
334         http://www.cpan.org/authors/00whois.html
335         http://www.cpan.org/authors/01mailrc.txt.gz
336         http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
337         ...
338
339       See chapter 4 of Perl & LWP for a longer discussion of URI objects.
340
341       Of course, using a regexp to match hrefs is a bit simplistic, and for
342       more robust programs, you'll probably want to use an HTML-parsing
343       module like HTML::LinkExtor or HTML::TokeParser or even maybe
344       HTML::TreeBuilder.
345
346   Other Browser Attributes
347       LWP::UserAgent objects have many attributes for controlling how they
348       work.  Here are a few notable ones:
349
350       ·   "$browser->timeout(15);"
351
352           This sets this browser object to give up on requests that don't
353           answer within 15 seconds.
354
355       ·   "$browser->protocols_allowed( [ 'http', 'gopher'] );"
356
357           This sets this browser object to not speak any protocols other than
358           HTTP and gopher. If it tries accessing any other kind of URL (like
359           an "ftp:" or "mailto:" or "news:" URL), then it won't actually try
360           connecting, but instead will immediately return an error code 500,
361           with a message like "Access to 'ftp' URIs has been disabled".
362
363       ·   "use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new());"
364
365           This tells the browser object to try using the HTTP/1.1 "Keep-
366           Alive" feature, which speeds up requests by reusing the same socket
367           connection for multiple requests to the same server.
368
369       ·   "$browser->agent( 'SomeName/1.23 (more info here maybe)' )"
370
371           This changes how the browser object will identify itself in the
372           default "User-Agent" line is its HTTP requests.  By default, it'll
373           send "libwww-perl/versionnumber", like "libwww-perl/5.65".  You can
374           change that to something more descriptive like this:
375
376             $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
377
378           Or if need be, you can go in disguise, like this:
379
380             $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
381
382       ·   "push @{ $ua->requests_redirectable }, 'POST';"
383
384           This tells this browser to obey redirection responses to POST
385           requests (like most modern interactive browsers), even though the
386           HTTP RFC says that should not normally be done.
387
388       For more options and information, see the full documentation for
389       LWP::UserAgent.
390
391   Writing Polite Robots
392       If you want to make sure that your LWP-based program respects
393       robots.txt files and doesn't make too many requests too fast, you can
394       use the LWP::RobotUA class instead of the LWP::UserAgent class.
395
396       LWP::RobotUA class is just like LWP::UserAgent, and you can use it like
397       so:
398
399         use LWP::RobotUA;
400         my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
401           # Your bot's name and your email address
402
403         my $response = $browser->get($url);
404
405       But HTTP::RobotUA adds these features:
406
407       ·   If the robots.txt on $url's server forbids you from accessing $url,
408           then the $browser object (assuming it's of class LWP::RobotUA)
409           won't actually request it, but instead will give you back (in
410           $response) a 403 error with a message "Forbidden by robots.txt".
411           That is, if you have this line:
412
413             die "$url -- ", $response->status_line, "\nAborted"
414              unless $response->is_success;
415
416           then the program would die with an error message like this:
417
418             http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
419             Aborted at whateverprogram.pl line 1234
420
421       ·   If this $browser object sees that the last time it talked to $url's
422           server was too recently, then it will pause (via "sleep") to avoid
423           making too many requests too often. How long it will pause for, is
424           by default one minute -- but you can control it with the
425           "$browser->delay( minutes )" attribute.
426
427           For example, this code:
428
429             $browser->delay( 7/60 );
430
431           ...means that this browser will pause when it needs to avoid
432           talking to any given server more than once every 7 seconds.
433
434       For more options and information, see the full documentation for
435       LWP::RobotUA.
436
437   Using Proxies
438       In some cases, you will want to (or will have to) use proxies for
439       accessing certain sites and/or using certain protocols. This is most
440       commonly the case when your LWP program is running (or could be
441       running) on a machine that is behind a firewall.
442
443       To make a browser object use proxies that are defined in the usual
444       environment variables ("HTTP_PROXY", etc.), just call the "env_proxy"
445       on a user-agent object before you go making any requests on it.
446       Specifically:
447
448         use LWP::UserAgent;
449         my $browser = LWP::UserAgent->new;
450
451         # And before you go making any requests:
452         $browser->env_proxy;
453
454       For more information on proxy parameters, see the LWP::UserAgent
455       documentation, specifically the "proxy", "env_proxy", and "no_proxy"
456       methods.
457
458   HTTP Authentication
459       Many web sites restrict access to documents by using "HTTP
460       Authentication". This isn't just any form of "enter your password"
461       restriction, but is a specific mechanism where the HTTP server sends
462       the browser an HTTP code that says "That document is part of a
463       protected 'realm', and you can access it only if you re-request it and
464       add some special authorization headers to your request".
465
466       For example, the Unicode.org admins stop email-harvesting bots from
467       harvesting the contents of their mailing list archives, by protecting
468       them with HTTP Authentication, and then publicly stating the username
469       and password (at "http://www.unicode.org/mail-arch/") -- namely
470       username "unicode-ml" and password "unicode".
471
472       For example, consider this URL, which is part of the protected area of
473       the web site:
474
475         http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
476
477       If you access that with a browser, you'll get a prompt like "Enter
478       username and password for 'Unicode-MailList-Archives' at server
479       'www.unicode.org'".
480
481       In LWP, if you just request that URL, like this:
482
483         use LWP;
484         my $browser = LWP::UserAgent->new;
485
486         my $url =
487          'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
488         my $response = $browser->get($url);
489
490         die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing',
491           #  ('WWW-Authenticate' is the realm-name)
492           "\n ", $response->status_line, "\n at $url\n Aborting"
493          unless $response->is_success;
494
495       Then you'll get this error:
496
497         Error: Basic realm="Unicode-MailList-Archives"
498          401 Authorization Required
499          at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
500          Aborting at auth1.pl line 9.  [or wherever]
501
502       ...because the $browser doesn't know any the username and password for
503       that realm ("Unicode-MailList-Archives") at that host
504       ("www.unicode.org").  The simplest way to let the browser know about
505       this is to use the "credentials" method to let it know about a username
506       and password that it can try using for that realm at that host.  The
507       syntax is:
508
509         $browser->credentials(
510           'servername:portnumber',
511           'realm-name',
512          'username' => 'password'
513         );
514
515       In most cases, the port number is 80, the default TCP/IP port for HTTP;
516       and you usually call the "credentials" method before you make any
517       requests.  For example:
518
519         $browser->credentials(
520           'reports.mybazouki.com:80',
521           'web_server_usage_reports',
522           'plinky' => 'banjo123'
523         );
524
525       So if we add the following to the program above, right after the
526       "$browser = LWP::UserAgent->new;" line...
527
528         $browser->credentials(  # add this to our $browser 's "key ring"
529           'www.unicode.org:80',
530           'Unicode-MailList-Archives',
531           'unicode-ml' => 'unicode'
532         );
533
534       ...then when we run it, the request succeeds, instead of causing the
535       "die" to be called.
536
537   Accessing HTTPS URLs
538       When you access an HTTPS URL, it'll work for you just like an HTTP URL
539       would -- if your LWP installation has HTTPS support (via an appropriate
540       Secure Sockets Layer library).  For example:
541
542         use LWP;
543         my $url = 'https://www.paypal.com/';   # Yes, HTTPS!
544         my $browser = LWP::UserAgent->new;
545         my $response = $browser->get($url);
546         die "Error at $url\n ", $response->status_line, "\n Aborting"
547          unless $response->is_success;
548         print "Whee, it worked!  I got that ",
549          $response->content_type, " document!\n";
550
551       If your LWP installation doesn't have HTTPS support set up, then the
552       response will be unsuccessful, and you'll get this error message:
553
554         Error at https://www.paypal.com/
555          501 Protocol scheme 'https' is not supported
556          Aborting at paypal.pl line 7.   [or whatever program and line]
557
558       If your LWP installation does have HTTPS support installed, then the
559       response should be successful, and you should be able to consult
560       $response just like with any normal HTTP response.
561
562       For information about installing HTTPS support for your LWP
563       installation, see the helpful README.SSL file that comes in the libwww-
564       perl distribution.
565
566   Getting Large Documents
567       When you're requesting a large (or at least potentially large)
568       document, a problem with the normal way of using the request methods
569       (like "$response = $browser->get($url)") is that the response object in
570       memory will have to hold the whole document -- in memory. If the
571       response is a thirty megabyte file, this is likely to be quite an
572       imposition on this process's memory usage.
573
574       A notable alternative is to have LWP save the content to a file on
575       disk, instead of saving it up in memory.  This is the syntax to use:
576
577         $response = $ua->get($url,
578                                ':content_file' => $filespec,
579                             );
580
581       For example,
582
583         $response = $ua->get('http://search.cpan.org/',
584                                ':content_file' => '/tmp/sco.html'
585                             );
586
587       When you use this ":content_file" option, the $response will have all
588       the normal header lines, but "$response->content" will be empty.
589       Errors writing to the content file (for example due to permission
590       denied or the filesystem being full) will be reported via the
591       "Client-Aborted" or "X-Died" response headers, and not the "is_success"
592       method:
593
594         if ($response->header('Client-Aborted') eq 'die') {
595           # handle error ...
596
597       Note that this ":content_file" option isn't supported under older
598       versions of LWP, so you should consider adding "use LWP 5.66;" to check
599       the LWP version, if you think your program might run on systems with
600       older versions.
601
602       If you need to be compatible with older LWP versions, then use this
603       syntax, which does the same thing:
604
605         use HTTP::Request::Common;
606         $response = $ua->request( GET($url), $filespec );
607

COPYRIGHT

646       Copyright 2002, Sean M. Burke.  You can redistribute this document
647       and/or modify it, but only under the same terms as Perl itself.
648

AUTHOR

650       Sean M. Burke "sburke@cpan.org"
651
652
653
654perl v5.32.1                      2021-03-09                         lwptut(3)

NAME

DESCRIPTION

SEE ALSO

COPYRIGHT

AUTHOR