1WWW::Mechanize::FAQ(3)User Contributed Perl DocumentationWWW::Mechanize::FAQ(3)
2
3
4

NAME

6       WWW::Mechanize::FAQ - Frequently Asked Questions about WWW::Mechanize
7

VERSION

9       version 2.03
10

How to get help with WWW::Mechanize

12       If your question isn't answered here in the FAQ, please turn to the
13       communities at:
14
15       •   StackOverflow
16           <https://stackoverflow.com/questions/tagged/www-mechanize>
17
18       •   #lwp on irc.perl.org
19
20       •   <http://perlmonks.org>
21
22       •   The libwww-perl mailing list at <http://lists.perl.org>
23

JavaScript

25   I have this web page that has JavaScript on it, and my Mech program doesn't
26       work.
27       That's because WWW::Mechanize doesn't operate on the JavaScript.  It
28       only understands the HTML parts of the page.
29
30   I thought Mech was supposed to work like a web browser.
31       It does pretty much, but it doesn't support JavaScript.
32
33       I added some basic attempts at picking up URLs in "window.open()" calls
34       and return them in "$mech->links".  They work sometimes.
35
36       Since Javascript is completely visible to the client, it cannot be used
37       to prevent a scraper from following links. But it can make life
38       difficult. If you want to scrape specific pages, then a solution is
39       always possible.
40
41       One typical use of Javascript is to perform argument checking before
42       posting to the server. The URL you want is probably just buried in the
43       Javascript function. Do a regular expression match on
44       "$mech->content()" to find the link that you want and "$mech->get" it
45       directly (this assumes that you know what you are looking for in
46       advance).
47
48       In more difficult cases, the Javascript is used for URL mangling to
49       satisfy the needs of some middleware. In this case you need to figure
50       out what the Javascript is doing (why are these URLs always really
51       long?). There is probably some function with one or more arguments
52       which calculates the new URL. Step one: using your favorite browser,
53       get the before and after URLs and save them to files. Edit each file,
54       converting the argument separators ('?', '&' or ';') into newlines. Now
55       it is easy to use diff or comm to find out what Javascript did to the
56       URL.  Step 2 - find the function call which created the URL - you will
57       need to parse and interpret its argument list. The Javascript Debugger
58       in the Firebug extension for Firefox helps with the analysis. At this
59       point, it is fairly trivial to write your own function which emulates
60       the Javascript for the pages you want to process.
61
62       Here's another approach that answers the question, "It works in
63       Firefox, but why not Mech?"  Everything the web server knows about the
64       client is present in the HTTP request. If two requests are identical,
65       the results should be identical. So the real question is "What is
66       different between the mech request and the Firefox request?"
67
68       The Firefox extension "Tamper Data" is an effective tool for examining
69       the headers of the requests to the server. Compare that with what LWP
70       is sending. Once the two are identical, the action of the server should
71       be the same as well.
72
73       I say "should", because this is an oversimplification - some values are
74       naturally unique, e.g. a SessionID, but if a SessionID is present, that
75       is probably sufficient, even though the value will be different between
76       the LWP request and the Firefox request. The server could use the
77       session to store information which is troublesome, but that's not the
78       first place to look (and highly unlikely to be relevant when you are
79       requesting the login page of your site).
80
81       Generally the problem is to be found in missing or incorrect POSTDATA
82       arguments, Cookies, User-Agents, Accepts, etc. If you are using mech,
83       then redirects and cookies should not be a problem, but are listed here
84       for completeness. If you are missing headers, "$mech->add_header" can
85       be used to add the headers that you need.
86
87   Which modules work like Mechanize and have JavaScript support?
88       In no particular order: Gtk2::WebKit::Mechanize, Win32::IE::Mechanize,
89       WWW::Mechanize::Firefox, WWW::Scripter, WWW::Selenium
90

How do I do X?

92   Can I do [such-and-such] with WWW::Mechanize?
93       If it's possible with LWP::UserAgent, then yes.  WWW::Mechanize is a
94       subclass of LWP::UserAgent, so all the wondrous magic of that class is
95       inherited.
96
97   How do I use WWW::Mechanize through a proxy server?
98       See the docs in LWP::UserAgent on how to use the proxy.  Short version:
99
100           $mech->proxy(['http', 'ftp'], 'http://proxy.example.com:8000/');
101
102       or get the specs from the environment:
103
104           $mech->env_proxy();
105
106           # Environment set like so:
107           gopher_proxy=http://proxy.my.place/
108           wais_proxy=http://proxy.my.place/
109           no_proxy="localhost,my.domain"
110           export gopher_proxy wais_proxy no_proxy
111
112   How can I see what fields are on the forms?
113       Use the mech-dump utility, optionally installed with Mechanize.
114
115           $ mech-dump --forms http://search.cpan.org
116           Dumping forms
117           GET http://search.cpan.org/search
118             query=
119             mode=all                        (option)  [*all|module|dist|author]
120             <NONAME>=CPAN Search            (submit)
121
122   How do I get Mech to handle authentication?
123           use MIME::Base64;
124
125           my $agent = WWW::Mechanize->new();
126           my @args = (
127               Authorization => "Basic " .
128                   MIME::Base64::encode( USER . ':' . PASS )
129           );
130
131           $agent->credentials( ADDRESS, REALM, USER, PASS );
132           $agent->get( URL, @args );
133
134       If you want to use the credentials for all future requests, you can
135       also use the LWP::UserAgent "default_header()" method instead of the
136       extra arguments to "get()"
137
138           $mech->default_header(
139               Authorization => 'Basic ' . encode_base64( USER . ':' . PASSWORD ) );
140
141   How can I get WWW::Mechanize to execute this JavaScript?
142       You can't.  JavaScript is entirely client-based, and WWW::Mechanize is
143       a client that doesn't understand JavaScript.  See the top part of this
144       FAQ.
145
146   How do I check a checkbox that doesn't have a value defined?
147       Set it to the value of "on".
148
149           $mech->field( my_checkbox => 'on' );
150
151   How do I handle frames?
152       You don't deal with them as frames, per se, but as links.  Extract them
153       with
154
155           my @frame_links = $mech->find_link( tag => "frame" );
156
157   How do I get a list of HTTP headers and their values?
158       All HTTP::Headers methods work on a HTTP::Response object which is
159       returned by the get(), reload(), response()/res(), click(),
160       submit_form(), and request() methods.
161
162           my $mech = WWW::Mechanize->new( autocheck => 1 );
163           $mech->get( 'http://my.site.com' );
164           my $response = $mech->response();
165           for my $key ( $response->header_field_names() ) {
166               print $key, " : ", $response->header( $key ), "\n";
167           }
168
169   How do I enable keep-alive?
170       Since WWW::Mechanize is a subclass of LWP::UserAgent, you can use the
171       same mechanism to enable keep-alive:
172
173           use LWP::ConnCache;
174           ...
175           $mech->conn_cache(LWP::ConnCache->new);
176
177   How can I change/specify the action parameter of an HTML form?
178       You can access the action of the form by utilizing the HTML::Form
179       object returned from one of the specifying form methods.
180
181       Using "$mech->form_number($number)":
182
183           my $mech = WWW::mechanize->new;
184           $mech->get('http://someurlhere.com');
185           # Access the form using its Zero-Based Index by DOM order
186           $mech->form_number(0)->action('http://newAction'); #ABS URL
187
188       Using "$mech->form_name($number)":
189
190           my $mech = WWW::mechanize->new;
191           $mech->get('http://someurlhere.com');
192           #Access the form using its Zero-Based Index by DOM order
193           $mech->form_name('trgForm')->action('http://newAction'); #ABS URL
194
195   How do I save an image?  How do I save a large tarball?
196       An image is just content.  You get the image and save it.
197
198           $mech->get( 'photo.jpg' );
199           $mech->save_content( '/path/to/my/directory/photo.jpg' );
200
201       You can also save any content directly to disk using the
202       ":content_file" flag to "get()", which is part of LWP::UserAgent.
203
204           $mech->get( 'http://www.cpan.org/src/stable.tar.gz',
205                       ':content_file' => 'stable.tar.gz' );
206
207   How do I pick a specific value from a "<select>" list?
208       Find the "HTML::Form::ListInput" in the page.
209
210           my ($listbox) = $mech->find_all_inputs( name => 'listbox' );
211
212       Then create a hash for the lookup:
213
214           my %name_lookup;
215           @name_lookup{ $listbox->value_names } = $listbox->possible_values;
216           my $value = $name_lookup{ 'Name I want' };
217
218       If you have duplicate names, this method won't work, and you'll have to
219       loop over "$listbox->value_names" and "$listbox->possible_values" in
220       parallel until you find a matching name.
221
222   How do I get Mech to not follow redirects?
223       You use functionality in LWP::UserAgent, not Mech itself.
224
225           $mech->requests_redirectable( [] );
226
227       Or you can set "max_redirect":
228
229           $mech->max_redirect( 0 );
230
231       Both these options can also be set in the constructor.  Mech doesn't
232       understand them, so will pass them through to the LWP::UserAgent
233       constructor.
234

Why doesn't this work: Debugging your Mechanize program

236   My Mech program doesn't work, but it works in the browser.
237       Mechanize acts like a browser, but apparently something you're doing is
238       not matching the browser's behavior.  Maybe it's expecting a certain
239       web client, or maybe you've not handling a field properly.  For some
240       reason, your Mech problem isn't doing exactly what the browser is
241       doing, and when you find that, you'll have the answer.
242
243   My Mech program gets these 500 errors.
244       A 500 error from the web server says that the program on the server
245       side died.  Probably the web server program was expecting certain
246       inputs that you didn't supply, and instead of handling it nicely, the
247       program died.
248
249       Whatever the cause of the 500 error, if it works in the browser, but
250       not in your Mech program, you're not acting like the browser.  See the
251       previous question.
252
253   Why doesn't my program handle this form correctly?
254       Run mech-dump on your page and see what it says.
255
256       mech-dump is a marvelous diagnostic tool for figuring out what forms
257       and fields are on the page.  Say you're scraping CNN.com, you'd get
258       this:
259
260           $ mech-dump http://www.cnn.com/
261           GET http://search.cnn.com/cnn/search
262             source=cnn                     (hidden readonly)
263             invocationType=search/top      (hidden readonly)
264             sites=web                      (radio)    [*web/The Web ??|cnn/CNN.com ??]
265             query=                         (text)
266             <NONAME>=Search                (submit)
267
268           POST http://cgi.money.cnn.com/servlets/quote_redirect
269             query=                         (text)
270             <NONAME>=GET                   (submit)
271
272           POST http://polls.cnn.com/poll
273             poll_id=2112                   (hidden readonly)
274             question_1=<UNDEF>             (radio)    [1/Simplistic option|2/VIEW RESULTS]
275             <NONAME>=VOTE                  (submit)
276
277           GET http://search.cnn.com/cnn/search
278             source=cnn                     (hidden readonly)
279             invocationType=search/bottom   (hidden readonly)
280             sites=web                      (radio)    [*web/??CNN.com|cnn/??]
281             query=                         (text)
282             <NONAME>=Search                (submit)
283
284       Four forms, including the first one duplicated at the end.  All the
285       fields, all their defaults, lovingly generated by HTML::Form's "dump"
286       method.
287
288       If you want to run mech-dump on something that doesn't lend itself to a
289       quick URL fetch, then use the "save_content()" method to write the HTML
290       to a file, and run mech-dump on the file.
291
292   Why don't https:// URLs work?
293       You need either IO::Socket::SSL or Crypt::SSLeay installed.
294
295   Why do I get "Input 'fieldname' is readonly"?
296       You're trying to change the value of a hidden field and you have
297       warnings on.
298
299       First, make sure that you actually mean to change the field that you're
300       changing, and that you don't have a typo.  Usually, hidden variables
301       are set by the site you're working on for a reason.  If you change the
302       value, you might be breaking some functionality by faking it out.
303
304       If you really do want to change a hidden value, make the changes in a
305       scope that has warnings turned off:
306
307           {
308           local $^W = 0;
309           $agent->field( name => $value );
310           }
311
312   I tried to [such-and-such] and I got this weird error.
313       Are you checking your errors?
314
315       Are you sure?
316
317       Are you checking that your action succeeded after every action?
318
319       Are you sure?
320
321       For example, if you try this:
322
323           $mech->get( "http://my.site.com" );
324           $mech->follow_link( "foo" );
325
326       and the "get" call fails for some reason, then the Mech internals will
327       be unusable for the "follow_link" and you'll get a weird error.  You
328       must, after every action that GETs or POSTs a page, check that Mech
329       succeeded, or all bets are off.
330
331           $mech->get( "http://my.site.com" );
332           die "Can't even get the home page: ", $mech->response->status_line
333               unless $mech->success;
334
335           $mech->follow_link( "foo" );
336           die "Foo link failed: ", $mech->response->status_line
337               unless $mech->success;
338
339   How do I figure out why "$mech->get($url)" doesn't work?
340       There are many reasons why a "get()" can fail. The server can take you
341       to someplace you didn't expect. It can generate redirects which are not
342       properly handled. You can get time-outs. Servers are down more often
343       than you think! etc, etc, etc. A couple of places to start:
344
345       1 Check "$mech->status()" after each call
346       2 Check the URL with "$mech->uri()" to see where you ended up
347       3 Try debugging with "LWP::ConsoleLogger".
348
349       If things are really strange, turn on debugging with "use
350       LWP::ConsoleLogger::Everywhere;" Just put this in the main program.
351       This causes LWP to print out a trace of the HTTP traffic between client
352       and server and can be used to figure out what is happening at the
353       protocol level.
354
355       It is also useful to set many traps to verify that processing is
356       proceeding as expected. A Mech program should always have an "I didn't
357       expect to get here" or "I don't recognize the page that I am
358       processing" case and bail out.
359
360       Since errors can be transient, by the time you notice that the error
361       has occurred, it might not be possible to reproduce it manually. So for
362       automated processing it is useful to email yourself the following
363       information:
364
365       •   where processing is taking place
366
367       •   An Error Message
368
369       •   $mech->uri
370
371       •   $mech->content
372
373       You can also save the content of the page with "$mech->save_content(
374       'filename.html' );"
375
376   I submitted a form, but the server ignored everything!  I got an empty form
377       back!
378       The post is handled by application software. It is common for PHP
379       programmers to use the same file both to display a form and to process
380       the arguments returned. So the first task of the application programmer
381       is to decide whether there are arguments to processes. The program can
382       check whether a particular parameter has been set, whether a hidden
383       parameter has been set, or whether the submit button has been clicked.
384       (There are probably other ways that I haven't thought of).
385
386       In any case, if your form is not setting the parameter (e.g. the submit
387       button) which the web application is keying on (and as an outsider
388       there is no way to know what it is keying on), it will not notice that
389       the form has been submitted. Try using "$mech->click()" instead of
390       "$mech->submit()" or vice-versa.
391
392   I've logged in to the server, but I get 500 errors when I try to get to
393       protected content.
394       Some web sites use distributed databases for their processing. It can
395       take a few seconds for the login/session information to percolate
396       through to all the servers. For human users with their slow reaction
397       times, this is not a problem, but a Perl script can outrun the server.
398       So try adding a sleep(5) between logging in and actually doing anything
399       (the optimal delay must be determined experimentally).
400
401   Mech is a big memory pig!  I'm running out of RAM!
402       Mech keeps a history of every page, and the state it was in.  It
403       actually keeps a clone of the full Mech object at every step along the
404       way.
405
406       You can limit this stack size with the "stack_depth" param in the
407       "new()" constructor.  If you set stack_size to 0, Mech will not keep
408       any history.
409

AUTHOR

411       Andy Lester <andy at petdance.com>
412
414       This software is copyright (c) 2004 by Andy Lester.
415
416       This is free software; you can redistribute it and/or modify it under
417       the same terms as the Perl 5 programming language system itself.
418
419
420
421perl v5.32.1                      2021-01-27            WWW::Mechanize::FAQ(3)
Impressum