1WWW::Mechanize::FAQ(3)User Contributed Perl DocumentationWWW::Mechanize::FAQ(3)
2
3
4

NAME

6       WWW::Mechanize::FAQ - Frequently Asked Questions about WWW::Mechanize
7

VERSION

9       version 2.15
10

How to get help with WWW::Mechanize

12       If your question isn't answered here in the FAQ, please turn to the
13       communities at:
14
15       •   StackOverflow
16           <https://stackoverflow.com/questions/tagged/www-mechanize>
17
18       •   #lwp on irc.perl.org
19
20       •   <http://perlmonks.org>
21
22       •   The libwww-perl mailing list at <http://lists.perl.org>
23

JavaScript

25   I have this web page that has JavaScript on it, and my Mech program doesn't
26       work.
27       That's because WWW::Mechanize doesn't operate on the JavaScript.  It
28       only understands the HTML parts of the page.
29
30   I thought Mech was supposed to work like a web browser.
31       It does pretty much, but it doesn't support JavaScript.
32
33       I added some basic attempts at picking up URLs in "window.open()" calls
34       and return them in "$mech->links".  They work sometimes.
35
36       Since Javascript is completely visible to the client, it cannot be used
37       to prevent a scraper from following links. But it can make life
38       difficult. If you want to scrape specific pages, then a solution is
39       always possible.
40
41       One typical use of Javascript is to perform argument checking before
42       posting to the server. The URL you want is probably just buried in the
43       Javascript function. Do a regular expression match on
44       "$mech->content()" to find the link that you want and "$mech->get" it
45       directly (this assumes that you know what you are looking for in
46       advance).
47
48       In more difficult cases, the Javascript is used for URL mangling to
49       satisfy the needs of some middleware. In this case you need to figure
50       out what the Javascript is doing (why are these URLs always really
51       long?). There is probably some function with one or more arguments
52       which calculates the new URL. Step one: using your favorite browser,
53       get the before and after URLs and save them to files. Edit each file,
54       converting the argument separators ('?', '&' or ';') into newlines. Now
55       it is easy to use diff or comm to find out what Javascript did to the
56       URL.  Step 2 - find the function call which created the URL - you will
57       need to parse and interpret its argument list. The Javascript Debugger
58       in the Firebug extension for Firefox helps with the analysis. At this
59       point, it is fairly trivial to write your own function which emulates
60       the Javascript for the pages you want to process.
61
62       Here's another approach that answers the question, "It works in
63       Firefox, but why not Mech?"  Everything the web server knows about the
64       client is present in the HTTP request. If two requests are identical,
65       the results should be identical. So the real question is "What is
66       different between the mech request and the Firefox request?"
67
68       The Firefox extension "Tamper Data" is an effective tool for examining
69       the headers of the requests to the server. Compare that with what LWP
70       is sending. Once the two are identical, the action of the server should
71       be the same as well.
72
73       I say "should", because this is an oversimplification - some values are
74       naturally unique, e.g. a SessionID, but if a SessionID is present, that
75       is probably sufficient, even though the value will be different between
76       the LWP request and the Firefox request. The server could use the
77       session to store information which is troublesome, but that's not the
78       first place to look (and highly unlikely to be relevant when you are
79       requesting the login page of your site).
80
81       Generally the problem is to be found in missing or incorrect POSTDATA
82       arguments, Cookies, User-Agents, Accepts, etc. If you are using mech,
83       then redirects and cookies should not be a problem, but are listed here
84       for completeness. If you are missing headers, "$mech->add_header" can
85       be used to add the headers that you need.
86
87   Which modules work like Mechanize and have JavaScript support?
88       In no particular order: Gtk2::WebKit::Mechanize,
89       WWW::Mechanize::Firefox, WWW::Mechanize::Chrome, WWW::Scripter,
90       WWW::Selenium
91

How do I do X?

93   Can I do [such-and-such] with WWW::Mechanize?
94       If it's possible with LWP::UserAgent, then yes.  WWW::Mechanize is a
95       subclass of LWP::UserAgent, so all the wondrous magic of that class is
96       inherited.
97
98   How do I use WWW::Mechanize through a proxy server?
99       See the docs in LWP::UserAgent on how to use the proxy.  Short version:
100
101           $mech->proxy(['http', 'ftp'], 'http://proxy.example.com:8000/');
102
103       or get the specs from the environment:
104
105           $mech->env_proxy();
106
107           # Environment set like so:
108           gopher_proxy=http://proxy.my.place/
109           wais_proxy=http://proxy.my.place/
110           no_proxy="localhost,my.domain"
111           export gopher_proxy wais_proxy no_proxy
112
113   How can I see what fields are on the forms?
114       Use the mech-dump utility, optionally installed with Mechanize.
115
116           $ mech-dump --forms http://search.cpan.org
117           Dumping forms
118           GET http://search.cpan.org/search
119             query=
120             mode=all                        (option)  [*all|module|dist|author]
121             <NONAME>=CPAN Search            (submit)
122
123   How do I get Mech to handle authentication?
124           use MIME::Base64;
125
126           my $agent = WWW::Mechanize->new();
127           my @args = (
128               Authorization => "Basic " .
129                   MIME::Base64::encode( USER . ':' . PASS )
130           );
131
132           $agent->credentials( ADDRESS, REALM, USER, PASS );
133           $agent->get( URL, @args );
134
135       If you want to use the credentials for all future requests, you can
136       also use the LWP::UserAgent default_header() method instead of the
137       extra arguments to get()
138
139           $mech->default_header(
140               Authorization => 'Basic ' . encode_base64( USER . ':' . PASSWORD ) );
141
142   How can I get WWW::Mechanize to execute this JavaScript?
143       You can't.  JavaScript is entirely client-based, and WWW::Mechanize is
144       a client that doesn't understand JavaScript.  See the top part of this
145       FAQ.
146
147   How do I check a checkbox that doesn't have a value defined?
148       Set it to the value of "on".
149
150           $mech->field( my_checkbox => 'on' );
151
152   How do I handle frames?
153       You don't deal with them as frames, per se, but as links.  Extract them
154       with
155
156           my @frame_links = $mech->find_link( tag => "frame" );
157
158   How do I get a list of HTTP headers and their values?
159       All HTTP::Headers methods work on a HTTP::Response object which is
160       returned by the get(), reload(), "response()/res()", click(),
161       submit_form(), and request() methods.
162
163           my $mech = WWW::Mechanize->new( autocheck => 1 );
164           $mech->get( 'http://my.site.com' );
165           my $response = $mech->response();
166           for my $key ( $response->header_field_names() ) {
167               print $key, " : ", $response->header( $key ), "\n";
168           }
169
170   How do I enable keep-alive?
171       Since WWW::Mechanize is a subclass of LWP::UserAgent, you can use the
172       same mechanism to enable keep-alive:
173
174           use LWP::ConnCache;
175           ...
176           $mech->conn_cache(LWP::ConnCache->new);
177
178   How can I change/specify the action parameter of an HTML form?
179       You can access the action of the form by utilizing the HTML::Form
180       object returned from one of the specifying form methods.
181
182       Using "$mech->form_number($number)":
183
184           my $mech = WWW::mechanize->new;
185           $mech->get('http://someurlhere.com');
186           # Access the form using its Zero-Based Index by DOM order
187           $mech->form_number(0)->action('http://newAction'); #ABS URL
188
189       Using "$mech->form_name($number)":
190
191           my $mech = WWW::mechanize->new;
192           $mech->get('http://someurlhere.com');
193           #Access the form using its Zero-Based Index by DOM order
194           $mech->form_name('trgForm')->action('http://newAction'); #ABS URL
195
196   How do I save an image?  How do I save a large tarball?
197       An image is just content.  You get the image and save it.
198
199           $mech->get( 'photo.jpg' );
200           $mech->save_content( '/path/to/my/directory/photo.jpg' );
201
202       You can also save any content directly to disk using the
203       ":content_file" flag to get(), which is part of LWP::UserAgent.
204
205           $mech->get( 'http://www.cpan.org/src/stable.tar.gz',
206                       ':content_file' => 'stable.tar.gz' );
207
208   How do I pick a specific value from a "<select>" list?
209       Find the "HTML::Form::ListInput" in the page.
210
211           my ($listbox) = $mech->find_all_inputs( name => 'listbox' );
212
213       Then create a hash for the lookup:
214
215           my %name_lookup;
216           @name_lookup{ $listbox->value_names } = $listbox->possible_values;
217           my $value = $name_lookup{ 'Name I want' };
218
219       If you have duplicate names, this method won't work, and you'll have to
220       loop over "$listbox->value_names" and "$listbox->possible_values" in
221       parallel until you find a matching name.
222
223   How do I get Mech to not follow redirects?
224       You use functionality in LWP::UserAgent, not Mech itself.
225
226           $mech->requests_redirectable( [] );
227
228       Or you can set "max_redirect":
229
230           $mech->max_redirect( 0 );
231
232       Both these options can also be set in the constructor.  Mech doesn't
233       understand them, so will pass them through to the LWP::UserAgent
234       constructor.
235

Why doesn't this work: Debugging your Mechanize program

237   My Mech program doesn't work, but it works in the browser.
238       Mechanize acts like a browser, but apparently something you're doing is
239       not matching the browser's behavior.  Maybe it's expecting a certain
240       web client, or maybe you've not handling a field properly.  For some
241       reason, your Mech problem isn't doing exactly what the browser is
242       doing, and when you find that, you'll have the answer.
243
244   My Mech program gets these 500 errors.
245       A 500 error from the web server says that the program on the server
246       side died.  Probably the web server program was expecting certain
247       inputs that you didn't supply, and instead of handling it nicely, the
248       program died.
249
250       Whatever the cause of the 500 error, if it works in the browser, but
251       not in your Mech program, you're not acting like the browser.  See the
252       previous question.
253
254   Why doesn't my program handle this form correctly?
255       Run mech-dump on your page and see what it says.
256
257       mech-dump is a marvelous diagnostic tool for figuring out what forms
258       and fields are on the page.  Say you're scraping CNN.com, you'd get
259       this:
260
261           $ mech-dump http://www.cnn.com/
262           GET http://search.cnn.com/cnn/search
263             source=cnn                     (hidden readonly)
264             invocationType=search/top      (hidden readonly)
265             sites=web                      (radio)    [*web/The Web ??|cnn/CNN.com ??]
266             query=                         (text)
267             <NONAME>=Search                (submit)
268
269           POST http://cgi.money.cnn.com/servlets/quote_redirect
270             query=                         (text)
271             <NONAME>=GET                   (submit)
272
273           POST http://polls.cnn.com/poll
274             poll_id=2112                   (hidden readonly)
275             question_1=<UNDEF>             (radio)    [1/Simplistic option|2/VIEW RESULTS]
276             <NONAME>=VOTE                  (submit)
277
278           GET http://search.cnn.com/cnn/search
279             source=cnn                     (hidden readonly)
280             invocationType=search/bottom   (hidden readonly)
281             sites=web                      (radio)    [*web/??CNN.com|cnn/??]
282             query=                         (text)
283             <NONAME>=Search                (submit)
284
285       Four forms, including the first one duplicated at the end.  All the
286       fields, all their defaults, lovingly generated by HTML::Form's "dump"
287       method.
288
289       If you want to run mech-dump on something that doesn't lend itself to a
290       quick URL fetch, then use the save_content() method to write the HTML
291       to a file, and run mech-dump on the file.
292
293   Why don't https:// URLs work?
294       You need either IO::Socket::SSL or Crypt::SSLeay installed.
295
296   Why don't file:// URLs to files with a question mark in the name work?
297       If you have a local file named "how-are-you?", the URL for that file is
298       "file:how-are-you%3f". That's because URI::file is required to be url-
299       encoded, just like any URL pointing to somewhere on the internet has to
300       be if it contains reserved characters such as "?", "/" or "@". This is
301       specified in RFC 3986. See URI::Escape for a full list of reserved
302       characters.
303
304   Why do I get "Input 'fieldname' is readonly"?
305       You're trying to change the value of a hidden field and you have
306       warnings on.
307
308       First, make sure that you actually mean to change the field that you're
309       changing, and that you don't have a typo.  Usually, hidden variables
310       are set by the site you're working on for a reason.  If you change the
311       value, you might be breaking some functionality by faking it out.
312
313       If you really do want to change a hidden value, make the changes in a
314       scope that has warnings turned off:
315
316           {
317           local $^W = 0;
318           $agent->field( name => $value );
319           }
320
321   I tried to [such-and-such] and I got this weird error.
322       Are you checking your errors?
323
324       Are you sure?
325
326       Are you checking that your action succeeded after every action?
327
328       Are you sure?
329
330       For example, if you try this:
331
332           $mech->get( "http://my.site.com" );
333           $mech->follow_link( "foo" );
334
335       and the "get" call fails for some reason, then the Mech internals will
336       be unusable for the "follow_link" and you'll get a weird error.  You
337       must, after every action that GETs or POSTs a page, check that Mech
338       succeeded, or all bets are off.
339
340           $mech->get( "http://my.site.com" );
341           die "Can't even get the home page: ", $mech->response->status_line
342               unless $mech->success;
343
344           $mech->follow_link( "foo" );
345           die "Foo link failed: ", $mech->response->status_line
346               unless $mech->success;
347
348   How do I figure out why "$mech->get($url)" doesn't work?
349       There are many reasons why a get() can fail. The server can take you to
350       someplace you didn't expect. It can generate redirects which are not
351       properly handled. You can get time-outs. Servers are down more often
352       than you think! etc, etc, etc. A couple of places to start:
353
354       1 Check "$mech->status()" after each call
355       2 Check the URL with "$mech->uri()" to see where you ended up
356       3 Try debugging with "LWP::ConsoleLogger".
357
358       If things are really strange, turn on debugging with "use
359       LWP::ConsoleLogger::Everywhere;" Just put this in the main program.
360       This causes LWP to print out a trace of the HTTP traffic between client
361       and server and can be used to figure out what is happening at the
362       protocol level.
363
364       It is also useful to set many traps to verify that processing is
365       proceeding as expected. A Mech program should always have an "I didn't
366       expect to get here" or "I don't recognize the page that I am
367       processing" case and bail out.
368
369       Since errors can be transient, by the time you notice that the error
370       has occurred, it might not be possible to reproduce it manually. So for
371       automated processing it is useful to email yourself the following
372       information:
373
374       •   where processing is taking place
375
376       •   An Error Message
377
378       •   $mech->uri
379
380       •   $mech->content
381
382       You can also save the content of the page with "$mech->save_content(
383       'filename.html' );"
384
385   I submitted a form, but the server ignored everything!  I got an empty form
386       back!
387       The post is handled by application software. It is common for PHP
388       programmers to use the same file both to display a form and to process
389       the arguments returned. So the first task of the application programmer
390       is to decide whether there are arguments to processes. The program can
391       check whether a particular parameter has been set, whether a hidden
392       parameter has been set, or whether the submit button has been clicked.
393       (There are probably other ways that I haven't thought of).
394
395       In any case, if your form is not setting the parameter (e.g. the submit
396       button) which the web application is keying on (and as an outsider
397       there is no way to know what it is keying on), it will not notice that
398       the form has been submitted. Try using "$mech->click()" instead of
399       "$mech->submit()" or vice-versa.
400
401   I've logged in to the server, but I get 500 errors when I try to get to
402       protected content.
403       Some web sites use distributed databases for their processing. It can
404       take a few seconds for the login/session information to percolate
405       through to all the servers. For human users with their slow reaction
406       times, this is not a problem, but a Perl script can outrun the server.
407       So try adding a sleep(5) between logging in and actually doing anything
408       (the optimal delay must be determined experimentally).
409
410   Mech is a big memory pig!  I'm running out of RAM!
411       Mech keeps a history of every page, and the state it was in.  It
412       actually keeps a clone of the full Mech object at every step along the
413       way.
414
415       You can limit this stack size with the "stack_depth" param in the new()
416       constructor.  If you set stack_size to 0, Mech will not keep any
417       history.
418

AUTHOR

420       Andy Lester <andy at petdance.com>
421
423       This software is copyright (c) 2004 by Andy Lester.
424
425       This is free software; you can redistribute it and/or modify it under
426       the same terms as the Perl 5 programming language system itself.
427
428
429
430perl v5.36.0                      2023-01-20            WWW::Mechanize::FAQ(3)
Impressum