1WGET(1) GNU Wget WGET(1)
2
3
4
6 Wget - The non-interactive network downloader.
7
9 wget [option]... [URL]...
10
12 GNU Wget is a free utility for non-interactive download of files from
13 the Web. It supports HTTP, HTTPS, and FTP protocols, as well as
14 retrieval through HTTP proxies.
15
16 Wget is non-interactive, meaning that it can work in the background,
17 while the user is not logged on. This allows you to start a retrieval
18 and disconnect from the system, letting Wget finish the work. By con‐
19 trast, most of the Web browsers require constant user's presence, which
20 can be a great hindrance when transferring a lot of data.
21
22 Wget can follow links in HTML and XHTML pages and create local versions
23 of remote web sites, fully recreating the directory structure of the
24 original site. This is sometimes referred to as ``recursive download‐
25 ing.'' While doing that, Wget respects the Robot Exclusion Standard
26 (/robots.txt). Wget can be instructed to convert the links in down‐
27 loaded HTML files to the local files for offline viewing.
28
29 Wget has been designed for robustness over slow or unstable network
30 connections; if a download fails due to a network problem, it will keep
31 retrying until the whole file has been retrieved. If the server sup‐
32 ports regetting, it will instruct the server to continue the download
33 from where it left off.
34
36 Option Syntax
37
38 Since Wget uses GNU getopt to process command-line arguments, every
39 option has a long form along with the short one. Long options are more
40 convenient to remember, but take time to type. You may freely mix dif‐
41 ferent option styles, or specify options after the command-line argu‐
42 ments. Thus you may write:
43
44 wget -r --tries=10 http://fly.srk.fer.hr/ -o log
45
46 The space between the option accepting an argument and the argument may
47 be omitted. Instead -o log you can write -olog.
48
49 You may put several options that do not require arguments together,
50 like:
51
52 wget -drc <URL>
53
54 This is a complete equivalent of:
55
56 wget -d -r -c <URL>
57
58 Since the options can be specified after the arguments, you may termi‐
59 nate them with --. So the following will try to download URL -x,
60 reporting failure to log:
61
62 wget -o log -- -x
63
64 The options that accept comma-separated lists all respect the conven‐
65 tion that specifying an empty list clears its value. This can be use‐
66 ful to clear the .wgetrc settings. For instance, if your .wgetrc sets
67 "exclude_directories" to /cgi-bin, the following example will first
68 reset it, and then set it to exclude /~nobody and /~somebody. You can
69 also clear the lists in .wgetrc.
70
71 wget -X '' -X /~nobody,/~somebody
72
73 Most options that do not accept arguments are boolean options, so named
74 because their state can be captured with a yes-or-no (``boolean'')
75 variable. For example, --follow-ftp tells Wget to follow FTP links
76 from HTML files and, on the other hand, --no-glob tells it not to per‐
77 form file globbing on FTP URLs. A boolean option is either affirmative
78 or negative (beginning with --no). All such options share several
79 properties.
80
81 Unless stated otherwise, it is assumed that the default behavior is the
82 opposite of what the option accomplishes. For example, the documented
83 existence of --follow-ftp assumes that the default is to not follow FTP
84 links from HTML pages.
85
86 Affirmative options can be negated by prepending the --no- to the
87 option name; negative options can be negated by omitting the --no- pre‐
88 fix. This might seem superfluous---if the default for an affirmative
89 option is to not do something, then why provide a way to explicitly
90 turn it off? But the startup file may in fact change the default. For
91 instance, using "follow_ftp = off" in .wgetrc makes Wget not follow FTP
92 links by default, and using --no-follow-ftp is the only way to restore
93 the factory default from the command line.
94
95 Basic Startup Options
96
97 -V
98 --version
99 Display the version of Wget.
100
101 -h
102 --help
103 Print a help message describing all of Wget's command-line options.
104
105 -b
106 --background
107 Go to background immediately after startup. If no output file is
108 specified via the -o, output is redirected to wget-log.
109
110 -e command
111 --execute command
112 Execute command as if it were a part of .wgetrc. A command thus
113 invoked will be executed after the commands in .wgetrc, thus taking
114 precedence over them. If you need to specify more than one wgetrc
115 command, use multiple instances of -e.
116
117 Logging and Input File Options
118
119 -o logfile
120 --output-file=logfile
121 Log all messages to logfile. The messages are normally reported to
122 standard error.
123
124 -a logfile
125 --append-output=logfile
126 Append to logfile. This is the same as -o, only it appends to log‐
127 file instead of overwriting the old log file. If logfile does not
128 exist, a new file is created.
129
130 -d
131 --debug
132 Turn on debug output, meaning various information important to the
133 developers of Wget if it does not work properly. Your system
134 administrator may have chosen to compile Wget without debug sup‐
135 port, in which case -d will not work. Please note that compiling
136 with debug support is always safe---Wget compiled with the debug
137 support will not print any debug info unless requested with -d.
138
139 -q
140 --quiet
141 Turn off Wget's output.
142
143 -v
144 --verbose
145 Turn on verbose output, with all the available data. The default
146 output is verbose.
147
148 -nv
149 --no-verbose
150 Turn off verbose without being completely quiet (use -q for that),
151 which means that error messages and basic information still get
152 printed.
153
154 -i file
155 --input-file=file
156 Read URLs from file. If - is specified as file, URLs are read from
157 the standard input. (Use ./- to read from a file literally named
158 -.)
159
160 If this function is used, no URLs need be present on the command
161 line. If there are URLs both on the command line and in an input
162 file, those on the command lines will be the first ones to be
163 retrieved. The file need not be an HTML document (but no harm if
164 it is)---it is enough if the URLs are just listed sequentially.
165
166 However, if you specify --force-html, the document will be regarded
167 as html. In that case you may have problems with relative links,
168 which you can solve either by adding "<base href="url">" to the
169 documents or by specifying --base=url on the command line.
170
171 -F
172 --force-html
173 When input is read from a file, force it to be treated as an HTML
174 file. This enables you to retrieve relative links from existing
175 HTML files on your local disk, by adding "<base href="url">" to
176 HTML, or using the --base command-line option.
177
178 -B URL
179 --base=URL
180 Prepends URL to relative links read from the file specified with
181 the -i option.
182
183 Download Options
184
185 --bind-address=ADDRESS
186 When making client TCP/IP connections, bind to ADDRESS on the local
187 machine. ADDRESS may be specified as a hostname or IP address.
188 This option can be useful if your machine is bound to multiple IPs.
189
190 -t number
191 --tries=number
192 Set number of retries to number. Specify 0 or inf for infinite
193 retrying. The default is to retry 20 times, with the exception of
194 fatal errors like ``connection refused'' or ``not found'' (404),
195 which are not retried.
196
197 -O file
198 --output-document=file
199 The documents will not be written to the appropriate files, but all
200 will be concatenated together and written to file. If - is used as
201 file, documents will be printed to standard output, disabling link
202 conversion. (Use ./- to print to a file literally named -.)
203
204 Note that a combination with -k is only well-defined for download‐
205 ing a single document.
206
207 -nc
208 --no-clobber
209 If a file is downloaded more than once in the same directory,
210 Wget's behavior depends on a few options, including -nc. In cer‐
211 tain cases, the local file will be clobbered, or overwritten, upon
212 repeated download. In other cases it will be preserved.
213
214 When running Wget without -N, -nc, or -r, downloading the same file
215 in the same directory will result in the original copy of file
216 being preserved and the second copy being named file.1. If that
217 file is downloaded yet again, the third copy will be named file.2,
218 and so on. When -nc is specified, this behavior is suppressed, and
219 Wget will refuse to download newer copies of file. Therefore,
220 ``"no-clobber"'' is actually a misnomer in this mode---it's not
221 clobbering that's prevented (as the numeric suffixes were already
222 preventing clobbering), but rather the multiple version saving
223 that's prevented.
224
225 When running Wget with -r, but without -N or -nc, re-downloading a
226 file will result in the new copy simply overwriting the old.
227 Adding -nc will prevent this behavior, instead causing the original
228 version to be preserved and any newer copies on the server to be
229 ignored.
230
231 When running Wget with -N, with or without -r, the decision as to
232 whether or not to download a newer copy of a file depends on the
233 local and remote timestamp and size of the file. -nc may not be
234 specified at the same time as -N.
235
236 Note that when -nc is specified, files with the suffixes .html or
237 .htm will be loaded from the local disk and parsed as if they had
238 been retrieved from the Web.
239
240 -c
241 --continue
242 Continue getting a partially-downloaded file. This is useful when
243 you want to finish up a download started by a previous instance of
244 Wget, or by another program. For instance:
245
246 wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
247
248 If there is a file named ls-lR.Z in the current directory, Wget
249 will assume that it is the first portion of the remote file, and
250 will ask the server to continue the retrieval from an offset equal
251 to the length of the local file.
252
253 Note that you don't need to specify this option if you just want
254 the current invocation of Wget to retry downloading a file should
255 the connection be lost midway through. This is the default behav‐
256 ior. -c only affects resumption of downloads started prior to this
257 invocation of Wget, and whose local files are still sitting around.
258
259 Without -c, the previous example would just download the remote
260 file to ls-lR.Z.1, leaving the truncated ls-lR.Z file alone.
261
262 Beginning with Wget 1.7, if you use -c on a non-empty file, and it
263 turns out that the server does not support continued downloading,
264 Wget will refuse to start the download from scratch, which would
265 effectively ruin existing contents. If you really want the down‐
266 load to start from scratch, remove the file.
267
268 Also beginning with Wget 1.7, if you use -c on a file which is of
269 equal size as the one on the server, Wget will refuse to download
270 the file and print an explanatory message. The same happens when
271 the file is smaller on the server than locally (presumably because
272 it was changed on the server since your last download
273 attempt)---because ``continuing'' is not meaningful, no download
274 occurs.
275
276 On the other side of the coin, while using -c, any file that's big‐
277 ger on the server than locally will be considered an incomplete
278 download and only "(length(remote) - length(local))" bytes will be
279 downloaded and tacked onto the end of the local file. This behav‐
280 ior can be desirable in certain cases---for instance, you can use
281 wget -c to download just the new portion that's been appended to a
282 data collection or log file.
283
284 However, if the file is bigger on the server because it's been
285 changed, as opposed to just appended to, you'll end up with a gar‐
286 bled file. Wget has no way of verifying that the local file is
287 really a valid prefix of the remote file. You need to be espe‐
288 cially careful of this when using -c in conjunction with -r, since
289 every file will be considered as an "incomplete download" candi‐
290 date.
291
292 Another instance where you'll get a garbled file if you try to use
293 -c is if you have a lame HTTP proxy that inserts a ``transfer
294 interrupted'' string into the local file. In the future a ``roll‐
295 back'' option may be added to deal with this case.
296
297 Note that -c only works with FTP servers and with HTTP servers that
298 support the "Range" header.
299
300 --progress=type
301 Select the type of the progress indicator you wish to use. Legal
302 indicators are ``dot'' and ``bar''.
303
304 The ``bar'' indicator is used by default. It draws an ASCII
305 progress bar graphics (a.k.a ``thermometer'' display) indicating
306 the status of retrieval. If the output is not a TTY, the ``dot''
307 bar will be used by default.
308
309 Use --progress=dot to switch to the ``dot'' display. It traces the
310 retrieval by printing dots on the screen, each dot representing a
311 fixed amount of downloaded data.
312
313 When using the dotted retrieval, you may also set the style by
314 specifying the type as dot:style. Different styles assign differ‐
315 ent meaning to one dot. With the "default" style each dot repre‐
316 sents 1K, there are ten dots in a cluster and 50 dots in a line.
317 The "binary" style has a more ``computer''-like orientation---8K
318 dots, 16-dots clusters and 48 dots per line (which makes for 384K
319 lines). The "mega" style is suitable for downloading very large
320 files---each dot represents 64K retrieved, there are eight dots in
321 a cluster, and 48 dots on each line (so each line contains 3M).
322
323 Note that you can set the default style using the "progress" com‐
324 mand in .wgetrc. That setting may be overridden from the command
325 line. The exception is that, when the output is not a TTY, the
326 ``dot'' progress will be favored over ``bar''. To force the bar
327 output, use --progress=bar:force.
328
329 -N
330 --timestamping
331 Turn on time-stamping.
332
333 -S
334 --server-response
335 Print the headers sent by HTTP servers and responses sent by FTP
336 servers.
337
338 --spider
339 When invoked with this option, Wget will behave as a Web spider,
340 which means that it will not download the pages, just check that
341 they are there. For example, you can use Wget to check your book‐
342 marks:
343
344 wget --spider --force-html -i bookmarks.html
345
346 This feature needs much more work for Wget to get close to the
347 functionality of real web spiders.
348
349 -T seconds
350 --timeout=seconds
351 Set the network timeout to seconds seconds. This is equivalent to
352 specifying --dns-timeout, --connect-timeout, and --read-timeout,
353 all at the same time.
354
355 When interacting with the network, Wget can check for timeout and
356 abort the operation if it takes too long. This prevents anomalies
357 like hanging reads and infinite connects. The only timeout enabled
358 by default is a 900-second read timeout. Setting a timeout to 0
359 disables it altogether. Unless you know what you are doing, it is
360 best not to change the default timeout settings.
361
362 All timeout-related options accept decimal values, as well as sub‐
363 second values. For example, 0.1 seconds is a legal (though unwise)
364 choice of timeout. Subsecond timeouts are useful for checking
365 server response times or for testing network latency.
366
367 --dns-timeout=seconds
368 Set the DNS lookup timeout to seconds seconds. DNS lookups that
369 don't complete within the specified time will fail. By default,
370 there is no timeout on DNS lookups, other than that implemented by
371 system libraries.
372
373 --connect-timeout=seconds
374 Set the connect timeout to seconds seconds. TCP connections that
375 take longer to establish will be aborted. By default, there is no
376 connect timeout, other than that implemented by system libraries.
377
378 --read-timeout=seconds
379 Set the read (and write) timeout to seconds seconds. The ``time''
380 of this timeout refers idle time: if, at any point in the download,
381 no data is received for more than the specified number of seconds,
382 reading fails and the download is restarted. This option does not
383 directly affect the duration of the entire download.
384
385 Of course, the remote server may choose to terminate the connection
386 sooner than this option requires. The default read timeout is 900
387 seconds.
388
389 --limit-rate=amount
390 Limit the download speed to amount bytes per second. Amount may be
391 expressed in bytes, kilobytes with the k suffix, or megabytes with
392 the m suffix. For example, --limit-rate=20k will limit the
393 retrieval rate to 20KB/s. This is useful when, for whatever rea‐
394 son, you don't want Wget to consume the entire available bandwidth.
395
396 This option allows the use of decimal numbers, usually in conjunc‐
397 tion with power suffixes; for example, --limit-rate=2.5k is a legal
398 value.
399
400 Note that Wget implements the limiting by sleeping the appropriate
401 amount of time after a network read that took less time than speci‐
402 fied by the rate. Eventually this strategy causes the TCP transfer
403 to slow down to approximately the specified rate. However, it may
404 take some time for this balance to be achieved, so don't be sur‐
405 prised if limiting the rate doesn't work well with very small
406 files.
407
408 -w seconds
409 --wait=seconds
410 Wait the specified number of seconds between the retrievals. Use
411 of this option is recommended, as it lightens the server load by
412 making the requests less frequent. Instead of in seconds, the time
413 can be specified in minutes using the "m" suffix, in hours using
414 "h" suffix, or in days using "d" suffix.
415
416 Specifying a large value for this option is useful if the network
417 or the destination host is down, so that Wget can wait long enough
418 to reasonably expect the network error to be fixed before the
419 retry.
420
421 --waitretry=seconds
422 If you don't want Wget to wait between every retrieval, but only
423 between retries of failed downloads, you can use this option. Wget
424 will use linear backoff, waiting 1 second after the first failure
425 on a given file, then waiting 2 seconds after the second failure on
426 that file, up to the maximum number of seconds you specify. There‐
427 fore, a value of 10 will actually make Wget wait up to (1 + 2 + ...
428 + 10) = 55 seconds per file.
429
430 Note that this option is turned on by default in the global wgetrc
431 file.
432
433 --random-wait
434 Some web sites may perform log analysis to identify retrieval pro‐
435 grams such as Wget by looking for statistically significant simi‐
436 larities in the time between requests. This option causes the time
437 between requests to vary between 0 and 2 * wait seconds, where wait
438 was specified using the --wait option, in order to mask Wget's
439 presence from such analysis.
440
441 A recent article in a publication devoted to development on a popu‐
442 lar consumer platform provided code to perform this analysis on the
443 fly. Its author suggested blocking at the class C address level to
444 ensure automated retrieval programs were blocked despite changing
445 DHCP-supplied addresses.
446
447 The --random-wait option was inspired by this ill-advised recommen‐
448 dation to block many unrelated users from a web site due to the
449 actions of one.
450
451 --no-proxy
452 Don't use proxies, even if the appropriate *_proxy environment
453 variable is defined.
454
455 -Q quota
456 --quota=quota
457 Specify download quota for automatic retrievals. The value can be
458 specified in bytes (default), kilobytes (with k suffix), or
459 megabytes (with m suffix).
460
461 Note that quota will never affect downloading a single file. So if
462 you specify wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz, all of
463 the ls-lR.gz will be downloaded. The same goes even when several
464 URLs are specified on the command-line. However, quota is
465 respected when retrieving either recursively, or from an input
466 file. Thus you may safely type wget -Q2m -i sites---download will
467 be aborted when the quota is exceeded.
468
469 Setting quota to 0 or to inf unlimits the download quota.
470
471 --no-dns-cache
472 Turn off caching of DNS lookups. Normally, Wget remembers the IP
473 addresses it looked up from DNS so it doesn't have to repeatedly
474 contact the DNS server for the same (typically small) set of hosts
475 it retrieves from. This cache exists in memory only; a new Wget
476 run will contact DNS again.
477
478 However, it has been reported that in some situations it is not
479 desirable to cache host names, even for the duration of a short-
480 running application like Wget. With this option Wget issues a new
481 DNS lookup (more precisely, a new call to "gethostbyname" or
482 "getaddrinfo") each time it makes a new connection. Please note
483 that this option will not affect caching that might be performed by
484 the resolving library or by an external caching layer, such as
485 NSCD.
486
487 If you don't understand exactly what this option does, you probably
488 won't need it.
489
490 --restrict-file-names=mode
491 Change which characters found in remote URLs may show up in local
492 file names generated from those URLs. Characters that are
493 restricted by this option are escaped, i.e. replaced with %HH,
494 where HH is the hexadecimal number that corresponds to the
495 restricted character.
496
497 By default, Wget escapes the characters that are not valid as part
498 of file names on your operating system, as well as control charac‐
499 ters that are typically unprintable. This option is useful for
500 changing these defaults, either because you are downloading to a
501 non-native partition, or because you want to disable escaping of
502 the control characters.
503
504 When mode is set to ``unix'', Wget escapes the character / and the
505 control characters in the ranges 0--31 and 128--159. This is the
506 default on Unix-like OS'es.
507
508 When mode is set to ``windows'', Wget escapes the characters \, ⎪,
509 /, :, ?, ", *, <, >, and the control characters in the ranges 0--31
510 and 128--159. In addition to this, Wget in Windows mode uses +
511 instead of : to separate host and port in local file names, and
512 uses
513 @ instead of ? to separate the query portion of the file name
514 from the rest. Therefore, a URL that would be saved as
515 www.xemacs.org:4300/search.pl?input=blah in Unix mode would be
516 saved as www.xemacs.org+4300/search.pl@input=blah in Windows mode.
517 This mode is the default on Windows.
518
519 If you append ,nocontrol to the mode, as in unix,nocontrol, escap‐
520 ing of the control characters is also switched off. You can use
521 --restrict-file-names=nocontrol to turn off escaping of control
522 characters without affecting the choice of the OS to use as file
523 name restriction mode.
524
525 -4
526 --inet4-only
527 -6
528 --inet6-only
529 Force connecting to IPv4 or IPv6 addresses. With --inet4-only or
530 -4, Wget will only connect to IPv4 hosts, ignoring AAAA records in
531 DNS, and refusing to connect to IPv6 addresses specified in URLs.
532 Conversely, with --inet6-only or -6, Wget will only connect to IPv6
533 hosts and ignore A records and IPv4 addresses.
534
535 Neither options should be needed normally. By default, an
536 IPv6-aware Wget will use the address family specified by the host's
537 DNS record. If the DNS responds with both IPv4 and IPv6 addresses,
538 Wget will them in sequence until it finds one it can connect to.
539 (Also see "--prefer-family" option described below.)
540
541 These options can be used to deliberately force the use of IPv4 or
542 IPv6 address families on dual family systems, usually to aid debug‐
543 ging or to deal with broken network configuration. Only one of
544 --inet6-only and --inet4-only may be specified at the same time.
545 Neither option is available in Wget compiled without IPv6 support.
546
547 --prefer-family=IPv4/IPv6/none
548 When given a choice of several addresses, connect to the addresses
549 with specified address family first. IPv4 addresses are preferred
550 by default.
551
552 This avoids spurious errors and connect attempts when accessing
553 hosts that resolve to both IPv6 and IPv4 addresses from IPv4 net‐
554 works. For example, www.kame.net resolves to
555 2001:200:0:8002:203:47ff:fea5:3085 and to 203.178.141.194. When
556 the preferred family is "IPv4", the IPv4 address is used first;
557 when the preferred family is "IPv6", the IPv6 address is used
558 first; if the specified value is "none", the address order returned
559 by DNS is used without change.
560
561 Unlike -4 and -6, this option doesn't inhibit access to any address
562 family, it only changes the order in which the addresses are
563 accessed. Also note that the reordering performed by this option
564 is stable---it doesn't affect order of addresses of the same fam‐
565 ily. That is, the relative order of all IPv4 addresses and of all
566 IPv6 addresses remains intact in all cases.
567
568 --retry-connrefused
569 Consider ``connection refused'' a transient error and try again.
570 Normally Wget gives up on a URL when it is unable to connect to the
571 site because failure to connect is taken as a sign that the server
572 is not running at all and that retries would not help. This option
573 is for mirroring unreliable sites whose servers tend to disappear
574 for short periods of time.
575
576 --user=user
577 --password=password
578 Specify the username user and password password for both FTP and
579 HTTP file retrieval. These parameters can be overridden using the
580 --ftp-user and --ftp-password options for FTP connections and the
581 --http-user and --http-password options for HTTP connections.
582
583 Directory Options
584
585 -nd
586 --no-directories
587 Do not create a hierarchy of directories when retrieving recur‐
588 sively. With this option turned on, all files will get saved to
589 the current directory, without clobbering (if a name shows up more
590 than once, the filenames will get extensions .n).
591
592 -x
593 --force-directories
594 The opposite of -nd---create a hierarchy of directories, even if
595 one would not have been created otherwise. E.g. wget -x
596 http://fly.srk.fer.hr/robots.txt will save the downloaded file to
597 fly.srk.fer.hr/robots.txt.
598
599 -nH
600 --no-host-directories
601 Disable generation of host-prefixed directories. By default,
602 invoking Wget with -r http://fly.srk.fer.hr/ will create a struc‐
603 ture of directories beginning with fly.srk.fer.hr/. This option
604 disables such behavior.
605
606 --protocol-directories
607 Use the protocol name as a directory component of local file names.
608 For example, with this option, wget -r http://host will save to
609 http/host/... rather than just to host/....
610
611 --cut-dirs=number
612 Ignore number directory components. This is useful for getting a
613 fine-grained control over the directory where recursive retrieval
614 will be saved.
615
616 Take, for example, the directory at
617 ftp://ftp.xemacs.org/pub/xemacs/. If you retrieve it with -r, it
618 will be saved locally under ftp.xemacs.org/pub/xemacs/. While the
619 -nH option can remove the ftp.xemacs.org/ part, you are still stuck
620 with pub/xemacs. This is where --cut-dirs comes in handy; it makes
621 Wget not ``see'' number remote directory components. Here are sev‐
622 eral examples of how --cut-dirs option works.
623
624 No options -> ftp.xemacs.org/pub/xemacs/
625 -nH -> pub/xemacs/
626 -nH --cut-dirs=1 -> xemacs/
627 -nH --cut-dirs=2 -> .
628
629 --cut-dirs=1 -> ftp.xemacs.org/xemacs/
630 ...
631
632 If you just want to get rid of the directory structure, this option
633 is similar to a combination of -nd and -P. However, unlike -nd,
634 --cut-dirs does not lose with subdirectories---for instance, with
635 -nH --cut-dirs=1, a beta/ subdirectory will be placed to
636 xemacs/beta, as one would expect.
637
638 -P prefix
639 --directory-prefix=prefix
640 Set directory prefix to prefix. The directory prefix is the direc‐
641 tory where all other files and subdirectories will be saved to,
642 i.e. the top of the retrieval tree. The default is . (the current
643 directory).
644
645 HTTP Options
646
647 -E
648 --html-extension
649 If a file of type application/xhtml+xml or text/html is downloaded
650 and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
651 option will cause the suffix .html to be appended to the local
652 filename. This is useful, for instance, when you're mirroring a
653 remote site that uses .asp pages, but you want the mirrored pages
654 to be viewable on your stock Apache server. Another good use for
655 this is when you're downloading CGI-generated materials. A URL
656 like http://site.com/article.cgi?25 will be saved as arti‐
657 cle.cgi?25.html.
658
659 Note that filenames changed in this way will be re-downloaded every
660 time you re-mirror a site, because Wget can't tell that the local
661 X.html file corresponds to remote URL X (since it doesn't yet know
662 that the URL produces output of type text/html or applica‐
663 tion/xhtml+xml. To prevent this re-downloading, you must use -k
664 and -K so that the original version of the file will be saved as
665 X.orig.
666
667 --http-user=user
668 --http-password=password
669 Specify the username user and password password on an HTTP server.
670 According to the type of the challenge, Wget will encode them using
671 either the "basic" (insecure) or the "digest" authentication
672 scheme.
673
674 Another way to specify username and password is in the URL itself.
675 Either method reveals your password to anyone who bothers to run
676 "ps". To prevent the passwords from being seen, store them in
677 .wgetrc or .netrc, and make sure to protect those files from other
678 users with "chmod". If the passwords are really important, do not
679 leave them lying in those files either---edit the files and delete
680 them after Wget has started the download.
681
682 --no-cache
683 Disable server-side cache. In this case, Wget will send the remote
684 server an appropriate directive (Pragma: no-cache) to get the file
685 from the remote service, rather than returning the cached version.
686 This is especially useful for retrieving and flushing out-of-date
687 documents on proxy servers.
688
689 Caching is allowed by default.
690
691 --no-cookies
692 Disable the use of cookies. Cookies are a mechanism for maintain‐
693 ing server-side state. The server sends the client a cookie using
694 the "Set-Cookie" header, and the client responds with the same
695 cookie upon further requests. Since cookies allow the server own‐
696 ers to keep track of visitors and for sites to exchange this infor‐
697 mation, some consider them a breach of privacy. The default is to
698 use cookies; however, storing cookies is not on by default.
699
700 --load-cookies file
701 Load cookies from file before the first HTTP retrieval. file is a
702 textual file in the format originally used by Netscape's cook‐
703 ies.txt file.
704
705 You will typically use this option when mirroring sites that
706 require that you be logged in to access some or all of their con‐
707 tent. The login process typically works by the web server issuing
708 an HTTP cookie upon receiving and verifying your credentials. The
709 cookie is then resent by the browser when accessing that part of
710 the site, and so proves your identity.
711
712 Mirroring such a site requires Wget to send the same cookies your
713 browser sends when communicating with the site. This is achieved
714 by --load-cookies---simply point Wget to the location of the cook‐
715 ies.txt file, and it will send the same cookies your browser would
716 send in the same situation. Different browsers keep textual cookie
717 files in different locations:
718
719 Netscape 4.x.
720 The cookies are in ~/.netscape/cookies.txt.
721
722 Mozilla and Netscape 6.x.
723 Mozilla's cookie file is also named cookies.txt, located some‐
724 where under ~/.mozilla, in the directory of your profile. The
725 full path usually ends up looking somewhat like
726 ~/.mozilla/default/some-weird-string/cookies.txt.
727
728 Internet Explorer.
729 You can produce a cookie file Wget can use by using the File
730 menu, Import and Export, Export Cookies. This has been tested
731 with Internet Explorer 5; it is not guaranteed to work with
732 earlier versions.
733
734 Other browsers.
735 If you are using a different browser to create your cookies,
736 --load-cookies will only work if you can locate or produce a
737 cookie file in the Netscape format that Wget expects.
738
739 If you cannot use --load-cookies, there might still be an alterna‐
740 tive. If your browser supports a ``cookie manager'', you can use
741 it to view the cookies used when accessing the site you're mirror‐
742 ing. Write down the name and value of the cookie, and manually
743 instruct Wget to send those cookies, bypassing the ``official''
744 cookie support:
745
746 wget --no-cookies --header "Cookie: <name>=<value>"
747
748 --save-cookies file
749 Save cookies to file before exiting. This will not save cookies
750 that have expired or that have no expiry time (so-called ``session
751 cookies''), but also see --keep-session-cookies.
752
753 --keep-session-cookies
754 When specified, causes --save-cookies to also save session cookies.
755 Session cookies are normally not saved because they are meant to be
756 kept in memory and forgotten when you exit the browser. Saving
757 them is useful on sites that require you to log in or to visit the
758 home page before you can access some pages. With this option, mul‐
759 tiple Wget runs are considered a single browser session as far as
760 the site is concerned.
761
762 Since the cookie file format does not normally carry session cook‐
763 ies, Wget marks them with an expiry timestamp of 0. Wget's
764 --load-cookies recognizes those as session cookies, but it might
765 confuse other browsers. Also note that cookies so loaded will be
766 treated as other session cookies, which means that if you want
767 --save-cookies to preserve them again, you must use --keep-ses‐
768 sion-cookies again.
769
770 --ignore-length
771 Unfortunately, some HTTP servers (CGI programs, to be more precise)
772 send out bogus "Content-Length" headers, which makes Wget go wild,
773 as it thinks not all the document was retrieved. You can spot this
774 syndrome if Wget retries getting the same document again and again,
775 each time claiming that the (otherwise normal) connection has
776 closed on the very same byte.
777
778 With this option, Wget will ignore the "Content-Length" header---as
779 if it never existed.
780
781 --header=header-line
782 Send header-line along with the rest of the headers in each HTTP
783 request. The supplied header is sent as-is, which means it must
784 contain name and value separated by colon, and must not contain
785 newlines.
786
787 You may define more than one additional header by specifying
788 --header more than once.
789
790 wget --header='Accept-Charset: iso-8859-2' \
791 --header='Accept-Language: hr' \
792 http://fly.srk.fer.hr/
793
794 Specification of an empty string as the header value will clear all
795 previous user-defined headers.
796
797 As of Wget 1.10, this option can be used to override headers other‐
798 wise generated automatically. This example instructs Wget to con‐
799 nect to localhost, but to specify foo.bar in the "Host" header:
800
801 wget --header="Host: foo.bar" http://localhost/
802
803 In versions of Wget prior to 1.10 such use of --header caused send‐
804 ing of duplicate headers.
805
806 --proxy-user=user
807 --proxy-password=password
808 Specify the username user and password password for authentication
809 on a proxy server. Wget will encode them using the "basic" authen‐
810 tication scheme.
811
812 Security considerations similar to those with --http-password per‐
813 tain here as well.
814
815 --referer=url
816 Include `Referer: url' header in HTTP request. Useful for retriev‐
817 ing documents with server-side processing that assume they are
818 always being retrieved by interactive web browsers and only come
819 out properly when Referer is set to one of the pages that point to
820 them.
821
822 --save-headers
823 Save the headers sent by the HTTP server to the file, preceding the
824 actual contents, with an empty line as the separator.
825
826 -U agent-string
827 --user-agent=agent-string
828 Identify as agent-string to the HTTP server.
829
830 The HTTP protocol allows the clients to identify themselves using a
831 "User-Agent" header field. This enables distinguishing the WWW
832 software, usually for statistical purposes or for tracing of proto‐
833 col violations. Wget normally identifies as Wget/version, version
834 being the current version number of Wget.
835
836 However, some sites have been known to impose the policy of tailor‐
837 ing the output according to the "User-Agent"-supplied information.
838 While this is not such a bad idea in theory, it has been abused by
839 servers denying information to clients other than (historically)
840 Netscape or, more frequently, Microsoft Internet Explorer. This
841 option allows you to change the "User-Agent" line issued by Wget.
842 Use of this option is discouraged, unless you really know what you
843 are doing.
844
845 Specifying empty user agent with --user-agent="" instructs Wget not
846 to send the "User-Agent" header in HTTP requests.
847
848 --post-data=string
849 --post-file=file
850 Use POST as the method for all HTTP requests and send the specified
851 data in the request body. "--post-data" sends string as data,
852 whereas "--post-file" sends the contents of file. Other than that,
853 they work in exactly the same way.
854
855 Please be aware that Wget needs to know the size of the POST data
856 in advance. Therefore the argument to "--post-file" must be a reg‐
857 ular file; specifying a FIFO or something like /dev/stdin won't
858 work. It's not quite clear how to work around this limitation
859 inherent in HTTP/1.0. Although HTTP/1.1 introduces chunked trans‐
860 fer that doesn't require knowing the request length in advance, a
861 client can't use chunked unless it knows it's talking to an
862 HTTP/1.1 server. And it can't know that until it receives a
863 response, which in turn requires the request to have been completed
864 -- a chicken-and-egg problem.
865
866 Note: if Wget is redirected after the POST request is completed, it
867 will not send the POST data to the redirected URL. This is because
868 URLs that process POST often respond with a redirection to a regu‐
869 lar page, which does not desire or accept POST. It is not com‐
870 pletely clear that this behavior is optimal; if it doesn't work
871 out, it might be changed in the future.
872
873 This example shows how to log to a server using POST and then pro‐
874 ceed to download the desired pages, presumably only accessible to
875 authorized users:
876
877 # Log in to the server. This can be done only once.
878 wget --save-cookies cookies.txt \
879 --post-data 'user=foo&password=bar' \
880 http://server.com/auth.php
881
882 # Now grab the page or pages we care about.
883 wget --load-cookies cookies.txt \
884 -p http://server.com/interesting/article.php
885
886 If the server is using session cookies to track user authentica‐
887 tion, the above will not work because --save-cookies will not save
888 them (and neither will browsers) and the cookies.txt file will be
889 empty. In that case use --keep-session-cookies along with
890 --save-cookies to force saving of session cookies.
891
892 HTTPS (SSL/TLS) Options
893
894 To support encrypted HTTP (HTTPS) downloads, Wget must be compiled with
895 an external SSL library, currently OpenSSL. If Wget is compiled with‐
896 out SSL support, none of these options are available.
897
898 --secure-protocol=protocol
899 Choose the secure protocol to be used. Legal values are auto,
900 SSLv2, SSLv3, and TLSv1. If auto is used, the SSL library is given
901 the liberty of choosing the appropriate protocol automatically,
902 which is achieved by sending an SSLv2 greeting and announcing sup‐
903 port for SSLv3 and TLSv1. This is the default.
904
905 Specifying SSLv2, SSLv3, or TLSv1 forces the use of the correspond‐
906 ing protocol. This is useful when talking to old and buggy SSL
907 server implementations that make it hard for OpenSSL to choose the
908 correct protocol version. Fortunately, such servers are quite
909 rare.
910
911 --no-check-certificate
912 Don't check the server certificate against the available certifi‐
913 cate authorities. Also don't require the URL host name to match
914 the common name presented by the certificate.
915
916 As of Wget 1.10, the default is to verify the server's certificate
917 against the recognized certificate authorities, breaking the SSL
918 handshake and aborting the download if the verification fails.
919 Although this provides more secure downloads, it does break inter‐
920 operability with some sites that worked with previous Wget ver‐
921 sions, particularly those using self-signed, expired, or otherwise
922 invalid certificates. This option forces an ``insecure'' mode of
923 operation that turns the certificate verification errors into warn‐
924 ings and allows you to proceed.
925
926 If you encounter ``certificate verification'' errors or ones saying
927 that ``common name doesn't match requested host name'', you can use
928 this option to bypass the verification and proceed with the down‐
929 load. Only use this option if you are otherwise convinced of the
930 site's authenticity, or if you really don't care about the validity
931 of its certificate. It is almost always a bad idea not to check
932 the certificates when transmitting confidential or important data.
933
934 --certificate=file
935 Use the client certificate stored in file. This is needed for
936 servers that are configured to require certificates from the
937 clients that connect to them. Normally a certificate is not
938 required and this switch is optional.
939
940 --certificate-type=type
941 Specify the type of the client certificate. Legal values are PEM
942 (assumed by default) and DER, also known as ASN1.
943
944 --private-key=file
945 Read the private key from file. This allows you to provide the
946 private key in a file separate from the certificate.
947
948 --private-key-type=type
949 Specify the type of the private key. Accepted values are PEM (the
950 default) and DER.
951
952 --ca-certificate=file
953 Use file as the file with the bundle of certificate authorities
954 (``CA'') to verify the peers. The certificates must be in PEM for‐
955 mat.
956
957 Without this option Wget looks for CA certificates at the system-
958 specified locations, chosen at OpenSSL installation time.
959
960 --ca-directory=directory
961 Specifies directory containing CA certificates in PEM format. Each
962 file contains one CA certificate, and the file name is based on a
963 hash value derived from the certificate. This is achieved by pro‐
964 cessing a certificate directory with the "c_rehash" utility sup‐
965 plied with OpenSSL. Using --ca-directory is more efficient than
966 --ca-certificate when many certificates are installed because it
967 allows Wget to fetch certificates on demand.
968
969 Without this option Wget looks for CA certificates at the system-
970 specified locations, chosen at OpenSSL installation time.
971
972 --random-file=file
973 Use file as the source of random data for seeding the pseudo-random
974 number generator on systems without /dev/random.
975
976 On such systems the SSL library needs an external source of random‐
977 ness to initialize. Randomness may be provided by EGD (see
978 --egd-file below) or read from an external source specified by the
979 user. If this option is not specified, Wget looks for random data
980 in $RANDFILE or, if that is unset, in $HOME/.rnd. If none of those
981 are available, it is likely that SSL encryption will not be usable.
982
983 If you're getting the ``Could not seed OpenSSL PRNG; disabling
984 SSL.'' error, you should provide random data using some of the
985 methods described above.
986
987 --egd-file=file
988 Use file as the EGD socket. EGD stands for Entropy Gathering Dae‐
989 mon, a user-space program that collects data from various unpre‐
990 dictable system sources and makes it available to other programs
991 that might need it. Encryption software, such as the SSL library,
992 needs sources of non-repeating randomness to seed the random number
993 generator used to produce cryptographically strong keys.
994
995 OpenSSL allows the user to specify his own source of entropy using
996 the "RAND_FILE" environment variable. If this variable is unset,
997 or if the specified file does not produce enough randomness,
998 OpenSSL will read random data from EGD socket specified using this
999 option.
1000
1001 If this option is not specified (and the equivalent startup command
1002 is not used), EGD is never contacted. EGD is not needed on modern
1003 Unix systems that support /dev/random.
1004
1005 FTP Options
1006
1007 --ftp-user=user
1008 --ftp-password=password
1009 Specify the username user and password password on an FTP server.
1010 Without this, or the corresponding startup option, the password
1011 defaults to -wget@, normally used for anonymous FTP.
1012
1013 Another way to specify username and password is in the URL itself.
1014 Either method reveals your password to anyone who bothers to run
1015 "ps". To prevent the passwords from being seen, store them in
1016 .wgetrc or .netrc, and make sure to protect those files from other
1017 users with "chmod". If the passwords are really important, do not
1018 leave them lying in those files either---edit the files and delete
1019 them after Wget has started the download.
1020
1021 --no-remove-listing
1022 Don't remove the temporary .listing files generated by FTP
1023 retrievals. Normally, these files contain the raw directory list‐
1024 ings received from FTP servers. Not removing them can be useful
1025 for debugging purposes, or when you want to be able to easily check
1026 on the contents of remote server directories (e.g. to verify that a
1027 mirror you're running is complete).
1028
1029 Note that even though Wget writes to a known filename for this
1030 file, this is not a security hole in the scenario of a user making
1031 .listing a symbolic link to /etc/passwd or something and asking
1032 "root" to run Wget in his or her directory. Depending on the
1033 options used, either Wget will refuse to write to .listing, making
1034 the globbing/recursion/time-stamping operation fail, or the sym‐
1035 bolic link will be deleted and replaced with the actual .listing
1036 file, or the listing will be written to a .listing.number file.
1037
1038 Even though this situation isn't a problem, though, "root" should
1039 never run Wget in a non-trusted user's directory. A user could do
1040 something as simple as linking index.html to /etc/passwd and asking
1041 "root" to run Wget with -N or -r so the file will be overwritten.
1042
1043 --no-glob
1044 Turn off FTP globbing. Globbing refers to the use of shell-like
1045 special characters (wildcards), like *, ?, [ and ] to retrieve more
1046 than one file from the same directory at once, like:
1047
1048 wget ftp://gnjilux.srk.fer.hr/*.msg
1049
1050 By default, globbing will be turned on if the URL contains a glob‐
1051 bing character. This option may be used to turn globbing on or off
1052 permanently.
1053
1054 You may have to quote the URL to protect it from being expanded by
1055 your shell. Globbing makes Wget look for a directory listing,
1056 which is system-specific. This is why it currently works only with
1057 Unix FTP servers (and the ones emulating Unix "ls" output).
1058
1059 --no-passive-ftp
1060 Disable the use of the passive FTP transfer mode. Passive FTP man‐
1061 dates that the client connect to the server to establish the data
1062 connection rather than the other way around.
1063
1064 If the machine is connected to the Internet directly, both passive
1065 and active FTP should work equally well. Behind most firewall and
1066 NAT configurations passive FTP has a better chance of working.
1067 However, in some rare firewall configurations, active FTP actually
1068 works when passive FTP doesn't. If you suspect this to be the
1069 case, use this option, or set "passive_ftp=off" in your init file.
1070
1071 --retr-symlinks
1072 Usually, when retrieving FTP directories recursively and a symbolic
1073 link is encountered, the linked-to file is not downloaded.
1074 Instead, a matching symbolic link is created on the local filesys‐
1075 tem. The pointed-to file will not be downloaded unless this recur‐
1076 sive retrieval would have encountered it separately and downloaded
1077 it anyway.
1078
1079 When --retr-symlinks is specified, however, symbolic links are tra‐
1080 versed and the pointed-to files are retrieved. At this time, this
1081 option does not cause Wget to traverse symlinks to directories and
1082 recurse through them, but in the future it should be enhanced to do
1083 this.
1084
1085 Note that when retrieving a file (not a directory) because it was
1086 specified on the command-line, rather than because it was recursed
1087 to, this option has no effect. Symbolic links are always traversed
1088 in this case.
1089
1090 --no-http-keep-alive
1091 Turn off the ``keep-alive'' feature for HTTP downloads. Normally,
1092 Wget asks the server to keep the connection open so that, when you
1093 download more than one document from the same server, they get
1094 transferred over the same TCP connection. This saves time and at
1095 the same time reduces the load on the server.
1096
1097 This option is useful when, for some reason, persistent
1098 (keep-alive) connections don't work for you, for example due to a
1099 server bug or due to the inability of server-side scripts to cope
1100 with the connections.
1101
1102 Recursive Retrieval Options
1103
1104 -r
1105 --recursive
1106 Turn on recursive retrieving.
1107
1108 -l depth
1109 --level=depth
1110 Specify recursion maximum depth level depth. The default maximum
1111 depth is 5.
1112
1113 --delete-after
1114 This option tells Wget to delete every single file it downloads,
1115 after having done so. It is useful for pre-fetching popular pages
1116 through a proxy, e.g.:
1117
1118 wget -r -nd --delete-after http://whatever.com/~popular/page/
1119
1120 The -r option is to retrieve recursively, and -nd to not create
1121 directories.
1122
1123 Note that --delete-after deletes files on the local machine. It
1124 does not issue the DELE command to remote FTP sites, for instance.
1125 Also note that when --delete-after is specified, --convert-links is
1126 ignored, so .orig files are simply not created in the first place.
1127
1128 -k
1129 --convert-links
1130 After the download is complete, convert the links in the document
1131 to make them suitable for local viewing. This affects not only the
1132 visible hyperlinks, but any part of the document that links to
1133 external content, such as embedded images, links to style sheets,
1134 hyperlinks to non-HTML content, etc.
1135
1136 Each link will be changed in one of the two ways:
1137
1138 * The links to files that have been downloaded by Wget will be
1139 changed to refer to the file they point to as a relative link.
1140
1141 Example: if the downloaded file /foo/doc.html links to
1142 /bar/img.gif, also downloaded, then the link in doc.html will
1143 be modified to point to ../bar/img.gif. This kind of transfor‐
1144 mation works reliably for arbitrary combinations of directo‐
1145 ries.
1146
1147 * The links to files that have not been downloaded by Wget will
1148 be changed to include host name and absolute path of the loca‐
1149 tion they point to.
1150
1151 Example: if the downloaded file /foo/doc.html links to
1152 /bar/img.gif (or to ../bar/img.gif), then the link in doc.html
1153 will be modified to point to http://hostname/bar/img.gif.
1154
1155 Because of this, local browsing works reliably: if a linked file
1156 was downloaded, the link will refer to its local name; if it was
1157 not downloaded, the link will refer to its full Internet address
1158 rather than presenting a broken link. The fact that the former
1159 links are converted to relative links ensures that you can move the
1160 downloaded hierarchy to another directory.
1161
1162 Note that only at the end of the download can Wget know which links
1163 have been downloaded. Because of that, the work done by -k will be
1164 performed at the end of all the downloads.
1165
1166 -K
1167 --backup-converted
1168 When converting a file, back up the original version with a .orig
1169 suffix. Affects the behavior of -N.
1170
1171 -m
1172 --mirror
1173 Turn on options suitable for mirroring. This option turns on
1174 recursion and time-stamping, sets infinite recursion depth and
1175 keeps FTP directory listings. It is currently equivalent to -r -N
1176 -l inf --no-remove-listing.
1177
1178 -p
1179 --page-requisites
1180 This option causes Wget to download all the files that are neces‐
1181 sary to properly display a given HTML page. This includes such
1182 things as inlined images, sounds, and referenced stylesheets.
1183
1184 Ordinarily, when downloading a single HTML page, any requisite doc‐
1185 uments that may be needed to display it properly are not down‐
1186 loaded. Using -r together with -l can help, but since Wget does
1187 not ordinarily distinguish between external and inlined documents,
1188 one is generally left with ``leaf documents'' that are missing
1189 their requisites.
1190
1191 For instance, say document 1.html contains an "<IMG>" tag referenc‐
1192 ing 1.gif and an "<A>" tag pointing to external document 2.html.
1193 Say that 2.html is similar but that its image is 2.gif and it links
1194 to 3.html. Say this continues up to some arbitrarily high number.
1195
1196 If one executes the command:
1197
1198 wget -r -l 2 http://<site>/1.html
1199
1200 then 1.html, 1.gif, 2.html, 2.gif, and 3.html will be downloaded.
1201 As you can see, 3.html is without its requisite 3.gif because Wget
1202 is simply counting the number of hops (up to 2) away from 1.html in
1203 order to determine where to stop the recursion. However, with this
1204 command:
1205
1206 wget -r -l 2 -p http://<site>/1.html
1207
1208 all the above files and 3.html's requisite 3.gif will be down‐
1209 loaded. Similarly,
1210
1211 wget -r -l 1 -p http://<site>/1.html
1212
1213 will cause 1.html, 1.gif, 2.html, and 2.gif to be downloaded. One
1214 might think that:
1215
1216 wget -r -l 0 -p http://<site>/1.html
1217
1218 would download just 1.html and 1.gif, but unfortunately this is not
1219 the case, because -l 0 is equivalent to -l inf---that is, infinite
1220 recursion. To download a single HTML page (or a handful of them,
1221 all specified on the command-line or in a -i URL input file) and
1222 its (or their) requisites, simply leave off -r and -l:
1223
1224 wget -p http://<site>/1.html
1225
1226 Note that Wget will behave as if -r had been specified, but only
1227 that single page and its requisites will be downloaded. Links from
1228 that page to external documents will not be followed. Actually, to
1229 download a single page and all its requisites (even if they exist
1230 on separate websites), and make sure the lot displays properly
1231 locally, this author likes to use a few options in addition to -p:
1232
1233 wget -E -H -k -K -p http://<site>/<document>
1234
1235 To finish off this topic, it's worth knowing that Wget's idea of an
1236 external document link is any URL specified in an "<A>" tag, an
1237 "<AREA>" tag, or a "<LINK>" tag other than "<LINK
1238 REL="stylesheet">".
1239
1240 --strict-comments
1241 Turn on strict parsing of HTML comments. The default is to termi‐
1242 nate comments at the first occurrence of -->.
1243
1244 According to specifications, HTML comments are expressed as SGML
1245 declarations. Declaration is special markup that begins with <!
1246 and ends with >, such as <!DOCTYPE ...>, that may contain comments
1247 between a pair of -- delimiters. HTML comments are ``empty decla‐
1248 rations'', SGML declarations without any non-comment text. There‐
1249 fore, <!--foo--> is a valid comment, and so is <!--one-- --two-->,
1250 but <!--1--2--> is not.
1251
1252 On the other hand, most HTML writers don't perceive comments as
1253 anything other than text delimited with <!-- and -->, which is not
1254 quite the same. For example, something like <!------------> works
1255 as a valid comment as long as the number of dashes is a multiple of
1256 four (!). If not, the comment technically lasts until the next --,
1257 which may be at the other end of the document. Because of this,
1258 many popular browsers completely ignore the specification and
1259 implement what users have come to expect: comments delimited with
1260 <!-- and -->.
1261
1262 Until version 1.9, Wget interpreted comments strictly, which
1263 resulted in missing links in many web pages that displayed fine in
1264 browsers, but had the misfortune of containing non-compliant com‐
1265 ments. Beginning with version 1.9, Wget has joined the ranks of
1266 clients that implements ``naive'' comments, terminating each com‐
1267 ment at the first occurrence of -->.
1268
1269 If, for whatever reason, you want strict comment parsing, use this
1270 option to turn it on.
1271
1272 Recursive Accept/Reject Options
1273
1274 -A acclist --accept acclist
1275 -R rejlist --reject rejlist
1276 Specify comma-separated lists of file name suffixes or patterns to
1277 accept or reject..
1278
1279 -D domain-list
1280 --domains=domain-list
1281 Set domains to be followed. domain-list is a comma-separated list
1282 of domains. Note that it does not turn on -H.
1283
1284 --exclude-domains domain-list
1285 Specify the domains that are not to be followed..
1286
1287 --follow-ftp
1288 Follow FTP links from HTML documents. Without this option, Wget
1289 will ignore all the FTP links.
1290
1291 --follow-tags=list
1292 Wget has an internal table of HTML tag / attribute pairs that it
1293 considers when looking for linked documents during a recursive
1294 retrieval. If a user wants only a subset of those tags to be con‐
1295 sidered, however, he or she should be specify such tags in a comma-
1296 separated list with this option.
1297
1298 --ignore-tags=list
1299 This is the opposite of the --follow-tags option. To skip certain
1300 HTML tags when recursively looking for documents to download, spec‐
1301 ify them in a comma-separated list.
1302
1303 In the past, this option was the best bet for downloading a single
1304 page and its requisites, using a command-line like:
1305
1306 wget --ignore-tags=a,area -H -k -K -r http://<site>/<document>
1307
1308 However, the author of this option came across a page with tags
1309 like "<LINK REL="home" HREF="/">" and came to the realization that
1310 specifying tags to ignore was not enough. One can't just tell Wget
1311 to ignore "<LINK>", because then stylesheets will not be down‐
1312 loaded. Now the best bet for downloading a single page and its
1313 requisites is the dedicated --page-requisites option.
1314
1315 -H
1316 --span-hosts
1317 Enable spanning across hosts when doing recursive retrieving.
1318
1319 -L
1320 --relative
1321 Follow relative links only. Useful for retrieving a specific home
1322 page without any distractions, not even those from the same hosts.
1323
1324 -I list
1325 --include-directories=list
1326 Specify a comma-separated list of directories you wish to follow
1327 when downloading. Elements of list may contain wildcards.
1328
1329 -X list
1330 --exclude-directories=list
1331 Specify a comma-separated list of directories you wish to exclude
1332 from download. Elements of list may contain wildcards.
1333
1334 -np
1335 --no-parent
1336 Do not ever ascend to the parent directory when retrieving recur‐
1337 sively. This is a useful option, since it guarantees that only the
1338 files below a certain hierarchy will be downloaded.
1339
1341 The examples are divided into three sections loosely based on their
1342 complexity.
1343
1344 Simple Usage
1345
1346 · Say you want to download a URL. Just type:
1347
1348 wget http://fly.srk.fer.hr/
1349
1350 · But what will happen if the connection is slow, and the file is
1351 lengthy? The connection will probably fail before the whole file
1352 is retrieved, more than once. In this case, Wget will try getting
1353 the file until it either gets the whole of it, or exceeds the
1354 default number of retries (this being 20). It is easy to change
1355 the number of tries to 45, to insure that the whole file will
1356 arrive safely:
1357
1358 wget --tries=45 http://fly.srk.fer.hr/jpg/flyweb.jpg
1359
1360 · Now let's leave Wget to work in the background, and write its
1361 progress to log file log. It is tiring to type --tries, so we
1362 shall use -t.
1363
1364 wget -t 45 -o log http://fly.srk.fer.hr/jpg/flyweb.jpg &
1365
1366 The ampersand at the end of the line makes sure that Wget works in
1367 the background. To unlimit the number of retries, use -t inf.
1368
1369 · The usage of FTP is as simple. Wget will take care of login and
1370 password.
1371
1372 wget ftp://gnjilux.srk.fer.hr/welcome.msg
1373
1374 · If you specify a directory, Wget will retrieve the directory list‐
1375 ing, parse it and convert it to HTML. Try:
1376
1377 wget ftp://ftp.gnu.org/pub/gnu/
1378 links index.html
1379
1380 Advanced Usage
1381
1382 · You have a file that contains the URLs you want to download? Use
1383 the -i switch:
1384
1385 wget -i <file>
1386
1387 If you specify - as file name, the URLs will be read from standard
1388 input.
1389
1390 · Create a five levels deep mirror image of the GNU web site, with
1391 the same directory structure the original has, with only one try
1392 per document, saving the log of the activities to gnulog:
1393
1394 wget -r http://www.gnu.org/ -o gnulog
1395
1396 · The same as the above, but convert the links in the HTML files to
1397 point to local files, so you can view the documents off-line:
1398
1399 wget --convert-links -r http://www.gnu.org/ -o gnulog
1400
1401 · Retrieve only one HTML page, but make sure that all the elements
1402 needed for the page to be displayed, such as inline images and
1403 external style sheets, are also downloaded. Also make sure the
1404 downloaded page references the downloaded links.
1405
1406 wget -p --convert-links http://www.server.com/dir/page.html
1407
1408 The HTML page will be saved to www.server.com/dir/page.html, and
1409 the images, stylesheets, etc., somewhere under www.server.com/,
1410 depending on where they were on the remote server.
1411
1412 · The same as the above, but without the www.server.com/ directory.
1413 In fact, I don't want to have all those random server directories
1414 anyway---just save all those files under a download/ subdirectory
1415 of the current directory.
1416
1417 wget -p --convert-links -nH -nd -Pdownload \
1418 http://www.server.com/dir/page.html
1419
1420 · Retrieve the index.html of www.lycos.com, showing the original
1421 server headers:
1422
1423 wget -S http://www.lycos.com/
1424
1425 · Save the server headers with the file, perhaps for post-processing.
1426
1427 wget --save-headers http://www.lycos.com/
1428 more index.html
1429
1430 · Retrieve the first two levels of wuarchive.wustl.edu, saving them
1431 to /tmp.
1432
1433 wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
1434
1435 · You want to download all the GIFs from a directory on an HTTP
1436 server. You tried wget http://www.server.com/dir/*.gif, but that
1437 didn't work because HTTP retrieval does not support globbing. In
1438 that case, use:
1439
1440 wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
1441
1442 More verbose, but the effect is the same. -r -l1 means to retrieve
1443 recursively, with maximum depth of 1. --no-parent means that ref‐
1444 erences to the parent directory are ignored, and -A.gif means to
1445 download only the GIF files. -A "*.gif" would have worked too.
1446
1447 · Suppose you were in the middle of downloading, when Wget was inter‐
1448 rupted. Now you do not want to clobber the files already present.
1449 It would be:
1450
1451 wget -nc -r http://www.gnu.org/
1452
1453 · If you want to encode your own username and password to HTTP or
1454 FTP, use the appropriate URL syntax.
1455
1456 wget ftp://hniksic:mypassword@unix.server.com/.emacs
1457
1458 Note, however, that this usage is not advisable on multi-user sys‐
1459 tems because it reveals your password to anyone who looks at the
1460 output of "ps".
1461
1462 · You would like the output documents to go to standard output
1463 instead of to files?
1464
1465 wget -O - http://jagor.srce.hr/ http://www.srce.hr/
1466
1467 You can also combine the two options and make pipelines to retrieve
1468 the documents from remote hotlists:
1469
1470 wget -O - http://cool.list.com/ ⎪ wget --force-html -i -
1471
1472 Very Advanced Usage
1473
1474 · If you wish Wget to keep a mirror of a page (or FTP subdirecto‐
1475 ries), use --mirror (-m), which is the shorthand for -r -l inf -N.
1476 You can put Wget in the crontab file asking it to recheck a site
1477 each Sunday:
1478
1479 crontab
1480 0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog
1481
1482 · In addition to the above, you want the links to be converted for
1483 local viewing. But, after having read this manual, you know that
1484 link conversion doesn't play well with timestamping, so you also
1485 want Wget to back up the original HTML files before the conversion.
1486 Wget invocation would look like this:
1487
1488 wget --mirror --convert-links --backup-converted \
1489 http://www.gnu.org/ -o /home/me/weeklog
1490
1491 · But you've also noticed that local viewing doesn't work all that
1492 well when HTML files are saved under extensions other than .html,
1493 perhaps because they were served as index.cgi. So you'd like Wget
1494 to rename all the files served with content-type text/html or
1495 application/xhtml+xml to name.html.
1496
1497 wget --mirror --convert-links --backup-converted \
1498 --html-extension -o /home/me/weeklog \
1499 http://www.gnu.org/
1500
1501 Or, with less typing:
1502
1503 wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
1504
1506 /etc/wgetrc
1507 Default location of the global startup file.
1508
1509 .wgetrc
1510 User startup file.
1511
1513 You are welcome to send bug reports about GNU Wget to
1514 <bug-wget@gnu.org>.
1515
1516 Before actually submitting a bug report, please try to follow a few
1517 simple guidelines.
1518
1519 1. Please try to ascertain that the behavior you see really is a bug.
1520 If Wget crashes, it's a bug. If Wget does not behave as docu‐
1521 mented, it's a bug. If things work strange, but you are not sure
1522 about the way they are supposed to work, it might well be a bug.
1523
1524 2. Try to repeat the bug in as simple circumstances as possible. E.g.
1525 if Wget crashes while downloading wget -rl0 -kKE -t5 -Y0
1526 http://yoyodyne.com -o /tmp/log, you should try to see if the crash
1527 is repeatable, and if will occur with a simpler set of options.
1528 You might even try to start the download at the page where the
1529 crash occurred to see if that page somehow triggered the crash.
1530
1531 Also, while I will probably be interested to know the contents of
1532 your .wgetrc file, just dumping it into the debug message is proba‐
1533 bly a bad idea. Instead, you should first try to see if the bug
1534 repeats with .wgetrc moved out of the way. Only if it turns out
1535 that .wgetrc settings affect the bug, mail me the relevant parts of
1536 the file.
1537
1538 3. Please start Wget with -d option and send us the resulting output
1539 (or relevant parts thereof). If Wget was compiled without debug
1540 support, recompile it---it is much easier to trace bugs with debug
1541 support on.
1542
1543 Note: please make sure to remove any potentially sensitive informa‐
1544 tion from the debug log before sending it to the bug address. The
1545 "-d" won't go out of its way to collect sensitive information, but
1546 the log will contain a fairly complete transcript of Wget's commu‐
1547 nication with the server, which may include passwords and pieces of
1548 downloaded data. Since the bug address is publically archived, you
1549 may assume that all bug reports are visible to the public.
1550
1551 4. If Wget has crashed, try to run it in a debugger, e.g. "gdb `which
1552 wget` core" and type "where" to get the backtrace. This may not
1553 work if the system administrator has disabled core files, but it is
1554 safe to try.
1555
1557 GNU Info entry for wget.
1558
1560 Originally written by Hrvoje Niksic <hniksic@xemacs.org>.
1561
1563 Copyright (c) 1996--2005 Free Software Foundation, Inc.
1564
1565 Permission is granted to make and distribute verbatim copies of this
1566 manual provided the copyright notice and this permission notice are
1567 preserved on all copies.
1568
1569 Permission is granted to copy, distribute and/or modify this document
1570 under the terms of the GNU Free Documentation License, Version 1.2 or
1571 any later version published by the Free Software Foundation; with the
1572 Invariant Sections being ``GNU General Public License'' and ``GNU Free
1573 Documentation License'', with no Front-Cover Texts, and with no Back-
1574 Cover Texts. A copy of the license is included in the section entitled
1575 ``GNU Free Documentation License''.
1576
1577
1578
1579GNU Wget 1.10.2 (Red Hat modified)2007-02-12 WGET(1)