1LINKCHECKER(1) LinkChecker commandline usage LINKCHECKER(1)
2
3
4
6 linkchecker - command line client to check HTML documents and websites
7 for broken links
8
10 linkchecker [options] [file-or-url]...
11
13 LinkChecker features
14
15 · recursive and multithreaded checking,
16
17 · output in colored or normal text, HTML, SQL, CSV, XML or a
18 sitemap graph in different formats,
19
20 · support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet
21 and local file links,
22
23 · restriction of link checking with URL filters,
24
25 · proxy support,
26
27 · username/password authorization for HTTP, FTP and Telnet,
28
29 · support for robots.txt exclusion protocol,
30
31 · support for Cookies
32
33 · support for HTML5
34
35 · HTML and CSS syntax check
36
37 · Antivirus check
38
39 · a command line and web interface
40
42 The most common use checks the given domain recursively:
43 linkchecker http://www.example.com/
44 Beware that this checks the whole site which can have thousands of
45 URLs. Use the -r option to restrict the recursion depth.
46 Don't check URLs with /secret in its name. All other links are checked
47 as usual:
48 linkchecker --ignore-url=/secret mysite.example.com
49 Checking a local HTML file on Unix:
50 linkchecker ../bla.html
51 Checking a local HTML file on Windows:
52 linkchecker c:\temp\test.html
53 You can skip the http:// url part if the domain starts with www.:
54 linkchecker www.example.com
55 You can skip the ftp:// url part if the domain starts with ftp.:
56 linkchecker -r0 ftp.example.com
57 Generate a sitemap graph and convert it with the graphviz dot utility:
58 linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
59
61 General options
62 -fFILENAME, --config=FILENAME
63 Use FILENAME as configuration file. As default LinkChecker uses
64 ~/.linkchecker/linkcheckerrc.
65
66 -h, --help
67 Help me! Print usage information for this program.
68
69 --stdin
70 Read list of white-space separated URLs to check from stdin.
71
72 -tNUMBER, --threads=NUMBER
73 Generate no more than the given number of threads. Default num‐
74 ber of threads is 10. To disable threading specify a non-posi‐
75 tive number.
76
77 -V, --version
78 Print version and exit.
79
80 --list-plugins
81 Print available check plugins and exit.
82
83 Output options
84 -DSTRING, --debug=STRING
85 Print debugging output for the given logger. Available loggers
86 are cmdline, checking, cache, dns, plugins and all. Specifying
87 all is an alias for specifying all available loggers. The
88 option can be given multiple times to debug with more than one
89 logger. For accurate results, threading will be disabled dur‐
90 ing debug runs.
91
92 -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
93 Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/black‐
94 list for blacklist output, or FILENAME if specified. The ENCOD‐
95 ING specifies the output encoding, the default is that of your
96 locale. Valid encodings are listed at
97 http://docs.python.org/library/codecs.html#standard-encodings.
98 The FILENAME and ENCODING parts of the none output type will be
99 ignored, else if the file already exists, it will be overwrit‐
100 ten. You can specify this option more than once. Valid file
101 output types are text, html, sql, csv, gml, dot, xml, sitemap,
102 none or blacklist. Default is no file output. The various out‐
103 put types are documented below. Note that you can suppress all
104 console output with the option -o none.
105
106 --no-status
107 Do not print check status messages.
108
109 --no-warnings
110 Don't log warnings. Default is to log warnings.
111
112 -oTYPE[/ENCODING], --output=TYPE[/ENCODING]
113 Specify output type as text, html, sql, csv, gml, dot, xml,
114 sitemap, none or blacklist. Default type is text. The various
115 output types are documented below.
116 The ENCODING specifies the output encoding, the default is that
117 of your locale. Valid encodings are listed at
118 http://docs.python.org/library/codecs.html#standard-encodings.
119
120 -q, --quiet
121 Quiet operation, an alias for -o none. This is only useful with
122 -F.
123
124 -v, --verbose
125 Log all checked URLs. Default is to log only errors and warn‐
126 ings.
127
128 -WREGEX, --warning-regex=REGEX
129 Define a regular expression which prints a warning if it matches
130 any content of the checked link. This applies only to valid
131 pages, so we can get their content.
132 Use this to check for pages that contain some form of error, for
133 example "This page has moved" or "Oracle Application error".
134 Note that multiple values can be combined in the regular expres‐
135 sion, for example "(This page has moved|Oracle Application
136 error)".
137 See section REGULAR EXPRESSIONS for more info.
138
139 Checking options
140 --cookiefile=FILENAME
141 Read a file with initial cookie data. The cookie data format is
142 explained below.
143
144 --check-extern
145 Check external URLs.
146
147 --ignore-url=REGEX
148 URLs matching the given regular expression will be ignored and
149 not checked.
150 This option can be given multiple times.
151 See section REGULAR EXPRESSIONS for more info.
152
153 -NSTRING, --nntp-server=STRING
154 Specify an NNTP server for news: links. Default is the environ‐
155 ment variable NNTP_SERVER. If no host is given, only the syntax
156 of the link is checked.
157
158 --no-follow-url=REGEX
159 Check but do not recurse into URLs matching the given regular
160 expression.
161 This option can be given multiple times.
162 See section REGULAR EXPRESSIONS for more info.
163
164 -p, --password
165 Read a password from console and use it for HTTP and FTP autho‐
166 rization. For FTP the default password is anonymous@. For HTTP
167 there is no default password. See also -u.
168
169 -rNUMBER, --recursion-level=NUMBER
170 Check recursively all links up to given depth. A negative depth
171 will enable infinite recursion. Default depth is infinite.
172
173 --timeout=NUMBER
174 Set the timeout for connection attempts in seconds. The default
175 timeout is 60 seconds.
176
177 -uSTRING, --user=STRING
178 Try the given username for HTTP and FTP authorization. For FTP
179 the default username is anonymous. For HTTP there is no default
180 username. See also -p.
181
182 --user-agent=STRING
183 Specify the User-Agent string to send to the HTTP server, for
184 example "Mozilla/4.0". The default is "LinkChecker/X.Y" where
185 X.Y is the current version of LinkChecker.
186
187
189 Configuration files can specify all options above. They can also spec‐
190 ify some options that cannot be set on the command line. See
191 linkcheckerrc(5) for more info.
192
193
195 Note that by default only errors and warnings are logged. You should
196 use the --verbose option to get the complete URL list, especially when
197 outputting a sitemap graph format.
198
199
200 text Standard text logger, logging URLs in keyword: argument fashion.
201
202 html Log URLs in keyword: argument fashion, formatted as HTML. Addi‐
203 tionally has links to the referenced pages. Invalid URLs have
204 HTML and CSS syntax check links appended.
205
206 csv Log check result in CSV format with one URL per line.
207
208 gml Log parent-child relations between linked URLs as a GML sitemap
209 graph.
210
211 dot Log parent-child relations between linked URLs as a DOT sitemap
212 graph.
213
214 gxml Log check result as a GraphXML sitemap graph.
215
216 xml Log check result as machine-readable XML.
217
218 sitemap
219 Log check result as an XML sitemap whose protocol is documented
220 at http://www.sitemaps.org/protocol.html.
221
222 sql Log check result as SQL script with INSERT commands. An example
223 script to create the initial SQL table is included as cre‐
224 ate.sql.
225
226 blacklist
227 Suitable for cron jobs. Logs the check result into a file
228 ~/.linkchecker/blacklist which only contains entries with
229 invalid URLs and the number of times they have failed.
230
231 none Logs nothing. Suitable for debugging or checking the exit code.
232
234 LinkChecker accepts Python regular expressions. See
235 http://docs.python.org/howto/regex.html for an introduction.
236
237 An addition is that a leading exclamation mark negates the regular
238 expression.
239
241 A cookie file contains standard HTTP header (RFC 2616) data with the
242 following possible names:
243
244 Host (required)
245 Sets the domain the cookies are valid for.
246
247 Path (optional)
248 Gives the path the cookies are value for; default path is /.
249
250 Set-cookie (required)
251 Set cookie name/value. Can be given more than once.
252
253 Multiple entries are separated by a blank line. The example below will
254 send two cookies to all URLs starting with http://example.com/hello/
255 and one to all URLs starting with https://example.org/:
256
257 Host: example.com
258 Path: /hello
259 Set-cookie: ID="smee"
260 Set-cookie: spam="egg"
261
262 Host: example.org
263 Set-cookie: baggage="elitist"; comment="hologram"
264
265
267 To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or
268 $ftp_proxy environment variables to the proxy URL. The URL should be of
269 the form http://[user:pass@]host[:port]. LinkChecker also detects man‐
270 ual proxy settings of Internet Explorer under Windows systems, and
271 gconf or KDE on Linux systems. On a Mac use the Internet Config to
272 select a proxy. You can also set a comma-separated domain list in the
273 $no_proxy environment variables to ignore any proxy settings for these
274 domains. Setting a HTTP proxy on Unix for example looks like this:
275
276 export http_proxy="http://proxy.example.com:8080"
277
278 Proxy authentication is also supported:
279
280 export http_proxy="http://user1:mypass@proxy.example.org:8081"
281
282 Setting a proxy on the Windows command prompt:
283
284 set http_proxy=http://proxy.example.com:8080
285
286
288 All URLs have to pass a preliminary syntax test. Minor quoting mistakes
289 will issue a warning, all other invalid syntax issues are errors.
290 After the syntax check passes, the URL is queued for connection check‐
291 ing. All connection check types are described below.
292
293 HTTP links (http:, https:)
294 After connecting to the given HTTP server the given path or
295 query is requested. All redirections are followed, and if
296 user/password is given it will be used as authorization when
297 necessary. All final HTTP status codes other than 2xx are
298 errors. HTML page contents are checked for recursion.
299
300 Local files (file:)
301 A regular, readable file that can be opened is valid. A readable
302 directory is also valid. All other files, for example device
303 files, unreadable or non-existing files are errors. HTML or
304 other parseable file contents are checked for recursion.
305
306 Mail links (mailto:)
307 A mailto: link eventually resolves to a list of email addresses.
308 If one address fails, the whole list will fail. For each mail
309 address we check the following things:
310 1) Check the adress syntax, both of the part before and after
311 the @ sign.
312 2) Look up the MX DNS records. If we found no MX record,
313 print an error.
314 3) Check if one of the mail hosts accept an SMTP connection.
315 Check hosts with higher priority first.
316 If no host accepts SMTP, we print a warning.
317 4) Try to verify the address with the VRFY command. If we got
318 an answer, print the verified address as an info.
319
320 FTP links (ftp:)
321
322 For FTP links we do:
323
324 1) connect to the specified host
325 2) try to login with the given user and password. The default
326 user is ``anonymous``, the default password is ``anony‐
327 mous@``.
328 3) try to change to the given directory
329 4) list the file with the NLST command
330
331
332 Telnet links (``telnet:``)
333
334 We try to connect and if user/password are given, login to the
335 given telnet server.
336
337
338 NNTP links (``news:``, ``snews:``, ``nntp``)
339
340 We try to connect to the given NNTP server. If a news group or
341 article is specified, try to request it from the server.
342
343
344 Unsupported links (``javascript:``, etc.)
345
346 An unsupported link will only print a warning. No further
347 checking
348 will be made.
349
350 The complete list of recognized, but unsupported links can be
351 found
352 in the linkcheck/checker/unknownurl.py source file.
353 The most prominent of them should be JavaScript links.
354
355
357 There are two plugin types: connection and content plugins. Connection
358 plugins are run after a successful connection to the URL host. Content
359 plugins are run if the URL type has content (mailto: URLs have no con‐
360 tent for example) and if the check is not forbidden (ie. by HTTP ro‐
361 bots.txt). See linkchecker --list-plugins for a list of plugins and
362 their documentation. All plugins are enabled via the linkcheckerrc(5)
363 configuration file.
364
365
367 Before descending recursively into a URL, it has to fulfill several
368 conditions. They are checked in this order:
369
370 1. A URL must be valid.
371
372 2. A URL must be parseable. This currently includes HTML files,
373 Opera bookmarks files, and directories. If a file type cannot
374 be determined (for example it does not have a common HTML file
375 extension, and the content does not look like HTML), it is assumed
376 to be non-parseable.
377
378 3. The URL content must be retrievable. This is usually the case
379 except for example mailto: or unknown URL types.
380
381 4. The maximum recursion level must not be exceeded. It is configured
382 with the --recursion-level option and is unlimited per default.
383
384 5. It must not match the ignored URL list. This is controlled with
385 the --ignore-url option.
386
387 6. The Robots Exclusion Protocol must allow links in the URL to be
388 followed recursively. This is checked by searching for a
389 "nofollow" directive in the HTML header data.
390
391 Note that the directory recursion reads all files in that directory,
392 not just a subset like index.htm*.
393
394
396 URLs on the commandline starting with ftp. are treated like ftp://ftp.,
397 URLs starting with www. are treated like http://www.. You can also
398 give local files as arguments.
399
400 If you have your system configured to automatically establish a connec‐
401 tion to the internet (e.g. with diald), it will connect when checking
402 links not pointing to your local host. Use the --ignore-url option to
403 prevent this.
404
405 Javascript links are not supported.
406
407 If your platform does not support threading, LinkChecker disables it
408 automatically.
409
410 You can supply multiple user/password pairs in a configuration file.
411
412 When checking news: links the given NNTP host doesn't need to be the
413 same as the host of the user browsing your pages.
414
416 NNTP_SERVER - specifies default NNTP server
417 http_proxy - specifies default HTTP proxy server
418 ftp_proxy - specifies default FTP proxy server
419 no_proxy - comma-separated list of domains to not contact over a proxy
420 server
421 LC_MESSAGES, LANG, LANGUAGE - specify output language
422
424 The return value is 2 when
425
426 · a program error occurred.
427
428 The return value is 1 when
429
430 · invalid links were found or
431
432 · link warnings were found and warnings are enabled
433
434 Else the return value is zero.
435
437 LinkChecker consumes memory for each queued URL to check. With thou‐
438 sands of queued URLs the amount of consumed memory can become quite
439 large. This might slow down the program or even the whole system.
440
442 ~/.linkchecker/linkcheckerrc - default configuration file
443 ~/.linkchecker/blacklist - default blacklist logger output filename
444 linkchecker-out.TYPE - default logger file output name
445 http://docs.python.org/library/codecs.html#standard-encodings - valid
446 output encodings
447 http://docs.python.org/howto/regex.html - regular expression documenta‐
448 tion
449
450
452 linkcheckerrc(5)
453
455 Bastian Kleineidam <bastian.kleineidam@web.de>
456
458 Copyright © 2000-2014 Bastian Kleineidam
459
460
461
462LinkChecker 2010-07-01 LINKCHECKER(1)