1LINKCHECKER(1) LinkChecker commandline usage LINKCHECKER(1)
2
3
4
6 linkchecker - check HTML documents and websites for broken links
7
9 linkchecker [options] [file-or-url]...
10
12 LinkChecker features recursive checking, multithreading, output in col‐
13 ored or normal text, HTML, SQL, CSV or a sitemap graph in GML or XML,
14 support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and
15 local file links, restriction of link checking with regular expression
16 filters for URLs, proxy support, username/password authorization for
17 HTTP and FTP, robots.txt exclusion protocol support, i18n support, a
18 command line interface and a (Fast)CGI web interface (requires HTTP
19 server)
20
22 The most common use checks the given domain recursively, plus any URL
23 pointing outside of the domain:
24 linkchecker http://www.example.net/
25 Beware that this checks the whole site which can have thousands of
26 URLs. Use the -r option to restrict the recursion depth.
27 Don't connect to mailto: hosts, only check their URL syntax. All other
28 links are checked as usual:
29 linkchecker --ignore-url=^mailto: mysite.example.org
30 Checking a local HTML file on Unix:
31 linkchecker ../bla.html
32 Checking a local HTML file on Windows:
33 linkchecker c:\temp\test.html
34 You can skip the http:// url part if the domain starts with www.:
35 linkchecker www.example.com
36 You can skip the ftp:// url part if the domain starts with ftp.:
37 linkchecker -r0 ftp.example.org
38 Generate a sitemap graph and convert it with the graphviz dot utility:
39 linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
40
42 General options
43 -h, --help
44 Help me! Print usage information for this program.
45
46 -fFILENAME, --config=FILENAME
47 Use FILENAME as configuration file. As default LinkChecker first
48 searches /etc/linkchecker/linkcheckerrc and then
49 ~/.linkchecker/linkcheckerrc.
50
51 -tNUMBER, --threads=NUMBER
52 Generate no more than the given number of threads. Default num‐
53 ber of threads is 10. To disable threading specify a non-posi‐
54 tive number.
55
56 -V, --version
57 Print version and exit.
58
59 --stdin
60 Read list of white-space separated URLs to check from stdin.
61
62 Output options
63 -v, --verbose
64 Log all checked URLs once. Default is to log only errors and
65 warnings.
66
67 --complete
68 Log all URLs, including duplicates. Default is to log duplicate
69 URLs only once.
70
71 --no-warnings
72 Don't log warnings. Default is to log warnings.
73
74 -WREGEX, --warning-regex=REGEX
75 Define a regular expression which prints a warning if it matches
76 any content of the checked link. This applies only to valid
77 pages, so we can get their content.
78 Use this to check for pages that contain some form of error, for
79 example "This page has moved" or "Oracle Application Server
80 error".
81
82 --warning-size-bytes=NUMBER
83 Print a warning if content size info is available and exceeds
84 the given number of bytes.
85
86 --check-html
87 Check syntax of HTML URLs with local library (HTML tidy).
88
89 --check-html-w3
90 Check syntax of HTML URLs with W3C online validator.
91
92 --check-css
93 Check syntax of CSS URLs with local library (cssutils).
94
95 --check-css-w3
96 Check syntax of CSS URLs with W3C online validator.
97
98 --scan-virus
99 Scan content of URLs for viruses with ClamAV.
100
101 -q, --quiet
102 Quiet operation, an alias for -o none. This is only useful with
103 -F.
104
105 -oTYPE[/ENCODING], --output=TYPE[/ENCODING]
106 Specify output type as text, html, sql, csv, gml, dot, xml, none
107 or blacklist. Default type is text. The various output types
108 are documented below.
109 The ENCODING specifies the output encoding, the default is that
110 of your locale. Valid encodings are listed at
111 http://docs.python.org/library/codecs.html#standard-encodings.
112
113 -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
114 Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/black‐
115 list for blacklist output, or FILENAME if specified. The ENCOD‐
116 ING specifies the output encoding, the default is that of your
117 locale. Valid encodings are listed at
118 http://docs.python.org/library/codecs.html#standard-encodings.
119 The FILENAME and ENCODING parts of the none output type will be
120 ignored, else if the file already exists, it will be overwrit‐
121 ten. You can specify this option more than once. Valid file
122 output types are text, html, sql, csv, gml, dot, xml, none or
123 blacklist Default is no file output. The various output types
124 are documented below. Note that you can suppress all console
125 output with the option -o none.
126
127 --no-status
128 Do not print check status messages.
129
130 -DSTRING, --debug=STRING
131 Print debugging output for the given logger. Available loggers
132 are cmdline, checking, cache, gui, dns and all. Specifying all
133 is an alias for specifying all available loggers. The option
134 can be given multiple times to debug with more than one logger.
135 For accurate results, threading will be disabled during debug
136 runs.
137
138 --trace
139 Print tracing information.
140
141 --profile
142 Write profiling data into a file named linkchecker.prof in the
143 current working directory. See also --viewprof.
144
145 --viewprof
146 Print out previously generated profiling data. See also --pro‐
147 file.
148
149 Checking options
150 -rNUMBER, --recursion-level=NUMBER
151 Check recursively all links up to given depth. A negative depth
152 will enable infinite recursion. Default depth is infinite.
153
154 --no-follow-url=REGEX
155 Check but do not recurse into URLs matching the given regular
156 expression.
157 This option can be given multiple times.
158
159 --ignore-url=REGEX
160 Only check syntax of URLs matching the given regular expression.
161 This option can be given multiple times.
162
163 -C, --cookies
164 Accept and send HTTP cookies according to RFC 2109. Only cookies
165 which are sent back to the originating server are accepted.
166 Sent and accepted cookies are provided as additional logging
167 information.
168
169 --cookiefile=FILENAME
170 Read a file with initial cookie data. The cookie data format is
171 explained below.
172
173 -a, --anchors
174 Check HTTP anchor references. Default is not to check anchors.
175 This option enables logging of the warning url-anchor-not-found.
176
177 -uSTRING, --user=STRING
178 Try the given username for HTTP and FTP authorization. For FTP
179 the default username is anonymous. For HTTP there is no default
180 username. See also -p.
181
182 -p, --password
183 Read a password from console and use it for HTTP and FTP autho‐
184 rization. For FTP the default password is anonymous@. For HTTP
185 there is no default password. See also -u.
186
187 --timeout=NUMBER
188 Set the timeout for connection attempts in seconds. The default
189 timeout is 60 seconds.
190
191 -PNUMBER, --pause=NUMBER
192 Pause the given number of seconds between two subsequent connec‐
193 tion requests to the same host. Default is no pause between
194 requests.
195
196 -NSTRING, --nntp-server=STRING
197 Specify an NNTP server for news: links. Default is the environ‐
198 ment variable NNTP_SERVER. If no host is given, only the syntax
199 of the link is checked.
200
201
203 Configuration files can specify all options above. They can also spec‐
204 ify some options that cannot be set on the command line. See
205 linkcheckerrc(5) for more info.
206
207
209 Note that by default only errors and warnings are logged. You should
210 use the --verbose option to get the complete URL list, especially when
211 outputting a sitemap graph format.
212
213
214 text Standard text logger, logging URLs in keyword: argument fashion.
215
216 html Log URLs in keyword: argument fashion, formatted as HTML. Addi‐
217 tionally has links to the referenced pages. Invalid URLs have
218 HTML and CSS syntax check links appended.
219
220 csv Log check result in CSV format with one URL per line.
221
222 gml Log parent-child relations between linked URLs as a GML sitemap
223 graph.
224
225 dot Log parent-child relations between linked URLs as a DOT sitemap
226 graph.
227
228 gxml Log check result as a GraphXML sitemap graph.
229
230 xml Log check result as machine-readable XML.
231
232 sql Log check result as SQL script with INSERT commands. An example
233 script to create the initial SQL table is included as cre‐
234 ate.sql.
235
236 blacklist
237 Suitable for cron jobs. Logs the check result into a file
238 ~/.linkchecker/blacklist which only contains entries with
239 invalid URLs and the number of times they have failed.
240
241 none Logs nothing. Suitable for debugging or checking the exit code.
242
244 LinkChecker accepts Python regular expressions. See
245 http://docs.python.org/howto/regex.html for an introduction.
246
247 An addition is that a leading exclamation mark negates the regular
248 expression.
249
251 A cookie file contains standard RFC 805 header data with the following
252 possible names:
253
254 Scheme (optional)
255 Sets the scheme the cookies are valid for; default scheme is
256 http.
257
258 Host (required)
259 Sets the domain the cookies are valid for.
260
261 Path (optional)
262 Gives the path the cookies are value for; default path is /.
263
264 Set-cookie (optional)
265 Set cookie name/value. Can be given more than once.
266
267 Multiple entries are separated by a blank line. The example below will
268 send two cookies to all URLs starting with http://example.com/hello/
269 and one to all URLs starting with https://example.org/:
270
271 Host: example.com
272 Path: /hello
273 Set-cookie: ID="smee"
274 Set-cookie: spam="egg"
275
276 Scheme: https
277 Host: example.org
278 Set-cookie: baggage="elitist"; comment="hologram"
279
280
282 To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or
283 $ftp_proxy environment variables to the proxy URL. The URL should be of
284 the form http://[user:pass@]host[:port]. LinkChecker also detects man‐
285 ual proxy settings of Internet Explorer under Windows systems. On a Mac
286 use the Internet Config to select a proxy. You can also set a comma-
287 separated domain list in the $no_proxy environment variables to ignore
288 any proxy settings for these domains. Setting a HTTP proxy on Unix for
289 example looks like this:
290
291 export http_proxy="http://proxy.example.com:8080"
292
293 Proxy authentication is also supported:
294
295 export http_proxy="http://user1:mypass@proxy.example.org:8081"
296
297 Setting a proxy on the Windows command prompt:
298
299 set http_proxy=http://proxy.example.com:8080
300
301
303 All URLs have to pass a preliminary syntax test. Minor quoting mistakes
304 will issue a warning, all other invalid syntax issues are errors.
305 After the syntax check passes, the URL is queued for connection check‐
306 ing. All connection check types are described below.
307
308 HTTP links (http:, https:)
309 After connecting to the given HTTP server the given path or
310 query is requested. All redirections are followed, and if
311 user/password is given it will be used as authorization when
312 necessary. Permanently moved pages issue a warning. All final
313 HTTP status codes other than 2xx are errors. HTML page contents
314 are checked for recursion.
315
316 Local files (file:)
317 A regular, readable file that can be opened is valid. A readable
318 directory is also valid. All other files, for example device
319 files, unreadable or non-existing files are errors. HTML or
320 other parseable file contents are checked for recursion.
321
322 Mail links (mailto:)
323 A mailto: link eventually resolves to a list of email addresses.
324 If one address fails, the whole list will fail. For each mail
325 address we check the following things:
326 1) Check the adress syntax, both of the part before and after
327 the @ sign.
328 2) Look up the MX DNS records. If we found no MX record,
329 print an error.
330 3) Check if one of the mail hosts accept an SMTP connection.
331 Check hosts with higher priority first.
332 If no host accepts SMTP, we print a warning.
333 4) Try to verify the address with the VRFY command. If we got
334 an answer, print the verified address as an info.
335
336 FTP links (ftp:)
337
338 For FTP links we do:
339
340 1) connect to the specified host
341 2) try to login with the given user and password. The default
342 user is ``anonymous``, the default password is ``anony‐
343 mous@``.
344 3) try to change to the given directory
345 4) list the file with the NLST command
346
347 - Telnet links (``telnet:``)
348
349 We try to connect and if user/password are given, login to the
350 given telnet server.
351
352 - NNTP links (``news:``, ``snews:``, ``nntp``)
353
354 We try to connect to the given NNTP server. If a news group or
355 article is specified, try to request it from the server.
356
357 - Ignored links (``javascript:``, etc.)
358
359 An ignored link will only print a warning. No further checking
360 will be made.
361
362 Here is a complete list of recognized, but ignored links. The
363 most
364 prominent of them should be JavaScript links.
365
366 - ``acap:`` (application configuration access protocol)
367 - ``afs:`` (Andrew File System global file names)
368 - ``chrome:`` (Mozilla specific)
369 - ``cid:`` (content identifier)
370 - ``clsid:`` (Microsoft specific)
371 - ``data:`` (data)
372 - ``dav:`` (dav)
373 - ``fax:`` (fax)
374 - ``find:`` (Mozilla specific)
375 - ``gopher:`` (Gopher)
376 - ``imap:`` (internet message access protocol)
377 - ``isbn:`` (ISBN (int. book numbers))
378 - ``javascript:`` (JavaScript)
379 - ``ldap:`` (Lightweight Directory Access Protocol)
380 - ``mailserver:`` (Access to data available from mail servers)
381 - ``mid:`` (message identifier)
382 - ``mms:`` (multimedia stream)
383 - ``modem:`` (modem)
384 - ``nfs:`` (network file system protocol)
385 - ``opaquelocktoken:`` (opaquelocktoken)
386 - ``pop:`` (Post Office Protocol v3)
387 - ``prospero:`` (Prospero Directory Service)
388 - ``rsync:`` (rsync protocol)
389 - ``rtsp:`` (real time streaming protocol)
390 - ``service:`` (service location)
391 - ``shttp:`` (secure HTTP)
392 - ``sip:`` (session initiation protocol)
393 - ``tel:`` (telephone)
394 - ``tip:`` (Transaction Internet Protocol)
395 - ``tn3270:`` (Interactive 3270 emulation sessions)
396 - ``vemmi:`` (versatile multimedia interface)
397 - ``wais:`` (Wide Area Information Servers)
398 - ``z39.50r:`` (Z39.50 Retrieval)
399 - ``z39.50s:`` (Z39.50 Session)
400
401
402
404 Before descending recursively into a URL, it has to fulfill several
405 conditions. They are checked in this order:
406
407 1. A URL must be valid.
408
409 2. A URL must be parseable. This currently includes HTML files,
410 Opera bookmarks files, and directories. If a file type cannot
411 be determined (for example it does not have a common HTML file
412 extension, and the content does not look like HTML), it is assumed
413 to be non-parseable.
414
415 3. The URL content must be retrievable. This is usually the case
416 except for example mailto: or unknown URL types.
417
418 4. The maximum recursion level must not be exceeded. It is configured
419 with the ``--recursion-level`` option and is unlimited per default.
420
421 5. It must not match the ignored URL list. This is controlled with
422 the ``--ignore-url`` option.
423
424 6. The Robots Exclusion Protocol must allow links in the URL to be
425 followed recursively. This is checked by searching for a
426 "nofollow" directive in the HTML header data.
427
428 Note that the directory recursion reads all files in that directory,
429 not just a subset like ``index.htm*``.
430
431
433 URLs on the commandline starting with ftp. are treated like ftp://ftp.,
434 URLs starting with www. are treated like http://www.. You can also
435 give local files as arguments.
436
437 If you have your system configured to automatically establish a connec‐
438 tion to the internet (e.g. with diald), it will connect when checking
439 links not pointing to your local host. Use the -s and -i options to
440 prevent this.
441
442 Javascript links are currently ignored.
443
444 If your platform does not support threading, LinkChecker disables it
445 automatically.
446
447 You can supply multiple user/password pairs in a configuration file.
448
449 When checking news: links the given NNTP host doesn't need to be the
450 same as the host of the user browsing your pages.
451
453 NNTP_SERVER - specifies default NNTP server
454 http_proxy - specifies default HTTP proxy server
455 ftp_proxy - specifies default FTP proxy server
456 no_proxy - comma-separated list of domains to not contact over a proxy
457 server
458 LC_MESSAGES, LANG, LANGUAGE - specify output language
459
461 The return value is non-zero when
462
463 · invalid links were found or
464
465 · link warnings were found and warnings are enabled
466
467 · a program error occurred.
468
470 LinkChecker consumes memory for each queued URL to check. With thou‐
471 sands of queued URLs the amount of consumed memory can become quite
472 large. This might slow down the program or even the whole system.
473
475 /etc/linkchecker/linkcheckerrc, ~/.linkchecker/linkcheckerrc - default
476 configuration files
477 ~/.linkchecker/blacklist - default blacklist logger output filename
478 linkchecker-out.TYPE - default logger file output name
479 http://docs.python.org/library/codecs.html#standard-encodings - valid
480 output encodings
481 http://docs.python.org/howto/regex.html - regular expression documenta‐
482 tion
483
484
486 linkcheckerrc(5)
487
489 Bastian Kleineidam <calvin@users.sourceforge.net>
490
492 Copyright © 2000-2011 Bastian Kleineidam
493
494
495
496LinkChecker 2010-07-01 LINKCHECKER(1)