1LINKCHECKER(1) LinkChecker LINKCHECKER(1)
2
3
4
6 linkchecker - command line client to check HTML documents and websites
7 for broken links
8
10 linkchecker [options] [file-or-url]...
11
13 LinkChecker features
14
15 • recursive and multithreaded checking
16
17 • output in colored or normal text, HTML, SQL, CSV, XML or a sitemap
18 graph in different formats
19
20 • support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and
21 local file links
22
23 • restriction of link checking with URL filters
24
25 • proxy support
26
27 • username/password authorization for HTTP, FTP and Telnet
28
29 • support for robots.txt exclusion protocol
30
31 • support for Cookies
32
33 • support for HTML5
34
35 • Antivirus check
36
37 • a command line and web interface
38
40 The most common use checks the given domain recursively:
41
42 $ linkchecker http://www.example.com/
43
44 Beware that this checks the whole site which can have thousands of
45 URLs. Use the -r option to restrict the recursion depth.
46
47 Don't check URLs with /secret in its name. All other links are checked
48 as usual:
49
50 $ linkchecker --ignore-url=/secret mysite.example.com
51
52 Checking a local HTML file on Unix:
53
54 $ linkchecker ../bla.html
55
56 Checking a local HTML file on Windows:
57
58 C:\> linkchecker c:empest.html
59
60 You can skip the http:// url part if the domain starts with www.:
61
62 $ linkchecker www.example.com
63
64 You can skip the ftp:// url part if the domain starts with ftp.:
65
66 $ linkchecker -r0 ftp.example.com
67
68 Generate a sitemap graph and convert it with the graphviz dot utility:
69
70 $ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
71
73 General options
74 -f FILENAME, --config=FILENAME
75 Use FILENAME as configuration file. By default LinkChecker uses
76 $XDG_CONFIG_HOME/linkchecker/linkcheckerrc.
77
78 -h, --help
79 Help me! Print usage information for this program.
80
81 -t NUMBER, --threads=NUMBER
82 Generate no more than the given number of threads. Default num‐
83 ber of threads is 10. To disable threading specify a non-posi‐
84 tive number.
85
86 -V, --version
87 Print version and exit.
88
89 --list-plugins
90 Print available check plugins and exit.
91
92 Output options
93 URL checking results
94 -F TYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
95 Output to a file linkchecker-out.TYPE,
96 $XDG_DATA_HOME/linkchecker/failures for the failures output
97 type, or FILENAME if specified. The ENCODING specifies the out‐
98 put encoding, the default is that of your locale. Valid encod‐
99 ings are listed at
100 https://docs.python.org/library/codecs.html#standard-encodings.
101 The FILENAME and ENCODING parts of the none output type will be
102 ignored, else if the file already exists, it will be overwrit‐
103 ten. You can specify this option more than once. Valid file
104 output TYPEs are text, html, sql, csv, gml, dot, xml, sitemap,
105 none or failures. Default is no file output. The various output
106 types are documented below. Note that you can suppress all con‐
107 sole output with the option -o none.
108
109 --no-warnings
110 Don't log warnings. Default is to log warnings.
111
112 -o TYPE[/ENCODING], --output=TYPE[/ENCODING]
113 Specify the console output type as text, html, sql, csv, gml,
114 dot, xml, sitemap, none or failures. Default type is text. The
115 various output types are documented below. The ENCODING speci‐
116 fies the output encoding, the default is that of your locale.
117 Valid encodings are listed at
118 https://docs.python.org/library/codecs.html#standard-encodings.
119
120 -v, --verbose
121 Log all checked URLs. Default is to log only errors and warn‐
122 ings.
123
124 Progress updates
125 --no-status
126 Do not print URL check status messages.
127
128 Application
129 -D STRING, --debug=STRING
130 Print debugging output for the given logger. Available debug
131 loggers are cmdline, checking, cache, plugin and all. all is an
132 alias for all available loggers. This option can be given mul‐
133 tiple times to debug with more than one logger.
134
135 Quiet
136 -q, --quiet
137 Quiet operation, an alias for -o none that also hides applica‐
138 tion information messages. This is only useful with -F, else no
139 results will be output.
140
141 Checking options
142 --cookiefile=FILENAME
143 Use initial cookie data read from a file. The cookie data format
144 is explained below.
145
146 --check-extern
147 Check external URLs.
148
149 --ignore-url=REGEX
150 URLs matching the given regular expression will only be syntax
151 checked. This option can be given multiple times. See section
152 REGULAR EXPRESSIONS for more info.
153
154 -N STRING, --nntp-server=STRING
155 Specify an NNTP server for news: links. Default is the environ‐
156 ment variable NNTP_SERVER. If no host is given, only the syntax
157 of the link is checked.
158
159 --no-follow-url=REGEX
160 Check but do not recurse into URLs matching the given regular
161 expression. This option can be given multiple times. See sec‐
162 tion REGULAR EXPRESSIONS for more info.
163
164 --no-robots
165 Check URLs regardless of any robots.txt files.
166
167 -p, --password
168 Read a password from console and use it for HTTP and FTP autho‐
169 rization. For FTP the default password is anonymous@. For HTTP
170 there is no default password. See also -u.
171
172 -r NUMBER, --recursion-level=NUMBER
173 Check recursively all links up to given depth. A negative depth
174 will enable infinite recursion. Default depth is infinite.
175
176 --timeout=NUMBER
177 Set the timeout for connection attempts in seconds. The default
178 timeout is 60 seconds.
179
180 -u STRING, --user=STRING
181 Try the given username for HTTP and FTP authorization. For FTP
182 the default username is anonymous. For HTTP there is no default
183 username. See also -p.
184
185 --user-agent=STRING
186 Specify the User-Agent string to send to the HTTP server, for
187 example "Mozilla/4.0". The default is "LinkChecker/X.Y" where
188 X.Y is the current version of LinkChecker.
189
190 Input options
191 --stdin
192 Read from stdin a list of white-space separated URLs to check.
193
194 FILE-OR-URL
195 The location to start checking with. A file can be a simple
196 list of URLs, one per line, if the first line is "# LinkChecker
197 URL list".
198
200 Configuration files can specify all options above. They can also spec‐
201 ify some options that cannot be set on the command line. See linkcheck‐
202 errc(5) for more info.
203
205 Note that by default only errors and warnings are logged. You should
206 use the option --verbose to get the complete URL list, especially when
207 outputting a sitemap graph format.
208
209 text Standard text logger, logging URLs in keyword: argument fashion.
210
211 html Log URLs in keyword: argument fashion, formatted as HTML. Addi‐
212 tionally has links to the referenced pages. Invalid URLs have
213 HTML and CSS syntax check links appended.
214
215 csv Log check result in CSV format with one URL per line.
216
217 gml Log parent-child relations between linked URLs as a GML sitemap
218 graph.
219
220 dot Log parent-child relations between linked URLs as a DOT sitemap
221 graph.
222
223 gxml Log check result as a GraphXML sitemap graph.
224
225 xml Log check result as machine-readable XML.
226
227 sitemap
228 Log check result as an XML sitemap whose protocol is documented
229 at https://www.sitemaps.org/protocol.html.
230
231 sql Log check result as SQL script with INSERT commands. An example
232 script to create the initial SQL table is included as cre‐
233 ate.sql.
234
235 failures
236 Suitable for cron jobs. Logs the check result into a file
237 $XDG_DATA_HOME/linkchecker/failures which only contains entries
238 with invalid URLs and the number of times they have failed.
239
240 none Logs nothing. Suitable for debugging or checking the exit code.
241
243 LinkChecker accepts Python regular expressions. See
244 https://docs.python.org/howto/regex.html for an introduction. An addi‐
245 tion is that a leading exclamation mark negates the regular expression.
246
248 A cookie file contains standard HTTP header (RFC 2616) data with the
249 following possible names:
250
251 Host (required)
252 Sets the domain the cookies are valid for.
253
254 Path (optional)
255 Gives the path the cookies are value for; default path is /.
256
257 Set-cookie (required)
258 Set cookie name/value. Can be given more than once.
259
260 Multiple entries are separated by a blank line. The example below will
261 send two cookies to all URLs starting with http://example.com/hello/
262 and one to all URLs starting with https://example.org/:
263
264 Host: example.com
265 Path: /hello
266 Set-cookie: ID="smee"
267 Set-cookie: spam="egg"
268
269 Host: example.org
270 Set-cookie: baggage="elitist"; comment="hologram"
271
273 To use a proxy on Unix or Windows set the http_proxy or https_proxy en‐
274 vironment variables to the proxy URL. The URL should be of the form
275 http://[user:pass@]host[:port]. LinkChecker also detects manual proxy
276 settings of Internet Explorer under Windows systems. On a Mac use the
277 Internet Config to select a proxy. You can also set a comma-separated
278 domain list in the no_proxy environment variable to ignore any proxy
279 settings for these domains. The curl_ca_bundle environment variable
280 can be used to identify an alternative certificate bundle to be used
281 with an HTTPS proxy.
282
283 Setting a HTTP proxy on Unix for example looks like this:
284
285 $ export http_proxy="http://proxy.example.com:8080"
286
287 Proxy authentication is also supported:
288
289 $ export http_proxy="http://user1:mypass@proxy.example.org:8081"
290
291 Setting a proxy on the Windows command prompt:
292
293 C:\> set http_proxy=http://proxy.example.com:8080
294
296 All URLs have to pass a preliminary syntax test. Minor quoting mistakes
297 will issue a warning, all other invalid syntax issues are errors. After
298 the syntax check passes, the URL is queued for connection checking. All
299 connection check types are described below.
300
301 HTTP links (http:, https:)
302 After connecting to the given HTTP server the given path or
303 query is requested. All redirections are followed, and if
304 user/password is given it will be used as authorization when
305 necessary. All final HTTP status codes other than 2xx are er‐
306 rors.
307
308 HTML page contents are checked for recursion.
309
310 Local files (file:)
311 A regular, readable file that can be opened is valid. A readable
312 directory is also valid. All other files, for example device
313 files, unreadable or non-existing files are errors.
314
315 HTML or other parseable file contents are checked for recursion.
316
317 Mail links (mailto:)
318 A mailto: link eventually resolves to a list of email addresses.
319 If one address fails, the whole list will fail. For each mail
320 address we check the following things:
321
322 1. Check the address syntax, both the parts before and after the
323 @ sign.
324
325 2. Look up the MX DNS records. If we found no MX record, print
326 an error.
327
328 3. Check if one of the mail hosts accept an SMTP connection.
329 Check hosts with higher priority first. If no host accepts
330 SMTP, we print a warning.
331
332 4. Try to verify the address with the VRFY command. If we got an
333 answer, print the verified address as an info.
334
335 FTP links (ftp:)
336 For FTP links we do:
337
338 1. connect to the specified host
339
340 2. try to login with the given user and password. The default
341 user is anonymous, the default password is anonymous@.
342
343 3. try to change to the given directory
344
345 4. list the file with the NLST command
346
347 Telnet links (telnet:)
348 We try to connect and if user/password are given, login to the
349 given telnet server.
350
351 NNTP links (news:, snews:, nntp)
352 We try to connect to the given NNTP server. If a news group or
353 article is specified, try to request it from the server.
354
355 Unsupported links (javascript:, etc.)
356 An unsupported link will only print a warning. No further check‐
357 ing will be made.
358
359 The complete list of recognized, but unsupported links can be
360 found in the linkcheck/checker/unknownurl.py source file. The
361 most prominent of them should be JavaScript links.
362
364 Sitemaps are parsed for links to check and can be detected either from
365 a sitemap entry in a robots.txt, or when passed as a FILE-OR-URL argu‐
366 ment in which case detection requires the urlset/sitemapindex tag to be
367 within the first 70 characters of the sitemap. Compressed sitemap
368 files are not supported.
369
371 There are two plugin types: connection and content plugins. Connection
372 plugins are run after a successful connection to the URL host. Content
373 plugins are run if the URL type has content (mailto: URLs have no con‐
374 tent for example) and if the check is not forbidden (ie. by HTTP ro‐
375 bots.txt). Use the option --list-plugins for a list of plugins and
376 their documentation. All plugins are enabled via the linkcheckerrc(5)
377 configuration file.
378
380 Before descending recursively into a URL, it has to fulfill several
381 conditions. They are checked in this order:
382
383 1. A URL must be valid.
384
385 2. A URL must be parseable. This currently includes HTML files, Opera
386 bookmarks files, and directories. If a file type cannot be deter‐
387 mined (for example it does not have a common HTML file extension,
388 and the content does not look like HTML), it is assumed to be
389 non-parseable.
390
391 3. The URL content must be retrievable. This is usually the case except
392 for example mailto: or unknown URL types.
393
394 4. The maximum recursion level must not be exceeded. It is configured
395 with the --recursion-level option and is unlimited per default.
396
397 5. It must not match the ignored URL list. This is controlled with the
398 --ignore-url option.
399
400 6. The Robots Exclusion Protocol must allow links in the URL to be fol‐
401 lowed recursively. This is checked by searching for a "nofollow" di‐
402 rective in the HTML header data.
403
404 Note that the directory recursion reads all files in that directory,
405 not just a subset like index.htm.
406
408 URLs on the commandline starting with ftp. are treated like ftp://ftp.,
409 URLs starting with www. are treated like http://www.. You can also give
410 local files as arguments. If you have your system configured to auto‐
411 matically establish a connection to the internet (e.g. with diald), it
412 will connect when checking links not pointing to your local host. Use
413 the --ignore-url option to prevent this.
414
415 Javascript links are not supported.
416
417 If your platform does not support threading, LinkChecker disables it
418 automatically.
419
420 You can supply multiple user/password pairs in a configuration file.
421
422 When checking news: links the given NNTP host doesn't need to be the
423 same as the host of the user browsing your pages.
424
426 NNTP_SERVER
427 specifies default NNTP server
428
429 http_proxy
430 specifies default HTTP proxy server
431
432 https_proxy
433 specifies default HTTPS proxy server
434
435 curl_ca_bundle
436 an alternative certificate bundle to be used with an HTTPS proxy
437
438 no_proxy
439 comma-separated list of domains to not contact over a proxy
440 server
441
442 LC_MESSAGES, LANG, LANGUAGE
443 specify output language
444
446 The return value is 2 when
447
448 • a program error occurred.
449
450 The return value is 1 when
451
452 • invalid links were found or
453
454 • link warnings were found and warnings are enabled
455
456 Else the return value is zero.
457
459 LinkChecker consumes memory for each queued URL to check. With thou‐
460 sands of queued URLs the amount of consumed memory can become quite
461 large. This might slow down the program or even the whole system.
462
464 $XDG_CONFIG_HOME/linkchecker/linkcheckerrc - default configuration file
465
466 $XDG_DATA_HOME/linkchecker/failures - default failures logger output
467 filename
468
469 linkchecker-out.TYPE - default logger file output name
470
472 linkcheckerrc(5)
473
474 https://docs.python.org/library/codecs.html#standard-encodings - valid
475 output encodings
476
477 https://docs.python.org/howto/regex.html - regular expression documen‐
478 tation
479
481 Bastian Kleineidam <bastian.kleineidam@web.de>
482
484 2000-2016 Bastian Kleineidam, 2010-2022 LinkChecker Authors
485
486
487
488
48910.1.0.post162+g614e84b5 October 31, 2022 LINKCHECKER(1)