1LINKCHECKER(1) LinkChecker LINKCHECKER(1)
2
3
4
6 linkchecker - command line client to check HTML documents and websites
7 for broken links
8
10 linkchecker [options] [file-or-url]...
11
13 LinkChecker features
14
15 • recursive and multithreaded checking
16
17 • output in colored or normal text, HTML, SQL, CSV, XML or a sitemap
18 graph in different formats
19
20 • support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and
21 local file links
22
23 • restriction of link checking with URL filters
24
25 • proxy support
26
27 • username/password authorization for HTTP, FTP and Telnet
28
29 • support for robots.txt exclusion protocol
30
31 • support for Cookies
32
33 • support for HTML5
34
35 • Antivirus check
36
37 • a command line and web interface
38
40 The most common use checks the given domain recursively:
41
42 $ linkchecker http://www.example.com/
43
44 Beware that this checks the whole site which can have thousands of
45 URLs. Use the -r option to restrict the recursion depth.
46
47 Don't check URLs with /secret in its name. All other links are checked
48 as usual:
49
50 $ linkchecker --ignore-url=/secret mysite.example.com
51
52 Checking a local HTML file on Unix:
53
54 $ linkchecker ../bla.html
55
56 Checking a local HTML file on Windows:
57
58 C:\> linkchecker c:empest.html
59
60 You can skip the http:// url part if the domain starts with www.:
61
62 $ linkchecker www.example.com
63
64 You can skip the ftp:// url part if the domain starts with ftp.:
65
66 $ linkchecker -r0 ftp.example.com
67
68 Generate a sitemap graph and convert it with the graphviz dot utility:
69
70 $ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
71
73 General options
74 -f FILENAME, --config=FILENAME
75 Use FILENAME as configuration file. By default LinkChecker uses
76 ~/.linkchecker/linkcheckerrc.
77
78 -h, --help
79 Help me! Print usage information for this program.
80
81 --stdin
82 Read list of white-space separated URLs to check from stdin.
83
84 -t NUMBER, --threads=NUMBER
85 Generate no more than the given number of threads. Default num‐
86 ber of threads is 10. To disable threading specify a non-posi‐
87 tive number.
88
89 -V, --version
90 Print version and exit.
91
92 --list-plugins
93 Print available check plugins and exit.
94
95 Output options
96 URL checking results
97 -F TYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
98 Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/fail‐
99 ures for the failures output type, or FILENAME if specified. The
100 ENCODING specifies the output encoding, the default is that of
101 your locale. Valid encodings are listed at
102 https://docs.python.org/library/codecs.html#standard-encodings.
103 The FILENAME and ENCODING parts of the none output type will be
104 ignored, else if the file already exists, it will be overwrit‐
105 ten. You can specify this option more than once. Valid file
106 output TYPEs are text, html, sql, csv, gml, dot, xml, sitemap,
107 none or failures. Default is no file output. The various output
108 types are documented below. Note that you can suppress all con‐
109 sole output with the option -o none.
110
111 --no-warnings
112 Don't log warnings. Default is to log warnings.
113
114 -o TYPE[/ENCODING], --output=TYPE[/ENCODING]
115 Specify the console output type as text, html, sql, csv, gml,
116 dot, xml, sitemap, none or failures. Default type is text. The
117 various output types are documented below. The ENCODING speci‐
118 fies the output encoding, the default is that of your locale.
119 Valid encodings are listed at
120 https://docs.python.org/library/codecs.html#standard-encodings.
121
122 -v, --verbose
123 Log all checked URLs. Default is to log only errors and warn‐
124 ings.
125
126 Progress updates
127 --no-status
128 Do not print URL check status messages.
129
130 Application
131 -D STRING, --debug=STRING
132 Print debugging output for the given logger. Available loggers
133 are cmdline, checking, cache, dns, plugin and all. Specifying
134 all is an alias for specifying all available loggers. The option
135 can be given multiple times to debug with more than one logger.
136 For accurate results, threading will be disabled during debug
137 runs.
138
139 Quiet
140 -q, --quiet
141 Quiet operation, an alias for -o none that also hides applica‐
142 tion information messages. This is only useful with -F, else no
143 results will be output.
144
145 Checking options
146 --cookiefile=FILENAME
147 Read a file with initial cookie data. The cookie data format is
148 explained below.
149
150 --check-extern
151 Check external URLs.
152
153 --ignore-url=REGEX
154 URLs matching the given regular expression will only be syntax
155 checked. This option can be given multiple times. See section
156 REGULAR EXPRESSIONS for more info.
157
158 -N STRING, --nntp-server=STRING
159 Specify an NNTP server for news: links. Default is the environ‐
160 ment variable NNTP_SERVER. If no host is given, only the syntax
161 of the link is checked.
162
163 --no-follow-url=REGEX
164 Check but do not recurse into URLs matching the given regular
165 expression. This option can be given multiple times. See sec‐
166 tion REGULAR EXPRESSIONS for more info.
167
168 --no-robots
169 Check URLs regardless of any robots.txt files.
170
171 -p, --password
172 Read a password from console and use it for HTTP and FTP autho‐
173 rization. For FTP the default password is anonymous@. For HTTP
174 there is no default password. See also -u.
175
176 -r NUMBER, --recursion-level=NUMBER
177 Check recursively all links up to given depth. A negative depth
178 will enable infinite recursion. Default depth is infinite.
179
180 --timeout=NUMBER
181 Set the timeout for connection attempts in seconds. The default
182 timeout is 60 seconds.
183
184 -u STRING, --user=STRING
185 Try the given username for HTTP and FTP authorization. For FTP
186 the default username is anonymous. For HTTP there is no default
187 username. See also -p.
188
189 --user-agent=STRING
190 Specify the User-Agent string to send to the HTTP server, for
191 example "Mozilla/4.0". The default is "LinkChecker/X.Y" where
192 X.Y is the current version of LinkChecker.
193
195 Configuration files can specify all options above. They can also spec‐
196 ify some options that cannot be set on the command line. See linkcheck‐
197 errc(5) for more info.
198
200 Note that by default only errors and warnings are logged. You should
201 use the option --verbose to get the complete URL list, especially when
202 outputting a sitemap graph format.
203
204 text Standard text logger, logging URLs in keyword: argument fashion.
205
206 html Log URLs in keyword: argument fashion, formatted as HTML. Addi‐
207 tionally has links to the referenced pages. Invalid URLs have
208 HTML and CSS syntax check links appended.
209
210 csv Log check result in CSV format with one URL per line.
211
212 gml Log parent-child relations between linked URLs as a GML sitemap
213 graph.
214
215 dot Log parent-child relations between linked URLs as a DOT sitemap
216 graph.
217
218 gxml Log check result as a GraphXML sitemap graph.
219
220 xml Log check result as machine-readable XML.
221
222 sitemap
223 Log check result as an XML sitemap whose protocol is documented
224 at https://www.sitemaps.org/protocol.html.
225
226 sql Log check result as SQL script with INSERT commands. An example
227 script to create the initial SQL table is included as cre‐
228 ate.sql.
229
230 failures
231 Suitable for cron jobs. Logs the check result into a file
232 ~/.linkchecker/failures which only contains entries with invalid
233 URLs and the number of times they have failed.
234
235 none Logs nothing. Suitable for debugging or checking the exit code.
236
238 LinkChecker accepts Python regular expressions. See
239 https://docs.python.org/howto/regex.html for an introduction. An addi‐
240 tion is that a leading exclamation mark negates the regular expression.
241
243 A cookie file contains standard HTTP header (RFC 2616) data with the
244 following possible names:
245
246 Host (required)
247 Sets the domain the cookies are valid for.
248
249 Path (optional)
250 Gives the path the cookies are value for; default path is /.
251
252 Set-cookie (required)
253 Set cookie name/value. Can be given more than once.
254
255 Multiple entries are separated by a blank line. The example below will
256 send two cookies to all URLs starting with http://example.com/hello/
257 and one to all URLs starting with https://example.org/:
258
259 Host: example.com
260 Path: /hello
261 Set-cookie: ID="smee"
262 Set-cookie: spam="egg"
263
264 Host: example.org
265 Set-cookie: baggage="elitist"; comment="hologram"
266
268 To use a proxy on Unix or Windows set the http_proxy or https_proxy en‐
269 vironment variables to the proxy URL. The URL should be of the form
270 http://[user:pass@]host[:port]. LinkChecker also detects manual proxy
271 settings of Internet Explorer under Windows systems. On a Mac use the
272 Internet Config to select a proxy. You can also set a comma-separated
273 domain list in the no_proxy environment variable to ignore any proxy
274 settings for these domains. The curl_ca_bundle environment variable
275 can be used to identify an alternative certificate bundle to be used
276 with an HTTPS proxy.
277
278 Setting a HTTP proxy on Unix for example looks like this:
279
280 $ export http_proxy="http://proxy.example.com:8080"
281
282 Proxy authentication is also supported:
283
284 $ export http_proxy="http://user1:mypass@proxy.example.org:8081"
285
286 Setting a proxy on the Windows command prompt:
287
288 C:\> set http_proxy=http://proxy.example.com:8080
289
291 All URLs have to pass a preliminary syntax test. Minor quoting mistakes
292 will issue a warning, all other invalid syntax issues are errors. After
293 the syntax check passes, the URL is queued for connection checking. All
294 connection check types are described below.
295
296 HTTP links (http:, https:)
297 After connecting to the given HTTP server the given path or
298 query is requested. All redirections are followed, and if
299 user/password is given it will be used as authorization when
300 necessary. All final HTTP status codes other than 2xx are er‐
301 rors.
302
303 HTML page contents are checked for recursion.
304
305 Local files (file:)
306 A regular, readable file that can be opened is valid. A readable
307 directory is also valid. All other files, for example device
308 files, unreadable or non-existing files are errors.
309
310 HTML or other parseable file contents are checked for recursion.
311
312 Mail links (mailto:)
313 A mailto: link eventually resolves to a list of email addresses.
314 If one address fails, the whole list will fail. For each mail
315 address we check the following things:
316
317 1. Check the address syntax, both the parts before and after the
318 @ sign.
319
320 2. Look up the MX DNS records. If we found no MX record, print
321 an error.
322
323 3. Check if one of the mail hosts accept an SMTP connection.
324 Check hosts with higher priority first. If no host accepts
325 SMTP, we print a warning.
326
327 4. Try to verify the address with the VRFY command. If we got an
328 answer, print the verified address as an info.
329
330 FTP links (ftp:)
331 For FTP links we do:
332
333 1. connect to the specified host
334
335 2. try to login with the given user and password. The default
336 user is anonymous, the default password is anonymous@.
337
338 3. try to change to the given directory
339
340 4. list the file with the NLST command
341
342 Telnet links (telnet:)
343 We try to connect and if user/password are given, login to the
344 given telnet server.
345
346 NNTP links (news:, snews:, nntp)
347 We try to connect to the given NNTP server. If a news group or
348 article is specified, try to request it from the server.
349
350 Unsupported links (javascript:, etc.)
351 An unsupported link will only print a warning. No further check‐
352 ing will be made.
353
354 The complete list of recognized, but unsupported links can be
355 found in the linkcheck/checker/unknownurl.py source file. The
356 most prominent of them should be JavaScript links.
357
359 There are two plugin types: connection and content plugins. Connection
360 plugins are run after a successful connection to the URL host. Content
361 plugins are run if the URL type has content (mailto: URLs have no con‐
362 tent for example) and if the check is not forbidden (ie. by HTTP ro‐
363 bots.txt). Use the option --list-plugins for a list of plugins and
364 their documentation. All plugins are enabled via the linkcheckerrc(5)
365 configuration file.
366
368 Before descending recursively into a URL, it has to fulfill several
369 conditions. They are checked in this order:
370
371 1. A URL must be valid.
372
373 2. A URL must be parseable. This currently includes HTML files, Opera
374 bookmarks files, and directories. If a file type cannot be deter‐
375 mined (for example it does not have a common HTML file extension,
376 and the content does not look like HTML), it is assumed to be
377 non-parseable.
378
379 3. The URL content must be retrievable. This is usually the case except
380 for example mailto: or unknown URL types.
381
382 4. The maximum recursion level must not be exceeded. It is configured
383 with the --recursion-level option and is unlimited per default.
384
385 5. It must not match the ignored URL list. This is controlled with the
386 --ignore-url option.
387
388 6. The Robots Exclusion Protocol must allow links in the URL to be fol‐
389 lowed recursively. This is checked by searching for a "nofollow" di‐
390 rective in the HTML header data.
391
392 Note that the directory recursion reads all files in that directory,
393 not just a subset like index.htm.
394
396 URLs on the commandline starting with ftp. are treated like ftp://ftp.,
397 URLs starting with www. are treated like http://www.. You can also give
398 local files as arguments. If you have your system configured to auto‐
399 matically establish a connection to the internet (e.g. with diald), it
400 will connect when checking links not pointing to your local host. Use
401 the --ignore-url option to prevent this.
402
403 Javascript links are not supported.
404
405 If your platform does not support threading, LinkChecker disables it
406 automatically.
407
408 You can supply multiple user/password pairs in a configuration file.
409
410 When checking news: links the given NNTP host doesn't need to be the
411 same as the host of the user browsing your pages.
412
414 NNTP_SERVER
415 specifies default NNTP server
416
417 http_proxy
418 specifies default HTTP proxy server
419
420 https_proxy
421 specifies default HTTPS proxy server
422
423 curl_ca_bundle
424 an alternative certificate bundle to be used with an HTTPS proxy
425
426 no_proxy
427 comma-separated list of domains to not contact over a proxy
428 server
429
430 LC_MESSAGES, LANG, LANGUAGE
431 specify output language
432
434 The return value is 2 when
435
436 • a program error occurred.
437
438 The return value is 1 when
439
440 • invalid links were found or
441
442 • link warnings were found and warnings are enabled
443
444 Else the return value is zero.
445
447 LinkChecker consumes memory for each queued URL to check. With thou‐
448 sands of queued URLs the amount of consumed memory can become quite
449 large. This might slow down the program or even the whole system.
450
452 ~/.linkchecker/linkcheckerrc - default configuration file
453
454 ~/.linkchecker/failures - default failures logger output filename
455
456 linkchecker-out.TYPE - default logger file output name
457
459 linkcheckerrc(5)
460
461 https://docs.python.org/library/codecs.html#standard-encodings - valid
462 output encodings
463
464 https://docs.python.org/howto/regex.html - regular expression documen‐
465 tation
466
468 Bastian Kleineidam <bastian.kleineidam@web.de>
469
471 2000-2016 Bastian Kleineidam, 2010-2021 LinkChecker Authors
472
473
474
475
47610.0.1.post124+ga12fcf04 December 21, 2021 LINKCHECKER(1)