linkchecker(1)

1LINKCHECKER(1)                    LinkChecker                   LINKCHECKER(1)
2
3
4

NAME

6       linkchecker  - command line client to check HTML documents and websites
7       for broken links
8

SYNOPSIS

10       linkchecker [options] [file-or-url]...
11

DESCRIPTION

13       LinkChecker features
14
15       • recursive and multithreaded checking
16
17       • output in colored or normal text, HTML, SQL, CSV, XML  or  a  sitemap
18         graph in different formats
19
20       • support  for  HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and
21         local file links
22
23       • restriction of link checking with URL filters
24
25       • proxy support
26
27       • username/password authorization for HTTP, FTP and Telnet
28
29       • support for robots.txt exclusion protocol
30
31       • support for Cookies
32
33       • support for HTML5
34
35       • Antivirus check
36
37       • a command line and web interface
38

EXAMPLES

40       The most common use checks the given domain recursively:
41
42          $ linkchecker http://www.example.com/
43
44       Beware that this checks the whole site  which  can  have  thousands  of
45       URLs. Use the -r option to restrict the recursion depth.
46
47       Don't  check URLs with /secret in its name. All other links are checked
48       as usual:
49
50          $ linkchecker --ignore-url=/secret mysite.example.com
51
52       Checking a local HTML file on Unix:
53
54          $ linkchecker ../bla.html
55
56       Checking a local HTML file on Windows:
57
58          C:\> linkchecker c:empest.html
59
60       You can skip the http:// url part if the domain starts with www.:
61
62          $ linkchecker www.example.com
63
64       You can skip the ftp:// url part if the domain starts with ftp.:
65
66          $ linkchecker -r0 ftp.example.com
67
68       Generate a sitemap graph and convert it with the graphviz dot utility:
69
70          $ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
71

OPTIONS

73   General options
74       -f FILENAME, --config=FILENAME
75              Use FILENAME as configuration file. By default LinkChecker  uses
76              $XDG_CONFIG_HOME/linkchecker/linkcheckerrc.
77
78       -h, --help
79              Help me! Print usage information for this program.
80
81       -t NUMBER, --threads=NUMBER
82              Generate  no more than the given number of threads. Default num‐
83              ber of threads is 10. To disable threading specify  a  non-posi‐
84              tive number.
85
86       -V, --version
87              Print version and exit.
88
89       --list-plugins
90              Print available check plugins and exit.
91
92   Output options
93   URL checking results
94       -F TYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
95              Output        to        a       file       linkchecker-out.TYPE,
96              $XDG_DATA_HOME/linkchecker/failures  for  the  failures   output
97              type,  or FILENAME if specified. The ENCODING specifies the out‐
98              put encoding, the default is that of your locale.  Valid  encod‐
99              ings                 are                listed                at
100              https://docs.python.org/library/codecs.html#standard-encodings.
101              The  FILENAME and ENCODING parts of the none output type will be
102              ignored, else if the file already exists, it will  be  overwrit‐
103              ten.   You  can  specify  this option more than once. Valid file
104              output TYPEs are text, html, sql, csv, gml, dot,  xml,  sitemap,
105              none or failures. Default is no file output.  The various output
106              types are documented below. Note that you can suppress all  con‐
107              sole output with the option -o none.
108
109       --no-warnings
110              Don't log warnings. Default is to log warnings.
111
112       -o TYPE[/ENCODING], --output=TYPE[/ENCODING]
113              Specify  the  console  output type as text, html, sql, csv, gml,
114              dot, xml, sitemap, none or failures.  Default type is text.  The
115              various  output types are documented below.  The ENCODING speci‐
116              fies the output encoding, the default is that  of  your  locale.
117              Valid          encodings          are          listed         at
118              https://docs.python.org/library/codecs.html#standard-encodings.
119
120       -v, --verbose
121              Log all checked URLs. Default is to log only  errors  and  warn‐
122              ings.
123
124   Progress updates
125       --no-status
126              Do not print URL check status messages.
127
128   Application
129       -D STRING, --debug=STRING
130              Print  debugging  output  for the given logger.  Available debug
131              loggers are cmdline, checking, cache, plugin and all.  all is an
132              alias  for all available loggers.  This option can be given mul‐
133              tiple times to debug with more than one logger.
134
135   Quiet
136       -q, --quiet
137              Quiet operation, an alias for -o none that also  hides  applica‐
138              tion information messages.  This is only useful with -F, else no
139              results will be output.
140
141   Checking options
142       --cookiefile=FILENAME
143              Use initial cookie data read from a file. The cookie data format
144              is explained below.
145
146       --check-extern
147              Check external URLs.
148
149       --ignore-url=REGEX
150              URLs  matching  the given regular expression will only be syntax
151              checked.  This option can be given multiple times.  See  section
152              REGULAR EXPRESSIONS for more info.
153
154       -N STRING, --nntp-server=STRING
155              Specify  an NNTP server for news: links. Default is the environ‐
156              ment variable NNTP_SERVER. If no host is given, only the  syntax
157              of the link is checked.
158
159       --no-follow-url=REGEX
160              Check  but  do  not recurse into URLs matching the given regular
161              expression.  This option can be given multiple times.  See  sec‐
162              tion REGULAR EXPRESSIONS for more info.
163
164       --no-robots
165              Check URLs regardless of any robots.txt files.
166
167       -p, --password
168              Read  a password from console and use it for HTTP and FTP autho‐
169              rization. For FTP the default password is anonymous@.  For  HTTP
170              there is no default password. See also -u.
171
172       -r NUMBER, --recursion-level=NUMBER
173              Check  recursively all links up to given depth. A negative depth
174              will enable infinite recursion. Default depth is infinite.
175
176       --timeout=NUMBER
177              Set the timeout for connection attempts in seconds. The  default
178              timeout is 60 seconds.
179
180       -u STRING, --user=STRING
181              Try  the  given username for HTTP and FTP authorization. For FTP
182              the default username is anonymous. For HTTP there is no  default
183              username. See also -p.
184
185       --user-agent=STRING
186              Specify  the  User-Agent  string to send to the HTTP server, for
187              example "Mozilla/4.0". The default  is  "LinkChecker/X.Y"  where
188              X.Y is the current version of LinkChecker.
189
190   Input options
191       --stdin
192              Read from stdin a list of white-space separated URLs to check.
193
194       FILE-OR-URL
195              The  location  to  start  checking with.  A file can be a simple
196              list of URLs, one per line, if the first line is "#  LinkChecker
197              URL list".
198

CONFIGURATION FILES

200       Configuration  files can specify all options above. They can also spec‐
201       ify some options that cannot be set on the command line. See linkcheck‐
202       errc(5) for more info.
203

OUTPUT TYPES

205       Note  that  by  default only errors and warnings are logged. You should
206       use the option --verbose to get the complete URL list, especially  when
207       outputting a sitemap graph format.
208
209       text   Standard text logger, logging URLs in keyword: argument fashion.
210
211       html   Log URLs in keyword: argument fashion, formatted as HTML.  Addi‐
212              tionally has links to the referenced pages.  Invalid  URLs  have
213              HTML and CSS syntax check links appended.
214
215       csv    Log check result in CSV format with one URL per line.
216
217       gml    Log  parent-child relations between linked URLs as a GML sitemap
218              graph.
219
220       dot    Log parent-child relations between linked URLs as a DOT  sitemap
221              graph.
222
223       gxml   Log check result as a GraphXML sitemap graph.
224
225       xml    Log check result as machine-readable XML.
226
227       sitemap
228              Log  check result as an XML sitemap whose protocol is documented
229              at https://www.sitemaps.org/protocol.html.
230
231       sql    Log check result as SQL script with INSERT commands. An  example
232              script  to  create  the  initial  SQL  table is included as cre‐
233              ate.sql.
234
235       failures
236              Suitable for cron jobs.  Logs  the  check  result  into  a  file
237              $XDG_DATA_HOME/linkchecker/failures  which only contains entries
238              with invalid URLs and the number of times they have failed.
239
240       none   Logs nothing. Suitable for debugging or checking the exit code.
241

REGULAR EXPRESSIONS

243       LinkChecker     accepts     Python     regular     expressions.     See
244       https://docs.python.org/howto/regex.html for an introduction.  An addi‐
245       tion is that a leading exclamation mark negates the regular expression.
246

COOKIE FILES

248       A cookie file contains standard HTTP header (RFC 2616)  data  with  the
249       following possible names:
250
251       Host (required)
252              Sets the domain the cookies are valid for.
253
254       Path (optional)
255              Gives the path the cookies are value for; default path is /.
256
257       Set-cookie (required)
258              Set cookie name/value. Can be given more than once.
259
260       Multiple  entries are separated by a blank line. The example below will
261       send two cookies to all URLs  starting  with  http://example.com/hello/
262       and one to all URLs starting with https://example.org/:
263
264          Host: example.com
265          Path: /hello
266          Set-cookie: ID="smee"
267          Set-cookie: spam="egg"
268
269          Host: example.org
270          Set-cookie: baggage="elitist"; comment="hologram"
271

PROXY SUPPORT

273       To use a proxy on Unix or Windows set the http_proxy or https_proxy en‐
274       vironment variables to the proxy URL. The URL should  be  of  the  form
275       http://[user:pass@]host[:port].   LinkChecker also detects manual proxy
276       settings of Internet Explorer under Windows systems. On a Mac  use  the
277       Internet  Config to select a proxy.  You can also set a comma-separated
278       domain list in the no_proxy environment variable to  ignore  any  proxy
279       settings  for  these  domains.  The curl_ca_bundle environment variable
280       can be used to identify an alternative certificate bundle  to  be  used
281       with an HTTPS proxy.
282
283       Setting a HTTP proxy on Unix for example looks like this:
284
285          $ export http_proxy="http://proxy.example.com:8080"
286
287       Proxy authentication is also supported:
288
289          $ export http_proxy="http://user1:mypass@proxy.example.org:8081"
290
291       Setting a proxy on the Windows command prompt:
292
293          C:\> set http_proxy=http://proxy.example.com:8080
294

PERFORMED CHECKS

296       All URLs have to pass a preliminary syntax test. Minor quoting mistakes
297       will issue a warning, all other invalid syntax issues are errors. After
298       the syntax check passes, the URL is queued for connection checking. All
299       connection check types are described below.
300
301       HTTP links (http:, https:)
302              After connecting to the given HTTP  server  the  given  path  or
303              query  is  requested.  All  redirections  are  followed,  and if
304              user/password is given it will be  used  as  authorization  when
305              necessary.  All  final  HTTP status codes other than 2xx are er‐
306              rors.
307
308              HTML page contents are checked for recursion.
309
310       Local files (file:)
311              A regular, readable file that can be opened is valid. A readable
312              directory  is  also  valid.  All other files, for example device
313              files, unreadable or non-existing files are errors.
314
315              HTML or other parseable file contents are checked for recursion.
316
317       Mail links (mailto:)
318              A mailto: link eventually resolves to a list of email addresses.
319              If  one  address  fails, the whole list will fail. For each mail
320              address we check the following things:
321
322              1. Check the address syntax, both the parts before and after the
323                 @ sign.
324
325              2. Look  up  the MX DNS records. If we found no MX record, print
326                 an error.
327
328              3. Check if one of the mail hosts  accept  an  SMTP  connection.
329                 Check  hosts  with  higher priority first. If no host accepts
330                 SMTP, we print a warning.
331
332              4. Try to verify the address with the VRFY command. If we got an
333                 answer, print the verified address as an info.
334
335       FTP links (ftp:)
336              For FTP links we do:
337
338              1. connect to the specified host
339
340              2. try  to  login  with the given user and password. The default
341                 user is anonymous, the default password is anonymous@.
342
343              3. try to change to the given directory
344
345              4. list the file with the NLST command
346
347       Telnet links (telnet:)
348              We try to connect and if user/password are given, login  to  the
349              given telnet server.
350
351       NNTP links (news:, snews:, nntp)
352              We  try  to connect to the given NNTP server. If a news group or
353              article is specified, try to request it from the server.
354
355       Unsupported links (javascript:, etc.)
356              An unsupported link will only print a warning. No further check‐
357              ing will be made.
358
359              The  complete  list  of recognized, but unsupported links can be
360              found in the linkcheck/checker/unknownurl.py  source  file.  The
361              most prominent of them should be JavaScript links.
362

SITEMAPS

364       Sitemaps  are parsed for links to check and can be detected either from
365       a sitemap entry in a robots.txt, or when passed as a FILE-OR-URL  argu‐
366       ment in which case detection requires the urlset/sitemapindex tag to be
367       within the first 70 characters  of  the  sitemap.   Compressed  sitemap
368       files are not supported.
369

PLUGINS

371       There  are two plugin types: connection and content plugins. Connection
372       plugins are run after a successful connection to the URL host.  Content
373       plugins  are run if the URL type has content (mailto: URLs have no con‐
374       tent for example) and if the check is not forbidden (ie.  by  HTTP  ro‐
375       bots.txt).   Use  the  option  --list-plugins for a list of plugins and
376       their documentation. All plugins are enabled via  the  linkcheckerrc(5)
377       configuration file.
378

RECURSION

380       Before  descending  recursively  into  a URL, it has to fulfill several
381       conditions. They are checked in this order:
382
383       1. A URL must be valid.
384
385       2. A URL must be parseable. This currently includes HTML  files,  Opera
386          bookmarks  files,  and  directories. If a file type cannot be deter‐
387          mined (for example it does not have a common  HTML  file  extension,
388          and  the  content  does  not  look  like  HTML), it is assumed to be
389          non-parseable.
390
391       3. The URL content must be retrievable. This is usually the case except
392          for example mailto: or unknown URL types.
393
394       4. The  maximum  recursion level must not be exceeded. It is configured
395          with the --recursion-level option and is unlimited per default.
396
397       5. It must not match the ignored URL list. This is controlled with  the
398          --ignore-url option.
399
400       6. The Robots Exclusion Protocol must allow links in the URL to be fol‐
401          lowed recursively. This is checked by searching for a "nofollow" di‐
402          rective in the HTML header data.
403
404       Note  that  the  directory recursion reads all files in that directory,
405       not just a subset like index.htm.
406

NOTES

408       URLs on the commandline starting with ftp. are treated like ftp://ftp.,
409       URLs starting with www. are treated like http://www.. You can also give
410       local files as arguments.  If you have your system configured to  auto‐
411       matically  establish a connection to the internet (e.g. with diald), it
412       will connect when checking links not pointing to your local  host.  Use
413       the --ignore-url option to prevent this.
414
415       Javascript links are not supported.
416
417       If  your  platform  does not support threading, LinkChecker disables it
418       automatically.
419
420       You can supply multiple user/password pairs in a configuration file.
421
422       When checking news: links the given NNTP host doesn't need  to  be  the
423       same as the host of the user browsing your pages.
424

ENVIRONMENT

426       NNTP_SERVER
427              specifies default NNTP server
428
429       http_proxy
430              specifies default HTTP proxy server
431
432       https_proxy
433              specifies default HTTPS proxy server
434
435       curl_ca_bundle
436              an alternative certificate bundle to be used with an HTTPS proxy
437
438       no_proxy
439              comma-separated  list  of  domains  to  not contact over a proxy
440              server
441
442       LC_MESSAGES, LANG, LANGUAGE
443              specify output language
444

RETURN VALUE

446       The return value is 2 when
447
448       • a program error occurred.
449
450       The return value is 1 when
451
452       • invalid links were found or
453
454       • link warnings were found and warnings are enabled
455
456       Else the return value is zero.
457

LIMITATIONS

459       LinkChecker consumes memory for each queued URL to  check.  With  thou‐
460       sands  of  queued  URLs  the amount of consumed memory can become quite
461       large.  This might slow down the program or even the whole system.
462

FILES

464       $XDG_CONFIG_HOME/linkchecker/linkcheckerrc - default configuration file
465
466       $XDG_DATA_HOME/linkchecker/failures - default  failures  logger  output
467       filename
468
469       linkchecker-out.TYPE - default logger file output name
470

AUTHOR

481       Bastian Kleineidam <bastian.kleineidam@web.de>
482

COPYRIGHT

484       2000-2016 Bastian Kleineidam, 2010-2022 LinkChecker Authors
485
486
487
488
48910.1.0.post162+g614e84b5       October 31, 2022                 LINKCHECKER(1)