linkchecker(1)

1LINKCHECKER(1)                    LinkChecker                   LINKCHECKER(1)
2
3
4

NAME

6       linkchecker  - command line client to check HTML documents and websites
7       for broken links
8

SYNOPSIS

10       linkchecker [options] [file-or-url]...
11

DESCRIPTION

13       LinkChecker features
14
15       • recursive and multithreaded checking
16
17       • output in colored or normal text, HTML, SQL, CSV, XML  or  a  sitemap
18         graph in different formats
19
20       • support  for  HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and
21         local file links
22
23       • restriction of link checking with URL filters
24
25       • proxy support
26
27       • username/password authorization for HTTP, FTP and Telnet
28
29       • support for robots.txt exclusion protocol
30
31       • support for Cookies
32
33       • support for HTML5
34
35       • Antivirus check
36
37       • a command line and web interface
38

EXAMPLES

40       The most common use checks the given domain recursively:
41
42          $ linkchecker http://www.example.com/
43
44       Beware that this checks the whole site  which  can  have  thousands  of
45       URLs. Use the -r option to restrict the recursion depth.
46
47       Don't  check URLs with /secret in its name. All other links are checked
48       as usual:
49
50          $ linkchecker --ignore-url=/secret mysite.example.com
51
52       Checking a local HTML file on Unix:
53
54          $ linkchecker ../bla.html
55
56       Checking a local HTML file on Windows:
57
58          C:\> linkchecker c:empest.html
59
60       You can skip the http:// url part if the domain starts with www.:
61
62          $ linkchecker www.example.com
63
64       You can skip the ftp:// url part if the domain starts with ftp.:
65
66          $ linkchecker -r0 ftp.example.com
67
68       Generate a sitemap graph and convert it with the graphviz dot utility:
69
70          $ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
71

OPTIONS

73   General options
74       -f FILENAME, --config=FILENAME
75              Use FILENAME as configuration file. By default LinkChecker  uses
76              ~/.linkchecker/linkcheckerrc.
77
78       -h, --help
79              Help me! Print usage information for this program.
80
81       --stdin
82              Read list of white-space separated URLs to check from stdin.
83
84       -t NUMBER, --threads=NUMBER
85              Generate  no more than the given number of threads. Default num‐
86              ber of threads is 10. To disable threading specify  a  non-posi‐
87              tive number.
88
89       -V, --version
90              Print version and exit.
91
92       --list-plugins
93              Print available check plugins and exit.
94
95   Output options
96   URL checking results
97       -F TYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
98              Output  to a file linkchecker-out.TYPE, $HOME/.linkchecker/fail‐
99              ures for the failures output type, or FILENAME if specified. The
100              ENCODING  specifies  the output encoding, the default is that of
101              your    locale.    Valid     encodings     are     listed     at
102              https://docs.python.org/library/codecs.html#standard-encodings.
103              The FILENAME and ENCODING parts of the none output type will  be
104              ignored,  else  if the file already exists, it will be overwrit‐
105              ten.  You can specify this option more  than  once.  Valid  file
106              output  TYPEs  are text, html, sql, csv, gml, dot, xml, sitemap,
107              none or failures. Default is no file output.  The various output
108              types  are documented below. Note that you can suppress all con‐
109              sole output with the option -o none.
110
111       --no-warnings
112              Don't log warnings. Default is to log warnings.
113
114       -o TYPE[/ENCODING], --output=TYPE[/ENCODING]
115              Specify the console output type as text, html,  sql,  csv,  gml,
116              dot,  xml, sitemap, none or failures.  Default type is text. The
117              various output types are documented below.  The ENCODING  speci‐
118              fies  the  output  encoding, the default is that of your locale.
119              Valid         encodings          are          listed          at
120              https://docs.python.org/library/codecs.html#standard-encodings.
121
122       -v, --verbose
123              Log  all  checked  URLs. Default is to log only errors and warn‐
124              ings.
125
126   Progress updates
127       --no-status
128              Do not print URL check status messages.
129
130   Application
131       -D STRING, --debug=STRING
132              Print debugging output for the given logger.  Available  loggers
133              are  cmdline,  checking,  cache, dns, plugin and all. Specifying
134              all is an alias for specifying all available loggers. The option
135              can  be given multiple times to debug with more than one logger.
136              For accurate results, threading will be  disabled  during  debug
137              runs.
138
139   Quiet
140       -q, --quiet
141              Quiet  operation,  an alias for -o none that also hides applica‐
142              tion information messages.  This is only useful with -F, else no
143              results will be output.
144
145   Checking options
146       --cookiefile=FILENAME
147              Read  a file with initial cookie data. The cookie data format is
148              explained below.
149
150       --check-extern
151              Check external URLs.
152
153       --ignore-url=REGEX
154              URLs matching the given regular expression will only  be  syntax
155              checked.   This option can be given multiple times.  See section
156              REGULAR EXPRESSIONS for more info.
157
158       -N STRING, --nntp-server=STRING
159              Specify an NNTP server for news: links. Default is the  environ‐
160              ment  variable NNTP_SERVER. If no host is given, only the syntax
161              of the link is checked.
162
163       --no-follow-url=REGEX
164              Check but do not recurse into URLs matching  the  given  regular
165              expression.   This option can be given multiple times.  See sec‐
166              tion REGULAR EXPRESSIONS for more info.
167
168       --no-robots
169              Check URLs regardless of any robots.txt files.
170
171       -p, --password
172              Read a password from console and use it for HTTP and FTP  autho‐
173              rization.  For  FTP the default password is anonymous@. For HTTP
174              there is no default password. See also -u.
175
176       -r NUMBER, --recursion-level=NUMBER
177              Check recursively all links up to given depth. A negative  depth
178              will enable infinite recursion. Default depth is infinite.
179
180       --timeout=NUMBER
181              Set  the timeout for connection attempts in seconds. The default
182              timeout is 60 seconds.
183
184       -u STRING, --user=STRING
185              Try the given username for HTTP and FTP authorization.  For  FTP
186              the  default username is anonymous. For HTTP there is no default
187              username. See also -p.
188
189       --user-agent=STRING
190              Specify the User-Agent string to send to the  HTTP  server,  for
191              example  "Mozilla/4.0".  The  default is "LinkChecker/X.Y" where
192              X.Y is the current version of LinkChecker.
193

CONFIGURATION FILES

195       Configuration files can specify all options above. They can also  spec‐
196       ify some options that cannot be set on the command line. See linkcheck‐
197       errc(5) for more info.
198

OUTPUT TYPES

200       Note that by default only errors and warnings are  logged.  You  should
201       use  the option --verbose to get the complete URL list, especially when
202       outputting a sitemap graph format.
203
204       text   Standard text logger, logging URLs in keyword: argument fashion.
205
206       html   Log URLs in keyword: argument fashion, formatted as HTML.  Addi‐
207              tionally  has  links  to the referenced pages. Invalid URLs have
208              HTML and CSS syntax check links appended.
209
210       csv    Log check result in CSV format with one URL per line.
211
212       gml    Log parent-child relations between linked URLs as a GML  sitemap
213              graph.
214
215       dot    Log  parent-child relations between linked URLs as a DOT sitemap
216              graph.
217
218       gxml   Log check result as a GraphXML sitemap graph.
219
220       xml    Log check result as machine-readable XML.
221
222       sitemap
223              Log check result as an XML sitemap whose protocol is  documented
224              at https://www.sitemaps.org/protocol.html.
225
226       sql    Log  check result as SQL script with INSERT commands. An example
227              script to create the initial  SQL  table  is  included  as  cre‐
228              ate.sql.
229
230       failures
231              Suitable  for  cron  jobs.  Logs  the  check  result into a file
232              ~/.linkchecker/failures which only contains entries with invalid
233              URLs and the number of times they have failed.
234
235       none   Logs nothing. Suitable for debugging or checking the exit code.
236

REGULAR EXPRESSIONS

238       LinkChecker     accepts     Python     regular     expressions.     See
239       https://docs.python.org/howto/regex.html for an introduction.  An addi‐
240       tion is that a leading exclamation mark negates the regular expression.
241

COOKIE FILES

243       A  cookie  file  contains standard HTTP header (RFC 2616) data with the
244       following possible names:
245
246       Host (required)
247              Sets the domain the cookies are valid for.
248
249       Path (optional)
250              Gives the path the cookies are value for; default path is /.
251
252       Set-cookie (required)
253              Set cookie name/value. Can be given more than once.
254
255       Multiple entries are separated by a blank line. The example below  will
256       send  two  cookies  to all URLs starting with http://example.com/hello/
257       and one to all URLs starting with https://example.org/:
258
259          Host: example.com
260          Path: /hello
261          Set-cookie: ID="smee"
262          Set-cookie: spam="egg"
263
264          Host: example.org
265          Set-cookie: baggage="elitist"; comment="hologram"
266

PROXY SUPPORT

268       To use a proxy on Unix or Windows set the http_proxy or https_proxy en‐
269       vironment  variables  to  the  proxy URL. The URL should be of the form
270       http://[user:pass@]host[:port].  LinkChecker also detects manual  proxy
271       settings  of  Internet Explorer under Windows systems. On a Mac use the
272       Internet Config to select a proxy.  You can also set a  comma-separated
273       domain  list  in  the no_proxy environment variable to ignore any proxy
274       settings for these domains.  The  curl_ca_bundle  environment  variable
275       can  be  used  to identify an alternative certificate bundle to be used
276       with an HTTPS proxy.
277
278       Setting a HTTP proxy on Unix for example looks like this:
279
280          $ export http_proxy="http://proxy.example.com:8080"
281
282       Proxy authentication is also supported:
283
284          $ export http_proxy="http://user1:mypass@proxy.example.org:8081"
285
286       Setting a proxy on the Windows command prompt:
287
288          C:\> set http_proxy=http://proxy.example.com:8080
289

PERFORMED CHECKS

291       All URLs have to pass a preliminary syntax test. Minor quoting mistakes
292       will issue a warning, all other invalid syntax issues are errors. After
293       the syntax check passes, the URL is queued for connection checking. All
294       connection check types are described below.
295
296       HTTP links (http:, https:)
297              After  connecting  to  the  given  HTTP server the given path or
298              query is  requested.  All  redirections  are  followed,  and  if
299              user/password  is  given  it  will be used as authorization when
300              necessary. All final HTTP status codes other than  2xx  are  er‐
301              rors.
302
303              HTML page contents are checked for recursion.
304
305       Local files (file:)
306              A regular, readable file that can be opened is valid. A readable
307              directory is also valid. All other  files,  for  example  device
308              files, unreadable or non-existing files are errors.
309
310              HTML or other parseable file contents are checked for recursion.
311
312       Mail links (mailto:)
313              A mailto: link eventually resolves to a list of email addresses.
314              If one address fails, the whole list will fail.  For  each  mail
315              address we check the following things:
316
317              1. Check the address syntax, both the parts before and after the
318                 @ sign.
319
320              2. Look up the MX DNS records. If we found no MX  record,  print
321                 an error.
322
323              3. Check  if  one  of  the mail hosts accept an SMTP connection.
324                 Check hosts with higher priority first. If  no  host  accepts
325                 SMTP, we print a warning.
326
327              4. Try to verify the address with the VRFY command. If we got an
328                 answer, print the verified address as an info.
329
330       FTP links (ftp:)
331              For FTP links we do:
332
333              1. connect to the specified host
334
335              2. try to login with the given user and  password.  The  default
336                 user is anonymous, the default password is anonymous@.
337
338              3. try to change to the given directory
339
340              4. list the file with the NLST command
341
342       Telnet links (telnet:)
343              We  try  to connect and if user/password are given, login to the
344              given telnet server.
345
346       NNTP links (news:, snews:, nntp)
347              We try to connect to the given NNTP server. If a news  group  or
348              article is specified, try to request it from the server.
349
350       Unsupported links (javascript:, etc.)
351              An unsupported link will only print a warning. No further check‐
352              ing will be made.
353
354              The complete list of recognized, but unsupported  links  can  be
355              found  in  the  linkcheck/checker/unknownurl.py source file. The
356              most prominent of them should be JavaScript links.
357

PLUGINS

359       There are two plugin types: connection and content plugins.  Connection
360       plugins  are run after a successful connection to the URL host. Content
361       plugins are run if the URL type has content (mailto: URLs have no  con‐
362       tent  for  example)  and if the check is not forbidden (ie. by HTTP ro‐
363       bots.txt).  Use the option --list-plugins for a  list  of  plugins  and
364       their  documentation.  All plugins are enabled via the linkcheckerrc(5)
365       configuration file.
366

RECURSION

368       Before descending recursively into a URL, it  has  to  fulfill  several
369       conditions. They are checked in this order:
370
371       1. A URL must be valid.
372
373       2. A  URL  must be parseable. This currently includes HTML files, Opera
374          bookmarks files, and directories. If a file type  cannot  be  deter‐
375          mined  (for  example  it does not have a common HTML file extension,
376          and the content does not look  like  HTML),  it  is  assumed  to  be
377          non-parseable.
378
379       3. The URL content must be retrievable. This is usually the case except
380          for example mailto: or unknown URL types.
381
382       4. The maximum recursion level must not be exceeded. It  is  configured
383          with the --recursion-level option and is unlimited per default.
384
385       5. It  must not match the ignored URL list. This is controlled with the
386          --ignore-url option.
387
388       6. The Robots Exclusion Protocol must allow links in the URL to be fol‐
389          lowed recursively. This is checked by searching for a "nofollow" di‐
390          rective in the HTML header data.
391
392       Note that the directory recursion reads all files  in  that  directory,
393       not just a subset like index.htm.
394

NOTES

396       URLs on the commandline starting with ftp. are treated like ftp://ftp.,
397       URLs starting with www. are treated like http://www.. You can also give
398       local  files as arguments.  If you have your system configured to auto‐
399       matically establish a connection to the internet (e.g. with diald),  it
400       will  connect  when checking links not pointing to your local host. Use
401       the --ignore-url option to prevent this.
402
403       Javascript links are not supported.
404
405       If your platform does not support threading,  LinkChecker  disables  it
406       automatically.
407
408       You can supply multiple user/password pairs in a configuration file.
409
410       When  checking  news:  links the given NNTP host doesn't need to be the
411       same as the host of the user browsing your pages.
412

ENVIRONMENT

414       NNTP_SERVER
415              specifies default NNTP server
416
417       http_proxy
418              specifies default HTTP proxy server
419
420       https_proxy
421              specifies default HTTPS proxy server
422
423       curl_ca_bundle
424              an alternative certificate bundle to be used with an HTTPS proxy
425
426       no_proxy
427              comma-separated list of domains to  not  contact  over  a  proxy
428              server
429
430       LC_MESSAGES, LANG, LANGUAGE
431              specify output language
432

RETURN VALUE

434       The return value is 2 when
435
436       • a program error occurred.
437
438       The return value is 1 when
439
440       • invalid links were found or
441
442       • link warnings were found and warnings are enabled
443
444       Else the return value is zero.
445

LIMITATIONS

447       LinkChecker  consumes  memory  for each queued URL to check. With thou‐
448       sands of queued URLs the amount of consumed  memory  can  become  quite
449       large.  This might slow down the program or even the whole system.
450

FILES

452       ~/.linkchecker/linkcheckerrc - default configuration file
453
454       ~/.linkchecker/failures - default failures logger output filename
455
456       linkchecker-out.TYPE - default logger file output name
457

AUTHOR

468       Bastian Kleineidam <bastian.kleineidam@web.de>
469

COPYRIGHT

471       2000-2016 Bastian Kleineidam, 2010-2021 LinkChecker Authors
472
473
474
475
47610.0.1.post124+ga12fcf04       December 21, 2021                LINKCHECKER(1)