1LINKCHECKER(1)           LinkChecker commandline usage          LINKCHECKER(1)
2
3
4

NAME

6       linkchecker  - command line client to check HTML documents and websites
7       for broken links
8

SYNOPSIS

10       linkchecker [options] [file-or-url]...
11

DESCRIPTION

13       LinkChecker features
14
15       ·      recursive and multithreaded checking,
16
17       ·      output in colored or normal text,  HTML,  SQL,  CSV,  XML  or  a
18              sitemap graph in different formats,
19
20       ·      support  for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet
21              and local file links,
22
23       ·      restriction of link checking with URL filters,
24
25       ·      proxy support,
26
27       ·      username/password authorization for HTTP, FTP and Telnet,
28
29       ·      support for robots.txt exclusion protocol,
30
31       ·      support for Cookies
32
33       ·      support for HTML5
34
35       ·      HTML and CSS syntax check
36
37       ·      Antivirus check
38
39       ·      a command line and web interface
40

EXAMPLES

42       The most common use checks the given domain recursively:
43         linkchecker http://www.example.com/
44       Beware that this checks the whole site  which  can  have  thousands  of
45       URLs.  Use the -r option to restrict the recursion depth.
46       Don't  check URLs with /secret in its name. All other links are checked
47       as usual:
48         linkchecker --ignore-url=/secret mysite.example.com
49       Checking a local HTML file on Unix:
50         linkchecker ../bla.html
51       Checking a local HTML file on Windows:
52         linkchecker c:\temp\test.html
53       You can skip the http:// url part if the domain starts with www.:
54         linkchecker www.example.com
55       You can skip the ftp:// url part if the domain starts with ftp.:
56         linkchecker -r0 ftp.example.com
57       Generate a sitemap graph and convert it with the graphviz dot utility:
58         linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
59

OPTIONS

61   General options
62       -fFILENAME, --config=FILENAME
63              Use FILENAME as configuration file. As default LinkChecker  uses
64              ~/.linkchecker/linkcheckerrc.
65
66       -h, --help
67              Help me! Print usage information for this program.
68
69       --stdin
70              Read list of white-space separated URLs to check from stdin.
71
72       -tNUMBER, --threads=NUMBER
73              Generate  no more than the given number of threads. Default num‐
74              ber of threads is 10. To disable threading specify  a  non-posi‐
75              tive number.
76
77       -V, --version
78              Print version and exit.
79
80       --list-plugins
81              Print available check plugins and exit.
82
83   Output options
84       -DSTRING, --debug=STRING
85              Print  debugging output for the given logger.  Available loggers
86              are cmdline, checking, cache, dns, plugins and all.   Specifying
87              all  is  an  alias  for  specifying  all available loggers.  The
88              option can be given multiple times to debug with more  than  one
89              logger.    For accurate results, threading will be disabled dur‐
90              ing debug runs.
91
92       -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
93              Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/black‐
94              list for blacklist output, or FILENAME if specified.  The ENCOD‐
95              ING specifies the output encoding, the default is that  of  your
96              locale.        Valid      encodings      are      listed      at
97              http://docs.python.org/library/codecs.html#standard-encodings.
98              The FILENAME and ENCODING parts of the none output type will  be
99              ignored,  else  if the file already exists, it will be overwrit‐
100              ten.  You can specify this option more  than  once.  Valid  file
101              output  types  are text, html, sql, csv, gml, dot, xml, sitemap,
102              none or blacklist.  Default is no file output. The various  out‐
103              put  types  are documented below. Note that you can suppress all
104              console output with the option -o none.
105
106       --no-status
107              Do not print check status messages.
108
109       --no-warnings
110              Don't log warnings. Default is to log warnings.
111
112       -oTYPE[/ENCODING], --output=TYPE[/ENCODING]
113              Specify output type as text, html,  sql,  csv,  gml,  dot,  xml,
114              sitemap,  none  or blacklist.  Default type is text. The various
115              output types are documented below.
116              The ENCODING specifies the output encoding, the default is  that
117              of    your    locale.    Valid    encodings    are   listed   at
118              http://docs.python.org/library/codecs.html#standard-encodings.
119
120       -q, --quiet
121              Quiet operation, an alias for -o none.  This is only useful with
122              -F.
123
124       -v, --verbose
125              Log  all  checked  URLs. Default is to log only errors and warn‐
126              ings.
127
128       -WREGEX, --warning-regex=REGEX
129              Define a regular expression which prints a warning if it matches
130              any  content  of  the  checked link.  This applies only to valid
131              pages, so we can get their content.
132              Use this to check for pages that contain some form of error, for
133              example "This page has moved" or "Oracle Application error".
134              Note that multiple values can be combined in the regular expres‐
135              sion, for  example  "(This  page  has  moved|Oracle  Application
136              error)".
137              See section REGULAR EXPRESSIONS for more info.
138
139   Checking options
140       --cookiefile=FILENAME
141              Read  a file with initial cookie data. The cookie data format is
142              explained below.
143
144       --check-extern
145              Check external URLs.
146
147       --ignore-url=REGEX
148              URLs matching the given regular expression will be  ignored  and
149              not checked.
150              This option can be given multiple times.
151              See section REGULAR EXPRESSIONS for more info.
152
153       -NSTRING, --nntp-server=STRING
154              Specify  an NNTP server for news: links. Default is the environ‐
155              ment variable NNTP_SERVER. If no host is given, only the  syntax
156              of the link is checked.
157
158       --no-follow-url=REGEX
159              Check  but  do  not recurse into URLs matching the given regular
160              expression.
161              This option can be given multiple times.
162              See section REGULAR EXPRESSIONS for more info.
163
164       -p, --password
165              Read a password from console and use it for HTTP and FTP  autho‐
166              rization.   For FTP the default password is anonymous@. For HTTP
167              there is no default password. See also -u.
168
169       -rNUMBER, --recursion-level=NUMBER
170              Check recursively all links up to given depth.  A negative depth
171              will enable infinite recursion.  Default depth is infinite.
172
173       --timeout=NUMBER
174              Set  the timeout for connection attempts in seconds. The default
175              timeout is 60 seconds.
176
177       -uSTRING, --user=STRING
178              Try the given username for HTTP and FTP authorization.  For  FTP
179              the  default username is anonymous. For HTTP there is no default
180              username. See also -p.
181
182       --user-agent=STRING
183              Specify the User-Agent string to send to the  HTTP  server,  for
184              example  "Mozilla/4.0".  The  default is "LinkChecker/X.Y" where
185              X.Y is the current version of LinkChecker.
186
187

CONFIGURATION FILES

189       Configuration files can specify all options above. They can also  spec‐
190       ify  some  options  that  cannot  be  set  on  the  command  line.  See
191       linkcheckerrc(5) for more info.
192
193

OUTPUT TYPES

195       Note that by default only errors and warnings are logged.   You  should
196       use  the --verbose option to get the complete URL list, especially when
197       outputting a sitemap graph format.
198
199
200       text   Standard text logger, logging URLs in keyword: argument fashion.
201
202       html   Log URLs in keyword: argument fashion, formatted as HTML.  Addi‐
203              tionally  has  links  to the referenced pages. Invalid URLs have
204              HTML and CSS syntax check links appended.
205
206       csv    Log check result in CSV format with one URL per line.
207
208       gml    Log parent-child relations between linked URLs as a GML  sitemap
209              graph.
210
211       dot    Log  parent-child relations between linked URLs as a DOT sitemap
212              graph.
213
214       gxml   Log check result as a GraphXML sitemap graph.
215
216       xml    Log check result as machine-readable XML.
217
218       sitemap
219              Log check result as an XML sitemap whose protocol is  documented
220              at http://www.sitemaps.org/protocol.html.
221
222       sql    Log  check result as SQL script with INSERT commands. An example
223              script to create the initial  SQL  table  is  included  as  cre‐
224              ate.sql.
225
226       blacklist
227              Suitable  for  cron  jobs.  Logs  the  check  result into a file
228              ~/.linkchecker/blacklist  which  only  contains   entries   with
229              invalid URLs and the number of times they have failed.
230
231       none   Logs nothing. Suitable for debugging or checking the exit code.
232

REGULAR EXPRESSIONS

234       LinkChecker     accepts     Python     regular     expressions.     See
235       http://docs.python.org/howto/regex.html for an introduction.
236
237       An addition is that a leading  exclamation  mark  negates  the  regular
238       expression.
239
241       A  cookie  file  contains standard HTTP header (RFC 2616) data with the
242       following possible names:
243
244       Host (required)
245              Sets the domain the cookies are valid for.
246
247       Path (optional)
248              Gives the path the cookies are value for; default path is /.
249
250       Set-cookie (required)
251              Set cookie name/value. Can be given more than once.
252
253       Multiple entries are separated by a blank line.  The example below will
254       send  two  cookies  to all URLs starting with http://example.com/hello/
255       and one to all URLs starting with https://example.org/:
256
257        Host: example.com
258        Path: /hello
259        Set-cookie: ID="smee"
260        Set-cookie: spam="egg"
261
262        Host: example.org
263        Set-cookie: baggage="elitist"; comment="hologram"
264
265

PROXY SUPPORT

267       To use a proxy on Unix or Windows set the $http_proxy, $https_proxy  or
268       $ftp_proxy environment variables to the proxy URL. The URL should be of
269       the form http://[user:pass@]host[:port].  LinkChecker also detects man‐
270       ual  proxy  settings  of  Internet  Explorer under Windows systems, and
271       gconf or KDE on Linux systems.  On a Mac use  the  Internet  Config  to
272       select  a proxy.  You can also set a comma-separated domain list in the
273       $no_proxy environment variables to ignore any proxy settings for  these
274       domains.  Setting a HTTP proxy on Unix for example looks like this:
275
276         export http_proxy="http://proxy.example.com:8080"
277
278       Proxy authentication is also supported:
279
280         export http_proxy="http://user1:mypass@proxy.example.org:8081"
281
282       Setting a proxy on the Windows command prompt:
283
284         set http_proxy=http://proxy.example.com:8080
285
286

PERFORMED CHECKS

288       All URLs have to pass a preliminary syntax test. Minor quoting mistakes
289       will issue a warning, all  other  invalid  syntax  issues  are  errors.
290       After  the syntax check passes, the URL is queued for connection check‐
291       ing. All connection check types are described below.
292
293       HTTP links (http:, https:)
294              After connecting to the given HTTP  server  the  given  path  or
295              query  is  requested.  All  redirections  are  followed,  and if
296              user/password is given it will be  used  as  authorization  when
297              necessary.   All  final  HTTP  status  codes  other than 2xx are
298              errors.  HTML page contents are checked for recursion.
299
300       Local files (file:)
301              A regular, readable file that can be opened is valid. A readable
302              directory  is  also  valid.  All other files, for example device
303              files, unreadable or non-existing files  are  errors.   HTML  or
304              other parseable file contents are checked for recursion.
305
306       Mail links (mailto:)
307              A mailto: link eventually resolves to a list of email addresses.
308              If one address fails, the whole list will fail.  For  each  mail
309              address we check the following things:
310                1) Check the adress syntax, both of the part before and after
311                   the @ sign.
312                2) Look up the MX DNS records. If we found no MX record,
313                   print an error.
314                3) Check if one of the mail hosts accept an SMTP connection.
315                   Check hosts with higher priority first.
316                   If no host accepts SMTP, we print a warning.
317                4) Try to verify the address with the VRFY command. If we got
318                   an answer, print the verified address as an info.
319
320       FTP links (ftp:)
321
322                For FTP links we do:
323
324                1) connect to the specified host
325                2) try to login with the given user and password. The default
326                   user  is  ``anonymous``,  the  default password is ``anony‐
327              mous@``.
328                3) try to change to the given directory
329                4) list the file with the NLST command
330
331
332       Telnet links (``telnet:``)
333
334                We try to connect and if user/password are given, login to the
335                given telnet server.
336
337
338       NNTP links (``news:``, ``snews:``, ``nntp``)
339
340                We try to connect to the given NNTP server. If a news group or
341                article is specified, try to request it from the server.
342
343
344       Unsupported links (``javascript:``, etc.)
345
346                An unsupported link will only  print  a  warning.  No  further
347              checking
348                will be made.
349
350                The  complete list of recognized, but unsupported links can be
351              found
352                in the linkcheck/checker/unknownurl.py source file.
353                The most prominent of them should be JavaScript links.
354
355

PLUGINS

357       There are two plugin types: connection and content plugins.  Connection
358       plugins are run after a successful connection to the URL host.  Content
359       plugins are run if the URL type has content (mailto: URLs have no  con‐
360       tent  for  example)  and if the check is not forbidden (ie. by HTTP ro‐
361       bots.txt).  See linkchecker --list-plugins for a list  of  plugins  and
362       their  documentation.  All plugins are enabled via the linkcheckerrc(5)
363       configuration file.
364
365

RECURSION

367       Before descending recursively into a URL, it  has  to  fulfill  several
368       conditions. They are checked in this order:
369
370       1. A URL must be valid.
371
372       2. A URL must be parseable. This currently includes HTML files,
373          Opera bookmarks files, and directories. If a file type cannot
374          be determined (for example it does not have a common HTML file
375          extension, and the content does not look like HTML), it is assumed
376          to be non-parseable.
377
378       3. The URL content must be retrievable. This is usually the case
379          except for example mailto: or unknown URL types.
380
381       4. The maximum recursion level must not be exceeded. It is configured
382          with the --recursion-level option and is unlimited per default.
383
384       5. It must not match the ignored URL list. This is controlled with
385          the --ignore-url option.
386
387       6. The Robots Exclusion Protocol must allow links in the URL to be
388          followed recursively. This is checked by searching for a
389          "nofollow" directive in the HTML header data.
390
391       Note  that  the  directory recursion reads all files in that directory,
392       not just a subset like index.htm*.
393
394

NOTES

396       URLs on the commandline starting with ftp. are treated like ftp://ftp.,
397       URLs  starting  with  www.  are treated like http://www..  You can also
398       give local files as arguments.
399
400       If you have your system configured to automatically establish a connec‐
401       tion  to  the internet (e.g. with diald), it will connect when checking
402       links not pointing to your local host.  Use the --ignore-url option  to
403       prevent this.
404
405       Javascript links are not supported.
406
407       If  your  platform  does not support threading, LinkChecker disables it
408       automatically.
409
410       You can supply multiple user/password pairs in a configuration file.
411
412       When checking news: links the given NNTP host doesn't need  to  be  the
413       same as the host of the user browsing your pages.
414

ENVIRONMENT

416       NNTP_SERVER - specifies default NNTP server
417       http_proxy - specifies default HTTP proxy server
418       ftp_proxy - specifies default FTP proxy server
419       no_proxy  - comma-separated list of domains to not contact over a proxy
420       server
421       LC_MESSAGES, LANG, LANGUAGE - specify output language
422

RETURN VALUE

424       The return value is 2 when
425
426       ·      a program error occurred.
427
428       The return value is 1 when
429
430       ·      invalid links were found or
431
432       ·      link warnings were found and warnings are enabled
433
434       Else the return value is zero.
435

LIMITATIONS

437       LinkChecker consumes memory for each queued URL to  check.  With  thou‐
438       sands  of  queued  URLs  the amount of consumed memory can become quite
439       large. This might slow down the program or even the whole system.
440

FILES

442       ~/.linkchecker/linkcheckerrc - default configuration file
443       ~/.linkchecker/blacklist - default blacklist logger output filename
444       linkchecker-out.TYPE - default logger file output name
445       http://docs.python.org/library/codecs.html#standard-encodings  -  valid
446       output encodings
447       http://docs.python.org/howto/regex.html - regular expression documenta‐
448       tion
449
450

SEE ALSO

452       linkcheckerrc(5)
453

AUTHOR

455       Bastian Kleineidam <bastian.kleineidam@web.de>
456
458       Copyright © 2000-2014 Bastian Kleineidam
459
460
461
462LinkChecker                       2010-07-01                    LINKCHECKER(1)
Impressum