1LINKCHECKER(1)           LinkChecker commandline usage          LINKCHECKER(1)
2
3
4

NAME

6       linkchecker - check HTML documents and websites for broken links
7

SYNOPSIS

9       linkchecker [options] [file-or-url]...
10

DESCRIPTION

12       LinkChecker features recursive checking, multithreading, output in col‐
13       ored or normal text, HTML, SQL, CSV or a sitemap graph in GML  or  XML,
14       support  for  HTTP/1.1,  HTTPS,  FTP, mailto:, news:, nntp:, Telnet and
15       local file links, restriction of link checking with regular  expression
16       filters  for  URLs,  proxy support, username/password authorization for
17       HTTP and FTP, robots.txt exclusion protocol support,  i18n  support,  a
18       command  line  interface  and  a (Fast)CGI web interface (requires HTTP
19       server)
20

EXAMPLES

22       The most common use checks the given domain recursively, plus  any  URL
23       pointing outside of the domain:
24         linkchecker http://www.example.net/
25       Beware  that  this  checks  the  whole site which can have thousands of
26       URLs.  Use the -r option to restrict the recursion depth.
27       Don't connect to mailto: hosts, only check their URL syntax. All  other
28       links are checked as usual:
29         linkchecker --ignore-url=^mailto: mysite.example.org
30       Checking a local HTML file on Unix:
31         linkchecker ../bla.html
32       Checking a local HTML file on Windows:
33         linkchecker c:\temp\test.html
34       You can skip the http:// url part if the domain starts with www.:
35         linkchecker www.example.com
36       You can skip the ftp:// url part if the domain starts with ftp.:
37         linkchecker -r0 ftp.example.org
38       Generate a sitemap graph and convert it with the graphviz dot utility:
39         linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
40

OPTIONS

42   General options
43       -h, --help
44              Help me! Print usage information for this program.
45
46       -fFILENAME, --config=FILENAME
47              Use FILENAME as configuration file. As default LinkChecker first
48              searches      /etc/linkchecker/linkcheckerrc      and       then
49              ~/.linkchecker/linkcheckerrc.
50
51       -tNUMBER, --threads=NUMBER
52              Generate  no more than the given number of threads. Default num‐
53              ber of threads is 10. To disable threading specify  a  non-posi‐
54              tive number.
55
56       -V, --version
57              Print version and exit.
58
59       --stdin
60              Read list of white-space separated URLs to check from stdin.
61
62   Output options
63       -v, --verbose
64              Log  all  checked  URLs  once. Default is to log only errors and
65              warnings.
66
67       --complete
68              Log all URLs, including duplicates. Default is to log  duplicate
69              URLs only once.
70
71       --no-warnings
72              Don't log warnings. Default is to log warnings.
73
74       -WREGEX, --warning-regex=REGEX
75              Define a regular expression which prints a warning if it matches
76              any content of the checked link.  This  applies  only  to  valid
77              pages, so we can get their content.
78              Use this to check for pages that contain some form of error, for
79              example "This page has  moved"  or  "Oracle  Application  Server
80              error".
81
82       --warning-size-bytes=NUMBER
83              Print  a  warning  if content size info is available and exceeds
84              the given number of bytes.
85
86       --check-html
87              Check syntax of HTML URLs with local library (HTML tidy).
88
89       --check-html-w3
90              Check syntax of HTML URLs with W3C online validator.
91
92       --check-css
93              Check syntax of CSS URLs with local library (cssutils).
94
95       --check-css-w3
96              Check syntax of CSS URLs with W3C online validator.
97
98       --scan-virus
99              Scan content of URLs for viruses with ClamAV.
100
101       -q, --quiet
102              Quiet operation, an alias for -o none.  This is only useful with
103              -F.
104
105       -oTYPE[/ENCODING], --output=TYPE[/ENCODING]
106              Specify output type as text, html, sql, csv, gml, dot, xml, none
107              or blacklist.  Default type is text. The  various  output  types
108              are documented below.
109              The  ENCODING specifies the output encoding, the default is that
110              of   your   locale.    Valid    encodings    are    listed    at
111              http://docs.python.org/library/codecs.html#standard-encodings.
112
113       -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
114              Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/black‐
115              list for blacklist output, or FILENAME if specified.  The ENCOD‐
116              ING  specifies  the output encoding, the default is that of your
117              locale.       Valid      encodings      are      listed       at
118              http://docs.python.org/library/codecs.html#standard-encodings.
119              The FILENAME and ENCODING parts of the none output type will  be
120              ignored,  else  if the file already exists, it will be overwrit‐
121              ten.  You can specify this option more  than  once.  Valid  file
122              output  types  are  text, html, sql, csv, gml, dot, xml, none or
123              blacklist Default is no file output. The  various  output  types
124              are  documented  below.  Note  that you can suppress all console
125              output with the option -o none.
126
127       --no-status
128              Do not print check status messages.
129
130       -DSTRING, --debug=STRING
131              Print debugging output for the given logger.  Available  loggers
132              are  cmdline, checking, cache, gui, dns and all.  Specifying all
133              is an alias for specifying all available  loggers.   The  option
134              can  be given multiple times to debug with more than one logger.
135               For accurate results, threading will be disabled  during  debug
136              runs.
137
138       --trace
139              Print tracing information.
140
141       --profile
142              Write  profiling  data into a file named linkchecker.prof in the
143              current working directory. See also --viewprof.
144
145       --viewprof
146              Print out previously generated profiling data. See  also  --pro‐
147              file.
148
149   Checking options
150       -rNUMBER, --recursion-level=NUMBER
151              Check recursively all links up to given depth.  A negative depth
152              will enable infinite recursion.  Default depth is infinite.
153
154       --no-follow-url=REGEX
155              Check but do not recurse into URLs matching  the  given  regular
156              expression.
157              This option can be given multiple times.
158
159       --ignore-url=REGEX
160              Only check syntax of URLs matching the given regular expression.
161              This option can be given multiple times.
162
163       -C, --cookies
164              Accept and send HTTP cookies according to RFC 2109. Only cookies
165              which are sent back to  the  originating  server  are  accepted.
166              Sent  and  accepted  cookies  are provided as additional logging
167              information.
168
169       --cookiefile=FILENAME
170              Read a file with initial cookie data. The cookie data format  is
171              explained below.
172
173       -a, --anchors
174              Check  HTTP  anchor references. Default is not to check anchors.
175              This option enables logging of the warning url-anchor-not-found.
176
177       -uSTRING, --user=STRING
178              Try the given username for HTTP and FTP authorization.  For  FTP
179              the  default username is anonymous. For HTTP there is no default
180              username. See also -p.
181
182       -p, --password
183              Read a password from console and use it for HTTP and FTP  autho‐
184              rization.   For FTP the default password is anonymous@. For HTTP
185              there is no default password. See also -u.
186
187       --timeout=NUMBER
188              Set the timeout for connection attempts in seconds. The  default
189              timeout is 60 seconds.
190
191       -PNUMBER, --pause=NUMBER
192              Pause the given number of seconds between two subsequent connec‐
193              tion requests to the same host.  Default  is  no  pause  between
194              requests.
195
196       -NSTRING, --nntp-server=STRING
197              Specify  an NNTP server for news: links. Default is the environ‐
198              ment variable NNTP_SERVER. If no host is given, only the  syntax
199              of the link is checked.
200
201

CONFIGURATION FILES

203       Configuration  files can specify all options above. They can also spec‐
204       ify some  options  that  cannot  be  set  on  the  command  line.   See
205       linkcheckerrc(5) for more info.
206
207

OUTPUT TYPES

209       Note  that  by default only errors and warnings are logged.  You should
210       use the --verbose option to get the complete URL list, especially  when
211       outputting a sitemap graph format.
212
213
214       text   Standard text logger, logging URLs in keyword: argument fashion.
215
216       html   Log URLs in keyword: argument fashion, formatted as HTML.  Addi‐
217              tionally has links to the referenced pages.  Invalid  URLs  have
218              HTML and CSS syntax check links appended.
219
220       csv    Log check result in CSV format with one URL per line.
221
222       gml    Log  parent-child relations between linked URLs as a GML sitemap
223              graph.
224
225       dot    Log parent-child relations between linked URLs as a DOT  sitemap
226              graph.
227
228       gxml   Log check result as a GraphXML sitemap graph.
229
230       xml    Log check result as machine-readable XML.
231
232       sql    Log  check result as SQL script with INSERT commands. An example
233              script to create the initial  SQL  table  is  included  as  cre‐
234              ate.sql.
235
236       blacklist
237              Suitable  for  cron  jobs.  Logs  the  check  result into a file
238              ~/.linkchecker/blacklist  which  only  contains   entries   with
239              invalid URLs and the number of times they have failed.
240
241       none   Logs nothing. Suitable for debugging or checking the exit code.
242

REGULAR EXPRESSIONS

244       LinkChecker     accepts     Python     regular     expressions.     See
245       http://docs.python.org/howto/regex.html for an introduction.
246
247       An addition is that a leading  exclamation  mark  negates  the  regular
248       expression.
249
251       A  cookie file contains standard RFC 805 header data with the following
252       possible names:
253
254       Scheme (optional)
255              Sets the scheme the cookies are valid  for;  default  scheme  is
256              http.
257
258       Host (required)
259              Sets the domain the cookies are valid for.
260
261       Path (optional)
262              Gives the path the cookies are value for; default path is /.
263
264       Set-cookie (optional)
265              Set cookie name/value. Can be given more than once.
266
267       Multiple entries are separated by a blank line.  The example below will
268       send two cookies to all URLs  starting  with  http://example.com/hello/
269       and one to all URLs starting with https://example.org/:
270
271        Host: example.com
272        Path: /hello
273        Set-cookie: ID="smee"
274        Set-cookie: spam="egg"
275
276        Scheme: https
277        Host: example.org
278        Set-cookie: baggage="elitist"; comment="hologram"
279
280

PROXY SUPPORT

282       To  use a proxy on Unix or Windows set the $http_proxy, $https_proxy or
283       $ftp_proxy environment variables to the proxy URL. The URL should be of
284       the form http://[user:pass@]host[:port].  LinkChecker also detects man‐
285       ual proxy settings of Internet Explorer under Windows systems. On a Mac
286       use  the  Internet Config to select a proxy.  You can also set a comma-
287       separated domain list in the $no_proxy environment variables to  ignore
288       any proxy settings for these domains.  Setting a HTTP proxy on Unix for
289       example looks like this:
290
291         export http_proxy="http://proxy.example.com:8080"
292
293       Proxy authentication is also supported:
294
295         export http_proxy="http://user1:mypass@proxy.example.org:8081"
296
297       Setting a proxy on the Windows command prompt:
298
299         set http_proxy=http://proxy.example.com:8080
300
301

PERFORMED CHECKS

303       All URLs have to pass a preliminary syntax test. Minor quoting mistakes
304       will  issue  a  warning,  all  other  invalid syntax issues are errors.
305       After the syntax check passes, the URL is queued for connection  check‐
306       ing. All connection check types are described below.
307
308       HTTP links (http:, https:)
309              After  connecting  to  the  given  HTTP server the given path or
310              query is  requested.  All  redirections  are  followed,  and  if
311              user/password  is  given  it  will be used as authorization when
312              necessary.  Permanently moved pages issue a warning.  All  final
313              HTTP status codes other than 2xx are errors.  HTML page contents
314              are checked for recursion.
315
316       Local files (file:)
317              A regular, readable file that can be opened is valid. A readable
318              directory  is  also  valid.  All other files, for example device
319              files, unreadable or non-existing files  are  errors.   HTML  or
320              other parseable file contents are checked for recursion.
321
322       Mail links (mailto:)
323              A mailto: link eventually resolves to a list of email addresses.
324              If one address fails, the whole list will fail.  For  each  mail
325              address we check the following things:
326                1) Check the adress syntax, both of the part before and after
327                   the @ sign.
328                2) Look up the MX DNS records. If we found no MX record,
329                   print an error.
330                3) Check if one of the mail hosts accept an SMTP connection.
331                   Check hosts with higher priority first.
332                   If no host accepts SMTP, we print a warning.
333                4) Try to verify the address with the VRFY command. If we got
334                   an answer, print the verified address as an info.
335
336       FTP links (ftp:)
337
338                For FTP links we do:
339
340                1) connect to the specified host
341                2) try to login with the given user and password. The default
342                   user  is  ``anonymous``,  the  default password is ``anony‐
343              mous@``.
344                3) try to change to the given directory
345                4) list the file with the NLST command
346
347              - Telnet links (``telnet:``)
348
349                We try to connect and if user/password are given, login to the
350                given telnet server.
351
352              - NNTP links (``news:``, ``snews:``, ``nntp``)
353
354                We try to connect to the given NNTP server. If a news group or
355                article is specified, try to request it from the server.
356
357              - Ignored links (``javascript:``, etc.)
358
359                An ignored link will only print a warning. No further checking
360                will be made.
361
362                Here is a complete list of recognized, but ignored links.  The
363              most
364                prominent of them should be JavaScript links.
365
366                - ``acap:``      (application configuration access protocol)
367                - ``afs:``       (Andrew File System global file names)
368                - ``chrome:``    (Mozilla specific)
369                - ``cid:``       (content identifier)
370                - ``clsid:``     (Microsoft specific)
371                - ``data:``      (data)
372                - ``dav:``       (dav)
373                - ``fax:``       (fax)
374                - ``find:``      (Mozilla specific)
375                - ``gopher:``    (Gopher)
376                - ``imap:``      (internet message access protocol)
377                - ``isbn:``      (ISBN (int. book numbers))
378                - ``javascript:`` (JavaScript)
379                - ``ldap:``      (Lightweight Directory Access Protocol)
380                - ``mailserver:`` (Access to data available from mail servers)
381                - ``mid:``       (message identifier)
382                - ``mms:``       (multimedia stream)
383                - ``modem:``     (modem)
384                - ``nfs:``       (network file system protocol)
385                - ``opaquelocktoken:`` (opaquelocktoken)
386                - ``pop:``       (Post Office Protocol v3)
387                - ``prospero:``  (Prospero Directory Service)
388                - ``rsync:``     (rsync protocol)
389                - ``rtsp:``      (real time streaming protocol)
390                - ``service:``   (service location)
391                - ``shttp:``     (secure HTTP)
392                - ``sip:``       (session initiation protocol)
393                - ``tel:``       (telephone)
394                - ``tip:``       (Transaction Internet Protocol)
395                - ``tn3270:``    (Interactive 3270 emulation sessions)
396                - ``vemmi:``     (versatile multimedia interface)
397                - ``wais:``      (Wide Area Information Servers)
398                - ``z39.50r:``   (Z39.50 Retrieval)
399                - ``z39.50s:``   (Z39.50 Session)
400
401
402

RECURSION

404       Before  descending  recursively  into  a URL, it has to fulfill several
405       conditions. They are checked in this order:
406
407       1. A URL must be valid.
408
409       2. A URL must be parseable. This currently includes HTML files,
410          Opera bookmarks files, and directories. If a file type cannot
411          be determined (for example it does not have a common HTML file
412          extension, and the content does not look like HTML), it is assumed
413          to be non-parseable.
414
415       3. The URL content must be retrievable. This is usually the case
416          except for example mailto: or unknown URL types.
417
418       4. The maximum recursion level must not be exceeded. It is configured
419          with the ``--recursion-level`` option and is unlimited per default.
420
421       5. It must not match the ignored URL list. This is controlled with
422          the ``--ignore-url`` option.
423
424       6. The Robots Exclusion Protocol must allow links in the URL to be
425          followed recursively. This is checked by searching for a
426          "nofollow" directive in the HTML header data.
427
428       Note that the directory recursion reads all files  in  that  directory,
429       not just a subset like ``index.htm*``.
430
431

NOTES

433       URLs on the commandline starting with ftp. are treated like ftp://ftp.,
434       URLs starting with www. are treated like  http://www..   You  can  also
435       give local files as arguments.
436
437       If you have your system configured to automatically establish a connec‐
438       tion to the internet (e.g. with diald), it will connect  when  checking
439       links  not  pointing  to your local host.  Use the -s and -i options to
440       prevent this.
441
442       Javascript links are currently ignored.
443
444       If your platform does not support threading,  LinkChecker  disables  it
445       automatically.
446
447       You can supply multiple user/password pairs in a configuration file.
448
449       When  checking  news:  links the given NNTP host doesn't need to be the
450       same as the host of the user browsing your pages.
451

ENVIRONMENT

453       NNTP_SERVER - specifies default NNTP server
454       http_proxy - specifies default HTTP proxy server
455       ftp_proxy - specifies default FTP proxy server
456       no_proxy - comma-separated list of domains to not contact over a  proxy
457       server
458       LC_MESSAGES, LANG, LANGUAGE - specify output language
459

RETURN VALUE

461       The return value is non-zero when
462
463       ·      invalid links were found or
464
465       ·      link warnings were found and warnings are enabled
466
467       ·      a program error occurred.
468

LIMITATIONS

470       LinkChecker  consumes  memory  for each queued URL to check. With thou‐
471       sands of queued URLs the amount of consumed  memory  can  become  quite
472       large. This might slow down the program or even the whole system.
473

FILES

475       /etc/linkchecker/linkcheckerrc,  ~/.linkchecker/linkcheckerrc - default
476       configuration files
477       ~/.linkchecker/blacklist - default blacklist logger output filename
478       linkchecker-out.TYPE - default logger file output name
479       http://docs.python.org/library/codecs.html#standard-encodings  -  valid
480       output encodings
481       http://docs.python.org/howto/regex.html - regular expression documenta‐
482       tion
483
484

SEE ALSO

486       linkcheckerrc(5)
487

AUTHOR

489       Bastian Kleineidam <calvin@users.sourceforge.net>
490
492       Copyright © 2000-2011 Bastian Kleineidam
493
494
495
496LinkChecker                       2010-07-01                    LINKCHECKER(1)
Impressum