htdig(1) - f35

1htdig(1)                    General Commands Manual                   htdig(1)
2
3
4

NAME

6       htdig - retrieve HTML documents for ht://Dig search engine
7

SYNOPSIS

9       htdig [options]
10

DESCRIPTION

12       Htdig  retrieves  HTML  documents  using  the HTTP protocol and gathers
13       information from these documents which can  later  be  used  to  search
14       these documents. This program can be referred to as the search robot.
15

OPTIONS

17       -      Get the list of URLs to start indexing from standard input. This
18              will override the default parameter start_url specified  in  the
19              config file  and the file supplied to the -m option.
20
21       -a     Use  alternate  work files. Tells htdig to append .work to data‐
22              base files, causing a second copy of the database to  be  built.
23              This allows the original files to be used by htsearch during the
24              indexing run.
25
26       -c configfile
27              Use the specified configfile instead of the default.
28
29       -h maxhops
30              Restrict the dig to documents that are  at  most  maxhops  links
31              away from the starting document. This only works if option -i is
32              also given.
33
34       -i     Initial. Do not use any old databases.  Old  databases  will  be
35              erased before runing the program.
36
37       -m filename
38              Minimal  run.  Only  index  the URLs given in the file filename,
39              ignoring all others. URLs in the file should  be  formatted  one
40              URL  per  line.  -s Print statistics about the dig after comple‐
41              tion.
42
43       -t     Create an ASCII version of the document database. This  database
44              is  easy to parse with other programs so that information can be
45              extracted from it for purposes other than searching.  One  could
46              gather some interesting statistics from this database.
47
48              Fieldname                           Value
49                  u       URL
50                  t       Title
51                  a       State
52                          (0 normal, 1 not found, 2 not indexed, 3 obsolete)
53                  m       Time of last modification reported by the server
54                  s       Document Size in bytes
55                  H       Excerpt of the document
56                  h       Meta Description
57                  l       Time of last rerievial
58                  L       Count of links in the document or  of outgoing links
59                  b       Number of links to the document, also called
60                          incoming links or backlinks
61                  c       Hop count of this document
62                  g       Signature of this document
63                          (used to detect duplicates)
64                  e       E-Mail address to use for a notification from htnotify
65                  n       Date on which such notification is sent
66
67                  S       Subject of the notfication message
68                  d       The text of Incoming links pointing to this document
69                          (e.g. <a href="docURL">description</a>)
70                  A       Anchors in the document (i.e. <A NAME=...)
71
72       -u username:password
73              Tells htdig to send the supplied username and password with each
74              HTTP request. The credentials will be encoded using the  ´Basic´
75              authentication  method.  There HAS to be a colon (:) between the
76              username and password.
77
78       -v     Verbose mode. This increases the verbosity of the program. Using
79              more  than 2 is probably only useful for debugging purposes. The
80              default verbose mode (using only one -v) gives a  nice  progress
81              report  while  digging.  Please consult the section below on the
82              exact format of the progress report.
83
84
85       FORMAT OF THE PROGRESS REPORT GIVEN IN VERBOSE MODE
86              A line is shown for each URL, with 3 numbers before the URL  and
87              some  symbols  after  the URL. The first number is the number of
88              documents parsed so far, the second is the DocID for this  docu‐
89              ment,  and the third is the hop count of the document (number of
90              hops from one of the start_url documents). Signification of  the
91              sybols printed after the url:
92
93              "*" is printed for a link already visited
94
95              "+" is printed for a new link just queued
96
97              "-"  is  output  for a link rejected for any of a number of rea‐
98              sons. To find out what those reasons are, you need to run  htdig
99              with at least 3 -v options, i.e. -vvv.
100
101       If  there are no "*", "+" or "-" symbols after the URL, it doesn't mean
102       the document was not parsed or was empty, but only  that  no  links  to
103       other  documents  were found within it. With more verbose output, these
104       symbols will get interspersed in several lines of debugging output.
105
106
107       FILES
108
109       /etc/htdig/htdig.conf
110              The default configuration file.  /var/lib/htdig/db.docdb  Stores
111              data     about     each    document    (title,    url,    etc.).
112              /var/lib/htdig/db.words.db,  /var/lib/htdig/db.words.db_weakcmpr
113              Record     which     documents     each    word    occurs    in.
114              /var/lib/htdig/db.excerpts Stores start of each document to show
115              context of matches.
116

AUTHOR

124       This manual page was written by Christian Schwarz, modified by Stijn de
125       Bekker. It is updated and maintained by Robert Ribnitz and based on the
126       HTML documentation of ht://Dig.
127
128
129
130                                 21 July 1997                         htdig(1)

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

SEE ALSO

AUTHOR