1htdig(1) General Commands Manual htdig(1)
2
3
4
6 htdig - retrieve HTML documents for ht://Dig search engine
7
9 htdig [options]
10
12 Htdig retrieves HTML documents using the HTTP protocol and gathers
13 information from these documents which can later be used to search
14 these documents. This program can be referred to as the search robot.
15
17 - Get the list of URLs to start indexing from standard input. This
18 will override the default parameter start_url specified in the
19 config file and the file supplied to the -m option.
20
21 -a Use alternate work files. Tells htdig to append .work to data‐
22 base files, causing a second copy of the database to be built.
23 This allows the original files to be used by htsearch during the
24 indexing run.
25
26 -c configfile
27 Use the specified configfile instead of the default.
28
29 -h maxhops
30 Restrict the dig to documents that are at most maxhops links
31 away from the starting document. This only works if option -i is
32 also given.
33
34 -i Initial. Do not use any old databases. Old databases will be
35 erased before runing the program.
36
37 -m filename
38 Minimal run. Only index the URLs given in the file filename,
39 ignoring all others. URLs in the file should be formatted one
40 URL per line. -s Print statistics about the dig after comple‐
41 tion.
42
43 -t Create an ASCII version of the document database. This database
44 is easy to parse with other programs so that information can be
45 extracted from it for purposes other than searching. One could
46 gather some interesting statistics from this database.
47
48 Fieldname Value
49 u URL
50 t Title
51 a State
52 (0 normal, 1 not found, 2 not indexed, 3 obsolete)
53 m Time of last modification reported by the server
54 s Document Size in bytes
55 H Excerpt of the document
56 h Meta Description
57 l Time of last rerievial
58 L Count of links in the document or of outgoing links
59 b Number of links to the document, also called
60 incoming links or backlinks
61 c Hop count of this document
62 g Signature of this document
63 (used to detect duplicates)
64 e E-Mail address to use for a notification from htnotify
65 n Date on which such notification is sent
66
67 S Subject of the notfication message
68 d The text of Incoming links pointing to this document
69 (e.g. <a href="docURL">description</a>)
70 A Anchors in the document (i.e. <A NAME=...)
71
72 -u username:password
73 Tells htdig to send the supplied username and password with each
74 HTTP request. The credentials will be encoded using the ´Basic´
75 authentication method. There HAS to be a colon (:) between the
76 username and password.
77
78 -v Verbose mode. This increases the verbosity of the program. Using
79 more than 2 is probably only useful for debugging purposes. The
80 default verbose mode (using only one -v) gives a nice progress
81 report while digging. Please consult the section below on the
82 exact format of the progress report.
83
84
85 FORMAT OF THE PROGRESS REPORT GIVEN IN VERBOSE MODE
86 A line is shown for each URL, with 3 numbers before the URL and
87 some symbols after the URL. The first number is the number of
88 documents parsed so far, the second is the DocID for this docu‐
89 ment, and the third is the hop count of the document (number of
90 hops from one of the start_url documents). Signification of the
91 sybols printed after the url:
92
93 "*" is printed for a link already visited
94
95 "+" is printed for a new link just queued
96
97 "-" is output for a link rejected for any of a number of rea‐
98 sons. To find out what those reasons are, you need to run htdig
99 with at least 3 -v options, i.e. -vvv.
100
101 If there are no "*", "+" or "-" symbols after the URL, it doesn't mean
102 the document was not parsed or was empty, but only that no links to
103 other documents were found within it. With more verbose output, these
104 symbols will get interspersed in several lines of debugging output.
105
106
107 FILES
108
109 /etc/htdig/htdig.conf
110 The default configuration file. /var/lib/htdig/db.docdb Stores
111 data about each document (title, url, etc.).
112 /var/lib/htdig/db.words.db, /var/lib/htdig/db.words.db_weakcmpr
113 Record which documents each word occurs in.
114 /var/lib/htdig/db.excerpts Stores start of each document to show
115 context of matches.
116
118 Please refer to the HTML pages (in the htdig-doc package)
119 /usr/share/doc/htdig-doc/html/index.html and the manual pages htdigcon‐
120 fig(8) , htmerge(1) , htnotify(1) , htsearch(1) and rundig(1) for a
121 detailed description of ht://Dig and its commands.
122
124 This manual page was written by Christian Schwarz, modified by Stijn de
125 Bekker. It is updated and maintained by Robert Ribnitz and based on the
126 HTML documentation of ht://Dig.
127
128
129
130 21 July 1997 htdig(1)