1BOGOFILTER(1) BOGOFILTER(1)
2
3
4
6 bogofilter - fast Bayesian spam filter
7
9 bogofilter [help options | classification options |
10 registration options | parameter options | info options]
11 [general options] [config file options]
12
13 where
14
15 help options are:
16
17 [-h] [--help] [-V] [-Q]
18
19 classification options are:
20
21 [-p] [-e] [-t] [-T] [-u] [-H] [-M] [-b] [-B object ...] [-R]
22 [general options] [parameter options] [config file options]
23
24 registration options are:
25
26 [-s | -n] [-S | -N] [general options]
27
28 general options are:
29
30 [-c filename] [-C] [-d dir] [-k cachesize] [-l] [-L tag] [-I filename]
31 [-O filename]
32
33 parameter options are:
34
35 [-E value[,value]] [-m value[,value][,value]] [-o value[,value]]
36
37 info options are:
38
39 [-v] [-y date] [-D] [-x flags]
40
41 config file options are:
42
43 [--option=value]
44
45 Note: Use bogofilter --help to display the complete list of options.
46
48 Bogofilter is a Bayesian spam filter. In its normal mode of operation,
49 it takes an email message or other text on standard input, does a
50 statistical check against lists of "good" and "bad" words, and returns
51 a status code indicating whether or not the message is spam.
52 Bogofilter is designed with a fast algorithm, uses the Berkeley DB for
53 fast startup and lookups, coded directly in C, and tuned for speed, so
54 it can be used for production by sites that process a lot of mail.
55
57 Bogofilter treats its input as a bag of tokens. Each token is checked
58 against a wordlist, which maintains counts of the numbers of times it
59 has occurred in non-spam and spam mails. These numbers are used to
60 compute an estimate of the probability that a message in which the
61 token occurs is spam. Those are combined to indicate whether the
62 message is spam or ham.
63
64 While this method sounds crude compared to the more usual
65 pattern-matching approach, it turns out to be extremely effective. Paul
66 Graham's paper [1] A Plan For Spam is recommended reading.
67
68 This program substantially improves on Paul's proposal by doing smarter
69 lexical analysis. Bogofilter does proper MIME decoding and a
70 reasonable HTML parsing. Special kinds of tokens like hostnames and IP
71 addresses are retained as recognition features rather than broken up.
72 Various kinds of MTA cruft such as dates and message-IDs are ignored so
73 as not to bloat the wordlist. Tokens found in various header fields are
74 marked appropriately.
75
76 Another improvement is that this program offers Gary Robinson's
77 suggested modifications to the calculations (see the parameters robx
78 and robs below). These modifications are described in Robinson's paper
79 [2]Spam Detection.
80
81 Since then, Robinson (see his Linux Journal article [3]A Statistical
82 Approach to the Spam Problem) and others have realized that the
83 calculation can be further optimized using Fisher's method. [4]Another
84 improvement compensates for token redundancy by applying separate
85 effective size factors (ESF) to spam and nonspam probability
86 calculations.
87
88 In short, this is how it works: The estimates for the spam
89 probabilities of the individual tokens are combined using the "inverse
90 chi-square function". Its value indicates how badly the null hypothesis
91 that the message is just a random collection of independent words with
92 probabilities given by our previous estimates fails. This function is
93 very sensitive to small probabilities (hammish words), but not to high
94 probabilities (spammish words); so the value only indicates strong
95 hammish signs in a message. Now using inverse probabilities for the
96 tokens, the same computation is done again, giving an indicator that a
97 message looks strongly spammish. Finally, those two indicators are
98 subtracted (and scaled into a 0-1-interval). This combined indicator
99 (bogosity) is close to 0 if the signs for a hammish message are
100 stronger than for a spammish message and close to 1 if the situation is
101 the other way round. If signs for both are equally strong, the value
102 will be near 0.5. Since those message don't give a clear indication
103 there is a tristate mode in bogofilter to mark those messages as
104 unsure, while the clear messages are marked as spam or ham,
105 respectively. In two-state mode, every message is marked as either spam
106 or ham.
107
108 Various parameters influence these calculations, the most important
109 are:
110
111 robx: the score given to a token which has not seen before. robx is the
112 probability that the token is spammish.
113
114 robs: a weight on robx which moves the probability of a little seen
115 token towards robx.
116
117 min-dev: a minimum distance from .5 for tokens to use in the
118 calculation. Only tokens farther away from 0.5 than this value are
119 used.
120
121 spam-cutoff: messages with scores greater than or equal to will be
122 marked as spam.
123
124 ham-cutoff: If zero or spam-cutoff, all messages with values strictly
125 below spam-cutoff are marked as ham, all others as spam (two-state).
126 Else values less than or equal to ham-cutoff are marked as ham,
127 messages with values strictly between ham-cutoff and spam-cutoff are
128 marked as unsure; the rest as spam (tristate)
129
130 sp-esf: the effective size factor (ESF) for spam.
131
132 ns-esf: the ESF for nonspam. These ESF values default to 1.0, which is
133 the same as not using ESF in the calculation. Values suitable to a
134 user's email population can be determined with the aid of the bogotune
135 program.
136
138 HELP OPTIONS
139
140 The -h option prints the help message and exits.
141
142 The -V option prints the version number and exits.
143
144 The -Q (query) option prints bogofilter's configuration, i.e.
145 registration parameters, parsing options, bogofilter directory, etc.
146
147 CLASSIFICATION OPTIONS
148
149 The -p (passthrough) option outputs the message with an X-Bogosity line
150 at the end of the message header. This requires keeping the entire
151 message in memory when it's read from stdin (or from a pipe or socket).
152 If the message is read from a file that can be rewound, bogofilter will
153 read it a second time.
154
155 The -e (embed) option tells bogofilter to exit with code 0 if the
156 message can be classified, i.e. if there is not an error. Normally
157 bogofilter uses different codes for spam, ham, and unsure
158 classifications, but this simplifies using bogofilter with procmail or
159 maildrop.
160
161 The -t (terse) option tells bogofilter to print an abbreviated
162 spamicity message containing 1 letter and the score. Spam is indicated
163 with "Y", ham by "N", and unsure by "U". Note: the formatting can be
164 customized using the config file.
165
166 The -T provides an invariant terse mode for scripts to use. bogofilter
167 will print an abbreviated spamicity message containing 1 letter and the
168 score. Spam is indicated with "S", ham by "H", and unsure by "U".
169
170 The -TT provides an invariant terse mode for scripts to use.
171 Bogofilter prints only the score and displays it to 16 significant
172 digits.
173
174 The -u option tells bogofilter to register the message's text after
175 classifying it as spam or non-spam. A spam message will be registered
176 on the spamlist and a non-spam message on the goodlist. If the
177 classification is "unsure", the message will not be registered.
178 Effectively this option runs bogofilter with the -s or -n flag, as
179 appropriate. Caution is urged in the use of this capability, as any
180 classification errors bogofilter may make will be preserved and will
181 accumulate until manually corrected with the -Sn and -Ns option
182 combinations. Note this option causes the database to be opened for
183 write access, which can entail massive slowdowns through lock
184 contention and synchronous I/O operations.
185
186 The -H option tells bogofilter to not tag tokens from the header. This
187 option is for testing, you should not use it in normal operation.
188
189 The -M option tells bogofilter to process its input as a mbox formatted
190 file. If the -v or -t option is also given, a spamicity line will be
191 printed for each message.
192
193 The -b (streaming bulk mode) option tells bogofilter to classify
194 multiple objects whose names are read from stdin. If the -v or -t
195 option is also given, bogofilter will print a line giving file name and
196 classification information for each file. This is an alternative to -B
197 which lists objects on the command line.
198
199 An object in this context shall be a maildir (autodetected), or if it's
200 not a maildir, a single mail unless -M is given - in that case it's
201 processed as mbox. (The Content-Length: header is not taken into
202 account currently.)
203
204 When reading mbox format, bogofilter relies on the empty line after a
205 mail. If needed, formail -es will ensure this is the case.
206
207 The -B object ... (bulk mode) option tells bogofilter to classify
208 multiple objects named on the command line. The objects may be
209 filenames (for single messages), mailboxes (files with multiple
210 messages), or directories (of maildir and MH format). If the -v or -t
211 option is also given, bogofilter will print a line giving file name and
212 classification information for each file. This is an alternative to -b
213 which lists objects on stdin.
214
215 The -R option tells bogofilter to output an R data frame in text form
216 on the standard output. See the section on integration with R, below,
217 for further detail.
218
219 REGISTRATION OPTIONS
220
221 The -s option tells bogofilter to register the text presented as spam.
222 The database is created if absent.
223
224 The -n option tells bogofilter to register the text presented as
225 non-spam.
226
227 Bogofilter doesn't detect if a message registered twice. If you do this
228 by accident, the token counts will off by 1 from what you really want
229 and the corresponding spam scores will be slightly off. Given a large
230 number of tokens and messages in the wordlist, this doesn't matter. The
231 problem can be corrected by using the -S option or the -N option.
232
233 The -S option tells bogofilter to undo a prior registration of the same
234 message as spam. If a message was incorrectly entered as spam by -s or
235 -u and you want to remove it and enter it as non-spam, use -Sn. If -S
236 is used for a message that wasn't registered as spam, the counts will
237 still be decremented.
238
239 The -N option tells bogofilter to undo a prior registration of the same
240 message as non-spam. If a message was incorrectly entered as non-spam
241 by -n or -u and you want to remove it and enter it as spam, then use
242 -Ns. If -N is used for a message that wasn't registered as non-spam,
243 the counts will still be decremented.
244
245 GENERAL OPTIONS
246
247 The -c filename option tells bogofilter to read the config file named.
248
249 The -C option prevents bogofilter from reading configuration files.
250
251 The -d dir option allows you to set the directory for the database. See
252 the ENVIRONMENT section for other directory setting options.
253
254 The -k cachesize option sets the cache size for the BerkeleyDB
255 subsystem, in units of 1 MiB (1,048,576 bytes). Properly sizing the
256 cache improves bogofilter's performance. The recommended size is one
257 third of the size of the database file. You can run the bogotune script
258 (in the tuning directory) to determine the recommended size.
259
260 The -l option writes an informational line to the system log each time
261 bogofilter is run. The information logged depends on how bogofilter is
262 run.
263
264 The -L tag option configures a tag which can be included in the
265 information being logged by the -l option, but it requires a custom
266 format that includes the %l string for now. This option implies -l.
267
268 The -I filename option tells bogofilter to read its input from the
269 specified file, rather than from stdin.
270
271 The -O filename option tells bogofilter where to write its output in
272 passthrough mode. Note that this only works when -p is explicitly
273 given.
274
275 PARAMETER OPTIONS
276
277 The -E value[,value] option allows setting the sp-esf value and the
278 ns-esf value. With two values, both sp-esf and ns-esf are set. If only
279 one value is given, parameters are set as described in the note below.
280
281 The -m value[,value][,value] option allows setting the min-dev value
282 and, optionally, the robs and robx values. With three values, min-dev,
283 robs, and robx are all set. If fewer values are given, parameters are
284 set as described in the note below.
285
286 The -o value[,value] option allows setting the spam-cutoff ham-cutoff
287 values. With two values, both spam-cutoff and ham-cutoff are set. If
288 only one value is given, parameters are set as described in the note
289 below.
290
291 Note: All of these options allow fewer values to be provided. Values
292 can be skipped by using just the comma delimiter, in which case the
293 corresponding parameter(s) won't be changed. If only the first value is
294 provided, then only the first parameter is set. Trailing values can be
295 skipped, in which case the corresponding parameters won't be changed.
296 Within the parameter list, spaces are not allowed after commas.
297
298 INFO OPTIONS
299
300 The -v option produces a report to standard output on bogofilter's
301 analysis of the input. Each additional v will increase the verbosity of
302 the output, up to a maximum of 4. With -vv, the report lists the tokens
303 with highest deviation from a mean of 0.5 association with spam.
304
305 Option -y date can be used to override the current date when
306 timestamping tokens. A value of zero (0) turns off timestamping.
307
308 The -D option redirects debug output to stdout.
309
310 The -x flags option allows setting of debug flags for printing debug
311 information. See header file debug.h for the list of usable flags.
312
313 CONFIG FILE OPTIONS
314
315 Using GNU longopt -- syntax, a config file's name=value statement
316 becomes a command line's --option=value. Use command bogofilter --help
317 for a list of options and see bogofilter.cf.example for more info on
318 them. For example to change the X-Bogosity header to "X-Spam-Header",
319 use:
320
321 --spam-header-name=X-Spam-Header
322
324 Bogofilter uses a database directory, which can be set in the config
325 file. If not set there, bogofilter will use the value of
326 BOGOFILTER_DIR. Both can be overridden by the -d dir option. If none of
327 that is available, bogofilter will use directory $HOME/.bogofilter.
328
330 The bogofilter command line allows setting of many options that
331 determine how bogofilter operates. File /etc/bogofilter.cf can be used
332 to set additional parameters that affect its operation. File
333 /etc/bogofilter.cf.example has samples of all of the parameters. Status
334 and logging messages can be customized for each site.
335
337 0 for spam; 1 for non-spam; 2 for unsure ; 3 for I/O or other errors.
338
339 If both -p and -e are used, the return values are: 0 for spam or
340 non-spam; 3 for I/O or other errors.
341
342 Error 3 usually means that the wordlist file bogofilter wants to read
343 at startup is missing or the hard disk has filled up in -p mode.
344
346 Use with procmail
347
348 The following recipe (a) spam-bins anything that bogofilter rates as
349 spam, (b) registers the words in messages rated as spam as such, and
350 (c) registers the words in messages rated as non-spam as such. With
351 this in place, it will normally only be necessary for the user to
352 intervene (with -Ns or -Sn) when bogofilter miscategorizes something.
353
354
355 # filter mail through bogofilter, tagging it as Ham, Spam, or Unsure,
356 # and updating the wordlist
357
358 :0fw
359 | bogofilter -u -e -p
360
361
362 # if bogofilter failed, return the mail to the queue;
363 # the MTA will retry to deliver it later
364 # 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h
365
366 :0e
367 { EXITCODE=75 HOST }
368
369
370 # file the mail to spam-bogofilter if it's spam.
371
372 :0:
373 * ^X-Bogosity: Spam, tests=bogofilter
374 spam-bogofilter
375
376 # file the mail to unsure-bogofilter
377 # if it's neither ham nor spam.
378
379 :0:
380 * ^X-Bogosity: Unsure, tests=bogofilter
381 unsure-bogofilter
382
383 # With this recipe, you can train bogofilter starting with an empty
384 # wordlist. Be sure to check your unsure-folder regularly, take the
385 # messages out of it, classify them as ham (or spam), and use them to
386 # train bogofilter.
387
388
389 The following procmail rule will take mail on stdin and save it to file
390 spam if bogofilter thinks it's spam:
391
392 :0HB:
393 * ? bogofilter
394 spam
395
396 and this similar rule will also register the tokens in the mail
397 according to the bogofilter classification:
398
399 :0HB:
400 * ? bogofilter -u
401 spam
402
403
404 If bogofilter fails (returning 3) the message will be treated as
405 non-spam.
406
407 This one is for maildrop, it automatically defers the mail and retries
408 later when the xfilter command fails, use this in your ~/.mailfilter:
409
410 xfilter "bogofilter -u -e -p"
411 if (/^X-Bogosity: Spam, tests=bogofilter/)
412 {
413 to "spam-bogofilter"
414 }
415
416 The following .muttrc lines will create mutt macros for dispatching
417 mail to bogofilter.
418
419 macro index d "<enter-command>unset wait_key0
420 <pipe-entry>bogofilter -n0
421 <enter-command>set wait_key0
422 <delete-message>" "delete message as non-spam"
423 macro index \ed "<enter-command>unset wait_key0
424 <pipe-entry>bogofilter -s0
425 <enter-command>set wait_key0
426 <delete-message>" "delete message as spam"
427
428 Integration with Mail Transport Agent (MTA)
429
430 1. bogofilter can also be integrated into an MTA to filter all incoming
431 mail. While the specific implementation is MTA dependent, the
432 general steps are as follows:
433
434 2. Install bogofilter on the mail server
435
436 3. Prime the bogofilter databases with a spam and non-spam corpus.
437 Since bogofilter will be serving a larger community, it is important
438 to prime it with a representative set of messages.
439
440 4. Set up the MTA to invoke bogofilter on each message. While this is
441 an MTA specific step, you'll probably need to use the -p, -u, and -e
442 options.
443
444 5. Set up a mechanism for users to register spam/non-spam messages, as
445 well as to correct mis-classifications. The most generic solution is
446 to set up alias email addresses to which users bounce messages.
447
448 6. See the doc and contrib directories for more information.
449
450
451 Use of R to verify bogofilter's calculations
452
453 The -R option tells bogofilter to generate an R data frame. The data
454 frame contains one row per token analyzed. Each such row contains the
455 token, the sum of its database "good" and "spam" counts, the "good"
456 count divided by the number of non-spam messages used to create the
457 training database, the "spam" count divided by the spam message count,
458 Robinson's f(w) for the token, the natural logs of (1 - f(w)) and f(w),
459 and an indicator character (+ if the token's f(w) value exceeded the
460 minimum deviation from 0.5, - if it didn't). There is one additional
461 row at the end of the table that contains a label in the token field,
462 followed by the number of words actually used (the ones with +
463 indicators), Robinson's P, Q, S, s and x values and the minimum
464 deviation.
465
466 The R data frame can be saved to a file and later read into an R
467 session (see [5]the R project website for information about the
468 mathematics package R). Provided with the bogofilter distribution is a
469 simple R script (file bogo.R) that can be used to verify bogofilter's
470 calculations. Instructions for its use are included in the script in
471 the form of comments.
472
474 Bogofilter writes messages to the system log when the -l option is
475 used. What is written depends on which other flags are used.
476
477 A classification run will generate (we are not showing the date and
478 host part here):
479
480 bogofilter[1412]: X-Bogosity: Ham, spamicity=0.000227
481 bogofilter[1415]: X-Bogosity: Spam, spamicity=0.998918
482
483 Using -u to classify a message and update a wordlist will produce (one
484 a single line):
485
486 bogofilter[1426]: X-Bogosity: Spam, spamicity=0.998918,
487 register -s, 329 words, 1 messages
488
489
490 Registering words (-l and -s, -n, -S, or -N) will produce:
491
492 bogofilter[1440]: register-n, 255 words, 1 messages
493
494 A registration run (using -s, -n, -N, or -S) will generate messages
495 like:
496
497 bogofilter[17330]: register-n, 574 words, 3 messages
498 bogofilter[6244]: register-s, 1273 words, 4 messages
499
501 /etc/bogofilter.cf
502 System configuration file.
503
504 ~/.bogofilter.cf
505 User configuration file.
506
507 ~/.bogofilter/wordlist.db
508 Combined list of good and spam tokens.
509
511 Eric S. Raymond <esr@thyrsus.com>.
512 David Relson <relson@osagesoftware.com>.
513 Matthias Andree <matthias.andree@gmx.de>.
514 Greg Louis <glouis@dynamicro.on.ca>.
515
516 For updates, see the [6] bogofilter project page.
517
519 bogolexer(1), bogotune(1), bogoupgrade(1), bogoutil(1)
520
522 1. A Plan For Spam
523 http://www.paulgraham.com/spam.html
524
525 2. Spam Detection
526 http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
527
528 3. A Statistical Approach to the Spam Problem
529 http://www.linuxjournal.com/article/6467
530
531 4. Another improvement
532 http://www.garyrobinson.net/2004/04/improved_chi.html
533
534 5. the R project website
535 http://cran.r-project.org/
536
537 6. bogofilter project page
538 http://bogofilter.sourceforge.net/
539
540
541
542 07/23/2007 BOGOFILTER(1)