bogofilter(1)

1BOGOFILTER(1)             Bogofilter Reference Manual            BOGOFILTER(1)
2
3
4

NAME

6       bogofilter - fast Bayesian spam filter
7

SYNOPSIS

9       bogofilter [help options | classification options |
10                  registration options | parameter options | info options]
11                  [general options] [config file options]
12
13       where
14
15       help options are:
16
17       [-h] [--help] [-V] [-Q]
18
19       classification options are:
20
21       [-p] [-e] [-t] [-T] [-u] [-H] [-M] [-b] [-B object ...] [-R]
22        [general options] [parameter options] [config file options]
23
24       registration options are:
25
26       [-s | -n] [-S | -N] [general options]
27
28       general options are:
29
30       [-c filename] [-C] [-d dir] [-k cachesize] [-l] [-L tag] [-I filename]
31        [-O filename]
32
33       parameter options are:
34
35       [-E value[,value]] [-m value[,value][,value]] [-o value[,value]]
36
37       info options are:
38
39       [-v] [-y date] [-D] [-x flags]
40
41       config file options are:
42
43       [--option=value]
44
45       Note: Use bogofilter --help to display the complete list of options.
46

DESCRIPTION

48       Bogofilter is a Bayesian spam filter. In its normal mode of operation,
49       it takes an email message or other text on standard input, does a
50       statistical check against lists of "good" and "bad" words, and returns
51       a status code indicating whether or not the message is spam.
52       Bogofilter is designed with a fast algorithm, uses the Berkeley DB for
53       fast startup and lookups, coded directly in C, and tuned for speed, so
54       it can be used for production by sites that process a lot of mail.
55

THEORY OF OPERATION

57       Bogofilter treats its input as a bag of tokens. Each token is checked
58       against a wordlist, which maintains counts of the numbers of times it
59       has occurred in non-spam and spam mails. These numbers are used to
60       compute an estimate of the probability that a message in which the
61       token occurs is spam. Those are combined to indicate whether the
62       message is spam or ham.
63
64       While this method sounds crude compared to the more usual
65       pattern-matching approach, it turns out to be extremely effective. Paul
66       Graham's paper A Plan For Spam[1] is recommended reading.
67
68       This program substantially improves on Paul's proposal by doing smarter
69       lexical analysis.  Bogofilter does proper MIME decoding and a
70       reasonable HTML parsing. Special kinds of tokens like hostnames and IP
71       addresses are retained as recognition features rather than broken up.
72       Various kinds of MTA cruft such as dates and message-IDs are ignored so
73       as not to bloat the wordlist. Tokens found in various header fields are
74       marked appropriately.
75
76       Another improvement is that this program offers Gary Robinson's
77       suggested modifications to the calculations (see the parameters robx
78       and robs below). These modifications are described in Robinson's paper
79       Spam Detection[2].
80
81       Since then, Robinson (see his Linux Journal article A Statistical
82       Approach to the Spam Problem[3]) and others have realized that the
83       calculation can be further optimized using Fisher's method.  Another
84       improvement[4] compensates for token redundancy by applying separate
85       effective size factors (ESF) to spam and nonspam probability
86       calculations.
87
88       In short, this is how it works: The estimates for the spam
89       probabilities of the individual tokens are combined using the "inverse
90       chi-square function". Its value indicates how badly the null hypothesis
91       that the message is just a random collection of independent words with
92       probabilities given by our previous estimates fails. This function is
93       very sensitive to small probabilities (hammish words), but not to high
94       probabilities (spammish words); so the value only indicates strong
95       hammish signs in a message. Now using inverse probabilities for the
96       tokens, the same computation is done again, giving an indicator that a
97       message looks strongly spammish. Finally, those two indicators are
98       subtracted (and scaled into a 0-1-interval). This combined indicator
99       (bogosity) is close to 0 if the signs for a hammish message are
100       stronger than for a spammish message and close to 1 if the situation is
101       the other way round. If signs for both are equally strong, the value
102       will be near 0.5. Since those message don't give a clear indication
103       there is a tristate mode in bogofilter to mark those messages as
104       unsure, while the clear messages are marked as spam or ham,
105       respectively. In two-state mode, every message is marked as either spam
106       or ham.
107
108       Various parameters influence these calculations, the most important
109       are:
110
111       robx: the score given to a token which has not seen before. robx is the
112       probability that the token is spammish.
113
114       robs: a weight on robx which moves the probability of a little seen
115       token towards robx.
116
117       min-dev: a minimum distance from .5 for tokens to use in the
118       calculation. Only tokens farther away from 0.5 than this value are
119       used.
120
121       spam-cutoff: messages with scores greater than or equal to will be
122       marked as spam.
123
124       ham-cutoff: If zero or spam-cutoff, all messages with values strictly
125       below spam-cutoff are marked as ham, all others as spam (two-state).
126       Else values less than or equal to ham-cutoff are marked as ham,
127       messages with values strictly between ham-cutoff and spam-cutoff are
128       marked as unsure; the rest as spam (tristate)
129
130       sp-esf: the effective size factor (ESF) for spam.
131
132       ns-esf: the ESF for nonspam. These ESF values default to 1.0, which is
133       the same as not using ESF in the calculation. Values suitable to a
134       user's email population can be determined with the aid of the bogotune
135       program.
136

OPTIONS

138       HELP OPTIONS
139
140       The -h option prints the help message and exits.
141
142       The -V option prints the version number and exits.
143
144       The -Q (query) option prints bogofilter's configuration, i.e.
145       registration parameters, parsing options, bogofilter directory, etc.
146
147       CLASSIFICATION OPTIONS
148
149       The -p (passthrough) option outputs the message with an X-Bogosity line
150       at the end of the message header. This requires keeping the entire
151       message in memory when it's read from stdin (or from a pipe or socket).
152       If the message is read from a file that can be rewound, bogofilter will
153       read it a second time.
154
155       The -e (embed) option tells bogofilter to exit with code 0 if the
156       message can be classified, i.e. if there is not an error. Normally
157       bogofilter uses different codes for spam, ham, and unsure
158       classifications, but this simplifies using bogofilter with procmail or
159       maildrop.
160
161       The -t (terse) option tells bogofilter to print an abbreviated
162       spamicity message containing 1 letter and the score. Spam is indicated
163       with "Y", ham by "N", and unsure by "U". Note: the formatting can be
164       customized using the config file.
165
166       The -T provides an invariant terse mode for scripts to use.  bogofilter
167       will print an abbreviated spamicity message containing 1 letter and the
168       score. Spam is indicated with "S", ham by "H", and unsure by "U".
169
170       The -TT provides an invariant terse mode for scripts to use.
171       Bogofilter prints only the score and displays it to 16 significant
172       digits.
173
174       The -u option tells bogofilter to register the message's text after
175       classifying it as spam or non-spam. A spam message will be registered
176       on the spamlist and a non-spam message on the goodlist. If the
177       classification is "unsure", the message will not be registered.
178       Effectively this option runs bogofilter with the -s or -n flag, as
179       appropriate. Caution is urged in the use of this capability, as any
180       classification errors bogofilter may make will be preserved and will
181       accumulate until manually corrected with the -Sn and -Ns option
182       combinations. Note this option causes the database to be opened for
183       write access, which can entail massive slowdowns through lock
184       contention and synchronous I/O operations.
185
186       The -H option tells bogofilter to not tag tokens from the header. This
187       option is for testing, you should not use it in normal operation.
188
189       The -M option tells bogofilter to process its input as a mbox formatted
190       file. If the -v or -t option is also given, a spamicity line will be
191       printed for each message.
192
193       The -b (streaming bulk mode) option tells bogofilter to classify
194       multiple objects whose names are read from stdin. If the -v or -t
195       option is also given, bogofilter will print a line giving file name and
196       classification information for each file. This is an alternative to -B
197       which lists objects on the command line.
198
199       An object in this context shall be a maildir (autodetected), or if it's
200       not a maildir, a single mail unless -M is given - in that case it's
201       processed as mbox. (The Content-Length: header is not taken into
202       account currently.)
203
204       When reading mbox format, bogofilter relies on the empty line after a
205       mail. If needed, formail -es will ensure this is the case.
206
207       The -B object ...  (bulk mode) option tells bogofilter to classify
208       multiple objects named on the command line. The objects may be
209       filenames (for single messages), mailboxes (files with multiple
210       messages), or directories (of maildir and MH format). If the -v or -t
211       option is also given, bogofilter will print a line giving file name and
212       classification information for each file. This is an alternative to -b
213       which lists objects on stdin.
214
215       The -R option tells bogofilter to output an R data frame in text form
216       on the standard output. See the section on integration with R, below,
217       for further detail.
218
219       REGISTRATION OPTIONS
220
221       The -s option tells bogofilter to register the text presented as spam.
222       The database is created if absent.
223
224       The -n option tells bogofilter to register the text presented as
225       non-spam.
226
227       Bogofilter doesn't detect if a message registered twice. If you do this
228       by accident, the token counts will off by 1 from what you really want
229       and the corresponding spam scores will be slightly off. Given a large
230       number of tokens and messages in the wordlist, this doesn't matter. The
231       problem can be corrected by using the -S option or the -N option.
232
233       The -S option tells bogofilter to undo a prior registration of the same
234       message as spam. If a message was incorrectly entered as spam by -s or
235       -u and you want to remove it and enter it as non-spam, use -Sn. If -S
236       is used for a message that wasn't registered as spam, the counts will
237       still be decremented.
238
239       The -N option tells bogofilter to undo a prior registration of the same
240       message as non-spam. If a message was incorrectly entered as non-spam
241       by -n or -u and you want to remove it and enter it as spam, then use
242       -Ns. If -N is used for a message that wasn't registered as non-spam,
243       the counts will still be decremented.
244
245       GENERAL OPTIONS
246
247       The -c filename option tells bogofilter to read the config file named.
248
249       The -C option prevents bogofilter from reading configuration files.
250
251       The -d dir option allows you to set the directory for the database. See
252       the ENVIRONMENT section for other directory setting options.
253
254       The -k cachesize option sets the cache size for the BerkeleyDB
255       subsystem, in units of 1 MiB (1,048,576 bytes). Properly sizing the
256       cache improves bogofilter's performance. The recommended size is one
257       third of the size of the database file. You can run the bogotune script
258       (in the tuning directory) to determine the recommended size.
259
260       The -l option writes an informational line to the system log each time
261       bogofilter is run. The information logged depends on how bogofilter is
262       run.
263
264       The -L tag option configures a tag which can be included in the
265       information being logged by the -l option, but it requires a custom
266       format that includes the %l string for now. This option implies -l.
267
268       The -I filename option tells bogofilter to read its input from the
269       specified file, rather than from stdin.
270
271       The -O filename option tells bogofilter where to write its output in
272       passthrough mode. Note that this only works when -p is explicitly
273       given.
274
275       PARAMETER OPTIONS
276
277       The -E value[,value] option allows setting the sp-esf value and the
278       ns-esf value. With two values, both sp-esf and ns-esf are set. If only
279       one value is given, parameters are set as described in the note below.
280
281       The -m value[,value][,value] option allows setting the min-dev value
282       and, optionally, the robs and robx values. With three values, min-dev,
283       robs, and robx are all set. If fewer values are given, parameters are
284       set as described in the note below.
285
286       The -o value[,value] option allows setting the spam-cutoff ham-cutoff
287       values. With two values, both spam-cutoff and ham-cutoff are set. If
288       only one value is given, parameters are set as described in the note
289       below.
290
291       Note: All of these options allow fewer values to be provided. Values
292       can be skipped by using just the comma delimiter, in which case the
293       corresponding parameter(s) won't be changed. If only the first value is
294       provided, then only the first parameter is set. Trailing values can be
295       skipped, in which case the corresponding parameters won't be changed.
296       Within the parameter list, spaces are not allowed after commas.
297
298       INFO OPTIONS
299
300       The -v option produces a report to standard output on bogofilter's
301       analysis of the input. Each additional v will increase the verbosity of
302       the output, up to a maximum of 4. With -vv, the report lists the tokens
303       with highest deviation from a mean of 0.5 association with spam.
304
305       Option -y date can be used to override the current date when
306       timestamping tokens. A value of zero (0) turns off timestamping.
307
308       The -D option redirects debug output to stdout.
309
310       The -x flags option allows setting of debug flags for printing debug
311       information. See header file debug.h for the list of usable flags.
312
313       CONFIG FILE OPTIONS
314
315       Using GNU longopt -- syntax, a config file's name=value statement
316       becomes a command line's --option=value. Use command bogofilter --help
317       for a list of options and see bogofilter.cf.example for more info on
318       them. For example to change the X-Bogosity header to "X-Spam-Header",
319       use:
320
321       --spam-header-name=X-Spam-Header
322

ENVIRONMENT

324       Bogofilter uses a database directory, which can be set in the config
325       file. If not set there, bogofilter will use the value of
326       BOGOFILTER_DIR. Both can be overridden by the -d dir option. If none of
327       that is available, bogofilter will use directory $HOME/.bogofilter.
328

CONFIGURATION

330       The bogofilter command line allows setting of many options that
331       determine how bogofilter operates. File /etc/bogofilter.cf can be used
332       to set additional parameters that affect its operation. File
333       /etc/bogofilter.cf.example has samples of all of the parameters. Status
334       and logging messages can be customized for each site.
335

RETURN VALUES

337       0 for spam; 1 for non-spam; 2 for unsure ; 3 for I/O or other errors.
338
339       If both -p and -e are used, the return values are: 0 for spam or
340       non-spam; 3 for I/O or other errors.
341
342       Error 3 usually means that the wordlist file bogofilter wants to read
343       at startup is missing or the hard disk has filled up in -p mode.
344

INTEGRATION WITH OTHER TOOLS

346       Use with procmail
347
348       The following recipe (a) spam-bins anything that bogofilter rates as
349       spam, (b) registers the words in messages rated as spam as such, and
350       (c) registers the words in messages rated as non-spam as such. With
351       this in place, it will normally only be necessary for the user to
352       intervene (with -Ns or -Sn) when bogofilter miscategorizes something.
353
354
355           # filter mail through bogofilter, tagging it as Ham, Spam, or Unsure,
356           # and updating the wordlist
357
358           :0fw
359           | bogofilter -u -e -p
360
361
362           # if bogofilter failed, return the mail to the queue;
363           # the MTA will retry to deliver it later
364           # 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h
365
366           :0e
367           { EXITCODE=75 HOST }
368
369
370           # file the mail to spam-bogofilter if it's spam.
371
372           :0:
373           * ^X-Bogosity: Spam, tests=bogofilter
374           spam-bogofilter
375
376           # file the mail to unsure-bogofilter
377           # if it's neither ham nor spam.
378
379           :0:
380           * ^X-Bogosity: Unsure, tests=bogofilter
381           unsure-bogofilter
382
383           # With this recipe, you can train bogofilter starting with an empty
384           # wordlist.  Be sure to check your unsure-folder regularly, take the
385           # messages out of it, classify them as ham (or spam), and use them to
386           # train bogofilter.
387
388
389       The following procmail rule will take mail on stdin and save it to file
390       spam if bogofilter thinks it's spam:
391
392           :0HB:
393           * ? bogofilter
394           spam
395
396       and this similar rule will also register the tokens in the mail
397       according to the bogofilter classification:
398
399           :0HB:
400           * ? bogofilter -u
401           spam
402
403       If bogofilter fails (returning 3) the message will be treated as
404       non-spam.
405
406       This one is for maildrop, it automatically defers the mail and retries
407       later when the xfilter command fails, use this in your ~/.mailfilter:
408
409           xfilter "bogofilter -u -e -p"
410           if (/^X-Bogosity: Spam, tests=bogofilter/)
411           {
412             to "spam-bogofilter"
413           }
414
415       The following .muttrc lines will create mutt macros for dispatching
416       mail to bogofilter.
417
418           macro index d "<enter-command>unset wait_key\n\
419           <pipe-entry>bogofilter -n\n\
420           <enter-command>set wait_key\n\
421           <delete-message>" "delete message as non-spam"
422           macro index \ed "<enter-command>unset wait_key\n\
423           <pipe-entry>bogofilter -s\n\
424           <enter-command>set wait_key\n\
425           <delete-message>" "delete message as spam"
426
427       Integration with Mail Transport Agent (MTA)
428
429        1. bogofilter can also be integrated into an MTA to filter all
430           incoming mail. While the specific implementation is MTA dependent,
431           the general steps are as follows:
432
433        2. Install bogofilter on the mail server
434
435        3. Prime the bogofilter databases with a spam and non-spam corpus.
436           Since bogofilter will be serving a larger community, it is
437           important to prime it with a representative set of messages.
438
439        4. Set up the MTA to invoke bogofilter on each message. While this is
440           an MTA specific step, you'll probably need to use the -p, -u, and
441           -e options.
442
443        5. Set up a mechanism for users to register spam/non-spam messages, as
444           well as to correct mis-classifications. The most generic solution
445           is to set up alias email addresses to which users bounce messages.
446
447        6. See the doc and contrib directories for more information.
448
449       Use of R to verify bogofilter's calculations
450
451       The -R option tells bogofilter to generate an R data frame. The data
452       frame contains one row per token analyzed. Each such row contains the
453       token, the sum of its database "good" and "spam" counts, the "good"
454       count divided by the number of non-spam messages used to create the
455       training database, the "spam" count divided by the spam message count,
456       Robinson's f(w) for the token, the natural logs of (1 - f(w)) and f(w),
457       and an indicator character (+ if the token's f(w) value exceeded the
458       minimum deviation from 0.5, - if it didn't). There is one additional
459       row at the end of the table that contains a label in the token field,
460       followed by the number of words actually used (the ones with +
461       indicators), Robinson's P, Q, S, s and x values and the minimum
462       deviation.
463
464       The R data frame can be saved to a file and later read into an R
465       session (see the R project website[5] for information about the
466       mathematics package R). Provided with the bogofilter distribution is a
467       simple R script (file bogo.R) that can be used to verify bogofilter's
468       calculations. Instructions for its use are included in the script in
469       the form of comments.
470

LOG MESSAGES

472       Bogofilter writes messages to the system log when the -l option is
473       used. What is written depends on which other flags are used.
474
475       A classification run will generate (we are not showing the date and
476       host part here):
477
478           bogofilter[1412]: X-Bogosity: Ham, spamicity=0.000227
479           bogofilter[1415]: X-Bogosity: Spam, spamicity=0.998918
480
481       Using -u to classify a message and update a wordlist will produce (one
482       a single line):
483
484           bogofilter[1426]: X-Bogosity: Spam, spamicity=0.998918,
485             register -s, 329 words, 1 messages
486
487
488       Registering words (-l and -s, -n, -S, or -N) will produce:
489
490           bogofilter[1440]: register-n, 255 words, 1 messages
491
492       A registration run (using -s, -n, -N, or -S) will generate messages
493       like:
494
495           bogofilter[17330]: register-n, 574 words, 3 messages
496           bogofilter[6244]: register-s, 1273 words, 4 messages
497

FILES

499       /etc/bogofilter.cf
500           System configuration file.
501
502       ~/.bogofilter.cf
503           User configuration file.
504
505       ~/.bogofilter/wordlist.db
506           Combined list of good and spam tokens.
507

AUTHOR

509           Eric S. Raymond esr@thyrsus.com.
510           David Relson relson@osagesoftware.com.
511           Matthias Andree matthias.andree@gmx.de.
512           Greg Louis glouis@dynamicro.on.ca.
513
514       For updates, see the bogofilter project page[6].
515

NOTES

520        1. A Plan For Spam
521           http://www.paulgraham.com/spam.html
522
523        2. Spam Detection
524           http://radio-weblogs.com/0101454/stories/2002/09/16/spamDetection.html
525
526        3. A Statistical Approach to the Spam Problem
527           http://www.linuxjournal.com/article/6467
528
529        4. Another improvement
530           http://www.garyrobinson.net/2004/04/improved%5fchi.html
531
532        5. the R project website
533           http://cran.r-project.org/
534
535        6. bogofilter project page
536           http://bogofilter.sourceforge.net/
537
538
539
540Bogofilter                        10/22/2012                     BOGOFILTER(1)