spamprobe(1)

1spamprobe(1)                       SpamProbe                      spamprobe(1)
2
3
4

NAME

6       spamprobe - a bayesian spam filter
7

SYNOPSIS

9       spamprobe [options] <command> [filename...]
10
11

INTRODUCTION

13       SpamProbe  can  be used in conjunction with procmail or similar program
14       to filter email.  SpamProbe uses a statistical  algorithm  to  identify
15       the  key  words  and  phrases  in  email and determine which emails are
16       legitimate and which are spam.  The  algorithm  used  by  SpamProbe  is
17       based  on  an excellent article by Paul Graham.  He describes the basic
18       idea and his results.  You can read his article here:
19
20         http://www.paulgraham.com/spam.html
21
22
23

COMMAND LINE USAGE

25       SpamProbe accepts a small set of commands and a growing set of  options
26       on  the  command line in addition to zero or more file names of mboxes.
27       The general usage is:
28
29         spamprobe [options] <command> [filename...]
30
31       The recognized options are:
32
33        -a char
34
35           By default SpamProbe converts non-ascii characters (characters
36           with the most significant bit set to 1) into the letter 'z'.  This
37           is useful for lumping all Asian characters into a single word for
38           easy recognition.  The -a option allows you to change the
39           character to something else if you don't like the letter 'z' for
40           some reason.
41
42        -c
43
44           Tells spamprobe to create the database directory if it does not
45           already exist.  Normally spamprobe exits with a usage error if
46           the database directory does not already exist.
47
48        -C number
49
50           Tells SpamProbe to assign a default, somewhat neutral, probability
51           to any term that does not have a weighted (good count doubled)
52           count of at least number in the database.  This prevents terms
53           which have been seen only a few times from having an unreasonable
54           influence on the score of an email containing them.
55
56           The default value is 5.  For example if number is 5 then in order
57           for a term to use its calculated probability it must have been
58           seen 3 times in good mails, or 2 times in good mails and once in
59           spam, or 5 times in spam, or some other combination adding up to
60           at least 5.
61
62        -d directory
63
64           By default SpamProbe stores its database in a directory named
65           .spamprobe under your home directory.  The -d option allows you to
66           specify a different directory to use.  This is necessary if your
67           home directory is NFS mounted for example.
68
69           The directory name can be prefixed with a special code to force
70           SpamProbe to use a particular type of data file format.  The type
71           codes depend on how your copy of SpamProbe was compiled.  Defined
72           types include:
73
74             Example                   Description
75             -d pbl:path               Forces the use of PBL data file.
76             -d hash:path              Forces the use of an mmapped hash file.
77             -d split:path             Forces the use of a hash file and ISAM
78                                       file (may provide better precision than
79                                       plain hash in some cases).
80
81           The hash: option can also specify a desired file size in megabytes
82           before the path.  For example -d hash:19:path would cause
83           SpamProbe to use a 19 MB hash file.  The size must be in the range
84           of 1-100.  The default hash file size is 16 MB.  Because hash
85           files have a fixed size and capacity they should be cleaned
86           relatively often using the cleanup command (see below) to prevent
87           them from becoming full or being slowed by too many hash key
88           collisions.
89
90           Hash files provide better performance than either of the ISAM
91           options (PBL or Berkeley DB).  However hash files do not store the
92           original terms.  Only a 32 bit hash key is stored with each term.
93           This prevents a user from exploring the terms in the database
94           using the dump command to see what words are particularly spammy
95           or hammy.
96
97        -D directory
98
99           Tells SpamProbe to use the database in the specified directory
100           (must be different than the one specified with the -d option) as a
101           shared database from which to draw terms that are not defined in
102           the user's own database.  This can be used to provide a baseline
103           database shared by all users on a system (in the -D directory) and
104           a private database unique to each user of the system
105           ($HOME/.spamprobe or -d directory).
106
107        -g field_name
108
109           Tells SpamProbe what header to look for previous score and message
110           digest in.  Default is X-SpamProbe.  Field name is not case
111           sensitive.  Used by all commands except receive.
112
113        -h
114
115           By default SpamProbe removes HTML markup from the text in emails
116           to help avoid false positives.  The -h option allows you to
117           override this behavior and force SpamProbe to include words from
118           within HTML tags in its word counts.  Note that SpamProbe always
119           counts any URLs in hrefs within tags whether -h is used or not.
120           Use of this option is discouraged.  It can increase the rate of
121           spam detection slightly but unless the user receives a significant
122           amount of HTML emails it also tends to increase the number of
123           false positives.
124
125        -H option
126
127           By default SpamProbe only scans a meaningful subset of headers
128           from the email message when searching for words to score.  The -H
129           option allows the user to specify additional headers to scan.
130           Legal values are "all", "nox", "none", or "normal".  "all" scans
131           all headers, "nox" scans all headers except those starting with
132           X-, "none" does not scan headers, and "normal" scans the normal
133           set of headers.
134
135           In addition to those values you can also explicitly add a header
136           to the list of headers to process by adding the header name in
137           lower case preceded by a plus sign.  Multiple headers can be
138           specified by using multiple -H options.  For example, to include
139           only the From and Received headers in your train command you could
140           run spamprobe as follows:
141
142             spamprobe -Hnone -H+from -H+received train
143
144           You can also selectively ignore headers that would otherwise be
145           processed by using -H-headername.  For example to process all
146           headers except for Subject you could run spamprobe as follows:
147
148             spamprobe -Hall -H-subject train
149
150           To process the normal set of headers but also add the SpamAssassin
151           header X-SpamStatus you could run spamprobe as follows:
152
153             spamprobe -H+x-spam-status train
154
155        -l number
156
157          Changes the spam probability threshold for emails from the default
158          (0.7) to number.  The number must be a between 0 and 1.  Generally
159          the value should be above 0.5 to avoid a high false positive rate.
160          Lower numbers tend to produce more false positives while higher
161          numbers tend to reduce accuracy.
162
163        -m
164
165           Forces SpamProbe to use mbox format for reading emails in receive
166           mode.  Normally SpamProbe assumes that the input to receive mode
167           contains a single message so it doesn't look for message breaks.
168
169        -M
170
171           Forces SpamProbe to treat the entire input as a single message.
172           This ignores From lines and Content-Length headers in the input.
173
174        -o option_name
175
176           Enables special options by name.  Currently the only special
177           options are:
178
179             -o graham
180
181               Causes SpamProbe to emulate the filtering algorithm originally
182               outlined in A Plan For Spam.
183
184             -o honor-status-header
185
186               Causes SpamProbe to ignore messages if they have a Status:
187               header containing a capital D.  Some mail servers use this
188               status to indicate a message that has been flagged for
189               deletion but has not yet been purged from the file.
190
191               DO NOT use this option with the receive or train command in
192               your procmailrc file!  Doing so could allow spammers to bypass
193               the filter.  This option is meant to be used with the
194               train-spam and train-good commands in scripts that
195               periodically update the database.
196
197             -o honor-xstatus-header
198
199               Causes SpamProbe to ignore messages if they have a X-Status:
200               header containing a capital D.  Some mail servers use this
201               status to indicate a message that has been flagged for
202               deletion but has not yet been purged from the file.
203
204               DO NOT use this option with the receive or train command in
205               your procmailrc file!  Doing so could allow spammers to bypass
206               the filter.  This option is meant to be used with the
207               train-spam and train-good commands in scripts that
208               periodically update the database.
209
210             -o ignore-body
211
212               Causes SpamProbe to ignore terms from the message body when
213               computing a score.  This is not normally recommended but might
214               be useful in conjunction with some other filter.  For example,
215               the whitelist option (see below) implicitly ignores the
216               message body.
217
218             -o orig-score
219
220               Causes SpamProbe to use its original scoring algorithm that
221               produces excellent results but tends to generate scores of
222               either 0 or 1 for all messages.
223
224             -o suspicious-tags
225
226               Causes SpamProbe to scan the contents of "suspicious" tags for
227               tokens rather than simply throwing them out.  Currently only
228               font tags are scanned but other tags may be added to this list
229               in later versions.
230
231             -o tokenized
232
233               Causes SpamProbe to read tokens one per line rather than
234               processing the input as mbox format.  This allows users to
235               completely replace the standard spamprobe tokenizer if they
236               wish and instead use some external program as a tokenizer.
237               For example in your procmailrc file you could use:
238
239                SCORE=| tokenize.pl | /bin/spamprobe -o tokenized train
240
241               In this mode SpamProbe considers a blank line to indicate the
242               end of one message's tokens and the start of a new message's
243               tokens.  SpamProbe computes a message digest based on the
244               lines of text containing the tokens.
245
246             -o whitelist
247
248               Causes SpamProbe to use information from the email's headers
249               to identify whether or not the email is from a legitimate
250               correspondent.  The message body is ignored as are any never
251               before seen terms and phrases in the headers.  This option can
252               be used with the score command in a procmailrc file to use a
253               bayesian white list in conjunction with some other filter or
254               rule external to SpamProbe.
255
256           The -o option can be used multiple times and all requested options
257           will be applied.  Note that some options might conflict with each
258           other in which case the last option would take precedence.
259
260        -p number
261
262           Changes the maximum number of words per phrase.  Default value is
263           two.  Increasing the limit improves accuracy somewhat but
264           increases database size.  Experiments indicate that increasing
265           beyond two is not worth the extra cost in space.
266
267        -P number
268
269           Causes spamprobe to perform a purge of all terms with junk count
270           less than or equal 2 after every number messages are processed.
271           Using this option when classifying a large collection of spam can
272           prevent the database from growing overly large at the cost of more
273           processing time and possible loss of precision.
274
275        -r number
276
277           Changes the number of times that a single word/phrase can occur
278           in the top words array used to calculate the score for each
279           message.  Allowing repeats reduces the number of words overall
280           (since a single word occupies more than one slot) but allows words
281           which occur frequently in the message to have a higher weight.
282           Generally this is changed only for optimization purposes.
283
284        -R
285
286           Causes spamprobe to treat the input as a single message and to
287           base its exit code on whether or not that message was spam.  The
288           exit code will be 0 if the message was spam or 1 if the message
289           was good.
290
291        -s number
292
293           SpamProbe maintains an in memory cache of the words it has seen in
294           previous messages to reduce disk I/O and improve performance.  By
295           default the cache will contain the most recently accessed 2,500
296           terms.  This number can be changed using the -s option.  Using a
297           larger the cache size will cause SpamProbe to use more memory and,
298           potentially, to perform less database I/O.
299
300           A value of zero causes SpamProbe to use 100,000 as the limit which
301           effectively means that the cache will only be flushed at program
302           exit (unless you have really enormous mailbox files).  The cache
303           doesn't affect receive, dump, or export but has a significant
304           impact on the others.
305
306        -T
307           Causes SpamProbe to write out the top terms associated with each
308           message in addition to its normal output.  Works with find-good,
309           find-spam, and score.
310
311        -v
312
313           Tells SpamProbe to write debugging information to stderr.  This
314           can be useful for debugging or for seeing which terms SpamProbe
315           used to score each email.
316
317        -V
318
319           Prints version and copyright information and then exits.
320
321        -w number
322
323           Changes the number of most significant words/phrases used by
324           SpamProbe to calculate the score for each message.  Generally this
325           is changed only for optimization purposes.
326
327        -x
328
329           Normally SpamProbe uses only a fixed number of top terms (as set
330           by the -w command line option) when scoring emails.  The -x option
331           can be used to allow the array to be extended past the max size if
332           more terms are available with probabilities <= 0.1 or >= 0.9.
333
334        -X
335
336           An interesting variation on the scoring settings.  Equivalent to
337           using "-w5 -r5 -x" so that generally only words with probabilites
338           <= 0.1 or >= 0.9 are used and word frequencies in the email count
339           heavily towards the score.  Tests have shown that this setting
340           tends to be safer (fewer false positives) and have higher recall
341           (proper classification of spams previously scored as spam)
342           although its predictive power isn't quite as good as the default
343           settings.  WARNING: This setting might work best with a fairly
344           large corpus, it has not been tested with a small corpus so it
345           might be very inaccurate with fewer than 1000 total messages.
346
347        -Y
348
349           Assume traditional Berkeley mailbox format, ignoring any
350           Content-Length: fields.
351
352        -7
353
354           Tells SpamProbe to ignore any characters with the most significant
355           bit set to 1 instead of mapping them to the letter 'z'.
356
357        -8
358
359           Tells SpamProbe to store all characters even if their most
360           significant bit is set to 1.
361
362
363       SpamProbe recognizes the following commands:
364
365        spamprobe help [command]
366
367          With no arguments spamprobe lists all of the valid commands.
368          If one or more commands are specified after the word help,
369          spamprobe will print a more verbose description of each command.
370
371        spamprobe create-db
372
373          If no database currently exists spamprobe will attempt to create
374          one and then exit.  This can be used to bootstrap a new
375          installation.  Strictly speaking this command is not necessary
376          since the train-spam, train-good, and auto-train commands will also
377          create a database if none already exists but some users like to
378          create a database as a separate installation step.
379
380        spamprobe create-config
381
382          Writes a new configuration file named spamprobe.hdl into the
383          database directory (normally $HOME/.spamprobe).  Any existing
384          configuration file will be overwritten so be sure to make a copy
385          before invoking this command.
386
387        spamprobe receive [filename...]
388
389          Tells SpamProbe to read its standard input (or a file specified
390          after the receive command) and score it using the current
391          databases.  Once the message has been scored the message is
392          classified as either spam or non-spam and its word counts are
393          written to the appropriate database.  The message's score is
394          written to stdout along with a single word.  For example:
395
396            SPAM 0.9999999 595f0150587edd7b395691964069d7af
397
398          or
399
400            GOOD 0.0200000 595f0150587edd7b395691964069d7af
401
402          The string of numbers and letters after the score is the message's
403          "digest", a 32 character number which uniquely identifies the
404          message.  The digest is used by SpamProbe to recognize messages
405          that it has processed previously so that it can keep its word
406          counts consistent if the message is reclassified.
407
408          Using the -T option additionally lists the terms used to produce
409          the score along with their counts (number of times they were found
410          in the message).
411
412        spamprobe train [filename...]
413
414          Functionally identical to receive except that the database is only
415          modified if the message was "difficult" to classify.  In practice
416          this can reduce the number of database updates to as little as 10%
417          of messages received.
418
419        spamprobe score [filename...]
420
421          Similar to receive except that the database is not modified in
422          any way.
423
424        spamprobe summarize [filename...]
425
426          Similar to score except that it prints a short summary and score
427          for each message.  This can be useful when testing.  Using the -T
428          option additionally lists the terms used to produce the score along
429          with their counts (number of times they were found in the message).
430
431        spamprobe find-spam [filename...]
432
433          Similar to score except that it prints a short summary and score
434          for each message that is determined to be spam.  This can be useful
435          when testing.  Using the -T option additionally lists the terms
436          used to produce the score along with their counts (number of times
437          they were found in the message).
438
439        spamprobe find-good [filename...]
440
441          Similar to score except that it prints a short summary and score
442          for each message that is determined to be good.  This can be useful
443          when testing.  Using the -T option additionally lists the terms
444          used to produce the score along with their counts (number of times
445          they were found in the message).
446
447        spamprobe auto-train {SPAM|GOOD filename...}...
448
449          Attempts to efficiently build a database from all of the named
450          files.  You may specify one or more file of each type.  Prior to
451          each set of file names you must include the word SPAM or GOOD to
452          indicate what type of mail is contained in the files which follow
453          on the command line.
454
455          The case of the SPAM and GOOD keywords is important.  Any number of
456          file names can be specified between the keywords.  The command line
457          format is very flexible.  You can even use a find command in
458          backticks to process whole directory trees of files. For example:
459
460            spamprobe auto-train SPAM spams/* GOOD `find hams -type f`
461
462          SpamProbe pre-scans the files to determine how many emails of each
463          type exist and then trains on hams and spams in a random sequence
464          that balances the inflow of each type so that the train command can
465          work most effectively.  For example if you had 400 hams and 400
466          spams, auto-train will generally process one spam, then one ham,
467          etc.  If you had 4000 spams and 400 hams then auto-train will
468          generally process 10 spams, then one ham, etc.
469
470          Since this command will likely take a long time to run it is often
471          desireable to use it with the -v option to see progress information
472          as the messages are processed.
473
474            spamprobe -v auto-train SPAM spams/* GOOD hams/*
475
476        spamprobe good [filename...]
477
478          Scans each file (or stdin if no file is specified) and reclassifies
479          every email in the file as non-spam.  The databases are updated
480          appropriately.  Messages previously classified as good (recognized
481          using their MD5 digest or message ids) are ignored.  Messages
482          previously classified as spam are reclassified as good.
483
484        spamprobe train-good [filename...]
485
486          Functionally identical to "good" command except that it only
487          updates the database for messages that are either incorrectly
488          classified (i.e. classified as spam) or are "difficult" to
489          classify.  In practice this can reduce amount of database updates
490          to as little as 10% of messages.
491
492        spamprobe spam [filename...]
493
494          Scans each file (or stdin if no file is specified) and reclassifies
495          every email in the file as spam.  The databases are updated
496          appropriately.  Messages previously classified as spam (recognized
497          using their MD5 digest of message ids) are ignored.  Messages
498          previously classified as good are reclassified as spam.
499
500        spamprobe train-spam [filename...]
501
502          Functionally identical to "spam" command except that it only
503          updates the database for messages that are either incorrectly
504          classified (i.e. classified as good) or are "difficult" to
505          classify.  In practice this can reduce amount of database updates
506          to as little as 10% of messages.
507
508        spamprobe remove [filename...]
509
510          Scans each file (or stdin if no file is specified) and removes its
511          term counts from the database.  Messages which are not in the
512          database (recognized using their MD5 digest of message ids) are
513          ignored.
514
515        spamprobe cleanup [ junk_count [ max_age ] ]...
516
517          Scans the database and removes all terms with junk_count or less
518          (default 2) which have not had their counts modified in at least
519          max_age days (default 7).  You can specify multiple count/age pairs
520          on a single command line but must specify both a count and an age
521          for all but the last count.  This should be run periodically to
522          keep the database from growing endlessly.
523
524          For my own email I use cron to run the cleanup command every day
525          and delete all terms with count of 2 or less that have not been
526          modified in the last two weeks.  Here is the excerpt from my
527          crontab:
528
529              3 0 * * * /home/brian/bin/spamprobe cleanup 2 14
530
531          Alternatively you might want to use a much higher count (1000 in
532          this example) for terms that have not been seen in roughly six
533          months:
534
535              3 0 * * * /home/brian/bin/spamprobe cleanup 1000 180 2 14
536
537          Because of the way that PBL and BerkeleyDB work the database file
538          will not actually shrink, but newly added terms will be able to use
539          the space previously occupied by any removed terms so that the
540          file's growth should be significantly slower if this command is
541          used.
542
543          To actually shrink the database you can build a new one using the
544          BerkeleyDB utility programs db_dump and db_load (Berkeley DB only)
545          or the spamprobe import and export commands (either database
546          library).  For example:
547
548              cd ~
549              mkdir new.spamprobe
550              spamprobe export | spamprobe -d new.spamprobe import
551              mv .spamprobe old.spamprobe
552              mv new.spamprobe .spamprobe
553
554          The -P option can also be used to limit the rate of growth of the
555          database when importing a large number of emails.  For example if
556          you want to classify 1000 emails and want SP to purge rare terms
557          every 100 messages use a command such as:
558
559            spamprobe -P 100 good goodmailboxname
560
561          Using -P slows down the classification but can avoid the need to
562          use the db_dump trick.  Using -P only makes sense when classifying
563          a large number of messages.
564
565        spamprobe purge [ junk_count ]
566
567          Similar to cleanup but forces the immediate deletion of all terms
568          with total count less than junk_count (default is 2) no matter how
569          long it has been since they were modified (i.e. even if they were
570          just added today). This could be handy immediately after
571          classifying a large mailbox of historical spam or good email to
572          make room for the next batch.
573
574        spamprobe purge-terms regex
575
576          Similar to purge except that it removes from the database all terms
577          which match the specified regular expression.  Be careful with this
578          command because it could remove many more terms than you expect.
579          Use dump with the same regex before running this command to see
580          exactly what will be deleted.
581
582        spamprobe edit-term term good_count spam_count
583
584          Can be used to specifically set the good and spam counts of a term.
585          Whether this is truly useful is doubtful but it is provided for
586          completeness sake.  For example it could be used to force a
587          particular word to be very spammy or very good:
588
589              spamprobe edit-term nigeria 0 1000000
590              spamprobe edit-term burton  10000000 0
591
592        spamprobe dump [ regex ]
593
594          Prints the contents of the word counts database one word per line
595          in human readable format with spam probability, good count, spam
596          count, flags, and word in columns separated by whitespace.  PBL and
597          Berkeley DB sort terms alphabetically.  The standard unix sort
598          command can be used to sort the terms as desired.  For example to
599          list all words from "most good" to "least good" use this command:
600
601              spamprobe dump | sort -k 1n -k 2nr
602
603          To list all words from "most spammy" to "least spammy" use this
604          command:
605
606              spamprobe dump | sort -k 1nr -k 3nr
607
608          Optionally you can specify a regular expression.  If specified
609          SpamProbe will only dump terms matching the regular expression.
610          For example:        i
611                              n
612              spamprobe dump 'fainance'
613              spamprobe dump 'n
614              spamprobe dump 'cHSubject_.*finance'
615                              e
616        spamprobe tokenize [ filename ]
617
618          Prints the tokens found in the file one word per line in human
619          readable format with spam probability, good count, spam count,
620          message count, and word in columns separated by whitespace.  Terms
621          are listed in the order in which they were encountered in the
622          message.  The standard unix sort command can be used to sort the
623          terms as desired.  For example to list all words from "most good"
624          to "least good" use this command:
625
626              spamprobe tokenize filename | sort -k 1n -k 2nr
627
628          To list all words from "most spammy" to "least spammy" use this
629          command:
630
631              spamprobe tokenize filename | sort -k 1nr -k 3nr
632
633        spamprobe export
634
635          Similar to the dump command but prints the counts and words in a
636          comma separated format with the words surrounded by double quotes.
637          This can be more useful for importing into some databases.
638
639        spamprobe import
640
641          Reads the specified files which must contain export data written by
642          the export command.  The terms and counts from this file are added
643          to the database.  This can be used to convert a database from a
644          prior version.
645
646        spamprobe exec command
647
648          Obtains an exclusive lock on the database and then executes the
649          command using system(3).  If multiple arguments are given after
650          "exec" they are combined to form the command to be executed.  This
651          command can be used when you want to perform some operation on the
652          database without interference from incoming mail.  For example, to
653          back up your .spamprobe directory using tar you could do something
654          like this:
655
656              cd
657              spamprobe exec tar cf spamprobe-data.tar.gz .spamprobe
658
659          If you simply want to hold the lock while interactively running
660          commands in a different xterm you could use "spamprobe exec read".
661          The linux read program simply reads a line of text from your
662          terminal so the lock would effectively be held until you pressed
663          the enter key.  Another option would be to use a shell as the
664          command and type the commands into that shell:
665
666              spamprobe /bin/bash
667              ls
668              date
669              exit
670
671          Be careful not to run spamprobe in the shell though since the
672          spamprobe in the shell will wind up deadlocked waiting for the
673          spamprobe running the exec command to release its lock.
674
675        spamprobe exec-shared command
676
677          Same as exec except that a shared lock is used.  This may be more
678          appropriate if you are backing up your database since operations
679          like score (but not train or receive) could still be performed on
680          the database while the backup was running.
681
682
683

SETUP OF SPAMPROBE FOR USERS

685       Once you have a spamprobe executable copy it to someplace in your  PATH
686       so that procmail can find it.  Then create a directory for SpamProbe to
687       store its databases in.  By default SpamProbe wants to use  the  direc‐
688       tory ~/.spamprobe.  You must create this directory manually in order to
689       run SpamProbe or else specify some other directory using the -d option.
690       Something like this should suffice:
691
692         mkdir ~/.spamprobe
693
694       SpamProbe  can  use either the PBL or Berkeley DB library for its data‐
695       bases.  Both are fast on local file systems but  very  slow  over  NFS.
696       Please  ensure  that your spamprobe directory is on a local file system
697       to ensure good performance.
698
699

NOTES USING HASH DATABASE

701       SpamProbe can use a simple, fixed size hash data file as an alternative
702       to PBL or BDB.  There are two advantages to the hash format.  The first
703       is speed.  In my experiments the hash file  format  is  around  2x  the
704       speed  of  PBL (ranged from 1.8x to 3.5x). The second advantage is that
705       the hash data file size is fixed.  You choose a size  when  you  create
706       the  file  and  it never changes.  File size can be anywhere from 1-100
707       MB. You need to choose a size large enough to hold your terms with room
708       to spare.  More on that later.
709
710       The  hash  file format also has significant disadvantages.  Becuase the
711       file size is fixed you must monitor the file to ensure that it does not
712       become  overly full.  When the file becomes more than half full perfor‐
713       mance will suffer.  Also the hash format does not store original  terms
714       so  you  cannot  use the dump command to learn what terms are spammy or
715       hammy in your database.  Finally, the hash format is  imprecise.   Hash
716       collisions  can  cause  the  counts  from  different  terms to be mixed
717       together which can reduce accuracy.
718
719       To create a hash data file you add a prefix to the  directory  name  in
720       the  -d  command  line option.  You can specify just the directory like
721       this:
722
723         spamprobe -d hash:$HOME/.spamprobe
724
725       or you can add a size in megabytes for the file like this:
726
727         spamprobe -d hash:42:$HOME/.spamprobe
728
729       The size is only used when a file is first created.   SP  auto  detects
730       the  size of an existing hash file.  You need to allow enough space for
731       twice as many terms as you are likely to have  in  your  file.   In  my
732       database  I have 2.2 million terms.  That required a database of are 53
733       MB.  SP uses 12 bytes per term in the hash file so you can estimate the
734       file size you'll need by multiplying the number of terms by 24.
735
736       The  hash  format does not store the original terms.  Instead it stores
737       the 32 bit hash code for each term.  You can  do  just  about  anything
738       with   a   hash   file  that  you  could  with  a  PBL  file  including
739       import/export, edit-term, cleanup, purge, etc.  You can use export your
740       PBL  database  and import it to build a hash file (note that you cannot
741       go the other direction) and you can export one  hash  file  and  import
742       into a new one to enlarge your file.
743
744

MAILDIR FORMAT

746       SpamProbe will accept a maildir directory name anywhere that an Mbox or
747       MBX file name can be specified.  When SpamProbe  encounters  a  Maildir
748       mailbox  (directory) name it will automatically process all of the non-
749       hidden files in the cur and new subdirectories of the  mailbox.   There
750       is no need to individually specify these subdirectories.
751
752
753

GETTING STARTED

755       SpamProbe  is not a stand alone mail filter.  It doesn't sort your mail
756       or split it into different mailboxes.  Instead it relies on some  other
757       program  such  as  procmail  to  actually file your mail for you.  What
758       SpamProbe does do is track the word counts in good and spam emails  and
759       generate  a  score  for  each email that indicates whether or not it is
760       likely to be spam.  Scores range from 0 to 1 with any score of  0.9  or
761       higher indicating a probable spam.
762
763       Personally  I  use  SpamProbe with procmail to filter my incoming email
764       into mail boxes.  I have procmail score each inbound email using  Spam‐
765       Probe and insert a special header into each email containing its score.
766       Then I have procmail move spams into a special mailbox.
767
768       No spam filter is perfect and SpamProbe sometimes makes  mistakes.   To
769       correct  those  mistakes I have a special mailbox that I put undetected
770       spams into.  I run SpamProbe periodically and have  it  reclassify  any
771       emails  in that mailbox as spam so that it will make a better guess the
772       next time around.
773
774       This is not a procmail primer.  You will need to ensure that  you  have
775       procmail and formail installed before you can use this technique.  Also
776       I recommend that you read the procmail documentation so  that  you  can
777       fully  understand  this  example  and adapt it to your own needs.  That
778       having been said, my .procmailrc file looks like this:
779
780           MAILDIR=$HOME/IMAP
781
782           :0 c
783           saved
784
785           :0
786           SCORE=| /home/brian/bin/spamprobe train
787           :0 wf
788           | formail -I "X-SpamProbe: $SCORE"
789           :0 a:
790           *^X-SpamProbe: SPAM
791           spamprobe
792
793       I use IMAP to fetch my email so my mailboxes all live  in  a  directory
794       named IMAP on my mail server.
795
796       NOTE:  The  first stanza copies all incoming emails into a special mbox
797       called saved.  SpamProbe IS BETA SOFTWARE and though it works well  for
798       me it is possible that it could somehow lose emails.  Caution is always
799       a good idea.  That having been said, with the procmailrc file as  shown
800       above  the  worst  that  could  happen if SpamProbe crashes is that the
801       email would not be scored properly and procmail  would  deliver  it  to
802       your inbox.  Of course if procmail crashes all bets are off.
803
804       The  second  stanza  runs spamprobe in "train" mode to score the email,
805       classify it as either spam or good, and possibly update  the  database.
806       The  train  command tries to minimize the number of database updates by
807       only updating the database with terms from an incoming message if there
808       was  insufficient confidence in the message's score.  The train command
809       always updates the database on the first 1500 of  each  type  received.
810       This ensures that sufficient email is classified to allow the filter to
811       operate reliably.
812
813       The next stanza runs formail to add a custom header to the  email  con‐
814       taining the SpamProbe score.  The final stanza uses the contents of the
815       custom header to file detected spams into a special  mbox  named  spam‐
816       probe.
817
818       As  an alternative to using the train command, you can run spamprobe in
819       "receive" mode.  In that mode SpamProbe scores the email and then clas‐
820       sifies  it  as either spam or good based on the score.  It always auto‐
821       matically adds the word counts for the email to the  appropriate  data‐
822       base.   This is essentially like running in score mode followed immedi‐
823       ately by either spam or good mode.  It produces more database I/O and a
824       bigger  database but ensures that every message has its terms reflected
825       in the database.  Personally I use train  mode.   A  sample  procmailrc
826       file using the receive command looks like this:
827
828           MAILDIR=$HOME/IMAP
829
830           :0 c
831           saved
832
833           :0
834           SCORE=| /home/brian/bin/spamprobe receive
835           :0 wf
836           | formail -I "X-SpamProbe: $SCORE"
837           :0 a:
838           *^X-SpamProbe: SPAM
839           spamprobe
840
841
842

MAKING CORRECTIONS

844       SpamProbe  is  not perfect.  It is able to detect over 99% of the spams
845       that I receive but some still slip through.  To  correct  these  missed
846       emails  I  run  SpamProbe periodically and have it scan a special mbox.
847       Since I use IMAP to retrieve my emails I  can  simply  drop  undetected
848       spams into this mbox from my mail client.  If you use POP or some other
849       system then you will need to find a way get the undetected spams into a
850       mbox that spamprobe can see.
851
852       Periodically  I run a script that scans three special mboxes to correct
853       errors in judgment:
854
855           #!/bin/sh
856
857           IMAPDIR=$HOME/IMAP
858
859           spamprobe remove $IMAPDIR/remove
860           spamprobe good $IMAPDIR/nonspam
861           spamprobe spam $IMAPDIR/spam
862           spamprobe train-spam $IMAPDIR/spamprobe
863
864       From this example you can see that I use three special mboxes  to  make
865       corrections.   I  copy emails that I don't want spamprobe to store into
866       the remove mbox.  This is useful if you receive email from a friend  or
867       colleague  that  looks  like  spam  and you don't want it to dilute the
868       effectiveness of the terms it contains.
869
870       Undetected spams go into the  spam  mbox.   SpamProbe  will  reclassify
871       those  emails  as spam and correct its database accordingly.  Note that
872       doing this does not guarantee that the spam will always  be  scored  as
873       spam  in  the  future.   Some  spams are too bland to detect perfectly.
874       Fortunately those are very rare.
875
876       The nonspam mbox is for any false positives.  These are always possible
877       and  it  is  important  to  have  a way to reclassify them when they do
878       occur.
879
880       If you are using receive mode rather than train  mode  then  the  above
881       script can be modified to remove the train-spam line. For example:
882
883           #!/bin/sh
884
885           IMAPDIR=$HOME/IMAP
886
887           spamprobe remove $IMAPDIR/remove
888           spamprobe good $IMAPDIR/nonspam
889           spamprobe spam $IMAPDIR/spam
890
891       Finally  you'll  need  to  build  a starting database.  Since SpamProbe
892       relies on word counts from past emails it requires a decent sized data‐
893       base  to be accurate.  To build the database select some of your mboxes
894       containing past emails.  Ideally you should have one mbox of spams  and
895       one or more of non-spams.  If you don't have any spams handy then don't
896       worry, SpamProbe will gradually become more  accurate  as  you  receive
897       more  spams.   Expect  a fairly high false negative (i.e. missed spams)
898       rate as you first start using SpamProbe.
899
900       To import your starting messages use commands such as these.  The exam‐
901       ple assumes that you have non-spams stored in a file named mbox in your
902       home directory and some spams  stored  in  a  file  named  nasty-spams.
903       Replace these names with real ones.
904
905         spamprobe good ~/mbox
906         spamprobe spam ~/nasty-spams
907
908
909