1spamprobe(1) SpamProbe spamprobe(1)
2
3
4
6 spamprobe - a bayesian spam filter
7
9 spamprobe [options] <command> [filename...]
10
11
13 SpamProbe can be used in conjunction with procmail or similar program
14 to filter email. SpamProbe uses a statistical algorithm to identify
15 the key words and phrases in email and determine which emails are
16 legitimate and which are spam. The algorithm used by SpamProbe is
17 based on an excellent article by Paul Graham. He describes the basic
18 idea and his results. You can read his article here:
19
20 http://www.paulgraham.com/spam.html
21
22
23
25 SpamProbe accepts a small set of commands and a growing set of options
26 on the command line in addition to zero or more file names of mboxes.
27 The general usage is:
28
29 spamprobe [options] <command> [filename...]
30
31 The recognized options are:
32
33 -a char
34
35 By default SpamProbe converts non-ascii characters (characters
36 with the most significant bit set to 1) into the letter 'z'. This
37 is useful for lumping all Asian characters into a single word for
38 easy recognition. The -a option allows you to change the
39 character to something else if you don't like the letter 'z' for
40 some reason.
41
42 -c
43
44 Tells spamprobe to create the database directory if it does not
45 already exist. Normally spamprobe exits with a usage error if
46 the database directory does not already exist.
47
48 -C number
49
50 Tells SpamProbe to assign a default, somewhat neutral, probability
51 to any term that does not have a weighted (good count doubled)
52 count of at least number in the database. This prevents terms
53 which have been seen only a few times from having an unreasonable
54 influence on the score of an email containing them.
55
56 The default value is 5. For example if number is 5 then in order
57 for a term to use its calculated probability it must have been
58 seen 3 times in good mails, or 2 times in good mails and once in
59 spam, or 5 times in spam, or some other combination adding up to
60 at least 5.
61
62 -d directory
63
64 By default SpamProbe stores its database in a directory named
65 .spamprobe under your home directory. The -d option allows you to
66 specify a different directory to use. This is necessary if your
67 home directory is NFS mounted for example.
68
69 The directory name can be prefixed with a special code to force
70 SpamProbe to use a particular type of data file format. The type
71 codes depend on how your copy of SpamProbe was compiled. Defined
72 types include:
73
74 Example Description
75 -d pbl:path Forces the use of PBL data file.
76 -d hash:path Forces the use of an mmapped hash file.
77 -d split:path Forces the use of a hash file and ISAM
78 file (may provide better precision than
79 plain hash in some cases).
80
81 The hash: option can also specify a desired file size in megabytes
82 before the path. For example -d hash:19:path would cause
83 SpamProbe to use a 19 MB hash file. The size must be in the range
84 of 1-100. The default hash file size is 16 MB. Because hash
85 files have a fixed size and capacity they should be cleaned
86 relatively often using the cleanup command (see below) to prevent
87 them from becoming full or being slowed by too many hash key
88 collisions.
89
90 Hash files provide better performance than either of the ISAM
91 options (PBL or Berkeley DB). However hash files do not store the
92 original terms. Only a 32 bit hash key is stored with each term.
93 This prevents a user from exploring the terms in the database
94 using the dump command to see what words are particularly spammy
95 or hammy.
96
97 -D directory
98
99 Tells SpamProbe to use the database in the specified directory
100 (must be different than the one specified with the -d option) as a
101 shared database from which to draw terms that are not defined in
102 the user's own database. This can be used to provide a baseline
103 database shared by all users on a system (in the -D directory) and
104 a private database unique to each user of the system
105 ($HOME/.spamprobe or -d directory).
106
107 -g field_name
108
109 Tells SpamProbe what header to look for previous score and message
110 digest in. Default is X-SpamProbe. Field name is not case
111 sensitive. Used by all commands except receive.
112
113 -h
114
115 By default SpamProbe removes HTML markup from the text in emails
116 to help avoid false positives. The -h option allows you to
117 override this behavior and force SpamProbe to include words from
118 within HTML tags in its word counts. Note that SpamProbe always
119 counts any URLs in hrefs within tags whether -h is used or not.
120 Use of this option is discouraged. It can increase the rate of
121 spam detection slightly but unless the user receives a significant
122 amount of HTML emails it also tends to increase the number of
123 false positives.
124
125 -H option
126
127 By default SpamProbe only scans a meaningful subset of headers
128 from the email message when searching for words to score. The -H
129 option allows the user to specify additional headers to scan.
130 Legal values are "all", "nox", "none", or "normal". "all" scans
131 all headers, "nox" scans all headers except those starting with
132 X-, "none" does not scan headers, and "normal" scans the normal
133 set of headers.
134
135 In addition to those values you can also explicitly add a header
136 to the list of headers to process by adding the header name in
137 lower case preceded by a plus sign. Multiple headers can be
138 specified by using multiple -H options. For example, to include
139 only the From and Received headers in your train command you could
140 run spamprobe as follows:
141
142 spamprobe -Hnone -H+from -H+received train
143
144 You can also selectively ignore headers that would otherwise be
145 processed by using -H-headername. For example to process all
146 headers except for Subject you could run spamprobe as follows:
147
148 spamprobe -Hall -H-subject train
149
150 To process the normal set of headers but also add the SpamAssassin
151 header X-SpamStatus you could run spamprobe as follows:
152
153 spamprobe -H+x-spam-status train
154
155 -l number
156
157 Changes the spam probability threshold for emails from the default
158 (0.7) to number. The number must be a between 0 and 1. Generally
159 the value should be above 0.5 to avoid a high false positive rate.
160 Lower numbers tend to produce more false positives while higher
161 numbers tend to reduce accuracy.
162
163 -m
164
165 Forces SpamProbe to use mbox format for reading emails in receive
166 mode. Normally SpamProbe assumes that the input to receive mode
167 contains a single message so it doesn't look for message breaks.
168
169 -M
170
171 Forces SpamProbe to treat the entire input as a single message.
172 This ignores From lines and Content-Length headers in the input.
173
174 -o option_name
175
176 Enables special options by name. Currently the only special
177 options are:
178
179 -o graham
180
181 Causes SpamProbe to emulate the filtering algorithm originally
182 outlined in A Plan For Spam.
183
184 -o honor-status-header
185
186 Causes SpamProbe to ignore messages if they have a Status:
187 header containing a capital D. Some mail servers use this
188 status to indicate a message that has been flagged for
189 deletion but has not yet been purged from the file.
190
191 DO NOT use this option with the receive or train command in
192 your procmailrc file! Doing so could allow spammers to bypass
193 the filter. This option is meant to be used with the
194 train-spam and train-good commands in scripts that
195 periodically update the database.
196
197 -o honor-xstatus-header
198
199 Causes SpamProbe to ignore messages if they have a X-Status:
200 header containing a capital D. Some mail servers use this
201 status to indicate a message that has been flagged for
202 deletion but has not yet been purged from the file.
203
204 DO NOT use this option with the receive or train command in
205 your procmailrc file! Doing so could allow spammers to bypass
206 the filter. This option is meant to be used with the
207 train-spam and train-good commands in scripts that
208 periodically update the database.
209
210 -o ignore-body
211
212 Causes SpamProbe to ignore terms from the message body when
213 computing a score. This is not normally recommended but might
214 be useful in conjunction with some other filter. For example,
215 the whitelist option (see below) implicitly ignores the
216 message body.
217
218 -o orig-score
219
220 Causes SpamProbe to use its original scoring algorithm that
221 produces excellent results but tends to generate scores of
222 either 0 or 1 for all messages.
223
224 -o suspicious-tags
225
226 Causes SpamProbe to scan the contents of "suspicious" tags for
227 tokens rather than simply throwing them out. Currently only
228 font tags are scanned but other tags may be added to this list
229 in later versions.
230
231 -o tokenized
232
233 Causes SpamProbe to read tokens one per line rather than
234 processing the input as mbox format. This allows users to
235 completely replace the standard spamprobe tokenizer if they
236 wish and instead use some external program as a tokenizer.
237 For example in your procmailrc file you could use:
238
239 SCORE=| tokenize.pl | /bin/spamprobe -o tokenized train
240
241 In this mode SpamProbe considers a blank line to indicate the
242 end of one message's tokens and the start of a new message's
243 tokens. SpamProbe computes a message digest based on the
244 lines of text containing the tokens.
245
246 -o whitelist
247
248 Causes SpamProbe to use information from the email's headers
249 to identify whether or not the email is from a legitimate
250 correspondent. The message body is ignored as are any never
251 before seen terms and phrases in the headers. This option can
252 be used with the score command in a procmailrc file to use a
253 bayesian white list in conjunction with some other filter or
254 rule external to SpamProbe.
255
256 The -o option can be used multiple times and all requested options
257 will be applied. Note that some options might conflict with each
258 other in which case the last option would take precedence.
259
260 -p number
261
262 Changes the maximum number of words per phrase. Default value is
263 two. Increasing the limit improves accuracy somewhat but
264 increases database size. Experiments indicate that increasing
265 beyond two is not worth the extra cost in space.
266
267 -P number
268
269 Causes spamprobe to perform a purge of all terms with junk count
270 less than or equal 2 after every number messages are processed.
271 Using this option when classifying a large collection of spam can
272 prevent the database from growing overly large at the cost of more
273 processing time and possible loss of precision.
274
275 -r number
276
277 Changes the number of times that a single word/phrase can occur
278 in the top words array used to calculate the score for each
279 message. Allowing repeats reduces the number of words overall
280 (since a single word occupies more than one slot) but allows words
281 which occur frequently in the message to have a higher weight.
282 Generally this is changed only for optimization purposes.
283
284 -R
285
286 Causes spamprobe to treat the input as a single message and to
287 base its exit code on whether or not that message was spam. The
288 exit code will be 0 if the message was spam or 1 if the message
289 was good.
290
291 -s number
292
293 SpamProbe maintains an in memory cache of the words it has seen in
294 previous messages to reduce disk I/O and improve performance. By
295 default the cache will contain the most recently accessed 2,500
296 terms. This number can be changed using the -s option. Using a
297 larger the cache size will cause SpamProbe to use more memory and,
298 potentially, to perform less database I/O.
299
300 A value of zero causes SpamProbe to use 100,000 as the limit which
301 effectively means that the cache will only be flushed at program
302 exit (unless you have really enormous mailbox files). The cache
303 doesn't affect receive, dump, or export but has a significant
304 impact on the others.
305
306 -T
307 Causes SpamProbe to write out the top terms associated with each
308 message in addition to its normal output. Works with find-good,
309 find-spam, and score.
310
311 -v
312
313 Tells SpamProbe to write debugging information to stderr. This
314 can be useful for debugging or for seeing which terms SpamProbe
315 used to score each email.
316
317 -V
318
319 Prints version and copyright information and then exits.
320
321 -w number
322
323 Changes the number of most significant words/phrases used by
324 SpamProbe to calculate the score for each message. Generally this
325 is changed only for optimization purposes.
326
327 -x
328
329 Normally SpamProbe uses only a fixed number of top terms (as set
330 by the -w command line option) when scoring emails. The -x option
331 can be used to allow the array to be extended past the max size if
332 more terms are available with probabilities <= 0.1 or >= 0.9.
333
334 -X
335
336 An interesting variation on the scoring settings. Equivalent to
337 using "-w5 -r5 -x" so that generally only words with probabilites
338 <= 0.1 or >= 0.9 are used and word frequencies in the email count
339 heavily towards the score. Tests have shown that this setting
340 tends to be safer (fewer false positives) and have higher recall
341 (proper classification of spams previously scored as spam)
342 although its predictive power isn't quite as good as the default
343 settings. WARNING: This setting might work best with a fairly
344 large corpus, it has not been tested with a small corpus so it
345 might be very inaccurate with fewer than 1000 total messages.
346
347 -Y
348
349 Assume traditional Berkeley mailbox format, ignoring any
350 Content-Length: fields.
351
352 -7
353
354 Tells SpamProbe to ignore any characters with the most significant
355 bit set to 1 instead of mapping them to the letter 'z'.
356
357 -8
358
359 Tells SpamProbe to store all characters even if their most
360 significant bit is set to 1.
361
362
363 SpamProbe recognizes the following commands:
364
365 spamprobe help [command]
366
367 With no arguments spamprobe lists all of the valid commands.
368 If one or more commands are specified after the word help,
369 spamprobe will print a more verbose description of each command.
370
371 spamprobe create-db
372
373 If no database currently exists spamprobe will attempt to create
374 one and then exit. This can be used to bootstrap a new
375 installation. Strictly speaking this command is not necessary
376 since the train-spam, train-good, and auto-train commands will also
377 create a database if none already exists but some users like to
378 create a database as a separate installation step.
379
380 spamprobe create-config
381
382 Writes a new configuration file named spamprobe.hdl into the
383 database directory (normally $HOME/.spamprobe). Any existing
384 configuration file will be overwritten so be sure to make a copy
385 before invoking this command.
386
387 spamprobe receive [filename...]
388
389 Tells SpamProbe to read its standard input (or a file specified
390 after the receive command) and score it using the current
391 databases. Once the message has been scored the message is
392 classified as either spam or non-spam and its word counts are
393 written to the appropriate database. The message's score is
394 written to stdout along with a single word. For example:
395
396 SPAM 0.9999999 595f0150587edd7b395691964069d7af
397
398 or
399
400 GOOD 0.0200000 595f0150587edd7b395691964069d7af
401
402 The string of numbers and letters after the score is the message's
403 "digest", a 32 character number which uniquely identifies the
404 message. The digest is used by SpamProbe to recognize messages
405 that it has processed previously so that it can keep its word
406 counts consistent if the message is reclassified.
407
408 Using the -T option additionally lists the terms used to produce
409 the score along with their counts (number of times they were found
410 in the message).
411
412 spamprobe train [filename...]
413
414 Functionally identical to receive except that the database is only
415 modified if the message was "difficult" to classify. In practice
416 this can reduce the number of database updates to as little as 10%
417 of messages received.
418
419 spamprobe score [filename...]
420
421 Similar to receive except that the database is not modified in
422 any way.
423
424 spamprobe summarize [filename...]
425
426 Similar to score except that it prints a short summary and score
427 for each message. This can be useful when testing. Using the -T
428 option additionally lists the terms used to produce the score along
429 with their counts (number of times they were found in the message).
430
431 spamprobe find-spam [filename...]
432
433 Similar to score except that it prints a short summary and score
434 for each message that is determined to be spam. This can be useful
435 when testing. Using the -T option additionally lists the terms
436 used to produce the score along with their counts (number of times
437 they were found in the message).
438
439 spamprobe find-good [filename...]
440
441 Similar to score except that it prints a short summary and score
442 for each message that is determined to be good. This can be useful
443 when testing. Using the -T option additionally lists the terms
444 used to produce the score along with their counts (number of times
445 they were found in the message).
446
447 spamprobe auto-train {SPAM|GOOD filename...}...
448
449 Attempts to efficiently build a database from all of the named
450 files. You may specify one or more file of each type. Prior to
451 each set of file names you must include the word SPAM or GOOD to
452 indicate what type of mail is contained in the files which follow
453 on the command line.
454
455 The case of the SPAM and GOOD keywords is important. Any number of
456 file names can be specified between the keywords. The command line
457 format is very flexible. You can even use a find command in
458 backticks to process whole directory trees of files. For example:
459
460 spamprobe auto-train SPAM spams/* GOOD `find hams -type f`
461
462 SpamProbe pre-scans the files to determine how many emails of each
463 type exist and then trains on hams and spams in a random sequence
464 that balances the inflow of each type so that the train command can
465 work most effectively. For example if you had 400 hams and 400
466 spams, auto-train will generally process one spam, then one ham,
467 etc. If you had 4000 spams and 400 hams then auto-train will
468 generally process 10 spams, then one ham, etc.
469
470 Since this command will likely take a long time to run it is often
471 desireable to use it with the -v option to see progress information
472 as the messages are processed.
473
474 spamprobe -v auto-train SPAM spams/* GOOD hams/*
475
476 spamprobe good [filename...]
477
478 Scans each file (or stdin if no file is specified) and reclassifies
479 every email in the file as non-spam. The databases are updated
480 appropriately. Messages previously classified as good (recognized
481 using their MD5 digest or message ids) are ignored. Messages
482 previously classified as spam are reclassified as good.
483
484 spamprobe train-good [filename...]
485
486 Functionally identical to "good" command except that it only
487 updates the database for messages that are either incorrectly
488 classified (i.e. classified as spam) or are "difficult" to
489 classify. In practice this can reduce amount of database updates
490 to as little as 10% of messages.
491
492 spamprobe spam [filename...]
493
494 Scans each file (or stdin if no file is specified) and reclassifies
495 every email in the file as spam. The databases are updated
496 appropriately. Messages previously classified as spam (recognized
497 using their MD5 digest of message ids) are ignored. Messages
498 previously classified as good are reclassified as spam.
499
500 spamprobe train-spam [filename...]
501
502 Functionally identical to "spam" command except that it only
503 updates the database for messages that are either incorrectly
504 classified (i.e. classified as good) or are "difficult" to
505 classify. In practice this can reduce amount of database updates
506 to as little as 10% of messages.
507
508 spamprobe remove [filename...]
509
510 Scans each file (or stdin if no file is specified) and removes its
511 term counts from the database. Messages which are not in the
512 database (recognized using their MD5 digest of message ids) are
513 ignored.
514
515 spamprobe cleanup [ junk_count [ max_age ] ]...
516
517 Scans the database and removes all terms with junk_count or less
518 (default 2) which have not had their counts modified in at least
519 max_age days (default 7). You can specify multiple count/age pairs
520 on a single command line but must specify both a count and an age
521 for all but the last count. This should be run periodically to
522 keep the database from growing endlessly.
523
524 For my own email I use cron to run the cleanup command every day
525 and delete all terms with count of 2 or less that have not been
526 modified in the last two weeks. Here is the excerpt from my
527 crontab:
528
529 3 0 * * * /home/brian/bin/spamprobe cleanup 2 14
530
531 Alternatively you might want to use a much higher count (1000 in
532 this example) for terms that have not been seen in roughly six
533 months:
534
535 3 0 * * * /home/brian/bin/spamprobe cleanup 1000 180 2 14
536
537 Because of the way that PBL and BerkeleyDB work the database file
538 will not actually shrink, but newly added terms will be able to use
539 the space previously occupied by any removed terms so that the
540 file's growth should be significantly slower if this command is
541 used.
542
543 To actually shrink the database you can build a new one using the
544 BerkeleyDB utility programs db_dump and db_load (Berkeley DB only)
545 or the spamprobe import and export commands (either database
546 library). For example:
547
548 cd ~
549 mkdir new.spamprobe
550 spamprobe export | spamprobe -d new.spamprobe import
551 mv .spamprobe old.spamprobe
552 mv new.spamprobe .spamprobe
553
554 The -P option can also be used to limit the rate of growth of the
555 database when importing a large number of emails. For example if
556 you want to classify 1000 emails and want SP to purge rare terms
557 every 100 messages use a command such as:
558
559 spamprobe -P 100 good goodmailboxname
560
561 Using -P slows down the classification but can avoid the need to
562 use the db_dump trick. Using -P only makes sense when classifying
563 a large number of messages.
564
565 spamprobe purge [ junk_count ]
566
567 Similar to cleanup but forces the immediate deletion of all terms
568 with total count less than junk_count (default is 2) no matter how
569 long it has been since they were modified (i.e. even if they were
570 just added today). This could be handy immediately after
571 classifying a large mailbox of historical spam or good email to
572 make room for the next batch.
573
574 spamprobe purge-terms regex
575
576 Similar to purge except that it removes from the database all terms
577 which match the specified regular expression. Be careful with this
578 command because it could remove many more terms than you expect.
579 Use dump with the same regex before running this command to see
580 exactly what will be deleted.
581
582 spamprobe edit-term term good_count spam_count
583
584 Can be used to specifically set the good and spam counts of a term.
585 Whether this is truly useful is doubtful but it is provided for
586 completeness sake. For example it could be used to force a
587 particular word to be very spammy or very good:
588
589 spamprobe edit-term nigeria 0 1000000
590 spamprobe edit-term burton 10000000 0
591
592 spamprobe dump [ regex ]
593
594 Prints the contents of the word counts database one word per line
595 in human readable format with spam probability, good count, spam
596 count, flags, and word in columns separated by whitespace. PBL and
597 Berkeley DB sort terms alphabetically. The standard unix sort
598 command can be used to sort the terms as desired. For example to
599 list all words from "most good" to "least good" use this command:
600
601 spamprobe dump | sort -k 1n -k 2nr
602
603 To list all words from "most spammy" to "least spammy" use this
604 command:
605
606 spamprobe dump | sort -k 1nr -k 3nr
607
608 Optionally you can specify a regular expression. If specified
609 SpamProbe will only dump terms matching the regular expression.
610 For example: i
611 n
612 spamprobe dump 'fainance'
613 spamprobe dump 'n
614 spamprobe dump 'cHSubject_.*finance'
615 e
616 spamprobe tokenize [ filename ]
617
618 Prints the tokens found in the file one word per line in human
619 readable format with spam probability, good count, spam count,
620 message count, and word in columns separated by whitespace. Terms
621 are listed in the order in which they were encountered in the
622 message. The standard unix sort command can be used to sort the
623 terms as desired. For example to list all words from "most good"
624 to "least good" use this command:
625
626 spamprobe tokenize filename | sort -k 1n -k 2nr
627
628 To list all words from "most spammy" to "least spammy" use this
629 command:
630
631 spamprobe tokenize filename | sort -k 1nr -k 3nr
632
633 spamprobe export
634
635 Similar to the dump command but prints the counts and words in a
636 comma separated format with the words surrounded by double quotes.
637 This can be more useful for importing into some databases.
638
639 spamprobe import
640
641 Reads the specified files which must contain export data written by
642 the export command. The terms and counts from this file are added
643 to the database. This can be used to convert a database from a
644 prior version.
645
646 spamprobe exec command
647
648 Obtains an exclusive lock on the database and then executes the
649 command using system(3). If multiple arguments are given after
650 "exec" they are combined to form the command to be executed. This
651 command can be used when you want to perform some operation on the
652 database without interference from incoming mail. For example, to
653 back up your .spamprobe directory using tar you could do something
654 like this:
655
656 cd
657 spamprobe exec tar cf spamprobe-data.tar.gz .spamprobe
658
659 If you simply want to hold the lock while interactively running
660 commands in a different xterm you could use "spamprobe exec read".
661 The linux read program simply reads a line of text from your
662 terminal so the lock would effectively be held until you pressed
663 the enter key. Another option would be to use a shell as the
664 command and type the commands into that shell:
665
666 spamprobe /bin/bash
667 ls
668 date
669 exit
670
671 Be careful not to run spamprobe in the shell though since the
672 spamprobe in the shell will wind up deadlocked waiting for the
673 spamprobe running the exec command to release its lock.
674
675 spamprobe exec-shared command
676
677 Same as exec except that a shared lock is used. This may be more
678 appropriate if you are backing up your database since operations
679 like score (but not train or receive) could still be performed on
680 the database while the backup was running.
681
682
683
685 Once you have a spamprobe executable copy it to someplace in your PATH
686 so that procmail can find it. Then create a directory for SpamProbe to
687 store its databases in. By default SpamProbe wants to use the direc‐
688 tory ~/.spamprobe. You must create this directory manually in order to
689 run SpamProbe or else specify some other directory using the -d option.
690 Something like this should suffice:
691
692 mkdir ~/.spamprobe
693
694 SpamProbe can use either the PBL or Berkeley DB library for its data‐
695 bases. Both are fast on local file systems but very slow over NFS.
696 Please ensure that your spamprobe directory is on a local file system
697 to ensure good performance.
698
699
701 SpamProbe can use a simple, fixed size hash data file as an alternative
702 to PBL or BDB. There are two advantages to the hash format. The first
703 is speed. In my experiments the hash file format is around 2x the
704 speed of PBL (ranged from 1.8x to 3.5x). The second advantage is that
705 the hash data file size is fixed. You choose a size when you create
706 the file and it never changes. File size can be anywhere from 1-100
707 MB. You need to choose a size large enough to hold your terms with room
708 to spare. More on that later.
709
710 The hash file format also has significant disadvantages. Becuase the
711 file size is fixed you must monitor the file to ensure that it does not
712 become overly full. When the file becomes more than half full perfor‐
713 mance will suffer. Also the hash format does not store original terms
714 so you cannot use the dump command to learn what terms are spammy or
715 hammy in your database. Finally, the hash format is imprecise. Hash
716 collisions can cause the counts from different terms to be mixed
717 together which can reduce accuracy.
718
719 To create a hash data file you add a prefix to the directory name in
720 the -d command line option. You can specify just the directory like
721 this:
722
723 spamprobe -d hash:$HOME/.spamprobe
724
725 or you can add a size in megabytes for the file like this:
726
727 spamprobe -d hash:42:$HOME/.spamprobe
728
729 The size is only used when a file is first created. SP auto detects
730 the size of an existing hash file. You need to allow enough space for
731 twice as many terms as you are likely to have in your file. In my
732 database I have 2.2 million terms. That required a database of are 53
733 MB. SP uses 12 bytes per term in the hash file so you can estimate the
734 file size you'll need by multiplying the number of terms by 24.
735
736 The hash format does not store the original terms. Instead it stores
737 the 32 bit hash code for each term. You can do just about anything
738 with a hash file that you could with a PBL file including
739 import/export, edit-term, cleanup, purge, etc. You can use export your
740 PBL database and import it to build a hash file (note that you cannot
741 go the other direction) and you can export one hash file and import
742 into a new one to enlarge your file.
743
744
746 SpamProbe will accept a maildir directory name anywhere that an Mbox or
747 MBX file name can be specified. When SpamProbe encounters a Maildir
748 mailbox (directory) name it will automatically process all of the non-
749 hidden files in the cur and new subdirectories of the mailbox. There
750 is no need to individually specify these subdirectories.
751
752
753
755 SpamProbe is not a stand alone mail filter. It doesn't sort your mail
756 or split it into different mailboxes. Instead it relies on some other
757 program such as procmail to actually file your mail for you. What
758 SpamProbe does do is track the word counts in good and spam emails and
759 generate a score for each email that indicates whether or not it is
760 likely to be spam. Scores range from 0 to 1 with any score of 0.9 or
761 higher indicating a probable spam.
762
763 Personally I use SpamProbe with procmail to filter my incoming email
764 into mail boxes. I have procmail score each inbound email using Spam‐
765 Probe and insert a special header into each email containing its score.
766 Then I have procmail move spams into a special mailbox.
767
768 No spam filter is perfect and SpamProbe sometimes makes mistakes. To
769 correct those mistakes I have a special mailbox that I put undetected
770 spams into. I run SpamProbe periodically and have it reclassify any
771 emails in that mailbox as spam so that it will make a better guess the
772 next time around.
773
774 This is not a procmail primer. You will need to ensure that you have
775 procmail and formail installed before you can use this technique. Also
776 I recommend that you read the procmail documentation so that you can
777 fully understand this example and adapt it to your own needs. That
778 having been said, my .procmailrc file looks like this:
779
780 MAILDIR=$HOME/IMAP
781
782 :0 c
783 saved
784
785 :0
786 SCORE=| /home/brian/bin/spamprobe train
787 :0 wf
788 | formail -I "X-SpamProbe: $SCORE"
789 :0 a:
790 *^X-SpamProbe: SPAM
791 spamprobe
792
793 I use IMAP to fetch my email so my mailboxes all live in a directory
794 named IMAP on my mail server.
795
796 NOTE: The first stanza copies all incoming emails into a special mbox
797 called saved. SpamProbe IS BETA SOFTWARE and though it works well for
798 me it is possible that it could somehow lose emails. Caution is always
799 a good idea. That having been said, with the procmailrc file as shown
800 above the worst that could happen if SpamProbe crashes is that the
801 email would not be scored properly and procmail would deliver it to
802 your inbox. Of course if procmail crashes all bets are off.
803
804 The second stanza runs spamprobe in "train" mode to score the email,
805 classify it as either spam or good, and possibly update the database.
806 The train command tries to minimize the number of database updates by
807 only updating the database with terms from an incoming message if there
808 was insufficient confidence in the message's score. The train command
809 always updates the database on the first 1500 of each type received.
810 This ensures that sufficient email is classified to allow the filter to
811 operate reliably.
812
813 The next stanza runs formail to add a custom header to the email con‐
814 taining the SpamProbe score. The final stanza uses the contents of the
815 custom header to file detected spams into a special mbox named spam‐
816 probe.
817
818 As an alternative to using the train command, you can run spamprobe in
819 "receive" mode. In that mode SpamProbe scores the email and then clas‐
820 sifies it as either spam or good based on the score. It always auto‐
821 matically adds the word counts for the email to the appropriate data‐
822 base. This is essentially like running in score mode followed immedi‐
823 ately by either spam or good mode. It produces more database I/O and a
824 bigger database but ensures that every message has its terms reflected
825 in the database. Personally I use train mode. A sample procmailrc
826 file using the receive command looks like this:
827
828 MAILDIR=$HOME/IMAP
829
830 :0 c
831 saved
832
833 :0
834 SCORE=| /home/brian/bin/spamprobe receive
835 :0 wf
836 | formail -I "X-SpamProbe: $SCORE"
837 :0 a:
838 *^X-SpamProbe: SPAM
839 spamprobe
840
841
842
844 SpamProbe is not perfect. It is able to detect over 99% of the spams
845 that I receive but some still slip through. To correct these missed
846 emails I run SpamProbe periodically and have it scan a special mbox.
847 Since I use IMAP to retrieve my emails I can simply drop undetected
848 spams into this mbox from my mail client. If you use POP or some other
849 system then you will need to find a way get the undetected spams into a
850 mbox that spamprobe can see.
851
852 Periodically I run a script that scans three special mboxes to correct
853 errors in judgment:
854
855 #!/bin/sh
856
857 IMAPDIR=$HOME/IMAP
858
859 spamprobe remove $IMAPDIR/remove
860 spamprobe good $IMAPDIR/nonspam
861 spamprobe spam $IMAPDIR/spam
862 spamprobe train-spam $IMAPDIR/spamprobe
863
864 From this example you can see that I use three special mboxes to make
865 corrections. I copy emails that I don't want spamprobe to store into
866 the remove mbox. This is useful if you receive email from a friend or
867 colleague that looks like spam and you don't want it to dilute the
868 effectiveness of the terms it contains.
869
870 Undetected spams go into the spam mbox. SpamProbe will reclassify
871 those emails as spam and correct its database accordingly. Note that
872 doing this does not guarantee that the spam will always be scored as
873 spam in the future. Some spams are too bland to detect perfectly.
874 Fortunately those are very rare.
875
876 The nonspam mbox is for any false positives. These are always possible
877 and it is important to have a way to reclassify them when they do
878 occur.
879
880 If you are using receive mode rather than train mode then the above
881 script can be modified to remove the train-spam line. For example:
882
883 #!/bin/sh
884
885 IMAPDIR=$HOME/IMAP
886
887 spamprobe remove $IMAPDIR/remove
888 spamprobe good $IMAPDIR/nonspam
889 spamprobe spam $IMAPDIR/spam
890
891 Finally you'll need to build a starting database. Since SpamProbe
892 relies on word counts from past emails it requires a decent sized data‐
893 base to be accurate. To build the database select some of your mboxes
894 containing past emails. Ideally you should have one mbox of spams and
895 one or more of non-spams. If you don't have any spams handy then don't
896 worry, SpamProbe will gradually become more accurate as you receive
897 more spams. Expect a fairly high false negative (i.e. missed spams)
898 rate as you first start using SpamProbe.
899
900 To import your starting messages use commands such as these. The exam‐
901 ple assumes that you have non-spams stored in a file named mbox in your
902 home directory and some spams stored in a file named nasty-spams.
903 Replace these names with real ones.
904
905 spamprobe good ~/mbox
906 spamprobe spam ~/nasty-spams
907
908
909
911 procmail(1)
912
913
914
915Version 1.4 December 2005 spamprobe(1)