sa-learn(1)

1SA-LEARN(1)           User Contributed Perl Documentation          SA-LEARN(1)
2
3
4

NAME

6       sa-learn - train SpamAssassin's Bayesian classifier
7

SYNOPSIS

9       sa-learn [options] [file]...
10
11       sa-learn [options] --dump [ all ⎪ data ⎪ magic ]
12
13       Options:
14
15        --ham                 Learn messages as ham (non-spam)
16        --spam                Learn messages as spam
17        --forget              Forget a message
18        --use-ignores         Use bayes_ignore_from and bayes_ignore_to
19        --sync                Syncronize the database and the journal if needed
20        --force-expire        Force a database sync and expiry run
21        --dbpath <path>       Allows commandline override (in bayes_path form)
22                              for where to read the Bayes DB from
23        --dump [all⎪data⎪magic]  Display the contents of the Bayes database
24                              Takes optional argument for what to display
25         --regexp <re>        For dump only, specifies which tokens to
26                              dump based on a regular expression.
27        -f file, --folders=file  Read list of files/directories from file
28        --dir                 Ignored; historical compatibility
29        --file                Ignored; historical compatibility
30        --mbox                Input sources are in mbox format
31        --mbx                 Input sources are in mbx format
32        --showdots            Show progress using dots
33        --progress            Show progress using progress bar
34        --no-sync             Skip synchronizing the database and journal
35                              after learning
36        -L, --local           Operate locally, no network accesses
37        --import              Migrate data from older version/non DB_File
38                              based databases
39        --clear               Wipe out existing database
40        --backup              Backup, to STDOUT, existing database
41        --restore <filename>  Restore a database from filename
42        -u username, --username=username
43                              Override username taken from the runtime
44                              environment, used with SQL
45        -C path, --configpath=path, --config-file=path
46                              Path to standard configuration dir
47        -p prefs, --prefspath=file, --prefs-file=file
48                              Set user preferences file
49        --siteconfigpath=path Path for site configs
50                              (default: /etc/mail/spamassassin)
51        --cf='config line'    Additional line of configuration
52        -D, --debug [area=n,...]  Print debugging messages
53        -V, --version         Print version
54        -h, --help            Print usage message
55

DESCRIPTION

57       Given a typical selection of your incoming mail classified as spam or
58       ham (non-spam), this tool will feed each mail to SpamAssassin, allowing
59       it to 'learn' what signs are likely to mean spam, and which are likely
60       to mean ham.
61
62       Simply run this command once for each of your mail folders, and it will
63       ''learn'' from the mail therein.
64
65       Note that csh-style globbing in the mail folder names is supported; in
66       other words, listing a folder name as "*" will scan every folder that
67       matches.  See "Mail::SpamAssassin::ArchiveIterator" for more details.
68
69       SpamAssassin remembers which mail messages it has learnt already, and
70       will not re-learn those messages again, unless you use the --forget
71       option. Messages learnt as spam will have SpamAssassin markup removed,
72       on the fly.
73
74       If you make a mistake and scan a mail as ham when it is spam, or vice
75       versa, simply rerun this command with the correct classification, and
76       the mistake will be corrected.  SpamAssassin will automatically 'for‐
77       get' the previous indications.
78
79       Users of "spamd" who wish to perform training remotely, over a network,
80       should investigate the "spamc -L" switch.
81

OPTIONS

83       --ham
84           Learn the input message(s) as ham.   If you have previously learnt
85           any of the messages as spam, SpamAssassin will forget them first,
86           then re-learn them as ham.  Alternatively, if you have previously
87           learnt them as ham, it'll skip them this time around.  If the mes‐
88           sages have already been filtered through SpamAssassin, the learner
89           will ignore any modifications SpamAssassin may have made.
90
91       --spam
92           Learn the input message(s) as spam.   If you have previously learnt
93           any of the messages as ham, SpamAssassin will forget them first,
94           then re-learn them as spam.  Alternatively, if you have previously
95           learnt them as spam, it'll skip them this time around.  If the mes‐
96           sages have already been filtered through SpamAssassin, the learner
97           will ignore any modifications SpamAssassin may have made.
98
99       --folders=filename, -f filename
100           sa-learn will read in the list of folders from the specified file,
101           one folder per line in the file.  If the folder is prefixed with
102           "ham:type:" or "spam:type:", sa-learn will learn that folder appro‐
103           priately, otherwise the folders will be assumed to be of the type
104           specified by --ham or --spam.
105
106           "type" above is optional, but is the same as the standard for
107           ArchiveIterator: mbox, mbx, dir, file, or detect (the default if
108           not specified).
109
110       --mbox
111           sa-learn will read in the file(s) containing the emails to be
112           learned, and will process them in mbox format (one or more emails
113           per file).
114
115       --mbx
116           sa-learn will read in the file(s) containing the emails to be
117           learned, and will process them in mbx format (one or more emails
118           per file).
119
120       --use-ignores
121           Don't learn the message if a from address matches configuration
122           file item "bayes_ignore_from" or a to address matches
123           "bayes_ignore_to".  The option might be used when learning from a
124           large file of messages from which the hammy spam messages or spammy
125           ham messages have not been removed.
126
127       --sync
128           Syncronize the journal and databases.  Upon successfully syncing
129           the database with the entries in the journal, the journal file is
130           removed.
131
132       --force-expire
133           Forces an expiry attempt, regardless of whether it may be necessary
134           or not.  Note: This doesn't mean any tokens will actually expire.
135           Please see the EXPIRATION section below.
136
137           Note: "--force-expire" also causes the journal data to be synchro‐
138           nized into the Bayes databases.
139
140       --forget
141           Forget a given message previously learnt.
142
143       --dbpath
144           Allows a commandline override of the bayes_path configuration
145           option.
146
147       --dump option
148           Display the contents of the Bayes database.  Without an option or
149           with the all option, all magic tokens and data tokens will be dis‐
150           played.  magic will only display magic tokens, and data will only
151           display the data tokens.
152
153           Can also use the --regexp RE option to specify which tokens to dis‐
154           play based on a regular expression.
155
156       --clear
157           Clear an existing Bayes database by removing all traces of the
158           database.
159
160           WARNING: This is destructive and should be used with care.
161
162       --backup
163           Performs a dump of the Bayes database in machine/human readable
164           format.
165
166           The dump will include token and seen data.  It is suitable for
167           input back into the --restore command.
168
169       --restore=filename
170           Performs a restore of the Bayes database defined by filename.
171
172           WARNING: This is a destructive operation, previous Bayes data will
173           be wiped out.
174
175       -h, --help
176           Print help message and exit.
177
178       -u username, --username=username
179           If specified this username will override the username taken from
180           the runtime environment.  You can use this option to specify users
181           in a virtual user configuration when using SQL as the Bayes back‐
182           end.
183
184           NOTE: This option will not change to the given username, it will
185           only attempt to act on behalf of that user.  Because of this you
186           will need to have proper permissions to be able to change files
187           owned by username.  In the case of SQL this generally is not a
188           problem.
189
190       -C path, --configpath=path, --config-file=path
191           Use the specified path for locating the distributed configuration
192           files.  Ignore the default directories (usually "/usr/share/spamas‐
193           sassin" or similar).
194
195       --siteconfigpath=path
196           Use the specified path for locating site-specific configuration
197           files.  Ignore the default directories (usually "/etc/mail/spamas‐
198           sassin" or similar).
199
200       --cf='config line'
201           Add additional lines of configuration directly from the com‐
202           mand-line, parsed after the configuration files are read.   Multi‐
203           ple --cf arguments can be used, and each will be considered a sepa‐
204           rate line of configuration.
205
206       -p prefs, --prefspath=prefs, --prefs-file=prefs
207           Read user score preferences from prefs (usually "$HOME/.spamassas‐
208           sin/user_prefs").
209
210       --progress
211           Prints a progress bar (to STDERR) showing the current progress.  In
212           the case where no valid terminal is found this option will behave
213           very much like the --showdots option.
214
215       -D [area,...], --debug [area,...]
216           Produce debugging output. If no areas are listed, all debugging
217           information is printed. Diagnostic output can also be enabled for
218           each area individually; area is the area of the code to instrument.
219           For example, to produce diagnostic output on bayes, learn, and dns,
220           use:
221
222                   spamassassin -D bayes,learn,dns
223
224           For more information about which areas (also known as channels) are
225           available, please see the documentation at:
226
227                   C<http://wiki.apache.org/spamassassin/DebugChannels>
228
229           Higher priority informational messages that are suitable for log‐
230           ging in normal circumstances are available with an area of "info".
231
232       --no-sync
233           Skip the slow synchronization step which normally takes place after
234           changing database entries.  If you plan to learn from many folders
235           in a batch, or to learn many individual messages one-by-one, it is
236           faster to use this switch and run "sa-learn --sync" once all the
237           folders have been scanned.
238
239           Clarification: The state of --no-sync overrides the
240           bayes_learn_to_journal configuration option.  If not specified, sa-
241           learn will learn to the database directly.  If specified, sa-learn
242           will learn to the journal file.
243
244           Note: --sync and --no-sync can be specified on the same command‐
245           line, which is slightly confusing.  In this case, the --no-sync
246           option is ignored since there is no learn operation.
247
248       -L, --local
249           Do not perform any network accesses while learning details about
250           the mail messages.  This will speed up the learning process, but
251           may result in a slightly lower accuracy.
252
253           Note that this is currently ignored, as current versions of SpamAs‐
254           sassin will not perform network access while learning; but future
255           versions may.
256
257       --import
258           If you previously used SpamAssassin's Bayesian learner without the
259           "DB_File" module installed, it will have created files in other
260           formats, such as "GDBM_File", "NDBM_File", or "SDBM_File".  This
261           switch allows you to migrate that old data into the "DB_File" for‐
262           mat.  It will overwrite any data currently in the "DB_File".
263
264           Can also be used with the --dbpath path option to specify the loca‐
265           tion of the Bayes files to use.
266

MIGRATION

268       There are now multiple backend storage modules available for storing
269       user's bayesian data. As such you might want to migrate from one back‐
270       end to another. Here is a simple procedure for migrating from one back‐
271       end to another.
272
273       Note that if you have individual user databases you will have to per‐
274       form a similar procedure for each one of them.
275
276       sa-learn --sync
277           This will sync any outstanding journal entries
278
279       sa-learn --backup > backup.txt
280           This will save all your Bayes data to a plain text file.
281
282       sa-learn --clear
283           This is optional, but good to do to clear out the old database.
284
285       Repeat!
286           At this point, if you have multiple databases, you should perform
287           the procedure above for each of them. (i.e. each user's database
288           needs to be backed up before continuing.)
289
290       Switch backends
291           Once you have backed up all databases you can update your configu‐
292           ration for the new database backend. This will involve at least the
293           bayes_store_module config option and may involve some additional
294           config options depending on what is required by the module. (For
295           example, you may need to configure an SQL database.)
296
297       sa-learn --restore backup.txt
298           Again, you need to do this for every database.
299
300       If you are migrating to SQL you can make use of the -u <username>
301       option in sa-learn to populate each user's database. Otherwise, you
302       must run sa-learn as the user who database you are restoring.
303

INTRODUCTION TO BAYESIAN FILTERING

305       (Thanks to Michael Bell for this section!)
306
307       For a more lengthy description of how this works, go to
308       http://www.paulgraham.com/ and see "A Plan for Spam". It's reasonably
309       readable, even if statistics make me break out in hives.
310
311       The short semi-inaccurate version: Given training, a spam heuristics
312       engine can take the most "spammy" and "hammy" words and apply proba‐
313       bilistic analysis. Furthermore, once given a basis for the analysis,
314       the engine can continue to learn iteratively by applying both the non-
315       Bayesian and Bayesian rulesets together to create evolving "intelli‐
316       gence".
317
318       SpamAssassin 2.50 and later supports Bayesian spam analysis, in the
319       form of the BAYES rules. This is a new feature, quite powerful, and is
320       disabled until enough messages have been learnt.
321
322       The pros of Bayesian spam analysis:
323
324       Can greatly reduce false positives and false negatives.
325           It learns from your mail, so it is tailored to your unique e-mail
326           flow.
327
328       Once it starts learning, it can continue to learn from SpamAssassin and
329       improve over time.
330
331       And the cons:
332
333       A decent number of messages are required before results are useful for
334       ham/spam determination.
335       It's hard to explain why a message is or isn't marked as spam.
336           i.e.: a straightforward rule, that matches, say, "VIAGRA" is easy
337           to understand. If it generates a false positive or false negative,
338           it is fairly easy to understand why.
339
340           With Bayesian analysis, it's all probabilities - "because the past
341           says it is likely as this falls into a probabilistic distribution
342           common to past spam in your systems". Tell that to your users!
343           Tell that to the client when he asks "what can I do to change
344           this". (By the way, the answer in this case is "use whitelisting".)
345
346       It will take disk space and memory.
347           The databases it maintains take quite a lot of resources to store
348           and use.
349

GETTING STARTED

351       Still interested? Ok, here's the guidelines for getting this working.
352
353       First a high-level overview:
354
355       Build a significant sample of both ham and spam.
356           I suggest several thousand of each, placed in SPAM and HAM directo‐
357           ries or mailboxes.  Yes, you MUST hand-sort this - otherwise the
358           results won't be much better than SpamAssassin on its own. Verify
359           the spamminess/haminess of EVERY message.  You're urged to avoid
360           using a publicly available corpus (sample) - this must be taken
361           from YOUR mail server, if it is to be statistically useful.  Other‐
362           wise, the results may be pretty skewed.
363
364       Use this tool to teach SpamAssassin about these samples, like so:
365                   sa-learn --spam /path/to/spam/folder
366                   sa-learn --ham /path/to/ham/folder
367                   ...
368
369           Let SpamAssassin proceed, learning stuff. When it finds ham and
370           spam it will add the "interesting tokens" to the database.
371
372       If you need SpamAssassin to forget about specific messages, use the
373       --forget option.
374           This can be applied to either ham or spam that has run through the
375           sa-learn processes. It's a bit of a hammer, really, lowering the
376           weighting of the specific tokens in that message (only if that mes‐
377           sage has been processed before).
378
379       Learning from single messages uses a command like this:
380                   sa-learn --ham --no-sync mailmessage
381
382           This is handy for binding to a key in your mail user agent.  It's
383           very fast, as all the time-consuming stuff is deferred until you
384           run with the "--sync" option.
385
386       Autolearning is enabled by default
387           If you don't have a corpus of mail saved to learn, you can let Spa‐
388           mAssassin automatically learn the mail that you receive.  If you
389           are autolearning from scratch, the amount of mail you receive will
390           determine how long until the BAYES_* rules are activated.
391

EFFECTIVE TRAINING

393       Learning filters require training to be effective.  If you don't train
394       them, they won't work.  In addition, you need to train them with new
395       messages regularly to keep them up-to-date, or their data will become
396       stale and impact accuracy.
397
398       You need to train with both spam and ham mails.  One type of mail alone
399       will not have any effect.
400
401       Note that if your mail folders contain things like forwarded spam, dis‐
402       cussions of spam-catching rules, etc., this will cause trouble.  You
403       should avoid scanning those messages if possible.  (An easy way to do
404       this is to move them aside, into a folder which is not scanned.)
405
406       If the messages you are learning from have already been filtered
407       through SpamAssassin, the learner will compensate for this.  In effect,
408       it learns what each message would look like if you had run "spamassas‐
409       sin -d" over it in advance.
410
411       Another thing to be aware of, is that typically you should aim to train
412       with at least 1000 messages of spam, and 1000 ham messages, if possi‐
413       ble.  More is better, but anything over about 5000 messages does not
414       improve accuracy significantly in our tests.
415
416       Be careful that you train from the same source -- for example, if you
417       train on old spam, but new ham mail, then the classifier will think
418       that a mail with an old date stamp is likely to be spam.
419
420       It's also worth noting that training with a very small quantity of ham,
421       will produce atrocious results.  You should aim to train with at least
422       the same amount (or more if possible!) of ham data than spam.
423
424       On an on-going basis, it is best to keep training the filter to make
425       sure it has fresh data to work from.  There are various ways to do
426       this:
427
428       1. Supervised learning
429           This means keeping a copy of all or most of your mail, separated
430           into spam and ham piles, and periodically re-training using those.
431           It produces the best results, but requires more work from you, the
432           user.
433
434           (An easy way to do this, by the way, is to create a new folder for
435           'deleted' messages, and instead of deleting them from other fold‐
436           ers, simply move them in there instead.  Then keep all spam in a
437           separate folder and never delete it.  As long as you remember to
438           move misclassified mails into the correct folder set, it is easy
439           enough to keep up to date.)
440
441       2. Unsupervised learning from Bayesian classification
442           Another way to train is to chain the results of the Bayesian clas‐
443           sifier back into the training, so it reinforces its own decisions.
444           This is only safe if you then retrain it based on any errors you
445           discover.
446
447           SpamAssassin does not support this method, due to experimental
448           results which strongly indicate that it does not work well, and
449           since Bayes is only one part of the resulting score presented to
450           the user (while Bayes may have made the wrong decision about a
451           mail, it may have been overridden by another system).
452
453       3. Unsupervised learning from SpamAssassin rules
454           Also called 'auto-learning' in SpamAssassin.  Based on statistical
455           analysis of the SpamAssassin success rates, we can automatically
456           train the Bayesian database with a certain degree of confidence
457           that our training data is accurate.
458
459           It should be supplemented with some supervised training in addi‐
460           tion, if possible.
461
462           This is the default, but can be turned off by setting the SpamAs‐
463           sassin configuration parameter "bayes_auto_learn" to 0.
464
465       4. Mistake-based training
466           This means training on a small number of mails, then only training
467           on messages that SpamAssassin classifies incorrectly.  This works,
468           but it takes longer to get it right than a full training session
469           would.
470

FILES

472       sa-learn and the other parts of SpamAssassin's Bayesian learner, use a
473       set of persistent database files to store the learnt tokens, as fol‐
474       lows.
475
476       bayes_toks
477           The database of tokens, containing the tokens learnt, their count
478           of occurrences in ham and spam, and the timestamp when the token
479           was last seen in a message.
480
481           This database also contains some 'magic' tokens, as follows: the
482           version number of the database, the number of ham and spam messages
483           learnt, the number of tokens in the database, and timestamps of:
484           the last journal sync, the last expiry run, the last expiry token
485           reduction count, the last expiry timestamp delta, the oldest token
486           timestamp in the database, and the newest token timestamp in the
487           database.
488
489           This is a database file, using "DB_File".  The database 'version
490           number' is 0 for databases from 2.5x, 1 for databases from certain
491           2.6x development releases, and 2 for all more recent databases.
492
493       bayes_seen
494           A map of Message-Id and some data from headers and body to what
495           that message was learnt as. This is used so that SpamAssassin can
496           avoid re-learning a message it has already seen, and so it can
497           reverse the training if you later decide that message was learnt
498           incorrectly.
499
500           This is a database file, using "DB_File".
501
502       bayes_journal
503           While SpamAssassin is scanning mails, it needs to track which
504           tokens it uses in its calculations.  To avoid the contention of
505           having each SpamAssassin process attempting to gain write access to
506           the Bayes DB, the token timestamps are written to a 'journal' file
507           which will later (either automatically or via "sa-learn --sync") be
508           used to synchronize the Bayes DB.
509
510           Also, through the use of "bayes_learn_to_journal", or when using
511           the "--no-sync" option with sa-learn, the actual learning data will
512           take be placed into the journal for later synchronization.  This is
513           typically useful for high-traffic sites to avoid the same con‐
514           tention as stated above.
515

EXPIRATION

517       Since SpamAssassin can auto-learn messages, the Bayes database files
518       could increase perpetually until they fill your disk.  To control this,
519       SpamAssassin performs journal synchronization and bayes expiration
520       periodically when certain criteria (listed below) are met.
521
522       SpamAssassin can sync the journal and expire the DB tokens either manu‐
523       ally or opportunistically.  A journal sync is due if --sync is passed
524       to sa-learn (manual), or if the following is true (opportunistic):
525
526       - bayes_journal_max_size does not equal 0 (means don't sync)
527       - the journal file exists
528
529       and either:
530
531       - the journal file has a size greater than bayes_journal_max_size
532
533       or
534
535       - a journal sync has previously occurred, and at least 1 day has passed
536       since that sync
537
538       Expiry is due if --force-expire is passed to sa-learn (manual), or if
539       all of the following are true (opportunistic):
540
541       - the last expire was attempted at least 12hrs ago
542       - bayes_auto_expire does not equal 0
543       - the number of tokens in the DB is > 100,000
544       - the number of tokens in the DB is > bayes_expiry_max_db_size
545       - there is at least a 12 hr difference between the oldest and newest
546       token atimes
547
548       EXPIRE LOGIC
549
550       If either the manual or opportunistic method causes an expire run to
551       start, here is the logic that is used:
552
553       - figure out how many tokens to keep.  take the larger of either
554       bayes_expiry_max_db_size * 75% or 100,000 tokens.  therefore, the goal
555       reduction is number of tokens - number of tokens to keep.
556       - if the reduction number is < 1000 tokens, abort (not worth the
557       effort).
558       - if an expire has been done before, guesstimate the new atime delta
559       based on the old atime delta.  (new_atime_delta = old_atime_delta *
560       old_reduction_count / goal)
561       - if no expire has been done before, or the last expire looks "wierd",
562       do an estimation pass.  The definition of "wierd" is:
563           - last expire over 30 days ago
564           - last atime delta was < 12 hrs
565           - last reduction count was < 1000 tokens
566           - estimated new atime delta is < 12 hrs
567           - the difference between the last reduction count and the goal
568           reduction count is > 50%
569
570       ESTIMATION PASS LOGIC
571
572       Go through each of the DB's tokens.  Starting at 12hrs, calculate
573       whether or not the token would be expired (based on the difference
574       between the token's atime and the db's newest token atime) and keep the
575       count.  Work out from 12hrs exponentially by powers of 2.  ie: 12hrs *
576       1, 12hrs * 2, 12hrs * 4, 12hrs * 8, and so on, up to 12hrs * 512
577       (6144hrs, or 256 days).
578
579       The larger the delta, the smaller the number of tokens that will be
580       expired.  Conversely, the number of tokens goes up as the delta gets
581       smaller.  So starting at the largest atime delta, figure out which
582       delta will expire the most tokens without going above the goal expira‐
583       tion count.  Use this to choose the atime delta to use, unless one of
584       the following occurs:
585
586       - the largest atime (smallest reduction count) would expire too many
587       tokens.  this means the learned tokens are mostly old and there needs
588       to be new tokens learned before an expire can occur.
589       - all of the atime choices result in 0 tokens being removed. this means
590       the tokens are all newer than 12 hours and there needs to be new tokens
591       learned before an expire can occur.
592       - the number of tokens that would be removed is < 1000.  the benefit
593       isn't worth the effort.  more tokens need to be learned.
594
595       If the expire run gets past this point, it will continue to the end.  A
596       new DB is created since the majority of DB libraries don't shrink the
597       DB file when tokens are removed.  So we do the "create new, migrate old
598       to new, remove old, rename new" shuffle.
599
600       EXPIRY RELATED CONFIGURATION SETTINGS
601
602       "bayes_auto_expire" is used to specify whether or not SpamAssassin
603       ought to opportunistically attempt to expire the Bayes database. The
604       default is 1 (yes).
605       "bayes_expiry_max_db_size" specifies both the auto-expire token count
606       point, as well as the resulting number of tokens after expiry as
607       described above.  The default value is 150,000, which is roughly equiv‐
608       alent to a 6Mb database file if you're using DB_File.
609       "bayes_journal_max_size" specifies how large the Bayes journal will
610       grow before it is opportunistically synced.  The default value is
611       102400.
612

INSTALLATION

614       The sa-learn command is part of the Mail::SpamAssassin Perl module.
615       Install this as a normal Perl module, using "perl -MCPAN -e shell", or
616       by hand.
617

PREREQUISITES

633       "Mail::SpamAssassin"
634

AUTHORS

636       The SpamAssassin(tm) Project <http://spamassassin.apache.org/>
637
638
639
640perl v5.8.8                       2008-01-29                       SA-LEARN(1)