1QSF(1)                           User Manuals                           QSF(1)
2
3
4

NAME

6       qsf - quick spam filter
7

SYNOPSIS

9       Filtering:       qsf [-snrAtav] [-d DB] [-g DB]
10                            [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
11                            [-X NUM]
12       Training:        qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
13       Retraining:      qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
14       Database:        qsf -[p|D|R|O] [-d DB]
15       Database merge:  qsf -E OTHERDB [-d DB]
16       Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
17       Denylist query:  qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
18       Help:            qsf -[h|V]
19
20

DESCRIPTION

22       qsf  reads  a single email on standard input, and by default outputs it
23       on standard output.  If the email is determined to be  spam,  an  addi‐
24       tional header ("X-Spam: YES") will be added, and optionally the subject
25       line can have "[SPAM]" prepended to it.
26
27       qsf is intended to be used in a procmail(1) recipe, in a  ruleset  such
28       as this:
29
30               :0 wf
31               | qsf -ra
32
33               :0 H:
34               * X-Spam: YES
35               $HOME/mail/spam
36
37       For  more examples, including sample procmail(1) recipes, see the EXAM‐
38       PLES section below.
39
40

TRAINING

42       Before qsf can be used properly, it needs to be trained.  A good way to
43       train qsf is to collect a copy of all your email into two folders - one
44       for spam, and one for non-spam.  Once you have done this, you  can  use
45       the training function, like this:
46
47               qsf -aT spam-folder non-spam-folder
48
49       This  will generate a database that can be used by qsf to guess whether
50       email received in the future is spam or not.  Note  that  this  initial
51       training  run  may  take a long time, but you should only need to do it
52       once.
53
54       To mark a single message as spam, pipe it to qsf with  the  --mark-spam
55       or  -m  ("mark as spam") option.  This will update the database accord‐
56       ingly and discard the email.
57
58       To mark a single message as non-spam, pipe it to qsf with  the  --mark-
59       nonspam  or  -M  ("mark as non-spam") option.  Again, this will discard
60       the email.
61
62       If a message has been mis-tagged, simply send it to qsf as the opposite
63       type,  i.e.  if it has been mistakenly tagged as spam, pipe it into qsf
64       --mark-nonspam --weight=2 to add it to the non-spam side of  the  data‐
65       base with double the usual weighting.
66
67

OPTIONS

69       The qsf options are listed below.
70
71       -d, --database [TYPE:]FILE
72              Use  FILE  as the spam/non-spam database.  The default is to use
73              /var/lib/qsfdb and, if that is not available  or  is  read-only,
74              $HOME/.qsfdb.  This option can also be useful if there is a sys‐
75              tem-wide database but you do not want to  use  it  -  specifying
76              your own here will override the default.
77
78              If   you   prefix   the  filename  with  a  TYPE,  of  the  form
79              btree:$HOME/.qsfdb, then this will specify what kind of database
80              FILE is, such as list, btree, gdbm, sqlite and so on.  Check the
81              output of qsf -V to see which database backends  are  available.
82              The default is to auto-detect the type, or, if the file does not
83              already exist, use list.  Note that TYPE is not case-sensitive.
84
85       -g, --global [TYPE:]FILE
86              Use  FILE  as  the   default   global   database,   instead   of
87              /var/lib/qsfdb.   If  you  also specify a database with -d, then
88              this "global" database will be used in read-only  mode  in  con‐
89              junction with the read-write database specified with -d.  The -g
90              option can be used a second time to specify  a  third  database,
91              which  will also be used in read-only mode.  Again, the filename
92              can optionally be prefixed with a TYPE which specifies the data‐
93              base type.
94
95       -P, --plain-map FILE
96              Maintain  a  mapping  of all database tokens to their non-hashed
97              counterparts in FILE, one token per line.  This can be useful if
98              you  want  to be able to list the contents of your database at a
99              later date, for instance to get a list  of  email  addresses  in
100              your allow-list.  Note that using this option may slow qsf down,
101              and only entries written to the database while  this  option  is
102              active will be stored in FILE.
103
104       -s, --subject
105              Rewrite the Subject line of any email that turns out to be spam,
106              adding "[SPAM]" to the start of the line.
107
108       -S, --subject-marker SUBJECT
109              Instead of adding "[SPAM]", add SUBJECT to the Subject  line  of
110              any email that turns out to be spam.  Implies -s.
111
112       -H, --header-marker MARK
113              Instead of setting the X-Spam header to "YES", set it to MARK if
114              email turns out to be spam.  This can be useful  if  your  email
115              client can only search all headers for a string, rather than one
116              particular header (so searching for "YES" might match more  than
117              just the output of qsf).
118
119       -n, --no-header
120              Do not add an X-Spam header to messages.
121
122       -r, --add-rating
123              Insert  an  additional header X-Spam-Rating which is a rating of
124              the "spamminess" of a message from 0 to 100; 90  and  above  are
125              counted  as  spam, anything under 90 is not considered spam.  If
126              combined with -t, then the rating (0-100) will be output, on its
127              own, on standard output.
128
129       -A, --asterisk
130              Insert  an  additional  header  X-Spam-Level  which will contain
131              between 0 and 20 asterisks (*), depending on the spam rating.
132
133       -t, --test
134              Instead of passing the message out on  standard  output,  output
135              nothing, and exit 0 if the message is not spam, or exit 1 if the
136              message is spam.  If combined with -r, then the spam rating will
137              be output on standard output.
138
139       -a, --allowlist
140              Enable the allow-list.  This causes the email addresses given in
141              the message's "From:" and "Return-Path:" headers to  be  checked
142              against  a  list;  if  either  one  matches, then the message is
143              always treated as non-spam, regardless of what the  token  data‐
144              base says. When specified with a retraining flag, -a -m (mark as
145              spam) will remove that address from the allow-list  as  well  as
146              marking  the  message as spam, and -a -M (mark as non-spam) will
147              add that address to the allow-list as well as marking  the  mes‐
148              sage  as non-spam.  The idea is that you add all of your friends
149              to the allow-list, and then none  of  their  messages  ever  get
150              marked as spam.
151
152       -y, --denylist
153              Enable  the deny-list.  This causes the email addresses given in
154              the message's "From:" and "Return-Path:" headers to  be  checked
155              against  a second list; if either one matches, then theh message
156              is always treated as spam.  Training works in the  same  way  as
157              with  -a,  except that you must specify -m or -M twice to modify
158              the deny-list instead of the allow-list, and  with  the  reverse
159              syntax:  -y  -m  -m  (mark as spam) will add that address to the
160              deny-list, whereas -y -M -M (mark as non-spam) will remove  that
161              address  from  the  deny-list.   This double specification is so
162              that the usual retraining process never touches  the  deny-list;
163              the  deny-list  should be carefully maintained rather than auto‐
164              matically generated.
165
166              Normally you would not need to use the deny-list.
167
168       -L, --level, --threshold LEVEL
169              Change the spam scoring threshold level which  must  be  reached
170              before an email is classified as spam.  The default is 90.
171
172       -Q, --min-tokens NUM
173              Only  give a score if more than NUM tokens are found in the mes‐
174              sage - otherwise the message is assumed to be non-spam,  and  it
175              is  not  modified  in  any  way.  The default is 0.  This option
176              might be useful if you find that very short messages  are  being
177              frequently miscategorised.
178
179       -e, --email, --email-only EMAIL
180              Query  or  update  the  allow-list  entry  for the email address
181              EMAIL.  With no other options, this will simply output "YES"  if
182              EMAIL  is  in  the allow-list, or "NO" if it is not. With -t, it
183              will not output anything, but will exit 0 (success) if EMAIL  is
184              in  the  allow-list,  or  1  (failure) if it is not. With the -m
185              (mark-spam) option, any previous allow-list entry for EMAIL will
186              be  removed.  Finally,  with the -M (mark-nonspam) option, EMAIL
187              will be added to the allow-list if it is not already on it.
188
189              If EMAIL is just the word MSG on its own, then an email will  be
190              read  from  standard input, and the email addresses given in the
191              "From:" and "Return-Path:" headers will be used.
192
193              Using -e automatically switches on -a.
194
195              If you also specify -y, then the deny-list will be operated  on.
196              Remember that -m and -M are reversed with the deny-list.
197
198              If  you  specify  an  email address of the form @domain (nothing
199              before the @), then the whole  domain  will  be  allow  or  deny
200              listed.
201
202       -v, --verbose
203              Add  extra  X-QSF-Info headers to any filtered email, containing
204              error messages and so on if applicable.  Specify  -v  more  than
205              once to increase verbosity.
206
207       -T, --train SPAM NONSPAM [MAXROUNDS]
208              Train  the database using the two mbox folders SPAM and NONSPAM,
209              by testing each message in each folder and updating the database
210              each  time  a  message  is miscategorised.  This is done several
211              times, and may take a while to run.  Specify the -a (allow-list)
212              flag  to  add  every sender in the NONSPAM folder to your allow-
213              list as a side-effect of the training process.  If MAXROUNDS  is
214              specified,  training will end after this number of rounds if the
215              results are still not good enough. The default is a  maximum  of
216              200 rounds.
217
218       -m, --mark-spam
219              Instead  of passing the message out on standard output, mark its
220              contents as spam and update the database  accordingly.   If  the
221              allow-list  (-a)  is enabled, the message's "From:" and "Return-
222              Path:" addresses are removed from the allow-list.  If the  deny-
223              list  (-y)  is  enabled  and you specify -m twice, the message's
224              addresses are added to the deny-list instead.
225
226       -M, --mark-nonspam
227              Instead of passing the message out on standard output, mark  its
228              contents  as  non-spam  and update the database accordingly.  If
229              the allow-list  (-a)  is  enabled,  the  message's  "From:"  and
230              "Return-Path:" addresses are added to the allow-list (see the -a
231              option above).  If the deny-list (-y) is enabled and you specify
232              -M twice, the message's addresses are removed from the deny-list
233              instead.
234
235       -w, --weight WEIGHT
236              When marking as spam or non-spam, update  the  database  with  a
237              weighting of WEIGHT per token instead of the default of 1.  Use‐
238              ful when correcting mistakes, eg a message that has been mistak‐
239              enly  detected  as  spam  should  be  marked as non-spam using a
240              weighting of 2, i.e. double the usual weighting,  to  counteract
241              the error.
242
243       -D, --dump [FILE]
244              Dump the contents of the database as a platform-independent text
245              file, suitable for archival, transfer to another machine, and so
246              on.  The data is output on stdout or into the given FILE.
247
248       -R, --restore [FILE]
249              Rebuild  the  database from scratch from the text file on stdin.
250              If a FILE is given, data is read  from  there  instead  of  from
251              stdin.
252
253       -O, --tokens
254              Instead  of  filtering, output a list of the tokens found in the
255              message read from standard input, along with the number of times
256              each  token  was  found.  This is only useful if you want to use
257              qsf as a general tokeniser for use with another filtering  pack‐
258              age.
259
260       -E, --merge OTHERDB
261              Merge  the OTHERDB database into the current database.  This can
262              be useful if you want to take one user's mailbox  and  merge  it
263              into  the  system-wide one, for instance (this would be done by,
264              as root, doing qsf -d /var/lib/qsfdb  -E  /home/user/.qsfdb  and
265              then removing /home/user/.qsfdb).
266
267       -B, --benchmark SPAM NONSPAM [MAXROUNDS]
268              Benchmark  the  training process using the two mbox folders SPAM
269              and NONSPAM.  A temporary database is created and trained  using
270              the  first  75%  of  the  messages  in each folder, and then the
271              entire contents of each folder is tested to see how  many  false
272              positives  and false negatives occur. Some timing information is
273              also displayed.
274
275              This can be used to decide which backend is best on your system.
276              Use  -d  to  select  a backend, eg qsf -B spam nonspam -d GDBM -
277              this will create a temporary database which  is  removed  after‐
278              wards.
279
280              The  exception  to this is the MySQL backend, where a full data‐
281              base   specification   must    be    given    (-d    MySQL:data‐
282              base=db;host=localhost;...)   and  the database table given will
283              not be wiped beforehand or dropped afterwards.
284
285              As with -T, if MAXROUNDS is specified, training  will  never  be
286              done for more than this number of rounds; the default is 200.
287
288
289       -h, --help
290              Print a usage message on standard output and exit successfully.
291
292       -V, --version
293              Print  version  information, including a list of available data‐
294              base backends, on standard output and exit successfully.
295
296

DEPRECATED OPTIONS

298       The following options are only for use with the old binary  tree  data‐
299       base  backend  or  old  databases that haven't been upgraded to the new
300       format that came in with version 1.1.0.
301
302
303       -N, --no-autoprune
304              When marking as spam or nonspam, never automatically  prune  the
305              database.  Usually the database is pruned after every 500 marks;
306              if you would rather --prune manually, use -N  to  disable  auto‐
307              matic pruning.
308
309       -p, --prune
310              Remove  redundant  entries  from  the database and clean it up a
311              little.  This is  automatically  done  after  several  calls  to
312              --mark-spam  or --mark-nonspam, and during training with --train
313              if the training takes a large number of  rounds,  so  it  should
314              rarely be necessary to use --prune manually unless you are using
315              -N / --no-autoprune.
316
317       -X, --prune-max NUM
318              When the database is being pruned, no more than NUM entries will
319              be  considered  for  removal.  This is to prevent CPU and memory
320              resources being taken over.  The default is 100,000 but in  some
321              circumstances  (if  you  find  that pruning takes too long) this
322              option may be used to reduce it to a more manageable number.
323
324

FILES

326       /var/lib/qsfdb
327              The default (system-wide) spam database.  If you wish to install
328              qsf  system-wide,  this  should  be read-only to everyone; there
329              should be one user with write access who  can  update  the  spam
330              database  with qsf --mark-spam and qsf --mark-non-spam when nec‐
331              essary.
332
333       /var/lib/qsfdb2
334              A second, read-only, system-wide database. This  can  be  useful
335              when installing qsf system-wide and using third-party spam data‐
336              bases; the first global database can be updated with system-spe‐
337              cific  changes,  and  this  second  database can be periodically
338              updated when the third-party spam database is updated.
339
340       $HOME/.qsfdb
341              The default spam database  for  per-user  data.   Users  without
342              write  access  to  the system-wide database will have their data
343              written here, and the two databases will be read together.   The
344              per-user  database  will  be  given a weighting equivalent to 10
345              times the weighting of the global database.
346
347

NOTES

349       Currently, you cannot use qsf to check for spam while the  database  is
350       being  updated.   This  means  that while an update is in progress, all
351       email is passed through as non-spam.
352
353       There is an upper size limit  of  512Kb  on  incoming  email;  anything
354       larger  than this is just passed through as non-spam, to avoid tying up
355       machine resources.
356
357       The plaintext  token  mapping  maintained  by  --plain-map  will  never
358       shrink,  only  grow.   It  is intended for use by housekeeping and user
359       interface scripts that, for instance, the user  can  use  to  list  all
360       email addresses on their allow-list.  These scripts should take care of
361       weeding out entries for tokens that are no longer in the database.   If
362       you  have no such scripts, there is probably no point in using --plain-
363       map anyway.
364
365       Avoid using the deny-list (-y) in any automated retraining, as  it  can
366       be cause the filter to reject mail unnecessarily.  In general the deny-
367       list is probably best left unused unless explicitly  required  by  your
368       particular setup.
369
370       If  both  the  allow-list  and  the  deny-list  are enabled, then email
371       addresses will first be checked against the deny-list, then the  allow-
372       list, then the domain of the email address will be checked for matching
373       "@domain" entries in the deny-list and then in the allow-list.
374
375

EXAMPLES

377       To filter all of your mail through qsf, with the allow-list enabled and
378       the  "spam  rating"  header  being  added, add this to your .procmailrc
379       file:
380
381               :0 wf
382               | qsf -ra
383
384       If you want qsf to add "[SPAM]" to the subject line of any messages  it
385       thinks are spam, do this instead:
386
387               :0 wf
388               | qsf -sra
389
390       To  automatically mark any email sent to spambox@yourdomain.com as spam
391       (this is the "naive" version):
392
393               :0 H
394               * ^To:.*spambox@yourdomain.com
395               | qsf -am
396
397       To do the same, but cleverly, so that  only  email  to  spambox@yourdo‐
398       main.com  which  qsf  does  NOT already classify as spam gets marked as
399       spam in the database (this  stops  the  database  getting  too  heavily
400       weighted):
401
402               # If sent to spambox@yourdomain.com:
403               :0
404               * ^To:.*spambox@yourdomain.com
405               {
406                  :0 wf
407                  | qsf -a
408
409                  # The above two lines can be skipped if you've
410                  # already piped the message through qsf.
411
412                  # If the qsf database says it's not spam,
413                  # mark it as spam!
414                  :0 H
415                  * ^X-Spam: NO
416                  | qsf -am
417               }
418
419       Remove the -a option in the above examples if you don't want to use the
420       allow-list.
421
422       A more complicated filtering example - this will only run qsf  on  mes‐
423       sages  which  don't  have a subject line saying "your <something> is on
424       fire" and which don't have a sender address  ending  in  "@foobar.com",
425       meaning  that  messages  with  that subject line OR that sender address
426       will NEVER be marked as spam, no matter what:
427
428               :0 wf
429               * ! ^Subject: Your .* is on fire
430               * ! ^From: .*@foobar.com
431               | qsf -ra
432
433       For more on  procmail(1)  recipes,  see  the  procmailrc(5)  and  proc‐
434       mailex(5) manual pages.
435
436       A couple of macros to add to your .muttrc file, if you use mutt(1) as a
437       mail user agent:
438
439               # Press F5 to mark a message as spam and delete it
440               macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
441               macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"
442
443               # Press F9 to mark a message as non-spam
444               macro index <f9> "<pipe-message>qsf -aM\n"
445               macro pager <f9> "<pipe-message>qsf -aM\n"
446
447       Again, remove the -a option in the above examples if you don't want  to
448       use the allow-list.
449
450       Note,  however, that the above macros won't work when operating on mul‐
451       tiple tagged messages. For that, you'd need something like this:
452
453               macro  index  <f5>   ":set   pipe_split\n<tag-prefix><pipe-mes‐
454              sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"
455
456       If you use qmail(7), then to get procmail working with it you will need
457       to put a line containing just DEFAULT=./Maildir/ at  the  top  of  your
458       ~/.procmailrc  file,  so  that procmail delivers to your Maildir folder
459       instead of trying to deliver to  /var/spool/mail/$USER,  and  you  will
460       need to put this in your ~/.qmail file:
461
462               | preline procmail
463
464       This  will  cause all your mail to be delivered via procmail instead of
465       being delivered directly into your mail directory.
466
467       See the qmail(7) documentation for more about mail delivery with qmail.
468
469       If you use postfix(1), you can set up a system-wide mail filter by cre‐
470       ating a user account for the purpose of filtering mail, populating that
471       account's .qsfdb, and then creating a shell  script,  to  run  as  that
472       user, which runs qsf on stdin and passes stdout to sendmail(8).
473
474       Doing  this  requires  some knowledge of postfix configuration and care
475       needs to be taken to avoid mail loops.  One qsf user's  full  HOWTO  is
476       included in the doc/ directory with this package.
477
478

THE ALLOW-LIST

480       A  feature called the "allow-list" can be switched on by specifying the
481       --allowlist or -a option.  This causes messages' "From:"  and  "Return-
482       Path:"  addresses  to be checked against a list of people you have said
483       to allow all messages from, and if  a  message's  "From:"  or  "Return-
484       Path:"  address is in the list, it is never marked as spam.  This means
485       you can add all your friends to an "allow-list" and qsf will then never
486       mis-file  their  messages - a quick way to do this is to use -a with -T
487       (train); everyone in your non-spam folder who has  sent  you  an  email
488       will be added to the allow-list automatically during training.
489
490       You  can  manually  add and remove addresses to and from the allow-list
491       using the -e (email) option. For instance, to add  foo@bar.com  to  the
492       allow-list, do this:
493
494               qsf -e foo@bar.com -M
495
496       To remove bad@nasty.com from the allow-list, do this:
497
498               qsf -e bad@nasty.com -m
499
500       And  to  see whether someone@somewhere.com is in the allow-list or not,
501       just do this:
502
503               qsf -e someone@somewhere.com
504
505       In general, you probably always  want  to  enable  the  allow-list,  so
506       always  specify  the -a option when using qsf.  This will automatically
507       maintain the allow-list based on what you classify as spam or non-spam.
508
509       The only times you might want to turn it off are when  people  on  your
510       allow-list  are prone to getting viruses or if a virus is causing email
511       to be sent to you that is pretending to be from someone on your  allow-
512       list.
513
514

BACKUP AND RESTORE

516       Because  the database format is platform-specific, it is a good idea to
517       periodically dump the database to a text file using qsf -D so that,  if
518       necessary,  it  can be transferred to another machine and restored with
519       qsf -R later on.
520
521       Also note that since the actual contents of email  messages  are  never
522       stored  in  the  database (see TECHNICAL DETAILS), you can safely share
523       your qsf database with friends - simply dump your database to  a  file,
524       like this:
525
526               qsf -D > your-database-dump.txt
527
528       Once  you  have sent your-database-dump.txt to another person, they can
529       do this:
530
531               qsf -R < your-database-dump.txt
532
533       They will then have an identical database to yours.
534
535

TECHNICAL DETAILS

537       When a message is passed to qsf, any attachments are decoded, all  HTML
538       elements  are  removed,  and  the  message  text is then broken up into
539       "tokens", where a "token" is a single  word  or  URL.   Each  token  is
540       hashed  using  the  MD5 algorithm (see below for why), and that hash is
541       then used to look up each token in the qsf database.
542
543       For full details of which parts of an  email  (headers,  body,  attach‐
544       ments, etc) are used to calculate the spam rating, see the TOKENISATION
545       section below.
546
547       Within the database, each token has two numbers associated with it: the
548       number  of  times  that  token has been seen in spam, and the number of
549       times it has been seen in non-spam.  These two numbers, along with  the
550       total  number of spam and non-spam messages seen, are then used to give
551       a "spamminess" value for  that  particular  token.   This  "spamminess"
552       value  ranges  from  "definitely  not  spammy" at one end of the scale,
553       through "neutral" in the middle, up to "definitely spammy" at the other
554       end.
555
556       Once  a "spamminess" value has been calculated for all of the tokens in
557       the message, a summary calculation is made to give an overall "is  this
558       spam?"  probability rating for the message.  If the overall probability
559       is 0.9 or above, the message is flagged as spam.
560
561       In addition to the probability test is the  "allow-list".   If  enabled
562       (with  the  -a  option),  the whole probability check is skipped if the
563       sender of the message is listed in the allow-list, and the  message  is
564       not marked as spam.
565
566       When  training  the  database,  a  message  is  split up into tokens as
567       described above, and then the numbers in the database  for  each  token
568       are  simply  added  to: if you tell qsf that a message is spam, it adds
569       one to the "number of times seen in spam" counter for each  token,  and
570       if  you  tell  it  a message is not spam, it adds one to the "number of
571       times seen in non-spam" counter for  each  token.   If  you  specify  a
572       weight, with -w, then the number you specify is added instead of one.
573
574       To  stop  the database growing uncontrollably, the database keeps track
575       of when a token was last  used.   Underused  tokens  are  automatically
576       removed  from  the  database.  (The old method was to "prune" every 500
577       updates).
578
579       Finally, the reason MD5 hashes were used is  privacy.   If  the  actual
580       tokens  from the messages, and the actual email addresses in the allow-
581       list, were stored, you could not share a single  qsf  database  between
582       multiple  users  because  bits  of  everyone's messages would be in the
583       database - things like emailed passwords, keywords relating to personal
584       gossip, and so on.  So a hash is stored instead.  A hash is a "one-way"
585       function; it is easy to turn a token into a hash but  very  hard  (some
586       might  say  impossible) to turn a hash back into the token that created
587       it.  This means that you end up with a database with no personal infor‐
588       mation in it.
589
590

TOKENISATION

592       When  a  message is broken up into tokens, various parts of the message
593       are treated in different ways.
594
595       First, all header fields are discarded, except for the important  ones:
596       From, Return-Path, Sender, To, Reply-To, and Subject.
597
598       Next,  any MIME-encoded attachments are decoded.  Any attachments whose
599       MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
600       having  any  HTML  tags  stripped.   Any  non-textual  attachments  are
601       replaced with their MD5 hash (such that two identical attachments  will
602       have the same hash), and that hash is then used as a token.
603
604       In  addition to single-word tokens from textual message parts, qsf adds
605       doubled-up tokens so that word pairs get added to the  database.   This
606       makes  the  database a bit bigger (although the automatic pruning tends
607       to take care of that) but makes matching more exact.
608
609

SPECIAL FILTERS

611       As well as using the textual content of email to detect spam, qsf  also
612       uses  special  filters  which  create  "pseudo-tokens" based on various
613       rules.  This means that specific patterns, not just  individual  words,
614       can be used to determine whether a message is spam or not.
615
616       For  example,  if a message contains lots of words with multiple conso‐
617       nants, like "ashjkbnxcsdjh", then each time a word like  that  is  seen
618       the  special  token  ".GIBBERISH-CONSONANTS."  is  added to the list of
619       tokens found in the message.  If it turns out that most  messages  with
620       words  that trigger this filter rule are spam, then other messages with
621       gibberish consonant strings will be more likely to be flagged as spam.
622
623       Currently the special filters are:
624
625
626       GTUBE  Flags      any      message      containing      the      string
627              XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-
628              EMAIL*C.34X as spam - useful for testing that your qsf installa‐
629              tion is working.
630
631       ATTACH-SCR
632
633       ATTACH-PIF
634
635       ATTACH-EXE
636
637       ATTACH-VBS
638
639       ATTACH-VBA
640
641       ATTACH-LNK
642
643       ATTACH-COM
644
645       ATTACH-BAT
646              Adds a token for every attachment whose filename ends in ".scr",
647              ".pif", ".exe",  ".vbs",  ".vba",  ".lnk",  ".com",  and  ".bat"
648              respectively (these are often viruses).
649
650
651       ATTACH-GIF
652
653       ATTACH-JPG
654
655       ATTACH-PNG
656              Adds a token for every attachment whose filename ends in ".gif",
657              ".jpg" or ".jpeg", and ".png" respectively.
658
659
660       ATTACH-DOC
661
662       ATTACH-XLS
663
664       ATTACH-PDF
665              Adds a token for every attachment whose filename ends in ".doc",
666              ".xls",  or  ".pdf"  respectively (these tend to indicate a non-
667              spam email).
668
669
670       SINGLE-IMAGE
671              Adds a token if the message contains exactly one attached image.
672
673
674       MULTIPLE-IMAGES
675              Adds a token if the message  contains  more  than  one  attached
676              image.
677
678
679       GIBBERISH-CONSONANTS
680              Adds  a  token for every word found that has multiple consonants
681              in a row, as described above.  Spam often  contains  strings  of
682              gibberish.
683
684       GIBBERISH-VOWELS
685              Adds  a token for every word found that has multiple vowels in a
686              row, eg "aeaiaiaeeio".
687
688       GIBBERISH-FROMCONS
689              Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
690              Path:" addresses on their own.
691
692       GIBBERISH-FROMVOWL
693              Like  GIBBERISH-VOWELS,  but  only  for the "From:" and "Return-
694              Path:" addresses on their own.
695
696       GIBBERISH-BADSTART
697              Adds a token for every word that starts  with  a  bad  character
698              such as %.
699
700       GIBBERISH-HYPHENS
701              Adds  a  token  for  every  word with more than three hyphens or
702              underscores in it.
703
704       GIBBERISH-LONGWORDS
705              Adds a token for every word with over 30 characters in  it  (but
706              less than 60).
707
708       HTML-COMMENTS-IN-WORDS
709              Adds  a  token  for  every HTML comment found in the middle of a
710              word.   Spam  often  contains  HTML  inside  words,  like  this:
711              w<!--dsgfhsdgjgh-->ord
712
713       HTML-EXTERNAL-IMG
714              Adds  a  token  for every HTML <img> (image) tag found that con‐
715              tains :// (i.e.  it refers to an external image).
716
717       HTML-FONT
718              Adds a token for every HTML <font> tag found.
719
720       HTML-IP-IN-URLS
721              Adds a token for every URL found containing an IP address.
722
723       HTML-INT-IN-URL
724              Adds a token for every URL found containing an  integer  in  its
725              hostname.
726
727       HTML-URLENCODED-URL
728              Adds  a  token  for  every  URL found containing a % sign in its
729              hostname.
730
731
732       Normally, filters will just cause a token to be added, and these tokens
733       are  processed  by  the  normal weighting algorithm.  However the GTUBE
734       filter will immediately flag any matching message  as  spam,  bypassing
735       the token matching.
736
737

DATABASE BACKENDS

739       The  inbuilt  "list"  database backend will not necessarily provide the
740       best performance, but is provided because using it requires no external
741       libraries.
742
743       If,  when  qsf was compiled, the correct libraries were available, then
744       it will be possible to use qsf with alternative database backends.   To
745       find  out which backends you have available, run qsf -V (capital V) and
746       read the second line of output.  To see how well  a  backend  performs,
747       collect  some  spam and non-spam and use qsf -d BACKEND -B SPAM NONSPAM
748       (see the entry for -B above).
749
750       Some people find that they get the best performance  out  of  the  gdbm
751       backend; this is a library that is widely available on many systems.
752
753       To  efficiently  share a qsf database across multiple machines, you may
754       find the MySQL backend useful.  However, using it is a little more com‐
755       plicated.
756
757       To  use  the  MySQL  backend  you  will need to create a table with the
758       fields key1, key2,  token,  value1,  value2  and  value3.   The  token,
759       value1,  value2,  and value3 fields must be VARCHAR(64), BIGINT or INT,
760       and BIGINT or INT respectively, and indexing on the token  field  is  a
761       good  idea.  The key1 and key2 fields can be anything, but they must be
762       present.
763
764       For example:
765
766                USE mydatabase;
767                CREATE TABLE qsfdb (
768                  key1      BIGINT UNSIGNED NOT NULL,
769                  key2      BIGINT UNSIGNED NOT NULL,
770                  token     VARCHAR(64) DEFAULT '' NOT NULL,
771                  value1    INT UNSIGNED NOT NULL,
772                  value2    INT UNSIGNED NOT NULL,
773                  value3    INT UNSIGNED NOT NULL,
774                  PRIMARY KEY (key1,key2,token),
775                  KEY (key1),
776                  KEY (key2),
777                  KEY (token)
778                );
779
780       The key1 and key2 fields allow you to have multiple  qsf  databases  in
781       one table, by specifying different key1 and key2 values on invocation.
782
783       Instead  of specifying a database file with the --database / -d option,
784       you must specify either a specification string as described  below,  or
785       the name of a file containing such a string on its first line.
786
787       The specification string is as follows:
788
789                database=DATABASE;host=HOST;port=PORT;
790                user=USER;pass=PASS;table=TABLE;
791                key1=KEY1;key2=KEY2
792
793       This string must be all on one line, with no spaces.
794
795
796       DATABASE
797              is the name of the MySQL database.
798
799       HOST   is the hostname of the database server (eg "localhost").
800
801       PORT   is the TCP port to connect on (eg 3306).
802
803       USER   is the username to connect with.
804
805       PASS   is the password to connect with.
806
807       TABLE  is  the  database  table to use.  If a table with this name does
808              not exist when qsf is called in update or training mode, then it
809              will be created if permissions allow this to be done.
810
811       KEY1   is the value to use for the key1 field.
812
813       KEY2   is the value to use for the key2 field.
814
815
816       Since  command  lines  can  be seen in the process list, it is probably
817       best to specify a filename (eg qsf -d  mysql:qsfdb.spec)  and  put  the
818       specification string inside that file.
819
820

TROUBLESHOOTING

822       If  you  have  problems  with qsf, please check the list below; if this
823       does not help, go to the qsf home  page  and  investigate  the  mailing
824       lists, or email the author.
825
826
827       Nothing is being marked as spam.
828              First,  use the -r option to switch on the X-Spam-Rating header,
829              and check that this header appears in email passed through  qsf.
830              If  it  does not, then it is likely that qsf is not being run at
831              all - check your configuration of procmail(1) or its equivalent.
832
833
834              If you are seeing X-Spam-Rating headers,  and  different  emails
835              have  different scores, then you may simply need to retrain your
836              database a little more.  Take more spam email and pass it to qsf
837              -m.
838
839
840              If  you  are  seeing X-Spam-Rating headers but they all give the
841              same spam rating, then the most likely reason is that qsf is not
842              reading any database.  Make sure that whatever is processing the
843              email has read permissions on /var/lib/qsfdb and/or  ~/.qsfdb  -
844              and  make  sure that, if you are using ~/.qsfdb, what your data‐
845              base creator thought was ~ ($HOME) is the  same  as  it  is  for
846              whatever is processing the email.
847
848
849       Retraining sometimes takes a very long time.
850              With  the  obtree  backend  or  2-column MySQL or SQLite tables,
851              every 500th retrain (-m or -M), the database is pruned.  On some
852              systems  this may take some time, and during this time the data‐
853              base is locked (except when using the MySQL or SQLite backends).
854              If you constantly do a lot of retraining and want to avoid this,
855              then use the -N option to suppress auto-pruning, and then have a
856              cron(8)  job  or something run a manual prune (qsf -p) every now
857              and again.
858
859
860       Running qsf from procmail fails with an error.
861              If you can run qsf from the command line, but in  your  procmail
862              log file you get errors about "qsf: cannot execute binary file",
863              then contact your system administrator for help. It may be  that
864              incoming  email  is handled by a different server to the one you
865              normally shell into, and either they are of a  different  archi‐
866              tecture or operating system, or the mail server is not permitted
867              to execute user-owned binaries.
868
869

ACKNOWLEDGEMENTS

871       The following people have contributed suggestions,  comments,  patches,
872       and testing:
873
874              Tom Parker <http://www.bits.bris.ac.uk/palfrey/>
875              Dr Kelly A. Parker
876              Vesselin Mladenov <http://www.antipodes.bg/>
877              Glyn Faulkner
878              Mark Reynolds
879              Sam Roberts
880              Scott Allen
881              Karsten Kankowski
882              M. Kolbl
883              Micha Holzmann
884              Jef Poskanzer <http://www.acme.com/jef/>
885              Clemens Fischer <http://ino-waiting.gmxhome.de/>
886              Nelson A. de Oliveira
887              Michal Vitecek
888              Tommy Pettersson <http://www.lysator.liu.se/~ptp/>
889
890

AUTHOR

892       The author:
893
894              Andrew Wood <andrew.wood@ivarch.com>
895              http://www.ivarch.com/
896
897       Project home page:
898
899              http://www.ivarch.com/programs/qsf/
900
901

BUGS

903       If  you find any bugs, please contact the author, either by email or by
904       using the contact form on the web site.
905
906

SEE ALSO

908       procmail(1), procmailrc(5), procmailex(5)
909
910       Someone has written a guide to using qsf with KMail that can  be  found
911       at:
912       http://www.softwaredesign.co.uk/Information.SpamFilters.html
913
914

LICENSE

916       This is free software, distributed under the ARTISTIC 2.0 license.
917
918
919
920Linux                             August 2007                           QSF(1)
Impressum