qsf(1) - f38

1QSF(1)                           User Manuals                           QSF(1)
2
3
4

NAME

6       qsf - quick spam filter
7

SYNOPSIS

9       Filtering:       qsf [-snrAtav] [-d DB] [-g DB]
10                            [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
11                            [-X NUM]
12       Training:        qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
13       Retraining:      qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
14       Database:        qsf -[p|D|R|O] [-d DB]
15       Database merge:  qsf -E OTHERDB [-d DB]
16       Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
17       Denylist query:  qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
18       Help:            qsf -[h|V]
19
20

DESCRIPTION

22       qsf  reads  a single email on standard input, and by default outputs it
23       on standard output.  If the email is determined to be  spam,  an  addi‐
24       tional header ("X-Spam: YES") will be added, and optionally the subject
25       line can have "[SPAM]" prepended to it.
26
27       qsf is intended to be used in a procmail(1) recipe, in a  ruleset  such
28       as this:
29
30               :0 wf
31               | qsf -ra
32
33               :0 H:
34               * X-Spam: YES
35               $HOME/mail/spam
36
37       For  more examples, including sample procmail(1) recipes, see the EXAM‐
38       PLES section below.
39
40

TRAINING

42       Before qsf can be used properly, it needs to be trained.  A good way to
43       train qsf is to collect a copy of all your email into two folders - one
44       for spam, and one for non-spam.  Once you have done this, you  can  use
45       the training function, like this:
46
47               qsf -aT spam-folder non-spam-folder
48
49       This  will generate a database that can be used by qsf to guess whether
50       email received in the future is spam or not.  Note  that  this  initial
51       training  run  may  take a long time, but you should only need to do it
52       once.
53
54       To mark a single message as spam, pipe it to qsf with  the  --mark-spam
55       or  -m  ("mark as spam") option.  This will update the database accord‐
56       ingly and discard the email.
57
58       To mark a single message as non-spam, pipe it to qsf with  the  --mark-
59       nonspam  or  -M  ("mark as non-spam") option.  Again, this will discard
60       the email.
61
62       If a message has been mis-tagged, simply send it to qsf as the opposite
63       type,  i.e.  if it has been mistakenly tagged as spam, pipe it into qsf
64       --mark-nonspam --weight=2 to add it to the non-spam side of  the  data‐
65       base with double the usual weighting.
66
67

OPTIONS

69       The qsf options are listed below.
70
71       -d, --database [TYPE:]FILE
72              Use  FILE  as the spam/non-spam database.  The default is to use
73              /var/lib/qsfdb and, if that is not available  or  is  read-only,
74              $HOME/.qsfdb.  This option can also be useful if there is a sys‐
75              tem-wide database but you do not want to  use  it  -  specifying
76              your own here will override the default.
77
78              If   you   prefix   the  filename  with  a  TYPE,  of  the  form
79              btree:$HOME/.qsfdb, then this will specify what kind of database
80              FILE is, such as list, btree, gdbm, sqlite and so on.  Check the
81              output of qsf -V to see which database backends  are  available.
82              The default is to auto-detect the type, or, if the file does not
83              already exist, use list.  Note that TYPE is not case-sensitive.
84
85       -g, --global [TYPE:]FILE
86              Use  FILE  as  the   default   global   database,   instead   of
87              /var/lib/qsfdb.   If  you  also specify a database with -d, then
88              this "global" database will be used in read-only  mode  in  con‐
89              junction with the read-write database specified with -d.  The -g
90              option can be used a second time to specify  a  third  database,
91              which  will also be used in read-only mode.  Again, the filename
92              can optionally be prefixed with a TYPE which specifies the data‐
93              base type.
94
95       -P, --plain-map FILE
96              Maintain  a  mapping  of all database tokens to their non-hashed
97              counterparts in FILE, one token per line.  This can be useful if
98              you  want  to be able to list the contents of your database at a
99              later date, for instance to get a list  of  email  addresses  in
100              your allow-list.  Note that using this option may slow qsf down,
101              and only entries written to the database while  this  option  is
102              active will be stored in FILE.
103
104       -s, --subject
105              Rewrite the Subject line of any email that turns out to be spam,
106              adding "[SPAM]" to the start of the line.
107
108       -S, --subject-marker SUBJECT
109              Instead of adding "[SPAM]", add SUBJECT to the Subject  line  of
110              any email that turns out to be spam.  Implies -s.
111
112       -H, --header-marker MARK
113              Instead of setting the X-Spam header to "YES", set it to MARK if
114              email turns out to be spam.  This can be useful  if  your  email
115              client can only search all headers for a string, rather than one
116              particular header (so searching for "YES" might match more  than
117              just the output of qsf).
118
119       -n, --no-header
120              Do not add an X-Spam header to messages.
121
122       -r, --add-rating
123              Insert  an  additional header X-Spam-Rating which is a rating of
124              the "spamminess" of a message from 0 to 100; 90  and  above  are
125              counted  as  spam, anything under 90 is not considered spam.  If
126              combined with -t, then the rating (0-100) will be output, on its
127              own, on standard output.
128
129       -A, --asterisk
130              Insert  an additional header X-Spam-Level which will contain be‐
131              tween 0 and 20 asterisks (*), depending on the spam rating.
132
133       -t, --test
134              Instead of passing the message out on  standard  output,  output
135              nothing, and exit 0 if the message is not spam, or exit 1 if the
136              message is spam.  If combined with -r, then the spam rating will
137              be output on standard output.
138
139       -a, --allowlist
140              Enable the allow-list.  This causes the email addresses given in
141              the message's "From:" and "Return-Path:" headers to  be  checked
142              against  a  list; if either one matches, then the message is al‐
143              ways treated as non-spam, regardless of what the token  database
144              says.  When  specified  with  a  retraining flag, -a -m (mark as
145              spam) will remove that address from the allow-list  as  well  as
146              marking  the  message as spam, and -a -M (mark as non-spam) will
147              add that address to the allow-list as well as marking  the  mes‐
148              sage  as non-spam.  The idea is that you add all of your friends
149              to the allow-list, and then none  of  their  messages  ever  get
150              marked as spam.
151
152       -y, --denylist
153              Enable  the deny-list.  This causes the email addresses given in
154              the message's "From:" and "Return-Path:" headers to  be  checked
155              against  a second list; if either one matches, then theh message
156              is always treated as spam.  Training works in the  same  way  as
157              with  -a,  except that you must specify -m or -M twice to modify
158              the deny-list instead of the allow-list, and  with  the  reverse
159              syntax:  -y  -m  -m  (mark as spam) will add that address to the
160              deny-list, whereas -y -M -M (mark as non-spam) will remove  that
161              address  from  the  deny-list.   This double specification is so
162              that the usual retraining process never touches  the  deny-list;
163              the  deny-list  should be carefully maintained rather than auto‐
164              matically generated.
165
166              Normally you would not need to use the deny-list.
167
168       -L, --level, --threshold LEVEL
169              Change the spam scoring threshold level which  must  be  reached
170              before an email is classified as spam.  The default is 90.
171
172       -Q, --min-tokens NUM
173              Only  give a score if more than NUM tokens are found in the mes‐
174              sage - otherwise the message is assumed to be non-spam,  and  it
175              is  not  modified  in  any  way.  The default is 0.  This option
176              might be useful if you find that very short messages  are  being
177              frequently miscategorised.
178
179       -e, --email, --email-only EMAIL
180              Query  or  update  the  allow-list  entry  for the email address
181              EMAIL.  With no other options, this will simply output "YES"  if
182              EMAIL  is  in  the allow-list, or "NO" if it is not. With -t, it
183              will not output anything, but will exit 0 (success) if EMAIL  is
184              in  the  allow-list,  or  1  (failure) if it is not. With the -m
185              (mark-spam) option, any previous allow-list entry for EMAIL will
186              be  removed.  Finally,  with the -M (mark-nonspam) option, EMAIL
187              will be added to the allow-list if it is not already on it.
188
189              If EMAIL is just the word MSG on its own, then an email will  be
190              read  from  standard input, and the email addresses given in the
191              "From:" and "Return-Path:" headers will be used.
192
193              Using -e automatically switches on -a.
194
195              If you also specify -y, then the deny-list will be operated  on.
196              Remember that -m and -M are reversed with the deny-list.
197
198              If you specify an email address of the form @domain (nothing be‐
199              fore the @), then the whole domain will be allow or deny listed.
200
201       -v, --verbose
202              Add extra X-QSF-Info headers to any filtered  email,  containing
203              error  messages  and  so on if applicable.  Specify -v more than
204              once to increase verbosity.
205
206       -T, --train SPAM NONSPAM [MAXROUNDS]
207              Train the database using the two mbox folders SPAM and  NONSPAM,
208              by testing each message in each folder and updating the database
209              each time a message is miscategorised.   This  is  done  several
210              times, and may take a while to run.  Specify the -a (allow-list)
211              flag to add every sender in the NONSPAM folder  to  your  allow-
212              list  as a side-effect of the training process.  If MAXROUNDS is
213              specified, training will end after this number of rounds if  the
214              results  are  still not good enough. The default is a maximum of
215              200 rounds.
216
217       -m, --mark-spam
218              Instead of passing the message out on standard output, mark  its
219              contents  as  spam  and update the database accordingly.  If the
220              allow-list (-a) is enabled, the message's "From:"  and  "Return-
221              Path:"  addresses are removed from the allow-list.  If the deny-
222              list (-y) is enabled and you specify -m twice, the message's ad‐
223              dresses are added to the deny-list instead.
224
225       -M, --mark-nonspam
226              Instead  of passing the message out on standard output, mark its
227              contents as non-spam and update the  database  accordingly.   If
228              the  allow-list  (-a) is enabled, the message's "From:" and "Re‐
229              turn-Path:" addresses are added to the allow-list  (see  the  -a
230              option above).  If the deny-list (-y) is enabled and you specify
231              -M twice, the message's addresses are removed from the deny-list
232              instead.
233
234       -w, --weight WEIGHT
235              When  marking  as  spam  or non-spam, update the database with a
236              weighting of WEIGHT per token instead of the default of 1.  Use‐
237              ful when correcting mistakes, eg a message that has been mistak‐
238              enly detected as spam should  be  marked  as  non-spam  using  a
239              weighting  of  2, i.e. double the usual weighting, to counteract
240              the error.
241
242       -D, --dump [FILE]
243              Dump the contents of the database as a platform-independent text
244              file, suitable for archival, transfer to another machine, and so
245              on.  The data is output on stdout or into the given FILE.
246
247       -R, --restore [FILE]
248              Rebuild the database from scratch from the text file  on  stdin.
249              If  a  FILE  is  given,  data is read from there instead of from
250              stdin.
251
252       -O, --tokens
253              Instead of filtering, output a list of the tokens found  in  the
254              message read from standard input, along with the number of times
255              each token was found.  This is only useful if you  want  to  use
256              qsf  as a general tokeniser for use with another filtering pack‐
257              age.
258
259       -E, --merge OTHERDB
260              Merge the OTHERDB database into the current database.  This  can
261              be  useful  if  you want to take one user's mailbox and merge it
262              into the system-wide one, for instance (this would be  done  by,
263              as  root,  doing  qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and
264              then removing /home/user/.qsfdb).
265
266       -B, --benchmark SPAM NONSPAM [MAXROUNDS]
267              Benchmark the training process using the two mbox  folders  SPAM
268              and  NONSPAM.  A temporary database is created and trained using
269              the first 75% of the messages in each folder, and then  the  en‐
270              tire  contents  of  each  folder is tested to see how many false
271              positives and false negatives occur. Some timing information  is
272              also displayed.
273
274              This can be used to decide which backend is best on your system.
275              Use -d to select a backend, eg qsf -B spam  nonspam  -d  GDBM  -
276              this  will  create  a temporary database which is removed after‐
277              wards.
278
279              The exception to this is the MySQL backend, where a  full  data‐
280              base  specification must be given (-d MySQL:database=db;host=lo‐
281              calhost;...)  and the database table given will not be wiped be‐
282              forehand or dropped afterwards.
283
284              As  with  -T,  if MAXROUNDS is specified, training will never be
285              done for more than this number of rounds; the default is 200.
286
287
288       -h, --help
289              Print a usage message on standard output and exit successfully.
290
291       -V, --version
292              Print version information, including a list of  available  data‐
293              base backends, on standard output and exit successfully.
294
295

DEPRECATED OPTIONS

297       The  following  options are only for use with the old binary tree data‐
298       base backend or old databases that haven't been  upgraded  to  the  new
299       format that came in with version 1.1.0.
300
301
302       -N, --no-autoprune
303              When  marking  as spam or nonspam, never automatically prune the
304              database.  Usually the database is pruned after every 500 marks;
305              if  you  would  rather --prune manually, use -N to disable auto‐
306              matic pruning.
307
308       -p, --prune
309              Remove redundant entries from the database and  clean  it  up  a
310              little.   This  is  automatically  done  after  several calls to
311              --mark-spam or --mark-nonspam, and during training with  --train
312              if  the  training  takes  a large number of rounds, so it should
313              rarely be necessary to use --prune manually unless you are using
314              -N / --no-autoprune.
315
316       -X, --prune-max NUM
317              When the database is being pruned, no more than NUM entries will
318              be considered for removal.  This is to prevent  CPU  and  memory
319              resources  being taken over.  The default is 100,000 but in some
320              circumstances (if you find that pruning takes too long) this op‐
321              tion may be used to reduce it to a more manageable number.
322
323

FILES

325       /var/lib/qsfdb
326              The default (system-wide) spam database.  If you wish to install
327              qsf system-wide, this should be  read-only  to  everyone;  there
328              should  be  one  user  with write access who can update the spam
329              database with qsf --mark-spam and qsf --mark-non-spam when  nec‐
330              essary.
331
332       /var/lib/qsfdb2
333              A  second,  read-only,  system-wide database. This can be useful
334              when installing qsf system-wide and using third-party spam data‐
335              bases; the first global database can be updated with system-spe‐
336              cific changes, and this second database can be periodically  up‐
337              dated when the third-party spam database is updated.
338
339       $HOME/.qsfdb
340              The  default  spam  database  for  per-user data.  Users without
341              write access to the system-wide database will  have  their  data
342              written  here, and the two databases will be read together.  The
343              per-user database will be given a  weighting  equivalent  to  10
344              times the weighting of the global database.
345
346

NOTES

348       Currently,  you  cannot use qsf to check for spam while the database is
349       being updated.  This means that while an update  is  in  progress,  all
350       email is passed through as non-spam.
351
352       There  is  an  upper  size  limit  of 512Kb on incoming email; anything
353       larger than this is just passed through as non-spam, to avoid tying  up
354       machine resources.
355
356       The  plaintext  token  mapping  maintained  by  --plain-map  will never
357       shrink, only grow.  It is intended for use by housekeeping and user in‐
358       terface  scripts that, for instance, the user can use to list all email
359       addresses on their allow-list.  These scripts should take care of weed‐
360       ing  out entries for tokens that are no longer in the database.  If you
361       have no such scripts, there is probably no point in  using  --plain-map
362       anyway.
363
364       Avoid  using  the deny-list (-y) in any automated retraining, as it can
365       be cause the filter to reject mail unnecessarily.  In general the deny-
366       list  is  probably  best left unused unless explicitly required by your
367       particular setup.
368
369       If both the allow-list and the deny-list are enabled,  then  email  ad‐
370       dresses  will  first  be checked against the deny-list, then the allow-
371       list, then the domain of the email address will be checked for matching
372       "@domain" entries in the deny-list and then in the allow-list.
373
374

EXAMPLES

376       To filter all of your mail through qsf, with the allow-list enabled and
377       the "spam rating" header being added,  add  this  to  your  .procmailrc
378       file:
379
380               :0 wf
381               | qsf -ra
382
383       If  you want qsf to add "[SPAM]" to the subject line of any messages it
384       thinks are spam, do this instead:
385
386               :0 wf
387               | qsf -sra
388
389       To automatically mark any email sent to spambox@yourdomain.com as  spam
390       (this is the "naive" version):
391
392               :0 H
393               * ^To:.*spambox@yourdomain.com
394               | qsf -am
395
396       To  do  the  same,  but cleverly, so that only email to spambox@yourdo‐
397       main.com which qsf does NOT already classify as  spam  gets  marked  as
398       spam  in  the  database  (this  stops  the database getting too heavily
399       weighted):
400
401               # If sent to spambox@yourdomain.com:
402               :0
403               * ^To:.*spambox@yourdomain.com
404               {
405                  :0 wf
406                  | qsf -a
407
408                  # The above two lines can be skipped if you've
409                  # already piped the message through qsf.
410
411                  # If the qsf database says it's not spam,
412                  # mark it as spam!
413                  :0 H
414                  * ^X-Spam: NO
415                  | qsf -am
416               }
417
418       Remove the -a option in the above examples if you don't want to use the
419       allow-list.
420
421       A  more  complicated filtering example - this will only run qsf on mes‐
422       sages which don't have a subject line saying "your  <something>  is  on
423       fire"  and  which  don't have a sender address ending in "@foobar.com",
424       meaning that messages with that subject line  OR  that  sender  address
425       will NEVER be marked as spam, no matter what:
426
427               :0 wf
428               * ! ^Subject: Your .* is on fire
429               * ! ^From: .*@foobar.com
430               | qsf -ra
431
432       For  more  on  procmail(1)  recipes,  see  the  procmailrc(5) and proc‐
433       mailex(5) manual pages.
434
435       A couple of macros to add to your .muttrc file, if you use mutt(1) as a
436       mail user agent:
437
438               # Press F5 to mark a message as spam and delete it
439               macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
440               macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"
441
442               # Press F9 to mark a message as non-spam
443               macro index <f9> "<pipe-message>qsf -aM\n"
444               macro pager <f9> "<pipe-message>qsf -aM\n"
445
446       Again,  remove the -a option in the above examples if you don't want to
447       use the allow-list.
448
449       Note, however, that the above macros won't work when operating on  mul‐
450       tiple tagged messages. For that, you'd need something like this:
451
452               macro   index   <f5>  ":set  pipe_split\n<tag-prefix><pipe-mes‐
453              sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"
454
455       If you use qmail(7), then to get procmail working with it you will need
456       to  put  a  line  containing just DEFAULT=./Maildir/ at the top of your
457       ~/.procmailrc file, so that procmail delivers to  your  Maildir  folder
458       instead  of  trying  to  deliver to /var/spool/mail/$USER, and you will
459       need to put this in your ~/.qmail file:
460
461               | preline procmail
462
463       This will cause all your mail to be delivered via procmail  instead  of
464       being delivered directly into your mail directory.
465
466       See the qmail(7) documentation for more about mail delivery with qmail.
467
468       If you use postfix(1), you can set up a system-wide mail filter by cre‐
469       ating a user account for the purpose of filtering mail, populating that
470       account's  .qsfdb,  and  then  creating  a shell script, to run as that
471       user, which runs qsf on stdin and passes stdout to sendmail(8).
472
473       Doing this requires some knowledge of postfix  configuration  and  care
474       needs  to  be  taken to avoid mail loops.  One qsf user's full HOWTO is
475       included in the doc/ directory with this package.
476
477

THE ALLOW-LIST

479       A feature called the "allow-list" can be switched on by specifying  the
480       --allowlist  or  -a option.  This causes messages' "From:" and "Return-
481       Path:" addresses to be checked against a list of people you  have  said
482       to  allow  all  messages  from,  and if a message's "From:" or "Return-
483       Path:" address is in the list, it is never marked as spam.  This  means
484       you can add all your friends to an "allow-list" and qsf will then never
485       mis-file their messages - a quick way to do this is to use -a  with  -T
486       (train);  everyone  in  your  non-spam folder who has sent you an email
487       will be added to the allow-list automatically during training.
488
489       You can manually add and remove addresses to and  from  the  allow-list
490       using  the  -e  (email) option. For instance, to add foo@bar.com to the
491       allow-list, do this:
492
493               qsf -e foo@bar.com -M
494
495       To remove bad@nasty.com from the allow-list, do this:
496
497               qsf -e bad@nasty.com -m
498
499       And to see whether someone@somewhere.com is in the allow-list  or  not,
500       just do this:
501
502               qsf -e someone@somewhere.com
503
504       In  general,  you probably always want to enable the allow-list, so al‐
505       ways specify the -a option when using  qsf.   This  will  automatically
506       maintain the allow-list based on what you classify as spam or non-spam.
507
508       The  only  times  you might want to turn it off are when people on your
509       allow-list are prone to getting viruses or if a virus is causing  email
510       to  be sent to you that is pretending to be from someone on your allow-
511       list.
512
513

BACKUP AND RESTORE

515       Because the database format is platform-specific, it is a good idea  to
516       periodically  dump the database to a text file using qsf -D so that, if
517       necessary, it can be transferred to another machine and  restored  with
518       qsf -R later on.
519
520       Also  note  that  since the actual contents of email messages are never
521       stored in the database (see TECHNICAL DETAILS), you  can  safely  share
522       your  qsf  database with friends - simply dump your database to a file,
523       like this:
524
525               qsf -D > your-database-dump.txt
526
527       Once you have sent your-database-dump.txt to another person,  they  can
528       do this:
529
530               qsf -R < your-database-dump.txt
531
532       They will then have an identical database to yours.
533
534

TECHNICAL DETAILS

536       When  a message is passed to qsf, any attachments are decoded, all HTML
537       elements are removed, and the message text is then broken up into  "to‐
538       kens",  where  a "token" is a single word or URL.  Each token is hashed
539       using the MD5 algorithm (see below for why), and that hash is then used
540       to look up each token in the qsf database.
541
542       For  full  details  of  which parts of an email (headers, body, attach‐
543       ments, etc) are used to calculate the spam rating, see the TOKENISATION
544       section below.
545
546       Within the database, each token has two numbers associated with it: the
547       number of times that token has been seen in spam,  and  the  number  of
548       times  it has been seen in non-spam.  These two numbers, along with the
549       total number of spam and non-spam messages seen, are then used to  give
550       a  "spamminess"  value  for  that  particular token.  This "spamminess"
551       value ranges from "definitely not spammy" at  one  end  of  the  scale,
552       through "neutral" in the middle, up to "definitely spammy" at the other
553       end.
554
555       Once a "spamminess" value has been calculated for all of the tokens  in
556       the  message, a summary calculation is made to give an overall "is this
557       spam?"  probability rating for the message.  If the overall probability
558       is 0.9 or above, the message is flagged as spam.
559
560       In  addition  to  the probability test is the "allow-list".  If enabled
561       (with the -a option), the whole probability check  is  skipped  if  the
562       sender  of  the message is listed in the allow-list, and the message is
563       not marked as spam.
564
565       When training the database, a message is split up into  tokens  as  de‐
566       scribed  above, and then the numbers in the database for each token are
567       simply added to: if you tell qsf that a message is spam, it adds one to
568       the  "number  of times seen in spam" counter for each token, and if you
569       tell it a message is not spam, it adds one to the "number of times seen
570       in non-spam" counter for each token.  If you specify a weight, with -w,
571       then the number you specify is added instead of one.
572
573       To stop the database growing uncontrollably, the database  keeps  track
574       of  when a token was last used.  Underused tokens are automatically re‐
575       moved from the database.  (The old method was to "prune" every 500  up‐
576       dates).
577
578       Finally, the reason MD5 hashes were used is privacy.  If the actual to‐
579       kens from the messages, and the actual email addresses  in  the  allow-
580       list,  were  stored,  you could not share a single qsf database between
581       multiple users because bits of everyone's  messages  would  be  in  the
582       database - things like emailed passwords, keywords relating to personal
583       gossip, and so on.  So a hash is stored instead.  A hash is a "one-way"
584       function;  it  is  easy to turn a token into a hash but very hard (some
585       might say impossible) to turn a hash back into the token  that  created
586       it.  This means that you end up with a database with no personal infor‐
587       mation in it.
588
589

TOKENISATION

591       When a message is broken up into tokens, various parts of  the  message
592       are treated in different ways.
593
594       First,  all header fields are discarded, except for the important ones:
595       From, Return-Path, Sender, To, Reply-To, and Subject.
596
597       Next, any MIME-encoded attachments are decoded.  Any attachments  whose
598       MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
599       having any HTML tags stripped.  Any  non-textual  attachments  are  re‐
600       placed  with  their  MD5 hash (such that two identical attachments will
601       have the same hash), and that hash is then used as a token.
602
603       In addition to single-word tokens from textual message parts, qsf  adds
604       doubled-up  tokens  so that word pairs get added to the database.  This
605       makes the database a bit bigger (although the automatic  pruning  tends
606       to take care of that) but makes matching more exact.
607
608

SPECIAL FILTERS

610       As  well as using the textual content of email to detect spam, qsf also
611       uses special filters which  create  "pseudo-tokens"  based  on  various
612       rules.   This  means that specific patterns, not just individual words,
613       can be used to determine whether a message is spam or not.
614
615       For example, if a message contains lots of words with  multiple  conso‐
616       nants,  like  "ashjkbnxcsdjh",  then each time a word like that is seen
617       the special token ".GIBBERISH-CONSONANTS." is added to the list of  to‐
618       kens  found  in  the  message.  If it turns out that most messages with
619       words that trigger this filter rule are spam, then other messages  with
620       gibberish consonant strings will be more likely to be flagged as spam.
621
622       Currently the special filters are:
623
624
625       GTUBE  Flags    any    message    containing   the   string   XJS*C4JD‐
626              BQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X  as
627              spam - useful for testing that your qsf installation is working.
628
629       ATTACH-SCR
630
631       ATTACH-PIF
632
633       ATTACH-EXE
634
635       ATTACH-VBS
636
637       ATTACH-VBA
638
639       ATTACH-LNK
640
641       ATTACH-COM
642
643       ATTACH-BAT
644              Adds a token for every attachment whose filename ends in ".scr",
645              ".pif", ".exe", ".vbs", ".vba", ".lnk", ".com", and  ".bat"  re‐
646              spectively (these are often viruses).
647
648
649       ATTACH-GIF
650
651       ATTACH-JPG
652
653       ATTACH-PNG
654              Adds a token for every attachment whose filename ends in ".gif",
655              ".jpg" or ".jpeg", and ".png" respectively.
656
657
658       ATTACH-DOC
659
660       ATTACH-XLS
661
662       ATTACH-PDF
663              Adds a token for every attachment whose filename ends in ".doc",
664              ".xls",  or  ".pdf"  respectively (these tend to indicate a non-
665              spam email).
666
667
668       SINGLE-IMAGE
669              Adds a token if the message contains exactly one attached image.
670
671
672       MULTIPLE-IMAGES
673              Adds a token if the message contains more than one attached  im‐
674              age.
675
676
677       GIBBERISH-CONSONANTS
678              Adds  a  token for every word found that has multiple consonants
679              in a row, as described above.  Spam often  contains  strings  of
680              gibberish.
681
682       GIBBERISH-VOWELS
683              Adds  a token for every word found that has multiple vowels in a
684              row, eg "aeaiaiaeeio".
685
686       GIBBERISH-FROMCONS
687              Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
688              Path:" addresses on their own.
689
690       GIBBERISH-FROMVOWL
691              Like  GIBBERISH-VOWELS,  but  only  for the "From:" and "Return-
692              Path:" addresses on their own.
693
694       GIBBERISH-BADSTART
695              Adds a token for every word that starts  with  a  bad  character
696              such as %.
697
698       GIBBERISH-HYPHENS
699              Adds  a token for every word with more than three hyphens or un‐
700              derscores in it.
701
702       GIBBERISH-LONGWORDS
703              Adds a token for every word with over 30 characters in  it  (but
704              less than 60).
705
706       HTML-COMMENTS-IN-WORDS
707              Adds  a  token  for  every HTML comment found in the middle of a
708              word.   Spam  often  contains  HTML  inside  words,  like  this:
709              w<!--dsgfhsdgjgh-->ord
710
711       HTML-EXTERNAL-IMG
712              Adds  a  token  for every HTML <img> (image) tag found that con‐
713              tains :// (i.e.  it refers to an external image).
714
715       HTML-FONT
716              Adds a token for every HTML <font> tag found.
717
718       HTML-IP-IN-URLS
719              Adds a token for every URL found containing an IP address.
720
721       HTML-INT-IN-URL
722              Adds a token for every URL found containing an  integer  in  its
723              hostname.
724
725       HTML-URLENCODED-URL
726              Adds  a  token  for  every  URL found containing a % sign in its
727              hostname.
728
729
730       Normally, filters will just cause a token to be added, and these tokens
731       are  processed  by  the  normal weighting algorithm.  However the GTUBE
732       filter will immediately flag any matching message  as  spam,  bypassing
733       the token matching.
734
735

DATABASE BACKENDS

737       The  inbuilt  "list"  database backend will not necessarily provide the
738       best performance, but is provided because using it requires no external
739       libraries.
740
741       If,  when  qsf was compiled, the correct libraries were available, then
742       it will be possible to use qsf with alternative database backends.   To
743       find  out which backends you have available, run qsf -V (capital V) and
744       read the second line of output.  To see how well  a  backend  performs,
745       collect  some  spam and non-spam and use qsf -d BACKEND -B SPAM NONSPAM
746       (see the entry for -B above).
747
748       Some people find that they get the best performance  out  of  the  gdbm
749       backend; this is a library that is widely available on many systems.
750
751       To  efficiently  share a qsf database across multiple machines, you may
752       find the MySQL backend useful.  However, using it is a little more com‐
753       plicated.
754
755       To  use  the  MySQL  backend  you  will need to create a table with the
756       fields key1, key2,  token,  value1,  value2  and  value3.   The  token,
757       value1,  value2,  and value3 fields must be VARCHAR(64), BIGINT or INT,
758       and BIGINT or INT respectively, and indexing on the token  field  is  a
759       good  idea.  The key1 and key2 fields can be anything, but they must be
760       present.
761
762       For example:
763
764                USE mydatabase;
765                CREATE TABLE qsfdb (
766                  key1      BIGINT UNSIGNED NOT NULL,
767                  key2      BIGINT UNSIGNED NOT NULL,
768                  token     VARCHAR(64) DEFAULT '' NOT NULL,
769                  value1    INT UNSIGNED NOT NULL,
770                  value2    INT UNSIGNED NOT NULL,
771                  value3    INT UNSIGNED NOT NULL,
772                  PRIMARY KEY (key1,key2,token),
773                  KEY (key1),
774                  KEY (key2),
775                  KEY (token)
776                );
777
778       The key1 and key2 fields allow you to have multiple  qsf  databases  in
779       one table, by specifying different key1 and key2 values on invocation.
780
781       Instead  of specifying a database file with the --database / -d option,
782       you must specify either a specification string as described  below,  or
783       the name of a file containing such a string on its first line.
784
785       The specification string is as follows:
786
787                database=DATABASE;host=HOST;port=PORT;
788                user=USER;pass=PASS;table=TABLE;
789                key1=KEY1;key2=KEY2
790
791       This string must be all on one line, with no spaces.
792
793
794       DATABASE
795              is the name of the MySQL database.
796
797       HOST   is the hostname of the database server (eg "localhost").
798
799       PORT   is the TCP port to connect on (eg 3306).
800
801       USER   is the username to connect with.
802
803       PASS   is the password to connect with.
804
805       TABLE  is  the  database  table to use.  If a table with this name does
806              not exist when qsf is called in update or training mode, then it
807              will be created if permissions allow this to be done.
808
809       KEY1   is the value to use for the key1 field.
810
811       KEY2   is the value to use for the key2 field.
812
813
814       Since  command  lines  can  be seen in the process list, it is probably
815       best to specify a filename (eg qsf -d  mysql:qsfdb.spec)  and  put  the
816       specification string inside that file.
817
818

TROUBLESHOOTING

820       If  you  have  problems  with qsf, please check the list below; if this
821       does not help, go to the qsf home  page  and  investigate  the  mailing
822       lists, or email the author.
823
824
825       Nothing is being marked as spam.
826              First,  use the -r option to switch on the X-Spam-Rating header,
827              and check that this header appears in email passed through  qsf.
828              If  it  does not, then it is likely that qsf is not being run at
829              all - check your configuration of procmail(1) or its equivalent.
830
831
832              If you are seeing X-Spam-Rating headers,  and  different  emails
833              have  different scores, then you may simply need to retrain your
834              database a little more.  Take more spam email and pass it to qsf
835              -m.
836
837
838              If  you  are  seeing X-Spam-Rating headers but they all give the
839              same spam rating, then the most likely reason is that qsf is not
840              reading any database.  Make sure that whatever is processing the
841              email has read permissions on /var/lib/qsfdb and/or  ~/.qsfdb  -
842              and  make  sure that, if you are using ~/.qsfdb, what your data‐
843              base creator thought was ~ ($HOME) is the  same  as  it  is  for
844              whatever is processing the email.
845
846
847       Retraining sometimes takes a very long time.
848              With  the obtree backend or 2-column MySQL or SQLite tables, ev‐
849              ery 500th retrain (-m or -M), the database is pruned.   On  some
850              systems  this may take some time, and during this time the data‐
851              base is locked (except when using the MySQL or SQLite backends).
852              If you constantly do a lot of retraining and want to avoid this,
853              then use the -N option to suppress auto-pruning, and then have a
854              cron(8)  job  or something run a manual prune (qsf -p) every now
855              and again.
856
857
858       Running qsf from procmail fails with an error.
859              If you can run qsf from the command line, but in  your  procmail
860              log file you get errors about "qsf: cannot execute binary file",
861              then contact your system administrator for help. It may be  that
862              incoming  email  is handled by a different server to the one you
863              normally shell into, and either they are of a  different  archi‐
864              tecture or operating system, or the mail server is not permitted
865              to execute user-owned binaries.
866
867

AUTHOR

869       Written by Andrew Wood, with patches submitted by various other people.
870       Please see the package README for a complete list of contributors.
871
872

BUGS

874       Report  bugs  in  QSF  using  the contact form linked from the QSF home
875       page: <http://www.ivarch.com/programs/qsf/>
876
877

LICENSE

887       This is free software, distributed under the ARTISTIC 2.0 license.
888
889
890
891Linux                             March 2021                            QSF(1)