1QSF(1) User Manuals QSF(1)
2
3
4
6 qsf - quick spam filter
7
9 Filtering: qsf [-snrAtav] [-d DB] [-g DB]
10 [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
11 [-X NUM]
12 Training: qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
13 Retraining: qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
14 Database: qsf -[p|D|R|O] [-d DB]
15 Database merge: qsf -E OTHERDB [-d DB]
16 Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
17 Denylist query: qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
18 Help: qsf -[h|V]
19
20
22 qsf reads a single email on standard input, and by default outputs it
23 on standard output. If the email is determined to be spam, an addi‐
24 tional header ("X-Spam: YES") will be added, and optionally the subject
25 line can have "[SPAM]" prepended to it.
26
27 qsf is intended to be used in a procmail(1) recipe, in a ruleset such
28 as this:
29
30 :0 wf
31 | qsf -ra
32
33 :0 H:
34 * X-Spam: YES
35 $HOME/mail/spam
36
37 For more examples, including sample procmail(1) recipes, see the EXAM‐
38 PLES section below.
39
40
42 Before qsf can be used properly, it needs to be trained. A good way to
43 train qsf is to collect a copy of all your email into two folders - one
44 for spam, and one for non-spam. Once you have done this, you can use
45 the training function, like this:
46
47 qsf -aT spam-folder non-spam-folder
48
49 This will generate a database that can be used by qsf to guess whether
50 email received in the future is spam or not. Note that this initial
51 training run may take a long time, but you should only need to do it
52 once.
53
54 To mark a single message as spam, pipe it to qsf with the --mark-spam
55 or -m ("mark as spam") option. This will update the database accord‐
56 ingly and discard the email.
57
58 To mark a single message as non-spam, pipe it to qsf with the --mark-
59 nonspam or -M ("mark as non-spam") option. Again, this will discard
60 the email.
61
62 If a message has been mis-tagged, simply send it to qsf as the opposite
63 type, i.e. if it has been mistakenly tagged as spam, pipe it into qsf
64 --mark-nonspam --weight=2 to add it to the non-spam side of the data‐
65 base with double the usual weighting.
66
67
69 The qsf options are listed below.
70
71 -d, --database [TYPE:]FILE
72 Use FILE as the spam/non-spam database. The default is to use
73 /var/lib/qsfdb and, if that is not available or is read-only,
74 $HOME/.qsfdb. This option can also be useful if there is a sys‐
75 tem-wide database but you do not want to use it - specifying
76 your own here will override the default.
77
78 If you prefix the filename with a TYPE, of the form
79 btree:$HOME/.qsfdb, then this will specify what kind of database
80 FILE is, such as list, btree, gdbm, sqlite and so on. Check the
81 output of qsf -V to see which database backends are available.
82 The default is to auto-detect the type, or, if the file does not
83 already exist, use list. Note that TYPE is not case-sensitive.
84
85 -g, --global [TYPE:]FILE
86 Use FILE as the default global database, instead of
87 /var/lib/qsfdb. If you also specify a database with -d, then
88 this "global" database will be used in read-only mode in con‐
89 junction with the read-write database specified with -d. The -g
90 option can be used a second time to specify a third database,
91 which will also be used in read-only mode. Again, the filename
92 can optionally be prefixed with a TYPE which specifies the data‐
93 base type.
94
95 -P, --plain-map FILE
96 Maintain a mapping of all database tokens to their non-hashed
97 counterparts in FILE, one token per line. This can be useful if
98 you want to be able to list the contents of your database at a
99 later date, for instance to get a list of email addresses in
100 your allow-list. Note that using this option may slow qsf down,
101 and only entries written to the database while this option is
102 active will be stored in FILE.
103
104 -s, --subject
105 Rewrite the Subject line of any email that turns out to be spam,
106 adding "[SPAM]" to the start of the line.
107
108 -S, --subject-marker SUBJECT
109 Instead of adding "[SPAM]", add SUBJECT to the Subject line of
110 any email that turns out to be spam. Implies -s.
111
112 -H, --header-marker MARK
113 Instead of setting the X-Spam header to "YES", set it to MARK if
114 email turns out to be spam. This can be useful if your email
115 client can only search all headers for a string, rather than one
116 particular header (so searching for "YES" might match more than
117 just the output of qsf).
118
119 -n, --no-header
120 Do not add an X-Spam header to messages.
121
122 -r, --add-rating
123 Insert an additional header X-Spam-Rating which is a rating of
124 the "spamminess" of a message from 0 to 100; 90 and above are
125 counted as spam, anything under 90 is not considered spam. If
126 combined with -t, then the rating (0-100) will be output, on its
127 own, on standard output.
128
129 -A, --asterisk
130 Insert an additional header X-Spam-Level which will contain
131 between 0 and 20 asterisks (*), depending on the spam rating.
132
133 -t, --test
134 Instead of passing the message out on standard output, output
135 nothing, and exit 0 if the message is not spam, or exit 1 if the
136 message is spam. If combined with -r, then the spam rating will
137 be output on standard output.
138
139 -a, --allowlist
140 Enable the allow-list. This causes the email addresses given in
141 the message's "From:" and "Return-Path:" headers to be checked
142 against a list; if either one matches, then the message is
143 always treated as non-spam, regardless of what the token data‐
144 base says. When specified with a retraining flag, -a -m (mark as
145 spam) will remove that address from the allow-list as well as
146 marking the message as spam, and -a -M (mark as non-spam) will
147 add that address to the allow-list as well as marking the mes‐
148 sage as non-spam. The idea is that you add all of your friends
149 to the allow-list, and then none of their messages ever get
150 marked as spam.
151
152 -y, --denylist
153 Enable the deny-list. This causes the email addresses given in
154 the message's "From:" and "Return-Path:" headers to be checked
155 against a second list; if either one matches, then theh message
156 is always treated as spam. Training works in the same way as
157 with -a, except that you must specify -m or -M twice to modify
158 the deny-list instead of the allow-list, and with the reverse
159 syntax: -y -m -m (mark as spam) will add that address to the
160 deny-list, whereas -y -M -M (mark as non-spam) will remove that
161 address from the deny-list. This double specification is so
162 that the usual retraining process never touches the deny-list;
163 the deny-list should be carefully maintained rather than auto‐
164 matically generated.
165
166 Normally you would not need to use the deny-list.
167
168 -L, --level, --threshold LEVEL
169 Change the spam scoring threshold level which must be reached
170 before an email is classified as spam. The default is 90.
171
172 -Q, --min-tokens NUM
173 Only give a score if more than NUM tokens are found in the mes‐
174 sage - otherwise the message is assumed to be non-spam, and it
175 is not modified in any way. The default is 0. This option
176 might be useful if you find that very short messages are being
177 frequently miscategorised.
178
179 -e, --email, --email-only EMAIL
180 Query or update the allow-list entry for the email address
181 EMAIL. With no other options, this will simply output "YES" if
182 EMAIL is in the allow-list, or "NO" if it is not. With -t, it
183 will not output anything, but will exit 0 (success) if EMAIL is
184 in the allow-list, or 1 (failure) if it is not. With the -m
185 (mark-spam) option, any previous allow-list entry for EMAIL will
186 be removed. Finally, with the -M (mark-nonspam) option, EMAIL
187 will be added to the allow-list if it is not already on it.
188
189 If EMAIL is just the word MSG on its own, then an email will be
190 read from standard input, and the email addresses given in the
191 "From:" and "Return-Path:" headers will be used.
192
193 Using -e automatically switches on -a.
194
195 If you also specify -y, then the deny-list will be operated on.
196 Remember that -m and -M are reversed with the deny-list.
197
198 If you specify an email address of the form @domain (nothing
199 before the @), then the whole domain will be allow or deny
200 listed.
201
202 -v, --verbose
203 Add extra X-QSF-Info headers to any filtered email, containing
204 error messages and so on if applicable. Specify -v more than
205 once to increase verbosity.
206
207 -T, --train SPAM NONSPAM [MAXROUNDS]
208 Train the database using the two mbox folders SPAM and NONSPAM,
209 by testing each message in each folder and updating the database
210 each time a message is miscategorised. This is done several
211 times, and may take a while to run. Specify the -a (allow-list)
212 flag to add every sender in the NONSPAM folder to your allow-
213 list as a side-effect of the training process. If MAXROUNDS is
214 specified, training will end after this number of rounds if the
215 results are still not good enough. The default is a maximum of
216 200 rounds.
217
218 -m, --mark-spam
219 Instead of passing the message out on standard output, mark its
220 contents as spam and update the database accordingly. If the
221 allow-list (-a) is enabled, the message's "From:" and "Return-
222 Path:" addresses are removed from the allow-list. If the deny-
223 list (-y) is enabled and you specify -m twice, the message's
224 addresses are added to the deny-list instead.
225
226 -M, --mark-nonspam
227 Instead of passing the message out on standard output, mark its
228 contents as non-spam and update the database accordingly. If
229 the allow-list (-a) is enabled, the message's "From:" and
230 "Return-Path:" addresses are added to the allow-list (see the -a
231 option above). If the deny-list (-y) is enabled and you specify
232 -M twice, the message's addresses are removed from the deny-list
233 instead.
234
235 -w, --weight WEIGHT
236 When marking as spam or non-spam, update the database with a
237 weighting of WEIGHT per token instead of the default of 1. Use‐
238 ful when correcting mistakes, eg a message that has been mistak‐
239 enly detected as spam should be marked as non-spam using a
240 weighting of 2, i.e. double the usual weighting, to counteract
241 the error.
242
243 -D, --dump [FILE]
244 Dump the contents of the database as a platform-independent text
245 file, suitable for archival, transfer to another machine, and so
246 on. The data is output on stdout or into the given FILE.
247
248 -R, --restore [FILE]
249 Rebuild the database from scratch from the text file on stdin.
250 If a FILE is given, data is read from there instead of from
251 stdin.
252
253 -O, --tokens
254 Instead of filtering, output a list of the tokens found in the
255 message read from standard input, along with the number of times
256 each token was found. This is only useful if you want to use
257 qsf as a general tokeniser for use with another filtering pack‐
258 age.
259
260 -E, --merge OTHERDB
261 Merge the OTHERDB database into the current database. This can
262 be useful if you want to take one user's mailbox and merge it
263 into the system-wide one, for instance (this would be done by,
264 as root, doing qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and
265 then removing /home/user/.qsfdb).
266
267 -B, --benchmark SPAM NONSPAM [MAXROUNDS]
268 Benchmark the training process using the two mbox folders SPAM
269 and NONSPAM. A temporary database is created and trained using
270 the first 75% of the messages in each folder, and then the
271 entire contents of each folder is tested to see how many false
272 positives and false negatives occur. Some timing information is
273 also displayed.
274
275 This can be used to decide which backend is best on your system.
276 Use -d to select a backend, eg qsf -B spam nonspam -d GDBM -
277 this will create a temporary database which is removed after‐
278 wards.
279
280 The exception to this is the MySQL backend, where a full data‐
281 base specification must be given (-d MySQL:data‐
282 base=db;host=localhost;...) and the database table given will
283 not be wiped beforehand or dropped afterwards.
284
285 As with -T, if MAXROUNDS is specified, training will never be
286 done for more than this number of rounds; the default is 200.
287
288
289 -h, --help
290 Print a usage message on standard output and exit successfully.
291
292 -V, --version
293 Print version information, including a list of available data‐
294 base backends, on standard output and exit successfully.
295
296
298 The following options are only for use with the old binary tree data‐
299 base backend or old databases that haven't been upgraded to the new
300 format that came in with version 1.1.0.
301
302
303 -N, --no-autoprune
304 When marking as spam or nonspam, never automatically prune the
305 database. Usually the database is pruned after every 500 marks;
306 if you would rather --prune manually, use -N to disable auto‐
307 matic pruning.
308
309 -p, --prune
310 Remove redundant entries from the database and clean it up a
311 little. This is automatically done after several calls to
312 --mark-spam or --mark-nonspam, and during training with --train
313 if the training takes a large number of rounds, so it should
314 rarely be necessary to use --prune manually unless you are using
315 -N / --no-autoprune.
316
317 -X, --prune-max NUM
318 When the database is being pruned, no more than NUM entries will
319 be considered for removal. This is to prevent CPU and memory
320 resources being taken over. The default is 100,000 but in some
321 circumstances (if you find that pruning takes too long) this
322 option may be used to reduce it to a more manageable number.
323
324
326 /var/lib/qsfdb
327 The default (system-wide) spam database. If you wish to install
328 qsf system-wide, this should be read-only to everyone; there
329 should be one user with write access who can update the spam
330 database with qsf --mark-spam and qsf --mark-non-spam when nec‐
331 essary.
332
333 /var/lib/qsfdb2
334 A second, read-only, system-wide database. This can be useful
335 when installing qsf system-wide and using third-party spam data‐
336 bases; the first global database can be updated with system-spe‐
337 cific changes, and this second database can be periodically
338 updated when the third-party spam database is updated.
339
340 $HOME/.qsfdb
341 The default spam database for per-user data. Users without
342 write access to the system-wide database will have their data
343 written here, and the two databases will be read together. The
344 per-user database will be given a weighting equivalent to 10
345 times the weighting of the global database.
346
347
349 Currently, you cannot use qsf to check for spam while the database is
350 being updated. This means that while an update is in progress, all
351 email is passed through as non-spam.
352
353 There is an upper size limit of 512Kb on incoming email; anything
354 larger than this is just passed through as non-spam, to avoid tying up
355 machine resources.
356
357 The plaintext token mapping maintained by --plain-map will never
358 shrink, only grow. It is intended for use by housekeeping and user
359 interface scripts that, for instance, the user can use to list all
360 email addresses on their allow-list. These scripts should take care of
361 weeding out entries for tokens that are no longer in the database. If
362 you have no such scripts, there is probably no point in using --plain-
363 map anyway.
364
365 Avoid using the deny-list (-y) in any automated retraining, as it can
366 be cause the filter to reject mail unnecessarily. In general the deny-
367 list is probably best left unused unless explicitly required by your
368 particular setup.
369
370 If both the allow-list and the deny-list are enabled, then email
371 addresses will first be checked against the deny-list, then the allow-
372 list, then the domain of the email address will be checked for matching
373 "@domain" entries in the deny-list and then in the allow-list.
374
375
377 To filter all of your mail through qsf, with the allow-list enabled and
378 the "spam rating" header being added, add this to your .procmailrc
379 file:
380
381 :0 wf
382 | qsf -ra
383
384 If you want qsf to add "[SPAM]" to the subject line of any messages it
385 thinks are spam, do this instead:
386
387 :0 wf
388 | qsf -sra
389
390 To automatically mark any email sent to spambox@yourdomain.com as spam
391 (this is the "naive" version):
392
393 :0 H
394 * ^To:.*spambox@yourdomain.com
395 | qsf -am
396
397 To do the same, but cleverly, so that only email to spambox@yourdo‐
398 main.com which qsf does NOT already classify as spam gets marked as
399 spam in the database (this stops the database getting too heavily
400 weighted):
401
402 # If sent to spambox@yourdomain.com:
403 :0
404 * ^To:.*spambox@yourdomain.com
405 {
406 :0 wf
407 | qsf -a
408
409 # The above two lines can be skipped if you've
410 # already piped the message through qsf.
411
412 # If the qsf database says it's not spam,
413 # mark it as spam!
414 :0 H
415 * ^X-Spam: NO
416 | qsf -am
417 }
418
419 Remove the -a option in the above examples if you don't want to use the
420 allow-list.
421
422 A more complicated filtering example - this will only run qsf on mes‐
423 sages which don't have a subject line saying "your <something> is on
424 fire" and which don't have a sender address ending in "@foobar.com",
425 meaning that messages with that subject line OR that sender address
426 will NEVER be marked as spam, no matter what:
427
428 :0 wf
429 * ! ^Subject: Your .* is on fire
430 * ! ^From: .*@foobar.com
431 | qsf -ra
432
433 For more on procmail(1) recipes, see the procmailrc(5) and proc‐
434 mailex(5) manual pages.
435
436 A couple of macros to add to your .muttrc file, if you use mutt(1) as a
437 mail user agent:
438
439 # Press F5 to mark a message as spam and delete it
440 macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
441 macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"
442
443 # Press F9 to mark a message as non-spam
444 macro index <f9> "<pipe-message>qsf -aM\n"
445 macro pager <f9> "<pipe-message>qsf -aM\n"
446
447 Again, remove the -a option in the above examples if you don't want to
448 use the allow-list.
449
450 Note, however, that the above macros won't work when operating on mul‐
451 tiple tagged messages. For that, you'd need something like this:
452
453 macro index <f5> ":set pipe_split\n<tag-prefix><pipe-mes‐
454 sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"
455
456 If you use qmail(7), then to get procmail working with it you will need
457 to put a line containing just DEFAULT=./Maildir/ at the top of your
458 ~/.procmailrc file, so that procmail delivers to your Maildir folder
459 instead of trying to deliver to /var/spool/mail/$USER, and you will
460 need to put this in your ~/.qmail file:
461
462 | preline procmail
463
464 This will cause all your mail to be delivered via procmail instead of
465 being delivered directly into your mail directory.
466
467 See the qmail(7) documentation for more about mail delivery with qmail.
468
469 If you use postfix(1), you can set up a system-wide mail filter by cre‐
470 ating a user account for the purpose of filtering mail, populating that
471 account's .qsfdb, and then creating a shell script, to run as that
472 user, which runs qsf on stdin and passes stdout to sendmail(8).
473
474 Doing this requires some knowledge of postfix configuration and care
475 needs to be taken to avoid mail loops. One qsf user's full HOWTO is
476 included in the doc/ directory with this package.
477
478
480 A feature called the "allow-list" can be switched on by specifying the
481 --allowlist or -a option. This causes messages' "From:" and "Return-
482 Path:" addresses to be checked against a list of people you have said
483 to allow all messages from, and if a message's "From:" or "Return-
484 Path:" address is in the list, it is never marked as spam. This means
485 you can add all your friends to an "allow-list" and qsf will then never
486 mis-file their messages - a quick way to do this is to use -a with -T
487 (train); everyone in your non-spam folder who has sent you an email
488 will be added to the allow-list automatically during training.
489
490 You can manually add and remove addresses to and from the allow-list
491 using the -e (email) option. For instance, to add foo@bar.com to the
492 allow-list, do this:
493
494 qsf -e foo@bar.com -M
495
496 To remove bad@nasty.com from the allow-list, do this:
497
498 qsf -e bad@nasty.com -m
499
500 And to see whether someone@somewhere.com is in the allow-list or not,
501 just do this:
502
503 qsf -e someone@somewhere.com
504
505 In general, you probably always want to enable the allow-list, so
506 always specify the -a option when using qsf. This will automatically
507 maintain the allow-list based on what you classify as spam or non-spam.
508
509 The only times you might want to turn it off are when people on your
510 allow-list are prone to getting viruses or if a virus is causing email
511 to be sent to you that is pretending to be from someone on your allow-
512 list.
513
514
516 Because the database format is platform-specific, it is a good idea to
517 periodically dump the database to a text file using qsf -D so that, if
518 necessary, it can be transferred to another machine and restored with
519 qsf -R later on.
520
521 Also note that since the actual contents of email messages are never
522 stored in the database (see TECHNICAL DETAILS), you can safely share
523 your qsf database with friends - simply dump your database to a file,
524 like this:
525
526 qsf -D > your-database-dump.txt
527
528 Once you have sent your-database-dump.txt to another person, they can
529 do this:
530
531 qsf -R < your-database-dump.txt
532
533 They will then have an identical database to yours.
534
535
537 When a message is passed to qsf, any attachments are decoded, all HTML
538 elements are removed, and the message text is then broken up into
539 "tokens", where a "token" is a single word or URL. Each token is
540 hashed using the MD5 algorithm (see below for why), and that hash is
541 then used to look up each token in the qsf database.
542
543 For full details of which parts of an email (headers, body, attach‐
544 ments, etc) are used to calculate the spam rating, see the TOKENISATION
545 section below.
546
547 Within the database, each token has two numbers associated with it: the
548 number of times that token has been seen in spam, and the number of
549 times it has been seen in non-spam. These two numbers, along with the
550 total number of spam and non-spam messages seen, are then used to give
551 a "spamminess" value for that particular token. This "spamminess"
552 value ranges from "definitely not spammy" at one end of the scale,
553 through "neutral" in the middle, up to "definitely spammy" at the other
554 end.
555
556 Once a "spamminess" value has been calculated for all of the tokens in
557 the message, a summary calculation is made to give an overall "is this
558 spam?" probability rating for the message. If the overall probability
559 is 0.9 or above, the message is flagged as spam.
560
561 In addition to the probability test is the "allow-list". If enabled
562 (with the -a option), the whole probability check is skipped if the
563 sender of the message is listed in the allow-list, and the message is
564 not marked as spam.
565
566 When training the database, a message is split up into tokens as
567 described above, and then the numbers in the database for each token
568 are simply added to: if you tell qsf that a message is spam, it adds
569 one to the "number of times seen in spam" counter for each token, and
570 if you tell it a message is not spam, it adds one to the "number of
571 times seen in non-spam" counter for each token. If you specify a
572 weight, with -w, then the number you specify is added instead of one.
573
574 To stop the database growing uncontrollably, the database keeps track
575 of when a token was last used. Underused tokens are automatically
576 removed from the database. (The old method was to "prune" every 500
577 updates).
578
579 Finally, the reason MD5 hashes were used is privacy. If the actual
580 tokens from the messages, and the actual email addresses in the allow-
581 list, were stored, you could not share a single qsf database between
582 multiple users because bits of everyone's messages would be in the
583 database - things like emailed passwords, keywords relating to personal
584 gossip, and so on. So a hash is stored instead. A hash is a "one-way"
585 function; it is easy to turn a token into a hash but very hard (some
586 might say impossible) to turn a hash back into the token that created
587 it. This means that you end up with a database with no personal infor‐
588 mation in it.
589
590
592 When a message is broken up into tokens, various parts of the message
593 are treated in different ways.
594
595 First, all header fields are discarded, except for the important ones:
596 From, Return-Path, Sender, To, Reply-To, and Subject.
597
598 Next, any MIME-encoded attachments are decoded. Any attachments whose
599 MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
600 having any HTML tags stripped. Any non-textual attachments are
601 replaced with their MD5 hash (such that two identical attachments will
602 have the same hash), and that hash is then used as a token.
603
604 In addition to single-word tokens from textual message parts, qsf adds
605 doubled-up tokens so that word pairs get added to the database. This
606 makes the database a bit bigger (although the automatic pruning tends
607 to take care of that) but makes matching more exact.
608
609
611 As well as using the textual content of email to detect spam, qsf also
612 uses special filters which create "pseudo-tokens" based on various
613 rules. This means that specific patterns, not just individual words,
614 can be used to determine whether a message is spam or not.
615
616 For example, if a message contains lots of words with multiple conso‐
617 nants, like "ashjkbnxcsdjh", then each time a word like that is seen
618 the special token ".GIBBERISH-CONSONANTS." is added to the list of
619 tokens found in the message. If it turns out that most messages with
620 words that trigger this filter rule are spam, then other messages with
621 gibberish consonant strings will be more likely to be flagged as spam.
622
623 Currently the special filters are:
624
625
626 GTUBE Flags any message containing the string
627 XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-
628 EMAIL*C.34X as spam - useful for testing that your qsf installa‐
629 tion is working.
630
631 ATTACH-SCR
632
633 ATTACH-PIF
634
635 ATTACH-EXE
636
637 ATTACH-VBS
638
639 ATTACH-VBA
640
641 ATTACH-LNK
642
643 ATTACH-COM
644
645 ATTACH-BAT
646 Adds a token for every attachment whose filename ends in ".scr",
647 ".pif", ".exe", ".vbs", ".vba", ".lnk", ".com", and ".bat"
648 respectively (these are often viruses).
649
650
651 ATTACH-GIF
652
653 ATTACH-JPG
654
655 ATTACH-PNG
656 Adds a token for every attachment whose filename ends in ".gif",
657 ".jpg" or ".jpeg", and ".png" respectively.
658
659
660 ATTACH-DOC
661
662 ATTACH-XLS
663
664 ATTACH-PDF
665 Adds a token for every attachment whose filename ends in ".doc",
666 ".xls", or ".pdf" respectively (these tend to indicate a non-
667 spam email).
668
669
670 SINGLE-IMAGE
671 Adds a token if the message contains exactly one attached image.
672
673
674 MULTIPLE-IMAGES
675 Adds a token if the message contains more than one attached
676 image.
677
678
679 GIBBERISH-CONSONANTS
680 Adds a token for every word found that has multiple consonants
681 in a row, as described above. Spam often contains strings of
682 gibberish.
683
684 GIBBERISH-VOWELS
685 Adds a token for every word found that has multiple vowels in a
686 row, eg "aeaiaiaeeio".
687
688 GIBBERISH-FROMCONS
689 Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
690 Path:" addresses on their own.
691
692 GIBBERISH-FROMVOWL
693 Like GIBBERISH-VOWELS, but only for the "From:" and "Return-
694 Path:" addresses on their own.
695
696 GIBBERISH-BADSTART
697 Adds a token for every word that starts with a bad character
698 such as %.
699
700 GIBBERISH-HYPHENS
701 Adds a token for every word with more than three hyphens or
702 underscores in it.
703
704 GIBBERISH-LONGWORDS
705 Adds a token for every word with over 30 characters in it (but
706 less than 60).
707
708 HTML-COMMENTS-IN-WORDS
709 Adds a token for every HTML comment found in the middle of a
710 word. Spam often contains HTML inside words, like this:
711 w<!--dsgfhsdgjgh-->ord
712
713 HTML-EXTERNAL-IMG
714 Adds a token for every HTML <img> (image) tag found that con‐
715 tains :// (i.e. it refers to an external image).
716
717 HTML-FONT
718 Adds a token for every HTML <font> tag found.
719
720 HTML-IP-IN-URLS
721 Adds a token for every URL found containing an IP address.
722
723 HTML-INT-IN-URL
724 Adds a token for every URL found containing an integer in its
725 hostname.
726
727 HTML-URLENCODED-URL
728 Adds a token for every URL found containing a % sign in its
729 hostname.
730
731
732 Normally, filters will just cause a token to be added, and these tokens
733 are processed by the normal weighting algorithm. However the GTUBE
734 filter will immediately flag any matching message as spam, bypassing
735 the token matching.
736
737
739 The inbuilt "list" database backend will not necessarily provide the
740 best performance, but is provided because using it requires no external
741 libraries.
742
743 If, when qsf was compiled, the correct libraries were available, then
744 it will be possible to use qsf with alternative database backends. To
745 find out which backends you have available, run qsf -V (capital V) and
746 read the second line of output. To see how well a backend performs,
747 collect some spam and non-spam and use qsf -d BACKEND -B SPAM NONSPAM
748 (see the entry for -B above).
749
750 Some people find that they get the best performance out of the gdbm
751 backend; this is a library that is widely available on many systems.
752
753 To efficiently share a qsf database across multiple machines, you may
754 find the MySQL backend useful. However, using it is a little more com‐
755 plicated.
756
757 To use the MySQL backend you will need to create a table with the
758 fields key1, key2, token, value1, value2 and value3. The token,
759 value1, value2, and value3 fields must be VARCHAR(64), BIGINT or INT,
760 and BIGINT or INT respectively, and indexing on the token field is a
761 good idea. The key1 and key2 fields can be anything, but they must be
762 present.
763
764 For example:
765
766 USE mydatabase;
767 CREATE TABLE qsfdb (
768 key1 BIGINT UNSIGNED NOT NULL,
769 key2 BIGINT UNSIGNED NOT NULL,
770 token VARCHAR(64) DEFAULT '' NOT NULL,
771 value1 INT UNSIGNED NOT NULL,
772 value2 INT UNSIGNED NOT NULL,
773 value3 INT UNSIGNED NOT NULL,
774 PRIMARY KEY (key1,key2,token),
775 KEY (key1),
776 KEY (key2),
777 KEY (token)
778 );
779
780 The key1 and key2 fields allow you to have multiple qsf databases in
781 one table, by specifying different key1 and key2 values on invocation.
782
783 Instead of specifying a database file with the --database / -d option,
784 you must specify either a specification string as described below, or
785 the name of a file containing such a string on its first line.
786
787 The specification string is as follows:
788
789 database=DATABASE;host=HOST;port=PORT;
790 user=USER;pass=PASS;table=TABLE;
791 key1=KEY1;key2=KEY2
792
793 This string must be all on one line, with no spaces.
794
795
796 DATABASE
797 is the name of the MySQL database.
798
799 HOST is the hostname of the database server (eg "localhost").
800
801 PORT is the TCP port to connect on (eg 3306).
802
803 USER is the username to connect with.
804
805 PASS is the password to connect with.
806
807 TABLE is the database table to use. If a table with this name does
808 not exist when qsf is called in update or training mode, then it
809 will be created if permissions allow this to be done.
810
811 KEY1 is the value to use for the key1 field.
812
813 KEY2 is the value to use for the key2 field.
814
815
816 Since command lines can be seen in the process list, it is probably
817 best to specify a filename (eg qsf -d mysql:qsfdb.spec) and put the
818 specification string inside that file.
819
820
822 If you have problems with qsf, please check the list below; if this
823 does not help, go to the qsf home page and investigate the mailing
824 lists, or email the author.
825
826
827 Nothing is being marked as spam.
828 First, use the -r option to switch on the X-Spam-Rating header,
829 and check that this header appears in email passed through qsf.
830 If it does not, then it is likely that qsf is not being run at
831 all - check your configuration of procmail(1) or its equivalent.
832
833
834 If you are seeing X-Spam-Rating headers, and different emails
835 have different scores, then you may simply need to retrain your
836 database a little more. Take more spam email and pass it to qsf
837 -m.
838
839
840 If you are seeing X-Spam-Rating headers but they all give the
841 same spam rating, then the most likely reason is that qsf is not
842 reading any database. Make sure that whatever is processing the
843 email has read permissions on /var/lib/qsfdb and/or ~/.qsfdb -
844 and make sure that, if you are using ~/.qsfdb, what your data‐
845 base creator thought was ~ ($HOME) is the same as it is for
846 whatever is processing the email.
847
848
849 Retraining sometimes takes a very long time.
850 With the obtree backend or 2-column MySQL or SQLite tables,
851 every 500th retrain (-m or -M), the database is pruned. On some
852 systems this may take some time, and during this time the data‐
853 base is locked (except when using the MySQL or SQLite backends).
854 If you constantly do a lot of retraining and want to avoid this,
855 then use the -N option to suppress auto-pruning, and then have a
856 cron(8) job or something run a manual prune (qsf -p) every now
857 and again.
858
859
860 Running qsf from procmail fails with an error.
861 If you can run qsf from the command line, but in your procmail
862 log file you get errors about "qsf: cannot execute binary file",
863 then contact your system administrator for help. It may be that
864 incoming email is handled by a different server to the one you
865 normally shell into, and either they are of a different archi‐
866 tecture or operating system, or the mail server is not permitted
867 to execute user-owned binaries.
868
869
871 The following people have contributed suggestions, comments, patches,
872 and testing:
873
874 Tom Parker <http://www.bits.bris.ac.uk/palfrey/>
875 Dr Kelly A. Parker
876 Vesselin Mladenov <http://www.antipodes.bg/>
877 Glyn Faulkner
878 Mark Reynolds
879 Sam Roberts
880 Scott Allen
881 Karsten Kankowski
882 M. Kolbl
883 Micha Holzmann
884 Jef Poskanzer <http://www.acme.com/jef/>
885 Clemens Fischer <http://ino-waiting.gmxhome.de/>
886 Nelson A. de Oliveira
887 Michal Vitecek
888 Tommy Pettersson <http://www.lysator.liu.se/~ptp/>
889
890
892 The author:
893
894 Andrew Wood <andrew.wood@ivarch.com>
895 http://www.ivarch.com/
896
897 Project home page:
898
899 http://www.ivarch.com/programs/qsf/
900
901
903 If you find any bugs, please contact the author, either by email or by
904 using the contact form on the web site.
905
906
908 procmail(1), procmailrc(5), procmailex(5)
909
910 Someone has written a guide to using qsf with KMail that can be found
911 at:
912 http://www.softwaredesign.co.uk/Information.SpamFilters.html
913
914
916 This is free software, distributed under the ARTISTIC 2.0 license.
917
918
919
920Linux August 2007 QSF(1)