1QSF(1) User Manuals QSF(1)
2
3
4
6 qsf - quick spam filter
7
9 Filtering: qsf [-snrAtav] [-d DB] [-g DB]
10 [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
11 [-X NUM]
12 Training: qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
13 Retraining: qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
14 Database: qsf -[p|D|R|O] [-d DB]
15 Database merge: qsf -E OTHERDB [-d DB]
16 Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
17 Denylist query: qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
18 Help: qsf -[h|V]
19
20
22 qsf reads a single email on standard input, and by default outputs it
23 on standard output. If the email is determined to be spam, an addi‐
24 tional header ("X-Spam: YES") will be added, and optionally the subject
25 line can have "[SPAM]" prepended to it.
26
27 qsf is intended to be used in a procmail(1) recipe, in a ruleset such
28 as this:
29
30 :0 wf
31 | qsf -ra
32
33 :0 H:
34 * X-Spam: YES
35 $HOME/mail/spam
36
37 For more examples, including sample procmail(1) recipes, see the EXAM‐
38 PLES section below.
39
40
42 Before qsf can be used properly, it needs to be trained. A good way to
43 train qsf is to collect a copy of all your email into two folders - one
44 for spam, and one for non-spam. Once you have done this, you can use
45 the training function, like this:
46
47 qsf -aT spam-folder non-spam-folder
48
49 This will generate a database that can be used by qsf to guess whether
50 email received in the future is spam or not. Note that this initial
51 training run may take a long time, but you should only need to do it
52 once.
53
54 To mark a single message as spam, pipe it to qsf with the --mark-spam
55 or -m ("mark as spam") option. This will update the database accord‐
56 ingly and discard the email.
57
58 To mark a single message as non-spam, pipe it to qsf with the --mark-
59 nonspam or -M ("mark as non-spam") option. Again, this will discard
60 the email.
61
62 If a message has been mis-tagged, simply send it to qsf as the opposite
63 type, i.e. if it has been mistakenly tagged as spam, pipe it into qsf
64 --mark-nonspam --weight=2 to add it to the non-spam side of the data‐
65 base with double the usual weighting.
66
67
69 The qsf options are listed below.
70
71 -d, --database [TYPE:]FILE
72 Use FILE as the spam/non-spam database. The default is to use
73 /var/lib/qsfdb and, if that is not available or is read-only,
74 $HOME/.qsfdb. This option can also be useful if there is a sys‐
75 tem-wide database but you do not want to use it - specifying
76 your own here will override the default.
77
78 If you prefix the filename with a TYPE, of the form
79 btree:$HOME/.qsfdb, then this will specify what kind of database
80 FILE is, such as list, btree, gdbm, sqlite and so on. Check the
81 output of qsf -V to see which database backends are available.
82 The default is to auto-detect the type, or, if the file does not
83 already exist, use list. Note that TYPE is not case-sensitive.
84
85 -g, --global [TYPE:]FILE
86 Use FILE as the default global database, instead of
87 /var/lib/qsfdb. If you also specify a database with -d, then
88 this "global" database will be used in read-only mode in con‐
89 junction with the read-write database specified with -d. The -g
90 option can be used a second time to specify a third database,
91 which will also be used in read-only mode. Again, the filename
92 can optionally be prefixed with a TYPE which specifies the data‐
93 base type.
94
95 -P, --plain-map FILE
96 Maintain a mapping of all database tokens to their non-hashed
97 counterparts in FILE, one token per line. This can be useful if
98 you want to be able to list the contents of your database at a
99 later date, for instance to get a list of email addresses in
100 your allow-list. Note that using this option may slow qsf down,
101 and only entries written to the database while this option is
102 active will be stored in FILE.
103
104 -s, --subject
105 Rewrite the Subject line of any email that turns out to be spam,
106 adding "[SPAM]" to the start of the line.
107
108 -S, --subject-marker SUBJECT
109 Instead of adding "[SPAM]", add SUBJECT to the Subject line of
110 any email that turns out to be spam. Implies -s.
111
112 -H, --header-marker MARK
113 Instead of setting the X-Spam header to "YES", set it to MARK if
114 email turns out to be spam. This can be useful if your email
115 client can only search all headers for a string, rather than one
116 particular header (so searching for "YES" might match more than
117 just the output of qsf).
118
119 -n, --no-header
120 Do not add an X-Spam header to messages.
121
122 -r, --add-rating
123 Insert an additional header X-Spam-Rating which is a rating of
124 the "spamminess" of a message from 0 to 100; 90 and above are
125 counted as spam, anything under 90 is not considered spam. If
126 combined with -t, then the rating (0-100) will be output, on its
127 own, on standard output.
128
129 -A, --asterisk
130 Insert an additional header X-Spam-Level which will contain be‐
131 tween 0 and 20 asterisks (*), depending on the spam rating.
132
133 -t, --test
134 Instead of passing the message out on standard output, output
135 nothing, and exit 0 if the message is not spam, or exit 1 if the
136 message is spam. If combined with -r, then the spam rating will
137 be output on standard output.
138
139 -a, --allowlist
140 Enable the allow-list. This causes the email addresses given in
141 the message's "From:" and "Return-Path:" headers to be checked
142 against a list; if either one matches, then the message is al‐
143 ways treated as non-spam, regardless of what the token database
144 says. When specified with a retraining flag, -a -m (mark as
145 spam) will remove that address from the allow-list as well as
146 marking the message as spam, and -a -M (mark as non-spam) will
147 add that address to the allow-list as well as marking the mes‐
148 sage as non-spam. The idea is that you add all of your friends
149 to the allow-list, and then none of their messages ever get
150 marked as spam.
151
152 -y, --denylist
153 Enable the deny-list. This causes the email addresses given in
154 the message's "From:" and "Return-Path:" headers to be checked
155 against a second list; if either one matches, then theh message
156 is always treated as spam. Training works in the same way as
157 with -a, except that you must specify -m or -M twice to modify
158 the deny-list instead of the allow-list, and with the reverse
159 syntax: -y -m -m (mark as spam) will add that address to the
160 deny-list, whereas -y -M -M (mark as non-spam) will remove that
161 address from the deny-list. This double specification is so
162 that the usual retraining process never touches the deny-list;
163 the deny-list should be carefully maintained rather than auto‐
164 matically generated.
165
166 Normally you would not need to use the deny-list.
167
168 -L, --level, --threshold LEVEL
169 Change the spam scoring threshold level which must be reached
170 before an email is classified as spam. The default is 90.
171
172 -Q, --min-tokens NUM
173 Only give a score if more than NUM tokens are found in the mes‐
174 sage - otherwise the message is assumed to be non-spam, and it
175 is not modified in any way. The default is 0. This option
176 might be useful if you find that very short messages are being
177 frequently miscategorised.
178
179 -e, --email, --email-only EMAIL
180 Query or update the allow-list entry for the email address
181 EMAIL. With no other options, this will simply output "YES" if
182 EMAIL is in the allow-list, or "NO" if it is not. With -t, it
183 will not output anything, but will exit 0 (success) if EMAIL is
184 in the allow-list, or 1 (failure) if it is not. With the -m
185 (mark-spam) option, any previous allow-list entry for EMAIL will
186 be removed. Finally, with the -M (mark-nonspam) option, EMAIL
187 will be added to the allow-list if it is not already on it.
188
189 If EMAIL is just the word MSG on its own, then an email will be
190 read from standard input, and the email addresses given in the
191 "From:" and "Return-Path:" headers will be used.
192
193 Using -e automatically switches on -a.
194
195 If you also specify -y, then the deny-list will be operated on.
196 Remember that -m and -M are reversed with the deny-list.
197
198 If you specify an email address of the form @domain (nothing be‐
199 fore the @), then the whole domain will be allow or deny listed.
200
201 -v, --verbose
202 Add extra X-QSF-Info headers to any filtered email, containing
203 error messages and so on if applicable. Specify -v more than
204 once to increase verbosity.
205
206 -T, --train SPAM NONSPAM [MAXROUNDS]
207 Train the database using the two mbox folders SPAM and NONSPAM,
208 by testing each message in each folder and updating the database
209 each time a message is miscategorised. This is done several
210 times, and may take a while to run. Specify the -a (allow-list)
211 flag to add every sender in the NONSPAM folder to your allow-
212 list as a side-effect of the training process. If MAXROUNDS is
213 specified, training will end after this number of rounds if the
214 results are still not good enough. The default is a maximum of
215 200 rounds.
216
217 -m, --mark-spam
218 Instead of passing the message out on standard output, mark its
219 contents as spam and update the database accordingly. If the
220 allow-list (-a) is enabled, the message's "From:" and "Return-
221 Path:" addresses are removed from the allow-list. If the deny-
222 list (-y) is enabled and you specify -m twice, the message's ad‐
223 dresses are added to the deny-list instead.
224
225 -M, --mark-nonspam
226 Instead of passing the message out on standard output, mark its
227 contents as non-spam and update the database accordingly. If
228 the allow-list (-a) is enabled, the message's "From:" and "Re‐
229 turn-Path:" addresses are added to the allow-list (see the -a
230 option above). If the deny-list (-y) is enabled and you specify
231 -M twice, the message's addresses are removed from the deny-list
232 instead.
233
234 -w, --weight WEIGHT
235 When marking as spam or non-spam, update the database with a
236 weighting of WEIGHT per token instead of the default of 1. Use‐
237 ful when correcting mistakes, eg a message that has been mistak‐
238 enly detected as spam should be marked as non-spam using a
239 weighting of 2, i.e. double the usual weighting, to counteract
240 the error.
241
242 -D, --dump [FILE]
243 Dump the contents of the database as a platform-independent text
244 file, suitable for archival, transfer to another machine, and so
245 on. The data is output on stdout or into the given FILE.
246
247 -R, --restore [FILE]
248 Rebuild the database from scratch from the text file on stdin.
249 If a FILE is given, data is read from there instead of from
250 stdin.
251
252 -O, --tokens
253 Instead of filtering, output a list of the tokens found in the
254 message read from standard input, along with the number of times
255 each token was found. This is only useful if you want to use
256 qsf as a general tokeniser for use with another filtering pack‐
257 age.
258
259 -E, --merge OTHERDB
260 Merge the OTHERDB database into the current database. This can
261 be useful if you want to take one user's mailbox and merge it
262 into the system-wide one, for instance (this would be done by,
263 as root, doing qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and
264 then removing /home/user/.qsfdb).
265
266 -B, --benchmark SPAM NONSPAM [MAXROUNDS]
267 Benchmark the training process using the two mbox folders SPAM
268 and NONSPAM. A temporary database is created and trained using
269 the first 75% of the messages in each folder, and then the en‐
270 tire contents of each folder is tested to see how many false
271 positives and false negatives occur. Some timing information is
272 also displayed.
273
274 This can be used to decide which backend is best on your system.
275 Use -d to select a backend, eg qsf -B spam nonspam -d GDBM -
276 this will create a temporary database which is removed after‐
277 wards.
278
279 The exception to this is the MySQL backend, where a full data‐
280 base specification must be given (-d MySQL:database=db;host=lo‐
281 calhost;...) and the database table given will not be wiped be‐
282 forehand or dropped afterwards.
283
284 As with -T, if MAXROUNDS is specified, training will never be
285 done for more than this number of rounds; the default is 200.
286
287
288 -h, --help
289 Print a usage message on standard output and exit successfully.
290
291 -V, --version
292 Print version information, including a list of available data‐
293 base backends, on standard output and exit successfully.
294
295
297 The following options are only for use with the old binary tree data‐
298 base backend or old databases that haven't been upgraded to the new
299 format that came in with version 1.1.0.
300
301
302 -N, --no-autoprune
303 When marking as spam or nonspam, never automatically prune the
304 database. Usually the database is pruned after every 500 marks;
305 if you would rather --prune manually, use -N to disable auto‐
306 matic pruning.
307
308 -p, --prune
309 Remove redundant entries from the database and clean it up a
310 little. This is automatically done after several calls to
311 --mark-spam or --mark-nonspam, and during training with --train
312 if the training takes a large number of rounds, so it should
313 rarely be necessary to use --prune manually unless you are using
314 -N / --no-autoprune.
315
316 -X, --prune-max NUM
317 When the database is being pruned, no more than NUM entries will
318 be considered for removal. This is to prevent CPU and memory
319 resources being taken over. The default is 100,000 but in some
320 circumstances (if you find that pruning takes too long) this op‐
321 tion may be used to reduce it to a more manageable number.
322
323
325 /var/lib/qsfdb
326 The default (system-wide) spam database. If you wish to install
327 qsf system-wide, this should be read-only to everyone; there
328 should be one user with write access who can update the spam
329 database with qsf --mark-spam and qsf --mark-non-spam when nec‐
330 essary.
331
332 /var/lib/qsfdb2
333 A second, read-only, system-wide database. This can be useful
334 when installing qsf system-wide and using third-party spam data‐
335 bases; the first global database can be updated with system-spe‐
336 cific changes, and this second database can be periodically up‐
337 dated when the third-party spam database is updated.
338
339 $HOME/.qsfdb
340 The default spam database for per-user data. Users without
341 write access to the system-wide database will have their data
342 written here, and the two databases will be read together. The
343 per-user database will be given a weighting equivalent to 10
344 times the weighting of the global database.
345
346
348 Currently, you cannot use qsf to check for spam while the database is
349 being updated. This means that while an update is in progress, all
350 email is passed through as non-spam.
351
352 There is an upper size limit of 512Kb on incoming email; anything
353 larger than this is just passed through as non-spam, to avoid tying up
354 machine resources.
355
356 The plaintext token mapping maintained by --plain-map will never
357 shrink, only grow. It is intended for use by housekeeping and user in‐
358 terface scripts that, for instance, the user can use to list all email
359 addresses on their allow-list. These scripts should take care of weed‐
360 ing out entries for tokens that are no longer in the database. If you
361 have no such scripts, there is probably no point in using --plain-map
362 anyway.
363
364 Avoid using the deny-list (-y) in any automated retraining, as it can
365 be cause the filter to reject mail unnecessarily. In general the deny-
366 list is probably best left unused unless explicitly required by your
367 particular setup.
368
369 If both the allow-list and the deny-list are enabled, then email ad‐
370 dresses will first be checked against the deny-list, then the allow-
371 list, then the domain of the email address will be checked for matching
372 "@domain" entries in the deny-list and then in the allow-list.
373
374
376 To filter all of your mail through qsf, with the allow-list enabled and
377 the "spam rating" header being added, add this to your .procmailrc
378 file:
379
380 :0 wf
381 | qsf -ra
382
383 If you want qsf to add "[SPAM]" to the subject line of any messages it
384 thinks are spam, do this instead:
385
386 :0 wf
387 | qsf -sra
388
389 To automatically mark any email sent to spambox@yourdomain.com as spam
390 (this is the "naive" version):
391
392 :0 H
393 * ^To:.*spambox@yourdomain.com
394 | qsf -am
395
396 To do the same, but cleverly, so that only email to spambox@yourdo‐
397 main.com which qsf does NOT already classify as spam gets marked as
398 spam in the database (this stops the database getting too heavily
399 weighted):
400
401 # If sent to spambox@yourdomain.com:
402 :0
403 * ^To:.*spambox@yourdomain.com
404 {
405 :0 wf
406 | qsf -a
407
408 # The above two lines can be skipped if you've
409 # already piped the message through qsf.
410
411 # If the qsf database says it's not spam,
412 # mark it as spam!
413 :0 H
414 * ^X-Spam: NO
415 | qsf -am
416 }
417
418 Remove the -a option in the above examples if you don't want to use the
419 allow-list.
420
421 A more complicated filtering example - this will only run qsf on mes‐
422 sages which don't have a subject line saying "your <something> is on
423 fire" and which don't have a sender address ending in "@foobar.com",
424 meaning that messages with that subject line OR that sender address
425 will NEVER be marked as spam, no matter what:
426
427 :0 wf
428 * ! ^Subject: Your .* is on fire
429 * ! ^From: .*@foobar.com
430 | qsf -ra
431
432 For more on procmail(1) recipes, see the procmailrc(5) and proc‐
433 mailex(5) manual pages.
434
435 A couple of macros to add to your .muttrc file, if you use mutt(1) as a
436 mail user agent:
437
438 # Press F5 to mark a message as spam and delete it
439 macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
440 macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"
441
442 # Press F9 to mark a message as non-spam
443 macro index <f9> "<pipe-message>qsf -aM\n"
444 macro pager <f9> "<pipe-message>qsf -aM\n"
445
446 Again, remove the -a option in the above examples if you don't want to
447 use the allow-list.
448
449 Note, however, that the above macros won't work when operating on mul‐
450 tiple tagged messages. For that, you'd need something like this:
451
452 macro index <f5> ":set pipe_split\n<tag-prefix><pipe-mes‐
453 sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"
454
455 If you use qmail(7), then to get procmail working with it you will need
456 to put a line containing just DEFAULT=./Maildir/ at the top of your
457 ~/.procmailrc file, so that procmail delivers to your Maildir folder
458 instead of trying to deliver to /var/spool/mail/$USER, and you will
459 need to put this in your ~/.qmail file:
460
461 | preline procmail
462
463 This will cause all your mail to be delivered via procmail instead of
464 being delivered directly into your mail directory.
465
466 See the qmail(7) documentation for more about mail delivery with qmail.
467
468 If you use postfix(1), you can set up a system-wide mail filter by cre‐
469 ating a user account for the purpose of filtering mail, populating that
470 account's .qsfdb, and then creating a shell script, to run as that
471 user, which runs qsf on stdin and passes stdout to sendmail(8).
472
473 Doing this requires some knowledge of postfix configuration and care
474 needs to be taken to avoid mail loops. One qsf user's full HOWTO is
475 included in the doc/ directory with this package.
476
477
479 A feature called the "allow-list" can be switched on by specifying the
480 --allowlist or -a option. This causes messages' "From:" and "Return-
481 Path:" addresses to be checked against a list of people you have said
482 to allow all messages from, and if a message's "From:" or "Return-
483 Path:" address is in the list, it is never marked as spam. This means
484 you can add all your friends to an "allow-list" and qsf will then never
485 mis-file their messages - a quick way to do this is to use -a with -T
486 (train); everyone in your non-spam folder who has sent you an email
487 will be added to the allow-list automatically during training.
488
489 You can manually add and remove addresses to and from the allow-list
490 using the -e (email) option. For instance, to add foo@bar.com to the
491 allow-list, do this:
492
493 qsf -e foo@bar.com -M
494
495 To remove bad@nasty.com from the allow-list, do this:
496
497 qsf -e bad@nasty.com -m
498
499 And to see whether someone@somewhere.com is in the allow-list or not,
500 just do this:
501
502 qsf -e someone@somewhere.com
503
504 In general, you probably always want to enable the allow-list, so al‐
505 ways specify the -a option when using qsf. This will automatically
506 maintain the allow-list based on what you classify as spam or non-spam.
507
508 The only times you might want to turn it off are when people on your
509 allow-list are prone to getting viruses or if a virus is causing email
510 to be sent to you that is pretending to be from someone on your allow-
511 list.
512
513
515 Because the database format is platform-specific, it is a good idea to
516 periodically dump the database to a text file using qsf -D so that, if
517 necessary, it can be transferred to another machine and restored with
518 qsf -R later on.
519
520 Also note that since the actual contents of email messages are never
521 stored in the database (see TECHNICAL DETAILS), you can safely share
522 your qsf database with friends - simply dump your database to a file,
523 like this:
524
525 qsf -D > your-database-dump.txt
526
527 Once you have sent your-database-dump.txt to another person, they can
528 do this:
529
530 qsf -R < your-database-dump.txt
531
532 They will then have an identical database to yours.
533
534
536 When a message is passed to qsf, any attachments are decoded, all HTML
537 elements are removed, and the message text is then broken up into "to‐
538 kens", where a "token" is a single word or URL. Each token is hashed
539 using the MD5 algorithm (see below for why), and that hash is then used
540 to look up each token in the qsf database.
541
542 For full details of which parts of an email (headers, body, attach‐
543 ments, etc) are used to calculate the spam rating, see the TOKENISATION
544 section below.
545
546 Within the database, each token has two numbers associated with it: the
547 number of times that token has been seen in spam, and the number of
548 times it has been seen in non-spam. These two numbers, along with the
549 total number of spam and non-spam messages seen, are then used to give
550 a "spamminess" value for that particular token. This "spamminess"
551 value ranges from "definitely not spammy" at one end of the scale,
552 through "neutral" in the middle, up to "definitely spammy" at the other
553 end.
554
555 Once a "spamminess" value has been calculated for all of the tokens in
556 the message, a summary calculation is made to give an overall "is this
557 spam?" probability rating for the message. If the overall probability
558 is 0.9 or above, the message is flagged as spam.
559
560 In addition to the probability test is the "allow-list". If enabled
561 (with the -a option), the whole probability check is skipped if the
562 sender of the message is listed in the allow-list, and the message is
563 not marked as spam.
564
565 When training the database, a message is split up into tokens as de‐
566 scribed above, and then the numbers in the database for each token are
567 simply added to: if you tell qsf that a message is spam, it adds one to
568 the "number of times seen in spam" counter for each token, and if you
569 tell it a message is not spam, it adds one to the "number of times seen
570 in non-spam" counter for each token. If you specify a weight, with -w,
571 then the number you specify is added instead of one.
572
573 To stop the database growing uncontrollably, the database keeps track
574 of when a token was last used. Underused tokens are automatically re‐
575 moved from the database. (The old method was to "prune" every 500 up‐
576 dates).
577
578 Finally, the reason MD5 hashes were used is privacy. If the actual to‐
579 kens from the messages, and the actual email addresses in the allow-
580 list, were stored, you could not share a single qsf database between
581 multiple users because bits of everyone's messages would be in the
582 database - things like emailed passwords, keywords relating to personal
583 gossip, and so on. So a hash is stored instead. A hash is a "one-way"
584 function; it is easy to turn a token into a hash but very hard (some
585 might say impossible) to turn a hash back into the token that created
586 it. This means that you end up with a database with no personal infor‐
587 mation in it.
588
589
591 When a message is broken up into tokens, various parts of the message
592 are treated in different ways.
593
594 First, all header fields are discarded, except for the important ones:
595 From, Return-Path, Sender, To, Reply-To, and Subject.
596
597 Next, any MIME-encoded attachments are decoded. Any attachments whose
598 MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
599 having any HTML tags stripped. Any non-textual attachments are re‐
600 placed with their MD5 hash (such that two identical attachments will
601 have the same hash), and that hash is then used as a token.
602
603 In addition to single-word tokens from textual message parts, qsf adds
604 doubled-up tokens so that word pairs get added to the database. This
605 makes the database a bit bigger (although the automatic pruning tends
606 to take care of that) but makes matching more exact.
607
608
610 As well as using the textual content of email to detect spam, qsf also
611 uses special filters which create "pseudo-tokens" based on various
612 rules. This means that specific patterns, not just individual words,
613 can be used to determine whether a message is spam or not.
614
615 For example, if a message contains lots of words with multiple conso‐
616 nants, like "ashjkbnxcsdjh", then each time a word like that is seen
617 the special token ".GIBBERISH-CONSONANTS." is added to the list of to‐
618 kens found in the message. If it turns out that most messages with
619 words that trigger this filter rule are spam, then other messages with
620 gibberish consonant strings will be more likely to be flagged as spam.
621
622 Currently the special filters are:
623
624
625 GTUBE Flags any message containing the string XJS*C4JD‐
626 BQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X as
627 spam - useful for testing that your qsf installation is working.
628
629 ATTACH-SCR
630
631 ATTACH-PIF
632
633 ATTACH-EXE
634
635 ATTACH-VBS
636
637 ATTACH-VBA
638
639 ATTACH-LNK
640
641 ATTACH-COM
642
643 ATTACH-BAT
644 Adds a token for every attachment whose filename ends in ".scr",
645 ".pif", ".exe", ".vbs", ".vba", ".lnk", ".com", and ".bat" re‐
646 spectively (these are often viruses).
647
648
649 ATTACH-GIF
650
651 ATTACH-JPG
652
653 ATTACH-PNG
654 Adds a token for every attachment whose filename ends in ".gif",
655 ".jpg" or ".jpeg", and ".png" respectively.
656
657
658 ATTACH-DOC
659
660 ATTACH-XLS
661
662 ATTACH-PDF
663 Adds a token for every attachment whose filename ends in ".doc",
664 ".xls", or ".pdf" respectively (these tend to indicate a non-
665 spam email).
666
667
668 SINGLE-IMAGE
669 Adds a token if the message contains exactly one attached image.
670
671
672 MULTIPLE-IMAGES
673 Adds a token if the message contains more than one attached im‐
674 age.
675
676
677 GIBBERISH-CONSONANTS
678 Adds a token for every word found that has multiple consonants
679 in a row, as described above. Spam often contains strings of
680 gibberish.
681
682 GIBBERISH-VOWELS
683 Adds a token for every word found that has multiple vowels in a
684 row, eg "aeaiaiaeeio".
685
686 GIBBERISH-FROMCONS
687 Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
688 Path:" addresses on their own.
689
690 GIBBERISH-FROMVOWL
691 Like GIBBERISH-VOWELS, but only for the "From:" and "Return-
692 Path:" addresses on their own.
693
694 GIBBERISH-BADSTART
695 Adds a token for every word that starts with a bad character
696 such as %.
697
698 GIBBERISH-HYPHENS
699 Adds a token for every word with more than three hyphens or un‐
700 derscores in it.
701
702 GIBBERISH-LONGWORDS
703 Adds a token for every word with over 30 characters in it (but
704 less than 60).
705
706 HTML-COMMENTS-IN-WORDS
707 Adds a token for every HTML comment found in the middle of a
708 word. Spam often contains HTML inside words, like this:
709 w<!--dsgfhsdgjgh-->ord
710
711 HTML-EXTERNAL-IMG
712 Adds a token for every HTML <img> (image) tag found that con‐
713 tains :// (i.e. it refers to an external image).
714
715 HTML-FONT
716 Adds a token for every HTML <font> tag found.
717
718 HTML-IP-IN-URLS
719 Adds a token for every URL found containing an IP address.
720
721 HTML-INT-IN-URL
722 Adds a token for every URL found containing an integer in its
723 hostname.
724
725 HTML-URLENCODED-URL
726 Adds a token for every URL found containing a % sign in its
727 hostname.
728
729
730 Normally, filters will just cause a token to be added, and these tokens
731 are processed by the normal weighting algorithm. However the GTUBE
732 filter will immediately flag any matching message as spam, bypassing
733 the token matching.
734
735
737 The inbuilt "list" database backend will not necessarily provide the
738 best performance, but is provided because using it requires no external
739 libraries.
740
741 If, when qsf was compiled, the correct libraries were available, then
742 it will be possible to use qsf with alternative database backends. To
743 find out which backends you have available, run qsf -V (capital V) and
744 read the second line of output. To see how well a backend performs,
745 collect some spam and non-spam and use qsf -d BACKEND -B SPAM NONSPAM
746 (see the entry for -B above).
747
748 Some people find that they get the best performance out of the gdbm
749 backend; this is a library that is widely available on many systems.
750
751 To efficiently share a qsf database across multiple machines, you may
752 find the MySQL backend useful. However, using it is a little more com‐
753 plicated.
754
755 To use the MySQL backend you will need to create a table with the
756 fields key1, key2, token, value1, value2 and value3. The token,
757 value1, value2, and value3 fields must be VARCHAR(64), BIGINT or INT,
758 and BIGINT or INT respectively, and indexing on the token field is a
759 good idea. The key1 and key2 fields can be anything, but they must be
760 present.
761
762 For example:
763
764 USE mydatabase;
765 CREATE TABLE qsfdb (
766 key1 BIGINT UNSIGNED NOT NULL,
767 key2 BIGINT UNSIGNED NOT NULL,
768 token VARCHAR(64) DEFAULT '' NOT NULL,
769 value1 INT UNSIGNED NOT NULL,
770 value2 INT UNSIGNED NOT NULL,
771 value3 INT UNSIGNED NOT NULL,
772 PRIMARY KEY (key1,key2,token),
773 KEY (key1),
774 KEY (key2),
775 KEY (token)
776 );
777
778 The key1 and key2 fields allow you to have multiple qsf databases in
779 one table, by specifying different key1 and key2 values on invocation.
780
781 Instead of specifying a database file with the --database / -d option,
782 you must specify either a specification string as described below, or
783 the name of a file containing such a string on its first line.
784
785 The specification string is as follows:
786
787 database=DATABASE;host=HOST;port=PORT;
788 user=USER;pass=PASS;table=TABLE;
789 key1=KEY1;key2=KEY2
790
791 This string must be all on one line, with no spaces.
792
793
794 DATABASE
795 is the name of the MySQL database.
796
797 HOST is the hostname of the database server (eg "localhost").
798
799 PORT is the TCP port to connect on (eg 3306).
800
801 USER is the username to connect with.
802
803 PASS is the password to connect with.
804
805 TABLE is the database table to use. If a table with this name does
806 not exist when qsf is called in update or training mode, then it
807 will be created if permissions allow this to be done.
808
809 KEY1 is the value to use for the key1 field.
810
811 KEY2 is the value to use for the key2 field.
812
813
814 Since command lines can be seen in the process list, it is probably
815 best to specify a filename (eg qsf -d mysql:qsfdb.spec) and put the
816 specification string inside that file.
817
818
820 If you have problems with qsf, please check the list below; if this
821 does not help, go to the qsf home page and investigate the mailing
822 lists, or email the author.
823
824
825 Nothing is being marked as spam.
826 First, use the -r option to switch on the X-Spam-Rating header,
827 and check that this header appears in email passed through qsf.
828 If it does not, then it is likely that qsf is not being run at
829 all - check your configuration of procmail(1) or its equivalent.
830
831
832 If you are seeing X-Spam-Rating headers, and different emails
833 have different scores, then you may simply need to retrain your
834 database a little more. Take more spam email and pass it to qsf
835 -m.
836
837
838 If you are seeing X-Spam-Rating headers but they all give the
839 same spam rating, then the most likely reason is that qsf is not
840 reading any database. Make sure that whatever is processing the
841 email has read permissions on /var/lib/qsfdb and/or ~/.qsfdb -
842 and make sure that, if you are using ~/.qsfdb, what your data‐
843 base creator thought was ~ ($HOME) is the same as it is for
844 whatever is processing the email.
845
846
847 Retraining sometimes takes a very long time.
848 With the obtree backend or 2-column MySQL or SQLite tables, ev‐
849 ery 500th retrain (-m or -M), the database is pruned. On some
850 systems this may take some time, and during this time the data‐
851 base is locked (except when using the MySQL or SQLite backends).
852 If you constantly do a lot of retraining and want to avoid this,
853 then use the -N option to suppress auto-pruning, and then have a
854 cron(8) job or something run a manual prune (qsf -p) every now
855 and again.
856
857
858 Running qsf from procmail fails with an error.
859 If you can run qsf from the command line, but in your procmail
860 log file you get errors about "qsf: cannot execute binary file",
861 then contact your system administrator for help. It may be that
862 incoming email is handled by a different server to the one you
863 normally shell into, and either they are of a different archi‐
864 tecture or operating system, or the mail server is not permitted
865 to execute user-owned binaries.
866
867
869 Written by Andrew Wood, with patches submitted by various other people.
870 Please see the package README for a complete list of contributors.
871
872
874 Report bugs in QSF using the contact form linked from the QSF home
875 page: <http://www.ivarch.com/programs/qsf/>
876
877
879 procmail(1), procmailrc(5), procmailex(5)
880
881 Someone has written a guide to using qsf with KMail that can be found
882 at:
883 http://www.softwaredesign.co.uk/Information.SpamFilters.html
884
885
887 This is free software, distributed under the ARTISTIC 2.0 license.
888
889
890
891Linux March 2021 QSF(1)