1cleanfeed(8) Cleanfeed - Because spam sucks cleanfeed(8)
2
3
4
6 Cleanfeed - spam filter for Usenet news servers
7
9 IINNNN:: Installed as ffiilltteerr__iinnnndd..ppll, location is configured into INN at
10 compile time.
11
12 HHiigghhwwiinndd sseerrvveerrss:: <command line> -program cleanfeed -body
13
14 NNNNTTPPRReellaayy: ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl
15
17 A spam filter for Usenet servers. CClleeaannffeeeedd blocks spam on the way
18 into your server, before it is written to disk or propagated to
19 outbound feeds. It can also block binaries in non-binary newsgroups
20 and includes several other features to keep your newsfeed clean.
21
22 Cleanfeed currently works with INN, Cyclone, Typhoon, Breeze, and
23 NNTPRelay servers. See my webpage (listed at the end of this document)
24 for pointers to information about using Cleanfeed with CNews, Diablo,
25 Collabra, or INN versions earlier than 1.5.1.
26
28 For all versions, place the cleanfeed.conf configuration file
29 somewhere, then edit the Cleanfeed source file and change the
30 $$ccoonnffiigg__ddiirr option at the top to point to the directory where the
31 config file lives.
32
33 IINNNN Install the filter file (called cleanfeed) as filter_innd.pl, and
34 cleanfeed.conf, in the location you specified in config.data (INN
35 1.7.2 and earlier) or when configuring INN 2.x (usually the
36 bin/filter directory under the installation root). Make sure both
37 files are readable by the news user. Once in place, the filter is
38 loaded with the command ccttlliinnnndd rreellooaadd ffiilltteerr..ppeerrll mmeeooww. Filtering
39 can be turned on with ccttlliinnnndd ppeerrll yy and turned off with ccttlliinnnndd
40 ppeerrll nn.
41
42 CCyycclloonnee//TTyypphhoooonn//BBrreeeezzee
43 Add the --pprrooggrraamm <file> and --bbooddyy options to the bin/start script,
44 where <file> is the location and name of the Cleanfeed program.
45 Restart the server. Cleanfeed will run as an external process
46 (standalone mode). IMPORTANT: make sure both cleanfeed and
47 cleanfeed.conf are readable by the news user! Double-check the
48 permissions as this is a fairly common mistake!
49
50 NNNNTTPPRReellaayy
51 Find the ExternalFilter directive in config.txt and make it look
52 like:
53
54 ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl
55
56 Cleanfeed will run as an external process (standalone mode).
57
58 More detailed installation instructions are provided later in this
59 document.
60
62 Configuration is accomplished by setting the various options in the
63 cleanfeed.conf configuration file. This file is evaluated as Perl
64 code, so comments can be included in the usual Perl # syntax. A sample
65 default file is included with the distribution.
66
67 If you would rather not use cleanfeed.conf, you can set its location to
68 "undef" in the source and edit the configuration variables directly in
69 the source file.
70
71 cleanfeed.conf has two sections (which define perl hashes):
72 %%ccoonnffiigg__llooccaall and %%ccoonnffiigg__aappppeenndd. Entries in %%ccoonnffiigg__llooccaall will
73 override the default settings of the same name in the Cleanfeed source.
74 Entries in %%ccoonnffiigg__aappppeenndd can be used to add to most of the default
75 regular expressions, for items such as bbaaddgguuyyss, bbiinn__aalllloowweedd,
76 ppooiissoonn__ggrroouuppss, etc. Settings in %%ccoonnffiigg__aappppeenndd for these items will be
77 appended to the default regexps, seperated by "⎪" (or).
78
79 If you want to completely override the default regexps for these
80 options, rather than just add to the defaults, you can add an entry for
81 them into the %%ccoonnffiigg__llooccaall section of cleanfeed.conf.
82
83 All of this is done quite blindly, so if you do anything odd, be
84 careful. (Cleanfeed will remove the common mistake of including two
85 "⎪" (or) signs in a row.) All config options are exposed to
86 %%ccoonnffiigg__llooccaall, including any that may not be present in the sample
87 file. Only the defined list of options are exposed to %%ccoonnffiigg__aappppeenndd.
88
89 Options that are on/off or yes/no should be set to 1 for on/yes, or 0
90 for off/no.
91
92 First, you need to tell Cleanfeed which news server software you are
93 using. At the top of the file, set the appropriate variable to 1. For
94 INN, set $$iinnnn; for Cyclone, Typhoon, or Breeze, set $$hhiigghhwwiinndd; and for
95 NNTPRelay, set $$nnnnttpprreellaayy. Ensure the other two (the ones you're not
96 using) are set to 0.
97
98 GGeenneerraall SSeettttiinnggss
99
100 aaggggrreessssiivvee
101 Set this to 0 to disable all content-based filters. Helpful to
102 please paranoid lawyers, or paranoid customers.
103
104 aaccttiivvee__ffiillee
105 Set this to the full path to an active file, to allow Cleanfeed
106 to know what groups are moderated. This is normally your
107 server's active file, but it doesn't have to be; it is
108 possible, for example, to run Cyclone with no active file, but
109 give one to Cleanfeed anyway.
110
111 MMDD55 BBooddyy FFiilltteerr SSeettttiinnggss
112
113 ddoo__mmdd55 When turned on, the MD5 EMP checks will be done. This should
114 be left on unless you have a really good reason to turn it off.
115 If you're running Hippo along with Cleanfeed, you might feel
116 Cleanfeed's MD5 checks are redundant and want to turn them off,
117 for example. It would probably be better to leave it on with
118 the history turned down, instead.
119
120 mmdd55mmaaxxmmuullttiippoossttss
121 Start rejecting articles after we have seen this many copies,
122 according to the MD5 checksum filter.
123
124 MMDD55HHiissttoorryy
125 How many articles to remember for MD5-based EMP comparison.
126 Since the MD5 filter is not prone to false positives, setting
127 this higher is a good idea to catch more spam, if you have the
128 RAM to spare.
129
130 MMDD55mmaaxxlliiffee
131 When a spam is identified by the MD5 EMP filter, it is saved
132 for continual rejection. MMDD55mmaaxxlliiffee specifies how long, in
133 hours, to keep a saved MD5 id which is no longer getting any
134 hits. (A spam id which is still getting matches will be saved
135 regardless of age.) 24 hours works well.
136
137 ffuuzzzzyy__mmdd55
138 When turned on, the message bodies will be munged up a bit
139 before MD5 checksums are generated. Whitespace and other non-
140 alphanumeric characters are stripped and letters are forced to
141 lowercase, as well as a couple other bits of treachery to try
142 to defeat the "hashbuster" spam-bots. This adds a bit of
143 "fuzziness" to the MD5 filter, and results in a performance hit
144 as well.
145
146 Since the smarter spammers have discovered hashbusting, I
147 recommend that this be turned on.
148
149 ffuuzzzzyy__mmaaxx__lleennggtthh
150 Sets the maximum amount of lines for an article body to be
151 subject to the ffuuzzzzyy__mmdd55 munging (above). This keeps extremely
152 large articles out of those nasty regular expressions.
153
154 mmdd55__sskkiippss__ffoolllloowwuuppss
155 Determines whether the MD5 filter checks articles with
156 References headers. The default is to skip them. Setting this
157 option to 0 will result in all articles passing through the MD5
158 filter, which can result in a major performance hit, but does
159 close another hole in the filter. If you turn this off, you
160 should increase MMDD55hhiissttoorryy as well to avoid shortening your
161 "window".
162
163 MMDD55HHiissttSSiizzee
164 The maximum allowed size of the EMP memory for the MD5-checksum
165 EMP filter. Use this as a "sanity check" to prevent a sudden
166 burst of spam from eating up all of your memory. It should be
167 set high enough so that you normally never hit this number; use
168 the MMDD55MMaaxxLLiiffee to expire the hash instead.
169
170 HHeeaaddeerr--BBaasseedd EEMMPP FFiilltteerr SSeettttiinnggss
171
172 ddoo__pphhll Turns on the NNTP-Posting-Host/Lines EMP filter. This filter
173 identifies spam by identical posting-host headers and article
174 sizes in a short period of time. You really don't want to turn
175 this off.
176
177 ddoo__ffssll Turns on the From/Subject/Lines EMP filter. This filter
178 identifies spam by identical From and Subject headers and
179 article sizes in a short period of time. This is the one that
180 gets the least number of hits these days, so you won't lose
181 much by shutting it off.
182
183 mmaaxxmmuullttiippoossttss
184 Start rejecting articles after we have seen this many copies,
185 according to the header-based EMP filter. Since false
186 positives are somewhat more likely with this filter than with
187 MD5, this should be set appropriately higher to reduce the
188 odds.
189
190 AArrttiicclleeHHiissttoorryy
191 How many ids to remember for header-based EMP comparison.
192 Setting this higher will catch more spam because there will be
193 a larger "window" to look at. Larger settings will also
194 consume more memory and have a (small) impact on performance,
195 as well as slightly increase the chance of a false positive
196 (since the sample size will be larger). Most articles will
197 actually take up two entries in this history because there are
198 two different header-based filters.
199
200 EEMMPPmmaaxxlliiffee
201 Same as MMDD55mmaaxxlliiffee but for the header-based EMP filter.
202
203 EEMMPPHHiissttSSiizzee
204 Same as MMDD55HHiissttSSiizzee but for the header-based EMP filter. If
205 you are running the header-based filter but not the MD5 filter
206 for whatever reason, set this high.
207
208 EExxcceessssiivvee CCrroossssppoosstt SSeettttiinnggss
209
210 mmaaxxggrroouuppss
211 Reject articles crossposted so that followups will be to more
212 than this many newsgroups.
213
214 llooww__xxppoosstt__mmaaxxggrroouuppss
215 Specify a special, lower crosspost limit for certain groups,
216 specifed by regular expression in llooww__xxppoosstt__ggrroouuppss (below).
217 Useful for being more strict in groups plagued by crossposting,
218 such as sex, binaries, and jobs groups. (Replaces the old
219 ttffjjmmaaxxggrroouuppss option.)
220
221 MMiissppllaacceedd BBiinnaarriieess FFiilltteerr
222
223 bblloocckk__bbiinnaarriieess
224 Enables blocking of binary posts in non-binary newsgroups.
225 Which newsgroups allow binaries is configured with bbiinn__aalllloowweedd
226 (below).
227
228 mmaaxx__eennccooddeedd__lliinneess
229 Sets the number of uuencoded or base64-encoded lines to allow
230 before considering a post to be a binary. This should be set
231 high enough to pass regular PGP signatures. (Those satanic
232 Netscape crypto-sigs can die along with the other binaries.)
233 Default is 15 lines, which may be a little low if you are
234 lenient, which you're not.
235
236 bbiinnaarriieess__iinn__mmoodd__ggrroouuppss
237 If set, binaries are allowed in spite of bblloocckk__bbiinnaarriieess if they
238 are posted only to moderated groups (requires aaccttiivvee__ffiillee).
239
240 HHTTMMLL
241
242 bblloocckk__mmiimmee__hhttmmll
243 Enables blocking of MIME-encapsulated HTML posts. This does
244 NOT affect straight text/html or multipart/alternative posts of
245 the type created by misconfigured Netscape and Microsoft
246 "newsreaders", but ONLY posts which are MIME-encapsulated HTML,
247 a favorite format of sex spammers which often sneaks in under
248 the EMP radar.
249
250 bblloocckk__hhttmmll
251 Enables blocking of HTML and multipart/alternative posts. You
252 can specify group patterns where HTML is allowed by setting
253 html_allowed (below).
254
255 CCaanncceell MMeessssaaggee FFiilltteerriinngg
256
257 bblloocckk__llaattee__ccaanncceellss
258 If turned on, cancels for recently rejected articles will be
259 rejected. Set the window with MMIIDDmmaaxxlliiffee (below). This will
260 result in a huge number of rejections if you have multiple full
261 feeds and you aren't backlogging. If you are concerned about
262 your downstream sites receiving the cancels, leave this off. If
263 you need a performance boost, turn it on.
264
265 MMIIDDmmaaxxlliiffee
266 How long to remember rejected message-ids so cancels for these
267 posts can later be rejected. Specified in hours. This only
268 has an effect if bblloocckk__llaattee__ccaanncceellss is enabled (above).
269
270 DDiissaabblliinngg OOtthheerr FFiilltteerrss
271
272 ddoo__ssccoorriinngg__ffiilltteerr
273 Enables the (new) "scoring" filter. You probably want to leave
274 this on, even if you need to turn of aaggggrreessssiivvee mode (turning
275 off aaggggrreessssiivvee mode will disable the content-based parts of the
276 scoring filter).
277
278 ddoo__mmiidd__ffiilltteerr (INN only)
279 Enables the message-id filter. This requires an additional
280 patch to INN 1.7.2, which is included with Cleanfeed (but
281 optional). The patch adds a new Perl hook to check message-
282 id's during the NNTP CHECK transaction, and decide whether to
283 refuse the article. There is a patch for this for INN 2.0
284 which may get incorporated into the INN distribution at some
285 point. The default is off.
286
287 ddoo__bboott__cchheecckkss
288 Enables the filters that check for spam bot signatures. The
289 only reason you would ever want to turn this off is if you've
290 written your own version, or something. Otherwise, leave it
291 on.
292
293 ddoo__ssuuppeerrsseeddeess__ffiilltteerr
294 Enables the Excessive Supersedes filter, to catch rogue
295 Supersedes attacks. This filter begins dropping articles with
296 Supersedes headers if too many appear from the same posting-
297 host in a short time. Moderated groups are given a higher
298 limit (if aaccttiivvee__ffiillee is set), as is news.answers. Default is
299 on.
300
301 cchheecckk__ssuuppeerrsseeddeess__ppaatthh
302 If set, bbaadd__ccaanncceell__ppaatthhss will also be applied to Supersedes
303 articles. Articles with Supersedes headers, where a path
304 element matches the regexp in bbaadd__ccaanncceell__ppaatthhss, will be
305 dropped. Default is on.
306
307 ddrroopp__uusseelleessss__ccoonnttrroollss
308 If set, all control messages of types sendsys, senduuname, and
309 version will be dropped. These are no longer useful and are a
310 hole for denial-of-service attacks due to the way INN and some
311 other servers handle them. On by default.
312
313 ddrroopp__iihhaavvee__sseennddmmee
314 If set, control messages of types ihave and sendme will be
315 dropped. See ddrroopp__uusseelleessss__ccoonnttrroollss. If you use these types of
316 control messages, turn this off. If you're not sure, then
317 you're not using them.
318
319 ddrroopp__ccoonnttrrooll__wwiitthh__ssuuppeerrsseeddeess
320 Drops any and all control messages which contain a Supersedes
321 header. Since control messages are not passed through the same
322 filters as regular messages, a rogue Supersedes attack can use
323 control messages to avoid filtering; this option closes this
324 hole. Legitimate control messages don't have Supersedes
325 headers. On by default.
326
327 HHaasshh--TTrriimmmmiinngg
328
329 ttrriimmccyycclleess
330 The EMP memories are trimmed every ttrriimmccyycclleess times through the
331 filter.
332
333 EEMMPPssttaarrttttrriimmmmiinngg
334 Tells the filter not to waste time trimming the EMP memories
335 until they have this many entries. Just a minor performance
336 enhancement during the first hours the filter is running or
337 when you first start iinnnndd.
338
339 LLooggggiinngg
340
341 vveerrbboossee When turned on, verbose logging to news.notice will happen;
342 spam domains will be listed, etc. When off, only general
343 messages will be logged, making the news.daily summaries less
344 interesting but much shorter and more to the point. (There is,
345 alas, no way to shut off news.notice logging entirely.)
346 (news.notice only applies to INN.) Note that this will not
347 reduce the number of log entries, but only their verbosity.
348
349 llooggffiillee (Standalone Mode)
350 If set to the path to a file, this will enable logging of
351 message-ids of all articles processed by the filter.
352 Rejections will be logged with the reason for rejection. Note
353 that this will create a very large logfile which you will need
354 to rotate or delete (see mmaaxx__lloogg__ssiizzee, below).
355
356 rreeppoorrttffiillee (Standalone Mode)
357 If set to the path to a file, this will enable generation of a
358 simple report of articles accepted and rejected. The report
359 file will contain one entry per line with the start time, end
360 time, number of articles accepted, and number of articles
361 rejected, tab-separated.
362
363 lloogg__aacccceeppttss (Standalone Mode)
364 When using the above logfiles, this setting determines whether
365 articles accepted should be logged. When disabled, only
366 rejections will be logged.
367
368 mmaaxx__lloogg__ssiizzee (Standalone Mode)
369 The size at which to rotate the llooggffiillee. This will be replaced
370 by time-based rotation at some point.
371
372 ssttaattffiillee
373 If this is set to the full path of a file, a crude stats file
374 will be written each time the filter is reloaded with ccttlliinnnndd
375 rreellooaadd ffiilltteerr..ppeerrll mmeeooww (for INN) or whenever the Cleanfeed
376 process receives a SIGUSR1 (for standalone mode). The file
377 shows how many entries are present in each of the EMP
378 histories, MID history and excessive supersedes history; timer
379 information if enabled (see ttiimmeerr__iinnffoo); and the contents of
380 all configuration settings. Posting-hosts in for each
381 supersedes entry will be listed, along with their counts; these
382 are not being rejected unless they are over the threshold. The
383 default for this is undef, which disables creation of the stat
384 file.
385
386 More comprehensive stats are planned for the future.
387
388 TTiimmiinngg IInnffoo
389
390 ttiimmeerr__iinnffoo
391 When enabled, Cleanfeed will generate timing statistics telling
392 you how many articles per second are being examined by the
393 filter and being accepted by the filter. This information will
394 appear in the statfile if this is enabled, and in the output of
395 INN's ccttlliinnnndd mmooddee if the mode.patch is applied to INN. Note
396 that the accepted/second rate is not necessarily the rate at
397 which your server is accepting articles; articles can be
398 rejected by the server after Cleanfeed passes them, for example
399 if they are posted to groups not in your active file.
400
401 ttiimmeerr__iinntteerrvvaall
402 The period over which to average timing information, in
403 seconds. The default is 600 seconds, or 5 minutes.
404
405 DDeebbuuggggiinngg
406
407 ddeebbuugg__bbaattcchh__ddiirreeccttoorryy
408 Specifies a directory where debugging "batchfiles" can be
409 written. See the Hacker's Guide in this document for more
410 information.
411
412 ddeebbuugg__bbaattcchh__ssiizzee
413 The maximum size of a debugging batchfile before it gets
414 rotated. Rotation is done by renaming the file to file.1,
415 file.2, etc., using the lowest number that doesn't already
416 exist.
417
418 RReegguullaarr EExxpprreessssiioonnss
419
420 You can add to most of these regular expressions in the %%ccoonnffiigg__aappppeenndd
421 section of cleanfeed.conf; settings you add there will be added to the
422 defaults, rather than overriding them. If you want to completely
423 override the default settings you can add entries for these to the
424 %%ccoonnffiigg__llooccaall section instead.
425
426 bbiinn__aalllloowweedd
427 This is a regular expression telling the anti-binary filter in
428 which newsgroups binaries are allowed. If all groups in the
429 Newsgroups header match this pattern, binaries are allowed
430 through the filter. (This obviously has no effect when the
431 binary filter is disabled.) If the binary filter is enabled
432 and this is set to a null string (by overriding the default in
433 the local config) the result will be blocking all binaries
434 regardless of where they are posted.
435
436 ppooiissoonn__ggrroouuppss
437 If any groups in the Newsgroups header match this regexp, the
438 article will be rejected. Thus you can reject crossposts to
439 certain groups even if they are also posted to groups you
440 carry.
441
442 hhttmmll__aalllloowweedd
443 This is a regular expression telling the anti-HTML filter in
444 which newsgroups HTML and multipart/alternative posts are
445 allowed. This only has an effect if bblloocckk__hhttmmll is turned on
446 (above). The default (to allow HTML in microsoft.* groups) can
447 be added to in cleanfeed.conf.
448
449 If you don't want to allow HTML anywhere, not even the
450 microsoft.* groups, override this setting in the local
451 configuration and set it to a null string or undef.
452
453 mmdd55eexxcclluuddee
454 If an article is posted only to groups matching this regexp,
455 the MD5 EMP filter will not be applied. Useful for "test"
456 groups where it's okay for lots of the posts to be the same.
457
458 aalllleexxcclluuddee
459 If an article is posted only to groups matching this regexp, NO
460 checks are applied at all.
461
462 llooww__xxppoosstt__ggrroouuppss
463 If a group matches this regular expression, it gets a special
464 crosspost limit, set in llooww__xxppoosstt__mmaaxxggrroouuppss, rather than the
465 general crosspost limit set in mmaaxxggrroouuppss. This is useful for
466 groups plagued by excessive crossposting, such as sex,
467 binaries, and jobs groups. The default is to limit crossposts
468 to 6 groups in test, forsale, and jobs groups. Setting this to
469 a null string, or undef, will disable this feature.
470
471 bbaaddgguuyyss This is a monster regular expression containing domains of
472 known spammers. Only the "middle" part of the domains are
473 listed; these are checked as email addresses in From headers by
474 appending a list of top-level domains to the end, and as URLs
475 by prepending http:// and an optional "www.". If you modify
476 this list, be very careful not to end up with "⎪⎪" in there
477 (two "or" signs in a row); this will match every single post
478 that comes through, which is Bad.
479
480 bbaaddddoommaaiinnppaatt
481 If a post contains a URL for a site whose domain name matches
482 this pattern (in .com, .net, and .nu TLDs only) the post will
483 be rejected. For example, there are hundreds of spamming porn
484 sites whose domain names begin or end with "xxx". This
485 prevents us from having to keep up with their nonsense. Yes,
486 it's a little aggressive, but it works.
487
488 eexxeemmpptt Regular expression of NNTP-Posting-Hosts that are exempt from
489 the posting-host-based EMP filter. This is for high-output
490 systems where all posts contain the same NNTP-Posting-Host
491 header, such as AOL, which if not exempted would end up hitting
492 the posting-host EMP filter with all of their posts. There
493 aren't many of these out there; a "regular" multi-user system
494 does not present a problem because the filter doesn't kick in
495 until it sees a large number of posts from the same posting-
496 host and also of the same length, in a short period of time.
497
498 ssuuppeerrsseeddeess__eexxeemmpptt
499 Regular expression of NNTP-Posting-Hosts that are exempt from
500 the excessive supersedes filter. Generally this will be
501 systems which post a lot of FAQs.
502
503 bbaadd__ccaanncceell__ppaatthhss
504 Cancel messages will be rejected if the Path header contains
505 elements matching this regular expression. Also applied to the
506 NNTP-Posting-Host. If cchheecckk__ssuuppeerrsseeddeess__ppaatthh is set, this will
507 also be checked against the Path header of articles with
508 Supersedes headers. This list contains sites which are or have
509 recently been the source of rogue cancel attacks.
510
511 rreeffuussee__mmeessssaaggeeiiddss (INN only)
512 If you have ddoo__mmiidd__ffiilltteerr (above) enabled, and you have the
513 optional message-id patch applied to INN (or otherwise have
514 obtained the hook for filter_messageid in INN 2.0), this
515 regular expression will be applied to message-ids as they are
516 offered to your server, and they will be refused if it matches.
517
518 nneett__aabbuussee__ggrroouuppss
519
520 ssppaamm__rreeppoorrtt__ggrroouuppss
521 These regular expressions are used to exempt certain groups
522 from certain filters; for example, groups expected to contain
523 spam reports, example spams, NoCeM notices, etc. These are not
524 in cleanfeed.conf; if you need to add to them please let me
525 know.
526
527 After modifying the filter file, always check for mistakes by typing:
528
529 perl -cw filter_innd.pl (or cleanfeed or whatever you called it)
530
531 There should be no errors and no warnings.
532
533 You can check cleanfeed.conf with:
534
535 perl -cw cleanfeed.conf
536
537 You will get several warnings about variables being used only once;
538 these can be ignored.
539
540 If you are running INN, you can modify the file and reload it with
541 ccttlliinnnndd rreellooaadd ffiilltteerr..ppeerrll mmeeooww while the server is running. The
542 configuration in f<cleanfeed.conf> will be reloaded at this time as
543 well.
544
545 With the Highwind servers, modifying the program will require a server
546 restart (use the bin/restart script). Note that this will result in
547 all connections (including newsreader clients) being dropped. This is
548 not my fault. :)
549
550 When in standalone mode, configuration from cleanfeed.conf can be
551 reloaded by sending Cleanfeed a SIGHUP.
552
553 I have no idea what NNTPRelay does, but I'm guessing it needs a restart
554 as well.
555
556 IMPORTANT NOTE: A common mistake is not setting file permissions on
557 cleanfeed/filter_innd.pl, cleanfeed.conf, and cleanfeed.local so that
558 they are readable by the news user. Please double-check your
559 permissions! If Cleanfeed is running, and fails to successfully load
560 cleanfeed.conf, it will use the default settings instead of those you
561 specified in the config file.
562
564 These instructions assume you have the Perl hooks compiled into INN.
565 If you don't, you will need to add them and rebuild the INN
566 distribution before proceeding.
567
568 With INN, Perl is embedded into the innd program. The filter file
569 defines subroutines that are called by innd at the appropriate times.
570
571 SSYYSSTTEEMM RREEQQUUIIRREEMMEENNTTSS
572
573 In order to run Cleanfeed with INN, you will need:
574
575 · INN 1.5.1 or later (1.7.2+insync1.1d or 2.1 recommended)
576
577 · Perl 5.004 or later
578
579 · Perl hooks compiled into INN
580
581 · The MD5 Perl module
582
583 INN is available from:
584 http://www.isc.org/inn.html
585
586 The Insync distribution of INN (highly recommended if you aren't
587 running INN 2.1) is available from:
588 http://www.insync.net/~aos/inn.html
589
590 The MD5 Perl module is available from:
591 http://www.perl.com/CPAN-local/modules/by-module/MD5/
592
593 Perl itself is available from the Perl home page:
594 http://www.perl.com/
595
596 PPAATTCCHHEESS AANNDD SSTTUUFFFF
597
598 INN 2.0 includes everything you need to run Cleanfeed, except the MD5
599 Perl module.
600
601 With earlier versions, Cleanfeed requires some patches to INN in order
602 to function properly.
603
604 If you are running INN 1.7.2+insync1.1d, you already have the original
605 filter.patch and the dynamic-load.patch; You need only apply the
606 upgrade.patch.
607
608 None of these patches are against INN 2.1; the "extra feature" ones
609 like mode.patch may not apply to 2.1. Ports are always welcome.
610
611 ffiilltteerr..ppaattcchh
612 This patch provides the basic functionality for Cleanfeed by making
613 some extra headers available to the Perl filter, as well as message
614 bodies. This patch was changed in version 0.95.3. It is against
615 INN 1.7.2 and should be applied in the innd directory. This patch
616 is included in the insync "megapatch" for INN as of version 1.1c,
617 so if you are running this version of INN you need not apply this
618 patch. Not necessary for INN 2.x.
619
620 ddyynnaammiicc--llooaadd..ppaattcchh
621 This patch enables INN's Perl interpreter to load dynamic modules.
622 It is necessary for MD5 support. The patch is against INN
623 1.7+insync and should be applied in the lib directory (NOT the innd
624 directory). It applies cleanly to other versions of INN including
625 1.5.1 and 1.7.2. This patch is included in the insync "megapatch"
626 for INN as of version 1.1d, so if you are running this version of
627 INN you need not apply this patch. Not necessary for INN 2.x.
628
629 If you are still using INN 1.5.1, you can use dynamic-1.5.1.patch
630 instead.
631
632 In order to compile INN with the new patch, you need to edit the
633 PERL_LIB entry in config.data. Type this command at the shell, and
634 paste its output into config.data as PERL_LIB:
635
636 perl -MExtUtils::Embed -e ldopts
637
638 Most systems also allow you to simply enter that line in backquotes
639 as PERL_LIB.
640
641 TThhiiss ppaattcchh rreeqquuiirreess PPeerrll 55..000044 oorr llaatteerr!! IINNNN wwiillll nnoott ccoommppiillee
642 lliinnkkeedd wwiitthh PPeerrll 55..000033 aafftteerr ffoolllloowwiinngg tthheessee iinnssttrruuccttiioonnss!!
643
644 AAIIXX:: There is a problem with Perl dynamic loading from INN under
645 the AIX operating system. In simple terms, it doesn't work. This
646 seems to be a problem with the gcc compiler. Success has been
647 reported by rebuilding both Perl and INN with IBM's commercial
648 compiler CSet (a.k.a. xlC).
649
650 SSoollaarriiss:: There have been multiple reports of Cleanfeed not working
651 under Solaris if any part of the system -- INN, Perl, or the MD5
652 module -- are compiled using egcs. Success has been reported by
653 recompiling everything with gcc, and by upgrading to the very
654 newest egcs.
655
656 uuppggrraaddee..ppaattcchh
657 For current users of Cleanfeed, this is a patch for an already-
658 patched INN, or for 1.7.2+insync1.1d, to bring you up to the new
659 version of the Cleanfeed patch. Not applying this patch right now
660 will only lose you a couple of filters, and nothing will break if
661 you don't apply it (no changes to the filter source or
662 configuration will be required).
663
664 mmeessssaaggeeiidd..ppaattcchh
665 This is a patch which adds a new Perl hook to innd,
666 filter_messageid. This allows you to run a Perl subroutine against
667 each message-id as it is offered to your server, and decide whether
668 to refuse the article before it is even sent to your server.
669 Cleanfeed includes a small filter_messageid. This patch is
670 entirely optional.
671
672 mmooddee..ppaattcchh
673 This patch adds a line to INN's ccttlliinnnndd mmooddee output for Perl filter
674 status. The output line is generated by the ffiilltteerr__ssttaattss
675 subroutine. The default output contains the number of articles
676 accepted, rejected and refused since the filter started, and the
677 sizes of the EMP, Message-ID, and Excessive Supersedes hashes. If
678 ttiimmeerr__iinnffoo is enabled, this will also include the rate in articles
679 per second (rounded to the nearest tenth) at which articles were
680 examined (total sent through the filter) and accepted by the
681 filter, averaged over the ttiimmeerr__iinntteerrvvaall number of seconds.
682
683 After applying the patches, rebuild all of INN and do a "make update".
684 The first patch (filter.patch) only requires innd to be rebuilt, but
685 the dynamic-load.patch requires you to rebuild the whole distribution.
686 Current users upgrading with upgrade.patch need only rebuild innd and
687 reinstall that executable.
688
689 Thus:
690
691 cd inn [to the top-level source directory]
692 make clean
693 cd innd
694 cp wherever/filter.patch . [from the Cleanfeed distribution]
695 patch <filter.patch
696 cd ../lib
697 cp wherever/dynamic-load.patch [from the Cleanfeed distribution]
698 patch <dynamic-load.patch
699 cd ../config
700 emacs config.data [edit the PERL_LIB entry as above]
701 make all
702 make update
703
704 Finally, you need to install the MD5 Perl module, no matter what
705 version of INN you are running.
706
707 IINNSSTTAALLLLIINNGG CCLLEEAANNFFEEEEDD -- IINNNN
708
709 In INN 1.7.2 and earlier, the location where INN looks for the Perl
710 filter is set in config.data, as _PATH_PERL_FILTER_INND. By default,
711 the filename is filter_innd.pl. The Cleanfeed filter program file
712 should be installed in this location. INN comes with an example
713 filter_innd.pl file; move this file (or whatever other filter is in
714 place) out of the way first.
715
716 Before putting the filter in place, edit the file, changing $$ccoonnffiigg__ddiirr
717 to the location of your cleanfeed.conf file.
718
719 After editing the file, always check for errors with the command:
720
721 perl -cw filter_innd.pl
722
723 Once the file is in place, tell innd to reload it:
724
725 ctlinnd reload filter.perl meow
726
727 And, if Perl filtering is currently disabled, enable it:
728
729 ctlinnd perl y
730
731 Now, you can watch it working by looking at your news.notice log:
732
733 tail -f /var/log/news/news.notice
734
735 If your server is running a full feed, you should start seeing a
736 constant stream of rejections almost immediately.
737
739 The various Highwind server packages (Cyclone, Typhoon, and Breeze) all
740 have the same external filter interface. The filter runs as its own
741 process, reading from standard input and writing to standard output.
742
743 SSYYSSTTEEMM RREEQQUUIIRREEMMEENNTTSS
744
745 In order to run Cleanfeed with a Highwind server, you will need:
746
747 · Cyclone, Typhoon or Breeze
748
749 · Perl 5.003 or later
750
751 · The MD5 Perl module
752
753 The Highwind servers are commercial products. For more information:
754 http://www.highwind.com/
755
756 The MD5 Perl module is available from:
757 http://www.perl.com/CPAN-local/modules/by-module/MD5/
758
759 Perl itself is available from the Perl home page:
760 http://www.perl.com/
761
762 IINNSSTTAALLLLIINNGG CCLLEEAANNFFEEEEDD -- HHIIGGHHWWIINNDD
763
764 The Cleanfeed program file should be installed as "cleanfeed" in your
765 news server's bin directory (cyclone/bin, etc). Make it owned by
766 news:news and make it executable.
767
768 Before putting the filter in place, edit the file, changing $$ccoonnffiigg__ddiirr
769 to the location of your cleanfeed.conf file. Also ensure that the
770 shebang line (the first line of the file, starting with #!) points to
771 the correct location of your perl executable.
772
773 After editing the file, always check for errors with the command:
774
775 perl -cw cleanfeed
776
777 There should be no warnings.
778
779 Now, edit your bin/start script. You need to add two options to the
780 command line that starts up the server process, the --pprrooggrraamm option to
781 tell it what program to use as a filter, and the --bbooddyy option to tell
782 it to send the bodies as well as the headers.
783
784 typhoond -program /typhoon/bin/cleanfeed -body
785
786 ...along with whatever else you have cluttering up the command line.
787
788 (Highwind has indicated that this may/will be a config file option in a
789 future release.)
790
791 Now you can restart the server with the bin/restart script. Check to
792 make sure Cleanfeed is running, with "ps -ef" or "top". If
793 Cyclone/Typhoon is unable to start the filter for some reason, it will
794 log an error via syslog. The error will not be terribly helpful.
795
796 You can make Cleanfeed reload its configuration from cleanfeed.conf and
797 local code from cleanfeed.local by sending it a SIGHUP.
798
800 Please note that I do not have an NNTPRelay server, nor access to one,
801 nor much interest in mucking around with Windows NT, and thus I have
802 not tested the NNTPRelay filtering support myself. The necessary
803 changes and notes were contributed by someone else. Additions and
804 improvements to this documentation would be most welcome.
805
806 The filter interface in NNTPRelay is pretty much the same as in the
807 Highwind servers.
808
809 SSYYSSTTEEMM RREEQQUUIIRREEMMEENNTTSS
810
811 In order to run Cleanfeed with NNTPRelay, you will need:
812
813 · NNTPRelay version 1.1b4 or later
814
815 · Perl 5.003 or later
816
817 · The MD5 Perl module
818
819 NNTPRelay is available from:
820 http://nntprelay.maxwell.syr.edu/
821
822 An NT binary release of Perl 5.004, which apparently includes the MD5
823 module, can be found at:
824 http://www.perl.com/CPAN/ports/win32/Standard/x86
825
826 The MD5 module (in source code) can be found at:
827 http://www.perl.com/CPAN-local/modules/by-module/MD5/
828
829 IINNSSTTAALLLLIINNGG CCLLEEAANNFFEEEEDD -- NNNNTTPPRREELLAAYY
830
831 Before putting the filter in place, edit the file, changing $$ccoonnffiigg__ddiirr
832 to the location of your cleanfeed.conf file.
833
834 Install the Cleanfeed program file wherever is appropriate on your
835 system, as "cleanfeed.pl". Edit NNTPRelay's config.txt file, adding an
836 entry like this:
837
838 ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl
839
840 Of course, use the correct path to your Perl executable and to the
841 Cleanfeed program file. Now restart NNTPRelay. If you defined a
842 logfile in Cleanfeed, it should appear.
843
845 Cleanfeed will look for a file called cleanfeed.local, in the same
846 directory as cleanfeed.conf. If this file exists, it will be loaded
847 and evaluated as Perl code right after the config file. This enables
848 you to provide your own local filter code which will survive an upgrade
849 of the main Cleanfeed source.
850
851 It will be reloaded when the filter is reloaded with ccttlliinnnndd rreellooaadd
852 ffiilltteerr..ppeerrll mmeeooww (for INN), or when configuration is reloaded with a
853 SIGHUP (in standalone mode). This means that you can modify the
854 running code without restarting Cleanfeed.
855
856 cleanfeed.local can define a number of different subroutines, which, if
857 defined, will be called at various points in the filter process. Other
858 subroutines can, of course, be defined as required by your code.
859
860 The file is simply re-evaluated each time. So, if you remove a
861 subroutine from the file completely, that subroutine will remain
862 defined after the reload, because nothing replaced it. You will need
863 instead to define it as an empty subroutine, or explicitely undef it,
864 to make it go away.
865
866 SSTTUUFFFF YYOOUU CCAANN DDEEFFIINNEE
867
868 Cleanfeed will call the following subroutines, if they are defined.
869 See the section on return values for instructions on what your code
870 should return.
871
872 llooccaall__ccoonnffiigg
873 This is called after configuration is loaded, each time. It will
874 be called when the filter is reloaded (with INN) or when
875 configuration is reloaded with SIGHUP (running standalone), as well
876 as when the filter is first run. No return value is expected.
877
878 llooccaall__ffiilltteerr__bbeeffoorree__eemmpp
879 Called for each (non-control) article, before any other filters.
880 General-purpose spam filters shouldn't go here, because you really
881 want to populate the EMP hashes first.
882
883 llooccaall__ffiilltteerr__aafftteerr__eemmpp
884 Called for each (non-control) article, after the EMP filters but
885 before any other filters.
886
887 llooccaall__ffiilltteerr__mmiiddddllee
888 Called for each (non-control) article, after the "simple" filters
889 but before the "expensive" body checks.
890
891 llooccaall__ffiilltteerr__ssccoorriinngg
892 Called during the scoring filter. Return the value, positive or
893 negative, by which to adjust the article's score.
894
895 WWaarrnniinngg:: HHeerree tthheerree bbee ddrraaggoonnss!! If you're going to play with this
896 please examine the existing source, and use the debugging routines
897 to watch what you're doing.
898
899 llooccaall__ffiilltteerr__llaasstt
900 Called for each (non-control) article, after all other filters are
901 done.
902
903 llooccaall__ffiilltteerr__ccaanncceell
904 Called for all cancel control messages.
905
906 llooccaall__ffiilltteerr__nneewwrrmmggrroouupp
907 Called for all newgroup and rmgroup control messages.
908
909 RREETTUURRNN VVAALLUUEESS
910
911 The general filtering subroutines you can define
912 (llooccaall__ffiilltteerr__bbeeffoorree__eemmpp, llooccaall__ffiilltteerr__aafftteerr__eemmpp, llooccaall__ffiilltteerr__mmiiddddllee,
913 llooccaall__ffiilltteerr__llaasstt, llooccaall__ffiilltteerr__ccaanncceell, and llooccaall__ffiilltteerr__nneewwrrmmggrroouupp)
914 are expected to return a value indicating whether you want to accept
915 the article being examined. If the article is okay, you should return
916 "" (empty string), in which case filtering will proceed as usual. If
917 you want to reject the article, you return any other string, which will
918 be used as the reason.
919
920 The rejection code actually expects two return values -- the first
921 string is the "verbose" rejection message, and the second is the "non-
922 verbose" message (see the vveerrbboossee configuration option). If only one
923 is supplied, it will be used for both purposes.
924
925 The scoring filter calls llooccaall__ffiilltteerr__ssccoorriinngg, which is expected to
926 return the value, postive or negative, by which the article's score
927 should be adjusted.
928
929 WWHHAATT YYOOUU GGEETT
930
931 Your subroutines get information about the article in several
932 variables.
933
934 %%hhddrr
935 A hash containing the article headers. The key is the header name,
936 in "canonical" case as INN likes them; the value is the content of
937 the header. When running under INN, only headers known to INN will
938 be included in the hash (which includes any header used anywhere in
939 Cleanfeed). In standalone mode, all headers will be present, but
940 only the known headers will be sent in canonical case; others will
941 have the header name (and thus hash key) in whatever case they are
942 in the article itself, making them difficult to find and use
943 consistently.
944
945 The message body is in this hash under the key __BODY__. If
946 running INN 2.x with storageapi, it will be provided in wireformat,
947 with lines terminated in \r\n rather than just \n. With the
948 traditional spool format (and in all cases with INN prior to 2.x)
949 lines will be terminated only with \n.
950
951 Examples:
952
953 To get the Subject header as a scalar: $hdr{'Subject'}
954
955 To get the entire message body as a scalar: $hdr{'__BODY__'}
956
957 %%llcchh
958 A hash containing lowercased versions of some of the article
959 headers. The hash keys are the header names in all lowercase; the
960 values are the contents of the headers, with all letters forced to
961 lowercase.
962
963 Currently, the only headers added to this hash are From,
964 Organization, Subject, Content-Type, X-Newsreader, X-Newsposter,
965 Message-ID, and Sender.
966
967 This hash is not availabe to llooccaall__ffiilltteerr__bbeeffoorree__eemmpp.
968
969 @@ggrroouuppss
970 An array containing the newsgroups the article is posted to (from
971 the Newsgroups header). You can find out how many groups the
972 article is crossposted to with "scalar @groups".
973
974 @@ffoolllloowwuuppss
975 An array containing the newsgroups to which followups are set (from
976 the Followup-To header). If the article has no Followup-To header,
977 this array will be identical to @groups. You can find out how many
978 groups followups are set to with "scalar @followups". This is the
979 preferred way to limit crossposting, because limiting only by the
980 Newsgroups header will catch FAQs and such.
981
982 $$lliinneess
983 The number of lines in the message body. This is not taken from
984 the Lines header as that can be client-supplied to fool filtering;
985 this is determined by counting the lines in the message body.
986
987 %%ggrr A hash containing information about the groups the article is
988 posted to. This isn't very straightforward and may not be useful
989 to you, but I'm including it in this documentation for
990 completeness. The following entries may be present in this hash:
991
992 $$ggrr{{''nneett''}} - the number of net.* (Usenet II) newsgroups the article
993 is posted to, if any.
994
995 $$ggrr{{''ootthheerr''}} - the number of non-net.* groups the article is posted
996 to.
997
998 $$ggrr{{''mmdd55sskkiipp''}} - true if the article should be exempted from the
999 MD5 body checks (if all newsgroups match the regexp in mmdd55eexxcclluuddee).
1000
1001 $$ggrr{{''bbiinnaarryy''}} - true if the article is posted only to groups where
1002 binaries are allowed (if all newsgroups match bbiinn__aalllloowweedd).
1003
1004 $$ggrr{{''hhttmmll''}} - true if the article is posted only to groups where
1005 html is allowed (if all newsgroups match hhttmmll__aalllloowweedd).
1006
1007 $$ggrr{{''ppooiissoonn''}} - number of 'poison' newsgroups this article is
1008 posted to (matching ppooiissoonn__ggrroouuppss). If this is present, you'll
1009 only see this entry in llooccaall__ffiilltteerr__bbeeffoorree__eemmpp and
1010 llooccaall__ffiilltteerr__aafftteerr__eemmpp because it will be rejected after that.
1011
1012 $$ggrr{{''aabbuussee''}} - number of 'net abuse' newsgroups this article is
1013 posted to (matching nneett__aabbuussee__ggrroouuppss).
1014
1015 $$ggrr{{''rreeppoorrttss''}} - number of 'spam reports' newsgroups this article
1016 is posted to (matching ssppaamm__rreeppoorrtt__ggrroouuppss).
1017
1018 $$ggrr{{''llooww__xxppoosstt''}} - number of 'low crosspost limit' groups this
1019 article is posted to (matching llooww__xxppoosstt__ggrroouuppss).
1020
1021 $$ggrr{{''mmoodd''}} - number of moderated groups this article is posted to
1022 (requires that Cleanfeed have an active file).
1023
1024 $$ggrr{{''aallllmmoodd''}} - true if this article is posted only to moderated
1025 groups.
1026
1027 $$ggrr{{''ffaaqq''}} - true if this article is crossposted to news.answers.
1028
1029 %%ccoonnffiigg
1030 A hash containing all configuration options.
1031
1032 DDEEBBUUGGGGIINNGG
1033
1034 When you make filtering changes, you should always check the results
1035 for false positives. I've provided two subroutines to help you do
1036 this: wwrriitteehheeaaddeerrss(()) and wwrriitteeffuullll(()).
1037
1038 First, make sure ddeebbuugg__bbaattcchh__ddiirreeccttoorryy is set in your configuration.
1039 Set this to a directory that is writable by the news user.
1040
1041 Call either of these subroutines with one argument, the basename of the
1042 batch file you want to write the current article to. wwrriitteehheeaaddeerrss will
1043 dump the article's headers out to the file (with INN this will only
1044 give you the known headers). wwrriitteeffuullll will dump the full article,
1045 headers (again, known headers with INN) and body. The file will be
1046 rotated if it becomes larger than ddeebbuugg__bbaattcchh__ssiizzee, set in your
1047 configuration. The rotation is simple, a number is appended to the end
1048 of the file, and incremented until the filename does not exist. You'll
1049 have to delete the old files yourself.
1050
1051 When testing a new filter, simply call wwrriitteehheeaaddeerrss ((""bbaattcchhffiillee"")) or
1052 wwrriitteeffuullll ((""bbaattcchhffiillee"")) when you're going to reject an article. Then
1053 you can look at the file to make sure you're doing what you think
1054 you're doing.
1055
1057 When running under Cyclone, Typhoon, Breeze, or NNTPRelay (standalone
1058 mode), Cleanfeed will catch SIGHUP, and reload its configuration from
1059 cleanfeed.conf. It will also reload and reevaluate cleanfeed.local if
1060 you're using it. Note that, unlike INN, there is no way to reload the
1061 filter code itself without restarting the server.
1062
1063 Cleanfeed in standalone mode will also catch SIGUSR1 and write its
1064 crude current-status file (see ssttaattffiillee in the config section) on the
1065 next cycle through the filter.
1066
1067 (I honestly don't know if SIGUSR1 and SIGHUP are things which exist on
1068 NT for NNTPRelay.)
1069
1071 Written by Jeremy Nixon <jeremy@exit109.com>.
1072
1073 Originally based on Jeff Garzik's EMP filter.
1074
1075 I can't possibly mention everyone who has submitted ideas or fixes for
1076 the filter, but I'd like to acknowledge the substantial contributions
1077 of several people: Danhiel Baker, Frank Copeland, Brian Moore, John
1078 Payne, Russ Allbery, David Riley, and SeokChan LEE. Thanks, guys.
1079
1080 dynamic-load.patch is from Piers Cawley. The body-filtering portion of
1081 the INN filter.patch is from Jeff Garzik. messageid.patch is from Ed
1082 Mooring. mode.patch is from John Payne.
1083
1085 Copyright 1997-1998 by Jeremy Nixon, All Rights Reserved.
1086
1088 This software may be distributed freely, provided it is intact
1089 (including all the files from the original archive). You may modify
1090 it, and you may distribute your modified version, provided the original
1091 work is credited to the appropriate authors, and your work is credited
1092 to you (don't make changes and pass them off as my work), and that you
1093 aren't charging for it.
1094
1096 This filter is available at:
1097
1098 http://www.exit109.com/~jeremy/news/antispam.html
1099 ftp://ftp.exit109.com/users/jeremy/
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
12353rd Berkeley Distribution Version 0.95.7b cleanfeed(8)