"Required SpamAssassin Score" and Bayes

Mon Jan 5 17:24:32 GMT 2004

At 17:17 05/01/2004, you wrote:
>Executive summary:  Might a high value of MS "Required SpamAssassin Score"
>interact adversely with SA Bayes?
>
>Detail:
>We started site-wide use of MailScanner some time ago (mid-2001), and of
>SpamAssassin back in 2002.  Because of our worries about false positives,
>we adjusted the MailScanner.conf "Required SpamAssassin Score" from its
>default of 5 up to 7.
>
>Things have moved on, and we are now happily using SA 2.61 including its
>Bayes aspects.  But we find more emails than we would expect still escape
>being spam-tagged: their spamscores seem strangely low.  Might it be that
>our artificially high "Required SpamAssassin Score = 7" is causing the
>Bayes mechanism to auto-learn some "Score = 5" and "6" spams incorrectly
>as hams, and perhaps then to cause future occurences of these spams to be
>marked down as hams (and thus escape being spam-tagged)?

No. The auto-learning is triggered by 2 theresholds which are set inside
SpamAssassin. The "Required SpamAssassin Score" is totally different, and
SpamAssassin is never even told what number it is.

>I think we could reasonably confidently reduce "Required SA Score" from 7
>down to 6 or 5, which would both catch a few more spams, and the resultant
>Bayes autolearn might then catch more (positive feedback).

We run at 6 and see no false positives, just a few false negatives. 5 was
too low and we started seeing false positives at that setting.

>Is the above reasoning basically sound?  Or is it fundamentally flawed?

No, and yes :-)

>A supplementary question: Our SA/Bayes is currently only self-learning.
>Are there any nicely packaged schemes to allow us to supplement this from
>emails from validated individuals?  A few of us could then redirect
>(bounce) emails to, say, "sa-learn-ham at ..." and "sa-learn-spam at ..." (but
>in such a way that it would verify the redirector/bouncer (or some
>equivalent) against a list of trusted folk).

You can control access to addresses using the check_compat stuff inside
sendmail's access DB (the sendmail Bat Book 3rd Edition will tell you how).
You can then just do an hourly learn using the --mbox switch to sa-learn. I
have a cron job which does this which I have posted here several times
before. It might be called learn.spam or something like that. Look for my
postings with attachments (there aren't too many of those).

--
Julian Field
www.MailScanner.info
Professional Support Services at www.MailScanner.biz
MailScanner thanks transtec Computers for their support
PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654