Bayes Poisoning? Spam with negative BAYES Scores - ahhhh

Nathan Johanson nathan at TCPNETWORKS.NET
Wed Dec 24 18:36:19 GMT 2003


> Most of the spam where I see Bayes erroneously giving low
probabilities consists of
> a single html image (with the real message) and then a bunch of random
dictionary
> words, probably intended to trigger a negative Bayes score.  I don't
think training
> Bayes on these spam messages is going to help any.  Won't it just
tokenize the
> random dictionary words and begin to associate them with spam?

This is a perfect description of the kinds of SPAM I've seen recently.
And I'm with you, I'm a little concerned that "learning" these messages
would tip the scales in the other direction and perhaps lead to false
positives. However, I do think I will start catching these messages and
learning them manually (on at least one of my production boxes).

I do like the idea behind reducing the Bayes probabilities. Please let
me know how this works for you. I'm curious if it's enough to fix the
problem, or if it impacts your filtering in some other unforseen way.

Nathan

-----Original Message-----
From: Mark Nienberg [mailto:mark at TIPPINGMAR.COM] 
Sent: Wednesday, December 24, 2003 9:56 AM
To: MAILSCANNER at JISCMAIL.AC.UK
Subject: Re: Bayes Poisoning? Spam with negative BAYES Scores - ahhhh


I finally decided to modify the scores for Bayes probabilities less than
50%, so Bayes
will not reduce overall spam scores anymore.  If Bayes thinks the
message is spam,
it will increase the score as always.  If it thinks it is not spam,
there will be no
significant reduction in score.  Here is what I put in
spam.assassin.prefs.conf.

score BAYES_00 0 0 -0.05 -0.05
score BAYES_01 0 0 -0.04 -0.04
score BAYES_10 0 0 -0.03 -0.03
score BAYES_20 0 0 -0.02 -0.02
score BAYES_30 0 0 -0.01 -0.01

Most of the spam where I see Bayes erroneously giving low probabilities
consists of
a single html image (with the real message) and then a bunch of random
dictionary
words, probably intended to trigger a negative Bayes score.  I don't
think training
Bayes on these spam messages is going to help any.  Won't it just
tokenize the
random dictionary words and begin to associate them with spam?

Mark




More information about the MailScanner mailing list