Which messages to feed to Bayes?

Thu Feb 26 23:47:45 GMT 2004

Matt Kettler <mailto:mkettler at EVI-INC.COM> wrote:
> At 04:55 PM 2/26/2004, Michael St. Laurent wrote:
>> Should we be feeding the Bayes engine in Spamassassin messages that
>> it has recognized as spam?  The reason I am considering doing this
>> is the thought that it would eventually increase the spam score on
>> like emails until they break the high spam score level and get
>> dropped instead of flagged.
>
> My answer is a very emphatic YES!
>
> There's absolutely NO valid reason to skip messages that SA caught
> when doing training. And there are good, valid reasons to train them.
>
> Those who naysay training tagged messages, or messages that are
> already BAYES_99 are only doing so because they don't understand how
> bayes works, and are coming to an incorrect conclusion that it won't
> help SA with other spam.
>
> The key factor is that bayes doesn't learn to recognize an email... it
> learns about spam in general from each email. SA applies lessons
> learned from one spam to other spam which isn't entirely the same,
> but may contain some small similarities.
>
> Feeding messages which SA already tags, and even ones that are already
> BAYES_99 can help prevent false negatives in messages that wouldn't
> otherwise catch because there were no tokens that matched it.
>
> Even messages that are already BAYES_99 can contain tokens that SA
> hasn't learned yet. BAYES_99 means that the tokens SA recognizes are
> collectively likely to be spam, but it doesn't mean that there aren't
> any new tokens to learn about in the message, and it doesn't mean
> that all the tokens even have high spam probabilities.

Excellent.  Okay, what about spam messages that have lost their headers
becuase the user forwarded it to me (Outlook strips the headers when you do
that).  Will it still benefit from looking at just the body of the message?

--
Michael St. Laurent
Hartwell Corporation