Bayesian training policy (crossposted from SpamAssassin ML)

Julian Field mailscanner at ecs.soton.ac.uk
Mon May 5 17:41:56 IST 2003


At 17:30 05/05/2003, you wrote:
>Julian,
>
>Thank you for your reply. Let me be more specific:
>
>I've created two mailboxes (spam and notspam) where I copy (not forward)
>Spam & Notspam messages; I run a script to launch sa_learn on them every
>hour. Right so far?

Okay.

>How many messages should I use to train the filter?

SpamAssassin won't start using the bayes results for filtering mail until
200 spam and 200 ham (non-spam) messages have been learned.

>Should I include only false positives and false negatives in my manual
>training or should I also use correctly tagged messages?

It will auto-learn if the other rules produce a very high or very low
score. The false positives and false negatives are the most important ones
to teach it, but adding correctly tagged messages certainly won't do any harm.

>Is there a good ratio between spam and not spam messages to use?

Ideally 50% of each I believe.

>Should I use only "new" messages (maybe one month old at max) or should
>I use also old messages?

Due to the changing nature of spam in general, I would think you would get
the best results with "new" messages.

>Should I keep the messages I used to train the filter or can I discard
>them?

You can discard them. Just make sure you don't lose or corrupt your Bayes
database files. Personally, I keep the manually-learned messages to be on
the safe side.

>Should I start from scratch every now and then or constantly train the
>filter with new messages without deleting the old database?

Don't delete it, just keep training it. It does a load of house-keeping
every now and then to clear out words/tokens which virtually never appear
and don't help the results. You can also trigger the house-keeping by hand
using some command-line switch to sa-learn. RTM to find out the command-line.

>How can I check if the learning procedure is doing any good at all?

I do it by keeping an eye on the "BAYES_" result in some spam and ham
messages. Other than that, I'm not sure.

Hope that helps a bit,
Jules.
--
Julian Field
www.MailScanner.info
Professional Support Services at www.MailScanner.biz
MailScanner thanks transtec Computers for their support



More information about the MailScanner mailing list