Bayesian training policy (crossposted from SpamAssassin ML)

Mon May 5 17:30:11 IST 2003

Julian,

Thank you for your reply. Let me be more specific:

I've created two mailboxes (spam and notspam) where I copy (not forward)
Spam & Notspam messages; I run a script to launch sa_learn on them every
hour. Right so far?
How many messages should I use to train the filter?
Should I include only false positives and false negatives in my manual
training or should I also use correctly tagged messages?
Is there a good ratio between spam and not spam messages to use?
Should I use only "new" messages (maybe one month old at max) or should
I use also old messages?
Should I keep the messages I used to train the filter or can I discard
them?
Should I start from scratch every now and then or constantly train the
filter with new messages without deleting the old database?
How can I check if the learning procedure is doing any good at all?

Thank you in advance for any hint,

Andrea

-----Original Message-----
From: Julian Field [mailto:mailscanner at ECS.SOTON.AC.UK] 
Sent: Monday, May 05, 2003 5:22 PM
To: MAILSCANNER at JISCMAIL.AC.UK
Subject: Re: Bayesian training policy (crossposted from SpamAssassin ML)

[...]
What do you mean by a "learning policy that makes sense"? 
[...]