sa learn bayes starter DB

shuttlebox shuttlebox at gmail.com
Tue Feb 3 14:33:46 GMT 2009


On Tue, Feb 3, 2009 at 3:14 PM, Glenn Steen <glenn.steen at gmail.com> wrote:
> This is more philosophical than technical...:-).
> The "best" thing to do is to have 200-1000 spam messages and 200-1000
> ham (non-spam) messages, harvested from your normal mail flow, and
> manually train Bayes on these.
> Another option is to set things up with an empty Bayes and either rely
> on automatic training, or a combination of manual/automatic training,
> so that you reach the prerequisite of 200 spam/ham before Bayes start
> scoring.
> The third option is to "borrow" someone elses' Bayes database and
> start scoring directly. Obviously this  is what you were about to do
> here.

Depends on the volume of course but for a domain with a few thousand
mailboxes it should be sufficiently trained to start scoring after
only a few hours of automatic training.

When I do my daily expire I see that 25% of the bayes db is purged. To
me that means my whole db is refreshed every four days so why bother
with a starter db from someone elses mail flow? Maybe I got the wrong
idea about how this purging works and someone can explain it better
but I've never seen it on this list. Everyone seems to think that
"good" tokens will stay in the db forever but I'm skeptical. I would
like an answer to that and may have to go to the SA list to get it.
Matt Kettler used to answer tricky SA questions about SA's internals
but he doesn't seem to hang around here anymore..?

-- 
/peter


More information about the MailScanner mailing list