Some Messages Double Scanned

Thu Jan 12 23:07:13 GMT 2006

Julian Field wrote:
> A couple of (okay, 3) points.
> 
> If you decide to scrap the Bayes db, then you can get a good "starter"
> database from www.fsl.com/support <http://www.fsl.com/support>.

I know it's been talked about before, but you talk about starter databases as
being a good thing in all respects. I feel it necessary to point out the drawbacks.

Starter databases, while necessary for some, are in general at best highly
suboptimal. You really should discourage their use except when necessary due to
lack of starter training data from the actual site to work with.

Drawback 1: SA extensively tokenizes mail headers, inclusive of sender and
recipient addresses. None of these tokens will be at all useful on any site
other than the site that generated it and are a complete waste space.

Drawback 1.1: You're forcing SA to activate bayes by inflating the mail counts
without ANY relevant header tokens. This creates a pretty distorted view until
some local training kicks in.

Drawback 2: The strength of bayes lies in it's adaptation to YOUR mail patterns.
While most sites have common spam patterns, most sites have very different
nonspam patterns. A starter database lacks this knowledge of what YOUR nonspam
looks like. Unless your email precisely fits the profile of the starter database
your results will be less than optimal. If your profile differs greatly your
results will be fairly poor.

Conclusion: If you don't have 200 spam and 200 ham emails, a starter database
may be useful to you. However, it should be supplemented with at least a few
local messages, and preferably as many as possible to get the header tokens up
to par. If you have plenty of samples to work with, you'll be much better off
using your own mail and passing on the starter.