sa learn bayes starter DB

Tue Feb 3 14:14:09 GMT 2009

2009/2/3 lorenzo <lorenzo at argroup.it>:
> Glenn Steen ha scritto:
>>
>> 2009/2/2 Kai Schaetzl <maillists at conactive.com>:
>>
>>>
>>> Lorenzo wrote on Mon, 02 Feb 2009 18:21:36 +0100:
>>>
>>>
>>>>
>>>> how i can use this db with spamassassin?
>>>>
>>>
>>> isn't that information in the download?
>>>
>>> Kai
>>>
>>>
>>
>> Nah, it's just a tarball of a bayes directory with the usual three
>> file... Unpack to somewhere (/etc/MailScanner/) and set bayes_path
>> accordingly (/etc/MailScanner/bayes/bayes), ownership and permissins
>> ... and you are done.
>>
>> Whether one should use a DB other than from ones own mail flow... is
>> another question:)
>>
>> Cheers
>>
>
> if you have a fresh mailserver is a good choice use this start db or is
> preferrable to start with a new bayes db?
>
>
This is more philosophical than technical...:-).
The "best" thing to do is to have 200-1000 spam messages and 200-1000
ham (non-spam) messages, harvested from your normal mail flow, and
manually train Bayes on these.
Another option is to set things up with an empty Bayes and either rely
on automatic training, or a combination of manual/automatic training,
so that you reach the prerequisite of 200 spam/ham before Bayes start
scoring.
The third option is to "borrow" someone elses' Bayes database and
start scoring directly. Obviously this  is what you were about to do
here.

The problem(s) would be that:
- You have no real knowledge of what is in the Bayes db.
- It might have been trained on things that would FP/FN a lot for your
organisation/mail flow.
- you would have to watch the hit rate for Bayes pretty closely at the
start. Now... That would be something you'd need do anyway:-).

So a starter DB will give you Bayes scoring, but might be seriously
out of date wrt the current spam trends and would potentially be all
wrong for your mail flow, hence leading to FP/FN rates that might
"hurt" you.
A lot of suppositions there:-).
Once your system is up and running (after a while all the tokens will
tend to be from your mail flow) the "impact" lessens.

So why use one, if there are risks? Well, for one thing... Score set
three might be a lot better than set 1 (which you'll have until Bayes
kick in). You'll have no "sudden change" in scoring, as you would
otherwise (when Bayes kick in), and that predictability is possibly
something to strive for.
Bottom line on that string of thoughts is that it might help you
detect more spam, thus running a better "laundry service".

Which way to go? It all depends on your needs. If we did a poll, you'd
find some that would be using a starter DB and some that
"emphatically:-) wouldn't.
I, for one, did the manual/automatic training thing on an empty Bayes
... some 5 years back, or so. I've since migrated that db around ever
since, as well as used the "pure manual method" on some testbeds...
Others have the starter db thing in their notes for "how to setup a
new MS server", and are very happy with that.

Cheers
-- 
-- Glenn
email: glenn < dot > steen < at > gmail < dot > com
work: glenn < dot > steen < at > ap1 < dot > se