sa learn bayes starter DB

Glenn Steen glenn.steen at gmail.com
Tue Feb 3 15:28:04 GMT 2009


2009/2/3 lorenzo <lorenzo at argroup.it>:
> Glenn Steen ha scritto:
>>
>> 2009/2/3 lorenzo <lorenzo at argroup.it>:
>>
>>>
>>> Glenn Steen ha scritto:
>>>
>>>>
>>>> 2009/2/2 Kai Schaetzl <maillists at conactive.com>:
>>>>
>>>>
>>>>>
>>>>> Lorenzo wrote on Mon, 02 Feb 2009 18:21:36 +0100:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> how i can use this db with spamassassin?
>>>>>>
>>>>>>
>>>>>
>>>>> isn't that information in the download?
>>>>>
>>>>> Kai
>>>>>
>>>>>
>>>>>
>>>>
>>>> Nah, it's just a tarball of a bayes directory with the usual three
>>>> file... Unpack to somewhere (/etc/MailScanner/) and set bayes_path
>>>> accordingly (/etc/MailScanner/bayes/bayes), ownership and permissins
>>>> ... and you are done.
>>>>
>>>> Whether one should use a DB other than from ones own mail flow... is
>>>> another question:)
>>>>
>>>> Cheers
>>>>
>>>>
>>>
>>> if you have a fresh mailserver is a good choice use this start db or is
>>> preferrable to start with a new bayes db?
>>>
>>>
>>>
>>
>> This is more philosophical than technical...:-).
>> The "best" thing to do is to have 200-1000 spam messages and 200-1000
>> ham (non-spam) messages, harvested from your normal mail flow, and
>> manually train Bayes on these.
>> Another option is to set things up with an empty Bayes and either rely
>> on automatic training, or a combination of manual/automatic training,
>> so that you reach the prerequisite of 200 spam/ham before Bayes start
>> scoring.
>> The third option is to "borrow" someone elses' Bayes database and
>> start scoring directly. Obviously this  is what you were about to do
>> here.
>>
>> The problem(s) would be that:
>> - You have no real knowledge of what is in the Bayes db.
>> - It might have been trained on things that would FP/FN a lot for your
>> organisation/mail flow.
>> - you would have to watch the hit rate for Bayes pretty closely at the
>> start. Now... That would be something you'd need do anyway:-).
>>
>> So a starter DB will give you Bayes scoring, but might be seriously
>> out of date wrt the current spam trends and would potentially be all
>> wrong for your mail flow, hence leading to FP/FN rates that might
>> "hurt" you.
>> A lot of suppositions there:-).
>> Once your system is up and running (after a while all the tokens will
>> tend to be from your mail flow) the "impact" lessens.
>>
>> So why use one, if there are risks? Well, for one thing... Score set
>> three might be a lot better than set 1 (which you'll have until Bayes
>> kick in). You'll have no "sudden change" in scoring, as you would
>> otherwise (when Bayes kick in), and that predictability is possibly
>> something to strive for.
>> Bottom line on that string of thoughts is that it might help you
>> detect more spam, thus running a better "laundry service".
>>
>> Which way to go? It all depends on your needs. If we did a poll, you'd
>> find some that would be using a starter DB and some that
>> "emphatically:-) wouldn't.
>> I, for one, did the manual/automatic training thing on an empty Bayes
>> ... some 5 years back, or so. I've since migrated that db around ever
>> since, as well as used the "pure manual method" on some testbeds...
>> Others have the starter db thing in their notes for "how to setup a
>> new MS server", and are very happy with that.
>>
>> Cheers
>>
>
> I think im doing the manual/automatic method too with a small test
> mailserver. I don't now if it is what you exatly mean: I simply look at evry
> single mail in mailwatch and check if the mail is tagged correct. if not i
> change the sa learning flag. and i submit. :-)
Excactly what I meant;-).

> the start was hard but now seems ok. in 1 month seems that mailscanner is
> working not perfect but very acceptable. 1 wrong tag every 600/700 mail and
> is learning quickly.....
With a large throughput, the work will be "more automatic" after a
while, as implied by Peter.

> I was just dreaming that with a starter db sa can learn in 1 second and no
> more manual control. but I forgot that now i can use my small bayes test db
> and put in my official server that is going to start....
> wish me good luck... :-)
>
We all long for "silver bullets" from time to time:-).
Best of luck to you and your new server!

Cheers
-- 
-- Glenn
email: glenn < dot > steen < at > gmail < dot > com
work: glenn < dot > steen < at > ap1 < dot > se


More information about the MailScanner mailing list