sa learn bayes starter DB

Tue Feb 3 14:50:51 GMT 2009

Glenn Steen ha scritto:
> 2009/2/3 lorenzo <lorenzo at argroup.it>:
>   
>> Glenn Steen ha scritto:
>>     
>>> 2009/2/2 Kai Schaetzl <maillists at conactive.com>:
>>>
>>>       
>>>> Lorenzo wrote on Mon, 02 Feb 2009 18:21:36 +0100:
>>>>
>>>>
>>>>         
>>>>> how i can use this db with spamassassin?
>>>>>
>>>>>           
>>>> isn't that information in the download?
>>>>
>>>> Kai
>>>>
>>>>
>>>>         
>>> Nah, it's just a tarball of a bayes directory with the usual three
>>> file... Unpack to somewhere (/etc/MailScanner/) and set bayes_path
>>> accordingly (/etc/MailScanner/bayes/bayes), ownership and permissins
>>> ... and you are done.
>>>
>>> Whether one should use a DB other than from ones own mail flow... is
>>> another question:)
>>>
>>> Cheers
>>>
>>>       
>> if you have a fresh mailserver is a good choice use this start db or is
>> preferrable to start with a new bayes db?
>>
>>
>>     
> This is more philosophical than technical...:-).
> The "best" thing to do is to have 200-1000 spam messages and 200-1000
> ham (non-spam) messages, harvested from your normal mail flow, and
> manually train Bayes on these.
> Another option is to set things up with an empty Bayes and either rely
> on automatic training, or a combination of manual/automatic training,
> so that you reach the prerequisite of 200 spam/ham before Bayes start
> scoring.
> The third option is to "borrow" someone elses' Bayes database and
> start scoring directly. Obviously this  is what you were about to do
> here.
>
> The problem(s) would be that:
> - You have no real knowledge of what is in the Bayes db.
> - It might have been trained on things that would FP/FN a lot for your
> organisation/mail flow.
> - you would have to watch the hit rate for Bayes pretty closely at the
> start. Now... That would be something you'd need do anyway:-).
>
> So a starter DB will give you Bayes scoring, but might be seriously
> out of date wrt the current spam trends and would potentially be all
> wrong for your mail flow, hence leading to FP/FN rates that might
> "hurt" you.
> A lot of suppositions there:-).
> Once your system is up and running (after a while all the tokens will
> tend to be from your mail flow) the "impact" lessens.
>
> So why use one, if there are risks? Well, for one thing... Score set
> three might be a lot better than set 1 (which you'll have until Bayes
> kick in). You'll have no "sudden change" in scoring, as you would
> otherwise (when Bayes kick in), and that predictability is possibly
> something to strive for.
> Bottom line on that string of thoughts is that it might help you
> detect more spam, thus running a better "laundry service".
>
> Which way to go? It all depends on your needs. If we did a poll, you'd
> find some that would be using a starter DB and some that
> "emphatically:-) wouldn't.
> I, for one, did the manual/automatic training thing on an empty Bayes
> ... some 5 years back, or so. I've since migrated that db around ever
> since, as well as used the "pure manual method" on some testbeds...
> Others have the starter db thing in their notes for "how to setup a
> new MS server", and are very happy with that.
>
> Cheers
>   
I think im doing the manual/automatic method too with a small test 
mailserver. I don't now if it is what you exatly mean: I simply look at 
evry single mail in mailwatch and check if the mail is tagged correct. 
if not i change the sa learning flag. and i submit. :-)
the start was hard but now seems ok. in 1 month seems that mailscanner 
is working not perfect but very acceptable. 1 wrong tag every 600/700 
mail and is learning quickly.....
I was just dreaming that with a starter db sa can learn in 1 second and 
no more manual control. but I forgot that now i can use my small bayes 
test db and put in my official server that is going to start....
wish me good luck... :-)


-- 
Lorenzo Santi
aura srl


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.