Ideas for improved bayes learning

Wed Sep 19 13:22:23 IST 2007

Hi Gareth,

Gareth wrote:
> Bayes normally autolearn a mail as being spam if the score is over 20.
> This is configurable.
> Many of us use RBLs on the MTA to reject known spam.
> 
> I was thinking that it might be usefull to instead of rejecting the RBL
> mail, to accept it, train bayes using it and then discard it.

I had this idea too a while back.  I discarded it for the following reasons:

1)  Accepting mail that would be rejected at the MTA level is not 
practical for anything but low volume sites as the ratio of good mail to 
that of mail rejected due to blacklists is usually at least >5:1 (and 
that is the low end).

2)  With bayes - it is desirable to balance the amount of mail learnt 
with even number of spam and ham tokens.  Based on point 1 above - if 
you learn *all* mail for client on an RBL then you'll end up with way 
more spam tokens than ham tokens.

3)  Training bayes is CPU intensive, this goes back to point 1.  I don't 
have the numbers, but I think it would be more efficient to learn this 
in batch instead of individual messages.  Doing this in MailScanner 
would cause the children to get held up doing the training instead of 
processing mail.

> However I believe that the RBL checks that spamassassin perform are on
> all the received lines and not just the IP address our mail servers
> received the mail from?
> If that is correct then I cannot simply assign a high score to the RBL
> checks and have mailscanner delete very high scoring mail.

Yes, this correct; SA works out which Received headers are trusted and 
which are untrusted, then tests them accordingly.  I don't see any 
reason why you couldn't just score them high if you wanted though.

> 
> Ideally what I was thinking would for a couple of enhancements to
> Mailscanner :-
> 
> 1) Add a new action of sa-learn-spam so the mail can be learnt. You can
> use a custom rule to fire this if a RBL matches so the mail is learnt
> and then deleted.
> 
> 2) Incorporate MailScanners RBL feature (I assume this one only checks
> one received header) into the rules which can be used when writing a
> custom action.
> 
> Its only an idea and not a request for the new feature. Personally
> MailScanner is working very well for us so at this time it is not worth
> allowing all the extra mail in just to improve the bayes effectivness.

The only way I could come up with doing this effectively was to check 
the bayes statistics (this shows the ham and spam token counts) each day 
and checking to see if the spam token count is less than the ham count, 
then training bayes on n messages to make up the difference.

It would be wildly inefficient to just let everything in from the MTA 
just to do this.  You almost want to be able to tell MTA to send you a 
certain number of RBL messages per-hour and redirect them to a special 
mailbox (bypassing MailScanner) for training, but I wouldn't know how to 
attempt that in Sendmail or Postfix (I think Exim could do this with a 
few tricks I expect).

Based on all of the above - I think the most efficient way to handle 
bayes is via mistake based training.  Train it on any messages that it 
get classifies incorrectly and it will do the right thing in time.

Cheers,
Steve.