Training the bayesian engine and sa-learn

Wed Sep 3 21:18:08 IST 2003

On Wed, 2003-09-03 at 20:33, Chris Lyon wrote:

>So, I have been reading the FAQ and also the past posts but have a
>little
>confusion that I need to resolve. Just to give a little back ground, I
>have
>a lot of users who all have issues with e-mail that is being marked as
>spam
>or not being marked as spam. So, I think the answer to this is to have
>them
>forward the messages to an unattended mailbox that will autowhitelist
>or
>autoblacklist the sender.  Is that what sa-learn is all about?

Not quite sa-learn is for tuning the Bayes classifier, this doesn't
whitelist or blacklist anything - it tokenises the mail content and
store a probability of each token appearing in a spam or ham mail.  This
is then used to determine the probability of future message being spam
or ham.

>So, if I create a spam and non-spam account on server and use the
>sa-learn
>to check the messages that my users forward to these accounts, if
>something
>was marked as spam and is not, further messages will not be marked
>again?

No, it reduces the probability associated with the tokens which appear
in a mail.

Auto white/blacklists are a bad idea - search for autowhitelist in the
archives for a discussion.

You can best improve the accuracy by adding DCC, razor2 and pyzor, and
by letting SA do RBL checks rather than MailScanner.

I found that the majority of my false positives came from a very few
sources.  Mainly clients of one particular department which has several
customers in Asia & Africa using dodgy ISP's  I added some SA rules
assigning a negative score to the names of that departments products,
which helped.

>How does it work, based on content I would assume or does it work by
>the
>domain? Also, what happens with stuff being forwarded from different
>mail
>clients like outlook?

Outlook is v bad at forwarding messages unaltered.  I got round this by
using the attachment option in MS (which also allowed me to add some
info for users). Then using a script I found online to strip the
original message from the attachment.

I strongly recommend getting a good handle on how SA works (by reading
the docs - particulary the Mail::SpamAssassin::Conf docs and about
Bayes) before trying to tune it.

BMRB International
http://www.bmrb.co.uk
+44 (0)20 8566 5000
_________________________________________________________________
This message (and any attachment) is intended only for the
recipient and may contain confidential and/or privileged
material.  If you have received this in error, please contact the
sender and delete this message immediately.  Disclosure, copying
or other action taken in respect of this email or in
reliance on it is prohibited.  BMRB International Limited
accepts no liability in relation to any personal emails, or
content of any email which does not directly relate to our
business.