More spam after spamassain upgrade

Sat Jul 26 04:56:47 IST 2003

> >
> > Why is this not an option???....
>
> What process would you suggest I use for getting message feedback from
> 20,000 users when we don't have individual config files/directories for
> users on our mail servers (and even if we did, how would that interact
> with messages to multiple users or mailing lists, neither of which are
> expanded before they get to mailscanner), we don't have anyone on staff
> who can review user submissions of false positives/false negatives (we
> are NOT going to blindly accept it when a user says 'this should have
> been spam', partially because users will not always agree upon the
> issue), and the things that auto-learning handles aren't the things I'm
> worried about (auto-learning basically strengthens the system's resolve
> about high scoring spam ... what I'm concerned about is changing the
> scores of low-scoring spam; when I call sa-learn on my home machine, I
> never call it on high scoring spam messages, for example -- I call it
> upon messages that were _lower_ than my threshold)?
>

WOW...ok...one at a time...I don't see why you would need feedback from 20K
employees.  If the challenge you are concerned with is stopping a LARGE
percentage of SPAM, then using a public corpus of spam to train a bayes
enabled SA implementation would do just that.  WITHOUT having to have
individual user interaction with anything..  With that said...creating a
bayes database doesn't HAVE to be user specific...as a matter of fact, you
are asking for a major headache to even think about it..  Create a single
one that is "relatively" accurate and slow the river to a stream as a first
step  :)

To deal with your multiple user issue is easy,create a dedicated email "hop"
with a standalone server (or cluster for you perhaps) that handles all email
inbound and/or outbound....your mailstore will send email to and from the
Internet (no matter how many users are included in the distribution) as
seperate messages...besides, if your organization is that large, I'm
guessing that you likely are not as concerned about SPAM going from user to
user on your mail infrastructure, you are likely more concerned about
inbound email from the Internet (or other untrusted network entity)

If there is nobody on staff to handle submissions of
false-positives/false-negatives, I'm curious how you might plan to handle
this WITHOUT a 90% accuracy ranking.  if you are concerned with users
polluting the bayes database, then simply don't allow them to learn messages
into it.  Read "A Plan for Spam" by Paul Graham (
http://www.paulgraham.com/spam.html ).  He very simplistically outlines the
concept and the math makes sense.  Users would have to do some SERIOUS work
to pollute a bayes database to the point where it causes a single false
positive, but again...if you're worried, then don't allow it..."teach" the
engine on a new public corpus once per month...and your accuracy will be
maintained pretty well.

If the auto-learning things aren't what you're worried about then, I'm
puzzled about why you mention it...I'm not sure I understand your concern
about low scoring SPAM...the bayes scores as part of SA are only a PART of
the overall score. (which is exaclty the right way to do it by the way)...so
if you are concerned about the bayes score affecting low (or high) scoring
spam the wrong way, then throttle the score for bayes and leverage the rest
of SA.

> I just don't see how bayes would fit into our situation.
>
> Down the road, I'm looking in to how to apply something I use at home
> (where I have a "learn" and an "unlearn" folder, and my home mail
> server automatically runs through those folders every night at 5am,
> learning about things it did wrong) to our mail servers ... but at home
> I've got _2_ users and at work I've got 10,000 times that many users.
> At home, spamassassin is called out of my .forward, so there's never
> confusion about whose bayes database to use, and there aren't any
> non-user recipients like mailing lists.
>
Your concern is valid.  I simply think that SA would solve a big part of
your problems if it were done in a "gateway" configuration.  Meaning, it
routes all mail internal and external, applies policy and checks for SPAM
with a single static bayes database.  You are doing things exactly right at
home making granular tuning and learning user-by-user...but the large
percentage of the benefit can be recognized by the greater group by using a
single bayes database that get supdated froma public corpus once per month
(or week as it is lekely easy to automate)  Since you KNOW what's being
learned in is definietly SPAM, it simply re-inforces the concetps in Paul's
paper.

> I'm not sure that the mechanism will translate well to my production
> servers at work.  There's the issue of server load as it tries to
> update 20,000 bayes databases (hopefully the low-usage window will be
> long enough to let all of those updates happen before usage picks  back
> up), there's adding a front end that expands all messages to 1 end user
> recipient per message before it gets submitted to mailscanner (which
> means more work for mailscanner, as mailscanner will now see 10
> messages instead of 1, if the message has 10 recipients), and there's
> the issue of where to put the user data files.  If it does, then using
> bayes will make sense.  Otherwise, I just don't see how it will fit my
> environment.

Load becomes less of an issue (doesn't totally go away) as you don't have
20,000 bayes databases, however a gateway implementation for a 20K mail
infrastructure may have to be relatively big...but instead of handling 20K
bayes databases...you are only really worrying about 1.  In addition, a
gateway is only going to scan the message as it sees it once...if you think
of the big picture, consider this:

a SPAm message comes into the gateway...if its addressed to 20 different
people SA only needs to scan it once with its OWN bayes db...it is either
SPAM or not...it then s ends the message on to your mail server
infrastructure for final delivery...

just an idea...seems an easy problem to solve.

:)

CT