More spam after spamassain upgrade

Sat Jul 26 07:33:36 IST 2003

On Friday, Jul 25, 2003, at 20:56 US/Pacific, Chris Trudeau wrote:

>>> Why is this not an option???....
>>
>> What process would you suggest I use for getting message feedback from
>> 20,000 users when we don't have individual config files/directories
>> for
>> users on our mail servers (and even if we did, how would that interact
>> with messages to multiple users or mailing lists, neither of which are
>> expanded before they get to mailscanner), we don't have anyone on
>> staff
>> who can review user submissions of false positives/false negatives (we
>> are NOT going to blindly accept it when a user says 'this should have
>> been spam', partially because users will not always agree upon the
>> issue), and the things that auto-learning handles aren't the things
>> I'm
>> worried about (auto-learning basically strengthens the system's
>> resolve
>> about high scoring spam ... what I'm concerned about is changing the
>> scores of low-scoring spam; when I call sa-learn on my home machine, I
>> never call it on high scoring spam messages, for example -- I call it
>> upon messages that were _lower_ than my threshold)?
>>
>
> WOW...ok...one at a time...I don't see why you would need feedback
> from 20K
> employees.  If the challenge you are concerned with is stopping a LARGE
> percentage of SPAM, then using a public corpus of spam to train a bayes
> enabled SA implementation would do just that.  WITHOUT having to have
> individual user interaction with anything..  With that said...creating
> a
> bayes database doesn't HAVE to be user specific...as a matter of fact,
> you
> are asking for a major headache to even think about it..  Create a
> single
> one that is "relatively" accurate and slow the river to a stream as a
> first
> step  :)

Right, but what am I feeding that single large corpus with?  Just my
own mail feed?  That wont fit anyone but me.  Who will decide who puts
things into the "this is spam" and "this is ham" parts of the corpus?
And, this is a university, which means there are tons of political
issues that come up in just discussing those questions, much less
answering them and putting them into production.  AND we're in the
middle of the state of california budget crisis -- we're already
overworked, so there will be NO human resource for managing that
corpus.  And, as I've said, we will not allow a central corpus/database
to be unmanaged (meaning "blind user contributions") , either.

And even if I accept people's spam submissions for the corpus, where do
I get the ham part of the corpus?  It is _illegal_ for me to use
messages they haven't explicitly given to me, so I'll only have
messages that they think to submit ... which wont include all of those
messages that might contain private or confidential correspondences and
other things that would be really ideal for training.  So, it really
comes back to "just my own mail feed?"  which is a poor choice, IMO.
In that case, it's better to not do bayes at all, IMO.

But, even if I did think it was worth doing on that basis, it would
again be a political, policy, and perhaps legal, issue.  I would be
indirectly deciding how to mark other people's email based upon my own
personal biases (because _I_ am the one deciding what gets into the
spam and ham databases).  To a certain extent, the same can be said
about the people who build the spam assassin corpus, but there are
differences there (starting with "I work here, I work in central IT,
and therefore I'm 'the man'", and therefore always viewed with
suspicion when it comes to things that might be considered censorship
and the like).

The technical question is "do I have enough system power to allow users
to have their own bayes db's", and the political/social/practical
question is how do we build and manage a central bayes db.  And, of the
two issues, the technical question is by far the easier one to solve.
I'll even go so far as to say that the political/social/practical one
is not solvable.

But, until we have those technical resources in place (if we ever have
them in place), it doesn't make sense to move forward with bayes.

> To deal with your multiple user issue is easy

We've already got 2 server hops (SMTP servers, which also do the
mailing list expansion, and a POP server) that cause us headaches (mail
backing up in one place that may be caused by some other link in the
chain, managing when accounts show and up disappear in each place,
etc.).  Adding a third would be absolutely the wrong solution.

The way I was looking at doing it: We're moving to a CommuniGate Pro
dynamic cluster, so the part where the CommuniGate Pro processing rules
hand messages to MailScanner will be in the "domain" level rules (which
happens after mailing list expansion and such, but before user owned
rules).

How to solve the technical problem is actually easy.  The question is
how much system resources I'll need to have in order to do it without
making my machines grind to a halt.  Even if I run it at 5am, the
system still has to be doing transactions during that time frame
without much of a delay in service (for professors and students who are
abroad, and thus checking at off hours ... or for people like me who
are night owls, or for early risers, etc.)

>> I just don't see how bayes would fit into our situation.
>>
>> Down the road, I'm looking in to how to apply something I use at home
>> (where I have a "learn" and an "unlearn" folder, and my home mail
>> server automatically runs through those folders every night at 5am,
>> learning about things it did wrong) to our mail servers ... but at
>> home
>> I've got _2_ users and at work I've got 10,000 times that many users.
>> At home, spamassassin is called out of my .forward, so there's never
>> confusion about whose bayes database to use, and there aren't any
>> non-user recipients like mailing lists.
>
> Your concern is valid.  I simply think that SA would solve a big part
> of
> your problems if it were done in a "gateway" configuration.

Well, yes, but using SA isn't the question.  We already use SA (at home
and at work).  The question is using the bayes part of SA at work.

> You are doing things exactly right at
> home making granular tuning and learning user-by-user

Yes, I know I'm doing the right thing there :-)

My home email address is user at domain.org.  I get over 200 spam messages
per day.  On most days, I see maybe 1 or 2 of them, and I've had 1
false positive with SA in over a year*.  I'm very confident that I've
got the right solution for my home environment.  I never said that I
didn't think SA or bayes were useful, I just don't think bayes fits my
current work environment ... and even in the new environment, I'm not
sure we'll have the processing power to make it fit there, either.

(* it's actually more complex than I described.  I have 4 spam folders:
Spam/Learn, Spam/Unlearn, Spam/Blacklist and Spam/Whitelist.  Messages
that I manually drop into Learn go to bayes and razor2 and then get put
into the blacklist folder (Unlearn messages only get sent to bayes).
Messages that are in the blacklist and whitelist folders get
incorporated into the auto-whitelist facility.  Messages that are
getting marked as spam automatically by SA go into the blacklist folder
(making it unlikely that I'll ever see messages from that sender ever
again, which matters more for corporate product registration mailing
lists and unconfirmed mailing lists).  After the blacklist stage, those
messages go into a mail folder outside of my IMAP area, and I almost
never actually look in the that folder.  Instead, I have a cron job
that extracts information from my procmail log about who got
automatically blacklisted, and sends me a report every night.  If I see
a false positive, I can go fish it out of that external mail folder and
get them whitelisted.  I've only had to do that once.   Once every few
months, I'll look in my auto-whitelist to see who has both a high
message count and a very high score ... and I'll add them to my
sendmail access db's reject rules, so that I can lighten my processing
load.)

>> I'm not sure that the mechanism will translate well to my production
>> servers at work.  There's the issue of server load as it tries to
>> update 20,000 bayes databases (hopefully the low-usage window will be
>> long enough to let all of those updates happen before usage picks
>> back
>> up), there's adding a front end that expands all messages to 1 end
>> user
>> recipient per message before it gets submitted to mailscanner (which
>> means more work for mailscanner, as mailscanner will now see 10
>> messages instead of 1, if the message has 10 recipients), and there's
>> the issue of where to put the user data files.  If it does, then using
>> bayes will make sense.  Otherwise, I just don't see how it will fit my
>> environment.
>
> just an idea...seems an easy problem to solve.

Technically, it is easy to solve.  The question is do I have enough
CPU, memory, and disk power in order to have each of my 20,000 users
have their own bayes database.  Without individual user db's, bayes
doesn't make any sense for my environment due to entirely non-technical
issues with building the corpus that feeds the proposed central bayes
db.