Thoughts on new Bayes idea

Fri Jun 4 17:09:11 IST 2004

On Jun 4, 2004, at 8:08 AM, Max Kipness wrote:

> I've been pondering this idea for a while, but wanted some opinions on 
> how feasible it would be...and the load it would cause.
>  
> I currently have all users that receive spam that bypassed 
> MailScanner, simply forward the email to spam at ourdomain.com. The email 
> then got blacklisted and there was an option to put 'domain' in the 
> subject header to black list the entire domain. This worked well, but 
> the black list got up to around 1600 emails/domains and I started to 
> get many SA time outs. This was before implementing Bayes which is 
> working great, if not too good with false positives, but that's 
> another story.
>  
> My idea is to basically archive every email that enters the system 
> (through MS) for a period of a day or so. I've got a script that 
> deletes all emails older than a time specified from an mbox file. Then 
> using my script from above, have users forward the email to 
> spam at ourdomain.com, have a new script fetch that email out of the 
> archive and feed it to Bayes.
>  
> Any thoughts on this? Is it ridiculous?
>  
> Most of my users are on various Exchange servers, and there really is 
> no easy way to get the email fed into bayes. I know you can do a 
> public folder, but then you have to train each user how to get it 
> there, and they have to open the public folder tree, etc. Using IMAP 
> is even more administration. I've found that simply forwarding the 
> email somewhere is very easy for them.
>

My main concern about these sorts of schemes is that: one man's trash 
is another man's treasure.  As the size of your user base increases, it 
is inevitable that you will have users who have different opinions 
about where to draw the line between spam and ham (or even users who 
are fanatical about even identifying organization wide announcements as 
spam, or who are fanatical about preventing censorship and thus not 
wanting ANY message to be marked as spam).

As a result, I tend to avoid any mechanism in which the user directly 
contributes to a site-wide configuration (side wide black lists, site 
wide bayes DB, etc.).  Indirect contributions by submitting messages 
for human review is fine (though, that gets into problems of spending 
all of some sysadmin's time reviewing spam), but the user should never 
directly say "learn this as spam/ham" for the site-wide database.

What I do at home (and I haven't yet gotten around to making something 
that works on a larger scale) is this:

1) if you're splitting messages out to individual recipients before 
MailScanner sees it, then you can set things up so that each recipient 
has their own Bayes Database, so that each message is checked against a 
user specific bayes db. (but this isn't what I actually do, I use spam 
assassin via procmail instead of via mailscanner ... I plan to go back 
to using mailscanner at some point, but haven't had time to do it yet 
... I just need to make sure that I set up my MTA to do expansion 
before MS instead of after MS, and the main reason I'm putting this off 
is that I'm actually planning to switch MTA's at home, soon)

2) I have a series of folders:  Spam, Spam/Blacklist, Spam/Learn, 
Spam/Learned, Spam/Unlearn, Spam/Whitelist, Spam/Unlearned

3) messages that are marked as spam are delivered (via procmail) into 
Spam/Blacklist.  Any time I receive a false-negative, I put it in 
Spam/Learn.  If I find a false-positive, I put it in Spam/Unlearn.  If 
I get something whose wording is spammy, but from a sender that I want 
to get through always, I can put them into the Spam/Whitelist folder.  
I haven't actually directly used the Blacklist folder yet, though.

4) at midnight, my procmail log is grepped for entries that went to 
Spam/Blacklist, telling me their score, sender, and subject.  If I 
can't tell from that that it was a valid sender, I'm willing to lose 
the message (so far, only Mailer-Daemon messages have been 
false-positives, and that's ok).  This means that I don't have to 
actually check the Spam/Blacklist folder (which, remember, is where my 
delivered spam goes), I just check the report to see if I need to fish 
any messages out of the folder before it gets processed in step 5b.  On 
bad days, it takes me a few minutes to page through the message, but 
then I'm done.  On good days, it takes a few seconds and I delete the 
report (actually, since I started using the SMTP Greet Delay, at 35 
seconds, and SBL/XBL at the MTA level, I have fewer than 3 spam 
messages per week, so most days my report is empty ... before I started 
using this whole system, I was at 150-250 per day).

5) at 5am, the following things happen in this order:

    a) all messages in Spam/Learn are submitted to razor and bayes as 
spam
       and then deposited in Spam/Blacklist
    b) all messages in Spam/Blacklist are added to my AWL for 
"blacklisting",
       and then deposited in Spam/Learned (which actually exists outside 
of
       my IMAP space, but I included it here for completeness)
    c) all messages in Spam/Unlearn are submitted to bayes as ham, and
       then deposited in Spam/Unlearned
    d) all messages in Spam/Whitelist are added to my AWL for 
"whitelisting",
       and then deposited in Spam/Unlearned

(you'll notice that Learn feeds into Blacklist, but Unlearn doesn't 
feed into Whitelist, because I can envision times where I do not want 
whitelisting and ham to be linked, but I generally do want a spam 
sender to be blacklisted (I get more spam from repeated addresses than 
from 1-shot addresses, but that's because most of my spam has 
historically been from unconfirmed lists for commercial things than 
from forged senders))

(you should also notice that razor is only done via the Spam/Learn 
folder, because only human reviewed messages get put into Spam/Learn 
... which fits the razor model of not submitting messages that haven't 
been reviewed and confirmed to be spam)

So, I have my own personal bayes database, which is automatically fed 
from my own folders instead of being hand-fed.  Plus, anyone else who 
uses my home mail server can do the same thing, without our spam/ham 
tastes affecting each other.

-------------------------- MailScanner list ----------------------
To leave, send    leave mailscanner    to jiscmail at jiscmail.ac.uk
Before posting, please see the Most Asked Questions at
http://www.mailscanner.biz/maq/     and the archives at
http://www.jiscmail.ac.uk/lists/mailscanner.html