Thoughts on new Bayes idea
John Rudd
jrudd at UCSC.EDU
Fri Jun 4 17:09:11 IST 2004
On Jun 4, 2004, at 8:08 AM, Max Kipness wrote:
> I've been pondering this idea for a while, but wanted some opinions on
> how feasible it would be...and the load it would cause.
>
> I currently have all users that receive spam that bypassed
> MailScanner, simply forward the email to spam at ourdomain.com. The email
> then got blacklisted and there was an option to put 'domain' in the
> subject header to black list the entire domain. This worked well, but
> the black list got up to around 1600 emails/domains and I started to
> get many SA time outs. This was before implementing Bayes which is
> working great, if not too good with false positives, but that's
> another story.
>
> My idea is to basically archive every email that enters the system
> (through MS) for a period of a day or so. I've got a script that
> deletes all emails older than a time specified from an mbox file. Then
> using my script from above, have users forward the email to
> spam at ourdomain.com, have a new script fetch that email out of the
> archive and feed it to Bayes.
>
> Any thoughts on this? Is it ridiculous?
>
> Most of my users are on various Exchange servers, and there really is
> no easy way to get the email fed into bayes. I know you can do a
> public folder, but then you have to train each user how to get it
> there, and they have to open the public folder tree, etc. Using IMAP
> is even more administration. I've found that simply forwarding the
> email somewhere is very easy for them.
>
My main concern about these sorts of schemes is that: one man's trash
is another man's treasure. As the size of your user base increases, it
is inevitable that you will have users who have different opinions
about where to draw the line between spam and ham (or even users who
are fanatical about even identifying organization wide announcements as
spam, or who are fanatical about preventing censorship and thus not
wanting ANY message to be marked as spam).
As a result, I tend to avoid any mechanism in which the user directly
contributes to a site-wide configuration (side wide black lists, site
wide bayes DB, etc.). Indirect contributions by submitting messages
for human review is fine (though, that gets into problems of spending
all of some sysadmin's time reviewing spam), but the user should never
directly say "learn this as spam/ham" for the site-wide database.
What I do at home (and I haven't yet gotten around to making something
that works on a larger scale) is this:
1) if you're splitting messages out to individual recipients before
MailScanner sees it, then you can set things up so that each recipient
has their own Bayes Database, so that each message is checked against a
user specific bayes db. (but this isn't what I actually do, I use spam
assassin via procmail instead of via mailscanner ... I plan to go back
to using mailscanner at some point, but haven't had time to do it yet
... I just need to make sure that I set up my MTA to do expansion
before MS instead of after MS, and the main reason I'm putting this off
is that I'm actually planning to switch MTA's at home, soon)
2) I have a series of folders: Spam, Spam/Blacklist, Spam/Learn,
Spam/Learned, Spam/Unlearn, Spam/Whitelist, Spam/Unlearned
3) messages that are marked as spam are delivered (via procmail) into
Spam/Blacklist. Any time I receive a false-negative, I put it in
Spam/Learn. If I find a false-positive, I put it in Spam/Unlearn. If
I get something whose wording is spammy, but from a sender that I want
to get through always, I can put them into the Spam/Whitelist folder.
I haven't actually directly used the Blacklist folder yet, though.
4) at midnight, my procmail log is grepped for entries that went to
Spam/Blacklist, telling me their score, sender, and subject. If I
can't tell from that that it was a valid sender, I'm willing to lose
the message (so far, only Mailer-Daemon messages have been
false-positives, and that's ok). This means that I don't have to
actually check the Spam/Blacklist folder (which, remember, is where my
delivered spam goes), I just check the report to see if I need to fish
any messages out of the folder before it gets processed in step 5b. On
bad days, it takes me a few minutes to page through the message, but
then I'm done. On good days, it takes a few seconds and I delete the
report (actually, since I started using the SMTP Greet Delay, at 35
seconds, and SBL/XBL at the MTA level, I have fewer than 3 spam
messages per week, so most days my report is empty ... before I started
using this whole system, I was at 150-250 per day).
5) at 5am, the following things happen in this order:
a) all messages in Spam/Learn are submitted to razor and bayes as
spam
and then deposited in Spam/Blacklist
b) all messages in Spam/Blacklist are added to my AWL for
"blacklisting",
and then deposited in Spam/Learned (which actually exists outside
of
my IMAP space, but I included it here for completeness)
c) all messages in Spam/Unlearn are submitted to bayes as ham, and
then deposited in Spam/Unlearned
d) all messages in Spam/Whitelist are added to my AWL for
"whitelisting",
and then deposited in Spam/Unlearned
(you'll notice that Learn feeds into Blacklist, but Unlearn doesn't
feed into Whitelist, because I can envision times where I do not want
whitelisting and ham to be linked, but I generally do want a spam
sender to be blacklisted (I get more spam from repeated addresses than
from 1-shot addresses, but that's because most of my spam has
historically been from unconfirmed lists for commercial things than
from forged senders))
(you should also notice that razor is only done via the Spam/Learn
folder, because only human reviewed messages get put into Spam/Learn
... which fits the razor model of not submitting messages that haven't
been reviewed and confirmed to be spam)
So, I have my own personal bayes database, which is automatically fed
from my own folders instead of being hand-fed. Plus, anyone else who
uses my home mail server can do the same thing, without our spam/ham
tastes affecting each other.
-------------------------- MailScanner list ----------------------
To leave, send leave mailscanner to jiscmail at jiscmail.ac.uk
Before posting, please see the Most Asked Questions at
http://www.mailscanner.biz/maq/ and the archives at
http://www.jiscmail.ac.uk/lists/mailscanner.html
More information about the MailScanner
mailing list