Bayes scoring working wrong

Thu Dec 18 02:49:30 GMT 2003

On Thu, 18 Dec 2003 01:24 pm, Nortex PageGuys wrote:
> Hello ,
>
> > Return-Path: <ppfwts at hongkong.com>
> > Received: by mailadmin.nortex.net (CommuniGate Pro PIPE 4.1.5)
> >   with PIPE id 32188222; Wed, 17 Dec 2003 19:41:13 -0600
> > Received: from [12.158.34.221] (HELO psmtp.com)
> >   by mailadmin.nortex.net (CommuniGate Pro SMTP 4.1.5)
> >   with SMTP id 32188182 for **REMOVED FOR SECURITY**; Wed, 17 Dec 2003
> > 19:41:00 -0600 Received: from source ([218.235.30.213]) by
> > exprod5mx69.postini.com ([12.158.34.245]) with SMTP; Wed, 17 Dec 2003
> > 17:40:57 PST
> > Received: from [218.235.30.213] by rx357.comIP with HTTP;
> >         Thu, 18 Dec 2003 05:36:45 +0500
> > From: "Riddle Eric" <ppfwts at hongkong.com>
> > To: **REMOVED FOR SECURITY**
> > Subject: Re: %RND_UC_CHAR[2-8], the promised kurolesov

**snipped**

> I have fed the Bayes engine in SpamAssassin lots of spam and ham
> emails over the past 8 months, and this the result of all my work, its
> reversing valid spam as not spam.
>
> Any suggestions on what I can do to improve spamassassins scoring on
> this?
> Best regards,
>  Nortex                          mailto:pages at ntin.net

Not a lot we can do about Bayes poisoning :( except create a couple of
customised rules:

header FROM_SPAMMER01   From =~ /\@.*hongkong\.com/i
describe FROM_SPAMMER01 Known spam source 'hongkong.com'
score FROM_SPAMMER01    3.5

body BODY_BAN_CD        /Banned CD/i
describe BODY_BAN_CD    Mentions 'banned CD'
score BODY_BAN_CD       2.0

Now unless my math is out: 3.5 + 2.0 - 0.399 = 5.101

Bingo :)  Of course you'll need to keep creating rules for each forged
address :-/  Not exactly ideal but it works.  Plus with perl's powerful
regex, you'll find after a while that most spammers are creatures of habit
and you can create some pretty powerful filters based on common themes,
like domains that only have numbers (eg, 12345.biz in perl would be
/[0-9]{5}\.biz/i etc) or common obfuscating patterns (eg, /([a-zA-Z](?:\_|\
|-|\.)){3,}/i would catch any sequence of 3 or more letters separated by
either "_", " ", "-" or ".")

As I said in a post recently our mail filter at work has a combined false
+ve/-ve rate of less that 0.01%.  We also have two guys (myself and the
other Unix guy) managing the filters.  We currently have created 1523
custom rules to tailor the filters to our specific needs.  This number will
only ever increase :(  However, if you're interested, I'm happy to share
them (in a modified form - without all our internal business-specific
stuff.  There's too many internal addresses/lists to just "put them up on
an ftp somewhere").  Contact me off-list if anyone is interested :)

--James
__________________________________
A random quote of nothing:

BOFH excuse #295:

The Token fell out of the ring. Call us when you find it.