MailScanner content scanning for keywords

Sat Jul 16 00:27:58 IST 2005

> -----Original Message-----
> From: MailScanner mailing list [mailto:MAILSCANNER at JISCMAIL.AC.UK] On
> Behalf Of Matt Kettler
> Sent: Friday, July 15, 2005 5:31 PM
> To: MAILSCANNER at JISCMAIL.AC.UK
> Subject: Re: MailScanner content scanning for keywords
> 
> Daniel Straka wrote:
> > Julian, and list,
> >
> > I know you're all getting tired of my postings so I'll make this my last
> > intrusion on this topic.
> >
> > I don't want to offend anyone on this list, but the comments sent back
> > about my suggestion (see bottom) are a bit programmer-anal. How
> > about just keeping it simple? A nice simple line file delimited by
> quotes
> > or whatever character, like:
> 
> Disclaimer: I *am* a programmer. I have both bias and experience from
> years of
> SA rule writing and general programming.
> 
> We're not being "programmer-anal" we're trying to be helpful.
> 
> AFAIK, there are no off-the-shelf tools that work with mailscanner do the
> simple
> single-line text-file thing. It's too inflexible a tool to be useful for
> most
> people so it wouldn't exactly be a popular. It sounds good, and would be
> easy to
> start with, but it's a PITA in the long run due to it's lack of
> flexibility.
> 
> So if you want a line-by-line string checker, you'll probably have to
> write your
> own tool. You might be able hack a script together using the generic spam
> scanner module, but at that point it's more 'code' than writing a couple
> trivial
> regexes for SA rules.
> 
> After all that, you'd likely use it for a month to a year and have to junk
> it
> because it sucked. No, really, it would suck. This stuff is a LOT harder
> than
> you think. Trust me, I'm trying to help you.
> 
> 
> Spammers use thousands of variants of the word "Viagra", do you want to
> dictionary them all? 1 regex rule detects absurd numbers of of possible
> spellings:
> 
> /(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[a4ij1!|l\xCC-\xCF\xEC-
> \xEF][_\W]{0,3}[ila40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}[x
> yz]?[gj][_\W]{0,3}rr?[_\W]{0,3}[a40\xC0-\xC6\xE0-
> \xE6@][_\W]{0,3}x?[_\W]{0,3}(?:\b|\s)/i
> 
> Not even counting increases due to mixed-case that's:
> 32768*2*32768*14*32768*18*32768*4*2*32768*2*32768**16*32768*2*32768 =
> 3.43*10^41
> different strings it will match, all resembling "Viagra".
> 
> I know I can't dictionary that many combinations. This is a -real world-
> problem, not a programmers dream. I wrote the above regex for SA. I've
> studied
> drug spam form years in evolving that rule. It's complex, but there really
> are
> an insane number of obfuscations used nowadays. WAY too many to catch with
> basic
> string matching.
> 
> 
> Besides, if you're a unix sysadmin, regexes really should not scare you.
> I'm not
> being programmer-centric here. They are not code, at all, and they are
> used in
> dozens of unix programs and even many windows programs (text searches in
> some
> apps). If you learn them you'll be able to use ordinary tools like "grep"
> better. They're not hard, just a little weird looking.
> 
> A SA rule for a single-word body is pretty trivial. It's not as easy as a
> line-by-line text file, but it's what we have in-hand. It's also a lot
> more
> powerful and flexible as your skills with it grow.
> 
> Here's some quick conversions of your examples:
> " viagra "
> body L_VIAGRA1	/\bviagra\b/i
> score L_VIAGRA1	5.0
> 
> " vagara "
> body L_VIAGRA2	/\bvagara\b/i
> score L_VIAGRA2	5.0
> 
> " cialis "
> body L_CIALIS	/\bcialis\b/i
> score L_CIALIS	5.0
> 
> Note: I changed your spaces to \b's, which will match any "word boundary"
> including space, punctuation, and end-of-line. Much more useful, as they
> won't
> miss end-of-sentence cases like using spaces will. I also made them
> case-insensitive with the trailing i.
> 
> And many of those examples are already built-in with SA 3.0.0 or higher to
> begin
> with (DRUGS_ERECTILE). You can just jack up the score if discussion of
> such
> drugs is inappropriate at your work (ie: no off-color-joke mails allowed).
> 
> 
> It's not hard. Really. If you don't want the admin hassles of a full-blown
> SA,
> just disable it and use MCP, which has the same syntax, but the same
> flexibility.
> 
> This really is likely your simplest route to go, because it exits. AND it
> has
> flexibility to help you when you run into trouble with simple rules. And
> you
> likely will need it at some point.
> 

Thanks Matt for a very reasoned and simple explanation of the problem and
why it's so difficult to solve in a simplistic fashion!

On a different tack - we were recently asked to implement a solution for a
client in England that used MCP to trap English (as in UK) profanity. We
created an MCP rule set that used their extensive list of "profane" words as
intelligently as possible. This was not simple as Matt has described. We
also set up rules to:

	1. Forward for review the messages trapped by these rules
	2. Easily release these messages 
	3. Created an audit trail of who released what / when.

The whole system is working well and the client appears to be happy with the
results.

Here's the problem; if we wanted to do the same thing for a company here in
the US we'd have to start all over again with a new nasty word list. Seems
that we Yanks have a very different set of Bl**dy nasty words.

Just my 2p 

Steve

Stephen Swaney
Fort Systems Ltd.
stephen.swaney at fsl.com
www.fsl.com

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki (http://wiki.mailscanner.info/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).

Support MailScanner development - buy the book off the website!