MailScanner content scanning for keywords

David H. dh at UPTIME.AT
Sat Jul 16 01:21:20 IST 2005

    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hash: SHA1

Stephen Swaney wrote:
>>-----Original Message-----
>>From: MailScanner mailing list [mailto:MAILSCANNER at JISCMAIL.AC.UK] On
>>Behalf Of Matt Kettler
>>Sent: Friday, July 15, 2005 5:31 PM
>>Subject: Re: MailScanner content scanning for keywords
>>Daniel Straka wrote:
>>>Julian, and list,
>>>I know you're all getting tired of my postings so I'll make this my last
>>>intrusion on this topic.
>>>I don't want to offend anyone on this list, but the comments sent back
>>>about my suggestion (see bottom) are a bit programmer-anal. How
>>>about just keeping it simple? A nice simple line file delimited by
>>>or whatever character, like:
>>Disclaimer: I *am* a programmer. I have both bias and experience from
>>years of
>>SA rule writing and general programming.
>>We're not being "programmer-anal" we're trying to be helpful.
>>AFAIK, there are no off-the-shelf tools that work with mailscanner do the
>>single-line text-file thing. It's too inflexible a tool to be useful for
>>people so it wouldn't exactly be a popular. It sounds good, and would be
>>easy to
>>start with, but it's a PITA in the long run due to it's lack of
>>So if you want a line-by-line string checker, you'll probably have to
>>write your
>>own tool. You might be able hack a script together using the generic spam
>>scanner module, but at that point it's more 'code' than writing a couple
>>regexes for SA rules.
>>After all that, you'd likely use it for a month to a year and have to junk
>>because it sucked. No, really, it would suck. This stuff is a LOT harder
>>you think. Trust me, I'm trying to help you.
>>Spammers use thousands of variants of the word "Viagra", do you want to
>>dictionary them all? 1 regex rule detects absurd numbers of of possible
>>Not even counting increases due to mixed-case that's:
>>32768*2*32768*14*32768*18*32768*4*2*32768*2*32768**16*32768*2*32768 =
>>different strings it will match, all resembling "Viagra".
>>I know I can't dictionary that many combinations. This is a -real world-
>>problem, not a programmers dream. I wrote the above regex for SA. I've
>>drug spam form years in evolving that rule. It's complex, but there really
>>an insane number of obfuscations used nowadays. WAY too many to catch with
>>string matching.
>>Besides, if you're a unix sysadmin, regexes really should not scare you.
>>I'm not
>>being programmer-centric here. They are not code, at all, and they are
>>used in
>>dozens of unix programs and even many windows programs (text searches in
>>apps). If you learn them you'll be able to use ordinary tools like "grep"
>>better. They're not hard, just a little weird looking.
>>A SA rule for a single-word body is pretty trivial. It's not as easy as a
>>line-by-line text file, but it's what we have in-hand. It's also a lot
>>powerful and flexible as your skills with it grow.
>>Here's some quick conversions of your examples:
>>" viagra "
>>body L_VIAGRA1	/\bviagra\b/i
>>score L_VIAGRA1	5.0
>>" vagara "
>>body L_VIAGRA2	/\bvagara\b/i
>>score L_VIAGRA2	5.0
>>" cialis "
>>body L_CIALIS	/\bcialis\b/i
>>score L_CIALIS	5.0
>>Note: I changed your spaces to \b's, which will match any "word boundary"
>>including space, punctuation, and end-of-line. Much more useful, as they
>>miss end-of-sentence cases like using spaces will. I also made them
>>case-insensitive with the trailing i.
>>And many of those examples are already built-in with SA 3.0.0 or higher to
>>with (DRUGS_ERECTILE). You can just jack up the score if discussion of
>>drugs is inappropriate at your work (ie: no off-color-joke mails allowed).
>>It's not hard. Really. If you don't want the admin hassles of a full-blown
>>just disable it and use MCP, which has the same syntax, but the same
>>This really is likely your simplest route to go, because it exits. AND it
>>flexibility to help you when you run into trouble with simple rules. And
>>likely will need it at some point.
> Thanks Matt for a very reasoned and simple explanation of the problem and
> why it's so difficult to solve in a simplistic fashion!
> On a different tack - we were recently asked to implement a solution for a
> client in England that used MCP to trap English (as in UK) profanity. 

I hope that client has a clear, written agreement with all its customers that
he may scan the body of the mail. Because otherwise he is violating privacy
laws and you might tell them that :)

- -d
Version: GnuPG v1.4.1 (Darwin)


------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki ( and
the archives (

Support MailScanner development - buy the book off the website!

More information about the MailScanner mailing list