MailScanner content scanning for keywords

Sat Jul 16 01:21:20 IST 2005

    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Stephen Swaney wrote:
>>-----Original Message-----
>>From: MailScanner mailing list [mailto:MAILSCANNER at JISCMAIL.AC.UK] On
>>Behalf Of Matt Kettler
>>Sent: Friday, July 15, 2005 5:31 PM
>>To: MAILSCANNER at JISCMAIL.AC.UK
>>Subject: Re: MailScanner content scanning for keywords
>>
>>Daniel Straka wrote:
>>
>>>Julian, and list,
>>>
>>>I know you're all getting tired of my postings so I'll make this my last
>>>intrusion on this topic.
>>>
>>>I don't want to offend anyone on this list, but the comments sent back
>>>about my suggestion (see bottom) are a bit programmer-anal. How
>>>about just keeping it simple? A nice simple line file delimited by
>>
>>quotes
>>
>>>or whatever character, like:
>>
>>Disclaimer: I *am* a programmer. I have both bias and experience from
>>years of
>>SA rule writing and general programming.
>>
>>We're not being "programmer-anal" we're trying to be helpful.
>>
>>AFAIK, there are no off-the-shelf tools that work with mailscanner do the
>>simple
>>single-line text-file thing. It's too inflexible a tool to be useful for
>>most
>>people so it wouldn't exactly be a popular. It sounds good, and would be
>>easy to
>>start with, but it's a PITA in the long run due to it's lack of
>>flexibility.
>>
>>So if you want a line-by-line string checker, you'll probably have to
>>write your
>>own tool. You might be able hack a script together using the generic spam
>>scanner module, but at that point it's more 'code' than writing a couple
>>trivial
>>regexes for SA rules.
>>
>>After all that, you'd likely use it for a month to a year and have to junk
>>it
>>because it sucked. No, really, it would suck. This stuff is a LOT harder
>>than
>>you think. Trust me, I'm trying to help you.
>>
>>
>>Spammers use thousands of variants of the word "Viagra", do you want to
>>dictionary them all? 1 regex rule detects absurd numbers of of possible
>>spellings:
>>
>>/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[a4ij1!|l\xCC-\xCF\xEC-
>>\xEF][_\W]{0,3}[ila40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}[x
>>yz]?[gj][_\W]{0,3}rr?[_\W]{0,3}[a40\xC0-\xC6\xE0-
>>\xE6@][_\W]{0,3}x?[_\W]{0,3}(?:\b|\s)/i
>>
>>Not even counting increases due to mixed-case that's:
>>32768*2*32768*14*32768*18*32768*4*2*32768*2*32768**16*32768*2*32768 =
>>3.43*10^41
>>different strings it will match, all resembling "Viagra".
>>
>>I know I can't dictionary that many combinations. This is a -real world-
>>problem, not a programmers dream. I wrote the above regex for SA. I've
>>studied
>>drug spam form years in evolving that rule. It's complex, but there really
>>are
>>an insane number of obfuscations used nowadays. WAY too many to catch with
>>basic
>>string matching.
>>
>>
>>Besides, if you're a unix sysadmin, regexes really should not scare you.
>>I'm not
>>being programmer-centric here. They are not code, at all, and they are
>>used in
>>dozens of unix programs and even many windows programs (text searches in
>>some
>>apps). If you learn them you'll be able to use ordinary tools like "grep"
>>better. They're not hard, just a little weird looking.
>>
>>A SA rule for a single-word body is pretty trivial. It's not as easy as a
>>line-by-line text file, but it's what we have in-hand. It's also a lot
>>more
>>powerful and flexible as your skills with it grow.
>>
>>Here's some quick conversions of your examples:
>>" viagra "
>>body L_VIAGRA1	/\bviagra\b/i
>>score L_VIAGRA1	5.0
>>
>>" vagara "
>>body L_VIAGRA2	/\bvagara\b/i
>>score L_VIAGRA2	5.0
>>
>>" cialis "
>>body L_CIALIS	/\bcialis\b/i
>>score L_CIALIS	5.0
>>
>>Note: I changed your spaces to \b's, which will match any "word boundary"
>>including space, punctuation, and end-of-line. Much more useful, as they
>>won't
>>miss end-of-sentence cases like using spaces will. I also made them
>>case-insensitive with the trailing i.
>>
>>And many of those examples are already built-in with SA 3.0.0 or higher to
>>begin
>>with (DRUGS_ERECTILE). You can just jack up the score if discussion of
>>such
>>drugs is inappropriate at your work (ie: no off-color-joke mails allowed).
>>
>>
>>It's not hard. Really. If you don't want the admin hassles of a full-blown
>>SA,
>>just disable it and use MCP, which has the same syntax, but the same
>>flexibility.
>>
>>This really is likely your simplest route to go, because it exits. AND it
>>has
>>flexibility to help you when you run into trouble with simple rules. And
>>you
>>likely will need it at some point.
>>
> 
> 
> Thanks Matt for a very reasoned and simple explanation of the problem and
> why it's so difficult to solve in a simplistic fashion!
> 
> On a different tack - we were recently asked to implement a solution for a
> client in England that used MCP to trap English (as in UK) profanity. 

I hope that client has a clear, written agreement with all its customers that
he may scan the body of the mail. Because otherwise he is violating privacy
laws and you might tell them that :)

- -d
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFC2FL/PMoaMn4kKR4RAh9OAJ96ujDyX6RobZES21LRJ2Ukqm+kJACfQxF6
QFux5+QL12+ZWT6NjxWBwlQ=
=kuj7
-----END PGP SIGNATURE-----

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki (http://wiki.mailscanner.info/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).

Support MailScanner development - buy the book off the website!