MailScanner content scanning for keywords

Matt Kettler mkettler at EVI-INC.COM
Fri Jul 15 22:30:30 IST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Daniel Straka wrote:
> Julian, and list,
> 
> I know you're all getting tired of my postings so I'll make this my last 
> intrusion on this topic.
> 
> I don't want to offend anyone on this list, but the comments sent back 
> about my suggestion (see bottom) are a bit programmer-anal. How 
> about just keeping it simple? A nice simple line file delimited by quotes 
> or whatever character, like:

Disclaimer: I *am* a programmer. I have both bias and experience from years of 
SA rule writing and general programming.

We're not being "programmer-anal" we're trying to be helpful.

AFAIK, there are no off-the-shelf tools that work with mailscanner do the simple 
single-line text-file thing. It's too inflexible a tool to be useful for most 
people so it wouldn't exactly be a popular. It sounds good, and would be easy to 
start with, but it's a PITA in the long run due to it's lack of flexibility.

So if you want a line-by-line string checker, you'll probably have to write your 
own tool. You might be able hack a script together using the generic spam 
scanner module, but at that point it's more 'code' than writing a couple trivial 
regexes for SA rules.

After all that, you'd likely use it for a month to a year and have to junk it 
because it sucked. No, really, it would suck. This stuff is a LOT harder than 
you think. Trust me, I'm trying to help you.


Spammers use thousands of variants of the word "Viagra", do you want to 
dictionary them all? 1 regex rule detects absurd numbers of of possible spellings:
 
/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[a4ij1!|l\xCC-\xCF\xEC-\xEF][_\W]{0,3}[ila40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}[x
yz]?[gj][_\W]{0,3}rr?[_\W]{0,3}[a40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}x?[_\W]{0,3}(?:\b|\s)/i

Not even counting increases due to mixed-case that's: 
32768*2*32768*14*32768*18*32768*4*2*32768*2*32768**16*32768*2*32768 = 3.43*10^41 
different strings it will match, all resembling "Viagra".

I know I can't dictionary that many combinations. This is a -real world- 
problem, not a programmers dream. I wrote the above regex for SA. I've studied 
drug spam form years in evolving that rule. It's complex, but there really are 
an insane number of obfuscations used nowadays. WAY too many to catch with basic 
string matching.


Besides, if you're a unix sysadmin, regexes really should not scare you. I'm not 
being programmer-centric here. They are not code, at all, and they are used in 
dozens of unix programs and even many windows programs (text searches in some 
apps). If you learn them you'll be able to use ordinary tools like "grep" 
better. They're not hard, just a little weird looking.

A SA rule for a single-word body is pretty trivial. It's not as easy as a 
line-by-line text file, but it's what we have in-hand. It's also a lot more 
powerful and flexible as your skills with it grow.

Here's some quick conversions of your examples:
" viagra "
body L_VIAGRA1	/\bviagra\b/i
score L_VIAGRA1	5.0

" vagara "
body L_VIAGRA2	/\bvagara\b/i
score L_VIAGRA2	5.0

" cialis "
body L_CIALIS	/\bcialis\b/i
score L_CIALIS	5.0

Note: I changed your spaces to \b's, which will match any "word boundary" 
including space, punctuation, and end-of-line. Much more useful, as they won't 
miss end-of-sentence cases like using spaces will. I also made them 
case-insensitive with the trailing i.

And many of those examples are already built-in with SA 3.0.0 or higher to begin 
with (DRUGS_ERECTILE). You can just jack up the score if discussion of such 
drugs is inappropriate at your work (ie: no off-color-joke mails allowed).


It's not hard. Really. If you don't want the admin hassles of a full-blown SA, 
just disable it and use MCP, which has the same syntax, but the same flexibility.

This really is likely your simplest route to go, because it exits. AND it has 
flexibility to help you when you run into trouble with simple rules. And you 
likely will need it at some point.

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki (http://wiki.mailscanner.info/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).

Support MailScanner development - buy the book off the website!



More information about the MailScanner mailing list