MailScanner content scanning for keywords

Matt Kettler mkettler at EVI-INC.COM
Mon Jul 18 17:15:45 IST 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

James Gray wrote:
> On Sat, 16 Jul 2005 07:30 am, Matt Kettler wrote:
>>Spammers use thousands of variants of the word "Viagra", do you want to
>>dictionary them all? 1 regex rule detects absurd numbers of of possible
> Good grief! That looks like a slightly extended version of the OBFU_VIAGRA 
> rule I wrote about a year ago...I can tell coz it's still got the (?:\b|\s) 
> rules which, syntactically can be replaced with [\b\s].  At least that's 
> how it reads in my custom SA rules /now/ and works just the same (and is 
> faster from my testing).

Actually, it's part of the DRUGS_ERECTILE rule I developed for and 
is now a part of sa 3.0.0+, starting sometime late 2003 with a public version in 
January 16, 2004.

It's interesting that the rest of our rules are similar, but then again, when 
you break it down it's all straightforward obfuscation handling.

The regex quoted is a slightly newer version of the __DRUGS_ERECTILE1 sub-part 
than is in common distribution via or SA 3.0.x, one I've been 
testing but haven't done a mass-check of yet.

> Perl gurus: Am I correct? does (?:\b|\s) == [\b\s] ??  If not, what's the 
> difference?  AFAICT (?:...) matches something without creating the $x 
> holder to refer to the match later, and [...] does the same thing except 
> matches a set of individual characters.

I *may* have lifted the idea of using (?:\b|\s) from your rule, or from someone 
else's rule. Originally I did use \b only. I believe that later I saw some other 
rule (yours, some SARE rule, dono) with a mixed-pre-gap clause using the 
combination \b|\s and decided to try it, and was pleased with the improvement. I 
don't think the combo-phrase was added until at least Feb, 2004.

The addition of \s makes considerable sense when you consider that my gap-clause 
could be word or non-word characters ([\W_]{0,3})

I settled on using (?:\b\s) instead of simplifying to [\b\s] based on my corpus 
testing. [\b\s] was the first thing that came to my mind, but it in fact does 
not work as well.

My *theory* is this is because \b is not a character, it's a zero-width 
assertion. [] would require a width as it is a character meta-class, reducing 
some of the hit possibilities. But that's a theory.

> So if you have (?:a|b|c|d|...|z) isn't that exactly the same as [a-z]?  

Yes, because those are all characters. And [a-z] will execute faster because it 
can be simplified.

> Obviously something like "fuss(?:ing|ed|y)?" is a where you'd want the 
> (?:...) syntax - but I'm referring to matching individual characters.

Ahh, but as we saw before \b can be 0 characters :)

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki ( and
the archives (

Support MailScanner development - buy the book off the website!

More information about the MailScanner mailing list