MailScanner content scanning for keywords

Mon Jul 18 17:15:45 IST 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

James Gray wrote:
> On Sat, 16 Jul 2005 07:30 am, Matt Kettler wrote:
> 
>>Spammers use thousands of variants of the word "Viagra", do you want to
>>dictionary them all? 1 regex rule detects absurd numbers of of possible
>>spellings:
>>
>>/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[a4ij1!|l\xCC-\xCF\xEC-\xEF][_\W]
>>{0,3}[ila40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}[x
>>yz]?[gj][_\W]{0,3}rr?[_\W]{0,3}[a40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}x?[_\W]{
>>0,3}(?:\b|\s)/i
> 
> 
> Good grief! That looks like a slightly extended version of the OBFU_VIAGRA 
> rule I wrote about a year ago...I can tell coz it's still got the (?:\b|\s) 
> rules which, syntactically can be replaced with [\b\s].  At least that's 
> how it reads in my custom SA rules /now/ and works just the same (and is 
> faster from my testing).

Actually, it's part of the DRUGS_ERECTILE rule I developed for antidrug.cf and 
is now a part of sa 3.0.0+, starting sometime late 2003 with a public version in 
January 16, 2004.

http://article.gmane.org/gmane.mail.spam.spamassassin.general/39305

It's interesting that the rest of our rules are similar, but then again, when 
you break it down it's all straightforward obfuscation handling.

http://mywebpages.comcast.net/mkettler/sa/antidrug.cf

The regex quoted is a slightly newer version of the __DRUGS_ERECTILE1 sub-part 
than is in common distribution via antidrug.cf or SA 3.0.x, one I've been 
testing but haven't done a mass-check of yet.

> 
> Perl gurus: Am I correct? does (?:\b|\s) == [\b\s] ??  If not, what's the 
> difference?  AFAICT (?:...) matches something without creating the $x 
> holder to refer to the match later, and [...] does the same thing except 
> matches a set of individual characters.

I *may* have lifted the idea of using (?:\b|\s) from your rule, or from someone 
else's rule. Originally I did use \b only. I believe that later I saw some other 
rule (yours, some SARE rule, dono) with a mixed-pre-gap clause using the 
combination \b|\s and decided to try it, and was pleased with the improvement. I 
don't think the combo-phrase was added until at least Feb, 2004.

The addition of \s makes considerable sense when you consider that my gap-clause 
could be word or non-word characters ([\W_]{0,3})

I settled on using (?:\b\s) instead of simplifying to [\b\s] based on my corpus 
testing. [\b\s] was the first thing that came to my mind, but it in fact does 
not work as well.

My *theory* is this is because \b is not a character, it's a zero-width 
assertion. [] would require a width as it is a character meta-class, reducing 
some of the hit possibilities. But that's a theory.

> So if you have (?:a|b|c|d|...|z) isn't that exactly the same as [a-z]?  

Yes, because those are all characters. And [a-z] will execute faster because it 
can be simplified.

> Obviously something like "fuss(?:ing|ed|y)?" is a where you'd want the 
> (?:...) syntax - but I'm referring to matching individual characters.

Ahh, but as we saw before \b can be 0 characters :)

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki (http://wiki.mailscanner.info/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).

Support MailScanner development - buy the book off the website!