MailScanner content scanning for keywords
Matt Kettler
mkettler at EVI-INC.COM
Mon Jul 18 17:15:45 IST 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "US-ASCII" character set. ]
[ Some characters may be displayed incorrectly. ]
James Gray wrote:
> On Sat, 16 Jul 2005 07:30 am, Matt Kettler wrote:
>
>>Spammers use thousands of variants of the word "Viagra", do you want to
>>dictionary them all? 1 regex rule detects absurd numbers of of possible
>>spellings:
>>
>>/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[a4ij1!|l\xCC-\xCF\xEC-\xEF][_\W]
>>{0,3}[ila40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}[x
>>yz]?[gj][_\W]{0,3}rr?[_\W]{0,3}[a40\xC0-\xC6\xE0-\xE6@][_\W]{0,3}x?[_\W]{
>>0,3}(?:\b|\s)/i
>
>
> Good grief! That looks like a slightly extended version of the OBFU_VIAGRA
> rule I wrote about a year ago...I can tell coz it's still got the (?:\b|\s)
> rules which, syntactically can be replaced with [\b\s]. At least that's
> how it reads in my custom SA rules /now/ and works just the same (and is
> faster from my testing).
Actually, it's part of the DRUGS_ERECTILE rule I developed for antidrug.cf and
is now a part of sa 3.0.0+, starting sometime late 2003 with a public version in
January 16, 2004.
http://article.gmane.org/gmane.mail.spam.spamassassin.general/39305
It's interesting that the rest of our rules are similar, but then again, when
you break it down it's all straightforward obfuscation handling.
http://mywebpages.comcast.net/mkettler/sa/antidrug.cf
The regex quoted is a slightly newer version of the __DRUGS_ERECTILE1 sub-part
than is in common distribution via antidrug.cf or SA 3.0.x, one I've been
testing but haven't done a mass-check of yet.
>
> Perl gurus: Am I correct? does (?:\b|\s) == [\b\s] ?? If not, what's the
> difference? AFAICT (?:...) matches something without creating the $x
> holder to refer to the match later, and [...] does the same thing except
> matches a set of individual characters.
I *may* have lifted the idea of using (?:\b|\s) from your rule, or from someone
else's rule. Originally I did use \b only. I believe that later I saw some other
rule (yours, some SARE rule, dono) with a mixed-pre-gap clause using the
combination \b|\s and decided to try it, and was pleased with the improvement. I
don't think the combo-phrase was added until at least Feb, 2004.
The addition of \s makes considerable sense when you consider that my gap-clause
could be word or non-word characters ([\W_]{0,3})
I settled on using (?:\b\s) instead of simplifying to [\b\s] based on my corpus
testing. [\b\s] was the first thing that came to my mind, but it in fact does
not work as well.
My *theory* is this is because \b is not a character, it's a zero-width
assertion. [] would require a width as it is a character meta-class, reducing
some of the hit possibilities. But that's a theory.
> So if you have (?:a|b|c|d|...|z) isn't that exactly the same as [a-z]?
Yes, because those are all characters. And [a-z] will execute faster because it
can be simplified.
> Obviously something like "fuss(?:ing|ed|y)?" is a where you'd want the
> (?:...) syntax - but I'm referring to matching individual characters.
Ahh, but as we saw before \b can be 0 characters :)
------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the Wiki (http://wiki.mailscanner.info/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).
Support MailScanner development - buy the book off the website!
More information about the MailScanner
mailing list