Help with a regexp

Schramm, Dominik dominik.schramm at businessmart.de
Mon Aug 25 15:44:31 IST 2008


Hi Steve,

Steve Campbell wrote on Monday, August 25, 2008 3:23 PM:

> One of our domain names is cnpapers.com and another is cnpapers.net. 
> The SA rule URI_CHINA_ADJ catches a lot of our mail, and although 
> it is a relatively low scoring rule, it does contribute.
> 
> The rule is defined as follows:
> 
> /^(?:https?:\/\/)?.*\.cn.*/i

The regex says:

an optional protocol prefix ("http://" or "https://"), followed 
by an arbitrary amount of arbitrary characters (which may be omitted 
altogether), followed by ".cn", followed by an arbitrary amount of 
arbitrary characters (which may be omitted altogether). So ".cn" is
the only obligatory character string and sufficient for the regex
to match; the scanner probably finds somethings like 
mailhost.cnpapers.com in the headers or http://www.cnpapers.com
in the footer.

What it should catch IMHO is:

an optional protocol prefix ("http://" or "https://"), followed 
by an arbitrary amount of arbitrary characters (which may be omitted 
altogether), followed by ".cn", either followed by a slash or followed 
by whitespace, followed by an arbitrary amount of arbitrary characters 
(which may be omitted altogether).

And that would translate back into a regex like this:

/^(?:https?:\/\/)?.*\.cn(?:\/|\s).*/i

However, I find the expression rather vague, even like this. It
should restrict the characters between the optional http(s) and
".cn" to those allowed in domain names.

Hope this helps,
Dominik



More information about the MailScanner mailing list