Help with a regexp

Tue Aug 26 12:39:01 IST 2008

Thanks Steve and Dominik.

I'll try one and/or the other shortly.

Steve

Schramm, Dominik wrote:
> Hi Steve,
>
> Steve Campbell wrote on Monday, August 25, 2008 3:23 PM:
>
>   
>> One of our domain names is cnpapers.com and another is cnpapers.net. 
>> The SA rule URI_CHINA_ADJ catches a lot of our mail, and although 
>> it is a relatively low scoring rule, it does contribute.
>>
>> The rule is defined as follows:
>>
>> /^(?:https?:\/\/)?.*\.cn.*/i
>>     
>
> The regex says:
>
> an optional protocol prefix ("http://" or "https://"), followed 
> by an arbitrary amount of arbitrary characters (which may be omitted 
> altogether), followed by ".cn", followed by an arbitrary amount of 
> arbitrary characters (which may be omitted altogether). So ".cn" is
> the only obligatory character string and sufficient for the regex
> to match; the scanner probably finds somethings like 
> mailhost.cnpapers.com in the headers or http://www.cnpapers.com
> in the footer.
>
> What it should catch IMHO is:
>
> an optional protocol prefix ("http://" or "https://"), followed 
> by an arbitrary amount of arbitrary characters (which may be omitted 
> altogether), followed by ".cn", either followed by a slash or followed 
> by whitespace, followed by an arbitrary amount of arbitrary characters 
> (which may be omitted altogether).
>
> And that would translate back into a regex like this:
>
> /^(?:https?:\/\/)?.*\.cn(?:\/|\s).*/i
>
> However, I find the expression rather vague, even like this. It
> should restrict the characters between the optional http(s) and
> ".cn" to those allowed in domain names.
>
> Hope this helps,
> Dominik
>
>