fix: regex for removing tags inside links (phishing filter)

Julian Field MailScanner at ecs.soton.ac.uk
Tue May 22 17:51:32 IST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Can you give me a simple example of mine doing it wrong where yours does 
it better please?
I need to see your patch in action before accepting it.

Juan Pablo Salazar Bertín wrote:
> The regexp for removing tags inside links is not very good. Currently, it's
> being done this way:
>
> $squashedtext =~ s/(\<\/?[^>]*\>)*//ig; # Remove tags
>
> So, html like this is not properly detected, and sometimes detected as phishing
> (not this example, but other cases):
>
> <img alt="my image >>>">
>
> I've found a better regexp in
> http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx so
> now I'm successfully using this:
>
> $squashedtext =~
> s/(\<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|\'.*?\'|[^\'\">\s]+))?)+\s*|\s*)\/?\>)*
> //ig; #Remove tags
>
> This has to be used before whitespaces are removed.
>
>   

Jules

- -- 
Julian Field MEng CITP
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store

MailScanner customisation, or any advanced system administration help?
Contact me at Jules at Jules.FM

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654
For all your IT requirements visit www.transtec.co.uk




-----BEGIN PGP SIGNATURE-----
Version: PGP Desktop 9.6.1 (Build 1012)
Charset: ISO-8859-1

wj8DBQFGUx+eEfZZRxQVtlQRAltOAKCnG00GKFWGVem5SMu8efmMWHlQLwCg7tQx
kJ/f5muc5LPb7IZvqK119bk=
=NEGM
-----END PGP SIGNATURE-----

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
For all your IT requirements visit www.transtec.co.uk



More information about the MailScanner mailing list