fix: regex for removing tags inside links (phishing filter)

Juan Pablo Salazar Bertín snifer_ at hotmail.com
Tue May 22 17:01:34 IST 2007


The regexp for removing tags inside links is not very good. Currently, it's
being done this way:

$squashedtext =~ s/(\<\/?[^>]*\>)*//ig; # Remove tags

So, html like this is not properly detected, and sometimes detected as phishing
(not this example, but other cases):

<img alt="my image >>>">

I've found a better regexp in
http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx so
now I'm successfully using this:

$squashedtext =~
s/(\<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|\'.*?\'|[^\'\">\s]+))?)+\s*|\s*)\/?\>)*
//ig; #Remove tags

This has to be used before whitespaces are removed.



More information about the MailScanner mailing list