fix: regex for removing tags inside links (phishing filter)
Juan Pablo Salazar Bertín
snifer_ at hotmail.com
Tue May 22 17:01:34 IST 2007
The regexp for removing tags inside links is not very good. Currently, it's
being done this way:
$squashedtext =~ s/(\<\/?[^>]*\>)*//ig; # Remove tags
So, html like this is not properly detected, and sometimes detected as phishing
(not this example, but other cases):
<img alt="my image >>>">
I've found a better regexp in
http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx so
now I'm successfully using this:
$squashedtext =~
s/(\<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|\'.*?\'|[^\'\">\s]+))?)+\s*|\s*)\/?\>)*
//ig; #Remove tags
This has to be used before whitespaces are removed.
More information about the MailScanner
mailing list