Google sites still in phishing.bad.sites.conf?

Paul Sand pas at unh.edu
Fri Oct 30 16:02:04 UTC 2015


* Jerry Benton <jerry.benton at mailborder.com> [2015-10-30 07:12]:
> Ok, it would be great to have a mechanism that detects phishing fraud
> links. To accomplish this I used phishtank.com to provide the data. The
> current blacklist has ~11,000 entries based on the latest list from
> phishtank.com. This is after I have already scrubbed the list changing URL
> links into hostnames and then removing the common safe sites derived from
> alexa.com. Am I personally eyeballing every entry into this list on a
> daily basis? Hell no. It is an automated solution that is available if you
> want to use it. This is the best free solution I could come with for the
> MailScanner community that is regularly updated with the latest threats. 

I hope I haven't given offense. We have been using MailScanner at the
University of New Hampshire since 2002. We are obviously happy with it,
and your hard work is appreciated.

> If anyone has a better solution, please send it to me. I will be happy to
> implement it. The file that is currently used to generate the list on the
> update server is attached. 

Your code seems to want to remove Alexa's top 500 domains from the
data returned from phishtank.com. E.g., since 'google.com' appears in Alexa,
phishtank.com entries like

    www.google.com
    docs.google.com
    sites.google.com
    [...]
    anything-else.google.com

should not go into the master phishing.bad.sites.conf. Correct?

I think that's an excellent idea, but the code doesn't quite do that.

My suggestion would be to construct a regex from the Alexa data.
Around line 84, after the data array is constructed:

    $safesite_regex = '/^(\w[\w-]*\\.)*?(' . implode('|', $data) .  ')$/i';

Then the test later in the code:

    }elseif(in_array($thing['host'], $safeParsedData)){

could be replaced with 

    }elseif(preg_match($safesite_regex, $thing['host'])){

This would also allow you to get rid of your 'www' hack and
special-case handling for google.com.

Multiple Disclaimers: PHP is far from my strongest language. Gurus scoff
at my regex skills. I can't test this as robustly as I would like due
to the bandwidth limits at phishtank.com. So (said in my best spooky
Halloween voice): Beware!

-- 
-- Paul A Sand <pas at unh.edu>
-- Information Technology / University of New Hampshire
-- http://pubpages.unh.edu/~pas
-- Parts of this message may have been electronically reproduced.


More information about the MailScanner mailing list