idea for next version

Matt Kettler mkettler at evi-inc.com
Tue Oct 10 23:19:07 IST 2006


mailscanner at berger.nl wrote:
> Well, I am happily using mailscanner for a while now and it still works great.
> 
> So I was checking mailwatch this evening and I found out that the spam / ham percentage is 60% / 40% at daytime and 95% / 5% at night. This is quiet logical because at daytime everybody is working and at night (well here in europe) only spammers are working. This can be used for the spamfiltering. 

Actually, this suggestion isn't very new. It's been made dozens of times over on
the SpamAssassin list. It really doesn't work out in the general case.

Unfortunately, for most folks it's not as dramatic as 95/5.. and even for those
it is, that's still a relatively poor spam rule.

The problem being that rule scores can't be viewed in terms spam percentage.
That's not how rule scoring in SA works. SA assigns rules by "fitting" the rule
scores against a real-world test. In the event of overlapping hits on the same
messages, this fitting winds up giving very little, if any, score to the
worst-performing rule in an overlapping group.

Rules with mediocre performance, like a mere 95% accuracy, often wind up finding
themselves with no score because there are better rules to give the points to
that cause fewer FPs.


My numbers are more like 80/20, even for the "dead of night" hours:
"Oct  9 00:" 81.2% spam
"Oct  9 01:" 86.6% spam
"Oct  9 02:" 83.5% spam
...
"Oct  9 13:" 48.5% spam
...
"Oct  9 21:" 72.6% spam
"Oct  9 22:" 70.7% spam
"Oct  9 23:" 78.3% spam


A lot of what ratio you see depends highly on how "localized" your mail is. If
you belong to a lot of globally-used mailing lists, your numbers at night will
be little different than your numbers at noon. Ditto if you have lots of
international contacts.



I think if it is possible to f.e. do, "spamscore * 1.2" between 11:00 pm and
7:00 am, it will hit more highscoring spam at night. Offcourse it will also hit
ham, but as there is much less ham at night the possibility is less. Then, most
off the overnight ham is mailinglist which are often whitelisted.

You whitelist mailing lists? Regularly? Wow.. I don't. I only do such things for
 spam discussion lists.

> 
> Any ideas?

Quite frankly, geographic origin is a whole lot more accurate, and even that
pretty well sucks. You might consider taking advantage of the RelayCountry
plugin, and adding some rules like these (adjust scores, etc for your own
geography:)


# informational, mostly for checking how much these hit
header RELAY_ES X-Relay-Countries=~/\bES\b/
describe RELAY_ES       Relayed through Spain
score RELAY_ES 0.01

header RELAY_UK X-Relay-Countries=~/\bGB\b/
describe RELAY_UK       Relayed through Brittan
score RELAY_UK 0.01

header RELAY_FR X-Relay-Countries=~/\bFR\b/
describe RELAY_FR       Relayed through France
score RELAY_FR 0.01

header RELAY_DE X-Relay-Countries=~/\bDE\b/
describe RELAY_DE       Relayed through Germany
score RELAY_DE 0.01

header RELAY_AT X-Relay-Countries=~/\bAT\b/
describe RELAY_AT       Relayed through Austria
score RELAY_AT 0.01

# these have VERY high spam volume and little legit mail
# however, don't go over 3.0 or so with these.

header RELAY_CN X-Relay-Countries=~/\bCN\b/
describe RELAY_CN       Relayed through china
score RELAY_CN 1.5

header RELAY_KR X-Relay-Countries=~/\bKR\b/
describe RELAY_KR       Relayed through Korea
score RELAY_KR 1.5

header RELAY_KP X-Relay-Countries=~/\bKP\b/
describe RELAY_KP       Relayed through North Korea
score RELAY_KP 1.5

#countries prone to abuse and low legit mail volume
# can't score high due to some legit mail
# however score bias of 0.1 to 1.5 is reasonable here
# depending on the country in question

header RELAY_AP X-Relay-Countries=~/\bAP\b/
describe RELAY_AP       Relayed through generic AP
score RELAY_AP  0.5

header RELAY_TW X-Relay-Countries=~/\bTW\b/
describe RELAY_TW       Relayed through Taiwan
score RELAY_TW 1.0

header RELAY_SK X-Relay-Countries=~/\bSK\b/
describe RELAY_SK       Relayed through Slovakia
score RELAY_TW 1.0

header RELAY_JP X-Relay-Countries=~/\bJP\b/
describe RELAY_JP       Relayed through Japan
score RELAY_JP 1.0

header RELAY_AR X-Relay-Countries=~/\bAR\b/
describe RELAY_AR       Relayed through Argentina
score RELAY_AR 1.0

header RELAY_BR X-Relay-Countries=~/\bBR\b/
describe RELAY_BR       Relayed through Brazil
score RELAY_BR 1.0

header RELAY_RU X-Relay-Countries=~/\bRU\b/
describe RELAY_RU       Relayed through Russia
score RELAY_RU 1.0

header RELAY_RO X-Relay-Countries=~/\bRO\b/
describe RELAY_RO       Relayed through Romania
score RELAY_RO 1.0

header RELAY_PS X-Relay-Countries=~/\bPS\b/
describe RELAY_PS       Relayed through occupied Palestine
score RELAY_PS 1.0

header RELAY_PL X-Relay-Countries=~/\bPL\b/
describe RELAY_PL       Relayed through Poland
score RELAY_PL 1.0

header RELAY_IL X-Relay-Countries=~/\bIL\b/
describe RELAY_IL       Relayed through Israel
score RELAY_IL 1.0

header RELAY_HU X-Relay-Countries=~/\bHU\b/
describe RELAY_HU       Relayed through Hungary
score RELAY_HU 1.0

header RELAY_NG X-Relay-Countries=~/\bNG\b/
describe RELAY_NG       Relayed through Nigeria
score RELAY_NG 1.0

header RELAY_PK X-Relay-Countries=~/\bPK\b/
describe RELAY_PK       Relayed through Pakistan
score RELAY_PK 1.0

header RELAY_GT X-Relay-Countries=~/\bGT\b/
describe RELAY_GT       Relayed through Guatemala
score RELAY_GT 1.0


More information about the MailScanner mailing list