idea for next version

Logan Shaw lshaw at emitinc.com
Tue Oct 10 23:12:16 IST 2006


Roger wrote:
>> So I was checking mailwatch this evening and I found out that the spam / 
>> ham percentage is 60% / 40% at daytime and 95% / 5% at night. This is quiet 
>> logical because at daytime everybody is working and at night (well here in 
>> europe) only spammers are working. This can be used for the spamfiltering. 
>> I think if it is possible to f.e. do, "spamscore * 1.2" between 11:00 pm 
>> and 7:00 am, it will hit more highscoring spam at night. Offcourse it will 
>> also hit ham, but as there is much less ham at night the possibility is 
>> less.

On Tue, 10 Oct 2006, Steve Campbell wrote:
> I tend to look at this in a different light. Spam is spam, and should be 
> caught by rules, etc regardless of the time it arrives. Ham is the same also 
> regardless of it's arrival time. A good set of rules should work fine any 
> time of the day. The percentages only indicate when people are sending mail, 
> so this is a useless figure for comparing day/night averages.

True enough, but every other rule that SpamAssassin uses
is a heuristic as well.  They're all based on particular
characteristics of the messages (or servers that send them)
and some kind of statistical correlation between those
characteristics and spamminess.

> For instance, if the same message that came in at night were resent during 
> the day, how should the mail be treated? Different score and action?

While I share the feeling that it is a little bit odd that the
time a message arrives could sway its score, this is already
true to some extent:  real-time blacklists change over time
(otherwise they wouldn't be real-time), and the score a message
gets can be different one hour from what it is at the next hour.

Overall, I think time of arrival could be safely used as
yet another heuristic for determining if something is spam.
The key thing is that the scores would need to be right, which
I suspect means they'd need to be fairly low, something like
0.5 or so.  SpamAssassin already handles setting scores by
running a genetic algorithm (or whatever it is that it uses
that replaced the GA in 3.x), but since this varies so much
by site (what time zone the site is located in, what type
of usage patterns it sees, etc.), there would need to be a
reliable method of determining site-specific scores for this.

To go in a different direction, as long as we're talking about
time, another possibility is to apply time other places.
For instance, you might have a time-dependent greylist.
Make the greylist's delay much longer at night and shorter
during the day.  You'd get a lot of the effectiveness of
greylisting but without as much delay during the active periods.

Overall, though, I think although looking at time does give
you additional information, it is not clear at all that
the positives of going with it will outweigh the negatives.
Time is a trait of a message (or message delivery) that has a
strong correlation with spamminess, but there is also a steady
stream of exceptions.  So getting value out of looking at the
time is likely to be that much harder because of that.

   - Logan


More information about the MailScanner mailing list