chris at fractalweb.com
Fri Oct 31 20:40:25 GMT 2003
I have an email address that gets tons of spam and is used only for
"scientific purposes" -- specifically monitoring the effects of MailScanner
with Bayes, DCC, Razor, RBLs, etc. I keep a running 10-day archive of spam.
I wrote a little Perl program to quickly analyze all the spam messages stored
in the spam folder of my email client (KMail). In the past couple of months,
I have seen the daily volume of spam increase from about 70 messages a day to
now over 130 per day. One message still sneaks through here and there, but
overall MailScanner with all the plugins is doing a wonderful job.
The interesting point is that over the past two months, I've seen the average
spam score increase from ~ 15 to now almost 18. I assume this is as a result
of me "feeding" the Bayes filters copies of all the spam that sneaks through
and ever-so-slightly tweaking some of the Bayes scores.
You'll notice that there are messages tagged as spam here that are very low
scoring--there's one message that only scored 1.1, but was still listed in
the Easynet-DNSBL so was (correctly) identified as spam. Most of the other
ones that scored below 5 and were tagged as spam are there for the same
For what it's worth, here's the output of my (crude) little program:
Total spams: 1318
High: 56 / Low: 1.1
Average Deviation: 5.16
Standard Deviation: 6.80
mean - 1: 10.99
mean + 1: 24.60
mean + 2: 31.40
mean + 3: 38.21
Count of messages...
under 5.0: 11
within 1 std. deviation +/- of mean: 982 = 74.5%
above 5, but below 1 std. dev.: 165
below mean, but above 1 std. dev.: 511
above mean, but below 1 std. dev.: 471
above 1 +std_dev, below 2: 115
above 2 +std_dev, below 3: 27
above 3 +std_dev: 18
Assuming I paid attention in stats class decades ago and haven't killed the
brain cells that stored whatever I learned, this info might tell us
something. Looking at these figures, we can imagine a pretty-standard looking
bell curve with the centerline at 17.8, with the first standard deviation
lines at 11 and 24.6. Almost 75% of the spam I receive falls within that
The question is: based on this data, what should I set my "high spam"
threshold to be? And what else can we learn, if anything, from this data?
chris at fractalweb.com
"Reality is that which, when you stop believing in it, doesn't go
-- Philip K. Dick
More information about the MailScanner