statistics

Fri Oct 31 20:40:25 GMT 2003

I have an email address that gets tons of spam and is used only for
"scientific purposes"  -- specifically monitoring the effects of MailScanner
with Bayes, DCC, Razor, RBLs, etc. I keep a running 10-day archive of spam.

I wrote a little Perl program to quickly analyze all the spam messages stored
in the spam folder of my email client (KMail). In the past couple of months,
I have seen the daily volume of spam increase from about 70 messages a day to
now over 130 per day. One message still sneaks through here and there, but
overall MailScanner with all the plugins is doing a wonderful job.

The interesting point is that over the past two months, I've seen the average
spam score increase from ~ 15 to now almost 18. I assume this is as a result
of me "feeding" the Bayes filters copies of all the spam that sneaks through
and ever-so-slightly tweaking some of the Bayes scores.

You'll notice that there are messages tagged as spam here that are very low
scoring--there's one message that only scored 1.1, but was still listed in
the Easynet-DNSBL so was (correctly) identified as spam. Most of the other
ones that scored below 5 and were tagged as spam are there for the same
reason.

For what it's worth, here's the output of my (crude) little program:

Spam statistics
---------------
Total spams: 1318
Average: 17.80
High: 56 / Low: 1.1
Range: 54.9
Median: 19.1
Average Deviation: 5.16
Variance: 46.29
Standard Deviation: 6.80

Standard deviations:
mean - 1: 10.99
mean + 1: 24.60
mean + 2: 31.40
mean + 3: 38.21

Count of messages...
under 5.0: 11
within 1 std. deviation +/- of mean: 982 = 74.5%

above 5, but below 1 std. dev.: 165
below mean, but above 1 std. dev.: 511
above mean, but below 1 std. dev.: 471
above 1 +std_dev, below 2: 115
above 2 +std_dev, below 3: 27
above 3 +std_dev: 18

Assuming I paid attention in stats class decades ago and haven't killed the
brain cells that stored whatever I learned, this info might tell us
something. Looking at these figures, we can imagine a pretty-standard looking
bell curve with the centerline at 17.8, with the first standard deviation
lines at 11 and 24.6. Almost 75% of the spam I receive falls within that
area.

The question is: based on this data, what should I set my "high spam"
threshold to be? And what else can we learn, if anything, from this data?

Cheers,
Chris
--
Chris Yuzik
chris at fractalweb.com
604-304-0444

"Reality is that which, when you stop believing in it, doesn't go
away".
                -- Philip K. Dick