Maybe a bit OT, auto adjusting high scoring value..

David dh at UPTIME.AT
Sun Mar 16 14:08:48 GMT 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Hello.

First of all let me explain my setup.

I have a "low" score of 5.3 and a high score of 13. High scoring spam 
is deleted, but the message is forwarded to me none the less, so I can 
check, that it is really not a message that has some value to the user. 
This is something we all agreed on.

Out of curiosity I collected 631 Spam messages, all verified by me to 
be actual spam. Some of them are above the threshold of 13, others are 
within the range of 5.3-13.

I have written a little Perl script, which reads that Mbox, collects 
all the Spam Scores and tosses them into a little array on which I am 
able to perform some statistical operations using Statistics::Lite.

For me that returns:

Max Value: 31.7
Min Value: 5.3 (kinda expected)
Data Range: 26.4
Std. Variance: 26.2935....
Std. Deviation: 5.0292...
Mean Score: 13.81410...
Median: 13.4

Now my question is and I am posting to this list because I know there 
are many talented mathematicians out there.

a) Does this kind of collecting data make sense?
b) which statistical functions would make sense ?

What I am trying to do is the following.

I am noticing, that there is a LOT of verified Spam in the range 
between 5.3 to 13 and I am trying to find the best value for our 
typical Spam flow which will catch most verified spam and still allow 
the seldom false positives to pass through to the user. If you recall, 
I delete the high scoring Spam.

So basically I need to find the best value for "High scoring"-

I would be very happy if you could tell me how to tackle this, because 
I really know nothing about math and I think what I just did has little 
to no value

- -d



- - ❜ Fantasie ist wichtiger als Wissen.❛ - Albert Einstein
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (Darwin)

iD8DBQE+dIV0iW/Ta/pxHPQRAzVvAKDGv6WRjGyMqc5pRAQyi/467M7fHwCghgsh
TaL4ldLqeIEb0qtZdPwOF2Y=
=Ua2i
-----END PGP SIGNATURE-----




More information about the MailScanner mailing list