Spam Corpus size and false positives?
Matt Kettler
mkettler at EVI-INC.COM
Fri Jun 4 18:37:15 IST 2004
At 10:11 AM 6/4/2004, Max Kipness wrote:
>Here is the size of my spam/ham corpuses:
>
>bayes corpus size: nspam = 15538, nham = 9517
>
>I had my bayes threshold seeting to 7, which may have been a bit low, and
>I'm seeing some false positives. I've now raised the auto-learn threshold
>to 12.
Good idea, it's generally not a good idea to drop the auto-learn threshold
so low.
(For that matter, I also run with my ham autolearn threshold set closer to
0 than the default 1.0)
>
>Is there anything else I can do correct the bayes analysis and get it not
>to tagged so much at 99%? Or is feeding ham the only way. This is hard for
>me to do, as another guy and myself are the only ones that really feed it.
There's not a whole lot you can do to "correct" bayes, however the
following approaches are things you can do:
1) delete and start from scratch, this is kind of brute-force, but it is
effective.
2) step up your ham training. I suggest setting up a "hamtrap" account.
Have all mail to this account auto-fed to bayes as ham learning and
subscribe it to a few legitimate sources (news updates, product
announcement mailing lists, industry newsletters, etc)
3) use crafted emails as ham training to try to counterbalance some words.
run sa-learn --dump and then grep the output for stuff that's 0.9 or
higher. Look around in here for words which are obviously mis-classified.
Create an email containing some of these words, send it to yourself, and
ham train it.
Cautions about method 3:
-Use this method sparingly.
-don't try to micromanage your bayes database contents.
-Tinkering with the bayes tokens using faked emails isn't a generally good
idea, but it is useful if you've got problems and don't want to wipe the
bayes DB.
-do NOT try this method for spam training
>
>I've thought of rebuilding the databases but with a higher auto-learn
>threshold, but this would allow in a flood of spam, right?
Hmm, depends what you mean by "rebuilding". If you're going to delete them,
and retrain with a large enough corpus of spam, it won't matter.
-------------------------- MailScanner list ----------------------
To leave, send leave mailscanner to jiscmail at jiscmail.ac.uk
Before posting, please see the Most Asked Questions at
http://www.mailscanner.biz/maq/ and the archives at
http://www.jiscmail.ac.uk/lists/mailscanner.html
More information about the MailScanner
mailing list