slightly OT: how do i know if i've been poisoned? (Bayes)

Matt Kettler mkettler at evi-inc.com
Fri Oct 20 22:38:05 IST 2006


Furnish, Trever G wrote:

<snip, stuff I can't help you with directly, but using the tools below you can
probably help yourself>

> And in the data part of the dump I see lots of what seems to be random
> data.  In fact that's all I see in the data dump -- no tokens I'd
> recognize as anything other than random garbage:
> 
> 0.978          2          0 1161346932  36dbf22fa5
<snip>

> 
> Is that the way it should look?  I expected to see actual words.

Yes, SA 3.0.0 and higher store 40bits of the SHA1 hash of the word, not the word
itself. This makes the entries themselves all fixed-size which offers a lot of
performance gain.

It also makes it impossible to decipher the bayes DB into a human-readable form.
This has some ups and downs.

One benefit is enhanced security. In the old system if you had a shared bayes
DB, any of the users could read the database and figure out a lot about your email:
	who's been sending mail to your network (H*F tokens)
	By looking at body tokens, deduce topics of conversation. (hint: specialized
terminology really stands out)


If you really want to see the tokens, in text form, for a specific message use
the following:

 spamassassin -D bayes < message.txt

And you should get some debug output like this:

[807] dbg: bayes: token 'I*:what' => 0.99846511627907
[807] dbg: bayes: token 'I*:future' => 0.996181818181818


> 
> Also, one one of the lists (either this one or the mailwatch list)
> someone said that Bayesian filtering was "4 times as effective" when it
> has more ham than spam to learn from -- but that makes no sense to me,
> and it's also not something that seems tenable -- I get about 95% spam.

What's that about 75% of statistics are made up on the spot? $5 says the person
was pulling that fact out of their behind, or was parroting a statistic that
applies to some other tool that implements bayes in an obscure fashion.


Technically, the ideal for SA, or nearly any other bayes, would be an exact
50/50 mix.  That is actually supported by the math if you think about how bayes
works. Ideally you want "common" words that appear in both spam and nonspam to
wind up with a token spam probability of 0.500. You'll get that, on average, if
you're training the same number of spam and nonspam messages. Otherwise, your
average "common word" token will be biased to be roughly your training ratio.

However, SA's use of chi-squared combining makes it really resistant to wild
deviations from that ideal. The impact of any "near the middle" tokens is
heavily drown out by stronger ones. one 0.000 will completely negate many 0.950's.

Unless your training ratio is approaching 99% spam, you should be fine. And even
this will only cause you increased false positives. It will definitely NOT cause
BAYES_00 problems. That would be an issue if your ratio was approaching 1% spam.


It should also be noted, that techincally speaking, SA's "bayes" isn't really
bayesian. In fact, nearly all "bayes" filters aren't bayesian. Chi-squared
combining works better and runs faster than a real bayes combining. But this
term has generally been applied to any statistical token analysis system,
regardless of how probabilities are calculated and combined.

Fundamentally there are 3 kinds of common "bayes" out there. All of which "work
best" at 50/50.

The original Paul Graham method, using naive Bayes combining
The improved method suggested by Robinson using geometric means.
The chi-squared method, also called Fisher's method.

Most use chi-sqared nowdays. It's faster, works better, and is highly resistant
to being biased by poisoning or uneven training.

http://www.bgl.nu/bogofilter/naive.html

Be wary of someone who posts generalities about bayes, they might be quoting
something that applies to a different bayes methodology.







More information about the MailScanner mailing list