slightly OT: how do i know if i've been poisoned? (Bayes)

Mon Oct 23 00:52:14 IST 2006

Thanks very much, Matt.  Might not have been a direct answer to my
question, but I really appreciate the information nonetheless. 

> -----Original Message-----
> From: mailscanner-bounces at lists.mailscanner.info 
> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf 
> Of Matt Kettler
> Sent: Friday, October 20, 2006 5:38 PM
> To: MailScanner discussion
> Subject: Re: slightly OT: how do i know if i've been poisoned? (Bayes)
> 
> Furnish, Trever G wrote:
> 
> <snip, stuff I can't help you with directly, but using the 
> tools below you can probably help yourself>
> 
> > And in the data part of the dump I see lots of what seems 
> to be random 
> > data.  In fact that's all I see in the data dump -- no tokens I'd 
> > recognize as anything other than random garbage:
> > 
> > 0.978          2          0 1161346932  36dbf22fa5
> <snip>
> 
> > 
> > Is that the way it should look?  I expected to see actual words.
> 
> Yes, SA 3.0.0 and higher store 40bits of the SHA1 hash of the 
> word, not the word itself. This makes the entries themselves 
> all fixed-size which offers a lot of performance gain.
> 
> It also makes it impossible to decipher the bayes DB into a 
> human-readable form.
> This has some ups and downs.
> 
> One benefit is enhanced security. In the old system if you 
> had a shared bayes DB, any of the users could read the 
> database and figure out a lot about your email:
> 	who's been sending mail to your network (H*F tokens)
> 	By looking at body tokens, deduce topics of 
> conversation. (hint: specialized terminology really stands out)
> 
> 
> If you really want to see the tokens, in text form, for a 
> specific message use the following:
> 
>  spamassassin -D bayes < message.txt
> 
> And you should get some debug output like this:
> 
> [807] dbg: bayes: token 'I*:what' => 0.99846511627907 [807] 
> dbg: bayes: token 'I*:future' => 0.996181818181818
> 
> 
> > 
> > Also, one one of the lists (either this one or the mailwatch list) 
> > someone said that Bayesian filtering was "4 times as 
> effective" when 
> > it has more ham than spam to learn from -- but that makes 
> no sense to 
> > me, and it's also not something that seems tenable -- I get 
> about 95% spam.
> 
> What's that about 75% of statistics are made up on the spot? 
> $5 says the person was pulling that fact out of their behind, 
> or was parroting a statistic that applies to some other tool 
> that implements bayes in an obscure fashion.
> 
> 
> Technically, the ideal for SA, or nearly any other bayes, 
> would be an exact 50/50 mix.  That is actually supported by 
> the math if you think about how bayes works. Ideally you want 
> "common" words that appear in both spam and nonspam to wind 
> up with a token spam probability of 0.500. You'll get that, 
> on average, if you're training the same number of spam and 
> nonspam messages. Otherwise, your average "common word" token 
> will be biased to be roughly your training ratio.
> 
> However, SA's use of chi-squared combining makes it really 
> resistant to wild deviations from that ideal. The impact of 
> any "near the middle" tokens is heavily drown out by stronger 
> ones. one 0.000 will completely negate many 0.950's.
> 
> Unless your training ratio is approaching 99% spam, you 
> should be fine. And even this will only cause you increased 
> false positives. It will definitely NOT cause BAYES_00 
> problems. That would be an issue if your ratio was 
> approaching 1% spam.
> 
> 
> It should also be noted, that techincally speaking, SA's 
> "bayes" isn't really bayesian. In fact, nearly all "bayes" 
> filters aren't bayesian. Chi-squared combining works better 
> and runs faster than a real bayes combining. But this term 
> has generally been applied to any statistical token analysis 
> system, regardless of how probabilities are calculated and combined.
> 
> Fundamentally there are 3 kinds of common "bayes" out there. 
> All of which "work best" at 50/50.
> 
> The original Paul Graham method, using naive Bayes combining 
> The improved method suggested by Robinson using geometric means.
> The chi-squared method, also called Fisher's method.
> 
> Most use chi-sqared nowdays. It's faster, works better, and 
> is highly resistant to being biased by poisoning or uneven training.
> 
> http://www.bgl.nu/bogofilter/naive.html
> 
> Be wary of someone who posts generalities about bayes, they 
> might be quoting something that applies to a different bayes 
> methodology.
> 
> 
> 
> 
> 
> --
> MailScanner mailing list
> mailscanner at lists.mailscanner.info
> http://lists.mailscanner.info/mailman/listinfo/mailscanner
> 
> Before posting, read http://wiki.mailscanner.info/posting
> 
> Support MailScanner development - buy the book off the website! 
>