slightly OT: how do i know if i've been poisoned? (Bayes)
Furnish, Trever G
TGFurnish at herffjones.com
Mon Oct 23 00:52:14 IST 2006
Thanks very much, Matt. Might not have been a direct answer to my
question, but I really appreciate the information nonetheless.
> -----Original Message-----
> From: mailscanner-bounces at lists.mailscanner.info
> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf
> Of Matt Kettler
> Sent: Friday, October 20, 2006 5:38 PM
> To: MailScanner discussion
> Subject: Re: slightly OT: how do i know if i've been poisoned? (Bayes)
>
> Furnish, Trever G wrote:
>
> <snip, stuff I can't help you with directly, but using the
> tools below you can probably help yourself>
>
> > And in the data part of the dump I see lots of what seems
> to be random
> > data. In fact that's all I see in the data dump -- no tokens I'd
> > recognize as anything other than random garbage:
> >
> > 0.978 2 0 1161346932 36dbf22fa5
> <snip>
>
> >
> > Is that the way it should look? I expected to see actual words.
>
> Yes, SA 3.0.0 and higher store 40bits of the SHA1 hash of the
> word, not the word itself. This makes the entries themselves
> all fixed-size which offers a lot of performance gain.
>
> It also makes it impossible to decipher the bayes DB into a
> human-readable form.
> This has some ups and downs.
>
> One benefit is enhanced security. In the old system if you
> had a shared bayes DB, any of the users could read the
> database and figure out a lot about your email:
> who's been sending mail to your network (H*F tokens)
> By looking at body tokens, deduce topics of
> conversation. (hint: specialized terminology really stands out)
>
>
> If you really want to see the tokens, in text form, for a
> specific message use the following:
>
> spamassassin -D bayes < message.txt
>
> And you should get some debug output like this:
>
> [807] dbg: bayes: token 'I*:what' => 0.99846511627907 [807]
> dbg: bayes: token 'I*:future' => 0.996181818181818
>
>
> >
> > Also, one one of the lists (either this one or the mailwatch list)
> > someone said that Bayesian filtering was "4 times as
> effective" when
> > it has more ham than spam to learn from -- but that makes
> no sense to
> > me, and it's also not something that seems tenable -- I get
> about 95% spam.
>
> What's that about 75% of statistics are made up on the spot?
> $5 says the person was pulling that fact out of their behind,
> or was parroting a statistic that applies to some other tool
> that implements bayes in an obscure fashion.
>
>
> Technically, the ideal for SA, or nearly any other bayes,
> would be an exact 50/50 mix. That is actually supported by
> the math if you think about how bayes works. Ideally you want
> "common" words that appear in both spam and nonspam to wind
> up with a token spam probability of 0.500. You'll get that,
> on average, if you're training the same number of spam and
> nonspam messages. Otherwise, your average "common word" token
> will be biased to be roughly your training ratio.
>
> However, SA's use of chi-squared combining makes it really
> resistant to wild deviations from that ideal. The impact of
> any "near the middle" tokens is heavily drown out by stronger
> ones. one 0.000 will completely negate many 0.950's.
>
> Unless your training ratio is approaching 99% spam, you
> should be fine. And even this will only cause you increased
> false positives. It will definitely NOT cause BAYES_00
> problems. That would be an issue if your ratio was
> approaching 1% spam.
>
>
> It should also be noted, that techincally speaking, SA's
> "bayes" isn't really bayesian. In fact, nearly all "bayes"
> filters aren't bayesian. Chi-squared combining works better
> and runs faster than a real bayes combining. But this term
> has generally been applied to any statistical token analysis
> system, regardless of how probabilities are calculated and combined.
>
> Fundamentally there are 3 kinds of common "bayes" out there.
> All of which "work best" at 50/50.
>
> The original Paul Graham method, using naive Bayes combining
> The improved method suggested by Robinson using geometric means.
> The chi-squared method, also called Fisher's method.
>
> Most use chi-sqared nowdays. It's faster, works better, and
> is highly resistant to being biased by poisoning or uneven training.
>
> http://www.bgl.nu/bogofilter/naive.html
>
> Be wary of someone who posts generalities about bayes, they
> might be quoting something that applies to a different bayes
> methodology.
>
>
>
>
>
> --
> MailScanner mailing list
> mailscanner at lists.mailscanner.info
> http://lists.mailscanner.info/mailman/listinfo/mailscanner
>
> Before posting, read http://wiki.mailscanner.info/posting
>
> Support MailScanner development - buy the book off the website!
>
More information about the MailScanner
mailing list