should I change score for Bayes_90?

Mark Nienberg mark at TIPPINGMAR.COM
Wed Oct 22 00:57:26 IST 2003


Jeez, this is complicated business.  Thanks for that explanation.  I generally just
leave the defaults alone and it seems to work pretty well for me.

On 21 Oct 2003 at 16:06, Matt Kettler wrote:

> At 01:41 PM 10/21/2003, Mark Nienberg wrote:
> >Also, I don't see why you would have higher scores for lower bayes
> >probablilities, the
> >way your third column does.
>
> One of these day's I'll have to enter this stuff into the FAQ on the SA site..
>
> Scores that don't meet your "common sense" expectations not an uncommon
> phenomenon in SA...
>
> If the rules were scored completely on their own, you'd wind up with
> increasing scores for all the bayes categories, but the scoresets for SA
> aren't calculated on their own, their calculated as a complete set. Thus,
> the rule scores are aren't based on the performance of a rule by itself,
> but based on the combinations of other rules they interact with.. this
> makes the scoring VERY nonlinear, which at first glance seems wrong, but
> when you sit down and start studying the patterns that exist in real email,
> that's really how things wind up working.
>
> Also in general, never underestimate the ability of human behaviors to fail
> to fit into any simple mathematical model.
>
> In this case, I suspect there's a fair amount of "spammish" non-spam (ie:
> crude jokes) which don't go very high on the bayes scale, but wind up in
> the 80 or 90 ballpark and also wind up matching some of the static rules..
> in order to avoid excessive false positives, the GA will tend to hack down
> the score of one or more of the rules, but will try to do so without
> causing excessive false negatives.. it looks like these "upper-mid" range
> bayes scores were the best candidate for sacrificing score to correct these.
>
> This same kind of non-linear behavior can also be seen in older versions of
> SA which had the "spam phrases" ruleset. In that system, the higher the
> spam phrases score, the more spam phrases exist in the email. You'd expect
> this to have a nice, constantly increasing correlation with spam score..
> but of course reality proves otherwise, and you wind up with weird dips in
> the curve which are the results of that ruleset interacting with a whole
> pile of other rules.


--
Mark W. Nienberg, SE
Tipping Mar + associates
1906 Shattuck Ave, Berkeley, CA  94704
visit our website at http://www.tippingmar.com



More information about the MailScanner mailing list