spamassassin doesn't like Norwegian email, apparently..

Matt Kettler mkettler at evi-inc.com
Fri Sep 8 20:38:06 IST 2006


Daniel Maher wrote:
> Hi all,
> 
>  
> 
> One of my users has been complaining that his Norwegian-language email
> has been getting tagged as Spam.  I checked the headers, and it doesn’t
> appear to be a locale issue.  In fact, it’s /all Bayes/:

>  6.5 BAYES_99               BODY: Bayesian spam probability is 99 to 100%
> 
>                             [score: 0.9956]

Erm, why is your bayes_99 score so absurdly high? Theoretically, 1% of the
emails matching this rule should be nonspam.

Admittedly, in practice chi-squared combining makes this much less, and more
like 0.1%, but still, this rule is NOT 100% accurate. Don't treat it like it is.


>  
> 
> Does anybody have any ideas on how I might be able to “fix” this?  Thanks!

Well, bayes is based on YOUR training. Apparently, the only email, or at least
most of the email, that's been learned by SA so far that was in Norwegian was
spam mail.

The fix is to train some Norwegian nonspam mail using sa-learn --ham.

See, bayes by default doesn't really know the difference between spam or ham, it
just knows what it's been trained on.

You can also see which words in the message have been heavily hit as spam by
redirecting one of the messages into SA with debug enabled.. assuming SA 3.1.x
or higher:

spamassassin -D bayes <message.txt


Look for debug lines like these:
[11988] dbg: bayes: token 'H*u:1.0.6' => 0.998818414322251
[11988] dbg: bayes: token 'H*F:D*hu' => 0.996473282442748
[11988] dbg: bayes: token 'happy!' => 0.996181818181818
[11988] dbg: bayes: token 'swamp' => 0.990941176470588
[11988] dbg: bayes: token 'Nigeria' => 0.978
[11988] dbg: bayes: token 'Commissioner' => 0.978
[11988] dbg: bayes: token 'nigeria' => 0.978
[11988] dbg: bayes: token 'twisted' => 0.978




Note: the ones that start off with H* are tokens representing headers, I'd start
off ignoring those for now.

Look for tokens with scores near 1.0, those will be the ones pushing up the
bayes score. See if it's Norwegian words, or something else.




More information about the MailScanner mailing list