slightly OT: how do i know if i've been poisoned? (Bayes)

Furnish, Trever G TGFurnish at herffjones.com
Fri Oct 20 22:00:31 IST 2006


Sorry, this is a bit long with some output from sa-learn --dump, but
it's probably just simple questions for someone here...

Been running with the same Bayes database for a long time, but lately a
lot of uncaught messages that seem as though they ought to be caught
very effectively using Bayesian techniques have me wondering if I have a
problem with my Bayes database.

To be honest I have quite a few questions related to SA's Bayes stuff
that I should have tracked down answers to sooner. :-(

The messages that caused me to start looking are those that all end with
"You must to read".  I say it seems like they ought to be caught easily
by Bayes because:
	- They're simple text messages
	- Most of the words and phrases appear consistently in all of
the versions of this spam I receive.
	- And most importantly, I've been sa-learn'ing them as spam
repeatedly.

I have about 900 of these listed in mailwatch in the last three days,
probably only about 50/50 caught as spam, but I've run sa-learn on
probably 100 of them to train it that this is spam.

What's worse, many of the ones that are listed as ham are triggering
BAYES_00.

Even if I send back through the exact same message that I've trained as
spam, it never gets caught as spam.


So...

Looking at the output of sa-learn --dump, I see the following "magic":
0.000          0          3          0  non-token data: bayes db version
0.000          0    1904995          0  non-token data: nspam
0.000          0     213646          0  non-token data: nham
0.000          0     696343          0  non-token data: ntokens
0.000          0 1161225623          0  non-token data: oldest atime
0.000          0 1161377189          0  non-token data: newest atime
0.000          0 1161377169          0  non-token data: last journal
sync atime
0.000          0 1161355561          0  non-token data: last expiry
atime
0.000          0      64369          0  non-token data: last expire
atime delta
0.000          0    1448252          0  non-token data: last expire
reduction count

And in the data part of the dump I see lots of what seems to be random
data.  In fact that's all I see in the data dump -- no tokens I'd
recognize as anything other than random garbage:

0.978          2          0 1161346932  36dbf22fa5
0.958          1          0 1161345892  79e8adb687
0.958          1          0 1161346781  3519895456
0.958          1          0 1161354304  6c4a3342f2
0.171       3921       2126 1161375277  f6bd08b094
0.000          0        198 1161377487  dd534744d6
0.459      68115       9002 1161377680  b303caafc0
0.088         17         20 1161364153  8edecfaeac
1.000       1464          0 1161353847  aff4ea7b31
0.009          0          6 1161351795  719fddf880
0.143         92         62 1161373089  e54862ab93
0.985          3          0 1161349368  78041875fa
0.259          3          1 1161352844  be2c8315bb
0.992          6          0 1161300413  f063d1aca5
0.999         33          0 1161372814  c411247e8c
0.999         92          0 1161376548  0a404340c8
0.998         24          0 1161376658  40a64bb94f
0.923        749          7 1161377672  fae3ecc1e9
1.000        129          0 1161375875  4438c3c4e2
0.999         78          0 1161348595  0c172e375f
0.999         51          0 1161377759  c3de5f8083
0.011          0          5 1161302064  fab6bc3637
0.991          5          0 1161292292  fad0ce7ecf
0.995          9          0 1161361757  578d39ad23
0.994          8          0 1161321372  f598e1bbac
0.017          0          3 1161339745  472573f9b9
0.985          3          0 1161372321  198a28fcfe
0.958          1          0 1161294047  0c6d083929
0.958          1          0 1161291228  38e29036dd
0.958          1          0 1161292212  85ce2e63d5

Is that the way it should look?  I expected to see actual words.

Also, one one of the lists (either this one or the mailwatch list)
someone said that Bayesian filtering was "4 times as effective" when it
has more ham than spam to learn from -- but that makes no sense to me,
and it's also not something that seems tenable -- I get about 95% spam.

--
Trever Furnish, tgfurnish at herffjones.com
Herff Jones, Inc. Unix / Network Administrator
Phone: 317.612.3519
Any sufficiently advanced technology is indistinguishable from Unix.


More information about the MailScanner mailing list