slightly OT: how do i know if i've been poisoned? (Bayes)
Furnish, Trever G
TGFurnish at herffjones.com
Fri Oct 20 22:00:31 IST 2006
Sorry, this is a bit long with some output from sa-learn --dump, but
it's probably just simple questions for someone here...
Been running with the same Bayes database for a long time, but lately a
lot of uncaught messages that seem as though they ought to be caught
very effectively using Bayesian techniques have me wondering if I have a
problem with my Bayes database.
To be honest I have quite a few questions related to SA's Bayes stuff
that I should have tracked down answers to sooner. :-(
The messages that caused me to start looking are those that all end with
"You must to read". I say it seems like they ought to be caught easily
by Bayes because:
- They're simple text messages
- Most of the words and phrases appear consistently in all of
the versions of this spam I receive.
- And most importantly, I've been sa-learn'ing them as spam
repeatedly.
I have about 900 of these listed in mailwatch in the last three days,
probably only about 50/50 caught as spam, but I've run sa-learn on
probably 100 of them to train it that this is spam.
What's worse, many of the ones that are listed as ham are triggering
BAYES_00.
Even if I send back through the exact same message that I've trained as
spam, it never gets caught as spam.
So...
Looking at the output of sa-learn --dump, I see the following "magic":
0.000 0 3 0 non-token data: bayes db version
0.000 0 1904995 0 non-token data: nspam
0.000 0 213646 0 non-token data: nham
0.000 0 696343 0 non-token data: ntokens
0.000 0 1161225623 0 non-token data: oldest atime
0.000 0 1161377189 0 non-token data: newest atime
0.000 0 1161377169 0 non-token data: last journal
sync atime
0.000 0 1161355561 0 non-token data: last expiry
atime
0.000 0 64369 0 non-token data: last expire
atime delta
0.000 0 1448252 0 non-token data: last expire
reduction count
And in the data part of the dump I see lots of what seems to be random
data. In fact that's all I see in the data dump -- no tokens I'd
recognize as anything other than random garbage:
0.978 2 0 1161346932 36dbf22fa5
0.958 1 0 1161345892 79e8adb687
0.958 1 0 1161346781 3519895456
0.958 1 0 1161354304 6c4a3342f2
0.171 3921 2126 1161375277 f6bd08b094
0.000 0 198 1161377487 dd534744d6
0.459 68115 9002 1161377680 b303caafc0
0.088 17 20 1161364153 8edecfaeac
1.000 1464 0 1161353847 aff4ea7b31
0.009 0 6 1161351795 719fddf880
0.143 92 62 1161373089 e54862ab93
0.985 3 0 1161349368 78041875fa
0.259 3 1 1161352844 be2c8315bb
0.992 6 0 1161300413 f063d1aca5
0.999 33 0 1161372814 c411247e8c
0.999 92 0 1161376548 0a404340c8
0.998 24 0 1161376658 40a64bb94f
0.923 749 7 1161377672 fae3ecc1e9
1.000 129 0 1161375875 4438c3c4e2
0.999 78 0 1161348595 0c172e375f
0.999 51 0 1161377759 c3de5f8083
0.011 0 5 1161302064 fab6bc3637
0.991 5 0 1161292292 fad0ce7ecf
0.995 9 0 1161361757 578d39ad23
0.994 8 0 1161321372 f598e1bbac
0.017 0 3 1161339745 472573f9b9
0.985 3 0 1161372321 198a28fcfe
0.958 1 0 1161294047 0c6d083929
0.958 1 0 1161291228 38e29036dd
0.958 1 0 1161292212 85ce2e63d5
Is that the way it should look? I expected to see actual words.
Also, one one of the lists (either this one or the mailwatch list)
someone said that Bayesian filtering was "4 times as effective" when it
has more ham than spam to learn from -- but that makes no sense to me,
and it's also not something that seems tenable -- I get about 95% spam.
--
Trever Furnish, tgfurnish at herffjones.com
Herff Jones, Inc. Unix / Network Administrator
Phone: 317.612.3519
Any sufficiently advanced technology is indistinguishable from Unix.
More information about the MailScanner
mailing list