Training spamassassin Bayes
DAve
dave.list at pixelhammer.com
Thu Aug 17 18:21:45 IST 2006
Casey T. Deccio wrote:
> On Thu, 2006-08-17 at 11:49 -0400, DAve wrote:
>> Casey T. Deccio wrote:
>>> Should there be any problem with me doing
>>> training using sa-learn as root while also doing auto training (turned
>>> on by default--at least in Debian)? Spam classification has gotten
>>> extremely poor sincne I began doing that.
>>>
>>>
>
>> spam.assassin.prefs.conf;
>> bayes_path /usr/local/etc/MailScanner/bayes/bayes
>> bayes_file_mode 0770
>> bayes_auto_learn 1
>> bayes_ignore_header X-MailScanner
>> bayes_ignore_header X-MailScanner-SpamCheck
>> bayes_ignore_header X-MailScanner-SpamScore
>> bayes_ignore_header X-MailScanner-Information
>> bayes_ignore_header X-Account_key
>> bayes_ignore_header X-UIDL
>> bayes_ignore_header X-Mozilla-Status
>> bayes_ignore_header X-Mozilla-Status2
>>
>
> MailScanner.conf seems to be okay. However, in spam.assassin.prefs.conf
> I seem to have had my bayes_ignore_header lines misconfigured, so they
> didn't match the X-MailScanner-* headers in MailScanner.conf.
I have those because I train from a Thunderbird mbox, I don't want bayes
to learn those headers. YMMV.
>
> Could this be tainting my spam training (significantly)? If so, do I
> need to clear out the old data from my bayes database and start over?
Someone with more Bayes experience would have to answer that, but I
would think it is certainly not helping if bayes is making tokens out of
your MailScanner headers.
>
> Also, should I add certain client headers to this list (e.g., evolution,
> mozilla, or whatever)?
Only add the headers you want Bayes to ignore. So it depends on the
messages you train with. If you only use autolearning then I would think no.
>
>> Perms are,
>> bash-2.05b# ls -la | less
>> total 2462018
>> drwxr-xr-x 2 root cvs 38912 Aug 17 11:43 .
>> dr-xr-xr-x 8 root cvs 1024 Aug 8 14:36 ..
>> -rw----rw- 1 root cvs 10632 Aug 17 11:45 bayes.mutex
>> -rw-rw---- 1 root cvs 78120 Aug 17 11:45 bayes_journal
>> -rw-rw---- 1 root cvs 10190848 Aug 17 11:45 bayes_seen
>> -rw-rw---- 1 root cvs 10174464 Aug 17 11:45 bayes_toks
>
> bash-2.05b# ls -la | less
> -rw------- 1 Debian-exim Debian-exim 651264 2006-08-17 09:21
> auto-whitelist
> -rw-rw-rw- 1 root root 27084 2006-08-17 06:31
> bayes.mutex
> -rw------- 1 Debian-exim Debian-exim 1290240 2006-08-17 07:41
> bayes_seen
> -rw------- 1 Debian-exim Debian-exim 10522624 2006-08-17 09:29
> bayes_toks
> -rw------- 1 Debian-exim Debian-exim 1294336 2006-07-21 17:46
> bayes_toks.expire10036
> -rw------- 1 Debian-exim Debian-exim 1409024 2006-07-17 09:16
> bayes_toks.expire10080
> -rw------- 1 Debian-exim Debian-exim 1445888 2006-07-15 01:11
> bayes_toks.expire10092
> ...
> [many more bayes_toks.expire* files]
Where is your bayes_journal? Also, if you have lots of bayes_toks.expire
files it is because you have SA trying to expire bayes and it doesn't
finish in time. See the MailScanner spam.assassin.prefs.conf file for an
explanation. You need to set bayes_auto_expire.
# When using the scheduled Bayes expiry feature, in MailScanner.conf
# you probably want to turn off auto-expiry in SpamAssassin as it will
# rarely complete before it is killed for taking too long.
# You will just end up with # MailScanner: big bayes_toks.new files
# wasting space.
bayes_auto_expire 0
>
>> What does your reporting say? If you train a "insert favorite spam here"
>> message and then see more of them come through later are they showing
>> Bayes scores?
>
> At first glance no, but I'll need to monitor from here out to see.
>
> Casey
>
>
DAve
--
Three years now I've asked Google why they don't have a
logo change for Memorial Day. Why do they choose to do logos
for other non-international holidays, but nothing for
Veterans?
Maybe they forgot who made that choice possible.
More information about the MailScanner
mailing list