Training spamassassin Bayes

Thu Aug 17 18:21:45 IST 2006

Casey T. Deccio wrote:
> On Thu, 2006-08-17 at 11:49 -0400, DAve wrote:
>> Casey T. Deccio wrote:
>>> Should there be any problem with me doing
>>> training using sa-learn as root while also doing auto training (turned
>>> on by default--at least in Debian)?  Spam classification has gotten
>>> extremely poor sincne I began doing that.
>>>
>>>
> 
>> spam.assassin.prefs.conf;
>> bayes_path /usr/local/etc/MailScanner/bayes/bayes
>> bayes_file_mode 0770
>> bayes_auto_learn 1
>> bayes_ignore_header X-MailScanner
>> bayes_ignore_header X-MailScanner-SpamCheck
>> bayes_ignore_header X-MailScanner-SpamScore
>> bayes_ignore_header X-MailScanner-Information
>> bayes_ignore_header X-Account_key
>> bayes_ignore_header X-UIDL
>> bayes_ignore_header X-Mozilla-Status
>> bayes_ignore_header X-Mozilla-Status2
>>
> 
> MailScanner.conf seems to be okay.  However, in spam.assassin.prefs.conf
> I seem to have had my bayes_ignore_header lines misconfigured, so they
> didn't match the X-MailScanner-* headers in MailScanner.conf.

I have those because I train from a Thunderbird mbox, I don't want bayes 
to learn those headers. YMMV.

> 
> Could this be tainting my spam training (significantly)?  If so, do I
> need to clear out the old data from my bayes database and start over?

Someone with more Bayes experience would have to answer that, but I 
would think it is certainly not helping if bayes is making tokens out of 
your MailScanner headers.

> 
> Also, should I add certain client headers to this list (e.g., evolution,
> mozilla, or whatever)?

Only add the headers you want Bayes to ignore. So it depends on the 
messages you train with. If you only use autolearning then I would think no.

> 
>> Perms are,
>> bash-2.05b# ls -la | less
>> total 2462018
>> drwxr-xr-x  2 root  cvs     38912 Aug 17 11:43 .
>> dr-xr-xr-x  8 root  cvs      1024 Aug  8 14:36 ..
>> -rw----rw-  1 root  cvs     10632 Aug 17 11:45 bayes.mutex
>> -rw-rw----  1 root  cvs     78120 Aug 17 11:45 bayes_journal
>> -rw-rw----  1 root  cvs  10190848 Aug 17 11:45 bayes_seen
>> -rw-rw----  1 root  cvs  10174464 Aug 17 11:45 bayes_toks
> 
> bash-2.05b# ls -la | less
> -rw-------  1 Debian-exim Debian-exim   651264 2006-08-17 09:21
> auto-whitelist
> -rw-rw-rw-  1 root        root           27084 2006-08-17 06:31
> bayes.mutex
> -rw-------  1 Debian-exim Debian-exim  1290240 2006-08-17 07:41
> bayes_seen
> -rw-------  1 Debian-exim Debian-exim 10522624 2006-08-17 09:29
> bayes_toks
> -rw-------  1 Debian-exim Debian-exim  1294336 2006-07-21 17:46
> bayes_toks.expire10036
> -rw-------  1 Debian-exim Debian-exim  1409024 2006-07-17 09:16
> bayes_toks.expire10080
> -rw-------  1 Debian-exim Debian-exim  1445888 2006-07-15 01:11
> bayes_toks.expire10092
> ...
> [many more bayes_toks.expire* files]

Where is your bayes_journal? Also, if you have lots of bayes_toks.expire 
files it is because you have SA trying to expire bayes and it doesn't 
finish in time. See the MailScanner spam.assassin.prefs.conf file for an 
explanation. You need to set bayes_auto_expire.

# When using the scheduled Bayes expiry feature, in MailScanner.conf
# you probably want to turn off auto-expiry in SpamAssassin as it will
# rarely complete before it is killed for taking too long.
# You will just end up with # MailScanner: big bayes_toks.new files
# wasting space.
bayes_auto_expire 0

> 
>> What does your reporting say? If you train a "insert favorite spam here" 
>> message and then see more of them come through later are they showing 
>> Bayes scores?
> 
> At first glance no, but I'll need to monitor from here out to see.
> 
> Casey
> 
> 

DAve

-- 
Three years now I've asked Google why they don't have a
logo change for Memorial Day. Why do they choose to do logos
for other non-international holidays, but nothing for
Veterans?

Maybe they forgot who made that choice possible.