Odd problem with bayes

Stephen Swaney steve.swaney at FSL.COM
Fri Mar 12 00:05:13 GMT 2004


> -----Original Message-----
> From: MailScanner mailing list [mailto:MAILSCANNER at JISCMAIL.AC.UK] On
> Behalf Of John Rudd
> Sent: Thursday, March 11, 2004 6:36 PM
> To: MAILSCANNER at JISCMAIL.AC.UK
> Subject: Odd problem with bayes
>
> Two of my four front line mailscanner machines (the ones that are still
> running sendmail) dramatically slowed down over hte last few days.  It
> was really confusing me and making me wonder what strange thing was
> going on.  Especially since they were installed in pairs (1&2, then
> 3&4), and each pair was as identical as I could make them ... and the
> slow machines were 1 and 3.
>
> They weren't taking in extra traffic over the others ... less, in fact.
>
> Eventually I noticed that / was horribly full on them, and I tracked it
> down to the /.spamassassin directory.
>
> What I found were tons of bayes_toks.expire[digits] files and
> bayes.lock.$HOST.digits files ... many of which were days old.  The
> other two machines didn't have this problem (a few of the bayes.lock
> files, but not many that were old, and none of the expire files).
>
>
> What would cause this?  What's the right way to clean it?  What's a good
> time to set up a routine to clean it automatically every night?  What's
> a good way to prevent it?
>
> (was I wrong that the site-wide bayes problems of old were solved?
> though, my production machines are still running an older version still
> ... MailScanner-4.11-1 , with a newer spamassassin ... should I just
> turn off bayes until I upgrade?  I think all we're doing with it right
> now is auto-learning)

I know this answer is in the list archives since I sent it laste week. I
probably should have put it in the FAQ :(

The path to the bayes directory may be set in spam.assassin.prefs.conf by
adding a line similar to:

bayes_path <directory_where_bayes_tokens_are_stored>/bayes

Please see: http://www.spamassassin.org/doc/Mail_SpamAssassin_Conf.html

For the details. There is a LOT of very useful information here. This is a
"must read" for MailScanner users.

Please note the "/bayes" after the actual directory name. This causes
SpamAssassin to look for files "bayes_*" in the named directory.

The default in SpamAssassin is to automatically add tokens to the bayes
database.

I quote from the link referenced above:

----Start quote -------

bayes_auto_learn ( 0 | 1 ) (default: 1)

Whether SpamAssassin should automatically feed high-scoring mails (or
low-scoring mails, for non-spam) into its learning systems. The only
learning system supported currently is a naive-Bayesian-style classifier.
Note that certain tests are ignored when determining whether a message
should be trained upon: - auto-whitelist (AWL) - rules with tflags set to
'learn' (the Bayesian rules) - rules with tflags set to 'userconf' (user
white/black-listing rules, etc)

Also note that auto-training occurs using scores from either scoreset 0 or
1, depending on what scoreset is used during message check. It is likely
that the message check and auto-train scores will be different.


bayes_auto_learn_threshold_nonspam n.nn (default: 0.1)

The score threshold below which a mail has to score, to be fed into
SpamAssassin's learning systems automatically as a non-spam message.

bayes_auto_learn_threshold_spam n.nn (default: 12.0)

The score threshold above which a mail has to score, to be fed into
SpamAssassin's learning systems automatically as a spam message.
Note: SpamAssassin requires at least 3 points from the header, and 3 points
from the body to auto-learn as spam. Therefore, the minimum working value
for this option is 6.

----End quote -------

Auto learn is on by default but these settings and scores can be explicitly
set or changed in by adding a "parameter value" in spam.assassin.prefs.conf
and reloading MailScanner.

I run a script in /etc/cron.daily called bayes-rebuild. The contents of
bayes-rebuild are simply:

-----Snip ------
#! /bin/bash
# rebuild the bayes database daily
/usr/bin/sa-learn -p <path_to_spam.assassin.prefs.conf> --rebuild
--force-expire
-----Snip ------

I also set

Rebuild Bayes Every = 0

In MailScanner.conf since it's done from the cron.daily job. I have had zero
bayes problems on many systems with this setup and bayes works very well
with almost no effort and no maintenance headaches.

Is it better to manually check and feed ham and spam to the Bayesian
database? Absolutely! - But if you're too busy to do that, this setup will
still improve overall spam detection.

Hope this helps,

Steve
Stephen Swaney
President
Fortress Systems Ltd.
Steve.Swaney at FSL.com


--
This message has been scanned for viruses and
dangerous content by Fortress Secure Mail Gateway
and was found to be clean.

Fortress Systems Ltd. - http://www.fsl.com



More information about the MailScanner mailing list