Bayesian shenanigans (i.e. problems)

Nathan Johanson nathan at TCPNETWORKS.NET
Thu Jan 22 19:10:10 GMT 2004

Can someone please clarify... Doesn't the sa-learn --rebuild command
expire the tokens (if necessary) by default? Isn't the extra
--force-expire option unnecessary if you regularly rebuild the

As an aside, I have been following this thread here and on the sa-talk
list (where surprisingly there were no responses). I too have been
having problems with accumulating lock files and the subsequent creation
of Deleting the lock files and rebuilding the database
seems to fix the problem (albeit temporarily). Since the common
conception seems to be that Bayes is resource and memory-intensive, I
recently upgraded the RAM on this machine from 256MB to 512MB and
haven't seen the problem since. I'm also planning on increasing the
SpamAssassin time out to 50 or 60 seconds, as this system also seems to
have more than it's share of overall timeouts on a daily basis.

Interstingly, while my attempts at sa-learn --rebuild seem to work w/out
issue, adding the --force-expire switch reports the following status.
Subsequent research of this logging suggests that it's more
informational than a true problem. I'm assuming that this is a
side-effect of currently only using auto-learning and not feeding my
bayes database enough. Has anyone else seen this sort of output? 

> debug: bayes: Can't use estimation method for expiry, something fishy,
calculating optimal atime delta (first pass)
> debug: bayes: atime     token reduction
> debug: bayes: ========  ===============
> debug: bayes: 43200     69735
> debug: bayes: 86400     39541
> debug: bayes: 172800    891
> debug: bayes: 345600    0
> debug: bayes: 691200    0
> debug: bayes: 1382400   0
> debug: bayes: 2764800   0
> debug: bayes: 5529600   0
> debug: bayes: 11059200  0
> debug: bayes: 22118400  0
> debug: bayes: couldn't find a good delta atime, need more token
difference, skipping expire.


-----Original Message-----
From: Julian Field [mailto:mailscanner at ECS.SOTON.AC.UK] 
Sent: Thursday, January 22, 2004 9:21 AM
Subject: Re: Bayesian shenanigans (i.e. problems)

At 16:52 22/01/2004, you wrote:
>On Thu, 22 Jan 2004, Steve Freegard wrote:
> > I haven't been following this thread closely, so apologies if this
> > already been covered.
>It hasn't, so you reply is appreciated!
> > Maybe the error is being caused by opportunistic bayes expiry which
> > take long enough on your system to cause MailScanner to time-out and
> > off SA mid-expiry causing your orphaned files??
>That sounds very plausible.  I have gone even deeper into the "maillog"
>files, and these "Delete bayes ..." for a particular MS process occur
>40 seconds after it starts the spam analysis.  And the MS conf has SA
>timeout of 40 seconds.  It all fits.
>So very promising indeed.
> > You could try setting 'bayes_auto_expire 0' in
spam.assassin.prefs.conf and
> > then creating nightly cron job to run a script and does an 'sa-learn
> > /etc/MailScanner/spam.assassin.prefs.conf --rebuild --force-expire'.
>Yes, that might be worth a try, at least as proof of concept.
>But I wonder whether we need a cleaner solution (remember, a few other
>folk have seen one or other variant of this) that, as default
>tries to prevent this.  Two possibilities:
>1. MS installation-time (and defaults):  MS defaults 'bayes_auto_expire
>    and accompanies that with setting the cron job?  But setting the
>    job is highly OS-specific (i.e. variable!), and overall this
>    feel quite right.
>2. MS run-time: MS defaults 'bayes_auto_expire 0', but at start up
>    it generally does every four hours) it does "--rebuild
>    preferably (if possible) by the appropriate subroutine call to SA.
>This second feels better and cleaner (although there's a residual issue
>the near simultaneous start-up of around five MS processes).
>Julian: Do you have any thoughts?  I'd be happy to try to cobble
>a proof of concept patch for that second version (although I'd prefer
>if it arrived fully-fledged on the doorstep!).

The trouble with option 2 is that the child processes start up
independently of each other, and doing it once at the startup of every
child process would cause a huge holdup while all n children (n could
easily be 12 on a dual-CPU box) ran their own bayes-expire. However,
are ways around this, as there always are, so I may be able to come up
a better solution that would do a bayes expire approximately once every
hours or so, which should be plenty. The whole system would have to sit
hang while this took place, unless I temporarily disabled SpamAssassin
*possibly* even just bayes) while it was doing it.

This is going to be a bit of a pig to write :-(
Julian Field
MailScanner thanks transtec Computers for their support

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654

More information about the MailScanner mailing list