Oversight in MailScanner's Bayes Implementation?

Matt Kettler mkettler at EVI-INC.COM
Sat Dec 13 21:59:53 GMT 2003

At 03:41 PM 12/13/2003, Ryan D. Egeland wrote:
>t appears the Bayes feature available through spamassassin specifically
>the way MailScanner implements it evaluates all incoming mail in a bulk
>fashion, i.e. each individual user does not have his own Bayes database.
>Is my assumption correct?

Yes, that is correct.

However, the default manner in which SA processes bayes makes per-user
bayes an impossiblity on many mailservers.

You see, in order to do per-user bayes the way SA does it, you need an
account for every user on your server. Many mailservers that run
MailScanner are relaying servers, like mine. This means that my mailserver
doesn't have accounts or home directories per-user.

<long explanation of bayes accuracy trimmed>

Yes, it's well known you get reduced accuracy by doing an aggregate bayes
database, but it's not THAT significant in most real-world cases. In fact,
in some real-world cases you get *better* accuracy, because some users get
too little mail to ever have enough tokens in their bayes DB, and thus
can't reap the benefits if the implementation is per-user.

The only significant case where per-user matters a lot is where all your
users get enough mail to have large bayes DBs, and you have two sub-groups
which have conflicting spam/nonspam email patterns. ie: if you bayes
together a bunch of sysadmins and a bunch of mortgage brokers, you're going
to have problems.

It's theoretically possible for MailScanner to do per-user bayes with some
substantial work on Julian's part, but I'd question the value of it.

If you realistically think it's that big a deal, do some side-by-side tests
with corpii, and generate some hard factual statistics that show just how
bad it is.. But I can tell you from my real-world experience using bayes
with mailscanner in a site-wide mode for a 100ish-user corporate network,
it works quite well.

More information about the MailScanner mailing list