Which messages to feed to Bayes?

Thu Feb 26 23:49:12 GMT 2004

At 05:08 PM 2/26/2004, Peter Bonivart wrote:
>If the score triggered the autolearn feature Bayes will not learn from
>the same message again. Look in bayes_seen, it's full of message id:s.
>
>So this feature already exists, just lower the threshold for autolearn
>if you want to, the default is 12. Put "bayes_auto_learn_threshold_spam
>8" or similar in spam.assassin.prefs.conf and it will do it for you

Not to be critical, but autolearn isn't a panacea.

Autolearning is helpful, but it won't learn every message that SA tags,
even if your autolearn and spam threshold scores are the same.

Generally speaking it's quite hard to get the autolearner to learn a message.

All of the following conditions have to be true to learn as spam:

                 1) Calculating  without AWL, white/blacklists, or bayes,
and using a non-bayes scoreset the score must be greater than
bayes_auto_learn_threshold_spam.
                 2) The header rules must total at least 3.0 in score
                 3) The body rules must total at least 3.0 in score
                 4) The existing bayes score must not be strongly non-spam
in nature.
                 5) The opportunistic one-try attempt at locking the bayes
database must succeed. (ie: nothing else can be updating bayes at the same
time)

I wouldn't auto-feed all tagged mail back to SA for training.. but I would
definitely still do manual training, and I would include tagged spam in my
training.

As you've mentioned sa-learn already has the feature of not re-learning the
same message, so you're not wasting much CPU time for SA to decide it's
already seen a given message ID and move on without learning it. That
feature isn't a reason to avoid training..

Using a mailbox of 205 messages, all tagged by SA. 108 of them were over 15
in score, 97 of them under 15 in score but over 5.

         110 of the 205 were learnable by sa-learn.

I'll admit I'm using the default threshold of 12.. but that autolearned
less than half of these messages. It didn't even autolearn all the
high-scoring messages.