Bayes not auto-learning at expected threshold

Tue Jun 29 13:37:02 IST 2004

Quoting from man Mail::SpamAssassin::Conf -

========
Note that certain tests are ignored when determining
whether a message should be trained upon:
 - auto-whitelist (AWL)
 - rules with tflags set to 'learn' (the Bayesian rules)
 - rules with tflags set to 'userconf' (user white/black-listing rules, etc)

Also note that auto-training occurs using scores from
either scoreset 0 or 1, depending on what scoreset is
used during message check.  It is likely that the mes-
sage check and auto-train scores will be different.
========

Jase

Mike Brudenell wrote:
> Greetings -
>
> We are using:
>     MailScanner     4.29.3
>     SpamAssassin    2.63
>
> In my /etc/mail/spamassassin/local.cf I have:
>     bayes_auto_learn_threshold_nonspam       0.1
>     bayes_auto_learn_threshold_spam         12.0
> and have checked my SpamAssassin config with its "--lint" option.
>
> I thought this would tell SpamAssassin to auto-learn a message into
> the Bayes database if its score was 12.0 or more.
>
> Yet soe of the messages I'm getting through achieve higher scores, but
> aren't marked autolearn=spam.
>
> Here are a few samples of interest...
> Message #1 has a score of 19.362 and *IS* auto-learned, whilst
> Message #2 has a higher score bus it *NOT* auto-learned.  Message #3
> arrived after I did a total stop/start of MailScanner just to make
> sure I hadn't forgotten previously: it too was *NOT* auto-learned.
>
> =============================================================
> Msg #1
>
> X-York-MailScanner-SpamCheck: spam, SpamAssassin (score=19.362,
>         required 8, autolearn=spam, DCC_CHECK 2.91,
>         FAKE_HELO_MAIL_COM 3.77, FORGED_MUA_OUTLOOK 2.57,
>         FORGED_OUTLOOK_TAGS 1.00, HTML_FONTCOLOR_UNKNOWN 0.10,
>         HTML_FONT_BIG 0.27, HTML_MESSAGE 0.10, HTML_MIME_NO_HTML_TAG
>         1.18, MIME_HTML_ONLY 0.32, MSGID_FROM_MTA_HEADER 0.70,
>         OPT_HEADER 2.40, OPT_IN 0.23, RAZOR2_CF_RANGE_51_100 1.10,
>         RAZOR2_CHECK 1.05, SARE_CHARSET_W1251 1.67)
>
> =============================================================
> Msg #2
>
> X-York-MailScanner-SpamCheck: spam, SpamAssassin (score=19.644,
>         required 8, AS_SEEN_ON 1.49, BAYES_99 5.40, BIZ_TLD 0.10,
>         CLICK_BELOW 0.10, EARN_MONEY 1.01, HTML_70_80 0.10,
>         HTML_FONTCOLOR_RED 0.10, HTML_FONTCOLOR_UNSAFE 0.10,
>         HTML_FONT_BIG 0.27, HTML_LINK_CLICK_HERE 0.10, HTML_MESSAGE
>         0.10, J_CHICKENPOX_13 0.60, J_CHICKENPOX_15 0.60,
>         MIME_HTML_ONLY 0.32, RATWR9_MESSID 0.80, RCVD_IN_DYNABLOCK
>         2.60, RCVD_IN_NJABL_DYNA 3.54, RCVD_IN_SORBS 0.10,
> SARE_BOUNDARY_07 2.22)
>
> =============================================================
> Msg #3
>
> X-York-MailScanner-SpamCheck: spam, SpamAssassin (score=17.795,
>         required 8, BAYES_99 5.40, MSGID_FROM_MTA_HEADER 0.70,
>         RAZOR2_CF_RANGE_51_100 1.10, RAZOR2_CHECK 1.05,
>         RCVD_IN_BL_SPAMCOP_NET 1.50, RCVD_IN_NJABL 0.10,
>         RCVD_IN_NJABL_SPAM 1.21, RCVD_IN_RFCI 0.10, RCVD_IN_SBL 3.54,
>         RCVD_IN_SORBS 0.10, WS_URI_RBL 3.00)
>
> =============================================================
>
> Can anyone shed any light on this behaviour please?  I'm including
> what I think is the relevant extract of debug output from MailScanner
> below.  I assume it's something to do with the "Score Set" chosen and
> the score shown for the auto-learn line, which is substantially lower
> than the final 14.47.
>
> Is it something like only the body-hits is used to determine whether
> to auto-learn or not rather than also including the head-hits?  (I
> assume the change from a head-hits of 8.365 to 9.07 is something to
> do with recomputing it using a different Score Set?  Can anyone point
> to information about these?)
>
> =============================================================
>
> debug: RBL: success for 16 of 16 queries
> debug: running meta tests; score so far=8.365
> debug: auto-learn? ham=0.1, spam=12, body-hits=8.365, head-hits=5.554
> debug: auto-learn: currently using scoreset 3.  recomputing score
> based on scoreset 1.
> debug: Score set 1 chosen.
> debug: auto-learn: original score: 9.07, recomputed score: 9.873
> debug: Score set 3 chosen.
> debug: auto-learn? no: inside auto-learn thresholds
> debug: is spam? score=14.47 required=5
>
tests=BAYES_99,DCC_CHECK,HTML_FONTCOLOR_RED,HTML_MESSAGE,HTTP_ESCAPED_HOST,M
SGID_FROM_MTA_HEADER,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_
NET,RCVD_IN_RFCI
> Stopping now as you are debugging me.
>
> =============================================================
>
> Cheers,
>
> Mike Brudenell

-------------------------- MailScanner list ----------------------
To leave, send    leave mailscanner    to jiscmail at jiscmail.ac.uk
Before posting, please see the Most Asked Questions at
http://www.mailscanner.biz/maq/     and the archives at
http://www.jiscmail.ac.uk/lists/mailscanner.html