No bayesian for me?

Wed Aug 12 18:30:49 IST 2009

Glenn Steen wrote:
> 2009/8/12 Mauricio Tavares <raubvogel at gmail.com>:
>> Glenn Steen wrote:
>>> 2009/8/12 Mauricio Tavares <raubvogel at gmail.com>:
>>>> Jules Field wrote:
>>>>> Did you run sa-learn as the same user you run MailScanner as? ("Run As
>>>>> User" in MailScanner.conf). Otherwise your Bayes database you've been
>>>>> training will be in the wrong place.
>>>>>
>>>>       I see your point. I was indeed running sa-learn as root, not as
>>>> postfix, which should be the user MailScanner runs as. So, I guess I
>>>> should
>>>> run it then as postfix. Now, should I delete the root-created database?
>>>> Also, where will it save the database at?
>>>>
>>> You should delete the one for root, if it resides in roots home
>>> directory, since that will be no help at all... Or move it. But I see
>>> you have configured it to reside somewhere sane, so all you need do is
>>> make it all owned by postfix.
>>        Here is an update: I wrote a script that through all the virtual
>> email accounts (/var/spool/vmail/domain.com) and scanned the spam (placed in
>> the .Spam folder) and the ham (placed in all the other mail folders). Since
>> I am running it as postfix:postfix and that directory is owned by
>> virtual:virtual, I did not get everyone. Is there a way to let the
>> postfix-owned script check all the mails in the virtual-owned ones? Make
>> postfix part of the virtual group? I think that is what the sticky bit is
>> for, right? In any case, here is the output:
>>
>> postfix at mail /etc/postfix $ sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0       1837          0  non-token data: nspam
>> 0.000          0     179092          0  non-token data: nham
>> 0.000          0    3104505          0  non-token data: ntokens
>> 0.000          0 1053729759          0  non-token data: oldest atime
>> 0.000          0 1250081652          0  non-token data: newest atime
>> 0.000          0 1250081434          0  non-token data: last journal sync
>> atime
>> 0.000          0 1250034247          0  non-token data: last expiry atime
>> 0.000          0          0          0  non-token data: last expire atime
>> delta
>> 0.000          0          0          0  non-token data: last expire
>> reduction count
>> postfix at mail /etc/postfix $
>>
>>
>> As you can see, there is a lot more ham than spam. I wonder how much harm
>> would that cause in my bayesian filtering...
>>
>>> If you also use MailWatch, you'll need make the apache users group the
>>> "group owner" for the base directory and all the files, and set the
>>> GID bit for the directory (/var/spool/MailScanner/bayes in your case),
>>> so that any new files get the correct group ownership. Once you've
>>> done that, things should start cooking:-).
>>        Thanks for the suggestion! If I ever use MailWatch, I will try to
>> remember to use that. =)
>>
>>> One more thing: Always run your tests (spamassassin --lint and stuff
>>> like that) as your postfix user, to avoid some subleties that might
>>> otherwise bite.
>> postfix at mail /etc/postfix $ spamassassin --lint
>> [19591] warn: config: warning: score set for non-existent rule
>> WANTS_CREDIT_CARD
>> [19591] warn: config: warning: score set for non-existent rule
>> FORGED_RCVD_HELO
>> [19591] warn: lint: 2 issues detected, please rerun with debug enabled for
>> more information
>> postfix at mail /etc/postfix $
>>
> Hm, I wonder if your postfix user really can read all the .cf files...
> Do as it suggests and see what debug will tell you (spamassassin
> --lint -D, as the PF user). Also try running a message through, or
> else it will not test bayes for you:
> spamassassin -t -D < /path/to/email/file
> ... and llok carefully at what it says about bayes. You might want to
> pipe the output to a file (or less). Don't forget to redirect STDERR
> as well ( 2>&1).
> 
> Cheers

	Some interesting findings (to me):

postfix at mail /home/raub/Spam $ spamassassin -D < spam9.eml

Content analysis details:   (10.2 points, 5.0 required)

  pts rule name              description
---- ---------------------- 
--------------------------------------------------
  1.8 BAD_ENC_HEADER         Message has bad MIME encoding in the header
  3.2 CHARSET_FARAWAY_HEADER A foreign language charset used in headers
  0.0 BAYES_50               BODY: Bayesian spam probability is 40 to 60%
                             [score: 0.5000]
  1.4 MIME_QP_LONG_LINE      RAW: Quoted-printable line longer than 76 chars
  0.9 RCVD_IN_SORBS_DUL      RBL: SORBS: sent directly from dynamic IP 
address
                             [202.132.194.31 listed in dnsbl.sorbs.net]
  0.9 RCVD_IN_PBL            RBL: Received via a relay in Spamhaus PBL
                             [202.132.194.31 listed in zen.spamhaus.org]
  2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
               [Blocked - see 
<http://www.spamcop.net/bl.shtml?202.132.194.31>]
  0.1 RDNS_DYNAMIC           Delivered to trusted network by host with
                             dynamic-looking rDNS
  0.0 MISSING_MIMEOLE        Message has X-MSMail-Priority, but no X-MimeOLE

But, as me:

raub at mail ~/Spam $ spamassassin -D < spam9.eml
[...]

Content analysis details:   (12.7 points, 5.0 required)

  pts rule name              description
---- ---------------------- 
--------------------------------------------------
  2.9 BAD_ENC_HEADER         Message has bad MIME encoding in the header
  3.2 CHARSET_FARAWAY_HEADER A foreign language charset used in headers
  1.8 MIME_QP_LONG_LINE      RAW: Quoted-printable line longer than 76 chars
  0.9 RCVD_IN_PBL            RBL: Received via a relay in Spamhaus PBL
                             [202.132.194.31 listed in zen.spamhaus.org]
  1.6 RCVD_IN_SORBS_DUL      RBL: SORBS: sent directly from dynamic IP 
address
                             [202.132.194.31 listed in dnsbl.sorbs.net]
  2.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
               [Blocked - see 
<http://www.spamcop.net/bl.shtml?202.132.194.31>]
  0.1 RDNS_DYNAMIC           Delivered to trusted network by host with
                             dynamic-looking rDNS
  0.0 MISSING_MIMEOLE        Message has X-MSMail-Priority, but no X-MimeOLE

So, I guess the above means that bayesian was not run when I ran 
spamassasin as me because it did not have the rights to access the 
database. I can live with that.

On a related note, why is it saying 5.0 points required if in 
MailScanner.conf I have

Required SpamAssassin Score = 4.7

Do I also have to define required_hits 4.70 in spam.assassin.prefs.conf?