No bayesian for me?
Mauricio Tavares
raubvogel at gmail.com
Wed Aug 12 18:30:49 IST 2009
Glenn Steen wrote:
> 2009/8/12 Mauricio Tavares <raubvogel at gmail.com>:
>> Glenn Steen wrote:
>>> 2009/8/12 Mauricio Tavares <raubvogel at gmail.com>:
>>>> Jules Field wrote:
>>>>> Did you run sa-learn as the same user you run MailScanner as? ("Run As
>>>>> User" in MailScanner.conf). Otherwise your Bayes database you've been
>>>>> training will be in the wrong place.
>>>>>
>>>> I see your point. I was indeed running sa-learn as root, not as
>>>> postfix, which should be the user MailScanner runs as. So, I guess I
>>>> should
>>>> run it then as postfix. Now, should I delete the root-created database?
>>>> Also, where will it save the database at?
>>>>
>>> You should delete the one for root, if it resides in roots home
>>> directory, since that will be no help at all... Or move it. But I see
>>> you have configured it to reside somewhere sane, so all you need do is
>>> make it all owned by postfix.
>> Here is an update: I wrote a script that through all the virtual
>> email accounts (/var/spool/vmail/domain.com) and scanned the spam (placed in
>> the .Spam folder) and the ham (placed in all the other mail folders). Since
>> I am running it as postfix:postfix and that directory is owned by
>> virtual:virtual, I did not get everyone. Is there a way to let the
>> postfix-owned script check all the mails in the virtual-owned ones? Make
>> postfix part of the virtual group? I think that is what the sticky bit is
>> for, right? In any case, here is the output:
>>
>> postfix at mail /etc/postfix $ sa-learn --dump magic
>> 0.000 0 3 0 non-token data: bayes db version
>> 0.000 0 1837 0 non-token data: nspam
>> 0.000 0 179092 0 non-token data: nham
>> 0.000 0 3104505 0 non-token data: ntokens
>> 0.000 0 1053729759 0 non-token data: oldest atime
>> 0.000 0 1250081652 0 non-token data: newest atime
>> 0.000 0 1250081434 0 non-token data: last journal sync
>> atime
>> 0.000 0 1250034247 0 non-token data: last expiry atime
>> 0.000 0 0 0 non-token data: last expire atime
>> delta
>> 0.000 0 0 0 non-token data: last expire
>> reduction count
>> postfix at mail /etc/postfix $
>>
>>
>> As you can see, there is a lot more ham than spam. I wonder how much harm
>> would that cause in my bayesian filtering...
>>
>>> If you also use MailWatch, you'll need make the apache users group the
>>> "group owner" for the base directory and all the files, and set the
>>> GID bit for the directory (/var/spool/MailScanner/bayes in your case),
>>> so that any new files get the correct group ownership. Once you've
>>> done that, things should start cooking:-).
>> Thanks for the suggestion! If I ever use MailWatch, I will try to
>> remember to use that. =)
>>
>>> One more thing: Always run your tests (spamassassin --lint and stuff
>>> like that) as your postfix user, to avoid some subleties that might
>>> otherwise bite.
>> postfix at mail /etc/postfix $ spamassassin --lint
>> [19591] warn: config: warning: score set for non-existent rule
>> WANTS_CREDIT_CARD
>> [19591] warn: config: warning: score set for non-existent rule
>> FORGED_RCVD_HELO
>> [19591] warn: lint: 2 issues detected, please rerun with debug enabled for
>> more information
>> postfix at mail /etc/postfix $
>>
> Hm, I wonder if your postfix user really can read all the .cf files...
> Do as it suggests and see what debug will tell you (spamassassin
> --lint -D, as the PF user). Also try running a message through, or
> else it will not test bayes for you:
> spamassassin -t -D < /path/to/email/file
> ... and llok carefully at what it says about bayes. You might want to
> pipe the output to a file (or less). Don't forget to redirect STDERR
> as well ( 2>&1).
>
> Cheers
Some interesting findings (to me):
postfix at mail /home/raub/Spam $ spamassassin -D < spam9.eml
Content analysis details: (10.2 points, 5.0 required)
pts rule name description
---- ----------------------
--------------------------------------------------
1.8 BAD_ENC_HEADER Message has bad MIME encoding in the header
3.2 CHARSET_FARAWAY_HEADER A foreign language charset used in headers
0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
[score: 0.5000]
1.4 MIME_QP_LONG_LINE RAW: Quoted-printable line longer than 76 chars
0.9 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP
address
[202.132.194.31 listed in dnsbl.sorbs.net]
0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
[202.132.194.31 listed in zen.spamhaus.org]
2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
[Blocked - see
<http://www.spamcop.net/bl.shtml?202.132.194.31>]
0.1 RDNS_DYNAMIC Delivered to trusted network by host with
dynamic-looking rDNS
0.0 MISSING_MIMEOLE Message has X-MSMail-Priority, but no X-MimeOLE
But, as me:
raub at mail ~/Spam $ spamassassin -D < spam9.eml
[...]
Content analysis details: (12.7 points, 5.0 required)
pts rule name description
---- ----------------------
--------------------------------------------------
2.9 BAD_ENC_HEADER Message has bad MIME encoding in the header
3.2 CHARSET_FARAWAY_HEADER A foreign language charset used in headers
1.8 MIME_QP_LONG_LINE RAW: Quoted-printable line longer than 76 chars
0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
[202.132.194.31 listed in zen.spamhaus.org]
1.6 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP
address
[202.132.194.31 listed in dnsbl.sorbs.net]
2.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
[Blocked - see
<http://www.spamcop.net/bl.shtml?202.132.194.31>]
0.1 RDNS_DYNAMIC Delivered to trusted network by host with
dynamic-looking rDNS
0.0 MISSING_MIMEOLE Message has X-MSMail-Priority, but no X-MimeOLE
So, I guess the above means that bayesian was not run when I ran
spamassasin as me because it did not have the rights to access the
database. I can live with that.
On a related note, why is it saying 5.0 points required if in
MailScanner.conf I have
Required SpamAssassin Score = 4.7
Do I also have to define required_hits 4.70 in spam.assassin.prefs.conf?
More information about the MailScanner
mailing list