Thoughts on new Bayes idea
Eric Dantan Rzewnicki
rzewnickie at RFA.ORG
Fri Jun 4 16:56:38 IST 2004
NOTE: don't use the attached script! it has problems, I'm just sending
to the list to give some ideas to the OP and hopefully get some
suggestions for improvement.
On Fri, Jun 04, 2004 at 04:22:57PM +0100, Julian Field wrote:
> At 16:21 04/06/2004, you wrote:
> >On Freitag, 4. Juni 2004 5:08 Max Kipness wrote:
> >
> >> My idea is to basically archive every email that enters the system
> >> (through MS) for a period of a day or so. I've got a script that
> >> deletes all emails older than a time specified from an mbox file.
> >> Then using my script from above, have users forward the email to
> >> spam at ourdomain.com <mailto:spam at ourdomain.com> , have a new script
> >> fetch that email out of the archive and feed it to Bayes.
> >
> >I like it. Please share the scripts once you are ready. The only
> >remaining problem would be that archiving mails in this manner might not
> >be allowed by local law. But that is another story...
>
> When Outlook forwards a message, what happens to the Message-ID? If it
> screws that, you may have trouble finding a unique key for the messsage,
> with the result that you can't find it in your archive.
I'm doing basically what Max is talking about now. I have a script that
pulls the Subject line out of the forwarded mail and uses that to create
a procmail recipe for each subject. Then procmail is called via formail
to pull matching mails out of the archive.
I'll attach it, but, I don't recommend using it as is. It has some
problems that I haven't had time to fix. If a reported spam has a blank
subject the procmail recipe basically matches everything, which is bad.
If there happens to be a spam that has the same subject as some
legitimate email the legitimate mail gets pulled out of the archive as
well. Spam with the same subject sent to users other than the
reporting user also gets pulled (this bit is generally beneficial).
Because of those problems, I don't have the script feeding sa-learn
automatically. I review the spam mbox the script generates before
running sa-learn on it.
The script does the same for reported false-positives.
Outlook screws up all the headers that it includes in a forwarded
message. From: and To: are rewritten with a different syntax. Date:
can't be trusted because a lot of spam has a date in the future or the
past. Even the Subject: has to be sanitized because Outlook collapses
multiple whitespaces to a single space character.
Anyway, for what it's worth, here's what I have so far. Hope it's
helpful.
-Eric Rz.
-------------------------- MailScanner list ----------------------
To leave, send leave mailscanner to jiscmail at jiscmail.ac.uk
Before posting, please see the Most Asked Questions at
http://www.mailscanner.biz/maq/ and the archives at
http://www.jiscmail.ac.uk/lists/mailscanner.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shpam-learn.sh
Type: application/x-sh
Size: 4657 bytes
Desc: not available
Url : http://lists.mailscanner.info/pipermail/mailscanner/attachments/20040604/85e91675/shpam-learn.sh
More information about the MailScanner
mailing list