Which messages to feed to Bayes?

Fri Feb 27 19:23:15 GMT 2004

On Thu, Feb 26, 2004 at 03:47:45PM -0800, Michael St. Laurent wrote:
> Matt Kettler <mailto:mkettler at EVI-INC.COM> wrote:
> > At 04:55 PM 2/26/2004, Michael St. Laurent wrote:
> >> Should we be feeding the Bayes engine in Spamassassin messages that
> > My answer is a very emphatic YES!
> Excellent.  Okay, what about spam messages that have lost their headers
> becuase the user forwarded it to me (Outlook strips the headers when you do
> that).  Will it still benefit from looking at just the body of the message?

As others have said, you need the original message. I'm archiving all
original mail in date stamped mboxes (for one week). I've worked out a
string of shell commands to grab the subject from forwarded
false-negative spams and use that to dump the original mails to an mbox
for review. Any that don't look like spam can be deleted and the rest
can be fed to sa-learn. I have used this approach successfully. But, it
is not fully automated.

I haven't quite finished the script, and what I have is pretty messy.
But, in the interest of sharing ideas, here's what I'm working on
(incomplete semi-psuedo code & not fully tested & I'm not an
experienced programmer & whatever other disclaimers would be pertinent
& criticism welcome):

##########

spam_reports=<user-reported-spam-mbox>
archive=/var/spool/MailScanner/archive
sa_prefs=/opt/MailScanner/etc/spam.assassin.prefs.conf

# many of these are protected by quoting and may not need to
# be here. But, I'm not sure which ...
special_chars="\041-\055\072-\077\133-\140\173-\177"

subjects=`grep -A6 "^-----Original Message-----$" $spam_reports |\
    grep "^Subject: " | sort | uniq`

for mbox in `ls $archive/20*`; do
    perl -pi -e "s/[$special_chars]/./g" $subjects |\
    xargs --replace grepmail -u -h "^{}" $mbox >> /tmp/spam ;
done

##########

### Then interactively
mutt -f /tmp/spam

### and then train. This could read the the grepmail output from a pipe
### but, i'm not comfortable totally automating this, yet.
sa-learn --spam --mbox -p $sa_prefs /tmp/spamcrap

There are problems with this approach.

-Blank Subject:'s reported by users are ignored because they match
 everything.
-I'm actually training on many many more messages than the one reported
 by each user since typically we received many spams w/ the same
 subject. This might be a good thing, though.
-Spam subjects could match subjects of legitimate mail. This is one
 reason for the manual review.
-I'm probably replacing too many special characters with "." and would
 rather escape them with "\", but my replacement foo is limited :-(
-grepmail is a ram pig. it seems to keep all of an mbox in memory as it
 works on it. Probably because it has to rewind, I guess.
-Maybe the whole idea is just silly?

Anyway, I hope this is useful to someone.

-Eric Rz.