SpamAssassin 2.0 - beware of old spamassassin.cf file!

Sat Jan 26 19:32:58 GMT 2002

On Sat, Jan 26, 2002 at 01:30:22PM +0000, Julian Field wrote:
> At 21:03 25/01/2002, you wrote:
> >On Fri, Jan 25, 2002 at 08:48:47AM +0000, Julian Field wrote:
> > > Please note this is due to SpamAssassin bugs, not MailScanner bugs :-)
> > > I call their API just the way they told me to...
> >
> >Without a doubt (God knows I've spent enough time looking over your code
> >to know that :)
>
> Cool, someone who actually looks at the code!
> I hope you don't think it's too awful... most of the users seem happy with
> it :)

I'm not sure how you're doing DNS lookups, haven't looked at it yet, but
I have some code that should kick those into high gear.  I'll send it
along when I dig it up.

> >   If SpamAssassin 2 doesn't work well, I'm going to
> >rewrite large parts of it myself and just use their patterns.
>
> I'm running SpamAssassin 2.01 on our servers and it seems to be behaving
> fine, certainly as well as 1.5 did.

I put SA 2.01 on my test mail server last night, and ran 999 known
spams through it, and it hit on 858.  I'm going to split them out and
run them through the command line version and see what I get.  Even so,
86% isn't bad, and another file of 350 known spams hit higher (91%).

I was thinking that it would make sense to use Vipul's Razor in
mailscanner, and then not run the message through SA if it hits there.
The razor is much quicker than SA, and hits on generally 50% of all
spam that I get (my file with 354 spams had only 130 that it didn't
recognize), which means that we could speed this up greatly by only
calling SA for half the spam.

Another idea, which would work in conjunction with the above high-volume
DNS lookup code, would be to create a process that scanned the mail
queue before mailscanner got the messages and did the high-volume DNS
lookup on every single host name in every single message.  (I do 100 or
more in parallel, so it wouldn't take long).  That way, when mailscanner
or SA went to do its lookups, they would already be sitting in the
nameserver's cache, and the lookup would be instantaneous.  The biggest
trick would be to do the convolution for the RBL lookups correctly, but
I don't think it would be terrible.

Anyway, some thoughts.  I'm not against coding, so I might play around
with some of it when I get a chance in a few weeks.

Good news, the code was right where I thought it was, have it under the
GPL.  This isn't the most beautiful Perl that I've ever written.  Oddly,
this was the first Perl program for which I was paid, written about a
week after I learned Perl.  Caveat emptor, but it works great.

#!/usr/bin/perl

# By: Michael Darrin Chaney (mdchaney at michaelchaney.com)
# Michael Chaney Consulting Corp.
# 7/22/1999
# Copyright (C) 1999, 2002 Michael Chaney Consulting Corporation
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
# USA
#
# Synopsis:
# Does a number of nslookups asynchronously.  Performance can be tweaked
# via the 4 variables below (timrout, waitlen, maxwaiting, maxlookups) to
# perhaps make it faster.  With the following values (5, 3, .75, and 60)
# the program is able to average a little over 1000 lookups each minute, it's
# unlikely that more are needed.  I'm not sure what a good value for
# maxlookups is, but it seems to run fine with 60 on a T-1.  Faster net
# connection means you should move that value up.  Just don't kill your
# nameserver.  You also might want to move the timeout up a bit, perhaps
# 15 or more seconds.
#
# This program expects to receive a list of hosts via stdin.

use Net::DNS;
use IO::Select;

my $debugme=0;

my $timeout = 5;
my $waitlen = 3;  # should be shorter than the timeout period
my $maxwaiting = .75;
my $maxlookups = 60;

my $waitntry=0;

my $sel = new IO::Select;
my $res = new Net::DNS::Resolver;
$res->tcp_timeout($timeout);

print "Initial select count: ", $sel->count(), "\n" if ($debugme);

until (eof STDIN && $waitntry==2) {
   # If over half of the lookups are "active", then I'm going to wait
   # and dump the ones which aren't ready.  Or, if we're at the end of
   # input but some slots are active, wait.
   if (($sel->count() > ($maxlookups*$maxwaiting)) ||
                              ($sel->count()>0 && eof(STDIN))){
      if ($waitntry == 0) {
         # wait for a few seconds and try again
         print $sel->count(), " of ", $maxlookups, " slots used, "
             if ($debugme);
         if (eof(STDIN)) {
            print "last bunch, waiting 10 seconds...\n" if ($debugme);
            sleep 10;  # Give the last bunch a few more seconds
         } else {
            print "waiting 2 seconds...\n" if ($debugme);
            sleep $waitlen;  # This should be short!
         }
         $waitntry=1;
      } else {
         # well, we waited and ended up here again, we'll dump them
         # and move on.  It would be better to actually keep track of
         # how long each request has taken, and only dump the oldest
         # ones.  Alternately, if the handles are added to the list
         # at the beginning or end (very likely, check the IO::Select
         # source), then a certain slice of the oldest ones could be
         # dumped.
         $sel->remove($sel->handles);
         if ($debugme) {
            if (eof(STDIN)) {
               print "End of file, dumping inactive lookups\n" if
($debugme);
               $waitntry=2;
               last;
            } else {
               print "Too many slots used, dumping inactive lookups\n"
                    if ($debugme);
               $waitntry=0;
            }
         }
      }
   } else {
      if ($sel->count()==0 && eof(STDIN)) {
         $waitntry=2;
         last;
      } else {
         $waitntry=0;
      }
   }

   unless ($waitntry) {
      # This will add some host lookups to bring the current count up to
      # $maxlookups.  Think of this as planting some seeds.
      while (!eof(STDIN) && $sel->count() < $maxlookups) {
         my $line=<STDIN>;
         chomp $line;
         print "Adding host $line\n" if ($debugme);
         $sel->add($res->bgsend($line));
      }
   }

   # Now, time to harvest the ripe ones.
   if (@ready = $sel->can_read($timeout)) {
      foreach $sock (@ready) {
         $packet = $res->bgread($sock);
#        $packet->print;
         if ($packet && $packet->answer) {
            foreach $rr ($packet->answer) {
               if ($rr->type eq "A") {
                  print $rr->name," " if ($debugme);
                  print $rr->address,"\n";
               }
            }
         } else {
            print "Empty return\n" if ($debug);
         }
         $sel->remove($sock);
         $sock = undef;
      }
   }
   # Dump selects that have errors.  Note that the method "has_error"
   # is incorrectly listed as "has_exception" in some documentation.
   if (@ready = $sel->has_error(0)) {
      print "Removing problem selects\n" if ($debugme);
      foreach $sock (@ready) {
#         $packet = $res->bgread($sock);
#         $packet->print;
         $sel->remove($sock);
         $sock = undef;
      }
   }
}

Michael
--
Michael Darrin Chaney
mdchaney at michaelchaney.com
http://www.michaelchaney.com/