It seems that viruses CAN slip through MailScanner under high load!

Brian Hoy brian.hoy at OPUS.CO.NZ
Fri Aug 29 03:16:47 IST 2003


Hi all,

Thanks to everyone for their comments and advice.  It is very much
appreciated.  And especially to Julian for finding and fixing the problem so
quickly!

Our sendmail config does have the load settings configured that many of you
mentioned, but still the mail was flowing in!  The input queue was growing
faster than Mailscanner could scan it, and the problem just kept compounding.

The reason is that the "load average" stats are not always a good measure of
the real stress that the machine is under.  If a machine is heavily using
swap space, then the disks and motherboard I/O bandwidth are being consumed
(and CPU also if the disks are ATA, rather than SCSI), yet no useful work is
being done.

If a process is waiting on a page fault, I do not think that it is placed in
the OS's run queue until the page is loaded (and another page swapped out -
still more disk I/O!).  If this is true then the load average does not
increase, yet the machine is clearly starting to struggle with the load.
This is what happened to us the other day.

If you want to experiment with this idea, compile this C program:

// Compile with gcc -o vm_tester vm_tester.c
//
#include <stdio.h>
#include <malloc.h>

#define NUM_PASSES 10
#define MB_TO_ALLOC 128
#define BYTES_TO_ALLOC (MB_TO_ALLOC * 1024*1024)

int main(void)
{
  char *mem;
  int pass, r, c;

  if ((mem = (char *) malloc(BYTES_TO_ALLOC)) == NULL)
  {
    printf("malloc() failed");
    exit(-1);
  }

  for (pass=0; pass<NUM_PASSES; pass++)
  {
    for (c=0; c<4096; c++)
    {
      for (r=0; r<BYTES_TO_ALLOC/4096; r++)
      {
        mem[r*4096 + c]++;
      }
    }
  }

  return 0;
}

// -----------------------------------------------

It allocates 128M of RAM, and increments bytes in a way that generates as
many page faults as possible.  As an initial suggestion, run as many of
these programs as needed to consume all your RAM and watch your other
processes struggle to get a slice of the CPU.  BTW, don't do this on a
production server, or try to consume more memory than your total VM - you
have been warned!

Use top and vmstat to watch things.  If you start running more of these
programs, then you find that the load average does not increase that much,
but your disks are flat out, and machine responsiveness goes right out the
window (esp on ATA disks).

I still think my suggestion (in my first post) for an "unfair" way of
selecting messages for scanning under "high load" has merit.  When our mail
gateway was stressed out the other day, I was using strace to monitor the
system calls in the MailScanner processes, and they were spending 5-30mins
just doing the stat() calls before locking messages for scanning.

When you machine is really overloaded, let's do anything to concentrate the
meagre available resources on clearing the queue in the most expedient fashion.

Perhaps "high load" can be determined by the length of the input queue
(rather than the misleading system load average), and be user configurable.

For example, if the input queue has in excess of 1000 messages waiting, peel
off any 30 for scanning.  Ensure that no other MailScanner process evaluates
the length of the queue until a user configurable time has passed (15
mins?).  I know this is easier said than done, but I think it really would
help when the machine is steaming up shit creek.

Another thought....Sendmail names all it's df and qf files, such that an
alphabetical listing is sorted by ascending time order too!  If the other
MTAs are the same, then perhaps this fact could be used to remove all the
stat()s and still meet the fairness algorithm?

Comments anyone?

Regards,
Brian



More information about the MailScanner mailing list