Spamassassin timeout results in MS 100% CPU and server lockup

Alan Fiebig mailscanner at ELKNET.NET
Mon Jul 28 03:03:55 IST 2003


Last night I got a number of emails into my MS/SA server that caused it to crash.

In my testing, here is what I discovered:

  Everything runs fine up until the maillog reports 'spamassassin timed out and was killed'.

  Using the PID in that error message, I check all the children of that parent process.

  One of the children will suddenly start climbing in memory size above the typical 22M, and likewise will start climbing in CPU usage.

  As memory and CPU continue to increase, all the other MS parents and their children get swapped out and go to sleep.

  Before too long, the child causing the problem will reach around 380+M of memory and 99% CPU. No other MS instances are running at all.

  Accordingly, the maillog no longer shows any processing of the incoming queue.

  If left alone, the machine eventually comes to its knees, and even the nic stops responding (can't ping the server) and the console is locked.

  If the bad child is reniced to a negative value (everything else is at 0), then everything else, including the other MS instances, start back up.

  If the bad child is killed, everything goes back to normal.


  ...that is of course until another one of the bad messages is picked up for scanning.


  My major concern here is not so much what was in the bad message that caused this, but more critically, why does the time out killing of a spamassassin instance cause its calling MS child to go ape and eat the server?

-Alan



More information about the MailScanner mailing list