Spamassassin timeout results in MS 100% CPU and server lockup (correction)
Alan Fiebig
mailscanner at ELKNET.NET
Mon Jul 28 04:23:07 IST 2003
A possible correction. I may have been wrong in my conclusion that it was a group of email sthat came in last night that caused this problem to manifest.
After running through the logs, I was quite surprised at how many SA timeout assasinations had started taking place in the early hours of the morning, right about the time of the first crash I described. Tonight I finished sanitizing the in queue, and started everything up again...
The problem came right back. I could not believe that it was more of the type of messages that came in last night, so perhaps my first conclusion was flawed. I started looking for what else might have changed on my server about the time all this started. I found it...
My Bayes finally hit 200 hams, and started working at 1:56 AM this morning. Prior to that time, my auto-learn had not accumulated enough hams for Bayes to start functioning. So, I just disabled Bayes, and viola! No more SA timeouts and deaths.
So, it wasn't some mysterious evil messages that caused the timeouts, but Bayes kicking in.
HOWEVER! Even with the cause of the timeouts figured out, I still have my PRIMARY concern: Why are MS children that encounter an SA timeout and death taking over my server with extrememly high ram and cpu usage, that eventually crowd out every other process and crash the server? THAT is my primary issue.
All help and insight is sure appreciated!
-Alan
>Last night I got a number of emails into my MS/SA server that caused it to crash.
>
>In my testing, here is what I discovered:
>
> Everything runs fine up until the maillog reports 'spamassassin timed out and was killed'.
>
> Using the PID in that error message, I check all the children of that parent process.
>
> One of the children will suddenly start climbing in memory size above the typical 22M, and likewise will start climbing in CPU usage.
>
> As memory and CPU continue to increase, all the other MS parents and their children get swapped out and go to sleep.
>
> Before too long, the child causing the problem will reach around 380+M of memory and 99% CPU. No other MS instances are running at all.
>
> Accordingly, the maillog no longer shows any processing of the incoming queue.
>
> If left alone, the machine eventually comes to its knees, and even the nic stops responding (can't ping the server) and the console is locked.
>
> If the bad child is reniced to a negative value (everything else is at 0), then everything else, including the other MS instances, start back up.
>
> If the bad child is killed, everything goes back to normal.
>
>
> ...that is of course until another one of the bad messages is picked up for scanning.
>
>
> My major concern here is not so much what was in the bad message that caused this, but more critically, why does the time out killing of a spamassassin instance cause its calling MS child to go ape and eat the server?
>
>-Alan
More information about the MailScanner
mailing list