Spamassassin timeout results in MS 100% CPU and server lockup (correction)

Alan Fiebig mailscanner at ELKNET.NET
Sat Aug 2 03:33:45 IST 2003


Julian,

I replied back on this earlier this week (the answer was version 2.60-cvs). Any closer to an answer? I'd like to turn bayes back on...

Remember, I'd be happy to temporarily re-enable bayes and gather any stats/info you might wish to see.

-Alan

>Are you running SA 2.55? Sounds like it's a Perl bug somewhere. I don't
>think there's any MS code that could loop in this situation...
>
>At 04:23 28/07/2003, you wrote:
>>A possible correction. I may have been wrong in my conclusion that it was
>>a group of email sthat came in last night that caused this problem to manifest.
>>After running through the logs, I was quite surprised at how many SA
>>timeout assasinations had started taking place in the early hours of the
>>morning, right about the time of the first crash I described. Tonight I
>>finished sanitizing the in queue, and started everything up again...
>>
>>The problem came right back. I could not believe that it was more of the
>>type of messages that came in last night, so perhaps my first conclusion
>>was flawed. I started looking for what else might have changed on my
>>server about the time all this started. I found it...
>>
>>My Bayes finally hit 200 hams, and started working at 1:56 AM this
>>morning. Prior to that time, my auto-learn had not accumulated enough hams
>>for Bayes to start functioning. So, I just disabled Bayes, and viola! No
>>more SA timeouts and deaths.
>>
>>So, it wasn't some mysterious evil messages that caused the timeouts, but
>>Bayes kicking in.
>>
>>HOWEVER! Even with the cause of the timeouts figured out, I still have my
>>PRIMARY concern: Why are MS children that encounter an SA timeout and
>>death taking over my server with extrememly high ram and cpu usage, that
>>eventually crowd out every other process and crash the server? THAT is my
>>primary issue.
>>
>>All help and insight is sure appreciated!
>>-Alan
>>
>>
>>
>> >Last night I got a number of emails into my MS/SA server that caused it
>> to crash.
>> >
>> >In my testing, here is what I discovered:
>> >
>> >  Everything runs fine up until the maillog reports 'spamassassin timed
>> out and was killed'.
>> >
>> >  Using the PID in that error message, I check all the children of that
>> parent process.
>> >
>> >  One of the children will suddenly start climbing in memory size above
>> the typical 22M, and likewise will start climbing in CPU usage.
>> >
>> >  As memory and CPU continue to increase, all the other MS parents and
>> their children get swapped out and go to sleep.
>> >
>> >  Before too long, the child causing the problem will reach around 380+M
>> of memory and 99% CPU. No other MS instances are running at all.
>> >
>> >  Accordingly, the maillog no longer shows any processing of the
>> incoming queue.
>> >
>> >  If left alone, the machine eventually comes to its knees, and even the
>> nic stops responding (can't ping the server) and the console is locked.
>> >
>> >  If the bad child is reniced to a negative value (everything else is at
>> 0), then everything else, including the other MS instances, start back up.
>> >
>> >  If the bad child is killed, everything goes back to normal.
>> >
>> >
>> >  ...that is of course until another one of the bad messages is picked
>> up for scanning.
>> >
>> >
>> >  My major concern here is not so much what was in the bad message that
>> caused this, but more critically, why does the time out killing of a
>> spamassassin instance cause its calling MS child to go ape and eat the server?
>> >
>> >-Alan
>
>--
>Julian Field
>www.MailScanner.info
>Professional Support Services at www.MailScanner.biz
>MailScanner thanks transtec Computers for their support



More information about the MailScanner mailing list