Spamassassin timeout results in MS 100% CPU and server lockup (correction)

Mon Jul 28 09:25:42 IST 2003

Are you running SA 2.55? Sounds like it's a Perl bug somewhere. I don't
think there's any MS code that could loop in this situation...

At 04:23 28/07/2003, you wrote:
>A possible correction. I may have been wrong in my conclusion that it was
>a group of email sthat came in last night that caused this problem to manifest.
>After running through the logs, I was quite surprised at how many SA
>timeout assasinations had started taking place in the early hours of the
>morning, right about the time of the first crash I described. Tonight I
>finished sanitizing the in queue, and started everything up again...
>
>The problem came right back. I could not believe that it was more of the
>type of messages that came in last night, so perhaps my first conclusion
>was flawed. I started looking for what else might have changed on my
>server about the time all this started. I found it...
>
>My Bayes finally hit 200 hams, and started working at 1:56 AM this
>morning. Prior to that time, my auto-learn had not accumulated enough hams
>for Bayes to start functioning. So, I just disabled Bayes, and viola! No
>more SA timeouts and deaths.
>
>So, it wasn't some mysterious evil messages that caused the timeouts, but
>Bayes kicking in.
>
>HOWEVER! Even with the cause of the timeouts figured out, I still have my
>PRIMARY concern: Why are MS children that encounter an SA timeout and
>death taking over my server with extrememly high ram and cpu usage, that
>eventually crowd out every other process and crash the server? THAT is my
>primary issue.
>
>All help and insight is sure appreciated!
>-Alan
>
>
>
> >Last night I got a number of emails into my MS/SA server that caused it
> to crash.
> >
> >In my testing, here is what I discovered:
> >
> >  Everything runs fine up until the maillog reports 'spamassassin timed
> out and was killed'.
> >
> >  Using the PID in that error message, I check all the children of that
> parent process.
> >
> >  One of the children will suddenly start climbing in memory size above
> the typical 22M, and likewise will start climbing in CPU usage.
> >
> >  As memory and CPU continue to increase, all the other MS parents and
> their children get swapped out and go to sleep.
> >
> >  Before too long, the child causing the problem will reach around 380+M
> of memory and 99% CPU. No other MS instances are running at all.
> >
> >  Accordingly, the maillog no longer shows any processing of the
> incoming queue.
> >
> >  If left alone, the machine eventually comes to its knees, and even the
> nic stops responding (can't ping the server) and the console is locked.
> >
> >  If the bad child is reniced to a negative value (everything else is at
> 0), then everything else, including the other MS instances, start back up.
> >
> >  If the bad child is killed, everything goes back to normal.
> >
> >
> >  ...that is of course until another one of the bad messages is picked
> up for scanning.
> >
> >
> >  My major concern here is not so much what was in the bad message that
> caused this, but more critically, why does the time out killing of a
> spamassassin instance cause its calling MS child to go ape and eat the server?
> >
> >-Alan

--
Julian Field
www.MailScanner.info
Professional Support Services at www.MailScanner.biz
MailScanner thanks transtec Computers for their support