Upgraded to 4.67.6,
MailScanner scans a batch then hangs at 100 percent CPU
Steve Crumley
scrumley at secure-enterprise.com
Thu Mar 13 04:19:16 GMT 2008
> -----Original Message-----
> From: mailscanner-bounces at lists.mailscanner.info
> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf
> Of Julian Field
> Sent: Wednesday, March 12, 2008 5:51 PM
> To: MailScanner discussion
> Subject: Re: Upgraded to 4.67.6, MailScanner scans a batch
> then hangs at 100 percent CPU
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>
> Steve Crumley wrote:
> >
> >
> >
> >> -----Original Message-----
> >> From: mailscanner-bounces at lists.mailscanner.info
> >> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf
> >> Of Julian Field
> >> Sent: Tuesday, March 11, 2008 6:50 PM
> >> To: MailScanner discussion
> >> Subject: Re: Upgraded to 4.67.6, MailScanner scans a batch
> >> then hangs at 100 percent CPU
> >>
> >> * PGP Signed by an unverified key: 03/11/08 at 18:50:26
> >>
> >>
> >>
> >> Steve Crumley wrote:
> >>
> >>>
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: mailscanner-bounces at lists.mailscanner.info
> >>>> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf
> >>>> Of Glenn Steen
> >>>> Sent: Tuesday, March 11, 2008 4:32 PM
> >>>> To: MailScanner discussion
> >>>> Subject: Re: Upgraded to 4.67.6,MailScanner scans a batch
> >>>> then hangs at 100 percent CPU
> >>>>
> >>>> On 11/03/2008, Steve Crumley
> >>>>
> >> <scrumley at secure-enterprise.com> wrote:
> >>
> >>>>
> >>>>
> >>>>> > -----Original Message-----
> >>>>> > From: mailscanner-bounces at lists.mailscanner.info
> >>>>> > [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Of Glenn Steen
> >>>>>>
> >>>>>>
> >>>>> > Sent: Tuesday, March 11, 2008 1:21 PM
> >>>>> > To: MailScanner discussion
> >>>>> > Subject: Re: Upgraded to 4.67.6,MailScanner scans a batch
> >>>>> > then hangs at 100 percent CPU
> >>>>> >
> >>>>> > On 11/03/2008, Steve Crumley
> >>>>>
> >>>>>
> >>>> <scrumley at secure-enterprise.com> wrote:
> >>>>
> >>>>
> >>>>> > >
> >>>>> > >
> >>>>> > > > -----Original Message-----
> >>>>> > > > From: mailscanner-bounces at lists.mailscanner.info
> >>>>> > > > [mailto:mailscanner-bounces at lists.mailscanner.info]
> >>>>>
> >>>>>
> >>>> On Behalf
> >>>>
> >>>>
> >>>>> > > > Of --[ UxBoD ]--
> >>>>> > >
> >>>>> > > > Sent: Tuesday, March 11, 2008 11:29 AM
> >>>>> > > > To: MailScanner discussion
> >>>>> > > > Subject: Re: Upgraded to 4.67.6, MailScanner
> scans a batch
> >>>>> > > > then hangs at 100 percent CPU
> >>>>> > > >
> >>>>> > >
> >>>>> > > > do you have strace installed on the server ? if
> so when the
> >>>>> > > > process is running at 100% CPU connect to it and
> >>>>>
> >> see what it
> >>
> >>>>> > > > is doing. I had this before, but for the life of
> >>>>>
> >>>>>
> >>>> me I cannot
> >>>>
> >>>>
> >>>>> > > > remember what I changed to fix it :(
> >>>>> > > >
> >>>>> > > > Things to check :-
> >>>>> > > >
> >>>>> > > > 1) Permissions, are they all correct
> >>>>> > > > 2) Check MailScanner.conf again just to make
> sure no typos
> >>>>> > > >
> >>>>> > > > Regards,
> >>>>> > > >
> >>>>> > > > --
> >>>>> > >
> >>>>> > >
> >>>>> > > Here is the output from strace:
> >>>>> > >
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > > waitpid(-1, 0xbff09448, WNOHANG) = 0
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > > The system had been running fine for over a year, I
> >>>>>
> >>>>>
> >>>> can't find any
> >>>>
> >>>>
> >>>>> > > permission or setting change thats doing this, but
> >>>>>
> >> I could be
> >>
> >>>>> > > overlooking something.
> >>>>> > > Thanks,
> >>>>> > > -Steve
> >>>>> > >
> >>>>> > Could perhaps be a busted SQLite SA cache? What does
> >>>>>
> >>>>>
> >>>> analyse_s<TAB> (I
> >>>>
> >>>>
> >>>>> > don't remember if it is sacache or spamassassin_cache
> >>>>>
> >>>>>
> >>>> ... the command
> >>>>
> >>>>
> >>>>> > completion should take care of it:-) say? If it looks
> >>>>>
> >>>>>
> >>>> fishy, simply
> >>>>
> >>>>
> >>>>> > delete the SA cache file and restart MS.
> >>>>> >
> >>>>> > You've run MailScanner --lint, right? Nothing obvious
> >>>>>
> >> from that?
> >>
> >>>>> >
> >>>>> > Oh, and what av scanners do you use? Obviously not
> >>>>>
> >>>>>
> >>>> clamavmodule, but
> >>>>
> >>>>
> >>>>> > perhaps clamav or clamd? are those OK?
> >>>>> >
> >>>>> > Cheers
> >>>>> > --
> >>>>> > -- Glenn
> >>>>> > email: glenn < dot > steen < at > gmail < dot > com
> >>>>> > work: glenn < dot > steen < at > ap1 < dot > se
> >>>>>
> >>>>>
> >>>>>
> >>>>>> --
> >>>>>>
> >>>>>>
> >>>>> > MailScanner mailing list
> >>>>> > mailscanner at lists.mailscanner.info
> >>>>> > http://lists.mailscanner.info/mailman/listinfo/mailscanner
> >>>>> >
> >>>>> > Before posting, read http://wiki.mailscanner.info/posting
> >>>>> >
> >>>>> > Support MailScanner development - buy the book off
> the website!
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> analyse_SpamAssassin_cache looks clean, MailScanner --lint
> >>>>>
> >>>>>
> >>>> is clean too.
> >>>>
> >>>>
> >>>>> I'm running clamd for AV but I've set virus scanning to no
> >>>>>
> >>>>>
> >>>> while working
> >>>>
> >>>>
> >>>>> on this.
> >>>>>
> >>>>> Thanks,
> >>>>> -Steve
> >>>>>
> >>>>>
> >>>> Couldn't be something easily mended, huh:-)....
> >>>>
> >>>> What you seem to have attached to above (with strace)
> would be the
> >>>> main MailScanner process, since it basically just wait for it's
> >>>> children to end... Or is it? What does a ps listing show
> (one that
> >>>> show the command argument list, since Jules rewrite it to
> >>>>
> >> show what it
> >>
> >>>> thinks it is basically doing)?
> >>>> Do the children restart endlessly when hung? How many
> children are
> >>>> there, and in what state?
> >>>> Cheers
> >>>> -- Glenn
> >>>>
> >>>>
> >>>
> >>> When I first started it with 8 children, they all end up
> >>>
> >> quickly hanging
> >>
> >>> and consuming CPU. For now, I've set it to 1 child and I've been
> >>> running in debug mode. The ps gives us a good clue! Its the only
> >>> mailscanner process and it reports "MailScanner: extracting
> >>>
> >> attachments"
> >>
> >>> Thanks,
> >>> -Steve
> >>>
> >>>
> >> In which case go into "sub Explode" in
> >> /usr/lib/MailScanner/MailScanner/Message.pm, and add some
> >> "print STDERR"
> >> lines to generate tracing output so you can see how far it
> gets. When
> >> you do a "MailScanner --debug" it will show you the STDERR
> >> debug output
> >> in the terminal session.
> >>
> >
> >
> > OK, Here is whats happening. Its using Explode in
> MessageBatch.pm and
> > not Message.pm.
> > Here is where it dies in MessageBatch.pm:
> >
> > sub Explode {
> > my $this = shift;
> > print STDERR "messagebatch\n"; #crumley
> >
> > my($key, $message);
> >
> > # jjh 2004-03-12 reap as many as we can.
> > # JKF Test 2004-11-23 1 until waitpid(-1, &POSIX::WNOHANG) == -1;
> > print STDERR "about to hang\n";
> > 1 until waitpid(-1, WNOHANG) == -1;
> > print STDERR "we never get here\n";
> >
> But as the comments in the code show, this code hasn't been touched
> since 2004. So I don't understand why you are just seeing a change in
> behaviour. I would suspect you have upgraded something else
> in your system.
>
> Are other people seeing the same problem?
> What OS, distro, version, kernel, etc are you running?
> Is anyone else running an identical system?
> If so, are they seeing the same symptoms?
>
> From the "perl-func" man page:
> waitpid PID,FLAGS
> Waits for a particular child process to
> terminate and returns
> the pid of the deceased process, or "-1" if
> there is no such
> child process.
> so it should reap processes until there aren't any left to be reaped.
> What does the documentation for waitpid say on your system? This is a
> POSIX function, so should be the same across most systems.
>
> If you take out the waitpid() call, you will collect <defunct>
> processes, as they are terminating but never being reaped. So
> this call
> is very necessary.
>
> I'm not going to touch this code with a 10-foot barge pole
> unless I have
> *very* good reason to.
>
> Jules
>
> - --
> Julian Field MEng CITP CEng
Julian, I really appreciate you looking at this. I understand this code
hasn't changed and I'm certianly not suggesting you change it now. I'm
just trying to track this down. I'm running a pretty standard Centos
4.6 system plus the rpmforge repositories so I'm guessing someone else
may run into this as well. I think you are probably right, something
else on the system may be involved. Everything is up to date with a
"yum upgrade". I just don't have a clue as to what could be causing
this.
Thanks,
-Steve
More information about the MailScanner
mailing list