Upgraded to 4.67.6, MailScanner scans a batch then hangs at 100 percent CPU

Steve Crumley scrumley at secure-enterprise.com
Thu Mar 13 04:19:16 GMT 2008


 

> -----Original Message-----
> From: mailscanner-bounces at lists.mailscanner.info 
> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf 
> Of Julian Field
> Sent: Wednesday, March 12, 2008 5:51 PM
> To: MailScanner discussion
> Subject: Re: Upgraded to 4.67.6, MailScanner scans a batch 
> then hangs at 100 percent CPU
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> 
> Steve Crumley wrote:
> >  
> >
> >   
> >> -----Original Message-----
> >> From: mailscanner-bounces at lists.mailscanner.info 
> >> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf 
> >> Of Julian Field
> >> Sent: Tuesday, March 11, 2008 6:50 PM
> >> To: MailScanner discussion
> >> Subject: Re: Upgraded to 4.67.6, MailScanner scans a batch 
> >> then hangs at 100 percent CPU
> >>
> >> * PGP Signed by an unverified key: 03/11/08 at 18:50:26
> >>
> >>
> >>
> >> Steve Crumley wrote:
> >>     
> >>>  
> >>>
> >>>   
> >>>       
> >>>> -----Original Message-----
> >>>> From: mailscanner-bounces at lists.mailscanner.info 
> >>>> [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf 
> >>>> Of Glenn Steen
> >>>> Sent: Tuesday, March 11, 2008 4:32 PM
> >>>> To: MailScanner discussion
> >>>> Subject: Re: Upgraded to 4.67.6,MailScanner scans a batch 
> >>>> then hangs at 100 percent CPU
> >>>>
> >>>> On 11/03/2008, Steve Crumley 
> >>>>         
> >> <scrumley at secure-enterprise.com> wrote:
> >>     
> >>>>     
> >>>>         
> >>>>>  > -----Original Message-----
> >>>>>  > From: mailscanner-bounces at lists.mailscanner.info
> >>>>>  > [mailto:mailscanner-bounces at lists.mailscanner.info] On Behalf
> >>>>>
> >>>>>       
> >>>>>           
> >>>>>> Of Glenn Steen
> >>>>>>         
> >>>>>>             
> >>>>>  > Sent: Tuesday, March 11, 2008 1:21 PM
> >>>>>  > To: MailScanner discussion
> >>>>>  > Subject: Re: Upgraded to 4.67.6,MailScanner scans a batch
> >>>>>  > then hangs at 100 percent CPU
> >>>>>  >
> >>>>>  > On 11/03/2008, Steve Crumley 
> >>>>>       
> >>>>>           
> >>>> <scrumley at secure-enterprise.com> wrote:
> >>>>     
> >>>>         
> >>>>>  > >
> >>>>>  > >
> >>>>>  > >  > -----Original Message-----
> >>>>>  > >  > From: mailscanner-bounces at lists.mailscanner.info
> >>>>>  > >  > [mailto:mailscanner-bounces at lists.mailscanner.info] 
> >>>>>       
> >>>>>           
> >>>> On Behalf
> >>>>     
> >>>>         
> >>>>>  > >  > Of --[ UxBoD ]--
> >>>>>  > >
> >>>>>  > > > Sent: Tuesday, March 11, 2008 11:29 AM
> >>>>>  > >  > To: MailScanner discussion
> >>>>>  > >  > Subject: Re: Upgraded to 4.67.6, MailScanner 
> scans a batch
> >>>>>  > >  > then hangs at 100 percent CPU
> >>>>>  > >  >
> >>>>>  > >
> >>>>>  > > > do you have strace installed on the server ? if 
> so when the
> >>>>>  > >  > process is running at 100% CPU connect to it and 
> >>>>>           
> >> see what it
> >>     
> >>>>>  > >  > is doing.  I had this before, but for the life of 
> >>>>>       
> >>>>>           
> >>>> me I cannot
> >>>>     
> >>>>         
> >>>>>  > >  > remember what I changed to fix it :(
> >>>>>  > >  >
> >>>>>  > >  > Things to check :-
> >>>>>  > >  >
> >>>>>  > >  > 1) Permissions, are they all correct
> >>>>>  > >  > 2) Check MailScanner.conf again just to make 
> sure no typos
> >>>>>  > >  >
> >>>>>  > >  > Regards,
> >>>>>  > >  >
> >>>>>  > >  > --
> >>>>>  > >
> >>>>>  > >
> >>>>>  > > Here is the output from strace:
> >>>>>  > >
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >  waitpid(-1, 0xbff09448, WNOHANG)        = 0
> >>>>>  > >
> >>>>>  > >
> >>>>>  > >
> >>>>>  > >
> >>>>>  > >  The system had been running fine for over a year, I 
> >>>>>       
> >>>>>           
> >>>> can't find any
> >>>>     
> >>>>         
> >>>>>  > >  permission or setting change thats doing this, but 
> >>>>>           
> >> I could be
> >>     
> >>>>>  > >  overlooking something.
> >>>>>  > >  Thanks,
> >>>>>  > >  -Steve
> >>>>>  > >
> >>>>>  > Could perhaps be a busted SQLite SA cache? What does 
> >>>>>       
> >>>>>           
> >>>> analyse_s<TAB> (I
> >>>>     
> >>>>         
> >>>>>  > don't remember if it is sacache or spamassassin_cache 
> >>>>>       
> >>>>>           
> >>>> ... the command
> >>>>     
> >>>>         
> >>>>>  > completion should take care of it:-) say? If it looks 
> >>>>>       
> >>>>>           
> >>>> fishy, simply
> >>>>     
> >>>>         
> >>>>>  > delete the SA cache file and restart MS.
> >>>>>  >
> >>>>>  > You've run MailScanner --lint, right? Nothing obvious 
> >>>>>           
> >> from that?
> >>     
> >>>>>  >
> >>>>>  > Oh, and what av scanners do you use? Obviously not 
> >>>>>       
> >>>>>           
> >>>> clamavmodule, but
> >>>>     
> >>>>         
> >>>>>  > perhaps clamav or clamd? are those OK?
> >>>>>  >
> >>>>>  > Cheers
> >>>>>  > --
> >>>>>  > -- Glenn
> >>>>>  > email: glenn < dot > steen < at > gmail < dot > com
> >>>>>  > work: glenn < dot > steen < at > ap1 < dot > se
> >>>>>
> >>>>>       
> >>>>>           
> >>>>>> --
> >>>>>>         
> >>>>>>             
> >>>>>  > MailScanner mailing list
> >>>>>  > mailscanner at lists.mailscanner.info
> >>>>>  > http://lists.mailscanner.info/mailman/listinfo/mailscanner
> >>>>>  >
> >>>>>  > Before posting, read http://wiki.mailscanner.info/posting
> >>>>>  >
> >>>>>  > Support MailScanner development - buy the book off 
> the website!
> >>>>>  >
> >>>>>
> >>>>>
> >>>>>
> >>>>> analyse_SpamAssassin_cache looks clean, MailScanner --lint 
> >>>>>       
> >>>>>           
> >>>> is clean too.
> >>>>     
> >>>>         
> >>>>>  I'm running clamd for AV but I've set virus scanning to no 
> >>>>>       
> >>>>>           
> >>>> while working
> >>>>     
> >>>>         
> >>>>>  on this.
> >>>>>
> >>>>> Thanks,
> >>>>>  -Steve
> >>>>>       
> >>>>>           
> >>>> Couldn't be something easily mended, huh:-)....
> >>>>
> >>>> What you seem to have attached to above (with strace) 
> would be the
> >>>> main MailScanner process, since it basically just wait for it's
> >>>> children to end... Or is it? What does a ps listing show 
> (one that
> >>>> show the command argument list, since Jules rewrite it to 
> >>>>         
> >> show what it
> >>     
> >>>> thinks it is basically doing)?
> >>>> Do the children restart endlessly when hung? How many 
> children are
> >>>> there, and in what state?
> >>>> Cheers
> >>>> -- Glenn
> >>>>     
> >>>>         
> >>>
> >>> When I first started it with 8 children, they all end up 
> >>>       
> >> quickly hanging
> >>     
> >>> and consuming CPU.  For now, I've set it to 1 child and I've been
> >>> running in debug mode.  The ps gives us a good clue!  Its the only
> >>> mailscanner process and it reports "MailScanner: extracting 
> >>>       
> >> attachments"
> >>     
> >>> Thanks,
> >>> -Steve
> >>>   
> >>>       
> >> In which case go into "sub Explode" in 
> >> /usr/lib/MailScanner/MailScanner/Message.pm, and add some 
> >> "print STDERR" 
> >> lines to generate tracing output so you can see how far it 
> gets. When 
> >> you do a "MailScanner --debug" it will show you the STDERR 
> >> debug output 
> >> in the terminal session.
> >>     
> >
> >
> > OK, Here is whats happening.  Its using Explode in 
> MessageBatch.pm and
> > not Message.pm.
> > Here is where it dies in MessageBatch.pm:
> >
> > sub Explode {
> >   my $this = shift;
> >   print STDERR "messagebatch\n";  #crumley
> >
> >   my($key, $message);
> >
> >   # jjh 2004-03-12 reap as many as we can.
> >   # JKF Test 2004-11-23 1 until waitpid(-1, &POSIX::WNOHANG) == -1;
> >   print STDERR "about to hang\n";  
> >   1 until waitpid(-1, WNOHANG) == -1;
> >   print STDERR "we never get here\n";  
> >   
> But as the comments in the code show, this code hasn't been touched 
> since 2004. So I don't understand why you are just seeing a change in 
> behaviour. I would suspect you have upgraded something else 
> in your system.
> 
> Are other people seeing the same problem?
> What OS, distro, version, kernel, etc are you running?
> Is anyone else running an identical system?
> If so, are they seeing the same symptoms?
> 
>  From the "perl-func" man page:
>        waitpid PID,FLAGS
>                Waits for a particular child process to 
> terminate and returns
>                the pid of the deceased process, or "-1" if 
> there is no such
>                child process.
> so it should reap processes until there aren't any left to be reaped. 
> What does the documentation for waitpid say on your system? This is a 
> POSIX function, so should be the same across most systems.
> 
> If you take out the waitpid() call, you will collect <defunct> 
> processes, as they are terminating but never being reaped. So 
> this call 
> is very necessary.
> 
> I'm not going to touch this code with a 10-foot barge pole 
> unless I have 
> *very* good reason to.
> 
> Jules
> 
> - -- 
> Julian Field MEng CITP CEng

Julian, I really appreciate you looking at this.  I understand this code
hasn't changed and I'm certianly not suggesting you change it now.  I'm
just trying to track this down.  I'm running a pretty standard Centos
4.6 system plus the rpmforge repositories so I'm guessing someone else
may run into this as well.  I think you are probably right, something
else on the system may be involved.  Everything is up to date with a
"yum upgrade".  I just don't have a clue as to what could be causing
this.
Thanks,
-Steve


More information about the MailScanner mailing list