bottlenecking?

Jim Levie jim at ENTROPHY-FREE.NET
Thu Sep 26 15:24:47 IST 2002


On Thu, 2002-09-26 at 04:56, Julian Field wrote:
> At 03:19 26/09/2002, you wrote:
> > >
> >FWIW: I've seen this sort of problem to one degree or another with every
> >V3 version of MailScanner that I've deployed. Its too soon to say if v4
> >will have the same sorts of problems. My solution is to use a smart Perl
> >monitor, rather than a shell script, to manage the MailScanner
> >processes. The perl code watches for excessive CPU consumption,
> >excessive process size, or a MailScanner that's run longer than it
> >should. If any of the boundary conditions are seen the offending process
> >is killed and restarted. While I suppose a clever shell script could be
> >written to do the same thing it was very easy to do with Perl and I took
> >the path of least resistance.
>
> It would be interesting to discover what is actually causing the problem,
> as I've never seen it on our systems here at all. Have you checked
> everywhere under /var/spool/MailScanner for "core" files? These can take a
> very long time to scan, and should just be deleted most of the time. If
> many other people were seeing the same problem as you, I would have heard
> about it a lot. And I haven't, so I can only think this is a fairly unusual
> problem.
>
I certainly wouldn't say that it is a common problem or that it happens
at all frequently. I only see it happen at infrequent intervals. I don't
know if the problem is load related or message related, but when it
happens all processing of messages from mqueue.in stops and mail starts
backing up. By the time I'd notice the problem (usually 15 minutes to a
hour later) I might have 10-15K messages in the input queue. At that
point the name of the game is to get the queue cleared and make the
phone stop ringing, so investigative work mostly has to be done in retro
spec. I have looked for core files and not found any. So far, simply
killing the MS process and restarting it causes message processing to
resume.

For a while I thought that the problem only occurred on my large volume
servers and was leaning towards a load related cause. But I have
observed it (even less frequently) on low volume servers (less that 15k
messages/day). So far I haven't been able to duplicate that failure when
I save off the contents of the mqueue.in dir and run that though my test
jig. That might imply that there's some critical set of conditions that
has to occur to cause MailScanner to go walk-about. One other thing that
I've observed is that MailScanner always has a batch of messages in
process at the time of the failure. The same message ID's exist both in
the work directory and in the input queue. I guess I don't know exactly
what MS was doing at the time it ran off into the weeds, only that it
appeared to have been processing messages.

> >The V4 implementation brings new challenges. Not only do you have the
> >mater process, but you also have a number of child processes to deal
> >with. I'd like to see a pid file for each of the children, perhaps with
> >a name of the form mailscanner1.pid, mailscanner2.pid, etc. And it would
> >be awfully nice is killing the master process would cause it to reap its
> >children.
>
> I happened to write that for you last night. There are pid files for all of
> the children, and the master creates and destroys these as the children
> start and stop. I've written an init.d script for it (for RedHat) that has
> start, stop, restart, status and reload commands. It does the "reload"
> operation by doing a "kill -HUP" on all the MailScanner processes.
>
Very nice.
--
The instructions said to use Windows 98 or better, so I installed
RedHat.



More information about the MailScanner mailing list