Vexing problem

Michael Janssen Janssen at RZ.UNI-FRANKFURT.DE
Wed Jul 23 02:35:11 IST 2003


On Tue, 22 Jul 2003, Thomas DuVally wrote:

> I recently upgraded both SA (2.55) and MS (4-20.3). I am running it in
> parallel to one of the older versions (2.43/4-10).
>
> I process about 70-100k per system per day. Each machine is otherwise
> identical and getting the same number and types of messages (equally
> weighted MX)
>
> Everyday around peak time the upgraded system starts to get backed up.
> The incoming queue goes from a normal 2-4 message count up to 1000+.
>
> Restarting MS will begin clearing this out.
>
> Question:  Is there a possible memory issue with eith MS or SA I should
> be aware of?  I've got it trimmed down pretty good with no bayes or RBLS
> and only incoming messages content checked.

Have you got reasons to suspect a memory problem? 16 MS workes should
consume up to 550MB (I count 33MB resident set size RSS given by "top" per
worker). This should be fine with 4GB (your sendmail(?)/ virus-scanner/ SA
can't take all the rest). Is the machine swapping (while it's mostly no
problem at all when the machine has swaped out some never used data it's
of course a problem if the machine is actually freeing and claiming
swap-space)?

What are the MS-Processes doing? Standing still (last logentry is what?
WCHAN and %CPU? strace-output (In the hope Solaris has all this kind of
information I'm familar to from our linux systems)?) or running too slow?

It's a bit hard to track this for 16 workers. Probably with help of a
filter script, that sets the loglines for different pids to different
colors (uhm 16 readable colors on console...). Anyway, in case the
processes are "just" slow it would be interessting if the TIME and
CTIME (Cumulative TIME - as far as i known only provided by top ("S"
key) of the Processes differs much.


By the way: I've just generated a fresh report for our system (MS 4.22-5):
http://www.rz.uni-frankfurt.de/~janssenm/logstats/daily/07.23.marcy.html

and Batchperformance/ Time/Batch (computing how much time was needed to
work on one batch) shows a very suspious pattern with low scan-times and
high - well, not high in a critical sense but the pattern is there and it
is correlated with the "dying of old age" Messages in the logs. I can't
remember to see such a pattern before and I really don't like it, cause
one might suspect, that MS would take more and more time without the
periodically-restart mechanism (which is by now regarded as a hyper secure
guard against possibly not actual problems). We have upgraded from v4.12
last week and swithced to sophossavi.... Nice, I'd love to investigate
that deeper.


Michael

>
> Specs:  Solaris 9
>         UltraSparc III+
>         2 CPU
>         4G mem
>         Perl 5.8
>
> If I can't figure this out I may have to downgrade.
> --
> Thomas J. DuVally
> Lead Systems Prog.
> CIS, Brown Univ.
>
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x15F233F6
>



More information about the MailScanner mailing list