MS killing CPU

Alan mailscanner at ELKNET.NET
Mon Sep 27 13:59:07 IST 2004


On Fri, 24 Sep 2004 08:37:16 +1200, Hendrik den Hartog <hden at KCBBS.GEN.NZ>
wrote:

We have had exactly the same symptoms on our MS server. When it finally
'sprung' back to life, the Vispan graphs showed the server at a load level
of way up high in the double digits for anywhere from 1 to 3 hours,
reflecting the time the server was unavailable.

To me, that high of a load indicated that some resource was so unavailable,
that no processes could run. I figured the most likely suspect was the
drives, as I could not see all my ram being suddenly tied up, yet with
nothing being processed.

My system was using an IDE drive for the OS, and a SCSI drive for queue and
for logs. I had the Adaptec SCSI card set for 160MB data rate. I backed it
off to 80 (the next step down), and this problem went away. It appears that
I was getting data errors on the SCSI drive that on occasion caused the
drive to become unavailable, driving system load way up and the server to
become unresponsive.

On a side note, a couple of times I tried to intervene when the server did
this, rather than waiting for it to spring back to life. When I did so, upon
reboot, the boot process would get stuck on 'Recovering Journal' on the SCSI
drive partitions (reiserfs), again lending support to the problem being
related to the SCSI drive. The boot process would just hand on recovering
journal and go no further. So finally I tried something out of the blue...
this is a smp server, and when I built the kernel, I also built a single
processor kernel. When I tried the single processor kernel, the boot went
fine and got right past the hang spot when rcovering the journal. Then I
rebooted as smp, and all was fine. We have repeated this process at least
three times (boot on single processor, then reboot to smp to get past
journal hang) before we tried backing down the SCSI data rate and fixed it
for good.

-Alan
>Recently our MS machine has been grinding to a halt. You can't
>'get' to it, won't respond to pings etc. These periods have lasted
>several hours. In the past,it has eventually been 'springing' back
>to life of its own accord, allowing me to get a quick look. TOP
>showed occassionally a MS child hogging 100% CPU. Also we've had
>some SA timeouts recently.

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the MAQ (http://www.mailscanner.biz/maq/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).



More information about the MailScanner mailing list