MailScanner causing server to crash

Mon Jan 6 02:09:35 GMT 2003

we use DAC960 hardware adn have seen similar things.  Usually form
pushing scsi limits(e.g. to long of cable, improper cable, and etc).
The drives are fine and are logged in /var/log/messages along with
dmesg.  You can control the drives with /proc/rd/c0/user_command (where
c0 stands for controller 0).  You can see what is going on with
/proc/rd/c0/current_status.

I have a script that dumps the status to a port and a little visual c
program that our helpdesk uses to monitor the status of the raid(since
converted to VB).  Even wrote a mon script at one point to parse the
output and notify me of a failed drive, and planning on writing a nagios
module for it(however this is low priority since I quite building things
that pushed the scsi limits drives don't fail).  Once notified, you can
echo "make-online channel:ID" > /proc/rd/c0/user_command  replacing
channel and ID with the correct channel and ID of the drive that is
dead.

If you boot off the raid and loose 2 drives (or as I often see a
channel) you will have a kernel panic.  If your mounting /var/spool/mail
on the raid then you will find your machine almost hangs just b/c of the
amount of processing going on trying to find where to put mail on a busy
server.

Hope this help, and if you have any questions please feel free to
contact me directly.

--rat

On Sun, 2003-01-05 at 18:12, Nick Phillips wrote:
> On Monday, January 6, 2003, at 01:34  pm, Jim Levie wrote:
>
> > MailScanner bangs on the disk quite a bit as compared to just
> > sendmail/procmail. My suspicion is that the fault is associated with
> > the
> > disk subsystem activity.
>
> Are you getting log messages from the DAC960 driver at all? You might
> want to check
> that by, say, fiddling with the control files in /proc (sorry, can't
> remember which ones) to manually take a drive offline and see whether
> it gets logged.
>
> It's just that I've seen problems with a DAC960 before where there were
> communication errors between the controller and the drives (introduced
> by the drive bay's backplane, IIRC), which caused the drives to be
> marked as bad by the controller, one after the other.
>
> Once they were all down, kernel panic followed, IIRC.
>
> What type of server is it (brand, model etc.)?
>
>
>
> Cheers,
>
>
> Nick