OT: Server puts in zombie mode.
glenn.steen at gmail.com
Thu Apr 16 13:46:46 IST 2009
2009/4/14 Scott Silva <ssilva at sgvwater.com>:
> on 4-13-2009 2:47 PM Eduardo Casarero spake the following:
>> Hi, i've a rare situation on some servers, they just get zombie. After
>> all i've researched i think it's a HD hang out or something in storage
>> because i couldnt find any trace in the logs. The failure seems to be
>> random and the server in zombie mode appears to be online answering
>> ping and if you telnet ssh port you get connected but after you
>> connect the connection is lost (as if the servers tries to read
>> something from HD).
>> Also the servers that eventually crash have years of heavy load
>> processing emails (this backups my theory of HD failure). After
>> rebooting the server everything seems to be ok.
>> The SO is slackware and some servers are slackware 10.2 or 12.1 so i
>> think is not a SO bug, also they have different versions of
>> MailScanner/SA/clamd (because they have years working and the upgrade
>> process is not massive. Usually the zombie servers has SATA disk (that
>> also backups my theory)
>> Does anyone have any idea of how can i get a log or something to
>> demostrate this? or any other test to get better/or another conclusion
>> or cause?
>> Any help would be really appreciated.
>> Thanks eduardo.
> Hang a serial console on the system and watch it or log it. It might get some
> kernel messages that don't get written to log. Also try a memory test. Memory
> can also go bad over time, especially on overworked servers that might have
> collected some dust over the years and overheated. Maybe even just pull and
> reseat all the cards, memory, and even the processor in case some oxidation is
> present on the connectors.
I have to agree with Scott... The gut feeling I get is _not_
HDD-problems (although that might be a possibility as well, just not
the "obvious first thing to test for":-)... It might not even be HW,
but rather something leaking memory slowly over time... and some more
or less unfortunate "memory hog prevention" making your kernel kill
things oppotunistically. So set up some monitoring (the sysstat
package include sar, which should be enough detail). If it is this,
you shouldn't need poll more than every 10-20 minutes to be able to
see it (no need to check every few seconds:-).
Other than that, Scotts suggestions are good... If you have a console
on 'em, and are logged into a textmode console session... you might
consider using the "magic sysrq keys" to facilitate a "pre-reboot
email: glenn < dot > steen < at > gmail < dot > com
work: glenn < dot > steen < at > ap1 < dot > se
More information about the MailScanner