What do you monitor and alert on for your mailscanner system?

Thu Jul 24 23:44:56 IST 2003

I am wondering what the group generally monitors in order to detect problems
with a mailscanner relay (ie things that should cause an alert to be sent to
an admin).

My system is probably a bit more isolated than most - my deployment
(currently only testing) will *only* be intended for inbound mail filtering,
no outbound delivery (not even for "no such user" bounces, if I can prevent
it), so my list may not make sense in situations that handle outbound mail,
but here are the conditions I'm planning to alert on so far - have I missed
anything you monitor on you system?

Failure conditions to alert on:
- Abnormally large inbound or outbound mailq.
- Deferred messages in the outbound mailq.  (I deliver only to an internal
server over a lan link.)
- Deferred messages in the inbound mailq. (Not sure that even is possible.)
- Old messages in the inbound or outbound mailqs (ie they haven't been
accessed or modified in a while)
- MailScanner process count (too high or too low)
- Inbound sendmail process count.
- Outbound sendmail queue runner (not sure how to differentiate in vs out
yet though)
- System unreachable via ping, smtp, http, or ssh (or whatever other methods
are appropriate for your system)
- Low disk space.
- High swap usage.
- High load average (especially anything nearing sendmail's deferal load
average).
- High process count.
- Dmesg errors (sends the dmesg output from the box to a monitoring device).
- High filtering percentages (ie 100% spam).
- Low filtering percentages (ie 10% spam).
- Round trip of test messages - ie send a message to a bounceback address
and expect it to return with X minutes.  Could also check to be sure a
mailscanner header was found in the return message.

Any suggestions for other things I could check?  If it breaks, I want to
know *before* anyone else does. :-)