MS/perl segfaults
David Lee
t.d.lee at durham.ac.uk
Mon Nov 10 15:27:05 GMT 2008
Julian: Over the years MailScanner has served us extremely well, and we
continue to rely on it and be thankful for your work on it.
But I'm currently clearing a backlog of 66,000+ emails from the weekend.
Occasionally (perhaps once a year) we get a particular class of problem
(and from skim-reading the list I believe others see this also), namely,
that a message, or messages, will arrive which cause MailScanner (more
likely one of its perl modules) to segfault. A (quote) shouldn't happen
(unquote) thing that, nevertheless, occasionally does happen.
We've just had such an incident over the weekend. And there were enough
such messages (about 100) to cause all the child MS processes (20) to
segfault on most occasions that they processed a batch (30). The net
result is that our inbound queue grew, and very little trickled through,
because the MS processes segfaulted, re-tried, segfaulted, retried, ...
(The failure of one message in the batch causes the whole batch to be
delayed until the next child attempt; and the chances are that new batch
will also suffer a segfault.)
As I say, such instances are rare, but they do happen. And when they
happen they can hit hard.
For this particular instance, I'd be happy to send you (offlist?) details,
including sample messages, "MailScanner -V", OS etc. (Let me know.)
But that still leaves a general problem of MS (+/ modules) being
susceptible to emails (possibly malformed HTML spams) that can cause this
behaviour.
So a suggestion for a _general_ fix against general segfaults (to allow
the other emails not to become "collateral damage").
====begin====
When an MS child starts processing a batch, for each email temporarily put
its id (e.g. sendmail "df/qf" number) into a small "being processed"
database (e.g. a trivial db/dbm).
When the child finishes the batch, remove those ids of the batch from that
database.
So for a system of 'c' children and batch-size 'b', the maximum number of
entries at any time in that database will be 'c*b': rarely more than a few
hundred, and so trivial for a db/dbm thing. (And if the inbound mqueue is
empty, the database should correspondingly be empty.)
Now here's the crucial detail: When the child starts its batch it also
quickly checks that those ids are not already present in the database.
(In normal use, they would never be present, as MS's existing mechanisms
already ensure that a child takes a batch from beginning right through to
completion.)
If it DOES find that id, this indicates that something has badly gone
wrong (e.g. previous child segfaulted, so didn't remove ids in this batch
from the database). Many of those ids, of course, will be innocent: they
will be there because another email (id) in an earlier batch had failed.
To counter that, the database could also store a timestamp. On finding
such an email, a child would skip that id if it was relatively young (e.g.
less than 10 minutes since last timestamp), or process it _on its own_ if
relatively old (e.g. older than ten minutes). That way, the innocent
email would only be held up for a short period (e.g. ten minutes).
(There are probably some cleverer things that could be done (and
additional things that ought to be done), but at this stage I'm simply
trying to outline the general idea.)
====end====
How does that sound?
Naturally I would be happy to assist beta-testing if you wish.
--
: David Lee I.T. Service :
: Senior Systems Programmer Computer Centre :
: UNIX Team Leader Durham University :
: South Road :
: http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE :
: Phone: +44 191 334 2752 U.K. :
More information about the MailScanner
mailing list