MS/perl segfaults
Julian Field
MailScanner at ecs.soton.ac.uk
Mon Nov 10 17:11:03 GMT 2008
One immediate thought: the only reproducible instance of this problem
was caused by the HTML parser, and I wrote a solution to that in a
recent release, it's in the Change Log.
But yes, your idea is a possibility, now that I'm using SQLite. Doing it
with a dbm file is not really practical due to high contention for the
exclusive write locks on the file. SQLite may be able to do it rather
better.
There are quite a few routes that lead to a message leaving a batch, and
I would have to catch all of those, time for a quick code review of a
few chunks I think.
If a message is more than 20 minutes old and still in the database, then
we do a batch containing only 1 message, and log it. If we find a
message more than 30 minutes old, then we log it and ignore it.
How many ways could this process go wrong? All existing exclusion-locks
would still apply, so if a message was more than 20 minutes old and is
being re-tried and is still locked, that lock still applies.
What are the failure modes of this scheme? I refuse to believe there
aren't any. We need to cover as many of them as possible and come up
with remedies for them.
Jules.
David Lee wrote:
>
> Julian: Over the years MailScanner has served us extremely well, and
> we continue to rely on it and be thankful for your work on it.
>
> But I'm currently clearing a backlog of 66,000+ emails from the weekend.
>
>
> Occasionally (perhaps once a year) we get a particular class of
> problem (and from skim-reading the list I believe others see this
> also), namely, that a message, or messages, will arrive which cause
> MailScanner (more likely one of its perl modules) to segfault. A
> (quote) shouldn't happen (unquote) thing that, nevertheless,
> occasionally does happen.
>
> We've just had such an incident over the weekend. And there were
> enough such messages (about 100) to cause all the child MS processes
> (20) to segfault on most occasions that they processed a batch (30).
> The net result is that our inbound queue grew, and very little
> trickled through, because the MS processes segfaulted, re-tried,
> segfaulted, retried, ...
>
> (The failure of one message in the batch causes the whole batch to be
> delayed until the next child attempt; and the chances are that new
> batch will also suffer a segfault.)
>
> As I say, such instances are rare, but they do happen. And when they
> happen they can hit hard.
>
> For this particular instance, I'd be happy to send you (offlist?)
> details, including sample messages, "MailScanner -V", OS etc. (Let me
> know.)
>
> But that still leaves a general problem of MS (+/ modules) being
> susceptible to emails (possibly malformed HTML spams) that can cause
> this behaviour.
>
> So a suggestion for a _general_ fix against general segfaults (to
> allow the other emails not to become "collateral damage").
>
>
> ====begin====
>
> When an MS child starts processing a batch, for each email temporarily
> put its id (e.g. sendmail "df/qf" number) into a small "being
> processed" database (e.g. a trivial db/dbm).
>
> When the child finishes the batch, remove those ids of the batch from
> that database.
>
> So for a system of 'c' children and batch-size 'b', the maximum number
> of entries at any time in that database will be 'c*b': rarely more
> than a few hundred, and so trivial for a db/dbm thing. (And if the
> inbound mqueue is empty, the database should correspondingly be empty.)
>
> Now here's the crucial detail: When the child starts its batch it
> also quickly checks that those ids are not already present in the
> database. (In normal use, they would never be present, as MS's
> existing mechanisms already ensure that a child takes a batch from
> beginning right through to completion.)
>
> If it DOES find that id, this indicates that something has badly gone
> wrong (e.g. previous child segfaulted, so didn't remove ids in this
> batch from the database). Many of those ids, of course, will be
> innocent: they will be there because another email (id) in an earlier
> batch had failed.
>
> To counter that, the database could also store a timestamp. On
> finding such an email, a child would skip that id if it was relatively
> young (e.g. less than 10 minutes since last timestamp), or process it
> _on its own_ if relatively old (e.g. older than ten minutes). That
> way, the innocent email would only be held up for a short period (e.g.
> ten minutes).
>
> (There are probably some cleverer things that could be done (and
> additional things that ought to be done), but at this stage I'm simply
> trying to outline the general idea.)
>
> ====end====
>
> How does that sound?
>
> Naturally I would be happy to assist beta-testing if you wish.
>
>
Jules
--
Julian Field MEng CITP CEng
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store
MailScanner customisation, or any advanced system administration help?
Contact me at Jules at Jules.FM
PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654
PGP public key: http://www.jules.fm/julesfm.asc
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the MailScanner
mailing list