MS/perl segfaults
Denis Beauchemin
Denis.Beauchemin at USherbrooke.ca
Mon Nov 10 15:47:24 GMT 2008
David Lee a écrit :
>
> Julian: Over the years MailScanner has served us extremely well, and
> we continue to rely on it and be thankful for your work on it.
>
> But I'm currently clearing a backlog of 66,000+ emails from the weekend.
>
>
> Occasionally (perhaps once a year) we get a particular class of
> problem (and from skim-reading the list I believe others see this
> also), namely, that a message, or messages, will arrive which cause
> MailScanner (more likely one of its perl modules) to segfault. A
> (quote) shouldn't happen (unquote) thing that, nevertheless,
> occasionally does happen.
>
> We've just had such an incident over the weekend. And there were
> enough such messages (about 100) to cause all the child MS processes
> (20) to segfault on most occasions that they processed a batch (30).
> The net result is that our inbound queue grew, and very little
> trickled through, because the MS processes segfaulted, re-tried,
> segfaulted, retried, ...
>
> (The failure of one message in the batch causes the whole batch to be
> delayed until the next child attempt; and the chances are that new
> batch will also suffer a segfault.)
>
> As I say, such instances are rare, but they do happen. And when they
> happen they can hit hard.
>
> For this particular instance, I'd be happy to send you (offlist?)
> details, including sample messages, "MailScanner -V", OS etc. (Let me
> know.)
>
> But that still leaves a general problem of MS (+/ modules) being
> susceptible to emails (possibly malformed HTML spams) that can cause
> this behaviour.
>
> So a suggestion for a _general_ fix against general segfaults (to
> allow the other emails not to become "collateral damage").
>
>
> ====begin====
>
> When an MS child starts processing a batch, for each email temporarily
> put its id (e.g. sendmail "df/qf" number) into a small "being
> processed" database (e.g. a trivial db/dbm).
>
> When the child finishes the batch, remove those ids of the batch from
> that database.
>
> So for a system of 'c' children and batch-size 'b', the maximum number
> of entries at any time in that database will be 'c*b': rarely more
> than a few hundred, and so trivial for a db/dbm thing. (And if the
> inbound mqueue is empty, the database should correspondingly be empty.)
>
> Now here's the crucial detail: When the child starts its batch it
> also quickly checks that those ids are not already present in the
> database. (In normal use, they would never be present, as MS's
> existing mechanisms already ensure that a child takes a batch from
> beginning right through to completion.)
>
> If it DOES find that id, this indicates that something has badly gone
> wrong (e.g. previous child segfaulted, so didn't remove ids in this
> batch from the database). Many of those ids, of course, will be
> innocent: they will be there because another email (id) in an earlier
> batch had failed.
>
> To counter that, the database could also store a timestamp. On
> finding such an email, a child would skip that id if it was relatively
> young (e.g. less than 10 minutes since last timestamp), or process it
> _on its own_ if relatively old (e.g. older than ten minutes). That
> way, the innocent email would only be held up for a short period (e.g.
> ten minutes).
>
> (There are probably some cleverer things that could be done (and
> additional things that ought to be done), but at this stage I'm simply
> trying to outline the general idea.)
>
> ====end====
>
> How does that sound?
>
> Naturally I would be happy to assist beta-testing if you wish.
>
>
I've never been bitten by that problem in the past but I nonetheless
like this idea.
Denis
--
_
°v° Denis Beauchemin, analyste
/(_)\ Université de Sherbrooke, S.T.I.
^ ^ T: 819.821.8000x62252 F: 819.821.8045
More information about the MailScanner
mailing list