MS/perl segfaults

Mon Nov 10 15:27:05 GMT 2008

Julian: Over the years MailScanner has served us extremely well, and we 
continue to rely on it and be thankful for your work on it.

But I'm currently clearing a backlog of 66,000+ emails from the weekend.

Occasionally (perhaps once a year) we get a particular class of problem 
(and from skim-reading the list I believe others see this also), namely, 
that a message, or messages, will arrive which cause MailScanner (more 
likely one of its perl modules) to segfault.  A (quote) shouldn't happen 
(unquote) thing that, nevertheless, occasionally does happen.

We've just had such an incident over the weekend.  And there were enough 
such messages (about 100) to cause all the child MS processes (20) to 
segfault on most occasions that they processed a batch (30).  The net 
result is that our inbound queue grew, and very little trickled through, 
because the MS processes segfaulted, re-tried, segfaulted, retried, ...

(The failure of one message in the batch causes the whole batch to be 
delayed until the next child attempt; and the chances are that new batch 
will also suffer a segfault.)

As I say, such instances are rare, but they do happen.  And when they 
happen they can hit hard.

For this particular instance, I'd be happy to send you (offlist?) details, 
including sample messages, "MailScanner -V", OS etc.  (Let me know.)

But that still leaves a general problem of MS (+/ modules) being 
susceptible to emails (possibly malformed HTML spams) that can cause this 
behaviour.

So a suggestion for a _general_ fix against general segfaults (to allow 
the other emails not to become "collateral damage").

====begin====

When an MS child starts processing a batch, for each email temporarily put 
its id (e.g. sendmail "df/qf" number) into a small "being processed" 
database (e.g. a trivial db/dbm).

When the child finishes the batch, remove those ids of the batch from that 
database.

So for a system of 'c' children and batch-size 'b', the maximum number of 
entries at any time in that database will be 'c*b': rarely more than a few 
hundred, and so trivial for a db/dbm thing.  (And if the inbound mqueue is 
empty, the database should correspondingly be empty.)

Now here's the crucial detail:  When the child starts its batch it also 
quickly checks that those ids are not already present in the database. 
(In normal use, they would never be present, as MS's existing mechanisms 
already ensure that a child takes a batch from beginning right through to 
completion.)

If it DOES find that id, this indicates that something has badly gone 
wrong (e.g. previous child segfaulted, so didn't remove ids in this batch 
from the database).  Many of those ids, of course, will be innocent: they 
will be there because another email (id) in an earlier batch had failed.

To counter that, the database could also store a timestamp.  On finding 
such an email, a child would skip that id if it was relatively young (e.g. 
less than 10 minutes since last timestamp), or process it _on its own_ if 
relatively old (e.g. older than ten minutes).  That way, the innocent 
email would only be held up for a short period (e.g. ten minutes).

(There are probably some cleverer things that could be done (and 
additional things that ought to be done), but at this stage I'm simply 
trying to outline the general idea.)

====end====

How does that sound?

Naturally I would be happy to assist beta-testing if you wish.

-- 

:  David Lee                                I.T. Service          :
:  Senior Systems Programmer                Computer Centre       :
:  UNIX Team Leader                         Durham University     :
:                                           South Road            :
:  http://www.dur.ac.uk/t.d.lee/            Durham DH1 3LE        :
:  Phone: +44 191 334 2752                  U.K.                  :