MS/perl segfaults

Mon Nov 10 17:11:03 GMT 2008

One immediate thought: the only reproducible instance of this problem 
was caused by the HTML parser, and I wrote a solution to that in a 
recent release, it's in the Change Log.

But yes, your idea is a possibility, now that I'm using SQLite. Doing it 
with a dbm file is not really practical due to high contention for the 
exclusive write locks on the file. SQLite may be able to do it rather 
better.

There are quite a few routes that lead to a message leaving a batch, and 
I would have to catch all of those, time for a quick code review of a 
few chunks I think.

If a message is more than 20 minutes old and still in the database, then 
we do a batch containing only 1 message, and log it. If we find a 
message more than 30 minutes old, then we log it and ignore it.

How many ways could this process go wrong? All existing exclusion-locks 
would still apply, so if a message was more than 20 minutes old and is 
being re-tried and is still locked, that lock still applies.

What are the failure modes of this scheme? I refuse to believe there 
aren't any. We need to cover as many of them as possible and come up 
with remedies for them.

Jules.

David Lee wrote:
>
> Julian: Over the years MailScanner has served us extremely well, and 
> we continue to rely on it and be thankful for your work on it.
>
> But I'm currently clearing a backlog of 66,000+ emails from the weekend.
>
>
> Occasionally (perhaps once a year) we get a particular class of 
> problem (and from skim-reading the list I believe others see this 
> also), namely, that a message, or messages, will arrive which cause 
> MailScanner (more likely one of its perl modules) to segfault.  A 
> (quote) shouldn't happen (unquote) thing that, nevertheless, 
> occasionally does happen.
>
> We've just had such an incident over the weekend.  And there were 
> enough such messages (about 100) to cause all the child MS processes 
> (20) to segfault on most occasions that they processed a batch (30).  
> The net result is that our inbound queue grew, and very little 
> trickled through, because the MS processes segfaulted, re-tried, 
> segfaulted, retried, ...
>
> (The failure of one message in the batch causes the whole batch to be 
> delayed until the next child attempt; and the chances are that new 
> batch will also suffer a segfault.)
>
> As I say, such instances are rare, but they do happen.  And when they 
> happen they can hit hard.
>
> For this particular instance, I'd be happy to send you (offlist?) 
> details, including sample messages, "MailScanner -V", OS etc.  (Let me 
> know.)
>
> But that still leaves a general problem of MS (+/ modules) being 
> susceptible to emails (possibly malformed HTML spams) that can cause 
> this behaviour.
>
> So a suggestion for a _general_ fix against general segfaults (to 
> allow the other emails not to become "collateral damage").
>
>
> ====begin====
>
> When an MS child starts processing a batch, for each email temporarily 
> put its id (e.g. sendmail "df/qf" number) into a small "being 
> processed" database (e.g. a trivial db/dbm).
>
> When the child finishes the batch, remove those ids of the batch from 
> that database.
>
> So for a system of 'c' children and batch-size 'b', the maximum number 
> of entries at any time in that database will be 'c*b': rarely more 
> than a few hundred, and so trivial for a db/dbm thing.  (And if the 
> inbound mqueue is empty, the database should correspondingly be empty.)
>
> Now here's the crucial detail:  When the child starts its batch it 
> also quickly checks that those ids are not already present in the 
> database. (In normal use, they would never be present, as MS's 
> existing mechanisms already ensure that a child takes a batch from 
> beginning right through to completion.)
>
> If it DOES find that id, this indicates that something has badly gone 
> wrong (e.g. previous child segfaulted, so didn't remove ids in this 
> batch from the database).  Many of those ids, of course, will be 
> innocent: they will be there because another email (id) in an earlier 
> batch had failed.
>
> To counter that, the database could also store a timestamp.  On 
> finding such an email, a child would skip that id if it was relatively 
> young (e.g. less than 10 minutes since last timestamp), or process it 
> _on its own_ if relatively old (e.g. older than ten minutes).  That 
> way, the innocent email would only be held up for a short period (e.g. 
> ten minutes).
>
> (There are probably some cleverer things that could be done (and 
> additional things that ought to be done), but at this stage I'm simply 
> trying to outline the general idea.)
>
> ====end====
>
> How does that sound?
>
> Naturally I would be happy to assist beta-testing if you wish.
>
>

Jules

-- 
Julian Field MEng CITP CEng
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store

MailScanner customisation, or any advanced system administration help?
Contact me at Jules at Jules.FM

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654
PGP public key: http://www.jules.fm/julesfm.asc

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.