MS/perl segfaults

Mon Nov 10 15:47:24 GMT 2008

David Lee a écrit :
>
> Julian: Over the years MailScanner has served us extremely well, and 
> we continue to rely on it and be thankful for your work on it.
>
> But I'm currently clearing a backlog of 66,000+ emails from the weekend.
>
>
> Occasionally (perhaps once a year) we get a particular class of 
> problem (and from skim-reading the list I believe others see this 
> also), namely, that a message, or messages, will arrive which cause 
> MailScanner (more likely one of its perl modules) to segfault.  A 
> (quote) shouldn't happen (unquote) thing that, nevertheless, 
> occasionally does happen.
>
> We've just had such an incident over the weekend.  And there were 
> enough such messages (about 100) to cause all the child MS processes 
> (20) to segfault on most occasions that they processed a batch (30).  
> The net result is that our inbound queue grew, and very little 
> trickled through, because the MS processes segfaulted, re-tried, 
> segfaulted, retried, ...
>
> (The failure of one message in the batch causes the whole batch to be 
> delayed until the next child attempt; and the chances are that new 
> batch will also suffer a segfault.)
>
> As I say, such instances are rare, but they do happen.  And when they 
> happen they can hit hard.
>
> For this particular instance, I'd be happy to send you (offlist?) 
> details, including sample messages, "MailScanner -V", OS etc.  (Let me 
> know.)
>
> But that still leaves a general problem of MS (+/ modules) being 
> susceptible to emails (possibly malformed HTML spams) that can cause 
> this behaviour.
>
> So a suggestion for a _general_ fix against general segfaults (to 
> allow the other emails not to become "collateral damage").
>
>
> ====begin====
>
> When an MS child starts processing a batch, for each email temporarily 
> put its id (e.g. sendmail "df/qf" number) into a small "being 
> processed" database (e.g. a trivial db/dbm).
>
> When the child finishes the batch, remove those ids of the batch from 
> that database.
>
> So for a system of 'c' children and batch-size 'b', the maximum number 
> of entries at any time in that database will be 'c*b': rarely more 
> than a few hundred, and so trivial for a db/dbm thing.  (And if the 
> inbound mqueue is empty, the database should correspondingly be empty.)
>
> Now here's the crucial detail:  When the child starts its batch it 
> also quickly checks that those ids are not already present in the 
> database. (In normal use, they would never be present, as MS's 
> existing mechanisms already ensure that a child takes a batch from 
> beginning right through to completion.)
>
> If it DOES find that id, this indicates that something has badly gone 
> wrong (e.g. previous child segfaulted, so didn't remove ids in this 
> batch from the database).  Many of those ids, of course, will be 
> innocent: they will be there because another email (id) in an earlier 
> batch had failed.
>
> To counter that, the database could also store a timestamp.  On 
> finding such an email, a child would skip that id if it was relatively 
> young (e.g. less than 10 minutes since last timestamp), or process it 
> _on its own_ if relatively old (e.g. older than ten minutes).  That 
> way, the innocent email would only be held up for a short period (e.g. 
> ten minutes).
>
> (There are probably some cleverer things that could be done (and 
> additional things that ought to be done), but at this stage I'm simply 
> trying to outline the general idea.)
>
> ====end====
>
> How does that sound?
>
> Naturally I would be happy to assist beta-testing if you wish.
>
>

I've never been bitten by that problem in the past but I nonetheless 
like this idea.

Denis

-- 
   _
  °v°   Denis Beauchemin, analyste
 /(_)\  Université de Sherbrooke, S.T.I.
  ^ ^   T: 819.821.8000x62252 F: 819.821.8045