MS/perl segfaults

Sat Jan 17 14:03:01 GMT 2009

Re-visiting this issue.
Is it still a problem?
Is it worth attempting to solve?

In the following descriptions, all timings would be configurable. It's 
just easier to think about the problem with real numbers in there.

When we scan the queue to build a batch, we look for unlocked messages 
as normal. When we find an unlocked message, we look to see if it is in 
the database table and was first scanned less than 20 minutes ago.
If it was first scanned 20 minutes ago, we ignore it in case it was a 
one-off failure, or a failure caused by other messages in the same batch.
If it was first scanned 20-40 minutes ago, we scan it in a batch of 1 
message, on its own.
If it was first scanned more than 40 minutes ago, we ignore it 
completely and log the event as a scanner failure. Or we could mark it 
as infected instead? What are your thoughts here? A DoS attack attempt 
would be a reasonable conclusion.

I need to catch every time a message leaves the batch and remove it from 
the database table, that's my problem.
Also, I need to find all the race conditions when checking the database 
about the message, but that's also my problem.

What do you think of the approach above?

Your comments would be most welcome.

Cheers,
Jules.

On 10/11/08 17:11, Julian Field wrote:
> One immediate thought: the only reproducible instance of this problem 
> was caused by the HTML parser, and I wrote a solution to that in a 
> recent release, it's in the Change Log.
>
> But yes, your idea is a possibility, now that I'm using SQLite. Doing 
> it with a dbm file is not really practical due to high contention for 
> the exclusive write locks on the file. SQLite may be able to do it 
> rather better.
>
> There are quite a few routes that lead to a message leaving a batch, 
> and I would have to catch all of those, time for a quick code review 
> of a few chunks I think.
>
> If a message is more than 20 minutes old and still in the database, 
> then we do a batch containing only 1 message, and log it. If we find a 
> message more than 30 minutes old, then we log it and ignore it.
>
> How many ways could this process go wrong? All existing 
> exclusion-locks would still apply, so if a message was more than 20 
> minutes old and is being re-tried and is still locked, that lock still 
> applies.
>
> What are the failure modes of this scheme? I refuse to believe there 
> aren't any. We need to cover as many of them as possible and come up 
> with remedies for them.
>
> Jules.
>
> David Lee wrote:
>>
>> Julian: Over the years MailScanner has served us extremely well, and 
>> we continue to rely on it and be thankful for your work on it.
>>
>> But I'm currently clearing a backlog of 66,000+ emails from the weekend.
>>
>>
>> Occasionally (perhaps once a year) we get a particular class of 
>> problem (and from skim-reading the list I believe others see this 
>> also), namely, that a message, or messages, will arrive which cause 
>> MailScanner (more likely one of its perl modules) to segfault.  A 
>> (quote) shouldn't happen (unquote) thing that, nevertheless, 
>> occasionally does happen.
>>
>> We've just had such an incident over the weekend.  And there were 
>> enough such messages (about 100) to cause all the child MS processes 
>> (20) to segfault on most occasions that they processed a batch (30).  
>> The net result is that our inbound queue grew, and very little 
>> trickled through, because the MS processes segfaulted, re-tried, 
>> segfaulted, retried, ...
>>
>> (The failure of one message in the batch causes the whole batch to be 
>> delayed until the next child attempt; and the chances are that new 
>> batch will also suffer a segfault.)
>>
>> As I say, such instances are rare, but they do happen.  And when they 
>> happen they can hit hard.
>>
>> For this particular instance, I'd be happy to send you (offlist?) 
>> details, including sample messages, "MailScanner -V", OS etc.  (Let 
>> me know.)
>>
>> But that still leaves a general problem of MS (+/ modules) being 
>> susceptible to emails (possibly malformed HTML spams) that can cause 
>> this behaviour.
>>
>> So a suggestion for a _general_ fix against general segfaults (to 
>> allow the other emails not to become "collateral damage").
>>
>>
>> ====begin====
>>
>> When an MS child starts processing a batch, for each email 
>> temporarily put its id (e.g. sendmail "df/qf" number) into a small 
>> "being processed" database (e.g. a trivial db/dbm).
>>
>> When the child finishes the batch, remove those ids of the batch from 
>> that database.
>>
>> So for a system of 'c' children and batch-size 'b', the maximum 
>> number of entries at any time in that database will be 'c*b': rarely 
>> more than a few hundred, and so trivial for a db/dbm thing.  (And if 
>> the inbound mqueue is empty, the database should correspondingly be 
>> empty.)
>>
>> Now here's the crucial detail:  When the child starts its batch it 
>> also quickly checks that those ids are not already present in the 
>> database. (In normal use, they would never be present, as MS's 
>> existing mechanisms already ensure that a child takes a batch from 
>> beginning right through to completion.)
>>
>> If it DOES find that id, this indicates that something has badly gone 
>> wrong (e.g. previous child segfaulted, so didn't remove ids in this 
>> batch from the database).  Many of those ids, of course, will be 
>> innocent: they will be there because another email (id) in an earlier 
>> batch had failed.
>>
>> To counter that, the database could also store a timestamp.  On 
>> finding such an email, a child would skip that id if it was 
>> relatively young (e.g. less than 10 minutes since last timestamp), or 
>> process it _on its own_ if relatively old (e.g. older than ten 
>> minutes).  That way, the innocent email would only be held up for a 
>> short period (e.g. ten minutes).
>>
>> (There are probably some cleverer things that could be done (and 
>> additional things that ought to be done), but at this stage I'm 
>> simply trying to outline the general idea.)
>>
>> ====end====
>>
>> How does that sound?
>>
>> Naturally I would be happy to assist beta-testing if you wish.
>>
>>
>
> Jules
>

Jules

-- 
Julian Field MEng CITP CEng
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store

MailScanner customisation, or any advanced system administration help?
Contact me at Jules at Jules.FM

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654
PGP public key: http://www.jules.fm/julesfm.asc

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.