Crash protection

Fri Mar 6 09:57:10 GMT 2009

On Thu, 5 Mar 2009, David Lee wrote:

> On Wed, 4 Mar 2009, Julian Field wrote:
> [...]
>> Please try the attached MessageBatch.pm (which I have compressed, of 
>> course).
>> Please let me know if this fixes the problem.
>
> Will do; I have just installed it.  (I made sure the inbound queue was empty 
> and removed the previous "Processing.db" to give it a clean start.)
> [...]

First, the bad news: it is still occuring, so the patch seems not to have 
made any difference.

-----------------------------------------------------------
Tries   Message Last Tried
=====   ======= ==========
1       n2650oUu021398  Fri Mar  6 05:05:35 2009
1       n2647uja010341  Fri Mar  6 04:12:49 2009
1       n2610rCJ022463  Fri Mar  6 01:05:22 2009
1       n2610rjK022464  Fri Mar  6 01:03:38 2009
1       n25J0ovL023772  Thu Mar  5 19:03:52 2009
1       n25I0msJ026885  Thu Mar  5 18:04:11 2009
1       n25H0sF7025852  Thu Mar  5 17:06:29 2009
1       n25H0oK1025828  Thu Mar  5 17:06:26 2009
1       n25C0uSx007184  Thu Mar  5 12:05:31 2009
1       n25A0bJ6029642  Thu Mar  5 10:05:57 2009
1       n25A0qAP029669  Thu Mar  5 10:05:12 2009
1       n25A0ZJX029632  Thu Mar  5 10:04:27 2009
-----------------------------------------------------------

Now the possibly good news.

Note that the times in both the above set and the previous set are 
consistently soon after the hour.  Pattern?  And when I look in the 
logfile for the sendmail id (the "n2..."), their final entries are 
followed within one or two seconds by all the MS processes catching a 
SIGHUP.  More than coincidence?

(The above times are actually "next retry" with a random addition to 
time-now; what they actually reflect are last updates to "Processing.db" 
from a few minutes earlier.)

We have been running your spear-phishing script.  And, of course, this has 
an hourly cron-job which ends: "service MailScanner reload".  Again, more 
than coincidence?

I suspect some sort of interaction.  Going into the realms of speculation: 
When this new, db-enabled, version of MS has successfully processed any 
email it now has to do two things:
   1. Deliver it to the next stage, e.g. out-queue (ham); deletion (spam)
   2. Remove from "Processing.db"

In all cases these need to happen as a single, atomic action.  So I 
suspect there is at least one outcome (particularly when "spam actions are 
delete") in which these events are happening separately and 
non-atomically, with the risk of an MS restart coming between them.

Guess: for a spam-deletion, MS firstly removes the {df,qf} pair from 
in-queue but only later gets around to removing it from "Processing.db". 
If MS stops (HUP signal, etc.) between them, then stale entries are left 
in "Processing.db".

Is there sufficient signal-trapping to keep these things atomic?  (There 
may be other areas where this might apply.)

Plausible?

-- 

:  David Lee                                I.T. Service          :
:  Senior Systems Programmer                Computer Centre       :
:  UNIX Team Leader                         Durham University     :
:                                           South Road            :
:  http://www.dur.ac.uk/t.d.lee/            Durham DH1 3LE        :
:  Phone: +44 191 334 2752                  U.K.                  :