MailScanner/Postfix message duplication - possible fix

Drew Marshall drew at THEMARSHALLS.CO.UK
Tue Dec 16 22:53:31 GMT 2003


I wonder if this is only part of the story. Not being a programmer (Or
even someone who 'tinkers with code'!) please forgive me if I am being
stupid or just plain don't understand :-)

The queue manager runs the queues when it's either called by receipt of
a 1byte message from another part of Postfix or when it's inactivity
timer times out (As set in the master.cf file). I had a play with this
to start with and when I set the idle timer to 28 days I still got
duplicates and the 'skipped' log entry from when MailScanner happened to
be picking up a queued file and the queue runner had been called by
smtpd because it had just received a message. In some instances (One
larger message of 9Mb) this meant on my slow system that I didn't just
get duplicates but I got the damn thing 5 times, in various states of
delivery as it spooled into the deferred queue.

Now my gamble is that moving your times to 40 seconds or even more will
probably not cure the problem as if your system is fairly busy the queue
runner will be almost continuously running through the deferred queue as
it collects mail and checks for messages that are due for attempted
redelivery (I guess this happens on every visit to the queue to ensure
that ageing messages are not left in deferred for too long). It's that
check that could be the problem. If MS is just about to collect the
message when the queue runner inspects the message for age (Not worth
locking for? Don't know?) then the two paps collide and cause the
situation as seen. It won't matter how long you tell MS to leave the
message there for, the queue runner could still bump into the
collection. On my much quieter system it will probably work more
reliably for longer as the queue runner will be called less by smtpd an
more by the inactivity timer.

One way round this could be to send the messages to the hold queue as
the queue runner never runs in there. Now just to get the messages there...

As I say I could be talking rubbish and I'll go away and keep going with
what ever experiment people want to fix this issue but I though it was
worth knocking some thoughts about.

Regards

Drew

Ash, Howard (Iv,URL/CID) wrote:

>I just got nailed with a few more duplicates, and it was up to 40
>seconds between the postfix "(deferred transport)" log entry and the
>"skipped, still being delivered" log entry.  So I'm going to change the
>waiting period on my system to 60 seconds (time + 940) and see how it
>goes.
>
>
>Howard
>
>
>
>-----Original Message-----
>From: Kash, Howard (Civ,ARL/CISD)
>Sent: Tuesday, December 16, 2003 9:42 AM
>To: MAILSCANNER at JISCMAIL.AC.UK
>Subject: MailScanner/Postfix message duplication - possible fix
>
>
>Here is a solution Julian proposed to the postfix/still being
>delivered/duplicate message problem back in September.  Based on my
>analysis of the Postfix code and logs from actual occurrences of the
>bug, I think this is along the right track.  However, postfix postdates
>messages that it moves into the deferred queue by 1000 seconds
>(minimal_backoff_time default value).  My version of this patch is:
>
>    next if ($ModDate{$file} + 10) > (time + 1000);
>
>or more efficiently:
>
>    next if $ModDate{$file} > (time + 990);
>
>This accounts for the 1000 second postdate period and adds 10 seconds to
>get around the apparent race condition.  In every occurrence that I've
>seen of the bug, MailScanner starts it's scan just as a message is being
>processed (moved into the deferred queue) by postfix.  I think there is
>a brief instance when postfix does not have a lock on the file and
>MailScanner picks it up (and locks it).  Then postfix tries to lock the
>file.  Seeing that it is already locked, it generates the "skipped,
>still being delivered" message and backs off for 60 seconds (see
>nqmgr/qmgr_active.c:qmgr_active_feed()) and then re-queues the message
>again.
>
>You will need to adjust the 1000 second value if you have changed the
>default postfix setting for minimal_backoff_time.  You may also want to
>play around with the 10 second delay if it's too long or short.  Since
>the bug is very difficult to reproduce and occurs so infrequently, it's
>hard to say yet if this is actually working.  If others could try it out
>and let the list know if it seems to be working for them, maybe Julian
>can add it to the next release.  The only side affect of adding this
>line will be a 10 second delay in mail delivery.
>
>
>
>Howard
>
>
>-----Original Message-----
>From: Julian Field [mailto:mailscanner at ECS.SOTON.AC.UK]
>Sent: Thursday, September 04, 2003 6:45 AM
>To: MAILSCANNER at JISCMAIL.AC.UK
>Subject: Re: MailScanner+PostFix ---- try this
>
>
>Here's a patch to Postfix.pm. I know it's not exactly a neat solution to
>the problem, but if it fixes it I will know I have found the problem.
>
>--- Postfix.pm.old      2003-09-01 12:28:21.000000000 +0100
>+++ Postfix.pm  2003-09-04 11:49:17.000000000 +0100
>@@ -1132,6 +1132,9 @@
>        #print STDERR "Files are " . join(', ', @SortedFiles) . "\n";
>        while(defined($file = shift @SortedFiles) &&
>              $HitLimit1+$HitLimit2+$HitLimit3+$HitLimit4<1) {
>+        # Yes I know this is a hack but it will help isolate the
>problem
>+        next if $ModDate{$file} > time-3;
>+
>          # must separate next two lines or $1 gets re-tainted by being
>part of
>          # same expression as $file [mumble mumble grrr mumble mumble]
>          #print STDERR "Reading file $file from list\n";
>
>



More information about the MailScanner mailing list