too many deletions => lost messages
Mariano Absatz
mailscanner at LISTS.COM.AR
Mon Nov 3 18:42:02 GMT 2003
Hi Julian,
a few days ago, I noticed a message was aparently "eaten" by my
MailScanner... I had it arrived in the (zmailer's) smtpserver, but never
showed in my personal MailScanner loggin' (done via CustomConfig.pm).
I did a little bit of debuggin' and found that sometimes
MailScanner::SMDiskStore::WriteEntireMessage() did not find the message
file in $this->{hdpath} (in ZMDiskStore.pm).
As I couldn't figure out why, I handed it to Leo (our Perl wizard, who
actually wrote much of the ZMailer code) and he found what the problem
was (and I found an explanation of why this seems to only affect ZMailer
and not the other mailers).
Leo filled sources with additional loggin' and he found out the
following:
The deletion function (MailScanner::MessageBatch::RemoveDeletedMessages)
may be called many times for the same message (in the main MailScanner
script).
Once after DeliverUnscanned()
Once after DeleteUnwantedCleaned()
Once after DeliverCleaned()
...
Message::DeleteMessage is also called from within DisinfectAndDeliver...
Now, AFAIK, the filename for a message is not reused in the other MTAs
(at least in Sendmail and Exim which have a timestamp codified in the
filename, I dunno about Postfix)... but ZMailer queue filenames are only
based on the inode number, so they actually repeat a lot and (at least on
linux) really fast, since inode numbers can be reused as soon as they
become available (e.g. because the file was deleted).
Now, if I have the followin' situation:
1) MailScanner process X starts processing a batch which comprises the
queue file /var/spool/postoffice-incoming/router/12345 (called like that
by ZMailer because its inode number is 12345).
2) MS process X after running after DeleteUnwantedCleaned() calls
RemoveDeletedMessages(), which actually removes file
/var/spool/postoffice-incoming/router/12345 (thus inode 12345 is now
available)
3) ZMailer's smtpserver receives a new message, creates a file and, since
the OS assigns inode number 12345 to it, the file finally becomes
/var/spool/postoffice-incoming/router/12345
4) MS process Y (not X) starts processing a new batch which comprises
this new queue file /var/spool/postoffice-incoming/router/12345.
5) Now, the old MS process X, after calling DeliverCleaned(), again calls
DeleteUnwantedCleaned() which, as the message object for the already
deleted file is still there with its {deleted} attribute on, tries to
delete it again... and, as the file exists, it _is_ actually deleted...
but this file belonged to Y's batch.
6) MS process Y wants to copy the message to someplace else to keep
processing it, but alas, the file is not there :-(
The message was actually lost and cannot be recovered...
The problem comes from assuming that queue filenames are not repeated,
which is true of Sendmail and Exim (and may be true for Postfix, I don't
know), but is blatanly false for ZMailer...
OTOH, the {deleted} attribute of the Message object, depending on where
it is used means "must be deleted" or "has been deleted".
I'm working with 4.23, but I don't think there were many changes in this
neighborhood in 4.24 or 4.25...
For what I have seen, it should be pretty easy to make {deleted} mean
"has to be deleted" and create a new attribute {actuallydeleted} or
something like that that be set after actual deletion and checked before
it so we don't double-delete messages.
More analysis seem to show that SMDiskStore::Delete() is never called...
is this dead code?
Regards.
--
Mariano Absatz
El Baby
----------------------------------------------------------
/"\ |
\ / ASCII RIBBON CAMPAIGN |
X AGAINST HTML MAIL |
/ \ |
More information about the MailScanner
mailing list