More encoded subject woes

Wed May 24 13:39:29 IST 2006

On 5/23/06, Nick Smith <nick.smith67 at googlemail.com> wrote:
> On 5/20/06, Nick Smith <nick.smith67 at googlemail.com> wrote:
> > On 5/19/06, Nick Smith <nick.smith67 at googlemail.com> wrote:
> > > Hi,
> > >
> > > MS 4.54-2 / Postfix 2.10
> > >
> > > I've got more trouble with encoded subject headers being "mishandled"
> > > from a recipient's point of view. The issue occurs when, for whatever
> > > reason, MIME-Tools is unable to decode an encoded subject properly -
> > > this example is UTF-8, but I don't know if it may affect other
> > > encoding types too
> > >
> > > =?UTF-8?B?5oOF5aCx6YCj57Wh56WoIC0gVVNHcumVt+WQiOitsOW+heOBoSA=?==?
> > > UTF-8?B?LSDnrKzvvJTvvJjlm57lhajml6XmnKwgLSDlsZXnpLrkvJrjga7lh7rlsZU=?=
> > >
> > > If you feed that string to MIME::WordDecoder::unmime it returns:
> > >
> > > ????? - USGr????? - ??????? - ??????
> > >
> > > I have absolutely no idea why this happens - whether it's a bug or
> > > expected behaviour on the part of MIME-Tools, but I assume that each
> > > question mark represents a multi-byte (Japanese in this case)
> > > character that it was not possible to decode
> > >
> > > Drop the same string into an Outlook message and send it via SMTP
> > > (making sure that it bypasses MailScanner), and when it arrives it
> > > should show a bunch of Japanese characters. The recipients are
> > > understandably not happy that the subject of their email when it shows
> > > up has been replaced by a bunch of question marks
> > >
> > > I've worked around this problem with a patch against Postfix.pm
> > > (attached), but I'm less than comfortable with it. Basically what it
> > > does is to unmime into a temporary holding string instead of the
> > > $message structure and then take a look at the results of its
> > > handiwork. If it sees more than an arbitrary number of consecutive ?'s
> > > (I picked more than 3 as a reasonable number), it assumes that the
> > > unmime was unsuccessful and allows the original encoded subject to
> > > pass. Otherwise it assumes decode success and fills the
> > > message->{subject} structure with the unmime result
> > >
> > > The first problem is that the ???? test is far from foolproof -
> > > there's loads of scope for false +ves and false -ves. The second
> > > problem is I'm not sure what issues this might cause if MS has to
> > > alter the subject later. I'm not altering any subjects at all so it
> > > wouldn't be a worry on my system but...
> > >
> > > Clearly I'm working with Postfix here, but this affects other MTA's
> > > too. Equally clearly the proper answer is to figure out what's up with
> > > MIME-Tools, but I'm afraid that's way beyond my capabilities :(
> > >
> > > Thoughts appreciated
> > >
> > > Thanks
> > >
> > > Nick
> > >
> > >
> > >
> >
> > Please ignore all of this - I think I've been fed old news by the
> > group that reported this to me as an issue
> >
> > I'm pretty certain that their problem was actually the "Postfix
> > truncates multi-line subject" thing that Julian already fixed for me,
> > and that when they said they were still having the issue after
> > re-testing they were mistaken
> >
> > I am working on the assumption that the ???? output from the unmime
> > function is just an ASCII representation but it was plenty enough to
> > confuse me :(
> >
> > Sorry for the false alarm
> >
> > Thanks
> >
> > Nick
> >
> Oh dear - it seems that maybe there is something in what I first
> suggested. Please take a look at this UTF-8 encoded string from a mail
> subject:
>
> Subject: =?UTF-8?B?NDXmmYLplpPmrovmpa3otoXpgY7nlLPoq4sg5om/6KqN5L6d6aC8IFtJ?=
>  =?UTF-8?B?S0VEQSBZT0hFSSDmsaDnlLAg5rSL5bmzXSAg?=
>
> MIME-Tools doesn't seem able to decode this, and the original encoded
> subject does get replaced by a bunch of ?'s (a single ? in place of
> where each double byte Japanese character should be). Microsoft seems
> to have no problem decoding this
>
> The thing I still don't get at all with MailScanner is under what
> circumstances the original encoded format subject header gets replaced
> by the unmimed version as part of onward delivery
>
> What I mean by this is that if a subject gets successfully unmimed
> then it gets sent onwards in its original MIME form - if the unmime is
> not successful however (as in this case) then the subject header in
> the message itself gets physically replaced with the "broken" ASCII
> representation where ?'s substitute for double byte characters
>
> I'd very much appreciate any insight into this problem - does the
> unmime function have a return code that could be tested for success
> before using its output for example?
>
> Unfortunately my previous strategy of testing for n successive ?'s
> isn't going to work because I think all db characters will appear as a
> ? in the perl string test whether the decode was successful or not. I
> also have not managed to figure out what dependencies there are here
> that affect MailScanner's ability to do a subject rewrite if it needs
> to insert a string of its own
>
> Thanks
>
> Nick
>
OK - I wonder what the record is for replying to your own posts on this list :)

...anyway, I have finally figured out the exact cause of this so no
more aimless rambling or speculation

When decoded, the string
"=?UTF-8?B?NDXmmYLplpPmrovmpa3otoXpgY7nlLPoq4sg5om/6KqN5L6d6aC8IFtJ?==?UTF-8?B?S0VEQSBZT0hFSSDmsaDnlLAg5rSL5bmzXSAg?="
contains 2 trailing spaces. Not immediately obvious, but the
SweepContent module does a bit of checking for evidence of malicious
subjects, and attempts to clean up. This isn't configurable or
optional in any way, it is just what MS does

One of the things it does is to remove trailing whitespace. However,
if the subject is MIME encoded, it can't act on the subject itself
directly, and instead does its work on the decoded version as returned
by the unmime function

This is fine until the encoded string contains multibyte unicode type
data which of course cannot be represented in an ascii string (which
is why it was encoded to begin with). The unmime function uses a ? as
a placeholder when it finds a multibyte character

Provided that SweepContent doesn't find any "badness" in the decoded
representation of the subject that it's looking at, MS will allow the
*original* encoded subject to pass unmolested. However, if it decides
any changes need to be made it completely replaces the original
encoded subject header with the ("cleaned") decoded representation

It may well be that this is considered unfortunate but unavoidable
collatoral damage by the MS team and that the fix is "don't do that"
when it comes to putting spaces at the end of subjects that have to be
encoded. However, I'm sure everybody would agree that it isn't easy
sometimes to convince developers of applications that their code is
"wrong", particularly when it involves a practice which is not
actually forbidden as such and even more so when "it works fine with
every other mail gateway"

Anyway, for he meantime, I am doing this with Postfix.pm - which will
allow MS to tolerate up to 2 trailing spaces (not tabs) if the subject
has been encoded:

-    $message->{subject} = MIME::WordDecoder::unmime($message->{subject});
+    my $TmpSubject = ""; # Temp storage
+    $TmpSubject = MIME::WordDecoder::unmime($message->{subject});
+    if ($TmpSubject != $message->{subject}) {
+      # The unmime function did something - we must be dealing with
+      # an encoded subject. Remove up to 2 trailing spaces if present
+      # so that SweepContent cuts us a little slack. Total replacement
+      # and hence probable destruction of unicode subjects for the sake of
+      # one or two probably harmless trailing spaces is a little harsh
+      $TmpSubject =~ s/ {1,2}$//;
+      $message->{subject} = $TmpSubject;
+    }

I'd be grateful if consideration could be given to this problem - my
"fix" probably isn't the most elegant, but perhaps there's a smarter
way round the issue

Thanks

Nick