Strip HTML weirdness
Julian Field
mailscanner at ecs.soton.ac.uk
Wed Oct 1 16:49:21 IST 2003
That is very strange. MailScanner, when stripping HTML to plain text,
doesn't surround the links in any punctuation at all, it just puts them in
with a space round them so they don't get mingled with the surrounding text.
Here's a little example of what is left after processing a short test message:
mime-boundary string here
Content-type: text/plain; charset="us-ascii"
This is an HTML message with a http://www.ecs.soton.ac.uk/ link
in it and a link on a line of its own.
http://soton.ac.uk/ Link number 2
And it also has a very long link in it like
http://www.ecs.soton.ac.uk/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaassssssssssssssssssssssssssssssssssssssssssssssdddddddddddddddddddddddddddddddddddfvfffffffffffffffffffffffffffffffffffff.jpg
in it.
mime-boundary string here
As you see, there are no "<>" characters and no truncation.
At 15:22 01/10/2003, you wrote:
>Julian Field wrote:
> > Can you make sure they are not very long URLs that are being split
> > into multiple lines by Outlook?
>
>No, thats not the case, heres a short extract from the particular mail
>thats generated complaints (lots of my users are on this list). A couple
>of curious things 1) The urls are truncated, 2) The URLS are surrounded by
><>. The message was tagged as spam and converted to an attachment, but
>was low scoring spam - which is not set to be stripped - although it looks
>from the logs that it also triggered one of the dangerous html rules
>(which are set to strip content). I think the original is online here
>http://www.bmra.org.uk/mrbusiness/index.asp (although the links are
>relative in the source of that).
>
>MESSAGE EXTRACT FOLLOWS....
>
> <http://www.bmra.org.u>
>
>
>
> 30 September 2003 Issue 25 <http://www.bmra.org.uk/mrbusine>
>
>print <http://www.bmra.org.uk/mrbusine> search for
>in --Whole Site-- Sep 2003 Ezine 25 Sep 2003 Ezine 24 Jul 2003
>Ezine 23 Jun 2003 Ezine 22 May 2003 Ezine 21 Apr 2003 Ezine 20
>submit
>
>
>
>
> BMRA <http://www.bmra.org.uk>
>
> frontpage <http://www.bmra.org.u>
>
> archive <http://www.bmra.org.uk/>
> contact <mailto:admin at bmra.org.uk>
>
> subscribe <http://www.bmra.org.uk/mr>
> calendar <http://www.bmra.org.uk/mrbusiness/>
>
>
> <http://www.bmra.org.uk/include/ad.asp?BannerID=49&src=http:>
>
>
>
>BMRB International
>http://www.bmrb.co.uk
>+44 (0)20 8566 5000
>_________________________________________________________________
>This message (and any attachment) is intended only for the
>recipient and may contain confidential and/or privileged
>material. If you have received this in error, please contact the
>sender and delete this message immediately. Disclosure, copying
>or other action taken in respect of this email or in
>reliance on it is prohibited. BMRB International Limited
>accepts no liability in relation to any personal emails, or
>content of any email which does not directly relate to our
>business.
--
Julian Field
www.MailScanner.info
MailScanner thanks transtec Computers for their support
More information about the MailScanner
mailing list