Mailscanner converting HTML messages with FORM tags

Julian Field mailscanner at ecs.soton.ac.uk
Mon Oct 27 14:33:18 GMT 2003


At 09:21 23/10/2003, you wrote:
>On Wed, 22 Oct 2003, Lancaster, David Matthew wrote:
>
> > > > >Perhaps the Allow/Convert options could be restructured to
> something like
> > > > >this:?
> > > > >Allow Form Tags = { yes | convert | no}
> > > > >Allow Object Codebase Tags = { yes | convert | no}
> > > > >Allow IFrame Tags = { yes | convert | no}
> > > > >
> > > > >This would also allow further selection of criteria (e.g. javascript,
> > > etc)
> > > > as
> > > > >"Dangerous HTML", while still allowing a great deal of tuning.
> > > >
> > > > Would you like to be able to strip _all_ html out of some messages, or
> > > just
> > > > strip out a few specific tags from some messages? The latter is much
> > > harder.
> > >
> > > I don't currently have the need to strip out specific html...just to
> select
> > > which of the three criteria (iframe, obj codebase, form) will cause the
> > > message
> > > to be converted to plain text.
> > >
> > > D.
> > >
> >
> > Hate to be a pest, but any idea if this could be added to MailScanner?
> > I realize that MailScanner must keep you quite busy, so if it's not
> something
> > you're interested in looking at, I might take a peek at the code
> myself...but
> > it doesn't make sense for two people to be duplicating the work.
>
>If I understand it correctly, this revised behaviour and configuration
>also has my vote.  (We want to allow "forms" through unmolested, but to
>convert "object codebase".)
>
>The proposed "Allow <X>= { yes | convert | no}" would seem to achieve such
>flexibility with elegant simplicity and, further, to allow possible future
>extensions beyond the current set of three values for <X>.
>
>Further, one could also envisage (at least in theory) the possibility of
>two (or even more?) different conversions:
>    Allow <X>= { yes | no | convert-all | convert-tag }
>
>Julian:  If you agree in principle, then I'd be happy to work (albeit
>subject to the usual "local busy-ness constraints") with David Lancaster
>to try to implement this framework over the coming weeks.  (I took at
>quick peek yesterday at the relevant 4.24-5 code to see what might be
>needed.)

Hi guys!

Sorry haven't been around much recently, have had a lot of other things on.

I like the idea of the
Convert Dangerous HTML = yes | no | object-codebase | iframe | form
where there can be more than 1 option give in that line. The only snag
being you don't know which tags cause the entire HTML to be removed, and
which tags cause just those tags to be removed.

So maybe your solution is better. What's the difference in behaviour
between "no" and "convert-all"?

The awkward bit is implementing it, it's all to do with HTML::TokeParser
and related things.
The code that currently does the job is in Message.pm. You want "sub
HTMLEntityToText" and the "sub get_text" following it (which over-rides the
TokeParser's original code so it has a slightly different output).

All it can do now is strip all HTML tags leaving the plain text.

As a trial for what I am going to put into the main code, try running the
attached script, passing an HTML file on the command line.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clean-html.pl
Type: application/octet-stream
Size: 1439 bytes
Desc: not available
Url : http://lists.mailscanner.info/pipermail/mailscanner/attachments/20031027/c0b04aaf/clean-html.obj
-------------- next part --------------
--
Julian Field
www.MailScanner.info
MailScanner thanks transtec Computers for their support

PGP footprint: EE81 D763 3DB0 0BFD E1DC  7222 11F6 5947 1415 B654


More information about the MailScanner mailing list