html2text output not really clean?
Remco Barendse
mailscanner at BARENDSE.TO
Fri Nov 15 08:59:11 GMT 2002
Hi!
I am using Mailscanner 4.05-3 and have a mobile user collecting mail onm
his laptop. I want to use the html2text feature to prevent expensive
phonecalls to collect e-mail in HTML format that keep the connection open
for hours. MS is running on a RedHat 7.3 box.
I have this line in my /etc/MailScanner/MailScanner.conf :
Convert HTML To Text = /etc/MailScanner/rules/html2text.rules
The html2text.rules contains:
To r.barendse at somedomain.com yes
To remco at somedomain.com yes
Fromorto default no
The output in maillog seems correct:
Nov 15 09:44:19 linuxgw MailScanner[7367]: Content Checks: Need to convert HTML to plain text in 1 messages
Nov 15 09:44:20 linuxgw MailScanner[7367]: Content Checks: Detected and will convert HTML message to plain text in gAF8iAN07366
When I start pine and look in the inbox, I still see small messages
being huge in size (13-40 Kb). The top of the e-mail contains stuff like :
@font-face { font-family: Tahoma; } @font-face { font-family: Verdana; }
@page Section1 {size:595.35pt 842.0pt; margin: 26.95pt 70.9pt 1.0in 70.9pt; mso-header-margin:
and similar rubble throughout the e-mail :
….Whaaat ??
<![if !supportEmptyParas]><![endif]>
You gotta be kidding me….?!
Now if I retrieve the contents of the mailbox using Outlook Express the
e-mail *appears* to be stripped of html rubble because the formatting has
changed (colors and font sizes are different). The size of the e-mail is
slightly reduced (the original HTML mail was 21 Kb, the end result is 13
Kb (still too much for only 80 lines of text).
Why is there still all this font and other rubble in the e-mails and how
can I strip them completely?
Thanks!!
Remco
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the MailScanner
mailing list