html2text output not really clean?

Remco Barendse mailscanner at BARENDSE.TO
Fri Nov 15 08:59:11 GMT 2002


I am using Mailscanner 4.05-3 and have a mobile user collecting mail onm 
his laptop. I want to use the html2text feature to prevent expensive 
phonecalls to collect e-mail in HTML format that keep the connection open 
for hours. MS is running on a RedHat 7.3 box.

I have this line in my /etc/MailScanner/MailScanner.conf :
Convert HTML To Text = /etc/MailScanner/rules/html2text.rules

The html2text.rules contains:
To              r.barendse at       yes
To              remco at            yes
Fromorto        default                         no

The output in maillog seems correct:
Nov 15 09:44:19 linuxgw MailScanner[7367]: Content Checks: Need to convert HTML to plain text in 1 messages
Nov 15 09:44:20 linuxgw MailScanner[7367]: Content Checks: Detected and will convert HTML message to plain text in gAF8iAN07366

When I start pine and look in the inbox, I still see small messages 
being huge in size (13-40 Kb). The top of the e-mail contains stuff like :
@font-face { font-family: Tahoma; } @font-face { font-family: Verdana; } 
@page Section1 {size:595.35pt 842.0pt; margin: 26.95pt 70.9pt 1.0in 70.9pt; mso-header-margin: 

and similar rubble throughout the e-mail :
….Whaaat ??
<![if !supportEmptyParas]><![endif]> 
You gotta be kidding me&#8230;.?!

Now if I retrieve the contents of the mailbox using Outlook Express the 
e-mail *appears* to be stripped of html rubble because the formatting has 
changed (colors and font sizes are different). The size of the e-mail is 
slightly reduced (the original HTML mail was 21 Kb, the end result is 13 
Kb (still too much for only 80 lines of text).

Why is there still all this font and other rubble in the e-mails and how 
can I strip them completely?



This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

More information about the MailScanner mailing list