Max SpamAssassin Size problems

Logan Shaw lshaw at emitinc.com
Thu Aug 24 20:57:07 IST 2006


On Thu, 24 Aug 2006, Ken A wrote:
> Logan Shaw wrote:
>> On Thu, 24 Aug 2006, Julian Field wrote:
>>> do I chop half way through an image?
>>> do I chop at the end of an image?
>>> do I carry on for a max of 100 lines of Base64 data or until the end of
>>> an image, which is earlier?

>> I don't like the last option at all.  It still easily allows
>> a situation where a valid message with a valid image in it
>> gets detected as a corrupt image and hits a rule that scores
>> it as spam.

>> Basically, adding the 100 extra lines is really not much better
>> than chopping right at the max message size barrier, unless
>> you assume that most images aren't much larger than 6K, which
>> I don't think is a valid assumption at all.  So, this option
>> adds extra complexity and doesn't really give much benefit.

> I'm all for #3 and and just set "score FUZZY_OCR_CORRUPT_IMG 0" if you are 
> worried about false positives. Fuzzyocr will get better at sorting this out.

Well, if you're going to disable FUZZY_OCR_CORRUPT_IMG, then
there is no functional difference between #1 and #3 at all.
In which case, I'd prefer #1 because it already exists, it is
already known to work, and it's less complex.

Contrariwise, if you're going to enable FUZZY_OCR_CORRUPT_IMG,
then #3 has only a slight benefit over #1.  Default "Max
SpamAssassin Size" is 30000 bytes, and base64 data tends to
have 70 to 80 characters per line.  So being flexible about the
cut-off by 100 lines means that rather than falling at 30000
exactly, the cut-off will fall in the range of about 30000-37000
or 30000-38000.  Yes, it can and will happen that an attachment
boundary falls there, but I'd be surprised if it happens
anywhere close to 50% of the time on ham that contains images.

In particular, take the case of a ham message that contains a
single image.  In that case, the image has to be sized between
about 22500 and 28500 bytes (since base64 is 75% efficient at
carrying data) for #3 to provide any benefit at all.  But lots
of ham that contains images contains stuff larger than that.

To put it another way, I think #3 should be restated as "chop
half way through an image most of the time, but occasionally
luck out and find an image boundary in a narrow window and
chop at the right place".

   - Logan


More information about the MailScanner mailing list