An express checkout? [was: Re: Postfix and Mailscanner sitting in a tree k-iss-ing]

Julian Field MailScanner at ecs.soton.ac.uk
Sat Jan 1 12:58:13 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

paddy wrote:

>Upon reflection I can't see a 'simple criteria' that's cheap enough to be
>a no-brainer to use unless you can do some processing before the incoming mail
>first goes to disk.
>
>
The message has not been received until it has hit the disk. So you're
proposing working on a message using partial information to start with,
to try to guess the spammy state of it.

>  (My first choice would be originating IP. I did briefly, in desperation,
>  consider size).  Anything else is just equivalent to what MailScanner
>  already does (dispatch RBL queries early, etc) only my suggestions
>  were weaker :)
>
>
I already split incoming and outgoing mail on my site. Surely just
having separate servers for mail going in different directions is the
easiest.

>  I'm also imagining that any processing before the mail hits disk
>  is at a premium in a DoS/highload situation, although that may not be the
>  case if the cpu is not the bottleneck ...
>
>
Interesting thought. Would only work with some MTA's though, it depends
on how they write the messages to disk. We're assuming here that a
message's metadata gets written first, and potentially long before the
message body.

>I don't think the express checkout idea is necessarily a totally lost cause:
>
>  sure, the cost of scheduling can easily drown the value, but a system
>  where the order of operations effects the cost is a promising target.
>
>
One of the major factors here, which I don't think you have commented
on, is that scanning the queue directory at all is a very expensive
operation when the queue is large. Which is why I have the "emergency
queue-clearing mode". Just looking at all the queue files at all can
take a long time and involve loads of i/o. So the cost of the express
checkout tests may well swamp any performance gain you get.

>  the original intention - differential QoS based on approximate spamminess -
>  still seems good.  The problem is implementing it at acceptable costs.
>  (remember Magnus Pike?)
>
>
Oh yes. One of my great aunts lived next door to him in Hammersmith.
Very funny guy.

MailScanner, in a way, already tries to do quite a lot of the checking
you mention above if you let it. If you have a good RBL such as SBL+XBL,
and use a config like this:

Spam List = SBL+XBL
Check SpamAssassin If On Spam List = no
Spam Lists To Reach High Score = 1
High Scoring Spam Actions = delete

(the 3rd setting is just so I can use the High scoring action to delete
RBL hits, which will probably fit in to your site policy rather better
than using the normal scoring action)

Doing this will completely get rid of any messages hitting the RBL
without any operation on the message body at all. It is all done based
on the content of the headers/envelope.

>  <more insane and pointless handwaving>
>    I also had this vague idea that using directories for the elevator in the
>    CriticalQueue condition might be cheaper than sorting by date, but the
>    problem is obvious ....
>
>What I realise is:
>
>  I don't really understand the trade-off between batch size and MaxChildren
>
>  I'd certainly appreciate it if you, or anyone for that matter :), could help
>  me with this.  Since they are both limits, I imagine that describing the
>  limiting conditions will help.
>
>
Smaller batches make virus scanning less efficient, but produce a more
"responsive" system under load. The message bandwidth is less (less
messages/hour) but the message latency (delay through MS) can be a lot
less. So if you inject a message one end, it pops out the other end
sooner. The cost is that you can't inject so many messages/hour.

MaxChildren should be set so that all the available resources are being
used all the time. Set it too high and the machine will spend too much
of its time context-switching between children, and too little time
actually doing useful work. Set it too low and there will be times when
at least one of the i/o, disk or net will be idle, which wastes resources.

My initial estimates of 5 per CPU, and possibly 8 per hyper-threaded
CPU, were based on some early testing I did on a dual-cpu box I've got.
5 per cpu gave very good throughput, and the system wasn't
context-switching excessively. If you have a quiet machine, by all means
set it to less. I assume that MailScanner will be running 100% or nearly
100%. After all, if the machine is quiet, who cares if I waste a few
resources. No-one else wanted them anyway.

>  I'm just re-reading the notes in the conf file.
>
>    Does a mailscanner child really consume ~20MB ?  Why ?
>
>
If you are running SpamAssassin it can easily be double that. Perl
processes are big, as the Perl compiler is very big and needs to be in
each process (so you can use cool things like "eval" in your program).
Ram is very cheap anyway.

>  based on your 'try 5 children per CPU' comment, I'm guessing that more
>  children = more cpu heavy (which makes sense anyway).
>  (must fix my CPU utlisation logging! :)
>
>  Is there even a BatchSize type option? Is MailScanner even batch-oriented
>  in the way I had imagined? is MaxUnscannedMessagesPerScan it ?
>
>
There are 4 options there:
Max Unscanned Bytes Per Scan = 100000000
Max Unsafe Bytes Per Scan = 50000000
Max Unscanned Messages Per Scan = 30
Max Unsafe Messages Per Scan = 30

This stops batches getting too big by picking up several huge messages
all in the same batch.
Total batch size = number of messages * average message size
So you need to limit both the number of messages and the message size to
have control of that calculation.

>  I'm also amused to discover (see previous mail) I have
>
>    Max Normal Queue Size = 5000
>
>
I would recommend lowering that, it's pretty big. Try about 1000 or so.

>  This reminds me of the 'per-user spamsassasin' thread tonight.  There are
>  already so many options, no doubt for each one there is somebody who
>  really needs it, but nobody could really need them all (could they?),
>  and the idea that anybody needs a new one should at least attract a
>  little skepticism.  But then, I expect I'm preaching to the priest !
>
>  would any of the options make sense in multiple units?
>    for (over)simplified example: 5000 mails or 5 mails per GHz of cpu
>    perhaps this is best left to admin and configuration tools?
>
>
It's not as simple as just CPU speed. It's a lot more complex than that.

>And it's easy to think you (I mean me, of course!) know what going on, until ....
>
>I wish you a Very Happy New Year !
>
>
Happy New Year to you too!

--
Julian Field
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store
Professional Support Services at www.MailScanner.biz
MailScanner thanks transtec Computers for their support

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the MAQ (http://www.mailscanner.biz/maq/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).

Support MailScanner development - buy the book off the website!




More information about the MailScanner mailing list