An express checkout? [was: Re: Postfix and Mailscanner sitting in a tree k-iss-ing]

Mon Jan 3 00:29:04 GMT 2005

On Sat, Jan 01, 2005 at 12:58:13PM +0000, Julian Field wrote:
> paddy wrote:
>
> >Upon reflection I can't see a 'simple criteria' that's cheap enough to be
> >a no-brainer to use unless you can do some processing before the incoming
> >mail
> >first goes to disk.
> >
> >
> The message has not been received until it has hit the disk.

Being excessively pedantic for a moment:

  I would put the moment of receipt at the transmission of the 2xx packet
  responding to the DATA command.

  Best practice would be that the message is either already commited to
  non-volatile storage, or is already delivered elsewhere.

So, yes! Agreed.

> So you're
> proposing working on a message using partial information to start with,
> to try to guess the spammy state of it.

I normally start from the assumption that a test of spaminess is
turing-equivalent - that it takes a human being to say what they consider
to be spam.

Amusingly, shortly after I first read this I came accross a 'not-spam'
message in my spam folder, examined the headers to see which rules it had
hit, and came to the conclusion 'looks like spam, is spam'.

So, my theory is somewhat at odds with my practice.  Thanks to Larry McVoy
for helping me to feel comfortable with that ;)

Trying to guess spamminess from partial info is not new, but for-all-I-know
using that information to prioritise workload, rather than outright reject
email may be (prior art in mailscanner, etc excepted ;)

> > (My first choice would be originating IP. I did briefly, in desperation,
> > consider size).  Anything else is just equivalent to what MailScanner
> > already does (dispatch RBL queries early, etc) only my suggestions
> > were weaker :)
> >
> >
> I already split incoming and outgoing mail on my site. Surely just
> having separate servers for mail going in different directions is the
> easiest.

I'm sorry, I don't follow this.  A DoS can pick a single server or MX group,
and potentially hammer them into the ground.  While there are certainly
resources outside the bounds of MailScanner that deal with such problems,
as you have already indicated, MailScanner does not live on an island where
such problems can be totally ignored.  It may be that the particular concern
that I have chosen is not, in fact, a consideration for mailscanner, and I
just haven't seen the light yet.

My outgoing mail is a fraction of my incoming mail - neglible in fact.
(Although, I appreciate, you may find that hard to believe ;)

> > I'm also imagining that any processing before the mail hits disk
> > is at a premium in a DoS/highload situation, although that may not be the
> > case if the cpu is not the bottleneck ...
> >
> >
> Interesting thought. Would only work with some MTA's though,

That postfix thing just keeps haunting this thread!

Sendmail is familiar territory to me and I imagine it wouldn't be to diffcult
to arrange a milter that caches certain info and makes it available to
a mailscanner process later in the pipeline.  I spent a little time looking
into postfix (I really wanted to write a program called 'prim':) and the
same hook appears to exist there. I'd expect to find the possibility in
most modern general purpose MTAs, although I wouldn't expect it to be
trivial to set up.

> it depends on how they write the messages to disk.

I don't follow you here. My objective would be to grab the relevent info
before it hits disk at all if I could.  My speculation on grabbing the
buffers still in RAM after a disk commit, was ... interesting, but a bit
random - might work though :)

> We're assuming here that a
> message's metadata gets written first, and potentially long before the
> message body.

so it could be hard to know when to grab the buffers ?

> >I don't think the express checkout idea is necessarily a totally lost
> >cause:
> >
> > sure, the cost of scheduling can easily drown the value, but a system
> > where the order of operations effects the cost is a promising target.
> >
> >
> One of the major factors here, which I don't think you have commented
> on, is that scanning the queue directory at all is a very expensive
> operation when the queue is large. Which is why I have the "emergency
> queue-clearing mode". Just looking at all the queue files at all can
> take a long time and involve loads of i/o. So the cost of the express
> checkout tests may well swamp any performance gain you get.

Absolutely.  Which is why I'm looking so desperately to avoid that cost.
The whole idea doesn't work if you have to read all the files.

> > the original intention - differential QoS based on approximate spamminess
> > -
> > still seems good.  The problem is implementing it at acceptable costs.
> > (remember Magnus Pike?)
> >
> >
> Oh yes. One of my great aunts lived next door to him in Hammersmith.

Cool!

> Very funny guy.

Absolutely!

> MailScanner, in a way, already tries to do quite a lot of the checking
> you mention above if you let it. If you have a good RBL such as SBL+XBL,
> and use a config like this:
>
> Spam List = SBL+XBL
> Check SpamAssassin If On Spam List = no
> Spam Lists To Reach High Score = 1
> High Scoring Spam Actions = delete
>
> (the 3rd setting is just so I can use the High scoring action to delete
> RBL hits, which will probably fit in to your site policy rather better
> than using the normal scoring action)
>
> Doing this will completely get rid of any messages hitting the RBL
> without any operation on the message body at all. It is all done based
> on the content of the headers/envelope.

I started with a pair of RBLs, i think.
then just ORBS, then switched to SPAMCOP.

SpamCop has given me headaches with mailling-lists.  I plan to switch to
SBL+XBL, but I regard this as quite a big move.  I call the RBL our
'backstop' - its saved me several times in the last year or so.

I vaguely recall that when SA times out, we fall back to just RBLs, and
that sometimes, thats precisely why I have a long queue anyway.

I have been reticent to employ

 Check SpamAssassin If On Spam List = no

because I like to see a score, but I might look at the possibility of a
custom function if there is not already a high-load cut-out on this
config option.  I like that idea!

> > <more insane and pointless handwaving>
> >   I also had this vague idea that using directories for the elevator in
> >   the
> >   CriticalQueue condition might be cheaper than sorting by date, but the
> >   problem is obvious ....
> >
> >What I realise is:
> >
> > I don't really understand the trade-off between batch size and MaxChildren
> >
> > I'd certainly appreciate it if you, or anyone for that matter :), could
> > help
> > me with this.  Since they are both limits, I imagine that describing the
> > limiting conditions will help.
> >
> >
> Smaller batches make virus scanning less efficient, but produce a more
> "responsive" system under load. The message bandwidth is less (less
> messages/hour) but the message latency (delay through MS) can be a lot
> less. So if you inject a message one end, it pops out the other end
> sooner. The cost is that you can't inject so many messages/hour.

So, quite counter-intuitively, I suspect that I'd be happier with smaller
batches and more children under what for me is 'high load'. ;)

> MaxChildren should be set so that all the available resources are being
> used all the time. Set it too high and the machine will spend too much
> of its time context-switching between children, and too little time
> actually doing useful work. Set it too low and there will be times when
> at least one of the i/o, disk or net will be idle, which wastes resources.
>
> My initial estimates of 5 per CPU, and possibly 8 per hyper-threaded
> CPU, were based on some early testing I did on a dual-cpu box I've got.
> 5 per cpu gave very good throughput, and the system wasn't
> context-switching excessively. If you have a quiet machine, by all means
> set it to less. I assume that MailScanner will be running 100% or nearly
> 100%. After all, if the machine is quiet, who cares if I waste a few
> resources. No-one else wanted them anyway.

I have a quiet box, except when its not!

> > I'm just re-reading the notes in the conf file.
> >
> >   Does a mailscanner child really consume ~20MB ?  Why ?
> >
> >
> If you are running SpamAssassin it can easily be double that.

I didn't want to say anything! :)

> Perl
> processes are big, as the Perl compiler is very big and needs to be in
> each process (so you can use cool things like "eval" in your program).

Ah! Yes, one word: eval!

Perl is clearly the language of choice for this problem-space.
How vital is eval?
I confess I've been promised bigger boxes for the new year, and I'm
getting by on raq3's now, so it isn't a big question.  probably the
answer is that programmer time is worth more.

> Ram is very cheap anyway.

No, my boss is cheap (God, I hope he doesn't read this :)
RAM is like my overdraft limit: not enough but I have to live with it.

> > based on your 'try 5 children per CPU' comment, I'm guessing that more
> > children = more cpu heavy (which makes sense anyway).
> > (must fix my CPU utlisation logging! :)
> >
> > Is there even a BatchSize type option? Is MailScanner even batch-oriented
> > in the way I had imagined? is MaxUnscannedMessagesPerScan it ?
> >
> >
> There are 4 options there:
> Max Unscanned Bytes Per Scan = 100000000
> Max Unsafe Bytes Per Scan = 50000000
> Max Unscanned Messages Per Scan = 30
> Max Unsafe Messages Per Scan = 30
>
> This stops batches getting too big by picking up several huge messages
> all in the same batch.
> Total batch size = number of messages * average message size
> So you need to limit both the number of messages and the message size to
> have control of that calculations

Gosh! is that 100MB Max Unscanned Bytes Per Scan, I always read it as 10 !

I take it this is a chunk of the ~20MB we've been talking about.

Call me lazy ('cos I can always go off and figure it out for myself), but
how much memory does a second mailscanner child consume, before it starts
to read data?

> > I'm also amused to discover (see previous mail) I have
> >
> >   Max Normal Queue Size = 5000
> >
> >
> I would recommend lowering that, it's pretty big. Try about 1000 or so.

You can say that again!  5000 won't kill this box: with mailscanner it'll
chew through them eventually, but it'll take a month of sundays!

1000 sounds much more reasonable!  I seem to have made a poor adjustment
sometime in the past :)

> > This reminds me of the 'per-user spamsassasin' thread tonight.  There are
> > already so many options, no doubt for each one there is somebody who
> > really needs it, but nobody could really need them all (could they?),
> > and the idea that anybody needs a new one should at least attract a
> > little skepticism.  But then, I expect I'm preaching to the priest !
> >
> > would any of the options make sense in multiple units?
> >   for (over)simplified example: 5000 mails or 5 mails per GHz of cpu
> >   perhaps this is best left to admin and configuration tools?
> >
> >
> It's not as simple as just CPU speed. It's a lot more complex than that.

best left to admin and configuration tools, then.

Regards,
Paddy
--
Perl 6 will give you the big knob. -- Larry Wall

------------------------ MailScanner list ------------------------
To unsubscribe, email jiscmail at jiscmail.ac.uk with the words:
'leave mailscanner' in the body of the email.
Before posting, read the MAQ (http://www.mailscanner.biz/maq/) and
the archives (http://www.jiscmail.ac.uk/lists/mailscanner.html).

Support MailScanner development - buy the book off the website!