Steve Campbell wrote:
> Martin Hepworth wrote:
>> 2009/1/2 Steve Campbell <campbell at>:
>>> Just got back from the holidays, so my reply is a little overdue.
>>> Ugo Bellavance wrote:
>>>> Steve Campbell wrote:
>>>>> The topic seems to come up quite often, and although the answers are
>>>>> usually pretty much the same, I never really see much of a "Solved" 
>>>>> reply.
>>>>> I upgraded from version 4.58, where I saw maybe 3 or 4 timeouts, to 
>>>>> 4.71,
>>>>> and saw an immediate increase to around 100-300 timeouts. I ran all 
>>>>> of the
>>>>> --debug and --debug-sa flavors of help I could think of. I reviewed 
>>>>> the
>>>>> logs. I run a caching nameserver. And I zeroed out some RBL scores. 
>>>>> I still
>>>>> have yet to find why this happens. I eventually upgraded to 4.72, and
>>>>> started using clamd. I still get the large numbers of timeouts. I 
>>>>> would
>>>>> think that the fact that this doesn't happen with all of my large 
>>>>> batches
>>>>> indicates I'm not using any dead RBLs.
>>>>> I'm still exploring the causes, but haven't had much luck. I find 
>>>>> it odd
>>>>> that SA would really keep RBLs that have expired over time in their 
>>>>> default
>>>>> files, so I really don't think it's that. I do all of my checking 
>>>>> of RBLs in
>>>>> SA. I always do my configuration and language upgrades, and search for
>>>>> rpmnew and rpmsave files. This has happened on 3 different but very 
>>>>> similar
>>>>> servers that I run.
>>>>> I'm not really asking for assistance here, but just wanted to let 
>>>>> others
>>>>> who are seeing this problem to  be aware that there is something 
>>>>> unique
>>>>> triggering this. I'm fairly confident that it is not happening at 
>>>>> all sites,
>>>>> but something here is causing it. It may not even be related to 
>>>>> MS/SA, but
>>>>> totally something else.
>>>>> The most I could ask for is a small checklist of what to ensure I have
>>>>> set. Every time I try to use the debug procedures, the tests perform
>>>>> flawlessly with no errors. It is very sporadic. We receive those 
>>>>> normal
>>>>> bursts of spam, but for the most part, the batches ares small. The 
>>>>> average
>>>>> amount of email per day is usually around 10k emails, but I get the 
>>>>> above
>>>>> stated 100-300 timeouts. I'm going to try and match batch numbers to
>>>>> timeouts and see if this will reveal anything. I only run 3 
>>>>> Children on a
>>>>> fairly hefty Dell PowerEdge, but I do use 30 messages per child. I 
>>>>> don't
>>>>> think this is excessive thought.
>>>>> Hope everyone has a Happy Holiday.
>>>> What is the machine?
>>> The machines are all Dell PowerEdge servers. There are three servers
>>> involved. Two are well equipped. One is just used as an interface for 
>>> our
>>> webmail users. Not a lot going through it.
>>>> Did you check the optimization section of the MAQ page on the wiki?
>>> No, I haven't, but I will. I have reviewed it before, but will look 
>>> to see
>>> if anything has changed or been added.
>>>> When running --debug --debug-sa, don't you find anything that is a bit
>>>> slow?
>>> Nothing at all.
>>> I would think that if something were causing these that were DNS or RBL
>>> related, it would show for most all of the batches, not just random 
>>> batches.
>>> So I am guessing it is either network clutter or something else. I just
>>> don't know yet. But still, there is the situation where this all 
>>> started to
>>> happen after an upgrade. I'm going to review in the upgraded conf 
>>> files and
>>> see if I've missed something.
>>> I have reduced the number of children on all machines from 5 to 3. 
>>> This has
>>> reduced the total of timeouts - which sort of points to machine 
>>> capacity. I
>>> only use 10 messages per batch. The main machines have 1 GB of RAM. The
>>> actual number of emails going through MS is quite low; around 10K, but I
>>> have quite a large access file, and the number of emails getting to the
>>> machines are closer to 25k+.
>>> Thanks for the thoughts and ideas. I'll keep digging and maybe find
>>> something.
>>> steve
>> Steve
>> 1GB ram is pretty minimal for SA...depends what third party rules you
>> got, but I'd consider increasing ram.
>> I presume you've got a local caching nameserver and you've dropped
>> most of the RBL's by giving them a zero score. Also trying using
>> opendns as your forward query servers which can operate lot quicker
>> than alot of ISP's DNS.
> Martin,
> I see in 'top' that I am very thin on RAM at times, but it still doesn't 
> definitively explain the randomness of the timeouts. We run our own DNS 
> servers, and I use a caching nameserver on each server. We also use 
> OpenDNS for certain purposes, but not mailserver instances.
> I guess the problem is more about the randomness. I don't think the 
> upgrade of MS would have caused such a large difference. I was running 
> SA 3 before and after the upgrade, so there shouldn't have been  a large 
> increase there.  Now there could have been a big difference in the way 
> SA was acting, but I'm not aware (ignorant is probably a better 
> adjective for my knowledge) of any great changes.

Well, the randomness can be simply caused by swapping.  For some reason, 
  a system loads a little more in RAM that what your RAM can take, and 
it starts swapping.  As Martin said, 1 G is minimal for a 
MailScanner/SA/AV system.  Increasing your batch sizes to 30 may also 
help.  But the first think I'd do is add another GB of ram.

