Per-user whitelisting

Tue Jun 15 17:37:36 IST 2004

On Tue, 15 Jun 2004, Don Newcomer wrote:

> I'm still having trouble with this.  I've had a number of messages get
> flagged as spam when I have the appropriate entries in
> /usr/local/MailScanner/spam.bydomain/whitelist/newcomer at dickison.edu.

I can't answer the whitelist aspects.  But...

> Here are the headers from one:
>
> [...]
> X-Dickinson-MailScanner-SpamCheck: spam, SpamAssassin (score=3.152,
>         required 3, BAYES_00 -4.90, DATE_IN_PAST_12_24 0.75,
>         HTML_MESSAGE 0.10, HTTP_WITH_EMAIL_IN_URL 0.20,
>         MAILTO_SUBJ_REMOVE 0.89, MIME_MISSING_BOUNDARY 1.84,
>         MK_BAD_HTML_05 0.30, MSGID_FROM_MTA_HEADER 0.70, OFFERS_ETC 0.23,
>         REMOVE_PAGE 0.50, REMOVE_REMOVAL_1WORD 1.89, REMOVE_SUBJ 0.35,
>         SARE_WEOFFER 0.30)
> X-Dickinson-MailScanner-SpamScore: sss
> [...]
>
> While this one isn't a huge deal, another comes from a credit card company.
> Again, I can see no reason why these aren't being whitelisted.  I haven't
> gotten a lot of complaints because we're only marking mail as spam and
> deleting the really high-scoring (15+) ones.  However, once we roll out
> filtering based on spam score, this will become a big issue.

Using a spam score of 3 is definitely "adventurous"!

Spam classification is an inexact art along a grey scale, not an exact
science with simple binary state.  Good emails (hams) will often score
higher than expected, and spams will often score lower than expected.  To
minimise the risk of "false positives" (i.e. wrongly accusing ham of being
spam, as happening in your example), you should probably set the boundary
no lower than 5.  (For a long time at our site (like you, a university) we
used 7; several months ago we reduced it to 6; I probably won't reduce it
to 5.  I really wouldn't use 3.)

So regardless of the "whitelist" aspect, I would suggest being a little
more lenient in your classification, increasing the threshold score to at
least 5, probably 6.  (How does your Bayes learn?  How immune are you to
so-called "Bayes poisioning"?  Etc.)

You may need to educate users about the real-world messiness and
uncertainty (including time variance of RBL components and Bayes) of spam
classification (i.e. it is not a simplistic, rose-tinted binary certainty)
and the risk balancing aspects of false positives versus false negatives.
They might not like a few spams getting through undetected (as false
negatives).  But how much less would they like to lose even one genuine
email as a false positive?

Hope that helps.

--

:  David Lee                                I.T. Service          :
:  Systems Programmer                       Computer Centre       :
:                                           University of Durham  :
:  http://www.dur.ac.uk/t.d.lee/            South Road            :
:                                           Durham                :
:  Phone: +44 191 334 2752                  U.K.                  :

-------------------------- MailScanner list ----------------------
To leave, send    leave mailscanner    to jiscmail at jiscmail.ac.uk
Before posting, please see the Most Asked Questions at
http://www.mailscanner.biz/maq/     and the archives at
http://www.jiscmail.ac.uk/lists/mailscanner.html