How to deliberately skew Bayes self-learn in SA

Tue Aug 12 12:20:59 IST 2003

I would avoid learning these messages into the bayes db.  I would craft SA
rules that handle this...

I would throttle UP the SA score for the HTML_IMAGE_ONLY_02 with BAYES:
Perhaps:

HTML_IMAGE_ONLY_02    3.50 2.76 2.54 1.97

I guess it depends on what the message scored in your system, wee what other
SA rules were triggered and throttle them as needed.  Remember changing
these rules could affect other combinations.  In my business operations,
Anyone who is going to send me an image embedded in an HTML document and
include less than 200 words, there is a high lieklyhood that the email is
not business related anyway  :)

CT

----- Original Message -----
From: "Quentin Campbell" <Q.G.Campbell at NEWCASTLE.AC.UK>
To: <MAILSCANNER at JISCMAIL.AC.UK>
Sent: Tuesday, August 12, 2003 7:05 AM
Subject: How to deliberately skew Bayes self-learn in SA

I am seeing an increasing number of spam messages of the form (shown
between the "cut here" delimiters):

---------- cut here
evaluate sayings hopelessly pondering euphoria poop midmorn access
braving barr expanded bomber positively experimenting accumulations
exaltation scriptural actuated tear scientist messiah hungrily acrylate
temptation bolshevism exposure amoco meaningful tells bookmobile adrift
how seashores scramble crewman mercantile addis berlitz countermeasures
brambly brainchildren terry alicia aventine adopt scum tarpaulin
evolutions hydrangea hunger iceland portable actuarially hubbub
televisions satires satires thanked melodramatic posters braggart
imagining bookstores seagull housebroken creeks coruscate teammate boss
hunters medical exchequers savoy metric maximized playgrounds mending
actinometer bethlehem hourglass adolph searchingly telephoned

[ra.gif]

merganser boson bobbed memory hosted adducing bordellos credited body
experimenter expectant mentioners experienced teleprompter horse ali
sank adhesives tangled tame scrim bratwurst bogota plunger horseplay
amerada mediums teaches taproot creased tenements hosiery scraped scat
excretion maximizes hydrant acolytes mate mathematical evict boost
plumped allyn action baltic tanh tensing acoustics examines exit bract
crochets polishing screeched exclusiveness etude porphyry exhales bessie
hydrofluoric creaming albany actualization ar creativity scales antoine
cranny saver midstream crawl cows hyperboloidal however belfast
accountably excommunicating hydrophobic tetravalent terrains scurried
playwrights
---------- cut here

Presumably the blocks of valid words are meant to hide from SA the
presence of the "real" content which is just a single image file. By
itself SA would probly score this highly.

I have a question and an observation on this sort of spam.

QUESTION: How do you formulate a rule to tackle such messages? Analysis
of sentence structure? Counting conjunctions and articles in a block of
words - if not enough then treat as spam?

OBSERVATION: Simply feeding these sorts of messages by hand into
"sa-learn" is likely to eventually train SA to recognise many of these
innocuous words as being indicative of spam.

This suggests that it is possible for a spammer to deliberately skew the
Bayes mechanism.

For example a persistent spammer could create messages which score well
over 25 but which *also* contain a number of words that are normally
found only in non-spam messages. If sufficient volume of these messages
are received at a site then it seems that the Bayes self-learn mechanism
can be subverted to begin rejecting non-spam messages that contain these
(innocent) words.

Any comments?

Quentin
---
PHONE: +44 191 222 8209    Computing Service, University of Newcastle
FAX:   +44 191 222 8765    Newcastle upon Tyne, United Kingdom, NE1 7RU.
------------------------------------------------------------------------
"Any opinion expressed above is mine. The University can get its own."