How to deliberately skew Bayes self-learn in SA

Quentin Campbell Q.G.Campbell at NEWCASTLE.AC.UK
Tue Aug 12 12:05:44 IST 2003


I am seeing an increasing number of spam messages of the form (shown
between the "cut here" delimiters):

---------- cut here
evaluate sayings hopelessly pondering euphoria poop midmorn access
braving barr expanded bomber positively experimenting accumulations
exaltation scriptural actuated tear scientist messiah hungrily acrylate
temptation bolshevism exposure amoco meaningful tells bookmobile adrift
how seashores scramble crewman mercantile addis berlitz countermeasures
brambly brainchildren terry alicia aventine adopt scum tarpaulin
evolutions hydrangea hunger iceland portable actuarially hubbub
televisions satires satires thanked melodramatic posters braggart
imagining bookstores seagull housebroken creeks coruscate teammate boss
hunters medical exchequers savoy metric maximized playgrounds mending
actinometer bethlehem hourglass adolph searchingly telephoned

[ra.gif]

merganser boson bobbed memory hosted adducing bordellos credited body
experimenter expectant mentioners experienced teleprompter horse ali
sank adhesives tangled tame scrim bratwurst bogota plunger horseplay
amerada mediums teaches taproot creased tenements hosiery scraped scat
excretion maximizes hydrant acolytes mate mathematical evict boost
plumped allyn action baltic tanh tensing acoustics examines exit bract
crochets polishing screeched exclusiveness etude porphyry exhales bessie
hydrofluoric creaming albany actualization ar creativity scales antoine
cranny saver midstream crawl cows hyperboloidal however belfast
accountably excommunicating hydrophobic tetravalent terrains scurried
playwrights
---------- cut here

Presumably the blocks of valid words are meant to hide from SA the
presence of the "real" content which is just a single image file. By
itself SA would probly score this highly.

I have a question and an observation on this sort of spam.

QUESTION: How do you formulate a rule to tackle such messages? Analysis
of sentence structure? Counting conjunctions and articles in a block of
words - if not enough then treat as spam?

OBSERVATION: Simply feeding these sorts of messages by hand into
"sa-learn" is likely to eventually train SA to recognise many of these
innocuous words as being indicative of spam.

This suggests that it is possible for a spammer to deliberately skew the
Bayes mechanism.

For example a persistent spammer could create messages which score well
over 25 but which *also* contain a number of words that are normally
found only in non-spam messages. If sufficient volume of these messages
are received at a site then it seems that the Bayes self-learn mechanism
can be subverted to begin rejecting non-spam messages that contain these
(innocent) words. 

Any comments?

Quentin
---
PHONE: +44 191 222 8209    Computing Service, University of Newcastle
FAX:   +44 191 222 8765    Newcastle upon Tyne, United Kingdom, NE1 7RU.
------------------------------------------------------------------------
"Any opinion expressed above is mine. The University can get its own." 




More information about the MailScanner mailing list