Emails made up of random words

Sat Jan 10 11:26:50 GMT 2004

> -----Original Message-----
> From: MailScanner mailing list
> [mailto:MAILSCANNER at JISCMAIL.AC.UK]On
> Behalf Of John Rudd
> Sent: Friday, January 09, 2004 1:24 PM
> To: MAILSCANNER at JISCMAIL.AC.UK
> Subject: Re: Emails made up of random words
>
>
> On Jan 9, 2004, at 3:32 AM, Rick Cooper wrote:
>
> >> -----Original Message-----
> >> From: MailScanner mailing list
> >> [mailto:MAILSCANNER at JISCMAIL.AC.UK]On
> >> Behalf Of Howard Robinson
> >> Sent: Friday, January 09, 2004 4:33 AM
> >> To: MAILSCANNER at JISCMAIL.AC.UK
> >> Subject: Emails made up of random words
> >>
> >>
> >> Dear List members,
> >> We are getting increasing numbers of emails containing
> >> what look
> >> like a selection of random words. It only started
> here before
> >> Christmas. Is this a new phenomenon or have we just
> been lucky
> >> before?
> >> Whilst they are still manageable numbers at the
> moments & can
> >> be quickly deleted there one or two members of staff
> >> are getting
> >> their knickers in a twist about them.
> >> What's the best way to deal with them (the emails not
> >> the staff)?
> >>
> >> Thanks and happy new year to you all.
> >>
> >>
> >>
> >> Regards
> >>
> >> Howard Robinson
> >
> > Go here http://www.emtinc.net/spamhammers.htm and
> use these rules
> > if you are not already.
>
> Based upon my observation, they miss the point.
>

They do not in and of themselves look for this specific type of
problem but they do a good job of looking for other problems (big
Evil for instance seems to be working quite well) and the
aggregate of these rules has done well in detecting spam while
passing ham.

> > There is much discussion of this topic (bayes poison) on the
> > spamassassin list and there are a couple of counter measures
> > being developed so you may want to subscribe to
> spamassassin-talk
> > and follow the thread relating to large collections of random
> > words.
>

> It's not just a bayes poisoning attack, in my
> observation.  Most of the
> messages I've seen like this have
> multipart/alternative structure where
> the text and html segments don't match (the text
> segment is gibberish
> and the html segment has spam).  Rules that try to
> identify gibberish
> would seem to be rather misguided ... just find a way
> to check and see
> if the two segments don't match in content.
>

I am not sure what you mean by "match content". I think this type
of spam is going to be a real problem because it's going to be
very difficult to test for gibberish "sentances" in an email. I
suppose you could do grammar tests and score the email based on
how many rules get broken but that would be a tremendous
undertaking to implement in all the various languages (although
it seems most of this spam is english). Right now they are
talking about testing for strings of words that are longer than
four characters with no punctuation but that will be easy for the
spammers to change. It is very much bayes poison since they have
taken to using large volumes of common words in a message that
may/will be tagged as spam thus degrading the ability to
distinguish spam/ham probability. Of course, I think, the main
goal of using these words is to defeat the HTML/image to text
ratio scores that used to trip them up much more commonly when
they began trying image only spam to defeat the word/phrase
checks. If you think back a short time they had began sending the
spam as an image and tacking on a bunch of gibberish at the end
of the message, much easier to catch because of lack of vowels or
too many vowels, extremely long "words", etc. Now they use real
words. When some one comes up with a way to check language syntax
they will just start including parts of Moby Dick in their spams
:-(

BTW: I checked rule hits on the messages of this type yesterday
and the items that were commong among the few I got were multiple
BACKHAIR hits in the same message and FVGT hits , with BigEvil
showing up in nearly every one that made it to very high spam
scores

> I tried asking about this on the sa-talk list, even
> re-posting my
> question, and have had NO response.  The sa-talk list is rather
> annoying in this regard.

This is true, but reading through the volumes of mail on that
list does reward you with some pretty good information, almost
daily.

>
> Which thread topics are the ones you're talking about?
>  (there are too
> many of them to read each and every one of them to
> track it down)
>

The latest thread on this topic is "detecting large collections
of random words"

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.