multiple garbage words/bayes

Dustin Baer dustin.baer at IHS.COM
Mon Jan 26 19:39:20 GMT 2004


Mark Nienberg wrote:
>
> On 26 Jan 2004 at 19:03, Kevin Spicer wrote:
>
> > On Mon, 2004-01-26 at 18:46, Dustin Baer wrote:
> >
> > > body   MULTI_WORD /\w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > > \w{4,} \w{4,} \w{4,}/i
> > > describe MULTI_WORD A lot of 4-letter words, with no punctuation
> > > score MULTI_WORD 0.1
> > >
> > > Since I am not a Perl master, can anyone suggest an easier way to write
> > > it?
> > Nice idea I think.
> >
> > I'm not a perl master either, but I'd suggest...
> >
> > /(\w{4,} ){30,}/
> >
> > (the trailing i is not required since \w matches upper and lower case
> > anyway)
> >
> > You might further allow different numbers of spaces/ tabs etc.  It might
> > also be worthwhile to disable capturing of the parenthesized part of the
> > expression (if memory serves this may make it faster)...
> >
> > /(?:\w{4,}\s+){30,}/
>
> I'm seeing some with puctuation in them.  This is going to complicate things.  Here is
> an example that the proposed rule would miss:

> [snip]

I hate spammers.

Dustin
--
Dustin Baer
Unix Administrator/Postmaster
Information Handling Services
15 Inverness Way East
Englewood, CO 80112
303-397-2836



More information about the MailScanner mailing list