multiple garbage words/bayes
Dustin Baer
dustin.baer at IHS.COM
Mon Jan 26 19:39:20 GMT 2004
Mark Nienberg wrote:
>
> On 26 Jan 2004 at 19:03, Kevin Spicer wrote:
>
> > On Mon, 2004-01-26 at 18:46, Dustin Baer wrote:
> >
> > > body MULTI_WORD /\w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > > \w{4,} \w{4,} \w{4,}/i
> > > describe MULTI_WORD A lot of 4-letter words, with no punctuation
> > > score MULTI_WORD 0.1
> > >
> > > Since I am not a Perl master, can anyone suggest an easier way to write
> > > it?
> > Nice idea I think.
> >
> > I'm not a perl master either, but I'd suggest...
> >
> > /(\w{4,} ){30,}/
> >
> > (the trailing i is not required since \w matches upper and lower case
> > anyway)
> >
> > You might further allow different numbers of spaces/ tabs etc. It might
> > also be worthwhile to disable capturing of the parenthesized part of the
> > expression (if memory serves this may make it faster)...
> >
> > /(?:\w{4,}\s+){30,}/
>
> I'm seeing some with puctuation in them. This is going to complicate things. Here is
> an example that the proposed rule would miss:
> [snip]
I hate spammers.
Dustin
--
Dustin Baer
Unix Administrator/Postmaster
Information Handling Services
15 Inverness Way East
Englewood, CO 80112
303-397-2836
More information about the MailScanner
mailing list