multiple garbage words/bayes

Dustin Baer dustin.baer at IHS.COM
Mon Jan 26 19:28:44 GMT 2004


Kevin Spicer wrote:
>
> On Mon, 2004-01-26 at 18:46, Dustin Baer wrote:
>
> > body   MULTI_WORD /\w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > \w{4,} \w{4,} \w{4,}/i
> > describe MULTI_WORD A lot of 4-letter words, with no punctuation
> > score MULTI_WORD 0.1
> >
> > Since I am not a Perl master, can anyone suggest an easier way to write
> > it?
> Nice idea I think.
>
> I'm not a perl master either, but I'd suggest...
>
> /(\w{4,} ){30,}/

Funny, I thought I tried that, but must have done /( \w{4,} ){30,}/
(notice the leading space), which didn't work.  Why the leading space
breaks the expression, I don't know.

Yours works.

> (the trailing i is not required since \w matches upper and lower case
> anyway)

Right.

> You might further allow different numbers of spaces/ tabs etc.  It might
> also be worthwhile to disable capturing of the parenthesized part of the
> expression (if memory serves this may make it faster)...
>
> /(?:\w{4,}\s+){30,}/

That works, also.

I might also change "\w" to "[a-zA-Z]" to ignore digits and underscores.

Thanks for the input, Kevin!  Hopefully, others might find this useful.

Dustin



More information about the MailScanner mailing list