multiple garbage words/bayes

Kevin Spicer kevins at BMRB.CO.UK
Mon Jan 26 19:03:28 GMT 2004


On Mon, 2004-01-26 at 18:46, Dustin Baer wrote:

> body   MULTI_WORD /\w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> \w{4,} \w{4,} \w{4,}/i
> describe MULTI_WORD A lot of 4-letter words, with no punctuation
> score MULTI_WORD 0.1
>
> Since I am not a Perl master, can anyone suggest an easier way to write
> it?
Nice idea I think.

I'm not a perl master either, but I'd suggest...

/(\w{4,} ){30,}/

(the trailing i is not required since \w matches upper and lower case
anyway)

You might further allow different numbers of spaces/ tabs etc.  It might
also be worthwhile to disable capturing of the parenthesized part of the
expression (if memory serves this may make it faster)...

/(?:\w{4,}\s+){30,}/




BMRB International
http://www.bmrb.co.uk
+44 (0)20 8566 5000
_________________________________________________________________
This message (and any attachment) is intended only for the
recipient and may contain confidential and/or privileged
material.  If you have received this in error, please contact the
sender and delete this message immediately.  Disclosure, copying
or other action taken in respect of this email or in
reliance on it is prohibited.  BMRB International Limited
accepts no liability in relation to any personal emails, or
content of any email which does not directly relate to our
business.



More information about the MailScanner mailing list