multiple garbage words/bayes

Mark Nienberg mark at TIPPINGMAR.COM
Mon Jan 26 19:26:55 GMT 2004


On 26 Jan 2004 at 19:03, Kevin Spicer wrote:

> On Mon, 2004-01-26 at 18:46, Dustin Baer wrote:
>
> > body   MULTI_WORD /\w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,} \w{4,}
> > \w{4,} \w{4,} \w{4,}/i
> > describe MULTI_WORD A lot of 4-letter words, with no punctuation
> > score MULTI_WORD 0.1
> >
> > Since I am not a Perl master, can anyone suggest an easier way to write
> > it?
> Nice idea I think.
>
> I'm not a perl master either, but I'd suggest...
>
> /(\w{4,} ){30,}/
>
> (the trailing i is not required since \w matches upper and lower case
> anyway)
>
> You might further allow different numbers of spaces/ tabs etc.  It might
> also be worthwhile to disable capturing of the parenthesized part of the
> expression (if memory serves this may make it faster)...
>
> /(?:\w{4,}\s+){30,}/

I'm seeing some with puctuation in them.  This is going to complicate things.  Here is
an example that the proposed rule would miss:

phloem cutback tau admire irredeemable allyl impeccable
headway muff closeup vine castigate astigmat coagulable
dragging pet cavil clapeyron clapboard boundary ruination
conklin butler thyroid depressant ,rub doubt isotherm melanin
mill keenan constantine widget betatron wells paternoster
blocky competitive lange autonomic - nerve domingo ott thesis
chemistry calder duct ember curry congress ostrich decreeing
conspirator .condensible permanent hades onomatopoeia ice cam
dawn precess teethed whitetail hager damn art castro , coleman
bugle doorman multiplicand firehouse ambiguous greensward
beast rutherford scribble teheran carmine annunciate
countermen joyce cover regrettable stove warmish humiliate
missile thereupon myosin . communicate berniece collectible
bawl bugeyed muscovy gator chinamen resuming sainthood
promulgate adams ,flatland goldenseal ciceronian penh wyman
basemen dharma seedling spinodal stuart falconry budget acco

Mark
--
Mark W. Nienberg, SE
Tipping Mar + associates
1906 Shattuck Ave, Berkeley, CA  94704
visit our website at http://www.tippingmar.com



More information about the MailScanner mailing list