multiple garbage words/bayes

Mike Brudenell pmb1 at YORK.AC.UK
Tue Jan 27 11:07:46 GMT 2004


Greetings -

--On Monday, January 26, 2004 11:26 am -0800 Mark Nienberg
<mark at TIPPINGMAR.COM> wrote:

> I'm seeing some with puctuation in them.  This is going to complicate
> things.

I think the "\b" pattern may come to the rescue here: it is a zero-width
assertion that matches a word boundary.  That is, to one side there is a
"word character" (\w = [A-Za-z0-9_]) and to the other side a "not word
character".

Someone has subsequently posted a message containing some sample patterns
that use this particular wild-card character; you may be able to adapt them
further to your own needs...

--On Monday, January 26, 2004 11:02 pm +0100 Peter Bonivart
<peter at UCGBOOK.COM> wrote:

> rawbody  CP_RANDOMWORD_10
> /(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/
> describe CP_RANDOMWORD_10       string of 10+ random words
> score    CP_RANDOMWORD_10       0.5
>
> rawbody  CP_RANDOMWORD_15
> /(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/
> describe CP_RANDOMWORD_15       string of 15+ random words
> score    CP_RANDOMWORD_15       2.5

Cheers,

Mike B-)

--
The Computing Service, University of York, Heslington, York Yo10 5DD, UK
Tel:+44-1904-433811  FAX:+44-1904-433740

* Unsolicited commercial e-mail is NOT welcome at this e-mail address. *



More information about the MailScanner mailing list