multiple garbage words/bayes
Mike Brudenell
pmb1 at YORK.AC.UK
Tue Jan 27 11:07:46 GMT 2004
Greetings -
--On Monday, January 26, 2004 11:26 am -0800 Mark Nienberg
<mark at TIPPINGMAR.COM> wrote:
> I'm seeing some with puctuation in them. This is going to complicate
> things.
I think the "\b" pattern may come to the rescue here: it is a zero-width
assertion that matches a word boundary. That is, to one side there is a
"word character" (\w = [A-Za-z0-9_]) and to the other side a "not word
character".
Someone has subsequently posted a message containing some sample patterns
that use this particular wild-card character; you may be able to adapt them
further to your own needs...
--On Monday, January 26, 2004 11:02 pm +0100 Peter Bonivart
<peter at UCGBOOK.COM> wrote:
> rawbody CP_RANDOMWORD_10
> /(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/
> describe CP_RANDOMWORD_10 string of 10+ random words
> score CP_RANDOMWORD_10 0.5
>
> rawbody CP_RANDOMWORD_15
> /(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/
> describe CP_RANDOMWORD_15 string of 15+ random words
> score CP_RANDOMWORD_15 2.5
Cheers,
Mike B-)
--
The Computing Service, University of York, Heslington, York Yo10 5DD, UK
Tel:+44-1904-433811 FAX:+44-1904-433740
* Unsolicited commercial e-mail is NOT welcome at this e-mail address. *
More information about the MailScanner
mailing list