OT: pdf spam

Wed Jun 20 15:40:43 IST 2007

[snip]

Another possibility would be for the author of fuzzyocr to recognise
.pdf files and render them so they can be scanned for keywords. I can
think of a few keyword and load issues this could cause though.

Yes, but I'm guessing that as fuzzyocr currently does with images then it would generate a checksum for each pdf and therefore the expesive decoding need only occur once for each pdf?

Also I'm guessing the same checksum approach could be used if the pdf was rendered into readable format using less.

Jason