[Web4lib] Google Books a tease, not a useful tool, for
serious research
John Fereira
jaf30 at cornell.edu
Sat Jul 7 08:39:46 EDT 2007
At 01:36 PM 7/6/2007, Lars Aronsson wrote:
>Stephen Cauffman wrote:
>
> > Per the 'search within full text' issue: on July 3, Google books
> > released "View plain text" where you can view the OCR text.
> > feature for the public domain items. See their blog post on it
> > at:
> >
> http://booksearch.blogspot.com/2007/07/greater-access-to-public-domain-works.html
>
> > or http://tinyurl.com/ysddmt
>
>
>Oh really, less than three years after the project was announced.
>It took the University of Michigan's "Making of America" longer
>than that before they dared to expose the OCR text to readers.
I worked on a project a couple of years ago that uses the same engine
as the MOA project and the OCR'd text was exposed to end users. We
get complaints about it when the OCR'd text doesn't match the text on
the original document.
>Now we're only waiting for a "edit/proofread this page" link,
>because as we all know there are tons of OCR errors, especially in
>the non-English books.
That sounds like a good idea in theory. One would need some kind of
authority control so that OCR vandalism would not occur (or could at
least be easily rolled back).
In the last issue of Wired I read an interesting little
snippet. I'm sure that you all have seen Captcha. This article
described something called "ReCaptcha". Instead of just printing one
word, when a user is ask to type, it produces two words. One of them
is used for validation. The other is produced from a list of OCR'd
words from scanning efforts. When the end user types in both words
one is used for validation and the other used to correct/verify OCR text.
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
John Fereira
jaf30 at cornell.edu
Ithaca, NY
More information about the Web4lib
mailing list