[Web4lib] Google Books a tease, not a useful tool, for serious research

John Fereira jaf30 at cornell.edu
Sat Jul 7 08:39:46 EDT 2007


At 01:36 PM 7/6/2007, Lars Aronsson wrote:
>Stephen Cauffman wrote:
>
> > Per the 'search within full text' issue: on July 3, Google books
> > released "View plain text" where you can view the OCR text.
> > feature for the public domain items. See their blog post on it
> > at:
> > 
> http://booksearch.blogspot.com/2007/07/greater-access-to-public-domain-works.html 
>
> > or http://tinyurl.com/ysddmt
>
>
>Oh really, less than three years after the project was announced.
>It took the University of Michigan's "Making of America" longer
>than that before they dared to expose the OCR text to readers.

I worked on a project a couple of years ago that uses the same engine 
as the MOA project and the OCR'd text was exposed to end users.  We 
get complaints about it when the OCR'd text doesn't match the text on 
the original document.


>Now we're only waiting for a "edit/proofread this page" link,
>because as we all know there are tons of OCR errors, especially in
>the non-English books.

That sounds like a good idea in theory.   One would need some kind of 
authority control so that OCR vandalism would not occur (or could at 
least be easily rolled back).

In the last issue of Wired I read an interesting little 
snippet.   I'm sure that you all have seen Captcha.  This article 
described something called "ReCaptcha".  Instead of just printing one 
word, when a user is ask to type, it produces two words.  One of them 
is used for validation.  The other is produced from a list of OCR'd 
words from scanning efforts.  When the end user types in both words 
one is used for validation and the other used to  correct/verify OCR text.


>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/

John Fereira
jaf30 at cornell.edu
Ithaca, NY 



More information about the Web4lib mailing list